Instance-optimality in optimal value estimation: Adaptivity via variance-reduced <italic>Q</italic>-learning

Eric Xia, Koulik Khamaru, Martin J. Wainwright, Michael I. Jordan

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Various algorithms in reinforcement learning exhibit dramatic variability in their convergence rates and ultimate accuracy as a function of the problem structure. Such instance-specific behavior is not captured by existing global minimax bounds, which are worst-case in nature. We analyze the problem of estimating optimal <italic>Q</italic>-state-action value functions for a discounted Markov decision process with discrete states and actions; our main result is to identify an instance-dependent functional that controls the difficulty of estimation in the &#x2113;&#x221E;-norm. Using a local minimax framework, we show that this functional arises in lower bounds on the accuracy on any estimation procedure. We establish the sharpness of these lower bounds, up to factors logarithmic in the state and action spaces, by analyzing a variance-reduced version of <italic>Q</italic>-learning. Our theory provides a precise way of distinguishing &#x201C;easy&#x201D; problems from &#x201C;hard&#x201D; ones in the context of <italic>Q</italic>-learning, as illustrated by an ensemble with a continuum of difficulty.

Original languageEnglish (US)
Pages (from-to)1
Number of pages1
JournalIEEE Transactions on Information Theory
DOIs
StateAccepted/In press - 2024
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Computer Science Applications
  • Library and Information Sciences

Keywords

  • <italic xmlns:ali="http://www.niso.org/schemas/ali/1.0/" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">Q</italic>-learning
  • Complexity theory
  • Data science
  • Estimation
  • Kernel
  • Q-learning
  • Reinforcement learning
  • Standards
  • Upper bound
  • instance-dependent complexity
  • minimax lower bounds
  • stochastic control
  • variance reduction

Fingerprint

Dive into the research topics of 'Instance-optimality in optimal value estimation: Adaptivity via variance-reduced <italic>Q</italic>-learning'. Together they form a unique fingerprint.

Cite this