Optimistic MLE: A Generic Model-Based Algorithm for Partially Observable Sequential Decision Making

This paper introduces a simple efficient learning algorithms for general sequential decision making. The algorithm combines Optimism for exploration with Maximum Likelihood Estimation for model estimation, which is thus named OMLE. We prove that OMLE learns the near-optimal policies of an enormously rich class of sequential decision making problems in a polynomial number of samples. This rich class includes not only a majority of known tractable model-based Reinforcement Learning (RL) problems (such as tabular MDPs, factored MDPs, low witness rank problems, tabular weakly-revealing/observable POMDPs and multi-step decodable POMDPs ), but also many new challenging RL problems especially in the partially observable setting that were not previously known to be tractable. Notably, the new problems addressed by this paper include (1) observable POMDPs with continuous observation and function approximation, where we achieve the first sample complexity that is completely independent of the size of observation space; (2) well-conditioned low-rank sequential decision making problems (also known as Predictive State Representations (PSRs)), which include and generalize all known tractable POMDP examples under a more intrinsic representation; (3) general sequential decision making problems under SAIL condition, which unifies our existing understandings of model-based RL in both fully observable and partially observable settings. SAIL condition is identified by this paper, which can be viewed as a natural generalization of Bellman/witness rank to address partial observability. This paper also presents a reward-free variant of OMLE algorithm, which learns approximate dynamic models that enable the computation of near-optimal policies for all reward functions simultaneously.


INTRODUCTION
A wide range of modern arti cial intelligence applications can be cast as sequential decision making problems, in which an agent interacts with an unknown environment through time, and learns to make a sequence of decisions using intermediate feedback.Sequential decision making covers not only problems like Atari games [27], Go [32], Chess [6] and basic control systems [35], where states are fully accessible to the learner (the fully observable setting), but also applications including StarCraft [37], Poker [4], robotics with local sensors [1], autonomous driving [24] and medical diagnostic systems [14], where observations only reveal partial information about the underlying states (the partially observable setting).While the fully observable sequential decision making problems have been under intense theoretical investigation over recent years, the partially observable problems remain comparatively less understood.
Distinguished from fully observable systems, a learner in partially observable systems is only able to see the observations that contain partial information about the underlying states.Observations in general are no longer Markovian.As a result, it is no longer su cient for the learner to make decision based on the observation or information available at the current step.Instead, the learner is required to additionally infer the latent states using past histories (memories).Such histories of observations have exponentially many possibilities, leading to many well-known hardness results in the worst case in both computation [28][29][30]38] and statistics [22].To avoid these worst-case barriers, a recent line of results started to investigate rich subclasses of Partially Observable Markov Decision Process (POMDPs) under the basic settings of nite states and observations [see, e.g., 18,26], which still only constitute a relatively small subset of all partially observable problems of practical interests.
In this paper, we introduce a simple, generic, model-based algorithm-OMLE, which combines Optimism (O) for exploration with Maximum Likelihood Estimation (MLE) for model estimation.We prove that OMLE learns the near-optimal policies of an enormously rich class of sequential decision making problems in a polynomial number of samples.This rich class includes not only a majority of known tractable model-based Reinforcement Learning (RL) problems such as tabular MDPs, factored MDPs, low witness rank problems [34], tabular weakly-revealing/observable POMDPs [18,26] and multistep decodable POMDPs [9], but also, more importantly, many new challenging RL problems especially in the partially observable setting that were not previously known to be tractable (see Section 1.1).To achieve these new results, this paper develops new frameworks and techniques which address a set of fundamental challenges that are uniquely presented in the partially observable systems: Challenge 1: Continuous observation space and function approximation with partial observability.Modern applications of sequential decision making often involve an enormous (or even in nite) number of observations, where function approximation must be deployed to approximate dynamic models, value functions, or policies.While function approximation greatly expands the potential reach of existing frameworks, particularly via deep architectures, it raises a number of fundamental questions including generalization, model misspeci cation, and how to address those issues in presence of exploration.Function approximation becomes even more complicated in the partially observable setting when further coupled with the inference of latent states and the use of history dependent policies.As a result, existing results on function approximation in the partially observable setting remain very limited [5,36].They make rather restrictive assumptions, and do not provide e cient guarantees even to a relatively simple continuous-observation extension of the basic tabular weakly-revealing or observable POMDPs [13,26]-GM-POMDPs (Section 5.1.2),which only add Gaussian noise to the observations in the original models.
Challenge 2: Learning under intrinsic representation of partially observable systems.Most existing works on e cient learning of partially observable problems focus on the model of POMDPs.POMDPs are based on latent states that are unobservable and subject to nontrivial ambiguity-there can exist multiple di erent POMDPs that represent the same sequential decision making problem.This ambiguity directly leads to the unidenti ability of latent states even in the benign settings where learning near-optimal policy is possible.This paper considers a more intrinsic modeling of partially observable dynamic system-Predictive State Representations (PSRs) [25,33], which model a dynamic system using only observable experiments of futures.It is known that PSRs can represent any low-rank sequential decision making problems, which are more expressive than nite-state POMDPs [15].However, it remains unclear how to learn large class of PSRs sample-e ciently.
Challenge 3: A uni ed understanding of fully observable and partially observable RL..There has been a long line of important works on generic framework of reinforcement learning [8,10,16,19,34].However, most of them focus on the fully observable problems and are only capable of dealing with very special partially observable problems such as reactive POMDPs.A majority of them critically rely on the complexity measures that are based on Bellman rank [16] or witness rank [34] (the model-based version), which assumes the Bellman error or the model estimation error (in the model-based setting) to have a bilinear structure.These bilinear-based complexity measures completely fail to explain the tractability of many basic partially observable problems [9,13,26].It remains open to develop a uni ed theoretical framework which explain large classes of both fully observable and partially observable problems.
This work addresses all three challenges above.For Challenge 1, we prove that OMLE learns observable POMDPs with continuous observation and function approximation, where we achieve the rst sample complexity that is completely independent of the size of observation space.For Challenge 2, we show that OMLE learns well-conditioned PSRs, which include and generalize all known tractable POMDP examples under a more intrinsic representation; For Challenge 3, we identify a new condition-Summation of Absolute values of Independent biLinear functions (SAIL)-which can be viewed as a natural generalization of Bellman/witness rank to address partial observability.We prove that OMLE learns general sequential decision making problems under SAIL condition, which include all problems considered in this paper, and unify our existing understanding for model-based RL in both fully observable and partially observable settings.

Overview of Our Results
This paper introduces a generic algorithm framework of OMLE, and prove it learns a very rich class of sequential decision making problems sample-e ciently.The OMLE algorithm (in its basic form) was rst proposed in [26] for sample-e cient learning of tabular weakly-revealing POMDPs.Here we introduce some extra exibility to the algorithm, address new challenges, and provide learning guarantees in a signi cantly more general setup.Speci cally, • We identify a su cient condition for OMLE-generalized eluder-type condition (Condition 3.1), under which OMLE is guaranteed to nd near-optimal policy in a polynomial number of samples.We will use this generalized eluder-type condition to analyze all problems considered in this paper.• We consider sequential decision making with low-rank structure (also known as Predictive State Representations (PSRs)).We rst show that learning generic PSRs is intractable.We then identify a rich subclass called well-conditioned PSRs, and prove that OMLE learn them sample-e ciently.Our sample complexity depends polynomially on the rank of PSRs and the size of core action sequences, and is independent of the size of core tests and the size of observation space.• We show that a wide range of POMDP models fall in to the class of well-conditioned PSRs.They include not only previously known tractable problems such as tabular weaklyrevealing/observable POMDPs [18,26], multistep decodable POMDPs [9]; but also new problems including observable POMDPs with continuous observation (in particular, GM-POMDPs, see Section We show that Reward-free OMLE learns an approximate dynamic model sample-e ciently under a slightly stronger version of the SAIL condition.This approximate dynamic model allows us to compute the near-optimal policies for all reward functions simultaneously.
Besides above results, this paper also establishes the rigorous formulations for overparameterized PSRs, studies their properties, gives rigorous treatment for PSRs with continuous observation, and bounds the bracketing number of tabular PSRs, which might be of independent interests to the community.

Technical Contribution
Underlying our new results is a set of new techniques for handling PSRs with in nite observations.
• New sharp elliptical potential style lemma for SAIL.
A crucial component for analyzing optimistic algorithms is pigeon-hole's principle [2,17] or so-called elliptical potential lemma [23] which ensures that the size of con dence set is shrinking fast enough to guarantee near-optimality of the learned policy after a small number of rounds A similar problem has been studied in [26] but the bounds derived therein depend on , , which scales with the size of observation space in PSRs/POMDPs.Such result becomes vacuous in the in nite-observation setting.We address this issue by developing a signi cantly sharper argument, which gives bounds completely independent of , (thus the size of observation space).Please see Appendix G.1 for details.
To apply the new sharp elliptical potential lemma discussed above, we need a projection operator which maps a function (or high-dimensional vector) de ned on the observation space into a low-dimensional Euclidean space whose dimension is equal to the intrinsic complexity of POMDPs or PSRs.Our analysis further requires the resulting vector after projection to have a small ℓ 1 -norm.In POMDPs, we can directly construct such a projection by taking the pseudo-inverse of emission matrices (as in [26]).However, such choice does not apply to PSRs as it has less structure than POMDPs.To address this issue, we consider the general problem of projecting high-dimensional vectors (that lie in a low-dimensional subspace) to a low-dimensional Euclidean space without signi cantly increasing their ℓ 1 -norm.We achieve so by constructing a projection using the Barycentric spanner technique.Please see Lemma G.3 and Step 3 in Appendix C.4 for details.• Matrix pseudo-inverse with small ℓ 1 -norm.To establish e cient guarantees for learning observable POMDPs, we need to construct operator M as in the framework of PSR, and bound the ℓ 1 -norm of the operator.All previous works [e.g., 3, 18, 26, 40, etc] construct such operators using the pseudo-inverse of emission matrices O † , whose ℓ 1norm scales with the size of observation space even under the observable condition (Condition 5.1).Such dependency prevents their analysis from generalizing to the in nite observation setting.We address this issue by adding a matrix Y that lies in the subspace complementary to O † .We show that with an optimal choice of Y, O † + Y has a small ℓ 1 -norm which is independent of the size of observation space.To our best knowledge, this operator design is completely new and has not been considered in the previous POMDP literature.

PRELIMINARIES
Notation.For a positive integer , we let [ ] = {1, . . ., }.We use the notation 1: to denote the sequence ( 1 , . . ., ).We use bold upper-case letters B to denote matrices and bold lower-case letters b to denote vectors.Given a matrix B ∈ R × , we use B to denote its ( , ) th entry, ∥B∥ = max ∥z ∥≠0 ∥Bz∥ /∥z∥ to denote its matrix -norm, and B † to denote its Moore-Penrose inverse.For a vector b ∈ R , we use b to denote its th entry, ∥b∥ to denote its vector -norm, and diag(b) to denote a diagonal matrix with [diag(b)] = b .Given a set X, we use 2 X to denote the collections of all subsets of X.

Sequential Decision Making
We consider the general episodic sequential decision making problems, which can be speci ed by a tuple (O, A , , P, ).Here O and A denote the space of observation and action respectively.denotes the length of each episode.P = {P ℎ } ℎ=1 speci es the joint distribution over observations 1: conditioned on action sequence 1: , which can be factorized as: P is also known as the system dynamics.= { ℎ } ℎ ∈ [ ] are the known reward functions from O to [0, 1] such that the agent will receive reward ℎ ( ) when she observes ∈ O at step ℎ. 1 To simplify the presentation, we also use the notation P( 1:ℎ , 1:ℎ ) := P( 1:ℎ | 1:ℎ ) for any trajectory ( 1:ℎ , 1:ℎ ) to represent the conditional probability over observations conditioned on actions.Throughout this paper we assume the nite action space A with |A | = , but allow in nitely large observation space O.
At each step ℎ ∈ [ ] of each episode, the environment rst samples an observation ℎ according to P ℎ (•| 1:ℎ−1 , 1:ℎ−1 ) based on the observation-action sequence in the past, and then the agent takes an action ℎ .The current episode terminates immediately after is taken.

Policy and value.
A policy = { ℎ } ℎ=1 is a collection of functions where ℎ : (O × A ) ℎ−1 × O → Δ maps a length-ℎ observation-action sequence to a distribution over actions.Given a policy , we use to denote its value, which is de ned as the expected total reward received under policy : where the expectation is with respect to the randomness within the system dynamics P and the policy .
Since the action space and the episode length are both nite, the maximal value over all policies max always exists.We call max the optimal value denoted by ★ , and call the policy that achieves this optimal value the optimal policy denoted by ★ .
Learning objective.Our goal is to learn an -optimal policy in the sense that ≥ ★ − , using a number of samples polynomial in all relevant parameters.We also consider the problem of learning with low regret.Suppose the agent interacts with the sequential decision making problem for episodes, and plays policy in the th episode for all ∈ [ ].The total (expected) regret is then de ned as: The question then is whether a learner can keep the regret small.Below we describe several widely studied reinforcement learning models that can be cast into the framework of sequential interactive decision making.
Example 1 (Contextual bandit).In a contextual bandit, the observation is the context of the problem.The episode length is equal to 1 and there exists a distribution ∈ Δ O so that the rst-step observation 1 of each episode is independently sampled from , i.e., P( 1 = •) = .
Example 2 (MDP).In Markov decision process (MDP), the observation is the state of MDP.The observation-action pair satis es the Markovian property.That is, there exist a collection of transition kernels T = {T ℎ } ℎ=1 so that Example 3 (POMDP).In partially observable Markov decision process (POMDP), there is an additional latent state space S , a collection of transition kernels T = {T ℎ } ℎ=1 , an initial distribution over the latent state space 1 , and a collection of emission kernels O = {O ℎ } ℎ=1 .In a POMDP, the latent states are hidden from the agent.At the beginning of each episode, the environment samples an initial state 1 from 1 .At each step ℎ ∈ [ ], the agent rst observes ℎ that is sampled from O ℎ (• | ℎ ), the emission distribution of hidden state ℎ at step ℎ.Then the agent takes action ℎ and receives reward ℎ ( ℎ ).After this, the environment transitions to We note that MDPs are fully observable models while POMDPs are partially observable models.Distinguished from MDPs where the optimal policies only depend on the current observation, the near-optimal policies of POMDPs in general depend on the entire history.This makes both learning and planning in POMDPs signi cantly more challenging than in MDPs.

Model-Based Function Approximation
We consider the interactive decision making problems where the observation space O, the action space A , the horizon , and the reward function are known, while the system dynamics P is unknown.To address in nitely large observation space, we consider the setting where we are given a model class Θ, which speci es a class of system dynamics {P } ∈Θ .We denote the system dynamics of the real model as P ★ .Throughout this paper, we make the following realizability assumption.
Realizability states that the true model resides in the given model class, so there is no misspeci cation error.Realizability is a standard assumption which appears in a majority of theoretical works in RL.
Following the convention in analyzing MLE [e.g., 11], we use the bracketing number to control the complexity of the model class Θ.

De nition 2.2 (Bracketing number). Given two functions and , the bracket
The bracketing number is required in the existing MLE analysis [11], which is in general equal or greater than the standard covering number.Across this paper, we use N Θ ( ) to denote the -bracketing number of function class {P } ∈Θ with respect to the policy-weighted ℓ 1 -distance, where the policy-weighted ℓ 1 -distance between two functions and de ned on , where the maximum is taken Algorithm 1 Optimistic Maximum Likelihood Estimation (Θ, ) compute exploration policies Π exp ← Π exp ( ) for each ∈ Π exp do 6: execute policy and collect a trajectory = ( 1 , 1 , . . ., , ) add ( , ) into dataset 8: update con dence set over all policy .Intuitively, we need this maximization, because P is a conditional probability of observations given actions.

OPTIMISTIC MLE
In this section, we present the generic Optimistic Maximum Likelihood Estimation (OMLE) algorithm.Moreover, we provide a general su cient condition-a generalized eluder-type condition (Condition 3.1), and prove that for any RL problems satisfying this condition, OMLE learns them within a polynomial number of samples.

Algorithm
The pseudocode of OMLE is provided in Algorithm 1.We remark that the OMLE algorithm was rst proposed in [26] for samplee cient learning of weakly-revealing POMDPs and here we introduce some extra exibility in the data collection steps to handle more general learning problems.
Formally, OMLE is a model-based algorithm which takes as input a model class Θ, and executes the following three key steps in each iteration ∈ [ ]: • Optimistic planning (Line 3): OMLE computes the most optimistic model in the model con dence set B and its corresponding optimal policy .• Data collection (Line 4-7): Based on the optimistic policy , OMLE constructs a set of exploration policies Π exp ( ) and then the learner executes each of them to collect a trajectory.As will be explained in later sections, these exploration policies could simply be or some composite policies that combine with random or certain action sequences, depending on the structure of the problems to solve.Intuitively, by actively trying exploratory action sequences after , the learner could gather more information about the system dynamics under .As an example, when applying OMLE to learning PSRs, the exploration policies will execute the core action sequences after , which we will explain in details in Section 4.
• Con dence set update (Line 8): Finally, OMLE updates the model con dence set using the newly collected data.Specifically, it constructs B +1 to include all the models ∈ Θ whose log likelihood on all the historical data collected so far is close to the maximal log likelihood up to an additive factor .This can be viewed as a relaxation of the classic maximal likelihood estimation (MLE) approach which chooses the model estimate to be the one exactly maximizing the log likelihood.In particular, when = 0, B +1 reduces to the solution set of MLE.One important reason behind this construction is that by choosing the relaxation parameter properly, we can guarantee the true model ★ lies in the con dence set for all ∈ [ ] with high probability, under the realizability assumption.

Theoretical Guarantees
In this section, we present the theoretical guarantees for OMLE.To present our results in the most general form, we rst introduce a su cient condition, called generalized eluder-type condition.We then provide the sample-e ciency guarantees for OMLE in learning any RL problems that satisfy this condition.Let P denote the distribution over ( , , ) 1: induced by executing policy in model .

Condition 3.1 (Generalized eluder-type condition)
. There exists a real number Θ ∈ R + and a function such that: for any ( , Δ) ∈ N × R + , and for the models { } ∈ [ ] and the policies where |Π exp | := max |Π exp ( )| is the largest possible number of exploration policies in each iteration.
At a high level, Condition 3.1 resembles the pigeonhole principle and the elliptical potential lemma widely used in tabular MDPs [e.g., 2, 17] and linear bandits/MDPs [e.g., 20,23] respectively.Such type of condition is widely used as a su cient condition for algorithms using optimistic exploration [31].Importantly, we will prove that Condition 3.1 holds for all the problems studied in this paper, with moderate Θ and function whose leading term scales as Õ ( Θ Δ|Π exp | ).
For an intuitive understanding of this generalized eluder-type condition, imagine that in each th iteration, the learner chooses a model such that can accurately predict the behavior of the historical exploration policies in Π 1 exp , . . ., Π −1 exp up to cumulative error Δ (i.e., the left inequality of (1)).Since could be di erent from ★ , the learner will still su er an instantaneous error in predicting the behavior of policy using model .And ( Θ , , Δ, |Π exp |) essentially measures the worst-case growth rate of the cumulative instantaneous error with respect to .
The key motivation behind Condition 3.1 is that because of the way OMLE constructs the con dence set B , we can use the classical analysis of MLE [11] to guarantee that any model inside B is close to the true model ★ in TV-distance under the historical policies in Π 1 exp , . . ., Π −1 exp with high probability.As a result, if the problem further satis es the generalized eluder-type condition, then OMLE immediately enjoys low-suboptimality guarantee by the optimism of { } =1 and Condition 3.1.Formally, we have the following theoretical guarantee for OMLE.Theorem 3.2.There exists absolute constant 1 , 2 > 0 such that for any ∈ (0, 1] and ∈ N, if we choose = 1 log( N Θ ( −1 )/ ) with = |Π exp | in OMLE (Algorithm 1) and assume Condition 3.1 holds, then with probability at least 1− , we have As mentioned before, for all problems studied in this paper, the leading term (in terms of dependency) of function scales as Õ ( Θ |Π exp | ).Then, Theorem 3.2 immediately leads to a guar- , which gives the optimal √ dependency up to a polylogarithmic factor.We remark that Theorem 3.2 is not a regret guarantee unless Π exp = { }, because it is the policies in {Π exp } =1 that are executed by OMLE, not { } =1 .
Sample complexity.Since the output policy out is a uniform mixture of { } =1 , we have out = ( =1 )/ .As a result, Theorem 3.2 immediately implies that with probability at least 1 − , out of OMLE is -optimal as long as ( Θ , , )/ ≤ /2.In particular, when ( Θ , , ) scales as Õ ( √ ) with respect to , it su ces to run OMLE for ≥ Õ ( −2 ) episodes, where the dependency on is again optimal up to a polylogarithmic factor.

LOW-RANK SEQUENTIAL DECISION MAKING
In this section, we consider an important large class of sequential decision making problems which has a low-rank structure.Note that the entire dynamics of the sequential decision making problem is fully speci ed by the joint probability P( 1: | 1: ).We can equivalently view this joint probability as system-dynamic matrices {D ℎ } ℎ ∈ [ ] : for each xed step ℎ, we call an observation-action sequence in previous steps up to ℎ, i.e., ℎ = ( 1:ℎ , 1:ℎ ) a history, and call an observation-action sequence in future steps, i.e., ℎ = ( ℎ+1: , ℎ+1: ) for any ∈ [ℎ + 1, ] a future (or test).Denote the set of all possible histories at step ℎ as T ℎ and the set of all possible futures as Ω ℎ .Then we can de ne the system-dynamic matrix D ℎ ∈ R | T ℎ |× |Ω ℎ | as a matrix with histories as rows and futures as columns 3 whose entry is speci ed as The rank of the sequential decision making problem is simply de ned as max ℎ ∈ [ ] rank(D ℎ ), which is the maximal rank of the system-dynamic matrices

Predicative State Representations
Predicative State Representations (PSRs) are proposed by [25,33] as a generic approach to model low-rank sequential decision making problems.Consider a xed step ℎ ∈ [ − 1], and denote = rank(D ℎ ).For any integer ≥ , there always exist columns (denoted as Q ℎ ) of matrix D ℎ , such that the submatrix restricted to these columns D ℎ [Q ℎ ] satis es rank(D ℎ [Q ℎ ]) = .These columns correspond to futures Q ℎ = { 1 , . . ., }, which are called core tests.Throughout this section, we assume all models in our model class Θ share the same sets of core tests, which are known to the learner.While most literature in PSRs often choose = , in many applications (as shown in the next section), learner only knows a set of core tests with a larger size.Therefore, we also consider the setting when > , to which we refer as overparameterized PSR.
Core tests allow the system-dynamic matrix D ℎ to be factorized as follows for certain matrix W ℎ : This implies an important property: for any history ℎ , the th ℎ row of D ℎ [Q ℎ ], which we denote as ( ℎ ) := (P( ℎ , 1 ), . . ., P( ℎ , )), serves as a su cient statistics for the history ℎ in predicting the the probabilities of all futures conditioned on ℎ .In sum, PSR captures the state of a dynamic system using ( ℎ )-a vector of predictions for future tests.
Formally, PSR models the dynamic system using a tuple ( , M, and 0 is a vector in R | Q 0 | .The tuple satis es following two equations: for any ℎ ∈ [0, − 1] and any observation-action sequence ( 1:ℎ , 1:ℎ ).That is, in PSR, the joint probability P( 1: | 1: ) can be factorized as a product of matrices and vectors where each matrix only depends on the observation and action at the corresponding step.The second condition (5) further requires the product of the rst ℎ matrices to have a probabilistic interpretation-the su cient statistics ( 1:ℎ , 1:ℎ ) for the history ( 1:ℎ , 1:ℎ ).In condition (5), we include the special case ℎ = 0, where the history ℎ is empty ∅, and the condition becomes (∅) := (P( 1 ), . . ., P( )) = 0 for core tests { 1 , . . ., } in Q 0 .We call the sets of core tests along with the tuple ( , M, 0 ) the PSR representation of the dynamic system.Finally, we de ne the rank of a PSR to be the rank of the underlying sequential decision making problem that the PSR describes (according to (4)).
Representation power of PSRs.The following theorem [see e.g., 25,33] guarantees the existence of such PSR representation ( , M, 0 ) for any low-rank sequential decision making problem.with size max ℎ ∈ [ −1] |Q ℎ | ≤ , and a corresponding tuple ( , M, 0 ) which jointly satisfy Equation (4) (5).Theorem 4.1 demonstrates the superior expressive power of PSR, in the sense that any low-rank sequential decision making problem admits an equivalent and compact PSR representation.This is in sharp contrast to other models of dynamical systems such as POMDPs which not only implicitly require the system dynamics being low-rank but also explicitly assume the existence of latent nominal states so that the current state of the system can be represented as a probability distribution over these unobservable nominal states.As a result, PSRs can model strictly more complex dynamical systems than POMDPs with nite states, e.g., the probability clock introduced in [15].
Linear weight vectors.According to low rank factorization (3), we know there exist linear weight vectors {m( ℎ )} ℎ ∈Ω ℎ only depending on the futures (where m( ℎ ) can be the th ℎ row of W ℎ matrix) such that for any future ℎ and history ℎ , the joint probability can be written in the bilinear form Equation ( 4) and ( 5) give two natural constructions for weight vectors.First, consider futures of full length Ω ℎ := (O ×A ) −ℎ .Equation (4) gives the weight vector of any future ℎ = ( ℎ+1: , ℎ+1: ) ∈ ℎ as: (where ∈ Q ℎ+1 is the th core test of Q ℎ+1 )as: We note that in the overparameterized setting (|Q ℎ | > rank(D ℎ )), the choice of linear weights m(•) in (6) may not be unique.As a result, the constructions in ( 7) and ( 8) are not necessarily related in general, unless a further self-consistent condition is satis ed (see discussion in Appendix C.3 for more details).
Core action sequences.We note that multiple core tests might use the same action sequence ℎ+1: for ∈ [ℎ + 1, ].Therefore, in many occasions, it is convenient to consider the set of core action sequences Q a ℎ , which is the set of unique actions sequences within the set of core tests Q ℎ .We know immediately that |Q a ℎ | ≤ |Q ℎ | and any rank-system-dynamic matrix D ℎ admits at least one set of core action sequences with size |Q a ℎ | ≤ .The size of core action sequences |Q a ℎ | determines the number of experiments we need to conduct in the dynamic system in order to estimate ( ℎ ).As we will see later, all our sample complexity results only depend on |Q a ℎ | instead of |Q ℎ |.WLOG, we assume that no core action sequence is a pre x of another core action sequence.

Continuous observation.
For clean presentation, we write the results in this section using the formulation with nite observations.As we will see, our sample complexity results are completely independent of the number of observations, which allows our results to readily extend to the setting of continuous observation.For rigorous treatment, we note that in our current de nition each core test is a single observation-action sequence, which has probability 0 to be observed if the observation is continuous.In Appendix A, we provide two approaches to modify the PSR formulation to resolve this issue.One approach is to consider a dense set of core tests with in nitely many futures, and generalize ( ) and M( , ) from vectors and matrices to functions and linear operators in Hilbert space.Our results remain meaningful even with in nitely many core tests as long as the number of core action sequences is small.The second approach is to generalize the de nition of core test to be an event of whether the future lands in a measurable subset of future space.We defer the details of rigorous treatment of continuous observation to Appendix A.

Well-Conditioned PSRs
Since PSR includes POMDP as a special case, it naturally inherits all the hardness results of learning POMDPs.In particular, even when the observation space, the action space and the sets of core tests are all small, nding a near-optimal policy still requires an exponential number of samples in the worst case.  1) so that any algorithm requires at least Ω(2 ) samples to learn a (1/4)-optimal policy with probability 1/6 or higher.
The proof of Proposition 4.2 essentially follows from Theorem 6 in [26] which shows the hardness for learning POMDPs when the weakly-revealing coe cient is bad.See Appendix C.2 for details.
Intuitively, the hard instances in Proposition 4.2 is due to the following reason: in the de nition of PSR, we require that for each step ℎ, the core tests Q ℎ satis es rank(D ℎ [Q ℎ ]) = rank(D ℎ ) := .However, this requirement alone does not prohibit the submatrix D ℎ [Q ℎ ] to be extremely close to some matrix whose rank is strictly less than .That is, matrix D ℎ [Q ℎ ] can be highly ill-conditioned.This will lead to high non-robustness in predicting the probability of P( ℎ , ℎ ) = m( ℎ ) ⊤ ( ℎ ) when the vector ( ℎ ) needs to be estimated-the corresponding linear weight m( ℎ ) can be extremely large such that we need to estimate ( ℎ ) up to an extremely high accuracy.Indeed, in the hard instances of Proposition there exists some future ℎ such that ∥m( ℎ )∥ 1 ≥ Ω(2 ).
To rule out such hard instances, core tests are required to not only guarantee rank(D ℎ [Q ℎ ]) := , but also ensure D ℎ [Q ℎ ] to be "well-conditioned" in certain sense.In this paper, we enforce such condition by assuming an upper bound on the magnitude of linear weight vectors.Condition 4.3 ( -well-conditioned PSR).We say a PSR is -wellconditioned if for any ℎ ∈ [ − 1] and any policy independent of the history before step ℎ + 1, the weight vectors m 1 (•), m 2 (•) and the corresponding future sets Ω ℎ in (7) (8) satisfy: ℎ .
Intuitively, the parameter −1 above measures how much the future weight vectors {m( ℎ )} ℎ ∈Ω ℎ can amplify the error x arising from estimating the probability of core tests, in an averaged sense that the future ℎ is sampled from policy .Being -wellconditioned naturally requires this error ampli cation to be not extremely large since otherwise the hard instances mentioned before will come into play.In Section 5, we will prove many common partially observable RL problems are naturally -well-conditioned PSRs with moderate , e.g., observable POMDPs and multistep decodable POMDPs.

Theoretical Results
In this subsection, we present the theoretical guarantees for learning well-conditioned PSRs with OMLE.To analyze OMLE, we rst need to specify the exploration policy function Π exp .Denote by ( , ℎ, a) a composite policy that rst executes policy for step 1 to step ℎ − 1, then takes random action at step ℎ, and after that executes action sequence a = ( ℎ+1 , . . ., ) till certain step , and nally nishes the remaining steps of the current episode by taking random actions.We construct the following exploration policy function: By using the above exploration policy function in OMLE, we have the following polynomial sample-e ciency guarantee for learning well-conditioned PSRs.
Theorem 4.5.Let > 0 be an absolute constant large enough and Θ be a rank--well-conditioned PSR class.For any ∈ (0, 1] and ∈ N, if we choose = log( N Θ ( −1 ) −1 ) with = max ℎ |Q a ℎ | and Π exp speci ed by Equation (10) in OMLE (Algorithm 1), then with probability at least 1 − , we have The result in Theorem 4.5 scales polynomially with respect to the rank of the PSR , the inverse well-conditioned parameter −1 , the number of core action sequences max ℎ |Q a ℎ |, the log-bracketing number of the model class log N Θ , the number of actions , and the episode length .In particular, (1) it does not depend on the size of core tests, but instead only depend on the size of core action sequence; (2) it is completely independent of the size of the observation space.Both empower our results to handle problems with continuous observations.Moreover, when the bracketing number satis es log N Θ ( −1 ) ≤ O (polylog( )) (e.g., in tabular PSRs and POMDPs with mixture of Gaussian observations), Theorem 4.5 guarantees that = Õ ( −2 ) episodes su ces for nding an -optimal policy, which is optimal up to a polylogarithmic factor.
The proof of Theorem 4.5 relies on the following key lemma, which states that any class of well-conditioned PSRs satisfy the generalized eluder-type condition (Condition 3.1) with favorable Θ and .Once Lemma 4.6 is established, Theorem 4.5 follows immediately from combining it with the guarantee of OMLE (Theorem 3.2).
Technical challenge.One of the key steps in proving Lemma 4.6 is to establish a generalized version of elliptical potential lemma for Summation of Absolute values of Independent biLinear (SAIL) functions of form =1 =1 |⟨ , ⟩|.Despite similar problems have been investigated in the previous analysis of OMLE [26], the bound derived therein scales with , , which depend on the number of observations.As a result, that bound is incapable to handle the settings with in nite observations.To address this issue, we develop a much tighter elliptical potential lemma which completely get rids of the , dependence.With the help of this strengthened elliptical potential lemma and other newly developed techniques, we are able to prove Lemma 4.6 without su ering any dependence on the size of the observation space.We refer an interesting reader to Appendix G.1 for more technical details.

Special cases: tabular PSRs.
To apply Theorem 4.5, we still need to upper bound the bracketing number of model class Θ.The following proposition states that in tabular PSRs (i.e., PSRs with nite observations and actions) the log-bracketing number of Θ is always upper bounded.Theorem 4.7 (bracketing number of tabular PSRs).Let Θ be the collections of all rank-PSRs with observations, actions and episode length .Then log N Θ ( ) ≤ O ( 22 log( / )).
We remark that the bracketing number in Theorem 4.7 is independent of the size of core tests or core action sequences.This is because the representation power of rank-PSRs is limited to rank-sequential decision making problems regardless the choices of core tests.
The key intermediate step in proving Theorem 4.7 is to show every low rank sequential decision making problem admits an observable operator model (OOM) representation wherein the norm of the operators are well controlled.Once this argument is established, we can upper bound the bracketing number by discretizing those operators.In comparison, recent works on PSRs [40] simply assume every PSR representation has bounded operator norm without proving it.To our knowledge, Theorem 4.7 provides the rst polynomial upper bound for the bracketing number of tabular PSRs without any additional assumptions.
Finally, by plugging the above upper bound back into Theorem 4.5, we immediately obtain the following sample complexity bound for learning tabular PSRs:

IMPORTANT PSR SUBCLASSES
In this section, we introduce several partially observable RL problems of interests and prove that they are all special subclasses of -well-conditioned PSRs with moderate .All the proofs for this section are deferred to Appendix D.

Observable POMDPs
We rst consider observable POMDPs [13] 4 -an important, natural and rich subclass of POMDPs wherein there exists an integer ∈ [ ] so that any two di erent distributions over latent states induce di erent -step observation-action distributions.We will prove a new result that OMLE can sample-e ciently learn any observable POMDP even with in nite or continuous observation.We remark that while such a result have been proved in the setting of nite observations [26], the sample complexity in [26] has a polynomial dependency on the number of observations, thus does not extend to the setting of continuous observation.Our new result is highly non-trivial: in addition to the sample-complexity guarantees of wellconditioned PSRs with continuous observation (Theorem 4.5), our result further requires new techniques on matrix pseudo-inverse with small ℓ 1 -norm (Appendix G.

3) and a new core tests design technique (Appendix D.2),
To formally state the observability condition, we rst de ne the -step observation-action probability kernels as follows: For an observation sequence o of length , a latent state and an action sequence a of length − 1, the value of the th probability function in And we say a POMDP is -step -observable ( ∈ [ ] and > 0), if its -step observation-action probability kernels satisfy the following condition.
Condition 5.1 ( -step -observable condition).For any 1 , 2 ∈ Δ S and ℎ ∈ [ − + 1], In the above condition, we use ∥ − ∥ 1 = ∫ ∈X | ( ) − ( )| to denote the ℓ 1 -distance between two functions from X to R. Intuitively, Condition 5.1 can be viewed as a robust version of assuming that the probability functions in each G ℎ are linearly independent, which guarantees that for any two di erent latent state mixtures 1 , 2 ∈ Δ , there exists an action sequence a of length − 1 so that these two mixtures can be distinguished from the distributions over the next -step observations provided that action sequence a is executed.
The following theorem states that any -step -observable POMDP admits an /( + −1 )-well-conditioned PSR representation with core action sets equal to A −1 .Theorem 5.2.Let Θ be a model class of -step -observable POMDPs.Then Θ satis es Condition 4.3 with = O ( / ) and Q New PSR operators for observable POMDPs.The key challenge in proving Theorem 5.2 is to construct a set of PSR operators that satisfy Condition 4.3 with parameter independent of the number of observations .For simplicity of illustration, let us consider 1step -observable tabular POMDPs as examples in this paragraph.Previous work [26] and concurrent works [7,40] all adopt the following operator construction: all for step ℎ.However, the above operators have scaling as O ( / √ ) in the worst case, which hinders generalization to the in niteobservation settings.To address this issue, we propose a di erent operator construction based on a novel ℓ 1 -norm matrix inverse technique (Lemma G.4): which, importantly, satis es Condition 4.3 with = O ( / ) completely independent of .When moving from the single-step observable tabular setting to the more challenging multi-step observable in nite-observation setting, the same idea still plays an important role in constructing well-conditioned PSR operators, where we rst use a novel partition technique to group di erent observations to obtain an ( /2)-observable meta-POMDP with nite but exponentially many meta-observations and then apply the above operator construction on top of the meta-POMDP.For more technical details, please refer to Appendix D.1 and D.2.
Sample complexity.By combining Theorem 5.2 with Theorem 4.5, we immediately obtain the following sample-e ciency guarantee for learning observable POMDPs with OMLE.
Corollary 5.3.Let Θ be a model class of -step -observable POMDPs.There exists an absolute constant > 0 such that for any ∈ (0, 1] and ∈ N, if we choose = log( N Θ ( −1 ) −1 ) with = in OMLE (Algorithm 1), then with probability at least 1 − , Di erent from previous works on tabular POMDPs [12,20,26] where the sample complexity scales with the number of observations, The result in Corollary 5.3 completely gets rid of the dependence on thanks to our novel PSR operator design as is discussed above.As a result, it also applies to learning observable POMDPs with continuous observations as long as the log-bracketing number of model class Θ is well controlled, whereas previous works cannot.

Observable tabular POMDPs.
We rst consider tabular observable POMDPs where the number of observations is nite.In this case, the -step observation-action probability kernel G ℎ is equivalent to an −1 by matrix wherein the entry at the intersection of the (o, a) th row and the th column is equal to P( ℎ:ℎ+ −1 = o | ℎ = , ℎ:ℎ+ −2 = a).And the observable condition (Condition 5.1) can be equivalently written as: To apply OMLE to tabular POMDPs with states, observations and actions, we choose the model class Θ to consist of all the legitimate POMDP parameterizations = (T, O, 1 ) whose corresponding -step observation-action probability matrices satisfy Equation (13).By simple discretization argument [e.g., see Appendix B in 26], we can bound the -bracketing number of Θ by Plugging the above upper bound back into Theorem 5.2, we immediately recover the sample e ciency guarantee for learning tabular observable POMDPs in [26].

Observable POMDPs with Gaussian emission.
To showcase the power of Theorem 5.2 in handling POMDPs with continuous observations, we consider the model of POMDPs with Gaussian mixture emissions (abbreviated as GM-POMDP hereafter), which can be intuitively viewed as tabular observable or weakly revealing POMDPs with observations corrupted by Gaussian noise.The Gaussian emissions further allow us to directly control the bracketing number.We start with the formal de nition of GM-POMDPs.
Without further assumptions on GM-POMDPs, the observable condition can be arbitrarily violated and sample-e cient learning is in general impossible.Therefore, we introduce the following natural separation condition on the Gaussian mixtures in GM-POMDPs, which, once being satis ed, immediately implies the observable condition holds.To condense notations, denote Condition 5.5 ( -separable condition).For all ℎ ∈ [ ], ≠ ∈ [ ] and 1 , 2 ∈ Δ S , we have Condition 5.5 requires that (a) di erent base Gaussian components are well separated, which is standard in learning Gaussian mixtures in classic theory of statistics, and (b) di erent latent state distributions induce di erent weights over the base Gaussian components, which resembles the one-step observable condition for tabular POMDPs.Importantly, in Lemma D.2 in Appendix D.3, we show that any GM-POMDPs satisfying the -separable condition are Ω( )-observable POMDPs.We remark that GM-POMDPs belongs to the in nite observation extension of tabular observable POMDP but not tabular weakly-revealing POMDPs, this is also the major reason we choose to present observable POMDPs in section 5.1.
To apply OMLE to learning -separable GM-POMDPs with states, actions and base Gaussian components in R , we construct the model class Θ to include all the valid POMDP models wherein (a) the observation distributions are -separable (Condition 5.5) and (b) the norm of the mean and variance of the base Gaussian components are well behaved.Formally, de ne -separable, ∥x ℎ, ∥ 2 ≤ and ≤ ℎ ≤ .
By carefully discretizing the parameter space and constructing the envelope functions, we can derive the following upper bound for the bracketing number of model class Θ (Lemma D.3 in Appendix D.3): Now that we know -separable GM-POMDPs are Ω( )-observable and have bounded bracketing number, we can invoke Theorem 5.2, which gives the following sample complexity guarantee for learning -separable GM-POMDPs with OMLE.
Despite the observation space being in nitely large and unbounded, the above sample complexity only scales polynomially with respect to the dimension of the observation space and other relevant nite parameters.Finally, we emphasize that although we only focus on POMDPs with Gaussian mixture observations in this subsection, our main result (Theorem 5.2) also applies to learning other types of continuous observation distributions as long as the observable condition (Condition 5.1) holds and the model class has bounded bracketing number.

Multi-Step Decodable POMDPs
Multi-step decodable POMDPs [9] is subclass of POMDPs in which a su x of length of the most recent history contains su cient information to decode the latent state.To simplify notations, denote We remark that neither of multi-step decodable POMDPs or multi-step observable POMDPs is more general than the other.That is, each of them contains statistically tractable POMDP instances that are not included by the other class (see Lemma D.4 in Appendix D.5 for the concrete constructions).Nonetheless, the following theorem states that multi-step decodable POMDPs also falls into the family of -well-conditioned PSRs with = 1 and the sets of core test actions equal to A .As a result, OMLE also enjoys polynomial sample e ciency guarantee for learning multi-step decodable POMDPs.
Theorem 5.8.Let Θ be a model class of -step decodable POMDPs.Then Θ admits rank-PSR representations with Q ℎ = A min{ , −ℎ } and satis es Condition 4.3 with = 1.Moreover, there exists an absolute constant > 0 such that for any ∈ (0, 1] and ∈ N, if we choose = log( N Θ ( −1 ) −1 ) with = in OMLE (Algorithm 1), then with probability at least 1 − , where we always have ≤ in any POMDP and ≤ lin when the underlying MDP can be represented as a lin -dimensional kernel linear MDP.
Similar to the results in previous sections, the sample complexity in Theorem 5.8 is independent of the number of observations, which means it also applies to the cases with in nite observations as long as the log-bracketing number of Θ is nite.Moreover, the above result scales with the rank of the PSR representations instead of the number of latent states .Although it is well-known ≤ in any POMDPs, the rank can be much smaller than the number of latent states in certain settings of interest.For example, when the underlying MDP can be represented as a lin -dimensional linear kernel MDP [39], we have ≤ lin while can be arbitrarily large.
Finite observations.When the number of observations is nite, we can easily upper bound the bracketing number of Θ by the standard discretization arguments as in Equation (14).And by plugging the bound back into Theorem 5.8, we immediately obtain a poly( , , , , log −1 ) × −2 sample complexity upper bound for nding an -optimal policy with OMLE in tabular -step decodable POMDPs.

POMDPs with A Few Known Core Action Sequences
In Section 5.1, we prove that if a POMDP satis es that any two state mixtures can be distinguished from the observation distributions induced by taking -step random actions, then it can be represented as an well-conditioned PSR and OMLE can learn it sample e ciently.However, the sample complexity there scales exponentially with respect to due to -step random exploration, which could be prohibitively large even for moderate .In this subsection, we show that it is possible to get rid of this exponential dependence when there exist a small set of known exploratory action sequences so that any two state mixtures can be distinguished from the observation distributions induced by at least one exploratory action sequence.
To simplify notations, we rst de ne the observation-action probability kernel K ℎ at step ℎ ∈ [ ]: For a latent state and an action sequence a of length ≤ − ℎ, K ℎ ( , a) is equal to the probability density function over ℎ:ℎ+ provided that action sequence a is used from state and step ℎ.Formally, we consider the following observable-style condition.Condition 5.9.For any ℎ ∈ [ ], there exists known A ℎ so that for any ∈ Θ and 1 , 2 ∈ Δ : Notice that in Condition 5.9, the exploratory action sequences in A ℎ can be length-Ω( ), which means a POMDP class Θ that satis es Condition 5.9 could satisfy the -step observable condition only for = Ω( ).Nonetheless, the following theorem states that as long as Θ satis es Condition 5.9 with A ℎ of small cardinality, then OMLE is guaranteed to learn a near-optimal policy for any ∈ Θ within a number of samples that scales only polynomially with respect to max ℎ |A ℎ |.When the number of exploratory action sequences (max ℎ |A ℎ |) is small but their length (max ℎ max a∈A ℎ |a|) is large, Theorem 5.10 o ers exponentially sharper sample complexity guarantee than Theorem 5.2.As an extreme case, when each A ℎ contains a single action sequence of length − ℎ, Theorem 5.10 improves over Theorem 5.2 by a factor of Ω ( ) .

BEYOND LOW-RANK SEQUENTIAL DECISION MAKING
In this section, we extend the sample e ciency guarantees of OMLE to any sequential decision making problems under a new structural condition-SAIL condition.We will show that SAIL condition holds not only in all well-conditioned low-rank sequential decision making problems studied in Section but also in problems beyond low-rank sequential decision making, such as factored MDPs, low witness rank problems.

SAIL Condition
In the fully observable setting, RL with general function approximation has been intensively studied in the theory community, and various complexity measures have been proposed, including Bellman rank [16], witness rank [34], and more [8,19].Most of them critical relies on the Bellman error (model-free setting) or the error in model estimation (model-based setting) to have a bilinear structure.Unfortunately, partially observability signi cantly complicates the learning problem, and neither structure mentioned above hold for even the basic tabular weakly-revealing POMDPs.
Here, we introduce a new general structural condition that is also capable of addressing partially observable setting.Our new condition can be viewed as a generalizations of the bilinear structures mentioned above.Since our focus is OMLE which is a model-based algorithm, our new condition requires the model estimation error to be upper and lower bounded by Summation of Absolute values of Independent biLinear functions (SAIL).Formally, let Π denote the universal policy space.Condition 6.1 (SAIL condition).We say model class Θ satis es ( , , )-SAIL condition with exploration policy function Π exp : Π → 2 Π , if there exist two sets of mappings { ℎ, } (ℎ, ) ∈ [ ]×[ ] , { ℎ, } (ℎ, ) ∈ [ ]×[ ] from Θ to R such that for any , ′ ∈ Θ, and the optimal policy of model : The rst inequality requires the model estimation error of ′ (measured by TV distance) on the exploration policies computed using to be lower bounded by a coe cient −1 times SAIL.In particular, the summand ⟨ ℎ, ( ), ℎ, ( ′ )⟩ is a bilinear function, because it is a linear function of ℎ, ( ) (features of ) when ′ is xed, and it is also a linear function of ℎ, ( ′ ) (features of ′ ) when is xed.The second inequality requires the model estimation error of on its optimal policy to be upper bounded by SAIL.The third inequality is a normalization condition.
At a high-level, standard Bellman rank or witness rank can be viewed as conditions similar to SAIL, with the LHS of the rst two inequalities replaced by appropriate error measure and the RHS of the rst two inequalities replaced by a bilinear function ⟨ ( ), ( ′ )⟩.SAIL condition generalize them by allowing multiple feature functions { } ∈ [ ] , { } ∈ [ ] which are indexed by , , and taking summation of them.One key structure here is that the indexes are decoupled between two features , , and summation is taken over two indexes independently.This is crucial in many partially observable applications where , are extremely large and we do not want to su er any dependency on , in the sample complexity.
We will prove in Section 6.3 that SAIL condition is very general, which holds not only in all well-conditioned low-rank sequential decision making problems studied in Section 4, but also in problems beyond low-rank sequential decision making, such as factored MDPs, low witness rank problems.

Theoretical Guarantees for SAIL
Now we present the theoretical guarantees for OMLE in learning sequential decision problems that satisfy the SAIL condition.Theorem 6.2.There exists an absolute constant > 0 such that for any ∈ (0, 1] and ∈ N, if we choose = log( N Θ ( −1 ) −1 ) with = |Π exp | in OMLE (Algorithm 1) and assume ( , , )-SAIL condition holds, then with probability at least 1 − , we have The result in Theorem 6.2 scales polynomially with respect to the parameters ( , , ) and the number of exploration policies |Π exp | in the SAIL condition.Moreover, the result is completely independent of the number of the feature mappings and , which is key in addressing the case of well-conditioned PSRs where the SAIL condition requires exponentially many feature mappings.When the log bracketing number has a reasonable growth rate log N Θ ( −1 ) ≤ polylog( ), Theorem 6.2 guarantees that = Õ ( 2 2 log N Θ ( −1 ) • −2 ) episodes su ces for nding an -optimal policy.The -dependency is optimal up to polylogarithmic factors.
The critical step in proving Theorem 6.2 is our new elliptical potential style lemma for SAIL, which signi cantly generalizes the standard elliptical potential lemma that only applies to bilinear functions.Our new lemma immediately implies the following result.
With this lemma, we can directly invoke the guarantee for OMLE (Theorem 3.2), which gives the bound in Theorem 6.2.
Sharper guarantee for single feature mapping.For sequential decision making problems that satisfy the SAIL condition with a single pair of feature mappings ( ℎ , ℎ ) for each ℎ ∈ [ ], e.g., sparse linear bandits, factored MDPs, and linear MDPs, we can further derive the following sharper sample complexity guarantee.Theorem 6.4.Suppose ( , , )-SAILcondition holds with = = 1.Then under the same choice of parameters as in Theorem 6.2, OMLE satis es that with probability at least 1 − , Theorem 6.4 directly implies a regret bound with leading-order term Õ ( log N Θ ( −1 ) ) when the exploration policy function Π exp is equal to identity.This improves a √ factor over Theorem 6.2.Theorem 6.4 also implies a Õ ( 2 log N Θ ( −1 ) • −2 ) sample complexity upper bound for nding an -optimal policy when the log-bracketing number of Θ grows polylogarithmically with respect to the covering precision, which improves a factor over the sample complexity implied by Theorem 6.2.

Important Examples of SAIL
In this section, we present several widely studied sequential decision making problems that satisfy the SAIL condition.We remark that all problems considered in this section are MDPs so we will use { ℎ } ℎ=1 to denote states.

Low-rank sequential decision making.
To demonstrate the generality of the SAIL condition, we prove the following proposition which states that (a) any well-conditioned PSR satis es the SAIL condition with moderate ( , , ), and (b) there exist sequential decision making problems, whose system dynamics matrices have exponentially large rank though, which still satisfy the SAIL condition with mild ( , , ).
The Q-type witness condition requires that at each single step the expected model discrepancy between the true model ★ and model candidate ′ under the state-action distribution induced by the optimal policy of is roughly proportional to the inner product of the features of and ′ .And the V-type version is de ned similarly except that the last action ℎ is sampled from ′ instead of .By basic algebra, we can easily relate the above perstep model discrepancy in witness condition to the whole-trajectory model discrepancy in SAIL condition, which leads to the following conclusion that the SAIL condition is satis ed with almost the same ( , , ) whenever either Q-type or V-type witness condition holds.In the case when log N Θ grows polylogarithmically with respect to the covering precision, by plugging Proposition 6.7 back into Theorem 6.4, we immediately obtain a Õ ( 2 2 log N Θ ( −1 ) −2 ) sample complexity upper bound for OMLE in the V-type witness rank setting, which improves over the quadratic dependence on in [34].Moreover, OMLE further enjoys a Õ ( log N Θ ( −1 ) ) regret guarantee in the Q-type witness rank setting, which is new to our knowledge.
In other words, the transition of the th factor of states are only determined by a subset of all factors, that is pa , instead of the whole state.In factored MDPs, it is standard to assume the factorization structure and the reward function are known [21,34].Therefore, our model class Θ only needs to parameterize the transitions under the given factorization structure.
Sparse linear bandits.In sparse linear bandits, the mean reward function can be represented as a sparse linear function of the arm feature.Formally, we have ( ) = ⟨ , ⟩ where (i) ∈ A ⊆ lin A (0), (ii) Θ := { ∈ lin Θ (0) : ∥ ∥ 0 ≤ and ⟨ , ⟩ ∈ [0, 1] for any ∈ A }. Without loss of generality, assume the stochastic reward feedback is binary 6 .The following proposition states that the witness rank of sparse linear bandits is no larger than the ambient dimension lin .Proposition 6.10.Let Θ be the family of lin -dimensional -sparse linear bandit.Then Θ satis es Q-type witness condition with = lin , = 1, and = 4 lin Θ A .
By combining Proposition 6.10 with Proposition 6.7 and 6.4, we recover the optimal regret for sparse linear bandits Õ ( lin ) up to a polylogarithmic factor.