An optimal tradeoff between entanglement and copy complexity for state tomography

There has been significant interest in understanding how practical constraints on contemporary quantum devices impact the complexity of quantum learning. For the classic question of tomography, recent work tightly characterized the copy complexity for any protocol that can only measure one copy of the unknown state at a time, showing it is polynomially worse than if one can make fully-entangled measurements. While we now have a fairly complete picture of the rates for such tasks in the near-term and fault-tolerant regimes, it remains poorly understood what the landscape in between looks like. In this work, we study tomography in the natural setting where one can make measurements of $t$ copies at a time. For sufficiently small $\epsilon$, we show that for any $t \le d^2$, $\widetilde{\Theta}(\frac{d^3}{\sqrt{t}\epsilon^2})$ copies are necessary and sufficient to learn an unknown $d$-dimensional state $\rho$ to trace distance $\epsilon$. This gives a smooth and optimal interpolation between the known rates for single-copy and fully-entangled measurements. To our knowledge, this is the first smooth entanglement-copy tradeoff known for any quantum learning task, and for tomography, no intermediate point on this curve was known, even at $t = 2$. An important obstacle is that unlike the optimal single-copy protocol, the optimal fully-entangled protocol is inherently biased and thus precludes naive batching approaches. Instead, we devise a novel two-stage procedure that uses Keyl's algorithm to refine a crude estimate for $\rho$ based on single-copy measurements. A key insight is to use Schur-Weyl sampling not to estimate the spectrum of $\rho$, but to estimate the deviation of $\rho$ from the maximally mixed state. When $\rho$ is far from the maximally mixed state, we devise a novel quantum splitting procedure that reduces to the case where $\rho$ is close to maximally mixed.


Introduction
As the computational resources for near-term quantum devices continue to grow, so too does their potential to help us analyze quantum experimental data and learn about the physical universe.It is timely not only to understand the fundamental limitations contemporary devices impose for such tasks relative to fault-tolerant quantum computation, but also to map out avenues for gracefully scaling up our near-term algorithms as the platforms on which we run these algorithms mature.
Here we examine this challenge for arguably the most fundamental problem in quantum data analysis: state tomography.Recall that in this problem, one is given n copies of a d-dimensional mixed state ρ, and the goal is to estimate its density matrix to error ǫ, e.g. in trace distance, by measuring copies of ρ.The standard figure of merit here is copy complexity: how small does n have to be, in terms of d and 1/ǫ?
Given fault-tolerant quantum computers, [18] and [30] settled the optimal copy complexity of this problem, showing that n = Θ(d 2 /ǫ 2 ) copies are necessary and sufficient.The protocols they gave came with an important caveat however: they require fully coherent measurements across the joint state ρ ⊗n .In conventional experimental setups, this renders the algorithms impractical, as typically we do not have the ability to simultaneously prepare and store so many copies of our quantum state.Motivated by contemporary device constraints, there has been a great deal of recent interest in understanding the other extreme, namely what one can do with incoherent (a.k.a.independent or unentangled) measurements, where the algorithm measures a single copy of ρ at a time.These algorithms are much more practical to run-now, the experimenter merely needs to prepare a single copy of the state at a time, interact with it once, and repeat.Recent work of [6] determined the optimal copy complexity in this setting, showing that n = Θ(d 3 /ǫ 2 ) copies are necessary and sufficient and thus that incoherent measurements are provably weaker than coherent measurements.
That said, the incoherent setting increasingly appears to be an overly restrictive proxy for current platforms, and realistic experimental settings.For instance, an experimenter could simultaneously prepare t copies of the state by simply replicating her experimental setup t times, and some existing devices can already store multiple (albeit few) copies of a quantum state at a time [2,21,25].In many settings, this is already very powerful.For instance, recent experimental demonstration of exponential advantage for estimating Pauli observables using Bell measurements naturally operate on two copies of the input state at a time [20], and cannot be done with incoherent measurements.In fact, the incoherent setting precludes even basic quantum learning primitives like the SWAP test.
Yet remarkably, prior to the present work, it was not even known how to rule out the possibility of achieving Θ(d 2 /ǫ 2 ) copy complexity just using measurements of t = 2 copies of ρ at a time!This begs the natural question: When are t-entangled measurements asymptotically stronger than incoherent measurements for state tomography?
Apart from the practical motivation, understanding t-entangled measurements also poses an important conceptual challenge in theory, namely the fundamental incompatability between existing algorithms in the fully entangled and incoherent settings.
The fully entangled algorithms of [18] and [30] crucially rely on estimates derived from weak Schur sampling, which is grounded in the representation theory of the symmetric and general linear groups.Indeed, if one is only allowed a single entangled measurement, then any optimal measurement provably must first go through weak Schur sampling [22].However, at the same time, this approach seems inherently tied to such "one-shot" algorithms.For example, while one can show that sampling from the Schur-Weyl distribution yields an estimate for the eigenvalues of ρ which is close in expectation, it is actually a biased estimator for the spectrum (indeed, its expectation always majorizes the spectrum, see Lemma 4.1.2in [34]).This bias decreases with t, and unless t = Ω(d 2 /ε 2 ), it is unclear if weak Schur sampling on t copies gives any useful information, let alone if one can somehow combine the outcome of several different trials of weak Schur sampling in a meaningful fashion.This is in contrast to algorithms for incoherent measurements, which rely on many unbiased (but less informative) measurements of the underlying state, which can then be averaged to obtain a good overall estimate.Unfortunately, this unbiasedness seems specific to incoherent measurements.Ultimately, neither approach by itself seems capable of yielding a nontrivial result for general t-entangled measurements.This calls for a new algorithmic framework that can synthesize the two techniques.

Our results
In this paper, we give a tight characterization of the copy complexity of state tomography with t-entangled measurements, for a large range of t.More specifically, we show the following pair of theorems.Here, and throughout the paper, we let O, Ω, and Θ hide polylogarithmic factors in d, ε, and we let • 1 denote the trace norm of a d × d matrix.
Theorem 1.1 (informal upper bound, see Theorem 5.7).Let t ≤ min(d 2 , ( √ d/ε) c ), for some (sufficiently small) universal constant c.There is an algorithm that uses total copies of ρ by taking n/t separate t-entangled measurements, and outputs ρ so that ρ − ρ 1 ≤ ε with high probability.
Theorem 1.2 (informal lower bound, see Theorem 7.13).Let t ≤ 1/ε c for some (sufficiently small) universal constant c.Any algorithm that uses n total copies of ρ by taking taking n/t separate (but possible adaptively chosen) t-entangled measurements must require to succeed at state tomography, with probability ≥ 0.01.
Together, these theorems imply that the copy complexity of tomography with t-entangled measurements is, up to logarithmic factors, exactly Θ d 3 √ tε 2 , for all t ≤ 1/ε c .In other words, for this regime of t, the sample complexity smoothly improves with t.To our knowledge, this is the first nontrivial setting in which the copy complexity has been sharply characterized to smoothly depend on the amount of entanglement.Prior works on the power of t-entangled measurements either did not obtain sharp tradeoffs, or studied settings where simply taking t to be a large enough constant sufficed to obtain optimal copy complexity (e.g.Bell sampling [10,27]).
We pause here to make a few remarks about our result.First, as mentioned above, the most practically pertinent regime of t is the regime where t is small but > 1.Our result fully characterizes this setting, so long as ε = o(1), and obtains a nontrivial partial tradeoff even for constant ǫ.We believe that the restriction that t ≤ 1/ε c is ultimately an artifact of our techniques, and we conjecture that Θ d 3 √ tε 2 is the right rate for all t ≤ O(d 2 ).We leave resolving this question as an interesting future direction.
Second, we note that our result implies that when ε is sufficiently small, entanglement only helps up until a certain point.When ε ≪ 1/d, we can take t = Θ(d 2 ), and our algorithm achieves a rate of n = O( d 2 ε 2 ), which matches (up to logarithmic factors) the lower bound against general quantum algorithms with any degree of entanglement.In other words, we can match the rate of the fully coherent algorithm, with asymptotically less coherence.To our knowledge, this is the first instance for a natural learning problem where a super-constant, but still partial, amount of entanglement is the minimal amount of entanglement required to obtain the statistically optimal rate.

Our techniques.
As previously mentioned, we need a number of novel ideas to overcome the difficulties with prior techniques for the t-entangled setting.In this section, we describe some of these conceptual contributions at a high level.For a more in-depth discussion of our techniques, we refer the reader to Section 2.
A key idea, which is crucial to both our upper and lower bounds, and which to our knowledge is novel in the literature, is something we call linearization.For any state ρ, we can write it as ρ = I d d + E, where E is some traceless Hermitian matrix.Then, for any integer t, we observe that ρ ⊗t can be approximated as follows where here and throughout the paper, we let sym denote the symmetric sum, i.e. sym We call the expression on the RHS of (1) the t-linearization of ρ, as it is the linear term (in E) of the expansion of ρ ⊗t .Note that while the linearization of the state is not necessarily itself a mixed state (as it need not be PSD), we can still perform formal calculations with it.For our purposes, the linearization of ρ has a number of crucial properties.While estimators based on weak Schur sampling do not yield an unbiased estimate of ρ when applied to ρ ⊗t , they do yield an unbiased estimate of E (up to a known correction term) when "applied" to the t-linearization of ρ.Therefore, as long as the linearization of ρ is a good approximation for ρ ⊗t , we may average the result of many independent trials of such estimators to obtain a better estimate for E, and consequently, ρ.For the lower bound, it turns out that the amount of information gained by the algorithm from some measurement outcome (more formally, the likelihood ratio of the posterior distribution for the algorithm starting from two different states, see Lemma 6.2) is controlled by the Frobenius norm of some linear transformation of the corresponding POVM element.
Another important and novel algorithmic idea is that of quantum splitting.It turns out that our techniques work best for states which are relatively spread-out, i.e. states with spectral norm at most O(1/d).
To extend our techniques to general mixed states, we give a novel procedure which reversibly transforms any state into another state of marginally larger dimension that is spectrally bounded.One can view this as a quantum generalization of the splitting procedure introduced in [14] for classical distribution learning and testing.There, as well, the goal was to take an arbitrary distribution over [d] elements and produce a "well-spread" distribution over a slightly larger domain.However, more care must be taken in the quantum setting, as our splitting procedure must be done obliviously of the (unknown) eigenvectors of the true state ρ, and so any such procedure will non-trivially alter the spectral properties of ρ.

Related work
This work comes in the wake of a flurry of recent works characterizing the effect that near-term constraints like noise and limited quantum memory have on statistical rates for quantum learning tasks.Many of these have focused on proving lower bounds for protocols that can only make incoherent measurements, e.g.[4,13,12,9,10,1,15,6].Among these works, the two most relevant ones are [6] and [9].
In [6], the authors prove an optimal lower bound on the copy complexity of state tomography with incoherent measurements, showing that the copy complexity upper bound of O( d 3 ǫ 2 ) originally obtained by [24] is tight.The proof of the lower bound in Theorem 1.2 builds upon the general proof strategy developed in that work.Roughly speaking, in lieu of standard mutual information-based proofs of minimax lower bounds, this framework involves showing that for some prior over input states, if one observes a typical sequence of measurement outcomes resulting from any protocol that makes insufficiently many incoherent measurements, the conditional distribution over the input state places negligible mass on the true state.As we explain in Section 2, there are a number of obstacles that we must overcome in order to adapt this strategy to the t-entangled setting.
In [9], the authors are motivated by the same general question as the present work: are there tasks for which being able to perform t-entangled measurements for large t is strictly more powerful than for small t? Instead of tomography, they studied a certain hypothesis testing task based on distinguishing approximate state k-designs on n qubits from maximally mixed.They show that when t < k, any protocol must use exponentially many copies, but with fully-entangled measurements, poly(n, k, 1/ǫ) copies suffice because one can simply run quantum hypothesis selection [3].Unlike the present work, they do not manage to give a full tradeoff.Additionally, they prove some partial results for using t-entangled measurements for mixedness testing [29,4,12], where the goal is to distinguish the maximally mixed state from states that are ǫ-far in trace distance from maximally mixed.Specifically, when t = o(log(d))/ǫ 2 , the polynomial dependence on d in their lower bound matches that of the lower bound of [4] for t = 1.Although that lower bound is suboptimal even for t = 1, and although [9] did not achieve a full tradeoff for mixedness testing, we remark that it hints at an interesting phenomenon.Unlike in tomography, where copy complexity smoothly decreases as t increases, for mixedness testing, the number of copies t must exceed some ǫ-dependent threshold before one can improve upon the single-copy rate.
Lastly, we remark that our analysis of the upper bound makes crucial use of certain representationtheoretic estimates proved by [30].While these estimates were originally used to analyze Keyl's estimator in the fully-entangled setting, in our analysis we leverage them in a novel way to prove guarantees in the partially-entangled setting (see Section 2 for details).

Discussion and open problems
In this work we gave, to our knowledge, the first smooth tradeoff between the copy complexity of a quantum learning task and the number of copies that one can measure at once.When the target error ǫ is sufficiently small, we show that up to logarithmic factors, the copy complexity achieved by our protocol achieves the optimal tradeoff between the best single-copy and fully-entangled protocols for the full range of possible t.Below we mention a number of interesting open directions to explore: Larger ǫ.The most obvious axis along which our results could be improved is that we do not achieve the full tradeoff when ǫ is somewhat large.Indeed, our upper and lower bounds require respectively that t ≤ ( √ d/ǫ) c and t ≤ (1/ǫ) c ′ for constants c, c ′ , so e.g. for ǫ which is a small constant, we can only show our t-entangled protocol is optimal for t at most some constant.In some sense this is unavoidable with our current techniques as we crucially exploit the fact that for any state ρ = I d /d + E, its tensor power ρ ⊗t is well-approximated by its linearization, provided E is sufficiently small relative to t.A larger value of ǫ corresponds to a larger deviation E relative to t in our arguments, and at a certain point this linear approximation breaks down.It is interesting to see whether one can leverage higher-order terms in the expansion of ρ ⊗t to handle larger ǫ.Rank-dependent rates.For fully-entangled measurements, it is known [30,18] how to achieve sample complexity O(rd/ǫ 2 ) when ρ has rank r, e.g. using a "truncated" version of Keyl's algorithm.Even when t = 1, it is open what the optimal sample complexity is for rank-r state tomography; conjecturally, the answer is the rate given by the best known upper bound of O(dr 2 /ǫ 2 ) [24,17].It is an interesting question whether the protocol in the present work can be adapted to give a smooth tradeoff between the best-known single-copy and fully-entangled algorithms.
Quantum memory lower bounds.While the model of t-entangled measurements is a natural setting for understanding the power of quantum algorithms that have some nontrivial amount of quantum memory, it is not the most general model for such algorithms.More generally, one could consider a setting where there are tn qubits, where n def = log 2 d, on which the algorithm can perform arbitrary quantum computation interspersed with calls to a state oracle given by the channel where σ is an arbitrary tn-qubit density matrix and tr 1:n (•) denotes partial trace over the first n qubits.To date, the only known lower bounds for learning tasks in this general setting are in the context of Pauli shadow tomography for both states [10] and processes [8,5,11,7], as well as learning noisy parity [26], and the techniques in these works break down as one approaches t = 2.It is an outstanding open question whether one can obtain nontrivial quantum memory lower bounds even for 2n qubits of quantum memory, not just for state tomography but for any natural quantum learning task.

Technical overview 2.1 Basic Setup
Throughout, let ρ denote a mixed state.We recall that a mixed state is described by its density matrix, a PSD Hermitian matrix in C d×d with trace 1.We use I d to denote the d × d identity matrix.Given matrix M , we use M , M 1 , and M F to denote its operator norm, trace norm, and Frobenius norm respectively.
Measurements.We now define the standard measurement formalism, which is the way algorithms are allowed to interact with a quantum state ρ.
Definition 2.1 (Positive operator valued measurement (POVM), see e.g.[28]).A positive operator valued measurement M in C d×d is a collection of PSD matrices M = {M z } z∈Z satisfying z M z = I d .When a state ρ is measured using M, we get a draw from a classical distribution over Z, where we observe z with probability tr(ρM z ).
t-Entangled Measurements An algorithm that uses t-entangled measurements and n total copies of ρ operates as follows: it is given m = n/t copies of ρ ⊗t .It then iteratively measures the i-th copy of ρ ⊗t (for i = 1, 2, . . ., m) using a POVM in C d t ×d t which could depend on the results of previous measurements, records the outcome, and then repeats this process on the (i + 1)-th copy of ρ ⊗t .After performing all m measurements, it must output an estimate of ρ based on the (classical) sequence of outcomes it has received.
Remark 2.2.Note that limiting the batch size to exactly t is not actually a limitation because we can simulate an algorithm that adaptively chooses batch sizes of at most t with an algorithm that uses batch sizes of exactly t up to at most a factor of 2 in the total copy complexity.This is because we can simply combine batches whenever their total size is at most t.

Key Technical Tools
Now we introduce some of the key technical tools used in our proofs.We first focus on the learning algorithm.We will discuss the proof of the matching lower bound at the end of this section -it turns out that the lower bound actually reuses many of the same ideas.

Power Series Approximation
It will be instructive to first consider the case where ρ is close to the maximally mixed state.Write ρ = I d /d + E for some E ∈ C d×d with tr(E) = 0. Observe that when E is small, we can approximate Let the LHS of the above be X and the RHS be X ′ .In general X ′ will be significantly easier to work with than X when analyzing various measurements so when the above approximation is good, we can work with X ′ in place of X.Now consider a POVM, say {ω z zz † } z∈Z in C d t ×d t where z are unit vectors and ω z are nonnegative weights (WLOG it suffices to consider rank-1 POVMs, see e.g.[10,12,6]).Now consider constructing an estimator f : The above is a linear function in the entries of E since X ′ is linear in E, and as long as f (z) is chosen appropriately, it will be a linear function of the matrix E. We can write it as Y + cE for some constant c and some fixed matrix Y independent of E (under our eventual choice of f and {ω z zz † }, Y will be a multiple of I d by symmetry).Now, we have an estimator, given by f (z), whose mean is cE (as we can simply subtract off Y ) and whose variance is equal to the average of f (z) The quantity θ controls the variance of our estimator.Technically θ is the average of f (z)

2
F when measuring the maximally mixed state but when E is small, it is the same up to constant factors as the average of f (z) 2  F when measuring X ′ .Thus, the sample complexity that we need to estimate E to accuracy ε in Frobenius norm is O(θ/(c 2 ε 2 )) -in particular it scales linearly in the variance θ and inverse quadratically with the "signal" c in the estimator.Now it suffices to maximize the ratio c 2 /θ.We can express where the matrix G 1 (z) is constructed as follows (see Section 6 for more details): • Let F 1 (z), . . ., F t (z) be the t different ways to unfold this tensor into a d × d t−1 matrix by flattening all but one of the modes.
Roughly, we can think of maximizing c as the same as maximizing E z [ E, f (z) ].We can write The above is a linear function in f (z) and roughly, we are trying to maximize it subject to a norm constraint on f (z)

2
F so a natural choice for f (z) (that turns out to be optimal) is f (z) = G 1 (z).In this case, we can compute that both c and θ scale with ω z G 1 (z) 2 F dz and thus the quotient c 2 /θ scales linearly with the integral as well.We get that where the factor of 1/d 2 comes from the fact that c contains an extra factor of 1/d.Thus, to summarize, it suffices to understand the average Frobenius norm of the unfolded matrix G 1 (z).

Bounding the Unfolded Matrix
The key insight in understanding G 1 (z) 2 F is that it is actually related to the projection of z onto the different Schur subspaces.We prove in Lemma 6.1 that where the sum is over all partitions λ of t and Π λ denotes the projection onto the λ-Schur subspace of C d t (see Section 3 for formal definitions).The point is that the distribution over partitions λ ⊢ t induced by the projections Π λ is exactly the well-studied Schur-Weyl distribution.Since {ω z zz † } need to form a POVM, the average of G 1 (z)

2
F is roughly determined by the value of λ 2 1 + • • • + λ 2 d for a "typical" partition drawn from the Schur-Weyl distribution.We prove that this typical value is exactly Θ(t 1.5 ) when t ≤ d 2 (see Claim 3.26 and Claim 3.27).For our learning algorithm, we show that Keyl's POVM [23] actually achieves G 1 (z) 2 F ∼ t 1.5 for most z in the POVM.For our lower bound, we formalize the intuition that the quantity G 1 (z)

2
F precisely controls the "information" gained by the learner after each measurement.

Learning Algorithm
Section 2.2.1 and Section 2.2.2 already give the blueprint for the learning algorithm when ρ is close to maximally mixed.In particular, we have a POVM (namely Keyl's POVM) given by {ω 5 for most z and thus using the estimator f (z) = G 1 (z) in ( 4), we can use d 2 /(t 1.5 ε 2 ) batches and compute ρ such that ρ − ρ F ≤ ε.

Rotational Invariance
We first note that naively, the approximation (3) seems to require t ≪ 1/(d E ).However, we can exploit the symmetries in Keyl's POVM to deduce that the approximation holds for a much larger range of t.We show that the approximation holds up to t ∼ (1/ E F ) c for some small constant c.In particular, we obtain nontrivial guarantees even when The key point is that Keyl's POVM is copywise rotationally invariant, meaning that for a unitary U ∈ C d×d , applying U ⊗t zz † (U † ) ⊗t to all of the elements of the POVM results in the same POVM.The key lemma is Lemma 4.5, which shows that for any matrix E and unit vector v, where the expectation is over Haar random unitaries U .In particular, due to the random rotation on the LHS, the RHS scales with E F instead of d E .We can use (7) to bound the contribution of the terms of order-2 and higher in the expansion ρ ⊗t = (I d /d + E) ⊗t .Roughly the point is that for constant k, there are t k terms of order-k and they are bounded by E k F , so as long as t ∼ E c F for some small constant c, the error terms in the truncation in (3) decay geometrically.

A Multi-stage Approach via Quantum Splitting
The algorithm we sketched so far only works when ρ is close to I d /d.It remains to show how to make it work for general states.We do so in two steps.First, we can bootstrap this to learn ρ whenever ρ ≤ 2/d.This is because we can first obtain a rough estimate ρ and then simulate access to the state ρ+(2I d /d− ρ)

2
. This state is close to maximally mixed so we can obtain a finer estimate of ρ − ρ which allows us to refine our estimate of ρ.Note that the same argument works whenever ρ ≤ C/d for some constant C.
Next, when ρ is large, we rely on a splitting procedure that transforms a state ρ into a state Split(ρ) such that • Split(ρ) ≤ C/d for some constant C • We can simulate measurement access to copies of Split(ρ) given measurement access to copies of ρ The way the splitting procedure works is as follows: we first (approximately) diagonalize ρ.Let its eigenvalues be α 1 , . . ., α d .For each j, let b j be the smallest nonnegative integer such that For each j, we split the jth row and column of ρ into 2 bj rows and columns and divide the entries evenly between these rows and columns (see Definition 5.1).Note that we lose a factor of 2 max(b1,...,b d )/2 in our recovery guarantee due to the splitting procedure.Thus, we incorporate a multi-scale approach where we consider projecting out eigenvalues and splitting at different scales.Due to this, our final recovery guarantee for learning general states is in trace norm instead of Frobenius norm.See Section 5 for more details.

Lower Bound
The lower bound follows the likelihood ratio framework of [6] and the key new ingredient is the upper bound on the average Frobenius norm of G 1 (z) in (6).Recall the high level framework in [6].There is some prior distribution on valid quantum states ρ.The unknown state ρ 0 is sampled from this distribution.After making a sequence of measurements and observing outcomes, say z 1 , . . ., z m where m = n/t, we have some posterior distribution over states ρ and learning succeeds only when the posterior concentrates around ρ 0 .The posterior measure at some alternative hypothesis ρ satisfies Let Then to prove a lower bound, it roughly suffices to show that the above ratio is at least exp(−d 2 ) (this is because the volume of the ball of alternative hypotheses is exp(d 2 ) times the volume of valid outputs with respect to ρ 0 ).To lower bound the above ratio, it suffices to show that the update to the ratio in each step is at most exp(−ε 2 t 1.5 /d) on average.The key lemma that proves this is Lemma 6.3.While the full lemma is a bit more complicated and only holds in an average-case sense, for the purposes of this overview, we can think of the lemma as saying that Now, in Section 2.2.2, we argued that for any POVM, G 1 (z)

2
F can be at most O(t 1.5 ) on average so typically, the above ratio will be at least exp − t 1.5 ε 2 d .Thus, as long as m ≤ d 3 /(t 1.5 ε 2 ), which is the same as and this completes the sketch of the lower bound.

Representation theory basics
We will introduce some notation and facts from representation theory for analyzing entangled quantum measurements.Our exposition closely follows [30,31].For a more detailed explanation of the elementary representation theory results, see e.g.[34].We use GL d to denote the general linear group in C d×d and U d to denote the unitary group in C d×d .Next, we introduce some notation for partitions.

Definition 3.1. Given a positive integer n, a partition of n into d parts is a list of positive integers
We write λ ⊢ n to denote that λ is a partition of n.We use |λ| to denote the total number of elements in λ and ℓ(λ) to denote the number of nonzero parts in λ, e.g. in the previous example |λ| = n and ℓ(λ) = d.

Definition 3.2. Given two partitions
for all i where λ j , λ ′ j are defined to be 0 whenever j exceeds the number of parts in the partition.If λ majorizes λ ′ , we write λ λ ′ .

Fact 3.3 ([32]
).For any n, the number of distinct partitions λ ⊢ n is at most 2 3 √ n .Definition 3.4.Given a partition λ ⊢ n, let f i (λ) be the number of parts of λ that are equal to i for each integer i.
Definition 3.5.Given a partition λ ⊢ n, we let λ T ⊢ n denote its transpose i.e. λ T 1 the number of parts of λ that are at least 1, λ T 2 is the number of parts of λ that are at least 2 and so on.

Definition 3.6. [Young Tableaux] We have the following standard definitions:
• Given a partition λ ⊢ n, a Young diagram of shape λ is a left-justified set of boxes arranged in rows, with λ i boxes in the ith row from the top.
• A standard Young tableaux (SYT) T of shape λ is a Young diagram of shape λ where each box is filled with some integer in [n] such that the rows are strictly increasing from left to right and the columns are strictly increasing from top to bottom.
• A semistandard Young tableaux (SSYT) T of shape λ is a Young diagram of shape λ where each box is filled with some integer in [d] for some d and the rows are weakly increasing from left to right and the columns are strictly increasing from top to bottom.
Now we review the correspondence between Young tableaux and representations of the symmetric and general linear groups.Definition 3.7.We say a representation µ of GL d over a complex vector space C m is a polynomial representation if for any U ∈ C d×d , µ(U ) ∈ C m×m is a polynomial in the entries of U .Fact 3.8 ([33]).The irreducible representations of the symmetric group S n are exactly indexed by the partitions λ ⊢ n and have dimensions dim(λ) equal to the number of standard Young tableaux of shape λ.We denote the corresponding vector space Sp λ .Fact 3.9 ([16]).For each λ ⊢ n, there is a (unique) irreducible polynomial representation of GL d corresponding to λ.We denote the corresponding map and vector space (π λ , V d λ ).The dimension dim(V d λ ) is equal to the number of semistandard Young tableaux of shape λ with entries in [d].This representation, restricted to U d is also an irreducible representation.
Theorem 3.10 (Schur-Weyl Duality [16]).Consider the representation of S n × GL d on (C d ) ⊗n where the action of the permutation π ∈ S n permutes the different copies of C d and the action of U ∈ GL d is applied independently to each copy.This representation can be decomposed as a direct sum Definition 3.11 (Schur Subspace).We call Sp λ ⊗ V d λ the λ-Schur subspace.Given integers n, d and λ ⊢ n, we define Π d λ : (C d ) ⊗n → Sp λ ⊗ V d λ to project onto the λ-Schur subspace.
Theorem 3.12 (Gelfand-Tsetlin Basis [16]).Let n, d be positive integers.For each partition λ ⊢ n where λ has at most d parts, there is a basis v for all i where f (i) are each d-tuples that give the frequencies of 1, 2, . . ., d in each of the different semi-standard tableaux of shape λ.
We now present a few consequences of Theorem 3.10 and Theorem 3.12 that will be used in our learning primitives.Roughly, we give a more explicit representation of the Gelfand-Tsetlin basis when embedded in the representation of (C d ) ⊗n .Definition 3.13.For integers n, d and a d-tuple f 1 , . . ., f d such that where the multiset {g(1), . . ., g(n)} has 1 with frequency f 1 , 2 with frequency f 2 and so on.
sorting in decreasing order) then there are weights {w g } g∈G f 1 ,...,f d [n→d] such that the vector is in the Schur subspace Sp λ ⊗ V d λ .Furthermore, we can choose dim(λ) orthogonal unit vectors v f,1 , . . ., v f,dim(λ) of the above form the weights so that where the expectation is over Haar random unitaries U .
Proof.First, WLOG v 1 , . . ., v d are the standard basis e 1 , . . ., e d .This is because by Theorem 3.10, we can always rotate all of the n copies simultaneously by the same unitary while staying within the same Schur subspace.Now imagine decomposing C d n into the subspaces given by Theorem 3.10.Take one of the copies of V d λ .Note that since λ majorizes (f 1 , . . ., f d ), there is a semi-standard tableaux of shape λ where the numbers 1, 2, . . ., d occur exactly f 1 , . . ., f d times respectively.Thus, we can apply Theorem 3.12 to find the vector v f in it such that v † f D ⊗n α v f = α f where D α = diag(α 1 , . . ., α d ).Note that both sides of the above are polynomials in the α 1 , . . ., α d and it holds for all values, so it must actually hold as a polynomial identity.Thus, v f must be contained in the eigenspace of D ⊗n α with eigenvalue α f which is exactly the subsapce spanned by e g(1) ⊗ • • • ⊗ e g(n) as g ranges over all of the functions in G f1,...,f d [n → d].This immediately implies the first statement.
For the second statement, the fact that V d λ is an irreducible representation of U d implies that times a projection onto the copy of V d λ that v f is in.Thus, we can simply use the same construction and pick v f,1 , . . ., v f,dim(λ) , one from each of the copies of V d λ in C d n and then we get the desired statement.
In light of Lemma 3.14, we make the following definition.Definition 3.15.Given integers n, d and a partition λ ⊢ n, we define the vectors v d λ,1 , . . ., v d λ,dim(λ) ∈ C d n to be the vectors constructed in Lemma 3.14 where we choose v 1 , . . ., v d to be the standard basis and (f 1 , . . ., f d ) = (λ 1 , . . ., λ d ) (when the vectors are not unique, we pick arbitrary ones, the choice will not matter when we use this later on).We define We may drop the d in the superscript and simply write M λ when d is clear from context.
We also have the following lemma that provides a type of converse to Lemma 3.14.
Then for any partition λ ⊢ n that does not majorize (f 1 , . . ., f d ) (after sorting in decreasing order), we have Proof.As before, due to Theorem 3.10, WLOG v 1 , . . ., v d are the standard basis e 1 , . . ., e d .Take any of the copies of V d λ .Note that by assumption, there does not exist any semi-standard tableaux of shape λ where 1, 2, . . ., d occur with frequencies f 1 , . . ., f d respectively.Now consider a basis v f (1) , . . ., v f (m) for V d λ as given by Theorem 3.12.As in Lemma 3.14, we have that each vector v f (i) must be contained in the span of vectors However, as no semi-standard tableaux of shape λ has 1, 2, . . ., d occur with frequencies f 1 , . . ., f d , these subsapces are all orthogonal to v g(1) ⊗ • • • ⊗ v g(n) and thus v f (i) is orthogonal as well.Since the v f (i) form a basis for (one copy of) V d λ and we can repeat the same argument for all of the other copies, is actually orthogonal to the entire λ-Schur subspace and we are done.
Equipped with the above constructions, we can define the following POVMs.Definition 3.17 (Weak Schur Sampling).We use the term weak Schur sampling to refer to the POVM on C d n ×d n with elements given by Π d λ for λ ranging over all partitions of n into at most d parts.
Our algorithm will make use of a POVM introduced by Keyl in [23].However, the actual estimator we construct and its analysis will (necessarily) be very different from previous works.Definition 3.18 (Keyl's POVM [23]).We define the following POVM on C d n ×d n : first perform weak Schur sampling to obtain λ ⊢ n.Then within the subspace Sp λ ⊗ V d λ , measure according to where U ranges over Haar random unitaries.Note that the outcome of the measurement consists of a partition λ ⊢ n and a unitary U ∈ C d×d .
Remark 3.19.The fact that this is a valid POVM follows from Lemma 3.14.
One of the key tools for understanding the Schur-Weyl distribution is the following combinatorial characterization.
Fact 3.25 (The RSK-Correspondence, Greene's Theorem [32]).Consider λ ∼ SW n (α).Alternatively, sample a sequence of n tokens in [d] say x = (x 1 , . . ., x n ) where each token is drawn independently from the distribution (α 1 , . . ., α d ).For each k ≤ n, let l k be the maximum length of the union of k disjoint weakly increasing subsequences of x and let l ′ k be the maximum length of the union of k disjoint strictly decreasing subsequences of k.Then we have that: • The distribution of λ is the same as the distribution of (l 1 , l 2 − l 1 , l 3 − l 2 , . . ., l n − l n−1 ) • The distribution of λ T is the same as the distribution of We now prove a few basic inequalities about the Schur-Weyl distribution.Compared to [30,31] where inequalities of a similar flavor are used, the inequalities here are stronger in the regime n ≪ d 2 , which will be crucial later on.Proof.Consider sampling a sequence x = (x 1 , . . ., x n ) in [d] n where each token is drawn independently from the distribution (α 1 , . . ., α d ) as in Fact 3.25.Now we pair up each sequence x with its reverse, say x ′ .These sequences occur with equal probability.Next, let λ(x) be the partition corresponding to x as defined in Fact 3.25.Note that by construction, λ(x ′ ) ≻ λ(x) T .On the other hand, for any partition λ ⊢ n, we claim that Once we prove the above we are done, since x and x ′ occur with the same probability and then we would get that either λ(x) or λ(x ′ ) satisfy the desired property.To see why the above holds, if The above then implies that n i=1 (λ 4 .This completes the proof.Claim 3.27.Let α = (α 1 , . . ., α d ) be a vector of nonnegative weights summing to 1. Then for λ ∼ SW n (α), Proof.Define the polynomial It is a standard fact (see [29]) that Next, note that Finally, recall from Fact 3.25 that λ T 1 has the same distribution as the longest strictly decreasing subsequence of a random sequence of n tokens in [d] drawn independently from the distribution (α 1 , . . ., α d ).This is dominated by the longest decreasing subsequence of a random permutation and it is known (see e.g.[34]) that this has expected value at most 2 √ n.Thus, Combining the above with ( 9) implies as desired.

Tensor manipulations
We will also need some general notation for working with vectors, matrices and tensors.
Definition 3.28.For a vector v ∈ C d t , we define T (v) to be the d ⊗ • • • ⊗ d tensor obtained by reshaping v (we assume that this is done in a canonical and consistent way throughout this paper).
Definition 3.29.For matrices (or tensors) A 1 , . . ., A t with all dimensions equal, we denote the symmetric sum If we additionally have positive integers k 1 , . . ., k t , then where S k1,...,kt consists of all distinct permutations of the multiset with k 1 elements equal to 1 , k 2 elements equal to 2 and so on.
Now we have the following relations.
Fact 3.32.For any vector v ∈ C d t , matrix E ∈ C d×d and integer Proof.The identity follows immediately from the definitions.
where g : [t] → [d] ranges over all functions such that exactly f j distinct elements are mapped to j for each j ∈ Proof.Note that for any two distinct functions g, g ′ with the specified frequencies, for any index j, because g, g ′ must differ on at least two inputs l ∈ [t] (since their outputs have the same frequencies) so then v g(l) , v g ′ (l) = 0 for some l ∈ [t]\{j}.Next, using the same formula as above but with g = g ′ , Writing out the definition of v and combining the above two identities completes the proof of the desired equality.
As a consequence of Lemma 3.33, we have: Corollary 3.34.Let d be an integer and λ ⊢ t be a partition.Let v d λ,j ∈ C d t be one of the vectors as defined in Definition 3. 15 Proof.This statement follows immediately from Lemma 3.14 and Lemma 3.33.

Additional facts
In this section, we present a few additional facts that will be used later in the proof.First, we give an expression for the symmetric polynomials as a linear combination of products of power sums.Proof.We prove the statement by induction on λ where we induct on the reverse "majorizing order".The base case is when λ = (n, 0, . . ., 0), which is the (unique) maximal partition according to the majorizing order.Now assume that we have proved the claim for all λ ′ that strictly majorize λ.Now note that we can write for some coefficients z µ that are nonnegative integers.Furthermore, the coefficient z λ is exactly equal to count(λ).Thus, count(λ) is a symmetric polynomial whose monomials strictly majorize λ and we can now apply the inductive hypothesis to write count(λ) sym x λ in the desired form.Using the inductive hypothesis, we can bound the coefficients.First c λ = 1 by construction.Next, for µ ≻ λ, we can bound c µ as we are assuming µ ′ ≻ λ).Also, z µ ′ /count(µ ′ ) is at most equal to the number of partitions of the set {λ 1 , . . ., λ k } into k ′ parts such that the sums of the parts of the partition are exactly µ ′ 1 , . . ., µ ′ k ′ .Thus, ℓ(µ ′ )=k ′ z µ ′ /count(µ ′ ) is at most equal to the number of ways to partition [k] into exactly k ′ disjoint subsets.This is at most equal to the number of labeled forests on k vertices with k − k ′ edges which is at most k 2(k−k ′ ) ≤ (n 2 ) ℓ(λ)−ℓ(µ ′ ) .Substituting this into the above sum completes the proof.
We also have the following formulas for integrals over Haar random unitaries.Claim 3.36.Let v ∈ C d be a uniformly random unit vector.Let v 1 be its first entry and v 2 be its second entry.Then .
Proof.These follow from direct computation using formulas for Haar integrals (see [19]).
Claim 3.37.Let X, Y ∈ C d×d be Hermitian matrices.Then where the expectation is over a Haar random unitary U .
Proof.First by Claim 3.36, for a random unit vector v, and for Haar random orthonormal unit vectors v 1 , v 2 , .
Now let the eigenvalues of X be x 1 , . . ., x d and the eigenvalues of Y be y 1 , . . .y d .We have

Learning balanced states
Now, we will present our learning algorithm for states that are close to maximally mixed.The main result that we prove in this section is as follows: Theorem 4.1.Let ρ = I d d + E be an unknown quantum state in C d×d .Let t be a parameter such that t ≤ d 2 and E F ≤ (0.01/t) 4 .Then for any target accuracy ε, there is an algorithm (Algorithm 1) that takes O(d 2 /(t1.5 ε 2 )) copies of ρ ⊗t and returns E ∈ C d×d such that Our algorithm will make use of Keyl's POVM (recall Definition 3.18).Let λ ⊢ t be the partition and U be the unitary obtained from the measurement

Algorithm 1: Algorithm for Learning Balanced States
Output: E We can actually use Theorem 4.1 to learn any state ρ with all eigenvalues at most 4/d 1 by first constructing an estimate σ for the "complement" 4 3d I d − 1 3 ρ and then simulating measurement access to copies (ρ + 3σ)/4 and applying Theorem 4.1 to refine the error in the estimate σ.To initially estimate σ, we will use the standard algorithm for state tomography with unentangled measurements.
In this way, we can prove the following corollary of Theorem 4.1.It is actually slightly more general in that we can learn any state ρ that can be written as a sum σ + ∆ where σ ≤ 4/d and ∆ F is small.Corollary 4.3.Let δ, ε < 1 and ρ ∈ C d×d be an unknown state.Let t ≤ min(d 2 , c(1/ε) 0.2 ) be some parameter, where c is a sufficiently small absolute constant.Assume that ρ can be written as ρ = ρ ′ + ∆ where ρ ′ ≤ 4/d and ∆ F ≤ √ ε/t 2 .Then there is an algorithm that takes O(d 2 log(1/δ)/( √ tε 2 )) total copies of ρ and measures them in batches of t-entangled copies and with probability 1 − δ, outputs a state ρ such that ρ − ρ F ≤ ε.
Proof.In the first step, we run the algorithm in Theorem 4.2 with O(d 2 log(1/δ)/( √ tε 2 )) samples to produce an estimate ρ satisfying ρ − ρ F ≤ O(εt 1/4 ).Project this (with respect to the Frobenius norm) to the convex set of density matrices with operator norm at most 4/d to produce a new estimate ρ.Note that ρ − ρ ′ F ≤ ρ − ρ ′ F , and where in the last step we used the assumption that t ≤ c(1/ε) 0.2 .Because ρ ≤ 4/d by design, the "complement" state σ 4 3d I d − 1 3 ρ is a valid density matrix.We would like to apply Theorem 4.1 to the state (ρ + 3σ)/4.Note that and for E (ρ− ρ)/4, we have 4 by the assumption that t ≤ c(1/ε) 0.2 , provided we take c sufficiently small.Theorem 4.1 thus implies that with O(d 2 /(t 1.5 ε 2 )) copies of ((ρ + 3σ)/4) ⊗t , measured in batches of t-entangled copies, we can use Algorithm 1 to produce an estimate E for which with probability at least 1 − δ, where in the last step we used that E 2 F ≤ O(ε/t 4 ).By Lemma 4.4, we can simulate t-entangled measurement access to (ρ + 3σ)/4 using just t-entangled measurement access to ρ.
Our estimate for ρ is given by taking ρ + 4 E and projecting to the convex set of density matrices.As this projection can only decrease the Frobenius distance to ρ, the proceeding arguments imply that with constant probability with using O(d 2 /( √ tε 2 )) copies.To achieve failure probability 1−δ, we can simply repeat this process O(log 1/δ) times and take the geometric median of the outputs.Standard arguments then imply that this is O(ε) close to ρ with probability 1 − δ.Rescaling ε by an appropriate constant factor completes the proof.Lemma 4.4.Let 0 ≤ λ ≤ 1.Given t copies of an unknown state ρ, and given a description of a density matrix σ, it is possible to simulate any measurement of (λρ + (1 − λ)σ) ⊗t using a measurement of ρ ⊗t .
Proof.Let {M z } z∈Z be an arbitrary POVM.Then for any z ∈ Z, observe that where ρ ⊗ S σ denotes the product state such that for every i ∈ S, the i-th component of the state is a copy of ρ, and every other component is a copy of σ.To simulate measuring (λρ + (1 − λ)σ) with {M z } z∈Z , we can simply sample S by including each i ∈ [t] in S with probability λ, prepare the state ρ ⊗ S σ, and measure with {M z }.
The remainder of this section will be devoted to proving Theorem 4.1.

Approximation for rotationally invariant POVMs
Recall that our goal in this section is to learn ρ when ρ = I d /d + E for some E that is sufficiently small.One important insight in the analysis of Algorithm 1 is that because Keyl's POVM is invariant to rotating all of the copies simultaneously, we can replace X = ρ ⊗t = (I d /d + E) ⊗t with its first order approximation X ′ = (I d /d) ⊗t + sym E ⊗ (I d /d) ⊗t−1 at the cost of some small error.We can then analyze the POVM applied to X ′ instead which is significantly simpler (although X ′ may not technically be a state).First, we have the following lemma for bounding quantities that involve averaging over simultaneous Haar random rotations of all copies.Lemma 4.5.Let v ∈ C d t be any vector with v = 1.Let E ∈ C d×d be a Hermitian matrix with tr(E) = 0. Then where U is a Haar random unitary.
Proof.Let the eigenvalues of E be α 1 , . . ., α d .Then is a symmetric, degree t polynomial in the α i since U is Haar random.Now consider any tuple of nonnegative integers λ = (λ 1 , . . ., λ d ) with Next note that S 1 = 0 and for all k ≥ 2, where we used the coefficient bound in Lemma 3.35 and Fact 3.3.Next, recall that by Schur Weyl duality, we can write All of the Schur subspaces V t λ have dimension at least (d/t) t (to see this, note that if t < d, then there are at least d t distinct SSYT of any given shape and otherwise the claim is vacuously true).Thus, since v = 1, we can write where the coefficients z λ satisfy λ |z λ | ≤ (t/d) t .Finally, note that the Schur polynomial s λ (α 1 , . . ., α d ) has positive coefficients and these are all dominated, monomial by monomial, by the coefficients in (α which are all at most t t .Thus, overall we conclude by Eq. ( 13) that We now define the family of POVMs in C d t ×d t that are invariant to rotating all of the d × d copies simultaneously by the same unitary.Definition 4.6.We say a POVM {M z } z∈Z in C d t ×d t is copy-wise rotationally invariant if it is equivalent to where U ∈ C d×d is a random unitary drawn from the Haar measure.
Next, we bound a χ 2 -like distance between the outcome distributions from measuring X and X ′ with a copy-wise rotationally invariant POVM.Again, we emphasize that the low-degree truncation X ′ is not necessarily an actual density matrix; nevertheless, the expression on the left-hand side of Eq. ( 14) below is a well-defined quantity.Lemma 4.7.Let {M z } z∈Z be a POVM in C d t ×d t that is copywise rotationally invariant.Let E ∈ C d×d be a matrix with tr(E) = 0 and E F ≤ 0.01 Proof.We write

Now we will first bound
where the expectation is over Haar random unitaries U ∈ C d×d (the last step is valid because we assumed that the POVM is copywise rotationally invariant).Now we can use Lemma 4.5 to conclude that the above is at most Thus, by Cauchy Schwarz we have where we use the condition that E F ≤ (0.01/t) for all z ∈ Z and unitary U .
We can apply Lemma 4.7 to bound the error of a rotationally compatible estimator between measuring X ′ and X with some copy-wise rotationally invariant POVM.Note that in Algorithm 1, the POVM is rotationally invariant and our estimator is rotationally compatible with it.
Lemma 4.9.Let {M z } z∈Z be a POVM in in C d t ×d t that is copywise rotationally invariant.Let f : {M z } z∈Z → C d×d be a rotationally compatible estimator such that tr(f Proof.We will upper bound the left-hand side by bounding Note that the integral on the left-hand side has trace 0, so it suffices to consider A with trace 0. Consider a matrix A ∈ C d×d with tr(A) = 0.
To bound the first term above, by Claim 3.37, for any traceless Hermitian matrix Y ∈ C d×d , where the expectation is over the Haar measure.Thus, since the POVM {M z } z∈Z is copywise rotationally invariant, we have Next, by Lemma 4.7 and thus, taking the maximum over all A with A F ≤ 1, we get

Proof of Theorem 4.1
Recall that the high-level idea for proving Theorem 4.1 is to replace measurements of X = ρ ⊗t with measurements of X ′ and then analyze those measurements of X ′ .The next result allows us to compute the mean of the estimator in Algorithm 1 if we were able to measure X ′ .
Corollary 4.10.Let {M λ,U } λ,U be Keyl's POVM where λ ranges over partitions of t and U ranges over unitaries in Proof.Note that the actual POVM elements of Keyl's POVM are dim(V t λ )U ⊗t M λ (U † ) ⊗t where λ ranges over all partitions and U ranges over Haar random unitaries.Now we apply Fact 3.32 and Corollary 3.34 to the vectors U ⊗t v λ,j as j ranges over all of the components of M λ (recall Definition 3.15).Then So by taking X = diag(λ 1 /t, . . ., λ d /t) and Y = E in Claim 3.37, for any fixed λ we get Now summing over all λ, we conclude Now we can complete the proof of Theorem 4.1.
Proof of Theorem 4.1.Note that the POVM in Algorithm 1 is clearly copywise rotationally invariant and the estimator is rotationally compatible with it.Let us use the shorthand {M z } z∈Z to denote this POVM and for M z corresponding to unitary U and partition λ, we let f (M z ) = U diag(λ 1 /t, . . ., λ d /t)U † .We have where the expectation is over the randomness of the quantum measurement in Algorithm 1.We can make the estimator D j have trace 0 by simply subtracting out I d /d and adding it back at the end.Thus, by Lemma 4.9 and Corollary 4.10, recalling the definition of θ in Line 7 of Algorithm 1, we have Thus, if E is the output of Algorithm 1, then Next, we compute the variance of the estimator E. We have Now by Claim 3.27, we can upper bound , where in the last step we used the assumption that ρ = I d d + E for E F ≤ (0.01/t) 4 and that t ≤ 0.01d 2 .While Theorem 4.1 is technically stated for t ≤ d 2 , for t in the range 0.01d 2 ≤ t ≤ d 2 , we can just use 0.01d 2 -entangled measurements instead and this loses at most a constant factor in the total copy complexity.Also by Claim 3.26, we have θ ≥ t 1.5 /4.Thus, putting everything together, we conclude

Reducing to balanced states via quantum splitting
Now we demonstrate how to generalize our results in Section 4 to learn arbitrary states.The main idea will be to take an arbitrary state ρ and construct a state Split(ρ) that preserves information about ρ and we can simulate measurement access to, but also has bounded operator norm.

Construction of the quantum splitting
We formalize the splitting procedure below.(j, s) where j ∈ [d] and s ∈ {0, 1} bj and these are sorted first by j and then lexicographically according to s.Now the entry indexed by row (j 1 , s 1 ) and column (j 2 , s 2 ) is defined as to be a d × d matrix X obtained as follows.First index the rows of N by pairs (j, s) where j ∈ [d] and s ∈ {0, 1} bj in sorted order (first by j and lexicographically by s).Now the entry X j1j2 is equal to ),(j2,s) where s[: b j1 ] denotes truncating s to its first b j1 bits.
The following basic facts about the splitting and its inverse are easily verified through direct computation.
Claim 5.3.Let b 1 , . . ., b d ∈ Z ≥0 .We have the following statements (for any matrices M, N of appropriate dimensions): The first statement is immediate from the definition since Rec sums up exactly the entries that are equal to M j1j2 /2 max(bj 1 ,bj 2 ) in Split.The second statement holds because Split simply splits up each of the entries of M j1j2 evenly into multiple entries which can only decrease the Frobenius norm.The last statement holds because each element of Rec b1,...,b d (N ) is equal to a sum of at most 2 max(b1,...,b d ) elements of N .Now the key fact about the splitting that makes it a useful abstraction in our learning algorithm is that we can actually simulate measurements on t-entangled copies of Split b1,...,b d (ρ) with measurements on t-entangled copies of ρ.Thus, Split b1,...,b d (ρ) ⊗t can be constructed by embedding ρ ⊗t in various different principal submatrices of a k t × k t matrix and averaging them.Thus when we measure Split b1,...,b d (ρ) ⊗t with some POVM in C k t ×k t , this is equivalent to averaging the different embeddings of ρ ⊗t and thus we can actually simulate this measurement by measuring ρ ⊗t with a single POVM.

Full algorithm
In this section, we present our full learning algorithm.Note that the quantum splitting procedure requires knowledge of ρ, or at least an estimate of ρ, to be useful.We can obtain such a rough estimate for ρ via tomography with unentangled measurements.First, we show how to learn when we are given this estimate as a black-box.We will then put everything together to prove our main learning result, Theorem 5.7.Lemma 5.5.Let d, t, ε, δ, C be parameters and ρ ∈ C d×d be an unknown quantum state.Assume that t ≤ min(d 2 , (1/ε) 0.2 ).Let ρ ′ be a quantum state whose description we know such that ρ ′ ≤ C/d and ρ ′ − ρ F ≤ √ ε/t by construction, so we are done.
Sometimes the states that we work with will have large eigenvalues and we won't want to apply Lemma 5.5 directly as the √ C factor loss in the accuracy could be very large.Instead, we will try to project out the large eigenvalues and learn in the orthogonal complement.Thus, we have the following subroutine.
To do this, for each copy, of ρ, we first measure with the POVM (P, I d − P ) and keep only those whose outcome is P .Now we can simulate measurement access to the state ρ 0 .Note that if we have t-entangled copies of ρ, then with probability at least 1/2, when we measure them with (P, I d −P ), at least t ′ of them will have outcome P so then we can make an entangled measurement on t ′ copies of ρ 0 .In other words, if we have batches of the form ρ ⊗t , then with at least half of the batches, we will be able to simulate measurements of ρ ⊗t ′ 0 .By Lemma 5.5, with probability 1 − 0.1δ, we obtain a state ρ such that Clearly, we can ensure that the state ρ lives entirely in the subspace given by P .Now we simply output ρ = β ρ.We have that so we are done.Now we present our full learning algorithm.At a high level, we first replace ρ with σ = (ρ+ I d /d)/2 and learn a rough estimate σ 0 via unentangled measurements.Then we restrict to subspaces corresponding to eigenvalues of σ 0 at various scales and apply Corollary 5.6 to refine our estimate on each of these subspaces.Finally, we aggregate our estimates to obtain a refined estimate σ, from which we can recover an estimate of ρ.Note that previously ε was used to measure the accuracy in Frobenius norm but in Algorithm 2, it will be used for accuracy in trace norm.Our algorithm only guarantees recovery to trace norm ε with )) copies of ρ and measures them in batches of t-entangled copies and with probability 1 − δ, outputs a state ρ such that ρ − ρ 1 ≤ ε.
Proof.By Theorem 4.2, with 1 − 0.1δ probability, we have Now we verify that the conditions of Corollary 5.6 hold whenever we apply it in Algorithm 2. The condition on t clearly holds.Next, note that all eigenvalues of σ are at least 1/(2d).For all j ≥ 1, the dimension of the orthogonal complement of P j is at most d/2 and thus, tr(P j σP † j ) ≥ 1 4 .Also, by definition, we have so we can ensure that with probability 1 − 0.1δ/t, Thus, since P j+1 is a projector onto a subspace of P j , we also have that Next, the dimension of the orthogonal complement of P j+1 is at most 2 j+1 d/ √ t, so Also, similarly, P 0 ( σ 0 − σ) Putting these together, we get that σ − σ 1 ≤ O(ε log t).This immediately implies Finally, the truncation and trace normalization to obtain ρ increases the trace norm distance by at most a constant factor so we conclude ρ − ρ 1 ≤ O(ε log t).This completes the proof (note that we can simply redefine ε appropriately and absorb the logarithmic factors in the number of copies into the O(•)).

Lower bound machinery
Recall Definition 3.31.The key ingredients in our lower bound involve understanding Proof.Let A ∈ C d×d be an arbitrary Hermitian matrix with A F = 1.Let its eigenvectors be v 1 , . . ., v d .Now we write v in the basis given by v where the first step holds because A is diagonal in the basis v 1 , . . ., v d and all of the cross terms that appear when we expand G 1 (v) are off-diagonal.Now the above is at most since by assumption, v For each s, let f s denote the partition corresponding to (f s (1), . . ., f s (d)) (in sorted order).We have that where the last inequality holds because by Lemma 3.16, we know that for any s with f s (1) However, we have that Putting (15) ( 16), ( 17) together, and taking the maximum over all choices of A, we conclude as desired.Now we will use Lemma 6.1 to bound the "likelihood ratio" x † ρ ⊗t x x † ρ ⊗t 0 x for different quantum states ρ and ρ 0 and vectors x ∈ C d t .We explain the lower bound framework in detail and why this quantity is meaningful in Section 7. Lemma 6.2.Let 0 < ε < 1 be some parameter.Let ρ 0 = (I d + Z)/d ∈ C d×d be a quantum state with Z ≤ ε.Let µ be a distribution on matrices in C d×d that is rotationally symmetric i.e. invariant under rotation by any unitary U .Also assume that any ∆ in the support of µ has tr(∆) = 0 and ∆ ≤ ε/d.Let t be an integer with t ≤ 0.01/ε 0.1 .Then for any vector x ∈ C d t with x = 1, Proof.Using the condition on t, we have that all eigenvalues of ρ ⊗t 0 and (ρ 0 + ∆) ⊗t are between 0.98/d t and 1.02/d t so For the first term, we can write Let Since µ is supported on traceless matrices and is rotationally invariant, E ∆∼µ [∆] = 0. Now by Lemma 4.5, Thus, the first sum in the RHS of ( 19) is 0 and aside from T , the remaining terms all contain a product of at least two copies of ∆ and one copy of Z or at least three copies of ∆.We can upper bound the norm of the expectation of these terms using (20).Combining (19), (20) and using the condition t ≤ 0.01/ε 0.1 , we deduce Next, we lower bound the quadratic term.Further expanding (19), we can write and we label the three sums above as S 1 (∆), S 2 (∆), S 3 (∆).Now let us consider expanding The key is that in the expansion, aside from S 1 (∆) ⊗ S 1 (∆), all other terms have a product of at least two copies of ∆ and one copy of Z or at least three copies of ∆ so we can apply (20) to upper bound the norms of all of these terms by ε 3 /d 2t+1 .Formally, we get The above implies that Finally, observe that by Fact 3.32, , ( 21), (22), we get where we also used that by the assumption on t, ρ ⊗t 0 has all eigenvalues at least 0.9/d t .This completes the proof.
We can aggregate Lemma 6.2 over sequences of vectors x 1 , . . ., x m and apply Jensen's inequality to get the following bound on the product of a sequence of likelihood ratios.Lemma 6.3.Let 0 < ε < 1 be some parameter.Let ρ 0 = (I d + Z)/d ∈ C d×d be a quantum state with Z ≤ ε.Let µ be a distribution on matrices in C d×d that is rotationally symmetric i.e. invariant under rotation by any unitary U .Also assume that any ∆ in the support of µ has tr(∆) = 0 and ∆ ≤ ε/d.Let t be an integer with t ≤ 0.01/ε 0.1 .Let x 1 , . . ., x m ∈ C d t be vectors such that Proof.By Jensen's inequality and Lemma 6.2, Now by Lemma 4.5, E ∆∼µ [∆ ⊗ ∆] ≤ 10 8 ε 2 d 3 so combining this with the assumption about x 1 , . . ., x m and that tr(∆) = 0, we get m j=1 Next, by Claim 3.37 and Lemma 6.1, m j=1 Plugging the above two inequalities into (23), we get 7 Proof of lower bound

Lower bound framework
The remainder of the proof of the lower bound will closely follow the framework in [6] with Lemma 6.3 as the main new ingredient.Recall that the learner measures m = n/t copies of ρ ⊗t in sequence with POVMs in C d t ×d t possibly chosen adaptively.It is a standard fact that without loss of generality (see e.g. [10, Lemma 4.8]) we may assume that all POVMs used are rank-1 and we will work with this assumption for the rest of the lower bound.We will sometimes represent a sequence of m measurement outcomes by x = (x 1 , . . ., x m ) to denote that in the i-th step, the outcome that was observed corresponds to a POVM element which is a scalar multiple of x i x † i .Next, we review a standard formalism for representing an adaptive algorithm as a tree.There is a slight difference in the definition here compared to the definition in [10] since we allow the algorithm to make entangled measurements on t copies of ρ simultaneously at each step.Definition 7.1 (Tree representation, see e.g.[10]).Fix an unknown d-dimensional mixed state ρ.An algorithm for state tomography that only uses m batches of t-entangled copies of ρ can be expressed as a pair (T , A), where T is a rooted tree T of depth m satisfying the following properties: • Each node is labeled by a string of vectors x = (x 1 , . . ., x k ), where each x i corresponds to measurement outcome observed in the i-th step.
• Each node x is associated with a probability p ρ ⊗t (x) corresponding to the probability of observing x over the course of the algorithm.The probability for the root is 1.
• At each non-leaf node, we measure ρ ⊗t using a rank-1 POVM ω x d t • xx † x to obtain classical outcome (that is a unit vector) x ∈ C d t .The children of x consist of all strings x ′ = (x 1 , . . ., x k , x) for which x is a possible POVM outcome.
• If x ′ = (x 1 , . . ., x k , x) is a child of x, then • Every root-to-leaf path is length-m.Note that T and ρ induce a distribution over the leaves of T .
A is a randomized algorithm that takes as input any leaf x of T and outputs a state A(x).The output of (T , A) upon measuring m copies of a state ρ in t-entangled batches is the random variable A(x), where x is sampled from the aforementioned distribution over the leaves of T .
We also recall the definition of the Gaussian Unitary Ensemble (GUE) and define a trace-centered variant, which will be the basis of our hard distribution.We recall the following standard fact about extremal eigenvalues of the GUE matrix.

Construction of hard distribution
We construct the following hard distribution µ over quantum states.Let U ⊆ C d×d be the subspace of Hermitian matrices with trace 1 and U 0 ⊆ C d×d be the subspace of Hermitian matrices with trace 0. These spaces inherit the inner product of C d×d , which defines Lebesgue measures Leb U and Leb U0 on them.Let ε be the target accuracy.We can assume ε is sufficiently small and let σ = Cε for some constant C > 1 to be chosen later.A sample ρ ∼ µ is generated by where G is a sample from GUE * (d) conditioned on G ≤ 4. Note that such matrices are clearly valid quantum states.Concretely, µ has density (with respect to Leb U ) where Z is a normalizing constant.Further define a set of "good" states which corresponds to the event G ≤ 3. Due to Lemma 7.3, µ(S good ) ≥ 1 − e −Ω(d) .
In the below proof, we will show that all ρ 0 ∈ S good are hard to learn.The important property of S good is that it is far from the boundary of supp(µ) = S supp ; this ensures that we can choose a suitable sub-sampling of µ in a neighborhood of ρ 0 , which is rotationally symmetric around ρ 0 .
Finally, we record the following straightforward fact.We first prove that the set of observations x is indeed well-balanced with high probability.
Lemma 7.10.Let ε, d be parameters such that ε ≤ ε 0 for some sufficiently small absolute constant ε 0 .Let t, m be parameters such that t ≤ 1 10Cε where C is some constant.Let ρ 0 ∈ C d×d be a state with ρ 0 − I d /d ≤ 4Cε d .Measure the state ρ ⊗t 0 sequentially m times with arbitrary rank-1 POVMs (possibly chosen adaptively) and let the outcomes be x = (x 1 , . . ., x m ) where x j is a unit vector in C d t .Then with probability 1 − (ε/d) 10 , the collection x is well-balanced.
Proof.For the first claim, note that for any POVM M = {ω x d t xx † } x , if we measure ρ ⊗t 0 with this POVM, then where the expectation is over the outcome from M. Thus, where the last step uses Claim 3.27.Also note that the quantity inside the expectation is at most 2t 2 always since ρ 0 has all eigenvalues at least 1 2d t by the assumption on t.Thus, since the above holds for any POVM, we can apply Azuma's inequality and get that with probability at least 1 − (ε/d) 20 , and thus the first condition of well-balanced is satisfied.Now we analyze the second condition.By (28), we also have Also, the quantity inside the expectation is PSD and has trace at most 2t 2 d 2 always.This means its variance has operator norm at most 2t 4 d 2 .Since the above holds for any POVMs, we can apply matrix Azuma and get that with probability at least 1 − (ε/d) 20 , which matches the second condition for well-balanced, as desired.Now we can bound the average likelihood ratio whenever the sequence of observations is well-balanced.
Proof.Note that the distribution γ(ρ 0 ) is rotationally symmetric around ρ. Thus, we can apply Lemma 6. for some choice of internal randomness for A and some x, note that the corresponding inner expectation in ( 29) is zero.We can thus upper bound the double expectation in ( 29) by where in the last step we used the fact that under E we have (ρ A x , x) ∈ S, so by Theorem 7.6 the posterior measure ν x places o(1) mass on the trace norm ε-ball around ρ A x .

1 3
Input: m copies of ρ ⊗t for some unknown quantum state ρ ∈ C d×d 2 for j ∈ [m] do Measure ρ ⊗t according to Keyl's POVM4

Definition 5 . 1 .
Let b 1 , . . ., b d ∈ Z ≥0 .We define Split b1,...,b d to be a linear map that sends any M ∈ C d×d to a square matrix with dimension 2 d1 + • • • + 2 b d defined as follows.The rows and columns of Split b1,...,b d (M ) are indexed by pairs

Claim 5 . 4 .
Given measurement access to ρ ⊗t where ρ ∈ C d×d is a state, Split b1,...,b d (ρ) is a valid state and we can simulate measurement access to access to Split b1,...,b d (ρ) ⊗t .Proof.Note that Split b1,...,b d (ρ) can be constructed by embedding ρ in various different principal submatrices of a k × k matrix (where k = 2 b1 + • • • + 2 b d ) and averaging them.In particular, for a string s ∈ {0, 1} max(b1,...,b d ) , we can imagine indexing the rows and columns of the k × k matrix as in Definition 5.1 and embedding ρ in the rows and columns indexed by (1, s[: b 1 ]), . . ., (d, s[: b d ]) where s[: b j ] denotes truncating s to its first b j bits.Now averaging these embeddings over all 2 max(b1,...,b d ) choices for s gives exactly Split b1,...,b d (ρ).
For a vector v ∈ C d t and integer 1 ≤ j ≤ t, we define G j (v) to be a d j × d j matrix defined as Definition 3.30.Given a tensor T ∈ (C d ) ⊗t , we index its modes 1, 2, . . ., t.For sets S 1 , . . ., S k that partition [t], we define F S1,...,S k (T ) to be the order-k tensor whose dimensions are d |S1| × • • • × d |S k | and are obtained by flattening the respective modes of T indexed by elements of S 1 , . . ., S k respectively.Definition 3.31.