Learning quantum Hamiltonians at any temperature in polynomial time

We study the problem of learning a local quantum Hamiltonian $H$ given copies of its Gibbs state $\rho = e^{-\beta H}/\textrm{tr}(e^{-\beta H})$ at a known inverse temperature $\beta>0$. Anshu, Arunachalam, Kuwahara, and Soleimanifar (arXiv:2004.07266) gave an algorithm to learn a Hamiltonian on $n$ qubits to precision $\epsilon$ with only polynomially many copies of the Gibbs state, but which takes exponential time. Obtaining a computationally efficient algorithm has been a major open problem [Alhambra'22 (arXiv:2204.08349)], [Anshu, Arunachalam'22 (arXiv:2204.08349)], with prior work only resolving this in the limited cases of high temperature [Haah, Kothari, Tang'21 (arXiv:2108.04842)] or commuting terms [Anshu, Arunachalam, Kuwahara, Soleimanifar'21]. We fully resolve this problem, giving a polynomial time algorithm for learning $H$ to precision $\epsilon$ from polynomially many copies of the Gibbs state at any constant $\beta>0$. Our main technical contribution is a new flat polynomial approximation to the exponential function, and a translation between multi-variate scalar polynomials and nested commutators. This enables us to formulate Hamiltonian learning as a polynomial system. We then show that solving a low-degree sum-of-squares relaxation of this polynomial system suffices to accurately learn the Hamiltonian.


Introduction
Quantum computing has sparked a major interest in increasing the scale and control of quantum systems [GAN14]. This increased interest is accompanied with the need for better algorithms to characterize and verify these systems [CEKKZ21]. A central computational task in controlling and verifying quantum systems is that of Hamiltonian learning, where the goal is to estimate physical properties, namely the interaction strengths, of an interacting quantum many-body system from measurements [Cra+10; SLP11; BAL19; AAKS20]. Formally, we consider n qubits (quantum particles with a local dimension of two) on a lattice. 1 The resulting system is characterized by a Hamiltonian, a 2 n × 2 n complex Hermitian matrix of the form H = ∑ m a=1 λ a E a , where a term E a encodes an interaction on at most K of the particles, and the coefficient λ a ∈ [−1, 1] is the strength of the corresponding interaction. We assume that the system has reached thermal equilibrium at a known inverse temperature β, in which case it is in the Gibbs state with density matrix The density matrix is normalized by the partition function, tr e −βH , which ensures that ρ has trace one. The goal of the Hamiltonian learning problem is to estimate the λ a 's, given the ability to prepare copies of the Gibbs state.
Problem (Hamiltonian learning, Problem 1 (informal)). Consider n qubits on a constantdimensional lattice. Let H = ∑ m a=1 λ a E a ∈ C 2 n ×2 n be a Hamiltonian whose terms E a are known, distinct, non-identity Pauli operators supported on at most k qubits that are local with respect to the lattice. Further suppose the coefficients λ a ∈ R satisfy |λ a | ⩽ 1. Given copies of the corresponding Gibbs state ρ at a known inverse temperature β > 0, and ε > 0, find estimatesλ a such that |λ a − λ a | ⩽ ε for all a ∈ [m].
We are interested in both the number of copies of ρ that we need, which is called the sample complexity, as well as the running time of the algorithm. Of particular interest to us is Hamiltonian learning in the low-temperature regime, where β is an arbitrarily large constant.
Motivation. As alluded to above, this problem is of fundamental importance in science and engineering. For example, in pursuit of understanding the phenomena like topological order and superconductivity that are studied in condensed matter physics, experimentalists carefully design systems which exhibit this exotic behavior. In particular, analog quantum simulators are tuned to obey poorly-understood Hamiltonians like the Fermi-Hubbard model for experimental exploration [GB17;End+11;Hir+23]. 2 For these experiments, a natural goal is to learn the interactions which give rise to various phenomena [WGFC14;KBEVZ21]. Intractability of computation is a major barrier to resolving open problems like finding the phase diagram of the 2D Fermi-Hubbard model, so having better algorithmic tools is of key importance in this domain [LeB+15]. This problem also arises when engineering quantum systems: a major challenge in building near-term quantum devices is being able to validate them-certify that they implement the desired Hamiltonian-and understand sources of error [SLP11;Shu+14]. Quantum devices with 100 or more qubits are challenging to simulate classically, but quantum 1 Our results do not need the Hamiltonian to be geometrically local as described here; we merely require it to be low-interaction in the sense of Haah, Kothari, and Tang [HKT22]. So, our algorithm will still work if the locality structure of the qubits is, say, an expander graph. 2 See also [GHLS15,Section 5.4.2] for a description of this work aimed towards theoretical computer scientists.
Hamiltonian learning has emerged as an alternative strategy for benchmarking devices by combining quantum resources and classical learning techniques [CEKKZ21;GCC22].
The low-temperature setting is of particular interest because because quantum phenomena are most prominent at zero or near-zero temperature [AFOV08], precisely where high-temperature series expansions fail [LeB+15]. 3 In some sense, this is the only relevant setting for analog quantum simulators, since models at high-temperature can be solved with a classical computer [OHZ06,Chapter 8], without needing to resort to a quantum simulation. More generally, low temperature is the computationally interesting regime, since quantum advantage is a lowtemperature phenomenon: "temperature scaling" laws show that quantum annealers can only achieve large speedups over classical computers when β scales with system size [AMH17].
Prior work. Despite its importance, the computational complexity of Hamiltonian learning from Gibbs states is not well understood. Anshu, Arunachalam, Kuwahara, and Soleimanifar gave the first polynomial sample complexity bounds for this task in 2020 [AAKS20], attaining coefficient estimates using 2 poly(β) m 2 log m β c ε 2 [AAKS20] copies of the Gibbs state [HKT22,Remark 4.5]. However, their work comes with a serious drawback: it is computationally inefficient. In particular, they give a stochastic gradient descent algorithm and show that it converges to the true parameters in a small number of iterations, but actually computing an iterate involves evaluating a log-partition function, which is well-known to be computationally hard even for classical systems [Mon15].
Prior work has obtained fast algorithms for Hamiltonian learning in limited regimes. For example, in a follow up paper, Anshu, Arunachalam, Kuwahara and Soleimanifar [AAKS21] show that when the terms of H commute, then a direct generalization from the classical "Markov property" algorithm can learn the parameters efficiently. Further, [AAKS20] notes that the log-partition function can be efficiently evaluated at high temperature (small β) because its multivariate Taylor series expansion can be shown to converge rapidly. Haah, Kothari, and Tang [HKT22] later gave an improved algorithm that achieves the sample and time complexity e O(β) log m β 2 ε 2 and me O(β) log m β 2 ε 2 , [HKT22] respectively, which they prove is tight up to the constant factor in the exponential, even in the classical case. In practice quantum many-body systems are run at low temperature, which is also when most macroscopic phenomena arise, so this is the most important regime for the problem. However, as we discuss later, no strategies had been suggested for solving Hamiltonian learning at low temperature. In fact, the situation is even more dire: all approaches to Hamiltonian learning used in prior settings fail catastrophically here, since reduction to sufficient statistics [AAKS20], efficient computation of the partition function [AAKS20], the approximate Markov property [KKB20], and cluster expansion [HKT22] all provably fail for sufficiently large β. This state of the literature reflects a broader scarcity of algorithmic tools known for understanding Hamiltonians outside of special settings like high temperature or one dimension. So, a negative resolution to this question seemed plausible, or even likely. Indeed, a recent survey on the complexity of learning quantum systems by Anshu and Arunachalam [AA23] discusses Hamiltonian learning and conclude by asking two questions: Question 2 ( [AA23]). Can we achieve Hamiltonian learning under the assumption that the Gibbs states satisfy an approximate conditional independence? 4 Question 3 ( [AA23]). Could low temperature Gibbs states be pseudorandom, which would explain the difficulty in finding a time efficient algorithm?

Our results
Surprisingly, we provide a positive resolution to Question 1. Our main result is a computationally efficient algorithm for Hamiltonian learning that works at all temperatures. This is a fortunate development since, if learning were truly computationally hard in the lowtemperature regime, then we could not understand the behavior of analog quantum simulators in precisely the regimes where they outperform classical simulators [Pre18, Section 6.10]. As a consequence our main result, we also resolve Question 2 positively and Question 3 negatively.
Theorem 1.1 (Efficiently learning a quantum Hamiltonian, Theorem 6.1 (informal)). Given ε > 0, β ⩾ β c , for a fixed universal constant β c > 0, and n copies of the Gibbs state of a low-interaction Hamiltonian, H = ∑ a∈[m] λ a E a , there exists an algorithm that runs in time n O(1) and outputs estimates λ a a∈ [m] such that with probability at least 99/100, for all a ∈ [m], λ a −λ a ⩽ ε, whenever n ⩾ poly m, (1/ε) 2 O(β) .
Remark 1.2 (On temperature). For our algorithm, we only need to know an upper bound on β, as we can consider a Gibbs state at temperature β to be a Gibbs state at temperature, say, 2β with Hamiltonian H/2. Our requirement that β > β c is for simplicity, and β c can be any constant bounded away from zero. In particular, we can take β c to be the temperature at which the high-temperature algorithm of [HKT22] fails; so, when β < β c , we can simply appeal to [HKT22] to achieve a sample and time complexity of log m β 2 ε 2 and m log m β 2 ε 2 , respectively. As noted by prior work [AAKS20], Hamiltonian learning is a generalization of the classical and well-studied problem of learning undirected graphical models, specifically parameter learning of these models. This classical problem requires e O(β) m log(m) β 2 ε 2 time (and there is an algorithm matching the lower bound), so exponential dependence on β is necessary [HKT22]. 5 Analogies with the classical setting turn out to be of limited use, since the non-commutativity and nonlocality inherent in the quantum setting rules out generalizations of classical ideas. However, with the classical setting we can identify barriers to designing a time-efficient algorithm.
A key challenge of time-efficient Hamiltonian learning is that we cannot work directly with the partition function. The only previous approach to low temperature [AAKS20] only used its copies of ρ to estimate tr(E a ρ) for all a ∈ [m]. It is known in the classical literature that taking just these estimates and using them to compute the parameters λ a is as hard as computing the partition function [Mon15]. To avoid this barrier, we take a richer set of expectations tr(Pρ) that 4 Approximate conditional independence is a property of Gibbs states which is proven to hold in 1D and conjectured to hold in general. 5 It is an interesting open question to improve our doubly exponential dependence to singly exponential. allows us to reduce learning to a tractable, but fairly involved, optimization problem instead. Along the way we develop several new tools of independent interest, and ultimately give a semi-definite programming algorithm based on the sum-of-squares hierarchy. Consequently, we show that sophisticated modern tools in optimization theory lead to a surprising resolution of the Hamiltonian learning problem.

Technical overview
The recipe for quantum Hamiltonian learning introduced by Anshu, Arunachalam, Kuwahara, and Soleimanifar [AAKS20] is based on matching the local marginals of the Gibbs state ρ, which we can estimate with copies of ρ. Specifically, for two Hamiltonians H = ∑ λ a E a and , which is a computationally hard problem. Formally, for a classical Hamiltonian, 6 tr(E a ρ) are sufficient statistics of a graphical model and it is known that estimating the parameters of a graphical model from these sufficient statistics is computationally intractable [Mon15]. This doesn't mean that the problem is hopeless, but rather that to find a tractable algorithm, we should be looking for the opportunity to use a richer family of statistics.

Designing a new system of constraints.
We interpret the previous argument as defining and then solving a constraint system in the set of unknowns, {λ ′ a } a∈ [m] . The structural result in [AAKS20] shows that an approximate solution to this system will be close to the true parameters λ a . However, this system is computationally hard to solve. Our starting point is to define a larger set of constraints which {λ a } a∈[m] must satisfy, which can be verified by measuring expectations of observables slightly less local than the terms {E a } a∈ [m] . Let P local be the set of Pauli matrices whose support is K-local for some large constant K. We begin with the following system of constraints: which follows from the cyclic property of the trace. Two main challenges remain: Must a solution to this system be close to the true coefficients? And how can we efficiently solve the system? Eventually we will derive a convex relaxation for it that is based on (A) replacing the last constraint in Eq.
(1) which involves the matrix exponential with low degree polynomial constraints on the indeterminates ({ λ ′ a } a∈[m] ) instead and (B) showing that any choice of λ ′ that satisfy the constraints must also approximately match the true coefficients λ.
In general solving systems of polynomial equations is computationally hard, but because our analysis in (B) will be based on sum-of-squares proofs, there is by now standard machinery for turning it into an efficient algorithm (see Section 2.5.1 for detailed explanation).
Identifying an equivalence between nested commutators and polynomials. Working towards the goal of replacing the term tr Qe −βH ′ Pe βH ′ ρ with a low-degree polynomial in the variables λ ′ , we begin by recalling the Hadamard formula: 7 which is a degree-2 polynomial in the λ ′ i indeterminates. However, the series in Eq.
(2) only converges quickly when β is sufficiently small [HKT22], so we cannot use it. 8 Nevertheless, from this observation we can develop a general formalism for constructing polynomial approximations of evolutions of operators. We observe that in the eigenbasis of H ′ , and thus we can focus our attention on designing polynomials that approximate the scalar quantity e −β(σ i −σ j ) with low-degree polynomials in σ i − σ j . Further, any degree-d polynomial p(z) = ∑ d ℓ=0 c ℓ z ℓ can be extended to commutators as follows: This allows us to translate between matrix series expansions involving nested commutators and univariate polynomials. We note that for technical reasons we need to extend our equivalence to nested commutators with two distinct operators X, Y appearing in an arbitrary order, such as . .], E a 1 ] and bi-variate polynomials p(x, y) (Section 3). The translation between bi-variate polynomials and nested commutators incurs additive error depending on [X, Y] due to re-ordering of the X and Y operators as expected (Theorem 3.9). In our full algorithm (Section 6) we introduce an additional constraint to drive this additive error to zero. Focusing on the scalar polynomial approximation to the exponential function, we now formalize the notion of approximation that we require.
Constructing a new, flat approximation to the exponential. Recall that we want a polynomial such that, working in the eigenbasis of H ′ , where "≈" denotes an unusual notion of approximation which, for the purposes of this discussion, we can consider to mean that the matrices are close in some norm. The Taylor series approximation to the exponential would be Eq.
(2), which we established is too high degree.
Our key insight is that we choose a better polynomial approximation. We begin by observing that an operator with small support is approximately band-diagonal in the basis of eigenvectors of H, which is a property of local terms proved by Arad, Kuwahara, and Landau [AKL16]. We state a weak version of this here: let P be a Pauli operator with support size O(1), and let H = ∑ i d i v i v † i be an eigendecomposition of H. Then, considering P in the eigenbasis of H, , which denote the error of the polynomial approximation, are weighted inverse exponentially in σ i − σ j . Therefore, our polynomial approximation need not be equally good for all σ i − σ j ; rather, our approximation should be ε-good in a small range but is allowed to diverge at a sufficiently slow exponential rate outside that range. We call this a flat approximation. In particular, given parameters β ⩾ 0, 0 < ε, η < 1, we construct p such that The key difficulty in satisfying the above constraints is satisfying |p(z)| ⩽ e ηβ|z| for z ⩾ β, as standard approximations like Taylor series truncations and Chebyshev series truncations fail this condition (Remark 4.3). In Section 4, we explicitly construct a degree-2 O(1/η) · (β + log(1/(εη))) polynomial that satisfies Eq. (4). This construction is inspired by the iterative "peeling" of the exponential used in proofs of Lieb-Robinson bounds [LR72;Has10]. We can write e −βz = e −β c z · · · e −β c z β/β c for a fixed small constant β c and then truncate the Taylor series expansion of e −β c z at different scales for all of the β/β c = O(β) copies in the product so that the tails of the different truncations don't "interfere".
We show that when p is a flat approximation of the above form for some sufficiently small η, then Qp(H|P)ρ is a good approximation to Qe −βH Pe βH ρ. In other words, the polynomial approximation is good when H ′ = H and we right multiply by ρ; this is crucial for the polynomial system that we set up next to be feasible. Formulating a polynomial system. We now have all the tools to describe a polynomial system that captures the Hamiltonian learning problem. The constraint system we describe in this section is an informal treatment of the system that appears in Section 6, and avoids several technical details. We show that using our flat approximation to the exponential, we can obtain a polynomial p such that tr Qe −βH Pe βH ρ ≈ tr(Qp(H | P)ρ).
Then, we can re-write Eq. (1) as the following polynomial constraint system: and observe that the last constraint encodes a relaxation of the last constraint in Eq. (1) and is satisfied when H ′ = H. Further, all of the constraints are indeed succinctly representable as low-degree polynomials in the indeterminates, { λ ′ a } a∈[m] , as discussed earlier. Finally, the coefficients, such as tr([E a , [E b , P]]), are expectations of the Gibbs state against a slightly larger set of local observables, which are the richer class of test functions we desired. We can obtain estimates of these expectations through quantum measurements (Section 5). Computing these estimates is the only quantum part of our algorithm, and the rest of the algorithm is entirely classical.
Feasibility of the polynomial system. Recall that to show that the polynomial system in (5) is feasible, we need to argue that tr Qe −βH Pe βH ρ ≈ tr(Qp(H | P)ρ) for all P, Q. Working in the eigenbasis of H, let its eigenvalues be { σ i } i∈[2 n ] . The key tool that we leverage is from [AKL16] (see Corollary 2.16) which roughly states that any local term E must be approximately diagonal in the eigenbasis of H, with off-diagonal entries decaying as |E ij | ⩽ e −Ω(|σ i −σ j |) . Thus, we can decompose the matrices Q, P into two parts -parts indexed by i, j where |σ i − σ j | ⩽ β and parts indexed by i, j where |σ i − σ j | ⩾ β. Then we use the fact that p(x) is a good approximation to e −x on [−β, β] to prove that the error on the first part is small. We then appeal to the exponential decay of the off-diagonals to argue that the contribution from the second part in both tr Qe −βH Pe βH ρ and tr(Qp(H | P)ρ) is small. Our flat approximation to the exponential is designed to ensure that it does not overwhelm the exponential decay in the off-diagonal entries in P, Q in any regime.
Efficiently optimizing polynomial systems. Now that we know that our polynomial system is feasible, we consider a convex relaxation of this system. In particular, we consider a degree-d sum-of-squares relaxation, which can be efficiently optimized by expressing it as a semi-definite program (see Section 2.5.1 for details), with d = log(1/ε) · 2 O(β) . Since we have m variables and 2 O(β) constraints, and each constraint is a degree-d polynomial, we can solve the degree-2d sum-of-squares relaxation of Eq. (5) in m (log(1/ε)·2 O(β) ) time. The main challenge in analyzing the sum-of-squares relaxation is to show that we can round it to estimates λ ′ a a∈ [m] such that they are close to the true parameters. Here, we adopt the so-called proofs-to-algorithms philosophy, where we instead work with the dual object to the sum-of-squares relaxation, namely sum-ofsquares proofs (see [Bar;FKP+19], and references therein). This perspective states that if the true parameters are identifiable only using the sum-of-squares proof system, then we immediately obtain an efficient algorithm and we show that we can easily and accurately round the solution.
We then provide a proof of identifiability, i.e. for all a ∈ [m], the inequality (λ ′ a − λ a ) ⩽ ε can be derived using the system of polynomial constraints and other basic inequalities that admit sum-of-squares proofs (we refer the reader to Section 9 for a detailed exposition). At a high level, the proof works by arguing that when H ′ − H is large, there are witnesses P, Q such that | tr(Qp(H ′ |P)ρ) − tr(Qp(H|P)ρ)| is large. Since we know that H is a feasible solution, this would imply that H ′ cannot be a feasible solution so any feasible solution must have H ′ − H be small. The construction of the witnesses relies on an additional property of the polynomial p that we construct, namely that it is strongly monotone (in some appropriate quantitative sense).
For the identifiability proof, we crucially use an additional important property of local Hamiltonians. It deals with the quantity tr(A 2 ρ), where A = ∑ b σ b P b is a Hermitian linear combination of Pauli matrices with small support. Thinking of ρ as a distribution, tr(A 2 ρ) is a second moment term with respect to ρ; we can prove this is not much smaller than tr(A 2 I/ dim) = ∑ b σ 2 b , the second moment against the uniform distribution: for some constant c > 0, Intuitively, this shows that ρ is not close to zero in any local direction. This was first shown by [AAKS20] for quasi-local operators; we adjust their proof to hold for just local operators and give a tighter bound. We show that we can obtain a slightly weaker statement of this form in the sum-of-squares proof system by formulating it as a quadratic inequality. This inequality can be used to remove the dependence on ρ in expressions appearing in the proof; for example, it is used to relate tr([H, Finally, we observe that our identifiability proof does not use the full power of a degree-2d sum-of-squares relaxation and therefore, it should suffice to solve a significantly smaller semidefinite program. We show that we can execute our proof of identifiability by only appealing to a sparse subset of monomials of degree at most 2d and invoke a linearization theorem by Steurer and Teigel [ST21] to obtain a final running time of poly(m) · (1/ε) 2 O(β) , as desired.

Background
Throughout, log denotes the natural logarithm and i = √ −1. O(·), Θ(·), and Ω(·) are big O notation, and we use the notation f ≲ g to mean f = O(g). The notation O( f ) denotes O( f polylog( f )). For a parameter t, O t denotes big O notation where t is treated as a constant; the same holds for the notation for polynomial scaling, poly t (·). Everywhere, the binary operator · denotes the usual multiplication. For a sequence S ∈ {0, 1} * , len(S) denotes its length.

Linear algebra
We work in the Hilbert space C N corresponding to a system of n qubits, C 2 ⊗ · · · ⊗ C 2 , so that N = 2 n . For a matrix A, we use A † to denote its conjugate transpose and ∥A∥ to denote its operator norm; for a vector v, we use ∥v∥ to denote its Euclidean norm. We will work with this Hilbert space, often considering it in the basis of (tensor products of) Pauli matrices.
These matrices are unitary and (consequently) involutory. Further, σ x σ y = iσ z , σ y σ z = iσ x , and σ z σ x = iσ y , so the product of Pauli matrices is a Pauli matrix, possibly up to a factor of {i, −1, −i}. The non-identity Pauli matrices are traceless. We also consider tensor products of Pauli matrices, P 1 ⊗ · · · ⊗ P n where P i ∈ {σ I , σ x , σ y , σ z } for all i ∈ [n]. The set of such products of Pauli matrices, which we denote P, form an orthogonal basis for the vector space of 2 n × 2 n (complex) Hermitian matrices under the trace inner product. The product of two elements of P is an element of P, possibly up to a factor of {i, −1, −i}.
Definition 2.2 (Support of an operator). For an operator P ∈ C N×N on a system of n qubits, its support, supp(P) ⊂ [n] is the subset of qubits that P acts non-trivially on. That is, supp(P) is the minimal set of qubits such that P can be written as P = O supp(P) ⊗ I [n]\supp(P) for some operator O.
So, for example, the support of a tensor product of Paulis, P 1 ⊗ · · · ⊗ P n are the set of i ∈ [n] such that P i ̸ = σ I . A central object we consider is (nested) commutators of operators.

Definition 2.3 (Commutator). Given operators
The nested commutator of order ℓ is defined recursively as Pauli matrices behave straightforwardly under commutation: the commutator of two Pauli matrices is another Pauli matrix up to a scalar (see Lemma 2.10).
Finally, we define the following notation to extract the piece of an operator that acts on a particular qubit.

Definition 2.4 (Localizing an operator). For an operator
where I i denotes the identity operator on the ith qubit, tr i denotes the partial trace operation with respect to the ith qubit, and µ i denotes the Haar measure over the set of unitaries only supported on qubit i. In other words, [·] (i) : (C 2×2 ) ⊗n → (C 2×2 ) ⊗n is the linear map on operators that is the identity on every qubit but i, and on the ith qubit maps For a tensor product of Pauli matrices, P ∈ P, P (i) is P when i ̸ ∈ supp(P) and 0 otherwise. So, for a linear combination of Paulis, A = ∑ P∈P λ P P, applying this map restricts the sum to Pauli matrices that interact with qubit i:

Definition 2.5 (Projector onto eigenspaces). For a Hermitian matrix X and an interval I ⊂ R, Π (X)
I denotes the orthogonal projector onto the subspace spanned by eigenvectors of X with eigenvalues in I.
We sometimes work in bases where matrices are diagonal, in which case Hadamard products become useful.

Hamiltonians of interacting systems
We begin by defining a Hamiltonian, which encodes the interaction forces between quantum particles in a physical system. Definition 2.7 (Hamiltonian). A Hamiltonian is an operator H ∈ C N×N that we consider as a linear combination of local terms E a with associated coefficients λ a , H = ∑ m a=1 λ a E a . For normalization, we assume ∥E a ∥ ⩽ 1 and |λ a | ⩽ 1. This Hamiltonian is K-local if every term E a satsifies |supp(E a )| ⩽ K.
We will only consider Hamiltonians whose terms E a are distinct, traceless product of Pauli matrices. Other kinds of local Hamiltonians can be reduced to this setting by expanding out local terms, which are Hermitian matrices, into the basis of products of Pauli matrices. This preserves the locality of the Hamiltonian and only inflates the number of terms by a factor exponential in the locality, which we think of as constant. The assumption that E a are traceless is without loss of generality, since adding a multiple of the identity to the Hamiltonian does not affect its corresponding Gibbs state. [HKT22]). For a K-local Hamiltonian H = ∑ λ a E a on a system of n qubits, its dual interaction graph G is an undirected graph with vertices labeled by [m] and an edge between a, b ∈ [m] if and only if

Definition 2.8 (Low-intersection Hamiltonian
Let d denote the maximum degree of this graph. We call H low-intersection 9 if k and d are constant.
Throughout, we assume that our Hamiltonian is K-local and has a dual interaction graph of degree d. We will treat K and d as constants throughout the paper. This encompasses most Hamiltonians discussed in the literature, including "geometrically local" Hamiltonians, i.e. Hamiltonians whose qubits are thought of as being on a constant-dimensional lattice like Z 3 and whose terms that are spatially local with respect to the lattice. This class is more general than geometrically local Hamiltonians, as it can extend to qubits where locality is dictated by an expander graph.
We will be considering operators other than the terms, so we define a notion of locality with respect to the dual interaction graph that generalizes to such operators. Concretely, what we need about ℓ-local operators for ℓ ⩾ k is that (1) they contain the Hamiltonian terms E a , (2) the dimension of the subspace spanned O(n), and (3) they contain nested commutators involving k-local operators.

Definition 2.9 (Local operator with respect to the dual interaction graph). Consider a K-local
We define P kℓ to be the set of kℓ-G-local Pauli matrices. Generally, we call an operator M ∈ C N×N kℓ-G-local if it is equal to a linear combination of elements in P kℓ .

By this definition, if E a and E b are terms of a Hamiltonian, then
We state the form of nested commutators of Pauli matrices below.
Lemma 2.10. Let P 1 ∈ P k 1 , . . . , P a ∈ P k a , and Q ∈ P ℓ (with respect to some background dual interaction graph G). Then the nested commutator is either zero or also a G-local Pauli matrix, Proof. Let a = 1. By properties given in Definition 2.1, for P ∈ P k 1 and Q ∈ P ℓ , PQ is a tensor product of Pauli matrices, up to fourth root of unity. Consequently, This proves the lemma for a = 1; the general statement follows by iterating this case.

Properties of local operators on quantum systems
For a Hamiltonian describing a quantum system, we consider getting access to the associated state attained by thermalizing it at a particular inverse temperature β. This is known as a Gibbs state.
Definition 2.11 (Gibbs state). The Gibbs state of the Hamiltonian H = ∑ λ a E a at inverse temperature β > 0 is given by A useful piece of intuition is to think of ρ as a distribution over its eigenspaces with probability proportional to the eigenvalue. In that sense, we can think about expectations and variances against this distribution.
Definition 2.12 (Expectation against the Gibbs state). For a Gibbs state ρ of a Hamiltonian H and an operator A ∈ C N×N , we define ⟨A⟩ = tr(Aρ).
A key result in the prior work of Anshu, Arunachalam, Kuwahara, and Soleimanifar gives a lower bound on the variance of a local operator with respect to the energy distribution defined by the Gibbs state. The authors prove this for the more general class of quasi-local operators, but we only need it for local operators, for which the result can be tightened. We give this tighter version below; its proof is in Appendix A.
Theorem 2.13 ([AAKS20, Theorem 33]). Let H be a K-local Hamiltonian with dual interaction graph G with max degree d. Let A = ∑ b σ b P b be a K'-local operator where P b are products of Pauli matrices and −1 ⩽ σ b ⩽ 1 and whose dual interaction graph has max degree d ′ . For β > 0, let ρ be the corresponding Gibbs state of H. Then Here, c and c ′ are positive constants that depend on K, d, K ′ , and d ′ .
After some manipulation we can conclude that, if a local operator has small variance with respect to ρ, then the operator itself must be close to zero. Some may recognize this as the quantum analogue of the statement that bounded-degree graphical models have local marginals bounded away from zero (see e.g. [Bre15]).
Corollary 2.14 ("No small local marginals"). Let H be a K-local Hamiltonian with dual interaction graph G with max degree d. Let A = ∑ P∈P σ P P be a sum over Paulis with support at most K ′ and with a dual interaction graph G ′ with max degree d ′ , with coefficients σ P ∈ R. Then tr(X 2 ρ) = g 2 tr((X/g) 2 ρ) This gives the desired statement.
Another key statement we use about local operators is that a local operator is approximately diagonal when considered in the basis of eigenvectors of another local operator.
The assumption that the h S are PSD can be removed: if the h S are not PSD, we can apply the theorem to h S + I∥h S ∥, which only affects the final bound by inflating g and R by a factor of two. Note that this just leads to shifting H by a factor of the identity, and so only shifts the spectrum.
Corollary 2.16. Let H = ∑ S⊂[n] h S be a Hamiltonian where all its terms h S are supported on at most K qubits, and the terms interacting with any particular site has bounded norm: When A satisfies |supp(A)| ⩽ K ′ , we can take R to be the number of terms that intersect with the support of A. In our setup, where the terms are Pauli matrices with a dual interaction graph with maximum degree d, we can take g = d + 1. If the support of A is contained in the support of a term, we can take R = d + 1; otherwise, we can take R = K ′ d.

Bounds on nested commutators
One key property about commutators [A, B] is that they compose well when the inputs A, B are a sum of local terms (recall Lemma 2.10). In this section, we will make these composition properties more precise when we have a nested commutator of the form where H 1 , . . . , H ℓ are local operators and A has small support. We use a cluster expansion argument to bound these terms.
Now we argue about which of the terms in the sum above are nonzero. Note that for the commutator to be nonzero, P ℓ must intersect the support of A, P ℓ−1 must intersect supp(P ℓ ) ∪ supp(A), and so on. This condition is equivalent to A, P ℓ , . . . , P 1 being a cluster. Now by Lemma 2.10, we have for some c P 1 ,...,P ℓ ∈ {0, ±1, ±i}. Because (A, P ℓ , . . . , P 1 ) is a cluster, we can also conclude that the elements of P ℓ+1 that appear in the sum have distance ⩽ ℓ from A. Now it remains to count the number of clusters. When choosing P a , we have a choices for which of (A, P 1 , . . . , P a−1 ), to intersect with, and d + 1 choices for elements of E that intersect with that element of E . This gives an upper bound on the total number of clusters of This completes the proof.
We will also need the following lemma from [HKT22] that counts the number of distinct multisets of terms that form a cluster (under some ordering).
As a consequence, we can bound the number of distinct elements in P kℓ .
Corollary 2.20. We have |P kℓ | ⩽ m(10 k d) ℓ . Furthermore, the number of elements of P kℓ that have support intersecting with a fixed term P ∈ P k is at most (10 k d) ℓ+1 .

Sums-of-squares polynomials
We will need some preliminaries about polynomials and the sum-of-squares (SoS) framework. First, we introduce a notion of polynomials with sum-of-squares representations with bounded coefficients. We maintain coefficient bounds because later, certain sampling and approximation errors are multiplied by these coefficients. The general SoS framework is not concerned with such tight bounds on the coefficients and thus we need to make a few definition outside of the general framework.
We sometimes abbreviate sum-of-squares as SoS.
Definition 2.22 (Bounded polynomial). We say a polynomial p(x 1 , . . . , (a) p has degree at most d, and We say p is a (k, d, C)-bounded sum-of-squares polynomial if p is a sum-of-squares polynomial, p = q 2 1 + · · · + q 2 k , and each of the q i 's are (d, C)-bounded. Claim 2.23 (Composition of bounded SoS polynomials). Let p 1 (x 1 , x 2 ) be a (k 1 , d 1 , C 1 )-bounded SoS polynomial and p 2 (x 1 , x 2 ) be a (k 2 , d 2 , C 2 )-bounded SoS polynomial. Then The first statement is obvious as we can simply combine the two representations as sums of squares. The second also follows immediately by taking the two representations as sums of squares and expanding the product. For this, we use that, when r( For the final statement, write p 1 (x 1 , x 2 ) = q 1 (x 1 , x 2 ) 2 + · · · + q k 1 (x 1 , x 2 ) 2 . Now we simply substitute in the change of variables. A coefficient of the new polynomial is and this shows that after the change of variables,

The sum-of-squares framework
We now provide an overview of the sum-of-squares proof system. We closely follow the exposition as it appears in the lecture notes of Barak [Bar].

Pseudo-Distributions.
A discrete probability distribution over R m is defined by its probability mass function, D : R m → R, which must satisfy ∑ x∈supp(D) D(x) = 1 and D ⩾ 0. We extend this definition by relaxing the non-negativity constraint to merely requiring that D passes certain low-degree non-negativity tests. We call the resulting object a pseudo-distribution.
Definition 2.24 (Pseudo-distribution). A degree-ℓ pseudo-distribution is a finitely-supported function D : R m → R such that ∑ x D(x) = 1 and ∑ x D(x)p(x) 2 ⩾ 0 for every polynomial p of degree at most ℓ/2, where the summation is over all x in the support of D.
Next, we define the related notion of pseudo-expectation.
Definition 2.25 (Pseudo-expectation). The pseudo-expectation of a function f on R m with respect We use the notation E µ(x) (1, x 1 , x 2 , . . . , x m ) ⊗ℓ to denote the degree-ℓ moment tensor of the pseudo-distribution µ. In particular, each entry in the moment tensor corresponds to the pseudo-expectation of a monomial of degree at most ℓ in x.
. . , p r ⩾ 0 } be a system of r polynomial inequality constraints of degree at most d in m variables. Let µ be a degree-ℓ pseudo-distribution over R m . We say that µ satisfies A at degree ℓ ⩾ 1 if for every subset S ⊂ [r] and every sum-of-squares polynomial q such that deg(q) Further, we say that µ approximately satisfies the system of constraints A if the above inequalities are satisfied up to additive error where ∥·∥ denotes the Euclidean norm of the coefficients of the polynomial, represented in the monomial basis.
Crucially, there's an efficient separation oracle for moment tensors of constrained pseudodistributions. Below gives the unconstrained statement; the constraint statement follows analogously.
This fact, together with the equivalence of weak separation and optimization [GLS81] forms the basis of the sum-of-squares algorithm, as it allows us to efficiently approximately optimize over pseudo-distributions.
Given a system of polynomial constraints, denoted by A, we say that it is explicitly bounded if it contains a constraint of the form {∥x∥ 2 ⩽ 1}. Then, the following fact follows from Fact 2.27 and [GLS81]: Theorem 2.28 (Efficient optimization over pseudo-distributions). There exists an (m + r) O(ℓ)time algorithm that, given any explicitly bounded and satisfiable system A of r polynomial constraints in m variables, outputs a degree-ℓ pseudo-distribution that satisfies A approximately, in the sense of Definition 2.26. 11 Remark 2.29 (Bit complexity and approximate satisfaction). We will eventually apply this result to a constraint system that can be defined with numbers with log(t) bits, where t is the sample complexity of the algorithm (scaling polynomially with m). Consequently, we can run this algorithm efficiently, and the errors incurred here (which is exponentially small in n) can be thought of as a "machine precision" error, and is dominated by the sampling errors incurred elsewhere. We can therefore safely ignore precision issues in the rest of our proof. 10 A separation oracle of a convex set S ⊂ R M is an algorithm that can decide whether a vector v ∈ R M is in the set, and if not, provide a hyperplane between v and S. Roughly, a weak separation oracle is a separation oracle that allows for some ε slack in this decision. 11 Here, we assume that the bit complexity of the constraints in A is (m + t) O(1) .
The pseudo-distribution D found will satisfy A only approximately, but provided the bit complexity of the sum-of-squares proof of A r ′ B, i.e. the number of bits required to write down the proof, is bounded by m O(ℓ) (assuming that all numbers in the input have bit complexity m O(1) ), we can compute to sufficiently good error in polynomial time that the soundness will hold approximately. All of our sum-of-squares proofs will have this bit complexity.
We now state some standard facts for pseudo-distributions, which extend facts that hold for standard probability distributions. These can be found in the prior works listed above.
Fact 2.30 (Cauchy-Schwarz for pseudo-distributions). Let f , g be polynomials of degree at most d in the variables x ∈ R m . Then, for any degree-d pseudo-distribution µ, Fact 2.31 (Hölder's inequality for pseudo-distributions). Let f , g be polynomials of degree at most d in the variables x ∈ R m . Fix t ∈ N. Then, for any degree-dt pseudo-distribution µ, t .
In particular, when t is even, Sum-of-squares proofs. Up to minor technical details, our algorithm is to set up a polynomial constraint system and then call Theorem 2.28 to get a pseudo-distribution µ over the indetermi- } which approximately satisfies the constraints. With this pseudo-distribution, we will output E µ [λ ′ ] as our estimated Hamiltonian coefficients. To show that these estimates are close to the true coefficients λ, we use that, under µ, the polynomial constraints hold. That is, for a constraint p ⩾ 0, we have E µ [p] ⩾ 0. If we can use these constraints to derive that , then the algorithm is correct. So, we provide such a proof: this proof will be in the sum-of-squares proof system.
Let f 1 , f 2 , . . . , f r and g be multivariate polynomials in the indeterminates x ∈ R m . Given the constraints { f 1 ⩾ 0, . . . , f r ⩾ 0}, a sum-of-squares proof of the identity {g ⩾ 0} is a set of polynomials {p S } S⊆[r] such that As its name suggests, the existence of such an SoS proof shows that if the constraints { f i ⩾ 0 | i ∈ [r]} are satisfied, then the identity g ⩾ 0 is also satisfied. We say that this SoS proof has degree ℓ if for every set S ⊆ [r], the polynomial p 2 S Π i∈S f i has degree at most ℓ. If there is a degree-ℓ SoS proof that Sum-of-squares proofs allow us to deduce properties of pseudo-distributions that satisfy some constraints.

Addition Rule
Multiplication Rule Fact 2.32 (Soundness). Let µ be a degree-ℓ pseudo-distribution. If µ is consistent with the set of degree-d A polynomial constraints A, denoted µ d A A, and there is a degree-d B sum-of-squares proof We also have a converse to Fact 2.32: every property of low-level pseudo-distributions can be derived by low-degree sum-of-squares proofs.
Fact 2.33 (Completeness). Let d ⩾ r ⩾ r ′ . Suppose A is a collection of polynomial constraints with degree at most r, and A Basic sum-of-squares proofs. Now, we recall some basic facts about sum-of-squares proofs. First, any univariate polynomial inequality admits a sum-of-squares proof over the reals.
Second, if p ⩾ 0 and p is a quadratic, then this admits a sum-of-squares proof.
Fact 2.35 (Quadratic polynomial inequalities admit SoS proofs). Let p be a polynomial in the indeterminates x ∈ R m such that p has degree 2 and p ⩾ 0 for all x ∈ R m . Then 2 Proof. Let M be the unique (m + 1) × (m + 1) Hermitian matrix such that, for v(x) = (1, x 1 , . . . , x m ) † , This shows that M must be PSD, so we can write M = ∑ m+1 i=1 u i u † i for some vectors u i ∈ R m+1 . Thus, which is a degree-2 SoS polynomial and we are done.
We also use the following basic sum-of-squares proofs. For further details, we refer the reader to a recent monograph [FKP+19].
Rearranging yields the claim.

Translating between polynomials and nested commutators
In this section, we relate nested commutators of matrices to polynomials of their associated eigenvalues. We begin with the following basic observation mentioned in the technical overview.
Further, by linearity, for a polynomial q(x) = ∑ d k=0 c k x k , In light of the above, we make the following definition associating a polynomial to an expression involving commutators of matrices.
Definition 3.2 (Univariate "commutator polynomials"). For a polynomial p(x) = a 0 + a 1 x + · · · + a d x d , given square matrices X, A of the same size, define We will need a generalization of the above that associates bivariate polynomials p(x, y) to expressions with nested commutators involving two matrices X, Y. We will primarily be interested in the case where X, Y commute or are close to commuting. We begin by generalizing the nested commutator. For a monomial x i y j , we wish to associate it to a nested commutator [(X, Y) S , A] where the number of 0's and 1's in S is i and j, respectively. There are many different such commutators, reflecting that X and Y need not commute. We will show that when X, Y are close to commuting, the nested commutators are also close, so the ordering in S does not matter. When |S| = 2, this follows from the identity below. We extend this to higher-order commutators.
Proof. Consider when S, S ′ differ exactly by a single swap of two adjacent elements. In this case, by Fact 3.4, the difference on the LHS is equal to exactly one term of the form where S i is the prefix up to the point where S, S ′ differ and T i is the suffix. Now we can repeatedly apply this to swap adjacent elements of S until it matches S ′ . Each of the residual terms is of the form given on the RHS so we are done.
Now we define the correspondence between bivariate polynomials and bivariate nested commutators by arbitrarily choosing an order for the S associated to every monomial.
Definition 3.6 (Bivariate "commutator polynomials"). Given a polynomial of degree d in two variables x, y, p(x, y) = ∑ i+j⩽d a ij x i y j , we define its associated matrix commutator polynomial for matrices X, Y, A ∈ C n×n as The key property that we will need is that certain polynomial identities (in the original bivariate polynomials) are essentially preserved under this translation. We begin by showing this for monomials. The following fact shows a relation between a commutator polynomial on A to a commutator polynomial on B.
Fact 3.7. For any Hermitian matrix X and matrices A, B, ρ, we have the identity Proof.
This can be extended to general monomials.
Lemma 3.8 (Commutator monomial equivalences). Let p(x, y) = x i 1 y i 2 , q(x, y) = x j 1 y j 2 , and r(x, y) = p(x, y)q(x, y). Let d = deg(r). Then for some ℓ ⩽ d 2 , we can write where every Z i is one of the following three types of errors: Proof. Our goal is to express as a sum of errors. Observe that Fact 3.7 allows us to remove one copy of X or Y from the commutator in front of A and move it onto the commutator in front of B at the cost of an error term of type 1 or 2. Thus, we can repeatedly apply Fact 3.7 to move all of the X's and Y's in the commutator in front of A onto the commutator in front of B and write as a sum of i 1 terms of type 1 and i 2 terms of type 2. Next, we can apply Lemma 3.5 to reorder the sequence of X and Y on the commutator in front of B at the cost of error terms of type 3. This allows us to write as a sum of i 2 (j 1 + i 1 ) terms of type 3. Together, this gives the desired bound.
Theorem 3.9 (Polynomial identities to nested commutator identities). Consider a formal polynomial identity in two variables p 1 (x, y)q 1 (x, y) + · · · + p k (x, y)q k (x, y) = 0 where each of the polynomials p i , q i is (d, C)-bounded. Let X, Y ∈ C N×N be Hermitian matrices and ρ, A, B ∈ C N×N be arbitrary matrices. Then we can write tr p 1 (X, Y|A)q 1 (X, Y|B) † + · · · + p k (X, Y|A)q k (X, where t ⩽ 4kd 6 , the coefficients c i satisfy |c i | ⩽ C 2 2 2d , and every Z i is one of the following three types of errors: 1. Proof. Let r ℓ (x, y) = p ℓ (x, y)q ℓ (x, y). By the assumed formal polynomial identity, tr Ar 1 (X, Y|B) † + · · · + Ar k (X, Y|B) † ρ = 0.
So, it suffices to express each product tr p ℓ (X, Y|A)q ℓ (X, Y|B) † ρ − tr Ar ℓ (X, Y|B) † ρ as a linear combination of errors. Let p ℓ = ∑ i 1 ,i 2 a ℓ,i 1 ,i 2 x i 1 y i 2 and q ℓ = ∑ j 1 ,j 2 b ℓ,j 1 ,j 2 x j 1 y j 2 . We can expand the above expression into its commutator monomials: Lemma 3.8 shows how to expand summand into error terms; note that the degree of the product is i 1 + j 1 + i 2 + j 2 ⩽ 2d. There are at most (2d) 2 error terms per summand, and there are d 4 summands per product p ℓ q ℓ . This gives a total bound on the number of error terms as 4kd 6 as desired. All that is left is to bound the size of the coefficients. For an error term of any type associated to a particular summand, the coefficient is a ℓ,i 1 ,i 2 b ℓ,j 1 ,j 2 . Let S, T be the sequences associated to the error term. Because p ℓ and q ℓ are (d, C)-bounded, we can bound |a ℓ,i 1 ,i 2 b ℓ,j 1 ,j 2 | ⩽ C 2 (i 1 + i 2 )!(j 1 + j 2 )! ⩽ C 2 2 2d (i 1 + i 2 + j 1 + j 2 )! ⩽ C 2 2 2d (len(S) + len(T))! .
The last inequality holds because, in all error types, len(S) + len(T) is at most the degree of the product.

Polynomial approximations of the exponential
Our goal for this section is to construct a polynomial approximation to e x that satisfies the properties necessary for our algorithm. In particular, we need a polynomial that approximates e x on an interval [−κ, κ], but with the additional property that the polynomial does not grow too quickly outside the area of approximation. The canonical polynomial approximation of e x is its Taylor series truncation.

Definition 4.2. Let s
x k k! denote the degree-ℓ truncation of the Taylor series of e x . Remark 4.3. The Taylor series truncation s ℓ (x) does not satisfy our desired approximation properties from Definition 4.1, even for large ℓ. This is because the truncation blows up too quickly on the negative tail: |s ℓ (−ℓ)| ≂ e ℓ > e ηℓ . The same issue occurs with conventional approximations of the exponential, like truncations of the Chebyshev series and QSVT-style "bounded" approximations of the Chebyshev series [GSLW19; TT23]: these truncations at degree ℓ fail in the region around −ℓ (so, notably, increasing the degree does not improve the flatness parameter of these approximations).
While standard techniques for approximating the exponential do not suffice, we construct a modified polynomial that is a flat approximation to the exponential.
Theorem 4.6 (Monotonicity of the flat approximations). For any positive integers k, ℓ, we have the following: Moreover, the LHS is a (10 2 k ℓ , 2 k ℓ + 10, 200 2 k ℓ )-bounded sum-of-squares polynomial in x and y.
Remark 4.7. Note that Theorem 4.6 stems from the intuition that roughly we should have However, the above isn't quite true, so the additional terms in Theorem 4.6 are there to account for this. The fact that the difference is not only non-negative, but also a sum of square polynomials will be crucial later on in the analysis of our algorithm.
We will prove Theorem 4.5 in this section. The proof of Theorem 4.6 is deferred to the appendix as it is quite long and computational. We begin by establishing some basic facts about the truncated Taylor series and our polynomials p and q. The following fact follows from Taylor's theorem, which implies that for every x ∈ R there exists some c ∈ [0, 1] such that e x − s ℓ (x) = e cx x ℓ+1 (ℓ + 1)! .
Fact 4.8 (Bounds on truncated Taylor series of e x ). For x ⩾ 0, e x ⩾ s ℓ (x) for all ℓ. For x < 0, we have s 2ℓ−1 (x) ⩽ e x ⩽ s 2ℓ (x) for all ℓ.
Proof. This statement also follows from Taylor's theorem (9). From this, for all x, the approximation error of the Taylor series truncation is Taking ℓ ⩾ e 2 |x| + log(e |x| /ε) makes the above expression bounded by ε. 13 The final bound comes from taking |x| ⩽ b.
Proof. Note that the coefficients of the monomials in p k,ℓ are all positive and are all less than the coefficients of the monomials in where we replace each truncated sum with the full infinite sum. However the above is exactly equal to 1 + x + x 2 2! + . . . . Thus for any degree d, the coefficient of x d in p k,ℓ (x) is at most 1 d! so p k,ℓ is ((2 k+1 − 1)ℓ, 1)-bounded since its degree is (2 k+1 − 1)ℓ. Recall that q k,ℓ is obtained by integrating p and thus we immediately get the same bound on the coefficients of q k,ℓ i.e. it is is ((2 k+1 − 1)ℓ + 1, 1)-bounded and we are done. Now we move onto the proof of Theorem 4.5.
Proof of Theorem 4.5. We prove the statement for p k,ℓ first. By Lemma 4.10 (with ε ← ε/(k2 2κ ))), we have that for all x ∈ [−κ, κ] and thus, . Now we immediately get the same guarantee for q k,ℓ (x) by integrating the above. Now we prove the second condition for p k,ℓ . For x ⩾ 0, we simply incur the e x upper bound by Fact 4.8. For x ⩽ 0, let j 0 be the smallest positive integer such that −x < 2 j 0 +1 kℓ then for all 1 ⩽ j < j 0 , Also, for all j > j 0 + 2, where we used Fact 4.8. Finally, for j 0 ⩽ j ⩽ j 0 + 2 we have Overall, combining the above, we conclude where the factor of 8 is because we need to apply (11) on at most 3 terms. Note that there must be some k 0 ∈ { j 0 , j 0 + 1, j 0 + 2 }, such that exactly when j ⩽ k 0 . Then, we can upper bound Equation (12) as follows: Further, we have that Substituting in Equation (13), we get The desired bound for q then follows as well from the definition as q is obtained by simply integrating p.

Accessing Gibbs states: the quantum piece
We now describe how our algorithm uses its copies of ρ. This is the only location in the algorithm where we access ρ; the rest of the algorithm is entirely classical. By making measurements on copies of ρ, we can estimate expectations of observables tr(Xρ) for various choices of the matrix X. These estimates will then be used in the learning algorithm.
Lemma 5.1. Let X = UDU † ∈ C N×N be a unitarily diagonalizable matrix, and suppose we are given copies of a quantum state with density matrix ρ ∈ C N×N . Then we can estimate tr(Xρ) to ε∥X∥ error with probability ⩾ 1 − δ using O( 1 ε 2 log 1 δ ) copies of ρ. The running time is the number of samples times the cost of applying U † to a quantum state and measuring in the computational basis.
Proof. Consider taking ρ and applying U † to form U † ρU, then measuring in the computational basis. We see the outcome i with probability ⟨u i | ρ |u i ⟩, where u i is the ith column of U. Let z be the random variable attained by performing this measurement and taking z = D i,i ; then E[z] = ∑ i D i,i ⟨u i | ρ |u i ⟩ = tr(Xρ), and z is always bounded by max i |D i,i | = ∥X∥. By a Chernoff bound we conclude that averaging O( 1 ε 2 log 1 δ ) independent copies of this estimator gives the desired error bound.
We wish to estimate tr(Xρ) for many different Pauli observables X, so we could simply run Lemma 5.1 for each one. However, we can use a classical shadows-like procedure [HKP20] to estimate them all at once.

Lemma 5.2 (Computing expectations of observables of Gibbs states).
Let H = ∑ m a=1 λ a E a be a K-local Hamiltonian on n qubits whose dual interaction graph G has maximum degree d. Consider the set of ℓ-G-local Paulis, P ℓ . There is a quantum algorithm that, with probability ⩾ 1 − δ, outputs estimates of tr(P 1 P 2 P 3 ρ) to ε error for all P 1 , P 2 , P 3 ∈ P ℓ . This algorithm uses t = O( 4 3ℓ ε 2 log |P ℓ | 3 δ ) samples, O(nt) quantum gates, and poly(t, n, |P ℓ |) classical post-processing time.
Proof. Consider the following procedure, where we take as input a copy of ρ, rotate into the eigenbasis of a Pauli matrix P ∈ P chosen uniformly at random, and then measure in the computational basis to get a measurement outcome b ∈ {0, 1} n . Take t = O( 4 3ℓ ε 2 log |P ℓ | 3 δ ) of these samples (P, b) and denote the collection S. Now, consider estimating some tr(Xρ) where X = UDU † ∈ P and supp(X) ⩽ 3ℓ; P 1 P 2 P 3 is such an X, up to a root of unity. Since tr(Xρ) = tr supp(X) (X (supp(X)) tr [n]−supp(X) (ρ)), it suffices to consider ρ on the support of X. Let T ⊂ S be the collection of samples whose Pauli matrix agrees with X on the support of X. Then we can use this subsample to run Lemma 5.1 and estimate tr(Xρ), as all that is needed to use it are measurements in a basis where X is diagonal. Since the probability that any given sample is in a basis where X is diagonal is ⩾ 4 −3ℓ , with t samples we can guarantee that we get an estimate tr(Xρ) to ε error with failure probability δ/|P ℓ | 3 . By union bound, we can then conclude that with our set of samples we can get ε-good estimates of every tr(P 1 P 2 P 3 ρ) with probability ⩾ 1 − δ.
The running time can be seen to be poly(t, n, |P ℓ |). Note that, for any product of Paulis P ∈ P, we can apply the quantum gate to rotate into its eigenbasis with O(n) gates, so the quantum gate complexity is O(nt) = O( 4 3ℓ n ε 2 log |P ℓ | 3 δ ).

Algorithm and analysis
Now we are ready to present our learning algorithm. We solve the following Hamiltonian learning problem: Problem 1 (Hamiltonian learning). Recalling the set-up described in Definitions 2.7 and 2.8, let H = ∑ m a=1 λ a E a ∈ C N×N be a K-local Hamiltonian on n qubits whose terms E a are known, distinct, non-identity Pauli operators, and whose coefficients λ a ∈ R satisfy |λ a | ⩽ 1. Let d denote the maximum degree of a vertex in the dual interaction graph associated to the Hamiltonian. Given ε, δ > 0, along with copies of the Gibbs state ρ = exp(−βH) tr exp(−βH) corresponding to H at a known inverse temperature β > 0, find estimatesλ a such that, with probability ⩾ 1 − δ, (λ a − λ a ) 2 ⩽ ε 2 for all a ∈ [m]. Theorem 6.1 (Efficiently learning a quantum Hamiltonian). Let H = ∑ m a=1 λ a E a ∈ C N×N be a K-local Hamiltonian on n qubits whose dual interaction graph has degree d (as in Problem 1). Suppose we are given the terms {E a } a∈[m] , ε > 0, δ > 0, and copies of the Gibbs state ρ at a known inverse temperature β > 0. Then there exists an algorithm that can output estimatesλ a such that, with probability ⩾ 1 − δ, (λ a − λ a ) 2 ⩽ ε 2 for all a ∈ [m]. This algorithm uses O (m 6 /ε e f (K,d)β ) log(m/δ) + ( f (K, d)/(β 2 ε 2 )) log(m/δ) copies of the Gibbs state and running time where f (K, d) is a positive function depending only on K and d.
Specifically, we prove that, when β > g(K, d) for some positive g, there exists an algorithm that succeeds with sample complexity (m 6 /ε e f (K,d)β ) log(m/δ) and time complexity poly(m, log(1/δ)) · (1/ε) e f (K,d)β for some positive f (K, d). We get Theorem 6.1 by taking g to be the threshold at which the high-temperature algorithm [HKT22] stops working and then combining our complexity bounds with that of the high-temperature algorithm. That β is lower bounded by a constant is just a simplification; the same algorithm and analysis should work for any temperature upon appropriate adjustments.
We are concerned with the setting where K and d are constant (that is, when the Hamiltonian is low-intersection), so we do not try to optimize dependence on these parameters. Further, we assume that n = O(m); this is without loss of generality, as m ⩾ n/K is necessary for the support of the Hamiltonian to be all n qubits.
Our main theorem follows from showing that a low-degree sum-of-squares relaxation of a carefully chosen polynomial system can be rounded to obtain accurate estimates of the true coefficients {λ a } a∈ [m] . We first set up the system, and then build up to the full algorithm (Algorithm 6.2). Presenting the polynomial system. We begin by setting up the precise polynomial system that we will then solve. Let ε 0 = ε 10 C K,d β m 3 for some sufficiently large constant C K,d depending only on K and d. Next, let ℓ 0 = 2 C K,d β log(1/ε), ℓ 1 = 4K and define A = P 4 C K,d β ℓ 0 and B = P ℓ 1 (as in Definition 2.9).
Let λ ′ = [λ ′ 1 , λ ′ 2 , . . . , λ ′ m ] be a set of indeterminates. We will solve a polynomial system in these indeterminates, and then use it to extract estimates to the true coefficients. In the system, and throughout, we denote H ′ = ∑ a λ ′ a E a . This object H ′ is merely notational convenience: we do not optimize with this exponential sized object. We only ever need to work with traces of H ′ and thus our polynomial system admits a succinct (polynomial sized) representation.
We begin by making measurements of ρ to construct ε 0 -accurate estimates to tr (A 1 A 2 A 3 ρ) for all A 1 , A 2 , A 3 ∈ A using Lemma 5.2. Let tr denote the map from A 1 A 2 A 3 ρ to our estimate of tr (A 1 A 2 A 3 ρ), extended by linearity to linear combinations of such matrices that we have estimated. 14 Again, this is notational convenience for our polynomial system. By Corollary 2.20, |A| ⩽ m(10d) 4 C K,d β ℓ 0 ⩽ m(1/ε) 10 C K,d β so we can produce these estimates in running time poly β,K,d (m, 1/ε) and sample complexity This is the sample complexity of the entire algorithm, as this is all of the information we need of ρ. Then, we write the following polynomial system: where q is the polynomial from Definition 4.4 extended to commutators using Definition 3.6. With this set of parameters, by Theorem 4.5, Representing the polynomial system. It is important to understand how the above can be represented as a polynomial system in the λ ′ with real coefficients and polynomial (in m) size.
The key is that we can write the expressions inside the traces in the form (∑ α (λ ′ ) α M α ) ρ where α ranges over low-degree monomials and M α are rescaled Pauli matrices, which we can explicitly and succinctly represent. Then by linearity of tr, we can write the system by plugging in our estimates for tr(M α ρ). For example, for the first constraint, we can write We can then plug in our estimates tr(A 1 A 2 E a ρ) and tr(E a A 1 A 2 ρ) obtained from Lemma 5.2. These trace estimates may be complex-valued, so that tr(A 1 A 2 E a ρ) − tr(E a A 1 A 2 ρ) = ψ a + iχ a for some ψ a , χ a ∈ R. We can rewrite this, revealing that we have a polynomial constraint in λ ′ with real coefficients. For the second constraint, where ℓ ⩽ 2 · 4 C K,d β log(1/ε) + 1, so by Lemma 2.10, such a commutator is an element of A. The coefficient associated with this nested commutator is (−β) ℓ λ ′ a ℓ . . . λ ′ a 1 , so that for some polynomials p A (λ ′ ) we can write Since we can plug in our estimates with tr as before, this expression is well-defined. We can further conclude that the associated constraint can be written as a polynomial constraint in λ ′ with real coefficients. Recall that |A| ⩽ m(1/ε) 10 C K,d β . Furthermore, by Lemma 2.19, there are at most m(10d) 4 C K,d β ℓ 0 ⩽ m(1/ε) 10 C K,d β distinct monomials in the λ ′ that appear in all of the p A and thus this entire representation has size poly(m, 1/ε). Algorithm 6.2 (Learning a Hamiltonian from Gibbs states, Theorem 6.1).

Input: Description of K-local Hamiltonian terms {E a } a∈[m]
with dual interaction graph degree d; accuracy and failure probability parameters ε, δ ∈ (0, 1); inverse temperature β > 0; (m 6 /ε e f (K,d)β ) log(m/δ) copies of the Gibbs state ρ = exp(−βH) tr exp(−βH) for an unknown Hamiltonian H = ∑ a λ a E a . Operation: where C K,d is a sufficiently large constant depending only on K, d; 4. For all A 1 , A 2 , A 3 ∈ A, compute estimates tr(A 1 A 2 A 3 ρ) of tr(A 1 A 2 A 3 ρ) to ε 0 error using Lemma 5.2; 5. Consider the following constraint system: Output:λ such that with probability at least 1 − δ, |λ a − λ a | 2 ⩽ ε 2 for all a ∈ [m]. Analyzing the polynomial system. We have now described the bulk of the algorithm (Algorithm 6.2). To obtain Theorem 6.1, we prove the following intermediate lemmas. In light of Lemma 5.2, we may assume that all of our estimates are accurate to within ε 0 with high probability.

Assumption 6.3 (Measuring Gibbs states). For all
The first lemma states that when our estimates are accurate, the true coefficients i.e. λ ′ = λ are a feasible solution to the system. Lemma 6.4 (Feasibility of the constraint system). If Assumption 6.3 holds, then the constraints in the system C λ ′ in Equation (16) are satisfied when λ ′ = λ (so that H ′ = H).
Next, we need to prove soundness. The proof of soundness is broken into two key steps. First, we prove that any solution to the system must have H ′ = ∑ a λ ′ a E a nearly commute with H.
Remark 6.5 (On sum-of-squares proof degree). All of our sum-of-squares proofs will be of degree log(1/ε) 2 f (K,d)β . For the sake of brevity, we omit the precise degrees from all subsequent statements.
Lemma 6.6 (SoS finds an almost-commuting state). Assume that Assumption 6.3 holds. Let H ′ = ∑ a λ ′ a E a and write i[H, H ′ ] = ∑ γ b F b for some 2K-G-local terms F b where each of γ b are linear expressions in the λ ′ a (such a representation exists by Lemma 2.10). Then Using the above lemma, we can derive identifiability of each parameter in the sum-of-squares proof system: Lemma 6.7 (A sum-of-squares proof of identifiability). Suppose Assumption 6.3 holds. Given the constraint system C λ ′ , for any term a * ∈ [m], Combining the above lemmas, we can almost prove Theorem 6.1. We first show how to immediately get the same learning guarantee with the same sample complexity but a slightly worse running time. Later on, we show in Section 10 how to improve this runtime and complete the proof of the stronger Theorem 6.1.
Theorem 6.8 (Weaker version of Theorem 6.1). Let H = ∑ m a=1 λ a E a ∈ C N×N be a K-local Hamiltonian on n qubits whose dual interaction graph has degree d (as in Problem 1). Suppose we are given the terms {E a } a∈[m] , ε > 0, δ > 0, and copies of the Gibbs state ρ at a known inverse temperature β > (10(d + 1)) −10 . Then there exists an algorithm that can output estimatesλ a such that, with where f (K, d) is a positive function depending only on K and d.
Proof of Theorem 6.8. Given O((m 6 /ε e f (K,d)β ) log(m/δ)) copies of the Gibbs state ρ, it follows from Lemma 5.2 that with probability at least 1 − δ, Assumption 6.3 holds. Conditioning on this event, it follows from Lemma 6.4 that the constraint system C λ ′ is feasible. Given a degree t = Ω 2 C K,d β log(1/ε) = Ω 2 C K,d β log(1/ε) pseudo-distribution µ, where C K,d is a fixed universal constant that only depends on K, d, it follows from Lemma 6.7 (and redefining ε appropriately as (ε/2 C K,d β ) 2 ) that for all a ∈ [m], The running time is dominated by computing a degree-t pseudo-distribution over m indeterminates and |A| + |B| ⩽ m O(1) (1/ε) exp(O(C K,d β)) many constraints. It follows from Theorem 2.28 that the overall running time is at most (m/ε) log(1/ε) exp(O(C K,d β)) , which concludes the proof.

A proof of feasibility (Lemma 6.4)
Our goal in this section is to prove Lemma 6.4. Since we are working with the ground truth Hamiltonian H, we can relax, as our proofs need not be in the sum-of-squares system.
Next, consider tr(B 2 q(−βH|B 1 )ρ) where q is a good flat approximation to the exponential e x (recall Definition 4.1). The intuition for why these are approximately tr (B 1 B 2 ρ) is that if WLOG we work in the diagonal basis of H and let its eigenvalues be σ i then where the second equality uses that q is a good approximation to e x . We now provide a formal derivation of feasibility of the exponential approximators. In the following lemma, we make the above intuition precise, showing that when q is a sufficiently good flat exponential approximation, the above derivation indeed holds up to some small error.
Lemma 7.2 (Feasibility of the polynomial approximation). For A, B ∈ P K and H a K-local Hamiltonian with interaction degree d, a temperature β, and a polynomial q that is a (κ, η, ε)-flat exponential approximation where • κ ⩾ Cβ log(1/ε) for some sufficiently large constant C depending only on K, d • η < c/β for some sufficiently small constant c depending only on K, d.
Proof. We work in the basis where H is diagonal, i.e., H ii = σ i .
where the first equality follows from cyclicity of trace, the second uses its definition, the third uses Definition 3.2, and the fourth groups relevant terms together. Now, we use the approximation properties we have proved about q to bound the above quantity. Let L = κ/(3β) and let S a = {i ∈ [N] : σ i ∈ [aL, (a + 1)L)}, so that Now, we bound (20).1 by breaking into cases depending on the value of α. For −2 ⩽ α ⩽ 2, since 3βL ⩽ κ, it follows from Definition 4.1 that where A S a ,S b denotes the submatrix of A (when written in the eigenbasis of H) indexed by i ∈ S a , j ∈ S b and B S a ,S b is defined similarly. Next, for α ⩾ 3, we have σ j > σ i whenever j ∈ S a+α , i ∈ S a and thus ∑ i∈S a ∑ j∈S a+α Finally for α ⩽ −3, we have σ j < σ i whenever j ∈ S a+α , i ∈ S a and thus ∑ i∈S a ∑ j∈S a+α Finally, by Lemma 2.16, for some absolute constant c depending only on K, d, Now since η is sufficiently small we can ensure (|α| + 1)ηβ ⩽ c(|α| − 1) whenever |α| ⩾ 3. Thus, combining (21) (22) (23) with the above, we have (1 + e (|α|+1)ηβL )e −2c(|α|−1)L where in the last step, we used that L = κ/(3β) ⩾ C log(1/ε) and C is a sufficiently large constant (in terms of K, d). Plugging this back into the relation at the beginning, we conclude | tr(Bq(−βH|A)ρ) − tr(ABρ)| ⩽ 20ε∥A∥∥B∥ as desired.
We need one more intermediate lemma about relating the sampling error in Assumption 6.3 to the error in the actual constrains in the SoS program. Lemma 7.3. Suppose that Assumption 6.3 holds. Then for all B 1 , B 2 ∈ B, Proof. Clearly Assumption 6.3 implies that | tr(B 1 B 2 ρ) − tr(B 1 B 2 ρ)| ⩽ 0.1ε for all B 1 , B 2 . Next, by Lemma 2.18 and Claim 4.11, we can write q C K,d β,ℓ 0 (−βH|B 1 ) as a linear combination of the form ∑ A∈A c A A where A ⊂ P 2 C K,d β+1 ℓ 0 K and the coefficients satisfy ∑ A∈A |c A | ⩽ ((1 + β)d) 2 C K,d β+5 ℓ 0 . Thus, conditioned on the event that Assumption 6.3 holds, we have | tr(B 2 q C K,d β,ℓ 0 (−βH|B 1 )ρ) − tr(B 2 q C K,d β,ℓ 0 (−βH|B 1 )ρ)| ⩽ 0.1ε . Now we can complete the proof of Lemma 6.4.
Proof of Lemma 6.4. The first four constraints are satisfied by Lemma 7.1. It remains to prove that the last two constraints are satisfied. By Theorem 4.5, and as stated in 18, the polynomial q C K,d β,ℓ 0 is a flat exponential approximation with parameters 0.001 · 2 C K,d β log(1/ε), 5 C K,d β , 0.001ε .

A sum-of-squares proof that the commutator is small (Lemma 6.6)
In this section, we prove Lemma 6.6. We will only use a subset of the constraints, namely the commutator constraints involving tr(A(H ′ ρ − ρH ′ )), to prove this. The proof will involve two parts. First, we will show that the constraints imply that tr([H, Proof. By Lemma 2.10, we can write [H, The total number of terms F ∈ P 2K is at most m(10 K d) 2 (recall Corollary 2.20). We have the constraint −ε 0 ⩽ tr(F(H ′ ρ − ρH ′ )) ⩽ ε 0 for each F ∈ P 2K , and thus summing over the F's where the first inequality follows from the constraint that −1 ⩽ λ ′ i ⩽ 1 and that each f a has at most d non-zeros, and the second inequality follows from there being at most m(10 K d) 2 terms F ∈ P 2K and the corresponding trace being at most ε 0 due to the constraints. Similarly, we can derive Next, again expanding H and H ′ in the Pauli basis, we have where g F 1 ,F 2 (λ ′ ) are quadratic functions of the λ ′ each with at most d nonzero monomials and coefficients between ±2. Similarly, Using Assumption 6.3 and the constraint that −1 ⩽ λ ′ i ⩽ 1, we have where the first inequality follows from (26), (27) and Assumption 6.3 and the second follows from (24), the fact that g has coefficients between [−2, 2] and there are at most m 2 (10 K d) 4 terms F 1 , F 2 ∈ P 2K . Similarly, using the lower bound estimate in (25) we have which completes the proof.
Next, we move on to the second part of the proof. First, we prove the following inequality relating tr([H, We note that we consider the expression i[H ′ , H] to ensure that the matrix is Hermitian and the coefficients of the λ ′ i are real. Lemma 8.2 (Lower bounding the commutator at arbitrary temperature). Given 0 < β, Proof. Consider the basis where H is diagonal and let its eigenvalues be σ i . Let Z = tr(e −βH ). Then, By a similar argument, We observe that, since e x ⩾ 1 + x, e −x ⩽ 1 1+x , and so Applying this inequality with x = σ j − σ i , which are constants in the sum-of-squares proof system, and substituting back into (30), we have, where the second inequality follows from |σ j − σ i | ⩽ 2∥H∥ and the last equality follows from (31).
Proof. Note that i[H ′ , H] is Hermitian and is 2K-local. Thus, by Corollary 2.14, we have that for any real values for the λ ′ i , the inequality holds. Now both sides of the above are quadratic expressions in the λ ′ i so by Fact 2.35, the difference between the two sides can be written as a sum of squares of linear functions of the λ ′ i . Thus, Now we can finish the proof of Lemma 6.6 by combining the previous lemmas.
Proof of Lemma 6.6. Combining Lemma 8.1, Lemma 8.2, and Lemma 8.3, we get and this is exactly the desired statement.

A sum-of-squares proof of identifiability (Lemma 6.7)
In this section, we prove Lemma 6.7. At a high level, we will rely on properties of q C K,d β,ℓ 0 proven in Section 4. However, since we are working with commutator polynomials, we will need to invoke the translation between polynomials and commutators in Section 3 at each step. First, it is critical that H ′ and H almost commute so that the "error" terms that appear in the translation (those on the RHS of Theorem 3.9) are small. We make this precise in the following subsection.

Bounding error terms: polynomials to nested commutators
We begin by showing the following lemma: Lemma 9.1 (Bounding switched commutators of type 3). Let S, T ∈ {0, 1} * be arbitrary sequences of length at most ℓ. Let A ∈ P K . Then we can write where ζ c are polynomials in the λ ′ and the terms G c are in P 4(ℓ+1)K and have distance at most 4(ℓ + 1)K from the support of A. If Assumption 6.3 holds then Note that the i |S|+|T| is to make the expression Hermitian so that the ζ c are real polynomials.
Proof. We can write i[H, H ′ ] = ∑ b γ b F b for F b ∈ P 2K and by Lemma 6.6 For notational convenience, for an index j ∈ {1, 2, . . . , |S| + |T|}, let λ a,[j] be equal to λ a if the jth entry of the sequence ST (concatenated) is 0 and equal to λ ′ a otherwise. Now we can apply Lemma 2.18 (with K ← 2K and d ← 10d 2 , since [H, H ′ ] is 2K-local) to write where c a 1 ,...,a |S|+|T| ,b ∈ ±1, ±i and the sum has at most (|S| + |T| + 1)!d 2(|S|+|T|+1) terms and each of the terms A a 1 ,...,a |S|+|T| ,b ∈ P 4(ℓ+1)K and has distance at most 4(ℓ + 1)K from the support of A. Thus, we can rewrite the original commutator in the form ∑ G c ∈P 4(ℓ+1)K ζ c G c where each of the ζ c is a polynomial with real coefficients (because the original commutator is Hermitian) in the variables λ ′ of degree at most 2ℓ + 2.
Since we have the constraint −1 ⩽ λ ′ a ⩽ 1 and (33) and also we know that −1 ⩽ λ a ⩽ 1, combining over all of the terms in the above sum, The lower bound can be obtained in a similar manner.
Lemma 9.2 (Bounding switched commutators of type 1 and 2). Let S ∈ {0, 1} * be an arbitrary sequence of length at most ℓ. Let A ∈ P K . Then we can write where the terms G c ∈ P (ℓ+1)K and have distance at most (ℓ + 1)K from the support of A. Furthermore, Note that the i |S| ensures that the expression is Hermitian so that the ζ c are real.
Proof. The proof is the same as Lemma 9.1. We don't need Lemma 6.6 at all and just need to use the constraint that −1 ⩽ λ ′ a ⩽ 1 and the fact that −1 ⩽ λ a ⩽ 1 to bound the coefficients.
We will also need the following lemma that bounds the effect of the sampling error. This is analogous to Lemma 7.3 but we now need that it is enforced for all potential choices of λ ′ .
Proof. By Lemma 2.18 and Claim 4.11, we can write q C K,d β,ℓ 0 (−βH ′ |B 1 ) as a linear combination of the form ∑ A∈A c A A where A ⊂ P 2 C K,d β+1 ℓ 0 K and the coefficients c A are polynomials in the λ ′ of degree at most q C K,d β,ℓ 0 . Furthermore, the sum of the magnitudes of all of the coefficients of all of the c A is at most ((1 + β)d) 2 C K,d β+5 ℓ 0 . We can then write Thus, conditioned on the event that Assumption 6.3 holds and using the constraints that −1 ⩽ λ ′ ⩽ 1, we get the desired bound.

No small local marginals in sum-of-squares
Now we move onto the main proof. At a high level, we will show that if H, H ′ are far, then there must be certain choices for B 1 , B 2 in the constraint system { C λ ′ } that "witness" this and thus the constraints will be violated. We break up the desired statement into a series of inequalities.
First, we can take a suitable linear combination of the constraints to derive that the following quantities must be small.
Proof. Fix any B ∈ P K . By Lemma 2.10, we have for some γ b that are degree-3 polynomials in the indeterminates (λ ′ i )'s with real coefficients (because the expression is Hermitian) in the λ ′ . Since −1 ⩽ λ i ⩽ 1 and we also have the constraint that −1 ⩽ λ ′ i ⩽ 1, we get Also, at most d 3 of the γ b are nonzero. We recall the following statements: for all B 1 , B 2 ∈ B, Using the (last) constraint in the system C λ ′ and the above, we deduce Now we can plug in B 1 = B and B 2 = F b into the above and then take a linear combination with coefficients equal to the γ b . Using the properties of the γ b at the beginning, we conclude On the other hand, we will show using the properties of q C K,d β,ℓ 0 in Theorem 4.6 that the expression on the LHS in the above is actually lower bounded by some function of [H − H ′ , B]. The first step of this is the following lemma.
Lemma 9.5 (Key polynomial identity for nested commutators in sum-of-squares). Let Under Assumption 6.3, we have for any B ∈ P K , Proof. By Theorem 4.6, we can write the following polynomial equality in two formal variables x, y (34) where m ⩽ 10 2 C K,d β ℓ 0 and each of the polynomials r j is (2 C K,d β ℓ 0 + 10, 200 2 C K,d β ℓ 0 )-bounded. Also note that since H, H ′ , B are Hermitian, for any polynomial q Thus, we can apply Theorem 3.9 to write the following formal identity: where D is a sum of four types of terms given on the RHS of Theorem 3.9. By Claim 4.11, all of the polynomials in the original identity are (2 C K,d β+1 ℓ 0 , 200 2 C K,d β ℓ 0 )-bounded and thus Theorem 3.9 combined with applying Lemma 9.1 and Lemma 9.2 to the individual terms in D gives us that Now by definition, 2 4 C K,d β ℓ 0 m 1.5 √ ε 0 ⩽ 0.1ε. Finally, it remains to note that This is because we can rewrite the LHS above as ∥ρ 1/2 r j (H, H ′ |B)∥ 2 F which is a sum of squares in the real and imaginary parts of the entries of the matrix ρ 1/2 r j (H, H ′ |B). This matrix has entries that are polynomials with complex coefficients in the λ ′ . Separating out the real and imaginary parts, the real and imaginary parts of the entries are thus polynomials in the λ ′ with real coefficients. So overall, ∥ρ 1/2 r j (H, H ′ |B)∥ 2 F is a sum of squares of polynomials in the λ ′ . Combining everything with (36) completes the proof.
Next, we analyze the subtracted term tr([H − H ′ , B] 2 p C K,d β,ℓ 0 (−βH|B)ρ) in Lemma 9.5. We use the property that p C K,d β,ℓ 0 is a good approximation to the exponential to relate it to a much simpler expression. Lemma 9.6 (Derivative is uniformly lower bounded ). Define Then for any B ∈ P K We can write [H − H ′ , B] 2 = ∑ F b ∈P 3K γ b F b for some γ b that are degree-2 polynomials with real coefficients (because the expression is Hermitian) in the λ ′ . Since −1 ⩽ λ i ⩽ 1 and we have the constraint that −1 ⩽ λ ′ i ⩽ 1, we get Also the number of nonzero γ b is at most 2d 2 . Now by Theorem 4.5 the polynomial p C K,d β,ℓ 0 is a weak exponential approximation with parameters 0.001 · 2 C K,d β log(1/ε), 5 C K,d β , 0.001ε . and thus by Lemma 7.2, we have for all F ∈ P 3K . Note that the above is simply a numerical inequality and does not involve any variables of the SoS system. Now taking a linear combination of the above over all F ∈ P 3K given by the decomposition [H − H ′ , B] 2 = ∑ F b ∈P 3K γ b F b and using (37), we conclude and writing B[H − H ′ , B] as a linear combination of elements of P 2K and using that −1 ⩽ λ a ⩽ 1 and the constraint −1 ⩽ λ ′ a ⩽ 1, we get (from the commutator constraints in { C λ ′ }) and combining this with (38) completes the proof.
Proof. The proof is exactly the same as the proof of Lemma 8.3.
Lemma 9.8 (Selecting each coefficient in Hamiltonian). Let X = ∑ E a ∈P K x a E a be written as a linear combination of K-local Pauli matrices. Then for any E a ̸ = I, there exist B ∈ P K and P ∈ P 2K whose supports intersect with E a such that is Hermitian so the expression on the LHS is a real polynomial in the λ ′ . Now fix an index a and we will analyze λ a − λ ′ a . By Lemma 9.8, we can find B ∈ P K and F b ∈ P 2K such that the coefficient of . Then Lemma 9.7 implies that as long as C K,d is sufficiently large in terms of K, d and this completes the proof.

A faster algorithm
We now complete the proof of Theorem 6.1. The key to obtaining the faster runtime is observing that actually only a specific family of monomials appear in all of the sum-of-squares proofs in the previous sections. The key observation is formally stated below.
Definition 10.1 (Relevant monomials via cluster expansion). We say a monomial in the variables a C } as unordered multisets We use L to denote the set of all relevant monomials in the λ ′ .
Lemma 10.2. Fix a E a ∈ P K . Then we can write where the r i are polynomials in the λ ′ and S i are subsets of constraints in { C λ ′ } (written in the form g(λ ′ ) ⩾ 0 ). Furthermore, each product of constraints ∏ g∈S i g(λ ′ ) involves • At most one commutator constraint • At most one polynomial approximation constraint • A product of the constraints λ a + 1 ⩾ 0, λ a − 1 ⩽ 0 where the λ a that appear form a relevant monomial Proof. This follows from examining the proofs in Section 8 and Section 9.2, using Lemma 2.18 to characterize all of the monomials that appear whenever we expand a nested commutator in H and H ′ . Note we can ensure that all of the constraints being multiplied in the sets S i are distinct because if any constraint is multiplied twice, we get a term of the form g(λ ′ ) 2 and can instead fold it into the r i (λ ′ ) 2 . Now using Lemma 2.19, we can count the total number of relevant monomials.
Lemma 10.3. The total number of distinct relevant monomials is at most m · (1/ε) 10 C K,d β .
Proof. This follows immediately from Lemma 2.19.
Finally, we have the following theorem from Steurer and Tiegel [ST21] that allows us to solve sum-of-squares systems with running time depending only on the number of monomials that appear in the proofs.
Theorem 10.4 (Degree reduction via linearization [ST21]). Let p : R n → R be a multivariate polynomial of degree at most t. Suppose that there exists a system of polynomial inequalities A = {q 1 ⩾ 0, . . . , q m ⩾ 0} such that A x t {p(x) ⩾ 0} and further assume that this proof can be written in the and there are at most M distinct sets S i . Also assume that the number of distinct monomials that appear in p(x) is at most N. Then we can write a polynomial system A ′ in x and some additional auxiliary variables such that and we can compute a pseudoexpectation satisfying A ′ in time O(m + M + (tN) 3 ).
Proof of Theorem 6.1. The proof is exactly the same as the proof of Theorem 6.8 except using Lemma 10.3 and Lemma 10.2 to bound the complexity of the final sum-of-squares proof and using Theorem 10.4 to solve the sum-of-squares system in running time poly(m, log(1/δ), (1/ε) 2 f (K,d)β ).

A.1 Reproving lemmas
First, we show that this quantity is "protected" by local unitary operations, meaning it doesn't change too much when we apply a local unitary to A. How this works in AAKS is as follows.
Thinking of d and K as being constants and choosing c, c ′ appropriately, the inequality becomes The O(· · ·) gets folded into the values of c, c ′ .
Proof. The overview of the proof is as follows. We want to bound the influence of U on the expression tr(U M 2 Uρ), so we can consider it in terms of the eigenbasis of H. The off-diagonal pieces are bounded by Corollary 2.16 (applying it with g = d + 1 and R = (d + 1)|S|): We split our expression into the on-diagonal and off-diagonal pieces.
We first bound (ON i ).
The remaining terms can be summed up nicely. In the last line above, we are careful to pull out factors of (d + 1)K and |S| to deal with regimes where |S|/k and β are ≪ 1 and ≫ 1.
by setting γ appropriately we can do this.
The final 1 γ 4 is not present in the statement, but is incurred in [AAKS20, Eq. 121].
Proof. Throughout, c's denote positive constants that depend on k, k ′ , d, d ′ . We use Corollary 2.16 to conclude the exact same bound as used in Eq. (39), but this time for A.
We split up our expression in pieces based on the eigenbasis of A. Let γ be a parameter that we choose later, and let Π By Eq. (42), we can conclude ⩽ e −c 0 γ|j−i| e c 1 |S| ∥Π We don't want to incur dependence on ∥Π 0 ⟩. When j = 0, we can use Lemma A.2 to get a bound that still depends on ⟨I − Π With both of these bounds in hand, we can now bound (TERM i ). We use Eqs. Returning to what we originally wanted to bound, ⩽ γ 2 + γ 2 e c 6 |S| ∑ i⩾1 (i + 1) 2 ⟨I − Π Proof. Take U to be the unitary that sends the ath largest eigenvector of A 2 (i) to the ath largest eigenvector of tr [n]\supp(A (i) ) ρ.

B Proof of Theorem 4.6
Here we prove Theorem 4.6. As the proof is quite long and computational, we break it into several manageable steps. First, we show that the even and odd truncations s 2ℓ−1 (x), s 2ℓ (x) can be related as follows: Lemma B.1 (Even truncations are bounded). For all ℓ ∈ N, for all x ∈ R, |s 2ℓ−1 (x)| < 99s 2ℓ (x).
We can rearrange the above as It suffices to prove that for x ⩾ 0.1ℓ that f (−x) ⩾ 0 since then we get that 0.99s 2ℓ (−x) + 0.01s 2ℓ−1 (−x) ⩾ e −x ⩾ 0. Now, we analyze the integral on the RHS of the above. We have and thus we conclude f (−x) ⩾ 0 and we are done.
Next, we present a few basic facts about polynomials that allow us to construct bounded coefficient sum of squares representations.
Claim B.2. Let p(x) = a n x n + a n−1 x n−1 + · · · + a 0 be a polynomial. Let C > 0 be a constant such that for all i ∈ [n], Then all of the (complex) roots of p have magnitude at most 2C.
Proof. Let z ∈ C have |z| ⩾ 2C. Then for all i, |a i z i | ⩽ C n−i |z| n−i · |a n z n | ⩽ |a n z n | 2 n−i and thus, |a n z n | > |a n−1 z n−1 | + · · · + |a 0 | so we cannot have p(z) = a n z n + a n−1 z n−1 + · · · + a 0 = 0 .
Thus, it is impossible for z to be a root of p.
As a straight-forward corollary, we show that each of the following admit a bounded coefficient sum-of-squares decomposition. is a (2 ℓ , ℓ, 10 ℓ )-bounded SoS polynomial in x (see Definition 2.22).