On the Power of Interactive Proofs for Learning

We continue the study of doubly-efficient proof systems for verifying agnostic PAC learning, for which we obtain the following results. - We construct an interactive protocol for learning the $t$ largest Fourier characters of a given function $f \colon \{0,1\}^n \to \{0,1\}$ up to an arbitrarily small error, wherein the verifier uses $\mathsf{poly}(t)$ random examples. This improves upon the Interactive Goldreich-Levin protocol of Goldwasser, Rothblum, Shafer, and Yehudayoff (ITCS 2021) whose sample complexity is $\mathsf{poly}(t,n)$. - For agnostically learning the class $\mathsf{AC}^0[2]$ under the uniform distribution, we build on the work of Carmosino, Impagliazzo, Kabanets, and Kolokolova (APPROX/RANDOM 2017) and design an interactive protocol, where given a function $f \colon \{0,1\}^n \to \{0,1\}$, the verifier learns the closest hypothesis up to $\mathsf{polylog}(n)$ multiplicative factor, using quasi-polynomially many random examples. In contrast, this class has been notoriously resistant even for constructing realisable learners (without a prover) using random examples. - For agnostically learning $k$-juntas under the uniform distribution, we obtain an interactive protocol, where the verifier uses $O(2^k)$ random examples to a given function $f \colon \{0,1\}^n \to \{0,1\}$. Crucially, the sample complexity of the verifier is independent of $n$. We also show that if we do not insist on doubly-efficient proof systems, then the model becomes trivial. Specifically, we show a protocol for an arbitrary class $\mathcal{C}$ of Boolean functions in the distribution-free setting, where the verifier uses $O(1)$ labeled examples to learn $f$.


INTRODUCTION
Can we verify the results of a supervised learner much cheaper than running the learning algorithm from scratch?This presents a fundamental question at a time where training modern machine learning models demands vast amounts of data, meticulously collected and labeled, coupled with substantial computational resources.The infeasibility of repeating this process multiple times, raises the criticality of verifying the results of the learning algorithm.
Recently, Goldwasser, Rothblum, Shafer and Yehudayo [14] formalised the question of verifying the results of supervised machine learning tasks in the context of PAC Learning [30].They initiated the study of PAC-veri cation, with the aim of developing interactive proof systems that enable a veri er to check the results of an untrusted learner (or a prover).The goal is to achieve this veri cation while conserving computational resources and reducing data access, either quantitatively or qualitatively.This notion of utilising interactive proofs to verify learning algorithms, is rooted in the understanding that interaction with a prover can be remarkably helpful for many computational tasks, evidenced by major results like IP = PSPACE or MIP * = RE.
In more detail, [14] focus on agnostic learning.In the setting of agnostic learning over the uniform distribution for a class of Boolean functions C, for an arbitrary : {0, 1} → {0, 1}, the learning algorithm has access to a random example oracle that provides labeled examples of the form ( , ( )), where each is drawn uniformly at random from {0, 1} .A learner ( , , )agnostically learns a class C over the uniform distribution using random examples, for some ≥ 1, if it outputs a hypothesis ℎ with error at most • opt C ( ) + with probability at least 1 − , where opt C ( ) = min ∈ C {dist( , )} is the best possible approximation of by any function in C. Another well-studied variant is one where the learner is required to meet the requirements of ( , , )-agnostic learning, but is given a stronger form of access to via membership queries, and not just random examples.Agnostic learning goes beyond "realisable" PAC learning, which assumes that always comes from C; instead, it represents a more realistic scenario where there is no "ground truth" about the labeling function (see the full version for a formal de nition).
In the ( , , )-PAC-veri cation model over the uniform distribution, a veri er, which accesses using a random example oracle over the uniform distribution, interacts with an untrusted prover (or learner) that has query access to , to output a hypothesis ℎ with error at most • opt C ( ) + , with probability at least 1 − .For the completeness requirement of the proof system, the honest prover should be able to convince the veri er to output a "good" hypothesis with high probability (this could also include scenarios where the prover sends a purported hypothesis rst and then tries to convince the veri er about its accuracy).For the soundness requirement, a malicious prover, even if computationally unbounded, should not be able to convince the veri er to accept a hypothesis that deviates signi cantly from • opt C ( ), with high probability.This represents a PAC-veri cation task with the goal of achieving a qualitative di erence between the veri er and the prover, as queries are considered to be a more powerful form of access.
Another equally appealing setting of PAC-veri cation is one where both veri er and prover have the query access (or random example access) to , and the goal is for the veri er to output a good hypothesis with high probability, making much fewer queries (or samples) than the prover, providing a quantitative di erence.We refer to the full version for a formal de nition of the model.
Following the works of [14] and its follow-ups [4,22], we are mainly interested in proof systems that are doubly e cient, where the honest prover is also an e cient algorithm.The notion of doubly e cient proof systems is intimately connected to the notion of delegating a computational task to an e cient, yet untrusted prover [13].While such proof systems have found widespread use theoretically and in practice, their study in the setting of delegating learning tasks is nascent.
In this work, we establish a set of general tools for PAC-verication and illustrate their utility by constructing PAC-veri cation protocols for certain classes of Boolean functions, whose study is foundational in the eld of computational learning theory.Our constructions showcase the power of interacting with a prover, by achieving quantitative and/or qualitative improvements for verifying the results of an agnostic learner, than actually performing the learning task without a prover.

Our Results
We will now provide a concise overview of the main results.Interested readers are directed to the full version of the paper for complete details.
To begin with, we construct a sample-e cient interactive Goldreich-Levin protocol, for the problem of learning heavy Fourier coe cients, which improves upon an analogous result by [14].Following this, we construct the rst PAC-veri er for classes of functions computable by constant-depth circuits like AC 0 [2], as well as for functions that are -juntas.Finally, we illustrate the power of such proof systems with unbounded honest provers, with PAC-veri ers for any arbitrary class of Boolean functions, using very few samples, even in the distribution-free setting.We elaborate on these results next.
1.1.1Learning Heavy Fourier Characters.One way of understanding the structure of a class of Boolean functions C is by studying the Fourier spectrum of functions in C (see Section 2.3 of the full version for de nitions about Fourier analysis of Boolean functions).This idea was further made concrete in the context of learning theory, through general results that show e cient learnability of classes for which the Fourier spectrum is concentrated on low-degree monomials [20], or concentrated on a sparse set [19].
In particular, [19] used the Goldreich-Levin algorithm [12] for computing the set of all heavy Fourier characters of a Boolean function (i.e., characters with large Fourier coe cients), given membership query access to it. 1Following these results, the ability to learn heavy Fourier characters of a Boolean function is used in multiple works in learning theory (as well as other areas like cryptography and coding theory).
For any -variate Boolean function, ∈ N and > 0, we say that a set of characters is -close to the set of heaviest Fourier characters Λ of , if no character from with the smallest (absolute) coe cient is replaceable by any character outside whose (absolute) coe cient is larger by a value of .Note that, this implies that for each ∈ Λ , there exists a distinct ˜ ∈ , such that | ˆ ( )| − | ˆ ( ˜ )| ≤ , where ˆ ( ) and ˆ ( ˜ ), are the respective Fourier coe cients.For our rst result, we show an algorithm that learns the -heaviest Fourier characters of a given Boolean function up to an error , where the veri er only uses poly( / ) random labeled examples.Speci cally, we have Theorem 1.1 (Learning heavy Fourier characters (See Section 4 of the full paper for formal statement)).There exists an interactive proof, such that for any : {0, 1} → {0, 1}, any > 0 and any ∈ N, the veri er uses at most poly( / ) many random examples and outputs a set Λ that -approximates Λ with probability at least 0.9.
Moreover, the prover is e cient in the sense that it is essentially as e cient as the learning algorithm it invokes.
We emphasize that Theorem 1.1 presents a qualitative, as well as a quantitative improvement over the prover that runs the Goldreich-Levin algorithm, and thus, the query complexity is poly( , 1/ ).As an example, for an application where we just want to nd the character that approximates the heaviest Fourier coe cient of up to a small constant error, then the veri er only needs (1) random examples to learn it using a prover, whereas without a prover it needs to make poly( ) queries to !
The interactive Goldreich-Levin theorem is obtained as an application of Theorem 1.1.For any > 0, let ≥ /2 be the set of Fourier characters whose coe cients are at least /2.From Parseval's theorem, we see that | ≥ /2 | ≤ 4/ 2 .To nd ≥ , we set Λ = ≥ /2 in Theorem 1.1, where = 4/ 2 and = /2.The veri er outputs a list of characters that contain ≥ , with high probability, using only poly(1/ ) many random examples, whereas the prover makes at most poly( / ) queries. 2 We highlight that our interactive Goldreich-Levin protocol vastly improves the sample complexity over that of [14] (Lemma 2.2), which has a sample complexity of poly( , 1/ ).In particular, the sample complexity of our interactive Goldreich-Levin protocol is independent of .Theorem 1.1 builds upon a novel algorithm that (approximately) computes the highest Fourier coe cients (but not the associated characters), by making poly( / ) membership queries to , that in turn gives a PAC-veri cation protocol in which the veri er uses much fewer queries (poly( / )) than the prover.We then show a framework for query-to-sample reduction for PAC-veri cation, based on the query pattern of the veri er, and apply it to this protocol.We nd these tools to be of independent interest and potentially applicable for designing other PAC-veri cation protocols, and provide more intuition for that in Section 2.1.
be the class of functions computable by constant-depth circuits of polynomial-size with AND, OR, NOT, XOR gates of unbounded fan-in.Similarly, for any prime > 2, we de ne AC 0 [ ] be the class of functions computable by constant-depth circuits of polynomial-size (number of gates) with AND, OR, NOT and MOD gates of unbounded fan-in.
Then the class AC 0 [2] is (polylog( ), 1/10)-PAC-veri able over the uniform distribution, where the veri er uses at most quasi-polynomially many random examples.Moreover, the protocol is doubly e cient, where both the veri er and the honest prover (that has query access to ) run in quasi-polynomial time.
Moreover, we can easily extend this to get PAC-veri cation for AC 0 [ ] (Theorem 5.3 of full version) with similar complexities, for any prime > 2.
The class AC 0 [2], or more generally AC 0 [ ], is at the frontiers of Boolean function classes that are learnable.In the realisable setting, [2] obtain a learner for AC 0 [ ] over the uniform distribution using membership queries in quasi-polynomial time, getting a learner for these classes using random examples has been a long-standing open question.Moreover, removing the membership queries in the agnostic learner for AC 0 [2], would give quasi-polynomial time algorithms for two notoriously di cult problems: learning parities with noise (LPN), and for AC 0 [ ], learning with errors (LWE).We refer to [3] (section 5) for a further discussion on implications to LPN or LWE.
In contrast, we show the power of PAC-veri cation protocols for AC 0 [2], where the veri er uses random examples to agnostically learn AC 0 [2], upon interaction with a quasi-polynomial time prover.This also generalises another result by [14], who show that parities are PAC-veri able, thus implying that the LPN assumption might not be relevant in the setting of PAC-veri cation.
Our techniques crucially use the structure of the agnostic learner with membership queries by [3].In particular, they use tools from pseudo-randomness and meta-complexity in the construction of the learner (the Nisan-Wigderson reconstruction algorithm [18,23]).We perform a careful analysis of the query patterns in this algorithm, and extend the query-to-sample reduction framework to use this.Considering the ubiquity of the Nisan-Wigderson generator in complexity theory, we nd this analysis to have other potential applications.
We brie y mention some subtleties about Theorem 1.2.Our protocol builds over the [3] agnostic improper learner (hypothesis is not in AC 0 [2]), which translates into our sample and running time complexities, as well as the hypothesis approximation error of our PAC-veri cation protocol.Indeed, we also inherit the non-negligibility criterion over opt( , AC 0 [2]) from them, which ensures quasi-polynomial sample and running time complexities.
General Boolean Circuit Classes.Let C be any well-known Boolean circuit class like ACC 0 , NC 1 , or P/poly.The agnostic learner for AC 0 [ ] over the uniform distribution by [3] is in fact obtained via the meta-algorithmic "circuit lower bounds to learning algorithms" framework.
In more detail, a ( )-tolerant natural property (in the sense of [27]) against functions that are ( )-close to C (i.e., opt( , C) ≤ ( )), is an "e cient" algorithm that distinguishes functions that are ( )-close to C from a dense subset of all functions (see Section 5.1.2for formal de nitions).Tolerant natural properties refer to constructive average-case circuit lower bound techniques, which not only show a lower bound for a given function against C, but also extend the lower bound against C to a non-negligible fraction of all functions.[3] show that a (1/2 − ( ))-tolerant natural property against C implies agnostic learners for C using membership queries over the uniform distribution, in a very general and black-box fashion.
The generality of our techniques used in the PAC-veri cation protocol for AC 0 [2] allows it to be extended to the "tolerant natural properties to agnostic learners" framework.Thus, we show conditional PAC-veri ers for any typical circuit class C [poly( )] (functions computable by C-circuits of polynomial size). 4In more detail, we show that if there exists a (1/2 − 1/poly( ))-tolerant natural property against C [poly( )], then C [poly( )] is ( , 1/10)-PAC-veri able over the uniform distribution, for 0 < < 1 where the veri er uses at most sub-exponentially many random labeled examples.Moreover, both the veri er and the honest prover run in sub-exponential time.We refer to Theorem 5.4 of the full version for a formal statement.
1.1.3Learning Juntas.Let : {0, 1} → {0, 1} be a Boolean function that depends only on an unknown subset of ≪ variables.The class of such functions are called -juntas.The class of -juntas have been widely explored in testing and learning literature, eg., [5,7,16,21] and questions about the complexity of learning it still remains unclear, in both realisable and agnostic settings.We construct the following PAC-veri cation protocol for -juntas.
Theorem 1.3 (PAC-verification for -juntas (see the full version for a formal statement)).For any integer ≤ , the class of functions -juntas is ( , 1/10)-PAC-veri able with respect to the uniform distribution with high probability.
Moreover, the veri er uses at most 2 •poly( / ) random examples, and the honest prover is e cient in the sense that it is essentially as e cient as any -junta learning algorithm that it invokes.
While the honest prover runs an agnostic learner for -juntas from [16], and uses at most 2 / 2 • log( ) random examples.On the other hand, the veri er in the protocol from Theorem 1.3 uses 2 • poly( / ) random examples.It is worth stressing the quantitative improvement here, where the veri er sample complexity is independent of , unlike the honest prover.For example, a PACveri er for an (1)-junta only uses (1) random examples to output a hypothesis with a constant error.
The PAC-veri er for -juntas is obtained by another general transformation from query-based tolerant testers for a class C to PAC-veri ers for C, followed by an application of the query-tosample reduction from Section 1.1.1 for this protocol. 5  1.1.4The Power of PAC-Verification with Unbounded Provers.Finally, we study the power of PAC-veri cation if we allow the honest prover to have unbounded computational power.For this section, we extend PAC-veri cation model (De nition 2.2 of full version) to the distribution-free agnostic learning setting, where for any unknown, yet xed distribution D over the examples, the veri er is now given access to the random example oracle with respect to D. The completeness and soundness requirements are stronger; for every xed, but unknown underlying distribution D, the hypothesis output by the interaction should have error at most opt D ( , C) + , where the distance is now measured over D.
Let P/poly be the class of functions computable by general Boolean circuits of polynomial size, that captures e cient nonuniform computation.We show the following PAC-veri cation protocol.
Theorem 1.4 (PAC-verification for P/poly (see the full version for a formal statement)).For any > 0, P/poly is distribution-free, proper, ( , 1/10)-PAC-veri able, where the veri er uses at most (1/ ) many random labeled examples and runs in poly( / ) time.Moreover, the honest prover is computationally unbounded. 5A tolerant tester for C is a sub-linear algorithm that accepts an input , if opt( , C) ≤ , and rejects it, if opt( , C) > + , for any input parameters , > 0. Clearly a tolerant tester implies a suitable tolerant natural property for C, since a random function is far from C (for a typical circuit class).However, it is unclear if a tolerant natural property gives us a tolerant tester, which has a more stringent rejection criterion.In particular, the results of [25] only imply a tolerant tester from a proper agnostic learner for C (the hypothesis is from C), and the learner from [3] is improper.This may indicate the reason for the di erence in hypothesis error for PAC-veri ers obtained using our techniques -additive error (tolerant tests) vs multiplicative error (tolerant natural properties).
For contrast, assuming the existence of standard cryptographic primitives (like one-way functions), we know that P/poly cannot be learnt in polynomial-time over the uniform distribution using membership queries, even in the realisable setting [11].Thus, providing access to an unbounded prover, can help a veri er learn P/poly even in the signi cantly stronger distribution-free, proper, agnostic learning setting.
Theorem 1.4 is proved using a more general result that shows that any arbitrary class of functions C is distribution-free PACveri able using (1/ ) random examples, where the veri er runs in poly(log(|C|)/ ).The interactive proof is constructed by viewing agnostic learning as a suitable Empirical Risk Minimisation task (ERM) (which does an exhaustive search) and delegating this computational task to the prover using [13].More details can be found in Section 7 of the full version.

TECHNICAL OVERVIEW
In this section, we highlight the proofs of Theorems 1.1 and 1.2.Our interactive proofs are constructed using a sequence of novel ideas for obtaining general transformations, and below, we highlight some of the key ideas used to obtain them.These interactive proofs are constructed through an interplay of various techniques and tools from interactive proofs, property testing, Fourier analysis, and pseudorandomness.We refer to the individual technical sections for further details.

Proof Outline of Theorem 1.1
Overview of the Goldreich-Levin algorithm (GL)..For the outline, we focus on the goal of learning the set Λ of all the -heavy Fourier characters (i.e., vectors ∈ {0, 1} such that the associated Fourier coe cient | ˆ ( )| ≥ ), given query access to the input function : {0, 1} → {1, −1} (for the output, we use the equivalent representation of {0, 1} as {−1, 1}).With additional work, we generalise the ideas highlighted below for PAC-verifying heavy Fourier characters, for any , thus obtaining Theorem 1.1.This algorithm follows a divide-and-conquer strategy over the inputs, that can be viewed as a binary tree (see Section 3.5 in [24]).For any 1 ≤ ≤ , and any node at depth in the tree (that represents a subcube of {0, 1} ), GL considers both subcubes obtained from this node by restricting the ( + 1) th -coordinate to either 0 or 1.It then estimates the Fourier weights of both subcubes up to a small error using (1/ 2 ) many queries to , and eliminates those with Fourier weight at most 2 /4, from future consideration. 6GL stops at depth , where each leaf consists of just one vector, and thus the set of leaves contains Λ .Of course, since there are at most (1/ 2 ) many subcubes with large Fourier weight (by Parseval's theorem) at any depth of the tree, GL roughly makes poly( / ) queries overall.It is crucial that GL needs to go all the way down to depth in order to identify all the heavy vectors in learn Λ .
Query-based PAC-veri cation of Λ .Suppose we could stop this process at depth (log(1/ )) instead of going all the way to , with the hope of getting the query complexity down to poly(1/ ).We already encounter a signi cant challenge: since each subcube at any depth is obtained by partitioning over a xed variable in the layers above, it could contain multiple -heavy vectors.Furthermore, GL only estimates the Fourier weight of each subcube, and does not help us with the identifying the actual heavy vectors within.
To resolve this, we consider a partitioning scheme based on random linear functions, that also appears in [15].In more detail, we take a set { 1 , . . ., } of uniformly random vectors from F 2 , where = (log(1/ )) is suitably large.De ne as span( 1 , . . ., ) of dimension (with high probability).First, we show that the 2 possible a ne shifts of ⊥ (the subspace orthogonal to ) form a partition of F 2 such that each of these subspaces contains at most one -heavy vector. 7 Following this, we prove a new method of estimating the heaviest vector from each of these a ne subspaces based on the following lemma: for any a ne subspace Γ which is "rare" for -heavy vectors, the 4 -Fourier norm is an additive approximation to the ∞ -Fourier norm, i.e., 4 ∈Γ ˆ ( ) 4 is roughly equal to arg max ∈Γ | ˆ ( )| with a small additive error.Using the 4 -Fourier norm, as opposed to the 2 -Fourier norm utilized in [15], seems crucial for estimating the heaviest Fourier coe cient.Now, if we estimate the 4 -Fourier norm of each of these a ne shifts, we get all the coe cients for the -heavy vectors.In Lemma 4.4 of full version by proving that the sum ∈Γ ˆ ( ) 4 can be expressed as an expectation, we show that this estimate can be made, using poly(1/ ) queries to .Thus, we get an algorithm that makes poly(1/ ) queries to and computes the coe cients for all the -heavy vectors, without identifying the corresponding characters. 8 As a solution to this, we just ask the prover (which runs GL) to send us the -heavy vectors!The honest prover sends the -heavy vectors to the veri er, and completeness is ensured since we can easily estimate the Fourier coe cient of a vector using (1/ 2 ) many random queries (up to an error, say /4).For soundness, we observe that if a malicious prover sends a coe cient ′ that is not heavy, we can e ciently nd the a ne shift of ⊥ it belongs to, since has a succinct representation using just vectors.Using the 4 -norm based estimation technique above, we can nd the maximum Fourier coe cient ˜ in this a ne subspace and reject, if ˜ is too far from | ˆ ( ′ )|.Put together, we obtain a PAC-veri cation protocol for learning Λ , where the veri er makes poly(1/ ) queries and the honest prover makes poly( / ) queries to , obtaining a signi cant quantitative improvement in the number of queries, that is independent of .
For the sake of exposition, we ignore the e ect of the additive errors on various estimates made by the veri er as well as the honest prover that runs GL; accounting for this gives us a list of 7 For comparison, [15] study testing sparse Boolean functions and show under such a partitioning scheme, at most one non-zero Fourier coe cient of a sparse Boolean function falls in each subspace.In such a case, computing the Fourier 2 -norm of each subspace su ces for their testing algorithm.While they work with the decision task of testing, in our learning/Goldreich-Levin setting, we get an arbitrary Boolean function without the nice structure provided by sparsity, that could render any such subspace to contain a large number of non-zero Fourier coe cients.It is unclear how estimating the 2 -norm of the entire subspace would help the learner identify the heaviest character (up to some error). 8One can view this as a full binary tree, where the nodes in layer represent all the 2 a ne shifts of the subspace orthogonal to span( 1 , . . ., ).It is worth noting that unlike the iterative process used by GL, we only compute the 4 -estimates at the end.coe cients of which the set of all -heavy vectors is a subset.We refer to Sections 4.1 and 4.2 of full version for details.
Query-to-sample reductions.In order to obtain a qualitative improvement as well, we need the veri er in the aforementioned PAC-veri cation protocol to run using only random examples over the uniform distribution, without having a large blow-up on the sample complexity.Our main challenge is that query access is typically a more powerful resource than just random labeled examples to (formalised by [8] for agnostic learning over uniform distribution).We prove that for a PAC-veri cation protocol where the veri er has query access to , the veri er can use only random examples and a prover to answer them, when the queries have suitable structure.
To understand this framework, we make the following observations about the queries made by the veri er in the aforementioned PAC-veri cation protocol for Λ .Firstly, we observe that the marginal distribution of every query is uniform.In more detail, to estimate the 4 -norm of an a ne subspace using Lemma 4.4 of full version, the veri er queries on uniformly random , , ∼ {0, 1} , and on which is the sum of these vectors with a uniformly drawn vector from the random subspace .Next, we see that the queries are non-adaptively generated by the veri er, i.e., there exists a query construction algorithm query that uses its internal randomness to generate all the queries before they are made.Further, the queries are independent of the proof sent by the prover.
Suppose that the veri er makes queries to .Now, a veri er picks a query index ∼ [ ] uniformly at random and "hides" a random example ( , ( )) from the uniform distribution as the th query.The important thing here is that query can generate the rest of − 1 queries, conditioned on the th query being .Indeed, the non-trivial case here is if this query corresponds to one that uses a random vector from the subspace , that is constructed beforehand by query using random vectors; we prove in Section 6.2 of full version that query can still generate the rest of the queries conditioned on this query being .For the reduction, the veri er sends the entire embedded query set to the prover and asks for the query answers back.The honest prover answers correctly and the completeness follows from that of the query-based protocol.On the other hand, since the marginal distribution of each query is uniform, which is the same as the underlying distribution, a malicious prover can't distinguish between the "embedded" query and any other query which is generated by query .Thus, the best can do is pick a query at random and answer it incorrectly hoping that it is not , while providing correct answers everywhere else (note that has ( ) and will catch a prover if it lies on ).A malicious prover can get away with cheating on a single query with probability (1 − 1/ ); repeating this process ( ) number of times ensures that we catch the prover with high probability.Finally, we get a PAC-veri cation protocol for learning Λ where the veri er makes just ( 2 ) samples.
Discussion.More generally, if the query construction algorithm of the veri er satis es embeddability, i.e., the query set can be generated, conditioned on a random query index being xed to a given query , as well as the marginal distribution of each query being the same as the underlying distribution, then we get a PACveri cation protocol where the veri er only uses random examples, with sample complexity having a quadratic blow-up.We refer to Section 3.2 of full version for more details.
While embeddability seems necessary for the hiding process, Section 2.2 extends this framework towards handling more general query marginal distributions.Moreover, we use non-adaptivity to embed the random example in the query set, and we leave the study of extending this framework to handle adaptive queries, which may or may not depend on the interaction with the prover, as an interesting direction for future work.

Proof Outline of Theorem 1.2
Overview of the [3] agnostic learner.We start by recalling the Nisan-Wigderson (NW) generator.A family of sets 1 , . . ., over a universe of size is called an NW-set design, if each has size and for every ≠ , the overlap between and is very small, i.e., is at most log( ).Given this, for any : {0, 1} → {0, 1}, we de ne the NW-generator : {0, 1} → {0, 1} , with seed length and stretch , as ( ) = ( | 1 ), . . ., ( | ), where | is the restriction of the seed to the indices in .
Their algorithm is obtained by a "play-to-lose" argument.Let be the underlying -variate Boolean function, such that opt( , AC 0 [2]) = * ( ), for a non-negligible * .[3] show an e cient algorithm R that takes as input truth tables of length , and rejects every function over ℓ inputs that has distance at most 1/ℓ 3 from AC 0 [2]circuits of size 2 ℓ (for a xed ), while accepting a random truth table with constant probability.This is nothing but a (1/ℓ 3 )-tolerant natural property.
The main idea is that with high probability over its seeds and stretch set to 2 polylog( ) , produces the truth table of a function : {0, 1} ℓ → {0, 1}, where ℓ = log( ) = polylog( ), such that the distance between and AC 0 [2]-circuits of size 2 ℓ is at most 4 * ( ).As long as * ( ) is less than 1/4ℓ 3 , the tolerant natural property acts as a distinguisher to ; indeed, by de nition, it rejects almost all every truth table that could be output by , whereas it accepts a constant fraction of strings from {0, 1} .Thus, we can directly use the natural property as a distinguisher in the NW-reconstruction algorithm to get a circuit that computes on a (1/2 + 1/ )-fraction of the input in {0, 1} .This gives a weak agnostic learner for AC 0 [2] over the uniform distribution using membership queries.
Query-to-sample reductions for NW-reconstruction algorithms.Our main technical idea for this section is to show that the veri er can run the NW-reconstruction algorithm A using random examples over the uniform distribution, by interacting with a prover to answer the queries.For this, we extend the query-to-sample reduction to the NW-reconstruction algorithm.
To do this, we need to understand the query pattern of A. De ne a random subcube E as one obtained by setting a xed set of coordinates in [ ] with a uniformly random string ∈ {0, 1} | | , and considering the set of all Boolean strings in {0, 1} that are consistent with .Further, we de ne a subcube membership query over E as one which queries over all the 2 − | | many strings in E.
To hide a random example in the query set, our next idea is that the query construction based on the hybrid argument allows us to embed in the seed .Indeed, we pick a subcube E at random and embed in the seed at a location that corresponds to E (i.e., projected onto \ ).Following this, we prove the NW-marginal query distribution lemma, in which we show that for any query index that falls in a subcube E , the probability that the th -query is a xed string is equal to −log( | E | ) .This is the most technical lemma for this section, and the main challenge here is to account for the e ect of all the overlapping sets coming from the set design on the th marginal distribution.We refer to Lemma 5.12 of the full version for more details.
The query-to-sample reduction follows a similar strategy as described earlier.While completeness holds for the same reasons, for soundness, for any query that lies in a subcube E , no malicious prover can distinguish between the distribution of | \ , where is the uniformly random example embedded in the query set, and the th query marginal distribution, since they are identical.The rest of the analysis follows from ideas similar to Section 2.1.
In order to boost the error of the hypothesis to polylog( ) • * , the [3] learns the ampli ed function Amp : {0, 1} + → {0, 1}, de ned as Amp ( 1 , . . ., , 1 , . . ., ) = =1 ( ) (mod 2), where is set to 1 4ℓ 3 * .While dealing with the nal CIKK-learner, it is worth highlighting a subtlety which becomes a critical issue for the PAC-veri cation model.In order to get a hypothesis with low error, the learner learns Amp , for which it needs to know the value of * to set = 1 4ℓ 3 * .However, the main challenge for PACveri cation is that the veri er does not know * , otherwise the model becomes trivial.We overcome this, by running the protocol over multiple guesses of the unknown * , and use a ner analysis of the CIKK-learner to nd the right hypothesis.Details on how we extend the ideas highlighted above to the NW-reconstruction algorithm for Amp , and the ner analysis of [3] can be found in Section 5.3 of the full version.

RELATED WORK
Following [14], [22] extend PAC-veri cation to statistical query learning algorithms, and [4] extend it for classical veri cation of quantum agnostic learners.A related model is that of covert learning by [1], where the goal is to ensure that no untrusted intermediary who monitors the interaction, gains any information about the function or the learner.
Of particular interest, [22] show a veri er sample complexity lower bound of Ω( √ ), where is the VC-dimension of the underlying hypothesis class, in the distributional agnostic learning setting, where the learner gets samples from an unknown joint distribution D over {0, 1} × {0, 1}.Our results on PAC-verifying P/poly (Theorem 1.4) indicate that we can beat this veri er sample complexity lower bound in the setting of functional agnostic learning, i.e., the labels come from an arbitrary Boolean function (over arbitrary example marginal distributions), or in the case of -juntas (Theorem 1.3), by additionally also considering distributions with restricted marginals over the examples like the uniform distribution.
The study of interactive proofs for testing properties of distributions, initiated by [6] (see also [17]), is also relevant here.In particular, [14] note that any PAC-veri cation task can be formulated as a property of distributions that can be tested using an interactive proof.While a direct application of [6] shows that any property over distributions supported on {0, 1} can be tested using (2 /2 / 2 ) many samples given an unbounded prover, Theorem 1.4 gives a much stronger sample complexity of (1/ ) for the speci c case of PAC-veri cation in this scenario.
Another related model is that of delegating a property testing task to an untrusted prover (interactive proofs of proximity or IPPs), where the veri er has query access to the input [28].In particular, [9] show query-to-sample reductions for IPPs.However, it is unclear whether such techniques for IPPs can be extended to PAC-veri cation.Moreover, extending our framework to IPPs incurs a quadratic blow-up for the sample complexity, which could render interesting query complexity regimes for IPPs trivial.[10] show that easiness of certain meta-complexity problems implies agnostic learnability of P/poly in polynomial time using random examples over polynomially-samplable distributions.It remains open whether this can be extended for typical circuit class restrictions.

FUTURE DIRECTIONS
Our work establishes PAC-veri cation protocols for learning some fundamental classes in learning theory like heavy Fourier coecients, constant-depth circuits, or -juntas, with techniques that could have more general applicability.We state some directions for possible future exploration.
• We show a general procedure of embedding a random example into speci c query construction algorithms and further, prove that this gives a PAC-veri cation protocol, for query distributions where each marginal is identical to the underlying distribution (Section 2.1), or for a set of uniformly random subcube queries (Section 2.2).An interesting direction is to explore the full generality of this framework and obtain a full distributional characterisation of the query-to-sample reduction.Along these lines, it is worth studying whether the property of embeddability can be extended to adaptive query construction algorithms.• Another question is whether we can use the algebraic structure of AC 0 [2] arising from [26,29] to get better PAC-veri cation protocols that output a hypothesis with just an additive error, i.e., an error of opt( , AC 0 [2]) + .• The role of round complexity in PAC veri cation is not wellunderstood.Is there a hypothesis class that requires more than 2 rounds of interaction?Is there a round-hierarchy for hypothesis classes that have doubly e cient PAC veri cation protocols?