Super Non-singular Decompositions of Polynomials and their Application to Robustly Learning Low-degree PTFs

We study the efficient learnability of low-degree polynomial threshold functions (PTFs) in the presence of a constant fraction of adversarial corruptions. Our main algorithmic result is a polynomial-time PAC learning algorithm for this concept class in the strong contamination model under the Gaussian distribution with error guarantee $O_{d, c}(\text{opt}^{1-c})$, for any desired constant $c>0$, where $\text{opt}$ is the fraction of corruptions. In the strong contamination model, an omniscient adversary can arbitrarily corrupt an $\text{opt}$-fraction of the data points and their labels. This model generalizes the malicious noise model and the adversarial label noise model. Prior to our work, known polynomial-time algorithms in this corruption model (or even in the weaker adversarial label noise model) achieved error $\tilde{O}_d(\text{opt}^{1/(d+1)})$, which deteriorates significantly as a function of the degree $d$. Our algorithm employs an iterative approach inspired by localization techniques previously used in the context of learning linear threshold functions. Specifically, we use a robust perceptron algorithm to compute a good partial classifier and then iterate on the unclassified points. In order to achieve this, we need to take a set defined by a number of polynomial inequalities and partition it into several well-behaved subsets. To this end, we develop new polynomial decomposition techniques that may be of independent interest.

In this paper we study the problem of PAC learning degree-PTFs in the presence of a constant fraction of adversarially corrupted data.More concretely, we de ne the following data contamination model considered in the current work.A learning algorithm in the strong contamination/nasty noise model is given as input an opt-corrupted set of examples from C and its goal is to output a hypothesis ℎ such that with high probability the error Pr x∼ [ℎ(x) ≠ (x)] is small, as compared to the information-theoretically optimal error of opt.
That is, in the nasty noise model [4], an omniscient adversary can arbitrarily corrupt a small constant fraction of both the data points and their labels.The nasty noise model is equivalent to the strong contamination model studied in the eld of robust statistics [15,17] and generalizes well-studied corruption models, including the agnostic (adversarial label noise) model [28,35] and the malicious noise model [36,48].In the adversarial label noise (agnostic) model, the adversary can corrupt an opt-fraction of the labels, but cannot change the distribution of the unlabeled points.In the malicious model, the adversary can add an opt-fraction of corrupted labeled examples, but is not allowed to adversarially remove clean labeled examples.
The goal of this work is to understand the e cient learnability of degree-PTFs under the Gaussian distribution in the presence of nasty noise.Our main algorithmic result is the following: Theorem 1.2 (Main Learning Result).There exists an algorithm that, given any , ∈ (0, 1), has sample and computational complexity ( ) poly , (1/ ), and learns the class of degree-PTFs on R in the nasty noise model under the Gaussian distribution within 0-1 error , (1) opt 1− + .
Discussion.Some comments are in order.We start by noting that our learning algorithm is not proper.Speci cally, the output hypothesis is a decision list whose leaves are degree-PTFs.
It is instructive to quantitatively compare the complexity and error guarantee of Theorem 1.2 with prior work.The 1 -polynomial regression algorithm of [29] achieves the optimal error of opt + in the (weaker) adversarial label noise model with sample and computational complexity poly( / ) .Moreover, the exponential complexity dependence in 1/ is inherent [21,46].The latter computational lower bounds motivate the design of faster (ideally, fully-polynomial time) algorithms with relaxed error guarantees.When restricting to fully-polynomial time algorithms (i.e., with runtime poly ( / )), [22] gave a robust learner with error guarantee ˜ (opt 1/( +1) ) + .For > 1, this was the best previously known error guarantee (for poly ( / ) time algorithms) even in the weaker adversarial label noise model.(See the following subsection for a detailed summary of prior work.) The latter error guarantee deteriorates dramatically as a function of the degree .A natural question that motivated this work is whether it is possible to qualitatively nearly-match the = 1 case -where polynomial-time algorithms with error (opt) + are known [1,22] -for any constant degree (or even for = 2!).

More concretely:
Is there a poly ( / ) time algorithm that, for any constant , robustly learns degree-PTFs with error , (1) opt , where > 0 is independent of ?Our main result answers this question in the a rmative.Moreover, we can take the parameter above to be any constant less than 1.Achieving error ˜ (opt) or (opt) is left as an open question.
Finally, we reiterate that our algorithm is the rst algorithm with this error guarantee even in the weaker model of adversarial label noise.
Interestingly, to obtain our algorithmic result, we generalize the localization technique [1,3], developed in the context of learning linear threshold functions, for the problem of learning degree-PTFs.To achieve this goal, we develop the algorithmic theory of super non-singular polynomial decompositions, which we believe is of broader interest beyond learning theory.

Prior Work
In the realizable PAC learning model (i.e., with clean/consistent labels), low-degree PTFs are known to be e ciently learnable in the distribution-free setting via a reduction to linear programming [39].Speci cally, the class of degree-PTFs on R can be learned to 0-1 error with sample size = ˜ ( / ) in poly( ) time.By standard VC-dimension arguments, this sample size is informationtheoretically necessary for any learning algorithm.
In the presence of adversarial noise in the data (the focus of the current work), the learning problem becomes signi cantly more challenging computationally.Speci cally, in the distribution-free setting, the agnostic learning problem (i.e., in the presence of adversarial label noise) is known to be computationally intractable, even for the special case of = 1 and constant accuracy [7,18,46].As a result, research in this area has focused on the distribution-speci c setting, i.e., with respect to speci c natural distributions on the domain, such as the Gaussian distribution.
In the distribution-speci c agnostic model, the 1 -polynomial regression algorithm [29] learns degree-PTFs within error opt + with sample and computational complexity ( 2 / 4 ) under the Gaussian distribution (and the uniform distribution on the hypercube) [14,24,25,27,30,38].Importantly, the exponential dependence in 1/ is inherent in the complexity of the problem, both in the Statistical Query model [20] and under standard cryptographic assumptions [21,46] (even under the Gaussian distribution).
The aforementioned hardness results motivated the design of faster algorithms with relaxed error guarantees.Over the past fteen years, substantial progress has been made in this direction, in particular for the special case of Linear Threshold Functions (corresponding to = 1).Speci cally, a sequence of works [1,6,22,37] developed poly( / ) time robust learners for LTFs in the malicious/nasty model (thus, also in the adversarial label noise model) under the Gaussian distribution and, in some cases, for isotropic (i.e., zero-mean, identity covariance) log-concave distributions.In more detail, [1] gave a malicious learning algorithm for homogeneous LTFs (i.e., halfspaces whose separating hyperplane goes through the origin) with near-optimal error guarantee of (opt) + under all isotropic log-concave distributions2 .Subsequently, [22] gave an e cient algorithm that achieves error of (opt) + for arbitrary LTFs and succeeds under the Gaussian distribution.At the technical level, [1] developed a localization method (see also [2] for a precursor) which is crucial to obtain the near-optimal error guarantee of (opt).In fact, the algorithm of [22] for general halfspaces proceeds by a re nement of this idea.
For the case of degree-PTFs, progress in this direction has been slow.The only prior algorithmic work on the topic is due to [22].That work gave a poly( / ) time algorithm that succeeds in the presence of nasty noise under the Gaussian distribution 3 and attains an error guarantee of (opt 1/( +1) ) + .

TECHNICAL OVERVIEW
Prior Work: Learning via Degree-Chow-Parameters.A polynomial threshold function (PTF) can be thought of as a linear threshold function (LTF) composed with the Veronese map x ↦ → x ⊗ .Thus, we can think of the question of (robustly) learning a PTF of Gaussian inputs as the problem of learning an LTF with input x ≔ x ⊗ .
A common approach to learning PTFs (and other related geometric concept classes) in the literature is via (low-degree) Chowparameter tting 4 ; see, e.g.[8,9,12,22,47].More precisely, in [22], the approach is to nd an LTF (as a function of the tensor feature The Chow-parameter tting approach requires two crucial assumptions about the distribution of x = x ⊗ .First, it requires concentration bounds in order to show that the adversarial noise cannot a ect signi cantly the relevant Chow parameters E ( x, ) [ x].Gaussian hypercontractivity indeed implies that polynomials of Gaussian random variables enjoy strong concentration.In addition to this, showing that a small error in the Chow parameters translates to a small error in the total variation distance of the corresponding threshold function, requires some anti-concentration properties of the underlying distribution.More precisely, it requires showing that any linear function is not-too-small with high probability.We can obtain this using results of [5]; but, unfortunately, the anti-concentration provided is weak, showing that ) for any degree-polynomial (•).This translates quantitatively to an algorithm that robustly learns degree-PTFs to error (opt 1/ ( ) ) -a far cry from our goal of error (opt 1− ), especially when is large.
Our Approach: Learning PTFs Using Perceptron and Localization.Our high-level plan to improve upon the error guarantee of (opt 1/ ( ) ) is via the method of localization, a powerful approach for learning with corrupted labels; see [1][2][3].For technical reasons, our starting point is an early instantiation of this technique [3] developed in the context of learning LTFs with random label noise.At a high-level, localization consists of rst learning some LTF that achieves good error for all large-margin points, and then conditioning on low-margin examples to learn a new (or improve the current) hypothesis.Importantly, all localization-based algorithms require that, after conditioning on the low-margin region |w • x| < , the resulting distribution satis es strong anti-concentration properties.While this property is true for learning LTFs under the Gaussian distribution, it completely fails to hold under the conditional distribution of low-margin points with respect to a PTF, i.e., | (x)| ≤ .Our approach consists of two new ingredients: (i) a robust version of the localized margin-perceptron algorithm for learning PTFs under weaker (anti-)concentration assumptions, and (ii) a localization process for PTFs so that the corresponding conditional distributions satisfy (anti-)concentration.In the following presentation, we rst focus on the localization process for PTFs, and then present our robust margin-perceptron learning algorithm.

PTF Localization via Partitioning
Naive localization fails.We rst investigate why naively conditioning in the low-margin region | (x)| < fails to satisfy the required anti-concentration property, when (x) has degree larger than 1.In particular, consider the polynomial To simplify the calculations, we rst observe that the set The probability of under the standard Gaussian distribution is roughly √ , so we still need to learn a classi er inside it (note that we could simply ignore a region of mass ( )).We examine the anti-concentration property of a di erent polynomial under the Gaussian distribution conditioned on the union of two rectangles .It is not hard to see that the 2 -norm of is Θ(1) under the conditional distribution (the points within the green rectangle in Figure 1 have large x 1 coordinate with constant probability).To give some intuition of what would be "good" anti-concentration, we remark that for a polynomial whose 2 -norm is constant, we would like the probability of | (x 1 , x 2 )| < to be roughly poly( ) (as is indeed the case for the Gaussian, by [5]).However, turns out to have much worse anti-concentration conditional on , as we have . Thus, a naive localization procedure -which tries to reapply a learner on the low-margin conditional distribution directly -is unlikely to work as long as the learner requires any non-trivial anti-concentration property.
Localization via Partitioning.A way to preserve the (anti)concentration properties in the previous example is to (approximately) partition the region where |x 2 1 x 2 2 | < into small (axisaligned) rectangles (see the right gure in Figure 1).The Gaussian distribution conditioned on each rectangle is a log-concave distribution, and thus has good concentration and anti-concentration.Hence, we could attempt to use the margin-perceptron learner on each of the conditional distributions.As one of our main contributions, we give an e cient algorithm that nds such a partition for any low-degree polynomial.In particular, for a degree-polynomial , we show that the low-margin area | (x)| ≤ can be partitioned into (1) many subsets such that the Gaussian distribution conditioned on each of them satis es strong (anti-)concentration properties.
Theorem 2.1 (Informal ś Partitioning the Low-Margin Region of Polynomials).Fix ∈ (0, 1) and let : R ↦ → R be a polynomial of degree at most .There exists an e cient algorithm that (approximately) decomposes the set {x ∈ R : | (x)| < } into = poly (1/ ) sets (1) , • • • , ( ) ⊂ R such that N (0, I) conditioned on ( ) satis es good anti-concentration.Super Non-Singular Decomposition.To get some intuition of how the partition routine operates, we revisit the example What made this possible in this example is that the function 1 conditioned on the union is roughly Θ(1) (due to the contribution of the green rectangle).If the conditional distribution were a Gaussian, Carbery-Wright anticoncentration would imply that the conditional probability of x 2 1 < should be at most poly( ).In sharp contrast, the mass of the set x 2 1 < conditioned on the union is roughly Θ(1) (due to the contribution of the orange rectangle).To mitigate the issue, we will partition the low-margin set | (x 1 , x 2 )| ≤ into multiple rectangles as in the right gure.Since the Gaussian conditioned on each rectangle is a log-concave distribution, we have the desirable (anti-)concentration properties by [5].
can be decomposed into the linear terms x 1 and x 2 .Conditioning the Gaussian density on a rectangle of the form x 1 ∈ 1 , x 2 ∈ 2 yields a log-concave distribution with good anti-concentration.More generally, if we could always decompose a polynomial into a small number of linear polynomials 1 , . . ., ℓ , then we would still have anti-concentration in the resulting conditional distributions after partitioning the region | (x)| < into rectangles de ned by the linear terms, i.e., every (x) lies in an interval .Unfortunately, this is not possible for a general polynomial .However, such a decomposition will exist if we allow the set of polynomials 1 , . . ., ℓ (or more precisely the polynomial mapping x ↦ → q(x) := ( 1 (x), . . ., ℓ (x))) to only resemble a linear transformation locally.To achieve this, we will need to leverage and generalize the results of [32] on nonsingular decompositions, which itself builds on the techniques of diffuse decompositions from [33].In particular, we say that a collection of polynomials 1 , 2 , . . ., ℓ is non-singular if there is only a negligible probability that the Jacobian of the (vector-valued) polynomial transformation q(x) (i.e., the matrix [∇ 1 (x)∇ 2 (x) . . .∇ ℓ (x)]) has small singular values.Intuitively, when this is the case, the polynomial transformation q will locally resemble a non-singular linear transformation.In [32], it is shown that for any polynomial of degree at most , there exists a non-singular set of of (1) polynomials 1 , . . ., ℓ so that can approximately be written as a polynomial in the 's.It turns out that having a non-singular decomposition is not enough to establish the anti-concentration properties that we require.We introduce the notion of super nonsingularity, which enforces "local linearity" by restricting the highorder derivatives of the polynomials.In particular, we establish two structural results on super non-singular sets of polynomials.First, we establish that the Gaussian distribution conditioned on a super non-singular set of polynomials, each lying in some interval, satis es good anti-concentration and concentration properties.
Equipped with the above structural and algorithmic results, obtaining an e cient partition algorithm is relatively straightforward.After computing a super non-singular decomposition 1 , . . ., of using Theorem 2.3, we have that since can be approximately expressed as a polynomial ℎ( 1 (x), . . ., (x), the value of (x) is (approximately) determined by the values of (x).Therefore, we can show that the set {x ∈ R : | (x)| < } can be approximately covered by sets of the form {x : ( 1 (x), • • • , ℓ (x) ∈ }, where is an -dimensional axis-aligned rectangle.Hence, anti-concentration properties of the conditional distributions on these sets follow.This allows us to perform at least one round of localization by partitioning the set = {x : | (x)| ≤ }.Assuming that we have obtained a polynomial ′ that achieves good error in the region , we then need to "localize" on the region ′ = {x : | (x)| ≤ , | ′ (x)| ≤ }.We show that this is possible by a subtle "extendibility" property of our super non-singular decomposition algorithm.Speci cally, assuming that we have a super nonsingular decomposition 1 , • • • , ℓ of (which was used to compute ′ ), we can then extend it into a larger super non-singular set For each rectangle of the original partition of (x), we can now further cover the region {x ∈ R : where ′ is some other ( − ℓ)-dimensional axis-aligned rectangle.As a slight digression, we note that this extendibility property is also what makes the proof of Theorem 2.2 possible.

Anti-concentration via Extendible Super Non-Singular Decomposition
We have already seen (see Figure 1) that the Gaussian distribution conditioned on sets of the form • for generic polynomials does not satisfy good anti-concentration.
To mitigate this issue, we need the polynomials appearing in the conditioning to collectively satisfy a strong non-singularity condition concerning their high-order derivatives.In the following de nition, we denote by ∇ the standard gradient operator and by y the derivative in the direction y.
We remark that De nition 2.4 resembles the de nition of nonsingular polynomials in [32], but imposes additional requirements on the high-order derivatives of the polynomials.This additional structure turns out to be crucial in proving Theorem 2.3, which is itself an important building block to establish Theorem 2.5.As one of our main contributions, we show that the distribution N (0, I), conditioned on a set of super non-singular polynomials each lying in some interval (satisfying some mild conditions), satis es good polynomial concentration and anti-concentration properties.
For brevity, we henceforth refer to both concentration and anticoncentration as (anti-)concentration.
Theorem 2.5 (Informal ś Conditional (anti-)concentration for SNPT, see Theorem 2.2).Let q be a degree , "su ciently" super non-singular polynomial transformation (i.e., for large enough , in De nition 2.4).Let ⊆ R be an axis-aligned rectangle that is not too far from the origin and let be N (0, I) conditioned on the set {x : q(x) ∈ }.For any unit variance, mean-zero polynomial of degree at most we have: We now provide a sketch of the high-level ideas behind the proof of the above theorem.In what follows, we denote by ( ) the distribution of the random variable (x) when x ∼ .Let be the distribution of x ∼ N (0, I) conditioned on q(x) ∈ for a rectangle .Our goal is to show that for any low-degree polynomial , the distribution ( ) has good anti-concentration.
Constructing a Low-Dimensional Surrogate Distribution.As our rst step, instead of directly analyzing the (anti-)concentration properties of under the -dimensional distribution , which is challenging, we construct low-dimensional "surrogates" for and .Speci cally, we consider a low-dimensional distribution together with a polynomial , such that the outcome of ( ) enjoys roughly the same concentration and anti-concentration properties as ( ).
Given the construction, we can in turn focus on analyzing this low-dimensional surrogate pair.
The fact that super non-singular decompositions are "extendible" will play a critical role in this construction.In particular, the super non-singular polynomials { 1 , • • • , ℓ } appearing in the conditioning of will rst be extended into a super non-singular decomposition for the target polynomial such that there exists a set of super non-singular polynomials , and a composition polynomial : R ↦ → R such that (x) ≈ ( 1 (x), • • • , (x)).De ne q : R ↦ → R to be the vector-valued polynomial whose -th coordiniate is , and to be the -dimensional distribution of q(N (0, I)) conditioned on the set {y ∈ R : y ∈ ∀ = 1, . . ., ℓ }.Then, subject to the intervals appearing in the conditioning satisfying some mild conditions and the polynomials appearing in the conditioning being su ciently super non-singular, one can verify that ( ) enjoys roughly the same (anti-)concentration properties as ( ).For the details of this argument, we refer the readers to the full version of the paper.
Given such a construction, we can now shift our focus from the -dimensional distribution to the = ,ℓ (1)-dimensional conditional distribution de ned by a set of super non-singular polynomials.In particular, if we use q : R ↦ → R to denote the vector-valued polynomial whose -th coordinate is , we are interested in the distribution of q (N (0, I)) conditioned on the event {q (x) ∈ } ℓ =1 , where is some interval.To build some intuition as to why such conditional distributions may have desirable properties, we can start with the simple case where all are linear functions.In that case, the distribution of q(N (0, I)) is simply some other Gaussian distribution N ′ .Then, even if we condition on that the -th coordinate of q(N (0, I)) is equal to a , the resulting distribution will simply be some lower-dimensional Gaussian distribution.Under mild conditions on a and , the resulting low-dimension Gaussian will be not too di erent from a standard Gaussian, in the sense that its mean is not very far from the origin and its covariance is bounded above and below by multiples of the identity.Definition 2.6 (( , )-Reasonable Gaussian).Let N ( , Σ) be a Gaussian distribution.Given ∈ (0, 1) and > 1, we say N ( , Σ) is a ( , )-reasonable Gaussian if ∥ ∥ 2 ≤ and I ⪯ Σ ⪯ I.
When the polynomials are of degree more than 1, it becomes hard to characterize the exact form of q(N (0, I)).Nonetheless, the hope is that we can still compare its probability density function to that of some other more structured distribution family.Definition 2.7 (Distribution Comparability).Let , ′ be probability distributions with the same support.We say that and ′ are comparable if for all x in their common support, it holds We show that if two distributions are comparable to each other, then they will have similar (anti-)concentration properties -even under an arbitrary conditioning.The formal statement of this fact and its proof can be found in in the full version of the paper.
Super Non-Singular Polynomial Transformations of Gaussians are Reasonable.Given a polynomial transformation x ↦ → q(x), we will say that q is super non-singular if the set of its polynomial coordinates q (x) is super non-singular.We show that a super nonsingular transformation q behaves similarly to linear transformation, in the sense that the distribution (N (0, I)) is comparable to a mixture of reasonable Gaussians.Proposition 2.8 (Informal ś Super Non-Singular Polynomial Transformations are Reasonable).Let q be a ( 1/(3 ) , )super non-singular polynomial transformation for some su ciently small and large .Then, q(N (0, I)) is ( )-close in total variation distance to some distribution that is comparable to the mixture distribution ∫ N , where each N is a ( , log ( ) (1/ ))-reasonable Gaussian.
We now provide a proof sketch for Proposition 2.8.We rst note that q(N (0, I)) has a distribution identical to q( √ 1 − 2 x + z) for ∈ (0, 1) an appropriately chosen real number and x, z distributed as two i.i.d.Gaussians.Fixing the value of x and Taylor expanding around z, we nd that q( √ 1 − 2 x + z) is approximately g x + Jac q (x) z + ( 2 )e x (z), where Jac q represents the Jacobian of the transformation q, g x is some vector that depends only on x, and e x is some degree-polynomial.It turns out if q consists of super non-singular polynomials, Jac q (x) will have no small singular values with high probability.Conditioned on some xed value of x that makes Jac q (x) non-singular, the distribution of g x + Jac q (x)z where z ∼ N (0, I) will be a reasonable Gaussian.We remark that the transformation still has the high-order error term ( 2 )e x (z) that we have to bound.We notice that the coe cient in front of the high-order term is signi cantly smaller than the minimum singular value of the linear component.As a result, the distribution produced by the transformation will still be close in total variation distance to some distribution comparable to a reasonable Gaussian distribution.We refer to the full version of the paper for more details.
We remark that all of the above analysis is done for a xed value of x that ensures non-singularity of Jac q (x).Hence, to conclude the proof, we simply need to take a mixture over the values of x following the standard Gaussian distribution.By super non-singularity, Jac q (x) has no small singular values with high probability.Consequently, most of the distributions within the mixture will be comparable to a reasonable Gaussian distribution.Proposition 2.8 thereby follows.
Given Proposition 2.8, and the de nition of comparability, we conclude that the transformation q conditioned on an axis-aligned rectangle enjoys good (anti)-concentration properties.By our construction, the target polynomial under the target distribution enjoys roughly the same (anti)-concentration properties as some polynomial ℎ under q(N (0, I)) conditioned on an axis-aligned rectangle.The proof of Theorem 2.2 follows.

E ciently Extending a Super Non-Singular Decomposition
In this section, we discuss our e cient algorithm for obtaining and extending a super non-singular decomposition.[32] shows that any polynomial of degree at most can be approximately decomposed into a non-singular polynomial set of size at most (1).In Theorem 2.9, we show that this is also true for the notion of super non-singularity.Theorem 2.9 extends and strengthens the result of [32] in two ways: (i) we are able to decompose multiple (as opposed to just one, as in [32]) generic polynomials into a common set of super non-singular polynomials, and (ii) we are able to do so when the generic polynomials arrive in an online fashion.In particular, given a super non-singular set of polynomials Q obtained while decomposing some polynomials 1 , • • • , in the past rounds, after receiving the new polynomial +1 , we are able to extend Q into a larger set of super non-singular polynomials Q ′ and decompose +1 in terms of Q ′ .We remark that the fact that we can keep extending a super non-singular set of polynomials to ensure it can be used to represent increasingly more polynomials is a unique characteristic of super non-singular decomposition (compared to its "non-super" counterpart).Crucially, this additional "extendibility" property of the decomposition is what makes the (anti-)concentration result (Theorem 2.5) and the polynomial set partitioning routine (Theorem 2.1) possible.In the following result, we present our e cient algorithm for extending a super non-singular decomposition.
Suppose we only want to compute a non-singular decomposition for a polynomial .The process given in [32] maintains a data-structure to which we refer as a partial decomposition.Informally, the data-structure keeps track of a list of polynomials 1 , • • • , ℓ (which is not necessarily non-singular), a coe cient vector b ∈ R ℓ , and a composition polynomial ℎ such that ( If the list of polynomials is already nonsingular, we are done.Otherwise, following the de nition of nonsingularity, there exists a linear combination of the polynomials * (x) = b (x) such that the gradient of the combined polynomial * is small with non-trivial probability under the Gaussian distribution.In the second case, we show that we can approximately decompose the combined polynomial * into a set of lower-degree polynomials 1 , • • • , .Hence, we can rewrite one of the polynomials which has non-trivial weight in the linear combination with the set of newly obtained lower-degree polynomials and the remaining polynomials in the linear combination.We then end up with a new partial decomposition consisting of the polynomials and a new composition polynomial ℎ ′ .It turns out such a rewriting strategy will always decrease the total weights of the polynomials that have the same degree as in b ′ , but may end up increasing the weights of the other polynomials in the linear combination.However, if we always choose to rewrite the highest (or one of the highest) degree among the polynomials in the linear combination, it is then guaranteed that we will have fewer and fewer high-degree polynomials in the decomposition.This process must then eventually terminate and give us a super non-singular set of polynomials.
In order to adapt the above strategy for extendible super nonsingular decomposition, a caveat here is that for the additional extendibility property to hold, we are now allowed to rewrite any of the initial polynomials 1 , • • • , ℓ .Fortunately, if a set of polynomials does not satisfy super non-singularity, we are always capable of nding a linear combination of the polynomials of the same degree such that the combined polynomial has small gradients with non-trivial probability.Therefore, no matter which polynomial we choose to rewrite, it will always have the highest degree among the polynomials in the linear combination (since all of them have the same degree!).The induction argument for showing the termination of the process will then go through, giving us a super non-singular decomposition algorithm.

Learning via Localization and Margin-Perceptron
At a high-level, our learning algorithm can be viewed as a robust version of the margin-perceptron algorithm of [26].In particular, we establish the following: given sample access to a distribution corrupted by opt-nasty noise, the margin-perceptron is a semi-agnostic LTF learner when the underlying (uncorrupted) x-marginal is "sufciently" (anti)-concentrated.
Proposition 2.10.Let ∈ (0, 1), ∈ Z + , , 1 > 0, and be a distribution on R × {±1}.Assume the following statements are true.Our modi ed perceptron algorithm of Proposition 2.10 uses a robust sub-routine of [22] for estimating the Chow-parameters of the LTF under nasty noise.For example, in the rst round, we learn a PTF that achieves error opt 1− for all high-margin points in the set {x : | (x)| ≥ ∥ ∥ 2 }.We then localize (condition) on the set of low-margin points {x : | (x)| ≤ ∥ ∥ 2 }, and use the partitioning algorithm of Theorem 2.1 to partition the above region into sets (1) , . . ., ( ) such that the standard normal conditional on those sets has good (anti)-concentration.We then again condition on each set of this partition and use the robust-perceptron algorithm of Proposition 2.10 to learn a PTF inside each set; and continue recursively until the probability mass of the "unclassi ed" lowregion is at most ( ).Our nal hypothesis is therefore a decision list of degree-PTFs: one PTF for each set of the partition.For more details, we refer the reader to the full version of the paper.

Definition 1 . 1 (
Strong Contamination Model).Let C be a class of Boolean functions on R , x a distribution over R , and an unknown target function ∈ C. For 0 < opt < 1/2, we say that a set of labeled examples is an opt-corrupted set of examples from C if it is obtained using the following procedure: First, we draw a set = {(x ( ) , )} of labeled examples, 1 ≤ ≤ , where for each we have that x ( ) ∼ x , = (x ( ) ), and the x ( ) 's are independent.Then an omniscient adversary, upon inspecting the set , is allowed to remove an opt-fraction of the examples and replace these examples by the same number of arbitrary examples of its choice.The modi ed set of labeled examples is the opt-corrupted set .

Figure 1 :
Figure 1: The localization region | (x 1 , x 2 )| = |x 2 1 x 2 2 | ≤ is shown in blue.It is essentially a union of two rectangles (shown in the left gure) of width roughly √ .It is easy to see that (i) the total mass of the union is roughly √ ); (ii) the expected value of x 21 conditioned on the union is roughly Θ(1) (due to the contribution of the green rectangle).If the conditional distribution were a Gaussian, Carbery-Wright anticoncentration would imply that the conditional probability of x 2 1 < should be at most poly( ).In sharp contrast, the mass of the set x 2 1 < conditioned on the union is roughly Θ(1) (due to the contribution of the orange rectangle).To mitigate the issue, we will partition the low-margin set | (x 1 , x 2 )| ≤ into multiple rectangles as in the right gure.Since the Gaussian conditioned on each rectangle is a log-concave distribution, we have the desirable (anti-)concentration properties by[5].