New Tools for Smoothed Analysis: Least Singular Value Bounds for Random Matrices with Dependent Entries

We develop new techniques for proving lower bounds on the least singular value of random matrices with limited randomness. The matrices we consider have entries that are given by polynomials of a few underlying base random variables. This setting captures a core technical challenge for obtaining smoothed analysis guarantees in many algorithmic settings. Least singular value bounds often involve showing strong anti-concentration inequalities that are intricate and much less understood compared to concentration (or large deviation) bounds. First, we introduce a general technique involving a hierarchical $\epsilon$-nets to prove least singular value bounds. Our second tool is a new statement about least singular values to reason about higher-order lifts of smoothed matrices, and the action of linear operators on them. Apart from getting simpler proofs of existing smoothed analysis results, we use these tools to now handle more general families of random matrices. This allows us to produce smoothed analysis guarantees in several previously open settings. These include new smoothed analysis guarantees for power sum decompositions, subspace clustering and certifying robust entanglement of subspaces, where prior work could only establish least singular value bounds for fully random instances or only show non-robust genericity guarantees.


INTRODUCTION
Over the past two decades, there has been signicant progress in using algebraic methods for high-dimensional statistical estimation (e.g., [2]).Techniques like tensor decomposition have been used for parameter estimation in mixture models [3,10,14], shallow neural networks [5,25], stochastic block models [2], and more [26].Recently, more sophisticated decomposition methods based on tensor networks [21], circuit complexity [12] and algebraic geometry [12,19] have given to rise to new algorithms for many problems in high-dimensional geometry and parameter estimation.These algorithms start by building appropriate algebraic structures that "encode" the hidden parameters of interest.Then, they use the algebraic techniques described above for recovering the solution.
Unfortunately, in most of these applications, the recovery problem turns out to be NP hard in general.So the algorithms have provable recovery guarantees only under certain algebraic conditions.Typically, these conditions can be formulated in terms of appropriately dened matrices being well-conditioned, i.e., having a non-negligible least singular value.Furthermore, the least singular value determines the sample complexity and running time, and so it is important to obtain inverse polynomial bounds.Now it is natural to ask: do the algebraic conditions typically hold?Due to NP hardness, we know there exist parameters for which the conditions do not hold.But how common or rare are such parameter settings/instances?A strong way to address this question is via the framework of smoothed analysis, developed in the seminal work of Spielman and Teng [23,27,28].A condition is said to hold in a smoothed analysis setting if for any instance, a small random perturbation of magnitude, say d = 1/=2 , where = is the input size, results in an instance that satises the condition with high probability.Smoothed analysis guarantees show that any potential bad instance is isolated or degenerate: most other instances in a small ball around it have good guarantees.On the one hand, smoothed analysis gives a much stronger guarantee than average case analysis, where one shows that the condition holds w.h.p. for a random choice of parameters from some distribution.On the other hand, it provides quantitative, robust analogs of genericity results in algebraic settings, which are needed in most algorithmic applications.
Considering the avor of the algebraic non-degeneracy conditions, the problem of smoothed analysis boils down to the following: given a matrix M whose entries are functions (typically polynomials) of some base variables, does randomly perturbing the variables result in M having a non-negligible least singular value with high probability?
This question is non-trivial even in very specialized settings, as it is a statement about anti-concentration -a topic that is less understood in probability theory than concentration or large deviation bounds.For example when the underlying variables form a matrix * 2 R =⇥< , the structured matrix M = * * = D 8 ⌦ D 8 8 2[<] , 1represents the Khatri-Rao product, and has been the subject of much past work [4,9,11] that developed intricate arguments specialized for this setting.Least singular value bounds of M = e * e * for randomly perturbed e * have lead to smoothed analysis guarantees for several problems including tensor decomposition [9], recovering assemblies of neurons [4], parameter estimation of latent variable models like mixtures of Gaussians [13], hidden Markov models [11], independent component analysis [15] and even learning shallow neural networks [5].Another approach is to use concentration bounds to prove lower bounds on the least singular value [7, 22, 29? ] for analyzing random instances; these techniques based on concentration bounds cannot handle smoothed instances.We lack a broader toolkit that allows us to analyze more general classes of random matrices that arise in many other smoothed analysis settings of interest.
Consider, for example the symmetric lift of the matrix e * represented by e * ~2 B (( D8 ⌦ D9 + D9 ⌦ D8 ) :1 8  9  <), where the columns (up to reshaping) give a basis for the space of all the symmetric matrices that are supported on the subspace * .Here ~denotes the symmetrized Kronecker product.Question 1.1.For a linear operator acting on the space of symmetric = ⇥ = matrices (e.g., a projection matrix), can we obtain an inverse polynomial lower bound with high probability on the least singular value of the matrix , when <  2= for a suciently small 2 2( 0, 1)?
The new techniques developed in this paper, to our knowledge, give the rst inverse polynomial lower bound on the least singular value of M, and its higher order generalizations; see Theorem 1.4.As it turns out, this already captures the Khatri-Rao product e * e * setting as a special case by setting < = 1 and appropriately.One interpretation of the statement is that e * ~e * acts like "truly random" subspace in the lifted space Sym(R = ⌦R = ) with the same dimension.With high probability, a random subspace of Sym(R = ⌦ R = ) 2 with dimension > (= 2 ) will not contain any vector near the kernel of .The armative answer to the above question shows that the lifted space that corresponds to column space of ( e * ) ~2 behaves similarly and is far from the kernel of !In other words, it is rotationally wellspread; it is not too aligned with any specic subspace.Note that e * only has about =< truly independent coordinates or "bits", whereas a random subspace of the same dimension has 2 •= 2 < 2 independent coordinates.Hence the lift U ~2 of a smoothed subspace U acts "pseudorandom" -it acts like a random subspace in the lifted space with respect to all linear operators of reasonable rank.
Matrices of this avor arise in open questions about the smoothed analysis of various algebraic algorithms for problems like robust certication of quantum entanglement in subspaces, certifying distance from varieties [19], and decomposition into sums of powers of polynomials [7,12].Specically, rank-1 matrices (of unit norm) correspond to separable or non-entangled states in bipartite quantum systems.For a certain specic choice of , the positive resolution of Question 1.1 certies that a smoothed subspace of = 1 ⇥ = 2 matrices of dimension 2= 1 = 2 (for some 2 > 0) is far from any rank-1 matrix of unit norm.Moreover, in the recent algebraic algorithms of [7,12], they consider generic or random subspaces U 1 , U 2 ,...,U C ⇢ R = and they need to argue that the corresponding 3th order lifts U ~3 1 , U ~3 2 ,...,U ~3 C are far from each other.Our results give a novel and modular way to analyze such matrices.Our contributions are two fold: • We give new tools for proving least singular value lower bounds via Y-nets.This involves identifying a key property that is sucient for carrying forth net based arguments, and giving a new tool for proving such a property.• We consider matrices that have the structure of a linear operator applied to higher-order lifts corresponding to the Kronecker product, and give new techniques to reason about the least singular value.This resolves open questions raised in [7,12,19].

Our Results
1.1.1Hierarchical Nets.Our rst set of results focus on Y-net based arguments for proving bounds for least singular values.Suppose we have a random matrix M, the idea is to consider a xed "test" vector U, prove that kMU k is large enough with high probability, and then take a union bound over "all possible vectors U".As the set of candidate U is innite, the idea is to take a ne enough net over possible vectors U.The challenge when dealing with structured matrices (of the kind discussed above) is that for a single test vector U, we do not obtain a suciently strong probability guarantee.This is because the individual columns of M may not have "sucient randomness", and since we do not know how U spreads its mass across columns, the bound will be weak.Our main observation is that in the matrices we consider for our application, as long as U is well spread, we can obtain a much stronger bound.We refer to this as a "combination amplies anticoncentration" (CAA) property of M.
CAA Property (Informal Denition).We say that M has the CAA property if for every : 1, for any test vector U that has : entries of magnitude X, we have that kMU k ⌦(X), with probability 1 exp( l (:)).
Formally, to capture the l (:) term, we have a parameter V. See Denition 4.1 for details.Our rst result is that for any matrix with this property, we have a bound on f min (M).
Informal Theorem 1.2.Suppose M is a random matrix with < columns and that M satises the CAA property with parameter V > 0. Then with high probability (indeed, exponentially small probability of failure), we have f min (M) > poly(1/<).(See Theorem 4.2 for the formal statement.) The proof uses a novel Y-net construction.Nets that use structural properties of the test vector U have been used in prior works in the context of proving least singular value bounds, notably in the celebrated work of Rudelson and Vershynin [24].In proving our result, the natural approach of constructing a hierarchy of nets based on increasing : (and using some threshold X) does not work.Informally, this is because the error from ignoring terms that are slightly smaller than X can add up signicantly, causing the argument to fail.We introduce a new hierarchical construction that overcomes this problem.
The next question we consider is how to prove that the CAA property holds in a particular context.This can be shown via a direct argument when M is simple, e.g., a random matrix with independent entries.However, for matrices with more structured entries, it can need a careful analysis.To handle this, we develop a new tool for proving anticoncentration that we believe is of independent interest.
1.1.2Anti-concentration of a Vector of Polynomials.Consider % (G) := (? 1 (G),? 2 (G),...,? # (G)), where each ?8 is a polynomial of = "base" random variables.Suppose we wish to show anti-concentration bounds for % ( G), where G is a perturbation of some G (i.e., we wish to bound the probability that % ( G) is within a small ball of a point ĩs small, for all ~).One hope is to use a coordinate-wise bound (e.g., using known results like [30]) and take the product over 1, 2,...,#.It is easy to see that this is too good to be true: consider an example where ?8 are all equal; here having # coordinates is the same as having just one.So we need a good metric for "how dierent" the polynomials ?8 are for a typical G.We capture this notion using the Jacobian of the polynomial map %.Recall that in this case, the Jacobian (G) is a matrix with one column per ?8 , containing the vector of partial derivatives, r? 8 (G).
Jacobian rank property (Informal Denition).We say that % (G) has the Jacobian rank property if for every G, at a slightly perturbed point G, ( G) has at least : singular values that are large enough (where : is a parameter).
We refer to Denition B.1 for the formal statement.Our result here is that this property implies anticoncentration: Informal Theorem 1.3.Suppose % (G) dened as above satises the Jacobian rank property with parameter :.Then for a perturbation of any point G, we have that 8~, P[k% ( G) ~k < Y] < exp( ⌦(:)).(Here, Y is a quantity that depends on the dimensions, :, the perturbation, and the singular value guarantee; see Theorem 4.7 for the formal statement.) Intuitively, the Jacobian having several large singular values must result in anticoncentration (because % (G) locally behaves linearly).However, the challenging aspect is that the Jacobian need not always have many large singular values.Our assumption (Jacobian rank property) is itself made for a perturbed vector, i.e., we assume that ( G) has many high singular values with high probability.Further, the magnitude of these singular values will depend on the perturbation: if a "bad" G was perturbed by d, ( G) will have most of the large singular values being ⇡ d.Dealing with this issue turns out to be the main challenge in proving the theorem (see Theorem 4.7 for a formal statement).
As an application of the Jacobian rank method, we re-prove the main result of [9] and [4].They consider random matrices M where the 8th column is D8 ⌦ Ẽ8 , and D8 , Ẽ8 are perturbed vectors in R = .We show that this M satises the CAA property, and thus our rst result (above) implies a condition number lower bound.In order to prove the CAA property, we consider a combination of the columns Õ 8 U 8 ( D8 ⌦ Ẽ8 ) and prove that if U has : entries X, then the Jacobian has =:/2 large singular values.Using our second result, we obtain a strong anticoncentration bound, thus completing the proof.This technique also lets us tackle Question 1.1 described above, but in what follows, we describe a dierent technique that also generalizes to higher orders.
1.1.3Structured Matrices from Kronecker Products.Next, we consider a general class of structured matrices that are obtained by taking the symmetrized Kronecker product of some d-perturbation * of an underlying matrix * and applying a linear operator .Here, * is a d-perturbation of * means * = * +N(0,d 2 ).In other words, the matrix of interest is M = e * ~3 , where 3 is a constant.For such a matrix, we can ask the question: are there conditions on under which we can prove that f min (M) is large, with high probability over the perturbation?We provide an armative answer to this question in terms of the rank of .
This question captures a variety of settings studied previously.For example, [11] studies matrices M whose columns are tensor products of some underlying vectors (i.e., the columns have the form ).This turns out to be a special case of our setting above.Likewise, in the work of [7], one of the matrices they consider is an M formed by concatenating the Kronecker products of a collection of underlying matrices, and the analysis of their algorithm relies on f min (M) being non-negligible.This also falls into our setting by choosing appropriately (as we show in Corollary 5.3).Finally, as we discuss in our applications, the setting M = e * ~3 also directly appears in the work of [19].The following is an informal statement of our result.Sym 3 (R = ) will refer to a symmetrization of (R = ) ⌦3 . 3Also, as before, f min corresponds to right singular vectors.
for some constant X > 0, and let * be any = ⇥ < matrix.Let e * be a d-perturbation of * .Then as long as <  2= for some constant 2,we have 1 exp( ⌦(=)), (See Theorem 5.1 for a formal statement.) Note that the above Theorem 1.4 with 3 = 2 answers Question 1.1 armatively.It also proves a similar statement about how the column space of a 3th order lift e * ~3 behaves like a random subspace of the lifted space of the same dimension with respect to linear operators in the lifted space of reasonable rank, even though we have only 3=< random "bits" as opposed to ⌦ 3 ((<=) 3 ).A s we describe in Section 2, the proof relies on rst moving to nonsymmetric products via a new decoupling argument.In the case of non-symmetric products, we end up having to analyze the least singular value of a matrix of the form ( e 3 ) ).This can be interpreted as a "modal contraction" (or dimension reduction of the mode) dened by { e * (8 ) } applied to the tensor .W e then show how to analyze such smoothed modal contractions, which ends up being one of our technical contributions (see Section 2.3 and Theorem 5.2).

Applications.
Certifying distance from variety and quantum entanglement.Our rst application is to the problem of certifying that a variety is "far" from a generic linear subspace.As a simple motivation, suppose we have a linear subspace X of dimension X= in R = (assume X < 1/2).Then for a randomly d-perturbed subspace e U of dimension < =/2, we can show that the two spaces have no overlap in a strong sense: every unit vector D 2Xis at a distance ⌦(d) from e U. It is natural to ask if a similar statement holds when X is an algebraic variety (as opposed to a subspace).This problem also has applications to quantum information (see [19] and references therein).Furthermore, we can ask if there is an ecient algorithm that can certify that every unit vector in X is far from e U. We answer both these questions in the armative.
Informal Theorem 1.5.Suppose X⇢R = is an irreducible variety cut out by X =+3 1 3 homogeneous degree 3 polynomials.There exists a 2 > 0 such that for any d-perturbed subspace e U of dimension at most 2=, with probability 1 exp( ⌦(=)), every unit vector in X has distance poly Further, this can be certied by an ecient algorithm.(See Theorem D.1 for the formal statement.) The recent work of [19] gave an algorithm that we also use, but our new least singular value bounds imply the quantitative distance lower bound stated above.Applying this theorem with the variety of rank-1 matrices gives the following direct corollary.

Corollary 1.6. There is a polynomial time algorithm that given a random
for some universal constant 2 > 0) certies w.h.p. that e U is at least poly(d, 1/=) far from every rank-1 matrix of unit norm.
The above theorem also has a direct implication to robustly certifying entanglement of dierent kinds, which we describe in Section D.
Decomposing sums of powers of polynomials.Our second application is to the problem of "decomposing power sums" of polynomials, a question that has applications to learning mixtures of distributions.In the simplest setting, [12] and [7] consider the following problem: given a polynomial ?(x) that can be expressed as where 0 C are quadratic polynomials and 4 (x) is a small enough error term, the goal is to recover 4 The work of [7] gave an algorithm for this problem, but their analysis relies on certain nondegeneracy conditions, which can be formulated as a lower bound on the least singular value of appropriate matrices.They prove that these conditions hold if the instances (i.e., the polynomials 0 C ) are random, using the machinery of graph matrices [1].However, the question of obtaining a smoothed analysis guarantee is left open.As discussed earlier, a smoothed analysis guarantee is much stronger than a guarantee for random instances, as it shows that even in the neighborhood of hard instances, most instances are easy.
Their analysis requires least singular value bounds for various matrices that arise from higher order lifts and polynomials of some underlying random variables.For example, they require least singular value bounds on matrices of the form ( * ~3), for a specic symmetrization operator that acts on the lifted space.Another type of matrix that they analyze are block Kronecker products, of the form that arise from dierent partial derivatives. 5hese kinds of matrices are ideal candidates for our techniques.
Informal Theorem 1.7.For the matrices M arising in the analysis of [7], a d-perturbation of the parameters of 0 C results in f min (M) poly(d, 1/=), with probability 1 exp( poly(<, =)).(This corresponds the formal statements of propositions E.1, E.2, and E.3.) These least singular bounds allow us to conclude that the algorithm of [7] indeed has a smoothed analysis guarantee.In Section E, we outline the algorithm of [7], identify the dierent nondegeneracy conditions required and show that each of these conditions holds for smoothed/perturbed polynomials 0 C .Interestingly, we can avoid the technically heavy machinery of graph matrices, while obtaining stronger (smoothed) results.We hope our new techniques can also help obtain smoothed analysis guarantees for other algebraic methods like the framework of [12].

PROOF OVERVIEW AND TECHNIQUES 2.1 Improved Net Analyses
Y-Nets and limitations.The classic approach to proving least singular value bounds is an Y-net argument.The argument proceeds by trying to prove that kMU k is large for all U in the unit sphere.It does so by constructing a ne "net" over points in the sphere with the properties that (a) the net has a small number of points, and hence a union bound can establish the desired bound for points in the net, and (b) for every other point U in the sphere, there is a point U 0 in the net that is close enough, and hence the bound for U 0 "translates" to a bound for U.However, in settings where the columns e -8 of M have "limited randomness", this approach cannot be applied in many parameter regimes of interest.The simplest example is one where each e -8 is of the form D8 ⌦ D8 , where D8 2 R = and we have around < = = 2 /4 such vectors.In this case, (a) above causes a problem: the size of a net for unit vectors in a sphere in R < is exp(<) = exp(= 2 /4).This is much too big for applying a union bound, since each column only has "= bits" of randomness, so the failure probability we can obtain for a general U is exp( =).For this specic example, the works [4,9] overcome this limitation by considering more ad-hoc methods for showing least singular value bounds, not based on Y-nets.
Main idea from Section 4.1.As described above, the limited randomness in each column e -8 limits the probability with which we can show that P[kMU k] is large.However, we observe that in many settings, as long as we consider an U that is spread out, we can show that P[kMU k] is large with a signicantly better probability.Informally, in this case, the randomness across many dierent columns gets "accumulated", thus amplifying the resulting bound.We refer to this phenomenon as combination amplies anticoncentration (CAA) (described informally in Section 1.1; see Denition 4.1).Our rst theorem states that the CAA property automatically implies a lower bound on f min (M) with high probability.
To outline the proof of the theorem, let us consider some unit vector U 2 R < .If U has say </2 "large enough" entries, then the CAA property implies that kMU k is non-negligible with probability 1 exp( <) (roughly), and so we can take a union bound over a (standard) Y-net, and we would be done.However, suppose U had only : entries that are large enough (dened as > X for some threshold), and : ⌧ <.In this case, the CAA property implies that kMU k 2X with probability roughly 1 exp( :).While this is large enough to allow a union bound over just the large entries of U (placing a zero in the other entries), the problem is that there can be many entries in U that are just slightly smaller than X.In this case, having kMU X k 2X (where U X is the vector U restricted to the entries X in magnitude, and zeros everywhere else) does not let us conclude that kMU k > 0, unless 2 is very large.Since we cannot ensure that 2 is large, we need a dierent argument.
The idea will be to use the fact that our denition of the CAA comes with a slack parameter V.In particular, for U as above with : values of magnitude X, it allows us to take a union bound over : • < V parameters.Thus, if we knew that there are at most : • < V entries that are "slightly smaller" (by a factor roughly \ ) than X, we can include them in the Y-net.Dening \ appropriately, we can ensure that the problem described above (where the slightly smaller entries cancel out the MU X ) does not occur.The problem now is when U has > : • < V entries of magnitude between \X and X.While this is indeed a problem for this value of X, it turns out that we can try to work with \X instead.Now the problem can recur, but it cannot recur more than (1/V) times (because each time, : grows by an < V factor).This allows to dene a hierarchical net, which helps us identify the threshold X for which the ratio of the number of entries \X and X is smaller than < V .
By carefully bounding the sizes of all the nets and setting \ appropriately, Theorem 4.2 follows.

Jacobian Based Anticoncentration
As described in Section 1.1, proving smoothed analysis bounds often requires dealing with a vector of polynomials in some underlying variables G.The goal is to show that for every G, evaluating % at a d-perturbed point G gives a vector that is not too small in magnitude.(A slight generalization is to show that % ( G) is not too close to any xed ~.) We rst observe that such a statement is not hard to prove if we know that the Jacobian (G) of % (G) has many large singular values at every G, and if the perturbation d is small enough.This is because around the given point G, we can consider the linear approximation of % ( G) given by the Jacobian.Now as long as the perturbation has a high enough projection onto the span of the corresponding singular vectors of (G), % ( G) can be shown to have desired anticoncentration properties (by using the standard anticoncentration result for Gaussians).Finally, if (G) has : large singular values, a random d-perturbation will have a large enough projection to the span of the singular vectors with probability 1 exp( :).Now, in the applications we are interested in, the polynomials % tend to have the Jacobian property above for "typical" points G, but not all G.Our main result here is to show that this property suces.Specically, suppose we know that for every G, the Jacobian at a d perturbed point has : singular values of magnitude 2d with high probability.Then, in order to show anticoncentration, we view the d perturbation of G as occurring in two independent steps: rst perturb by d p 1 I 2 for some parameter I, and then perturb by dI.The key observation is that for Gaussian perturbations, this is identical to a d perturbation!
This gives an approach for proving anticoncentration.We use the fact that the rst perturbation yields a point with suciently many large Jacobian singular values with high probability, and combine this with our earlier result (discussed above) to show that if I is small enough, the linear approximation can indeed be used for the second perturbation, and this yields the desired anticoncentration bound.
Applications.The simplest application for our framework is the setting where M has columns being D8 ⌦ Ẽ8 , for some d-perturbations of underlying vectors D 8 ,E 8 .(This setting was studied in [4,9] and already had applications to parameter recovery in statistical models.)Here, we can show that M has the CAA property.To show this, we consider some combination Õ 8 U 8 ( D8 ⌦ Ẽ8 ) with : "large" coecients in U, and show that in this case, the Jacobian property holds.Specically, we show that the Jacobian has ⌦(:=) large singular values.This establishes the CAA property, which in turn implies a lower bound on f min (M).This gives an alternative proof of the results of the works above.

Structured Matrices from Kronecker
Products and Higher-Order Lifts Our second set of techniques allow us to handle structured matrices that arise from the action of a linear operator on Kronecker products, as described in Question 1.1.For simplicity let us focus on the setting when 3 = 2, and let : Sym(R = ⌦ R = )!R : be an (orthogonal) projection matrix of rank ' 0.01= 2 acting on the space of symmetric matrices Sym(R = ⌦ R = ) (in general can also be any linear operator of large rank).Let < = > (=) and e * 2 R =⇥< be a small random d-perturbation of arbitrary matrix * 2 R =⇥< .The columns of the matrix e * ~2 are linearly independent with high probability, and span the symmetric lift of the column space of e * .
An arbitrary subspace of Sym(R = ⌦ R = ) of the same dimension may intersect non-trivially, or lie close to the kernel of .Theorem 1.4 shows that the column space of e * ~2 for a smoothed e * is in fact far from the kernel of with high probability.Note that e * only has about =< truly independent coordinates or "bits", whereas a random subspace (matrix) of the same dimension has 2 • = 2 < 2 independent coordinates.
Challenge with existing approaches.This setting captures many kinds of random matrices that have been studied earlier including [4,9,11].For example, [11] studies the setting when a xed polynomial map 5 : R = !R : applied to a randomly perturbed vector D8 to produce the 8th column 5 ( D8 ).It turns out to be a special case of our setting above when < = 1.These works use the leaveone-out approach to lower bound the least singular value, where they establish that every column has a non-negligible component orthogonal to the span of the rest of the columns (see Lemma 3.1).However this approach crucially relies on the columns bringing in independent randomness. 6This does not hold in our setting, since every column share randomness with ⌦(<) other columns.
In the recent algebraic algorithms of [7,12] for decomposing sum of powers of polynomials, the analysis of the algorithm involves analyzing the least singular value of dierent random matrices.One such matrix M is formed by concatenating the Kronecker products of a collection of underlying matrices.This allows us to reason about that the non-overlap or distance between the lifts of a collection of subspaces.The work of [7] analyzed the fully random setting and proves least singular value bounds with intricate arguments involving graph matrices, matrix concentration, and other ideas.Specically, like in [29], they show that E[M] has good least singular value, and then prove deviation bounds on the largest singular value of M E[M] to get a bound of f min (E[M]) kM E[M]k.But this approach does not extend to the smoothed setting, since the underlying arbitrary matrix * makes it challenging to get good bounds for kM E[M]k.
For the smoothed case, when 3 = 2, it turns out that we can use ideas similar to those described in Sections 2.1 and 2.2 to show Theorem 1.4.However, the approach runs into technical issues for larger 3. Thus, we develop an alternate technique to analyze higherorder lifts that proves Theorem 1.4 for all constant 3.In order to prove Theorem 1.4 we rst move to a decoupled setting where we are analyzing the action of a linear operator on decoupled products of the form where e + has a random component that is independent of e * .This new decoupling step leverages symmetry and the Taylor expansion and carefully groups together terms in a way that decouples the randomness.The main technical statement we prove is the following non-symmetric version of Theorem 1.4 which analyzes a linear operator acting on a Kronecker product of dierent smoothed matrices.
Informal Theorem 2.1 (Non-symmetric version for 3 = 2 and modal contractions).Suppose 2 R '⇥= 3 is a matrix with at least Smoothed modal contractions.While is specied as a linear operator or a matrix of dimension ' ⇥ = 2 in Theorem 2.1, one can alternately view as a order-3 tensor of dimensions ' ⇥ = ⇥ = as shown in Figure 1.Theorem 2.1 then gives a lower bound for the multilinear rank 7 (or its robust analog) under smoothed modal contractions (dimension reduction) along the modes of dimension = each.The proof of this theorem is by induction on the order 3.We perform each modal contraction one at a time.As shown in Figure 2, we rst do modal contraction by e + to obtain a ' ⇥ = ⇥ < tensor , and then by e * to form the nal ' ⇥ < ⇥ < tensor.We need to argue about the (robust) ranks of the matrix slices (we also call them blocks) and tensors obtained in intermediate steps.For any matrix " (potentially a matrix slice of the tensor ) of large (robust) rank : > 1.1<, a smoothed contraction " * has full rank < (i.e., nonnegligible least singular value) with probability 1 exp( ⌦(:)).
To argue that the nal tensor (when attened) has full rank < 2 ,we need to argue that for the tensor in the intermediate step , , each of the < slices (along the contracted mode) has rank at least ⌦(=).The original rank of was large, so we know that a constant fraction of the slices 1 ,..., = must have rank ⌦(=).But this alone may not be enough since many of the slices can be identical, in which case the < slices are not suciently dierent from each other.
We can use the large rank of to argue that a constant fraction of the matrix slices should have large "marginal rank" i.e., they have large rank even if we project out the column spaces of the slices that were chosen before it.While this strategy may work in the non-robust setting, this incurs an exponential blowup in the least singular value.Instead we use the following randomized strategy of nding a collection of blocks or slices ( 1 ⇢[ =], each of which has a large "relative rank", even after we project out the column spaces of all the other blocks in ( 1 (we show these statements in a robust sense, formalized using appropriate least singular values).
Finding many blocks with large relative rank.We note that while the idea is quite intuitive, the proof of the corresponding claim (Lemma 5.4) is non-trivial because we require that in any selected block, there must be many vectors with a large component orthogonal to the entire span of the other selected blocks.As a simple example, consider setting = 2 = 2C and In this case, even if Y is tiny, we cannot choose both the blocks, because the span of the vectors in 2 contains all the vectors in 1 .
The proof will proceed by rst identifying a set of roughly ' = ⌦(= 2 ) vectors (spread across the blocks) that form a well conditioned matrix, followed by randomly restricting to a subset of the blocks.We start with the following claim, which gives us the rst step.The lemma is a robust version of the simple statement that if f : ( ) > 0, then there exist : linearly independent columns.The proof of the claim is elegant and uses the choice of a so-called Auerbach basis or a well-conditioned basis for the column span.
The outline of the main argument is as follows: (1) First nd a submatrix " of ' = X= 2 columns of such that f ' (") is large (2) Randomly sample a subset ) ✓[ =] of the blocks.
(3) Discard any block 9 2 ) that has fewer than X=/6 vectors with a non-negligible component orthogonal to the span of [ A 2() \{9 }) A ; argue that there are ⌦(X=) blocks remaining.We remark that the above idea of a random restriction to obtain many blocks with large relative rank (in a robust sense) seems of independent interest and also comes in handy in the application to power sum decompositions (Claim E.5).
Finishing the inductive argument.As shown in Figure 2, after modal contraction along + 2 R =⇥< , we get , 2 R '⇥=⇥< with slices , 1 ,...,, = .Now we would like to argue that when we perform a smoothed contraction with e * , the contracted slices have large rank, while simultaneously preserving the relative rank across the slices.Let , ( 1 2 R '⇥( 1 ⇥< represent the subtensor corresponding to the slices obtained from the "good" blocks ( 1 ⇢[ =] (which have large relative rank), and let , [=]\( 1 2 R '⇥([=]\( 1 )⇥< represent the remaining slices.Also let , ( 9 ) 2 R '⇥= denote the matrix slices along the alternate mode for each 9 2[ <].We can show that the randomly contracted matrices , where the randomness in the two summands is independent.Arguing that the high relative rank across the slices is preserved involves some work, and this is achieved in Lemma 5.5.The lemma proves that with high probability, every test unit vector U 2 R <•< has non-negligible value of k"Uk 2 .A standard argument would consider a net over all potential unit vectors U 2 R <•< .However this approach fails here, since we cannot get high enough concentration (of the form 4 ⌦ (< 2 ) ) that is required for this argument.Instead, we argue that if there were such a test vector U 2 R <•< , there exists a block 9 ⇤ 2[ <] where we get a highly unlikely event.This allows us to conclude the inductive proof that establishes Theorem 2.1.
Here 0 9 and 1 9 denote the 9th column of and ⌫, respectively, and 0 9 ⌦ 1 9 is the Kronecker product (or simply the tensor product) of these columns.
Here ( 3 denotes the symmetric group on [3] and denotes the 8 c ( 9 ) th column of * ( 9 ) .For example, for matrices *,+ 2 R =⇥< , the column of * ~+ corresponding to a tuple (8, 9) with 8  9 is 1 2 In the case that 8 = 9, this reduces to 3 ) denote the ~product of a total of 3 copies of * .The product ~can be viewed as a partially symmetrized version of the Kronecker product since all columns of * ~3 are symmetric with respect to the natural symmetrization of Along these lines, we introduce the operator Sym 3 : R = 3 !R = 3 which symmetrizes elements of R = 3 with respect to the identication R = 3 (R = ) ⌦3 .With this notation, we have that Moreover, the columns of the matrix * ~3 are precisely the unique columns of the matrix Sym 3 (* ⌦3 ).
Finally, for a vector space U, we have that U ~3 = Sym 3 (U ⌦3 ) is the space of symmetric 3th tensors over the spacd U. We also call this the symmetric 3th order left of the space U.
Leave-one-out distance.The leave-one-out distance of a matrix * is a useful tool for analyzing least singular values.Given * 2 R =⇥< , dene the leave-one-out distance ✓ (* ) by The least singular value of * is related to the leave-one-out distance of * through the following lemma [24].
See also Lemma A.2 for a block-version of leave-one-out singular value bounds.

HIERARCHICAL NETS AND ANTICONCENTRATION FROM JACOBIAN CONDITIONING
complete version of this section, including all deferred proofs, can be found in Appendix B. In this section, we will primarily deal with a matrix M of dimensions # ⇥ < where < < # .The columns will be denoted by e -8 , and we wish to show a lower bound on f < (M).In this section, we describe the ner Y-net argument outlined in Section 2. We begin with a formal denition of the CAA property.Denition 4.1 (CAA property).We say that a random matrix M with < columns has the CAA property with parameter V > 0, if for all : 1, for all test vectors U 2 R < with at least : coordinates of magnitude X, there exist _ > 0 and 2 8 V (dependent only on M) such that 8⌘ 2( 0, 1), P[kMU k < X⌘/_]exp ⇣ 2 min(<, :< V ) log(1/⌘) ⌘ .Remark.We note that the condition 2 8/V may seem strong; however, as we will see in applications, it is satised as long as < is small enough compared to # , the number of rows of the matrix.

Hierarchical Nets
The following shows that the CAA property implies a least singular value guarantee.T 4.2.Suppose M is a random matrix with < columns and that M satises the CAA property with some parameter V > 0. Suppose additionally that we have the spectral norm bound kMk  ! with probability 1 [.Then with probability at least 1 exp( < V ) [, we have , where _ comes from the CAA property.
As discussed in Section 2, the natural approach to proving such a result would be to take nets based on the sparsity of the test vector U.In other words, if there are : nonzero values of magnitude X > 0, the CAA property yields a least singular value lower bound of X/_ (choosing ⌘ to be a small constant), and we can take a union bound over a net of size exp(:).The issue with this argument is that U might have many other non-zero values that are slightly smaller than X, and these might lead to a zero singular value (unless it so happened that _ < 1/<, which we do not have a control of).Of course, in this case, we should have worked with a slightly smaller value of X, but this issue may recur, so we need a more careful argument.
We construct a sequence of nets N 1 , N 2 ,...,N B 1 as follows.The net N 1 is a set of vectors parametrized by pairs (A 1 ,A 2 )2N 2 , where: (a) For each pair (A 1 ,A 2 ),we include all the vectors whose entries are integer multiples of < with have exactly (A 1 + A 2 ) non-zero entries, of which A 1 entries are in (g 1 , 1] and A 2 entries are in [g 2 ,g 1 ].Thus, the number of vectors in N 1 for a single pair (A 1 ,A 2 ) is bounded by: .
The next net N 2 has vectors parametrized by (A 1 ,A 2 ,A 3 )2N 3 , where (a)

and additionally, (c)
For each such tuple, we include vectors that have exactly (A 1 +A 2 +A 3 ) non-zero entries (in the corresponding g ranges as above), and have values that are all integer multiples of \ 2 /<.
We have nets of this form for 9 = 1, 2,...,B 1, where B = d 1 V e.We now have the following claim.Claim 4.3.Fix any 1  9 < B. We have Finally, we have a bigger net for all "dense" vectors U, that have at least < 1 V coordinates of magnitude \ B 1 < .This net consists of vectors 2 R < for which (a) every coordinate is an integer multiple of \ B /< (between 0 and 1), and (b) at least < 1 V coordinates are < .Call this net N 0 .An easy upper bound for the size is Using this, we have the following: Claim 4.4.
One of the advantages of our Y-net argument is that if we only care about "well spread" vectors, we can obtain a much stronger concentration bound (Eq (10)).Observation 4.5.Suppose M is a random matrix that satises the CAA property with parameter V. Let us call a test vector U (of length  1) "dense" if it has at least < 1 V coordinates of magnitude > X.Then P " 9 dense U : kMU k < 1 Note that in the above claim, < could be quite large compared to =.The observation follows immediately from (10), but we will use it later in Section 4.3.

Anticoncentration of a Vector of Homogeneous Polynomials
We consider the following setting: let ? 1 ,? (1) Our goal will be to show anticoncentration results for %.Specifically, we want to prove that P[k% ( G) ~k < Y] is small for all ~, where G is a perturbation of some (arbitrary) vector G 2 R = .W e give a sucient condition for proving such a result, in terms of the Jacobian of %. (See Section 3 for background.)Denition 4.6 (Jacobian rank property).We say that % has the Jacobian rank property with parameters (:,2,W) if for all d > 0 and for all G, the matrix ( G) has at least : singular values of magnitude 2d, with probability at least 1 W. Here, G = G + [, where [ ⇠ N(0,d 2 ) is a perturbation of the vector G.
Comment.Indeed, all of our results will hold if we only have the required condition for small enough perturbations d.To keep the results simple, we work with the stronger denition.
For many interesting settings of %, the Jacobian rank property turns out to be quite simple to prove.Our main result now is that the property above implies an anticoncentration bound for %.T 4.7.Suppose % (G) dened as above satises the Jacobian rank property with parameters (:,2,W), and suppose further that the Jacobian % 0 is "-Lipschitz in our domain of interest.Let G be any point and let G be a d-perturbation.Then for any ⌘ > 0, we have A key ingredient in the proof is the following "linearization" based lemma.Lemma 4.8.Suppose G is a point at which the Jacobian (G) of a polynomial % has at least : singular values of magnitude g.Also suppose that the norm of the Hessian of % is bounded by " in the domain of interest.Then, for "small" perturbations, 0 < d < g 4"=: , we have that for any Y > 0, We remark that the lemma does not imply Theorem 4.7 directly because it only applies to the case where the perturbation d is much smaller than the singular value threshold g.

Jacobian Rank Property for Khatri-Rao Products
As the rst application, let us use the machinery from the previous sections to prove the following.
Note that the result is stronger in terms of the success probability than the main result of [9] and matches the result of [4].The following lemma is the main ingredient of the proof, as it proves the CAA property for * + .Theorem 4.9 then follows immediately from Theorem 4.2.Lemma 4.10.Suppose U 2 R < be a unit vector at least : of whose coordinates have magnitude X.Let *,+ be arbitrary (as above), and let * and + be d perturbations.Dene % ( *, + ) = Õ 8 U 8 D8 ⌦ Ẽ8 .Then for " = (< + =) 2 and all ⌘ > 0, we have Remark.To see why this satises the CAA property (hypothesis of Theorem 4.2), note that as long as < < = 2 /⇠ for a suciently large (absolute) constant ⇠, the term := 16 16 min(<, :< 1/2 ), thus it satises the condition with V = 1/2.
The Jacobian property used to show Lemma 4.10 can be extended to higher order Khatri-Rao products.We give details in Section B.3.

HIGHER ORDER LIFTS AND STRUCTURED MATRICES FROM KRONECKER PRODUCTS
A complete version of this section, including all deferred proofs can be found in Appendix C. We provide the following theorem.
T 5.1.Suppose 3 2 N, and let : Sym 3 (R = )!R ⇡ be an orthogonal projection of rank ' = X =+3 1 3 for some constant X > 0, and let Sym 3 : (R = ) ⌦3 !Sym 3 (R = ) be the orthogonal projection on to the symmetric subspace of (R = ) ⌦3 .Let * = (D 8 : 8 2[ <]) 2 R =⇥< be an arbitrary matrix, and let * be a random d-perturbation of * .Then there exists a constant 2 3 > 0 such that for <  2 3 X=, with probability at least 1 exp ⌦ 3,X (=) , we have the least singular value 3 ) ( * ~3 ) , where In the above statement, one can also consider an arbitrary linear operator and suer an extra factor of f ' ( ) in the least singular value bound (by considering the projector onto the span of the top ' singular vectors).In the rest of the section, we assume that is an orthogonal projector of rank ' without loss of generality.
Theorem 5.1 follows from the following theorem (Theorem 5.2) which gives a non-symmetric analog of the same statement.The proof of Theorem 5.1 follows from a reduction to Theorem 5.2 that is given by Lemma C. 4. In what follows, 2 R '⇥= 3 denotes the natural matrix representation of such that G ⌦3 = (G ⌦3 ) for all G 2 R = .
Applying Theorem 5.1 along with the block leave-one-out approach (see Lemma A.2) we arrive at the following corollary.
We note that while the statement of Lemma 5.4 is quite intuitive, the proof is non-trivial because we require that in any selected block, there must be many vectors with a large component orthogonal to the entire span of the other selected blocks.We prove this lemma in Section C.2 by restricting to randomly chosen columns as described in the overview (Section 2.3).
The following lemma will be important in the inductive proof of the theorem.It reasons about the robust rank (also called multilinear rank) after the modal contraction by a smoothed matrix along a specic mode.The lemma is proved in slightly more generality; we will use it for the theorem with Y = 1.Lemma 5.5 (Robust rank under random contractions).Suppose Y 2( 0, 1] is a constant.For every constant W,⇠ > 0, there is a constant 2 2( 0, 1) such that the following holds for all B = 2 > (: ) .Consider matrices 1 , 2 ,..., B 2 R '⇥: , ⇠ 1 ,...,⇠ B 2 R '⇥< and 89 2[ B] let ⇧ ?
9 denote the projector orthogonal to the span of the column spaces of { 9 0 : 9 0 < 9, 9 0 2[ B]}.Suppose the following conditions are satised: Finally, we reduce the the setting of symmetric products to that of non-symmetric products.We provide details in Section C.3.

Figure 1 :
Figure 1: The gure shows the setting of Theorem 2.1 with 3 = 2. Le: The linear operator : R =⇥= !R ' interpreted as a tensor consisting of a = ⇥ = array of '-dimensional vectors.There are smoothed or random contractions applied using matrices e *, e + 2 R =⇥< .Right: The operator ( e * ⌦ e + ) : R <⇥< !R ' interpreted as an < 2 array of '-dimensional vectors.Theorem 2.1 shows that under the conditions of the theorem, with high probability the robust rank is < 2 .

( 9 ) ( 1
have large relative rank with respect to each other.The random modal contraction e * can also now be

Figure 3 :
Figure 3: Le: The linear operator : R =⇥= !R ' interpreted as a tensor consisting of a = ⇥ = array of '-dimensional vectors.There are smoothed or random contractions applied using matrices *, + 2 R =⇥< .Right: The operator ( * ⌦ + ) :R <⇥< !R ' interpreted as an < 2 array of '-dimensional vectors.Theorem 5.2 shows that under the conditions of the theorem, with high probability the robust rank of this operator is < 2 i.e, the least singular value of ' ⇥ < 2 matrix is inverse polynomial.