Supermodular Approximation of Norms and Applications

Many classical problems in theoretical computer science involve norms, even if implicitly; for example, both XOS functions and downward-closed sets are equivalent to some norms. The last decade has seen a lot of interest in designing algorithms beyond the standard ℓp norms ||· ||p. Despite notable advancements, many existing methods remain tailored to specific problems, leaving a broader applicability to general norms less understood. This paper investigates the intrinsic properties of ℓp norms that facilitate their widespread use and seeks to abstract these qualities to a more general setting. We identify supermodularity—often reserved for combinatorial set functions and characterized by monotone gradients—as a defining feature beneficial for ||·||pp. We introduce the notion of p-supermodularity for norms, asserting that a norm is p-supermodular if its pth power function exhibits supermodularity. The association of supermodularity with norms offers a new lens through which to view and construct algorithms. Our work demonstrates that for a large class of problems p-supermodularity is a sufficient criterion for developing good algorithms. This is either by reframing existing algorithms for problems like Online Load-Balancing and Bandits with Knapsacks through a supermodular lens, or by introducing novel analyses for problems such as Online Covering, Online Packing, and Stochastic Probing. Moreover, we prove that every symmetric norm can be approximated by a p-supermodular norm. Together, these recover and extend several existing results, and support p-supermodularity as a unified theoretical framework for optimization challenges centered around norm-related problems.


INTRODUCTION
Many classical problems in theoretical computer science are framed in terms of optimizing norm objectives.For instance, Load-Balancing involves minimizing the maximum machine load, which is an ℓ ∞ objective, while Set Cover aims at minimizing the ℓ 1 objective, or the number of selected sets.However, contemporary applications, such as energy-e cient scheduling [2], network routing [24], paging [39], and budget allocation [1], demand algorithms that are capable of handling more complex objectives.Norms also underline other seemingly unrelated concepts in computer science, such as XOS functions from algorithmic game theory (both are max of linear functions) and downward-closed constraints from combinatorial optimization (the downward-closed set corresponds to the unit ball of the norm); these connections are further discussed in Section 1. 4.
Hence, ongoing e orts have focused on designing good algorithms for general norm objectives.Notably, the last decade has seen a lot of progress in this direction for the class of symmetric norms-those invariant to coordinate permutations.Examples include ℓ norms, Top-k norm, and Orlicz norms.They o er rich possibilities, e.g., enabling the simultaneous capture of multiple symmetric norm objectives, as their maximum is also a symmetric norm.We have seen the fruit of this in algorithms for a range of applications like Load-Balancing [17,18], Stochastic Probing [45], Bandits with Knapsacks [35], clustering [17,18], nearest-neighbor search [5,6], and linear regression [4,48].
Despite the above progress, our understanding of applying algorithms beyond ℓ norms remains incomplete.For instance, while [9] (where 3 independent papers were merged) provide an algorithm for Online Cover with ℓ norms, which was extended to sum of ℓ norms in [44], the extension to general symmetric norms is unresolved.Indeed, [44] posed as an open question whether good Online Cover algorithms exist for more general norms.Other less understood applications with norms include Online Packing [14] and Stochastic Probing [28].
A notable limitation of current techniques extending beyond ℓ norms is that they are often ad-hoc.Our aim is to create a uni ed framework that provides a better understanding of norms in this context, simpli es proofs, and enhances generalizability.
What properties of ℓ norms make them amenable to various applications?Can we reduce the problem of designing good algorithms for general norms to ℓ norms?A common approach taken when working with ℓ norms is to instead work with the function ∥ ∥ = . This function has several nice properties, e.g., it is separable and convex.We want to understand its fundamental properties that su ce for many applications, hoping that this would allow us to de ne similar nice functions beyond ℓ norms.
We identify Supermodularity, characterized by monotone gradients, as a particularly valuable property of ∥ ∥ .This may sound intriguing because Supermodularity is typically associated with combinatorial set functions and not a priori norms.This is perhaps because all norms, except for scalings of ℓ 1 , are not Supermodular.We therefore propose that a norm ∥ • ∥ is -Supermodular if ∥ • ∥ exhibits Supermodularity.
We show that for a large class of problems involving norms or equivalent objects, -Supermodularity su ces to design good algorithms.This is either by reframing existing algorithms for problems like Online Load-Balancing [35] and Bandits with Knapsacks [32,36] through a Supermodular lens or by introducing novel analyses for problems such as Online Covering [9], Online Packing [14], and Stochastic Probing [28,45].
Moreover, we demonstrate that -Supermodular approximations of norms are possible for large classes of norms, especially for all symmetric norms.Our approach paves the path for a uni ed approach to algorithm design involving norms and for obtaining guarantees that only depend polylogarithmically on the number of dimensions .In particular, it can bypass the limitations of ubiquitous approaches like the use of "concentration + union bound" or Multiplicative Weights Update, that typically cannot give bounds depending only on the ambient dimension (they usually depend on the number of linear inequalities/constraints that de ne the norm/set); we expand on this a bit later.

-Supermodularity and a Quick Application
Throughout the paper, we only deal with non-negative vectors, i.e., ∈ R + , and monotone norms, namely those where ∥ ∥ ≥ ∥ ∥ if ≥ .

De nition 1.1 ( -Supermodularity
As an example, ℓ norms are -Supermodular (follows from convexity of ).It may not be immediately clear, but the larger the , the weaker this condition is and easier to satisfy (but the guarantees of the algorithm also become weaker as grows).In Section 2.1 we present an in-depth discussion of -Supermodularity, including this and other properties, equivalent characterizations, how to create new -Supermodular norms from old ones, etc.
But to give a quick illustration of why -Supermodularity is useful, we consider the classic Online Load-Balancing problem [8,10].In this problem, there are jobs arriving one-by-one that are to be scheduled on machines.On arrival, job ∈ [ ] reveals how much size ∈ R + it takes if executed on machine ∈ [ ].Given an -dimensional norm ∥ • ∥, the goal is to nd an online assignment to minimize the norm of the load vector, i.e., ∥Λ ∥ where the -th coordinate of Λ is the sum of sizes of the jobs assigned to the -th machine.The following simple argument shows why -Supermodularity implies a good algorithm for Online Load-Balancing.
Theorem 1.2.For Online Load-Balancing problem with a norm objective that is -Supermodular, there is an ( )-competitive algorithm.
Proof.The algorithm is simple: be greedy with respect to ∥ • ∥, i.e., allocate job to a machine such that the increase in the norm of load vector is the smallest, breaking ties arbitrarily.
For the analysis, let ∈ R + be the load vector that the algorithm incurs at time and Λ := 1 + . . .+ , and let * and Λ * be de ned analogously for the hindsight optimal solution.Then the cost of the algorithm to the power of is , where the rst inequality follows from the greedyness of the algorithm and the second inequality from -Supermodularity.Rearranging and taking -th root gives Since ℓ norms are -Supermodular, we obtain ( )-competitive algorithms for Online Load-Balancing with these norms, implying the results of [8,10].

-Supermodular Approximation and Our Technique via Orlicz Norms
One di culty is that many norms (e.g., ℓ ∞ ) are not -Supermodular for a reasonable (e.g., polylogarithmic in the number of dimensions ).Indeed, the greedy algorithm for online load balancing is known to be Ω( )-competitive for ℓ ∞ [8].However, in such cases one would like to approximate the original norm by a -Supermodular norm before running the algorithm; e.g., approximate ℓ ∞ by ℓ log .
One of our main contributions is showing that such an approximation exists for large classes of norms.Formally, we say that a norm As our rst main result (in Section 2), we show that all symmetric norms can be approximated by an (log )-Supermodular norm.
Moreover, this approximation can be done e ciently given Ball-Optimization oracle 1 access to the norm ∥ • ∥.This result plays a crucial role not only in allowing us to rederive many existing results for symmetric norms in a uni ed way, but also to obtain new results where previously general symmetric norms could not be handled.
We now give a high-level idea of the di erent steps in the proof of Theorem 1.1.
Reduction to Top-k norms.The reason why general norms are often di cult to work with is that they cannot be easily described.An approach that has been widely successful when dealing with symmetric norms is to instead work with Top-k norms-sum of the largest coordinates of a non-negative vector.Besides giving a natural way to interpolate between ℓ 1 and ℓ ∞ , they actually form a "basis" for all symmetric norms.In particular, it is known that any symmetric norm can be (log )-approximated by the max of polynomially many (weighted) Top-k norms (see Lemma 2.15).Leveraging this property, we reduce our problem in that of nding -Supermodular approximations of Top-k norms.
Our Approach via Orlicz Norms.Even though Top-k norms have a very simple structure, it is still not clear how to design -Supermodular approximations for them.Not only thinking about -th power of functions in high dimensional setting is not easy, but there is no constant or "wiggle room" in the de nition of -Supermodularity to absorb errors.Our main idea to overcome this is to instead work with Orlicz norms (de ned in Section 2.2).These norms are fundamental objects in functional analysis (e.g., see book [29]) and have also found use in statistics and computer science; see for example [4,48] for their application in regression.Orlicz functions are much easier to work with because they are de ned via a 1-dimensional function R + → R + .
So our next step is showing that any Top-k norm can be (1)approximated by an Orlicz norm.This e ectively reduce our task of designing a -Supermodular approximation from an -dimensional situation to a 1-dimensional situation.
Approximating Orlicz Norms.The last step is showing that every Orlicz norm can be approximated by a -Supermodular one.
As an example, an immediate corollary of this result along with Theorem 1.2 is an (log )-competitive algorithm for Online Load-Balancing with an Orlicz norm objective.
Our key handle for approaching Theorem 1.2 is the proof of a su cient guarantee for an Orlicz norm to be -Supermodular: the 1-dimensional function generating it should grow "at most like a polynomial of power " (Lemma 2.9).Then the construction of the approximation in the theorem proceeds in three steps.First, we simplify the structure of the Orlicz function by approximating it with a sum of (increasing) "hinge" functions ˜ ( ) := ˜ ( ).These hinge function, by de nition, have a sharp "kink", hence do not satisfy the requisite growth condition.Thus, the next step is to approximate them by smoother functions ( ) that grow at most like power .The standard smooth approximations of hinge functions (e.g., Hubber loss) do not give the desired approximation properties, so we design an approximation that depends on the relation between the slope and the location of the kink of the hinge function.Finally, we show that the Orlicz norm ∥ • ∥ , generated by the the function ( ) = ( ), both approximates ∥ • ∥ and is (log )-Supermodular.
Putting these ideas together, gives the desired approximation of every symmetric norm by an (log )-Supermodular one.

Direct Applications of -Supermodularity
Next, we detail a variety of applications for -Supermodular functions.Our discussion includes both reinterpretations of existing algorithms through the lens of Supermodularity and the introduction of novel techniques that leverage Supermodularity to address previously intractable problems.In this section, we discuss applications that immediately follow from prior works due to -Supermodularity.

Online
Covering with a Norm Objective.The OnlineCover problem is de ned as follows: a norm : R → R is given upfront, and at each round a new constraint ⟨ , ⟩ ≥ 1 arrives (for some non-negative vector ∈ R ).The algorithm needs to maintain a non-negative solution ∈ R + that satis es the constraints ⟨ 1 , ⟩ ≥ 1, . . ., ⟨ , ⟩ ≥ 1 seen thus far, and is only allowed to increase the values of the variables over the rounds.The goal is to minimize the cost ( ) of the nal solution .
When the cost function is linear (i.e., the ℓ 1 norm), this corresponds to the classical problem of Online Covering LPs [3,15], where (log )-competitive algorithms are known ( is the maximum row sparsity) [14,26].This was rst extended to ( log )competitive algorithms when is the ℓ norm [9], and was later extended to sums of ℓ norms [44].[44] posed as an open question whether good online coverage algorithms exist outside of ℓ -based norms.The following result, which follows directly by applying the algorithm of [9] to the -Supermodular approximations of Orlicz and symmetric norms provided by Theorem 1.2 and Theorem 1.1, shows that this is indeed the case.
Corollary 1.3.In the OnlineCover problem, if the objective can be -approximated by a -Supermodular norm then there exists an ( log )-competitive algorithm, where is the maximum row sparsity.Hence, if the objective is an Orlicz norm then this yields (log log ) competitive ratio, and if the objective is a symmetric norm then this yields (log 2 log ) competitive ratio.

Applications via Gradient Stability: Bandits with Knapsacks or
Vector Costs .Recently, [35] introduced the notion of gradient stability of norms and showed that it implies good algorithms for online problems such as Online Load-Balancing, Bandits with Vector Costs, and Bandits with Knapsacks.(Gradient stability, however, does not su ce for other applications in this paper, like for Online Covering, Online Packing, Stochastic Probing, and robust algorithms.)In the full version, we show that gradient stability is (strictly) weaker than -Supermodularity, and hence we can recover all of the results in [35].Due to Theorem 1.2 for Orlicz norms, this also improves the approximation factors in all these applications from (log 2 ) to (log ) for Orlicz norms.See the full version for more details.

Robust Algorithms.
Supermodularity also has implications for online problem in stochastic, and even better, robust input models.Concretely, consider the Online Load-Balancing problem from Section 1.1, but in the Mixed model where the time steps are partitioned (unbeknownst to the algorithm) into an adversarial part and a stochastic part, where in the latter jobs are generated i.i.d.from an unknown distribution.Such models that interpolate between the pessimism and optimism of the pure worst-case and stochastic models, respectively, have received signi cant attention in both online algorithms [7,12,21,33,34,37,[40][41][42] and online learning (see [23] and references within).
Consider the (Generalized) 2 Online Load-Balancing in this model, with processing times normalized to be in [0, 1].For the ℓ -norm objective, [43] designed an algorithm with cost most (1) ), where OPT and OPT ℎ are the hindsight optimal solutions for the items on each part of the input.That is, the algorithm has strong performance on the "easy" part of the instance, while being robust to "unpredictable" jobs.We extend this result beyond ℓ -norm objectives, by applying Theorem 1 of [43] and our -Supermodular approximation for Orlicz norms from Theorem 1.2.

New Applications using -Supermodularity
We discuss applications that require additional work but crucially rely on -Supermodularity.The details can be found in the full version.

Online Covering with Composition of Norms.
To illustrate the general applicability of our ideas, in particular going beyond symmetric norms, let us reconsider the OnlineCover problem but now with "composition of norms" objective.This version of the problem is surprisingly general: its o ine version captures the fractional setting of other fundamental problems such as Generalized Load-Balancing [20] and Facility Location.Formally, in OnlineCover with composition of norms, the objective function is de ned by a monotone outer norm ∥ • ∥ in R , monotone inner norms 1 , . . ., in R , and subsets of coordinates and is only allowed to increase the values of the variables over the rounds.The goal is to minimize the composed norm objective.
Our next theorem shows that good algorithms exist for On-lineCover even with composition of -Supermodular norms objectives.(Since this composed norm may not be -Supermodular, Corollary 1.3 does not apply.)Theorem 1.3.If the outer norm ∥ • ∥ is ′ -Supermodular and inner norms ℓ 's are -Supermodular, then there is an ( ′ log 2 )competitive algorithm for OnlineCover, where is the maximum between the sparsity of the constraints and the size of the coordinate restrictions, namely = max{max ( ) , and = max ℓ max ∈ ℓ ℓ ( ) Unlike Corollary 1.3 that followed from -Supermodularity immediately, this result needs new ideas to analyze the algorithm.We combine ideas from Fenchel duality used in [9] with breaking up the evolution of the algorithm into phases where the gradients the norm behaves almost -Supermodular, inspired by [44] in the ℓ -case.

Online Packing.
The OnlinePacking problem has the form: where ∈ R , ∈ R # constraints× , and ∈ R # constraints have all non-negative entries.At the -th step, we see the value of the item and its vector size ( 1, , . . ., # constraints, )), and have to immediately set (which cannot be changed later).The classic online primal-dual algorithms were designed to address such problems [14,15], and we know (log( • # constraints))-competitive algorithms, where = max max , / min : , >0 , / is the "width" of the instance.
For many packing problems, however, the # constraints is exponential in number of items , e.g., matroids are given by { ∈ ≤ ( ), ∀ ⊆ [ ]} where is the rank function.In such situations, a competitive ratio that depends logarithmically on the number of constraints is not interesting, and we are interested in obtaining competitive ratios that only depend on the intrinsic dimension of the problem.
More formally, we consider the general OnlinePacking problem of the form: where is an -dimensional downward closed set.Again, items come one-by-one (along with and ( 1, , . . ., , )) and we need to immediately set .Can we obtain polylog( , , )-competitive online algorithms?In the stochastic setting of this problem, where items come in a random order (secretary model) or from known distributions (prophet model), Rubinstein [47] obtained (log 2 )competitive algorithms (see also [1]).But in the adversarial online model, despite being a very basic problem, we do not know of good online algorithms beyond very simple .
We propose the use of -Supermodularity as a way of tackling this problem.The connection with norms is because there is a 1-1 equivalence between downward closed sets and monotone norms, given by the gauge function ∥ ∥ := inf { > 0 : ∈ }, where ∈ ⇔ ∥ ∥ ≤ 1.Thus, the packing constraint ∈ in ( 2) is equivalent to ∥ ∥ ≤ 1.Our next result illustrates the potential of this approach.
Theorem 1.4.Consider an instance of the problem OnlinePacking where the norm associated with the feasible set admits an -approximation by a di erentiable -Supermodular norm.
• If a -approximation OPT ≤ OPT ≤ OPT of OPT is known, then there is an algorithm whose expected value is ( ) • max{ , log }-competitive.• If no approximation of OPT is known, then there is an algorithm with expected value ( ) • max{ , log }-competitive, where upper bounds the width , the norm ∥ • ∥ is just ℓ ∞ with rescaled coordinates.Hence, Theorem 1.4 together with (log )-Supermodular approximation of ℓ ∞ gives an (log( ))competitive algorithm for the setting of (1), which essentially is the same classical guarantee of [14], albeit with a slightly di erent notion of width .Moreover, if our Conjecture 1.6 about -Supermodularity of general monotone norms is true then this gives the desired polylog( )-approx for every downward closed .As a side comment, this result/technique highlights a fact that we were unaware of, even for the classical problem (1), that if an estimate of OPT within poly( ) factors is available, then one can avoid the dependence on any width parameter .

Adaptivity Gaps and Decoupling
Inequalities.We show that -Supermodularity is related to another fundamental concept, namely the power of adaptivity when making decisions under stochastic uncertainty.To illustrate that, we consider the problem of Stochastic Probing (StochProbing), which was introduced as a generalization of stochastic matching [11,19] and has been greatly studied in the last decade [13,25,27,28,45].
In this problem, there are items with unknown non-negative values 1 , . . ., that were drawn independently from known distributions.Items need to be probed for their values to be revealed.There is a downward-closed family F ⊆ [ ] indicating the feasible sets of probes (e.g., if the items correspond to edges in a graph, F can say that at most edges incident on a node can be queried).Finally, there is a monotone function : R + → R + , and the goal is to probe a set ∈ F of elements so as to maximize E ( ), where has coordinate equal to if ∈ and 0 otherwise (continuing the graph example, ( ) can be the maximum matching with edge values given by ).
The optimal probing strategy is generally adaptive, i.e., it probes elements one at a time and may change its decisions based on the observed values.Since adaptive strategies are complicated (can be an exponential-sized decision tree, and probes cannot be performed in parallel), one often resorts to non-adaptive strategies that select the probe set upfront only based on the value distributions.The fundamental question is how much do we lose by making decisions non-adaptively, i.e., if Adapt( , F , ) denotes the value of the optimal adaptive strategy and NonAdapt( , F , ) denotes the value of the optimal non-adaptive one, then what is the maximum possible adaptivity gap Adapt( ,F, ) NonAdapt( ,F, ) for a class of instances.
For submodular set functions, the adaptivity gap is known to be 2 [13,28].For XOS set functions of width , [28] showed the adaptivity gap is at most (log ), where a width-XOS set function : 2 [ ] → + is a max over linear set functions.The authors conjectured that the adaptivity gap for all XOS set functions should be poly-logarithmic in , independent of their width.Since a monotone norm is nothing but a max over linear functions (given by the dual-norm unit ball), they form an extension of XOS set functions from the hypercube to all non-negative real vectors.Thus, the generalized conjecture of [28] is the following: Conjecture 1.5.The adaptivity gap for stochastic probing with monotone norms is polylog .
We prove this conjecture for Supermodular norms.This simultaneously recovers the (log ) adaptivity gap result of [28] (via Lemma 2.4) and the result of [45] for all monotone symmetric norms (within polylog( )).Moreover, if our Conjecture 1.6 about Supermodularity of general monotone norms is true, this would settle the full Conjecture 1.5.Importantly, neither the techniques from [28] nor [45] seem able to prove Conjecture 1.5: the former uses a "concentration + union bound" over the linear functions composing (leading to the expected (log ) loss), and the latter showed an Ω( √ ) lower bound for non-symmetric functions with their approach.
The proof of Theorem 1.5 is similar to the Load-Balancing application of Section 1.1: we replace one-by-one the actions of the optimal adaptive strategy Adapt by those of the "hallucinationbased" non-adaptive strategy that runs Adapt on "hallucinated samples" ¯ 's (but receives value according to the true item values 's).However, additional probabilistic arguments are required; in particular, we need to prove a result of the type "E∥ 1 + . . .
where 's and ¯ 's will correspond to Adapt and the hallucinating strategy, respectively.We do this via an interpolation idea inspired by Burkholder [16].
In fact, we prove a more general result than Theorem 1.5 that show the connections with probability and geometry of Banach spaces: a decoupling inequality for tangent sequences of random variables (see the full version); these have applications from concentration inequalities [46] to Online Learning [22,49].Two sequences of random variables 1 , 2 , . . ., and ¯ 1 , ¯ 2 , . . ., ¯ are called tangent if conditioned up to time − 1, and ¯ have the same distribution.We show that for such tangent sequences in R + , for a -Supermodular norm ∥ • ∥ we have E∥ 1 + . . .+ ∥ ≤ ( ) • E∥ ¯ 1 + . . .+ ¯ ∥, independent of the number of dimensions.This complements the (stronger) results known for the so-called UMD Banach spaces [31]. 3

Our Conjecture and Future Directions
In this work we demonstrate that -Supermodularity is widely applicable to many problems involving norm objectives (from online to stochastic and from maximization to minimization problems).Our Theorem 1.1 shows that all symmetric norms have an (log )-Supermodular approximation.We conjecture that such an approximation should exist for all norms.
Conjecture 1.6.Any monotone norm in dimensions can be polylog -approximated in the positive orthant by a norm that is polylog -Supermodular.
If true, this conjecture will signi cantly push the boundary of what's known.It is akin to the phenomenon of going "beyond the trivial union bound" that appears in multiple settings.For instance, it will positively resolve the adaptivity gap conjecture of [28] for XOS functions where the current best results depend on the number of linear functions, and it will give online packing/covering algorithms that do not depend on the number of constraints but only on the ambient dimension.
Another interesting future direction is to obtain integral solutions for the OnlineCover problem.Similar to the work of [44], our Corollary 1.3 and Theorem 1.3 can only handle the fractional OnlineCover problem.Unlike the classic online set cover (ℓ 1 objective), where randomized rounding su ces to obtain integral solutions, it is easy to show that we cannot round w.r.t. the natural fractional relaxation of the problem since there is a large integrality gap.Hence, a new idea will be required to capture integrality in the objective.
-Supermodularity is also related to the classic Online Linear Optimization (e.g., see book [30]).For the maximization version of the problem, in the full version we show how to obtain total value at least (1 − )OPT − • when a norm associated to the problem is -Supermodular, where is "diameter" parameter.In the case of prediction with experts, this recovers the standard (1 − )OPT − ( log ) bound ( being the number of experts), and generalizes the result of [42] when the player chooses actions on the ℓ ball.This gives an intriguing alternative to the standard methods like Online Mirror Descent and Follow the Perturbed Leader.It would be interesting to nd further implications of this result, and more broadly -Supermodularity, in the future.
In the next section we discuss properties of -Supermodularity and defer the proofs of the applications to the full version.

SUPERMODULAR APPROXIMATION OF NORMS
In this section we discuss -Supermodularity and how many general norms can be approximated by -Supermodular norms.

-Supermodularity and Its Basic Properties
-Supermodularity can be understood in a natural and more workable manner through the rst and second derivatives of the norms; this is the approach we use in most of our results.While norms may not be di erentiable, using standard smoothing techniques, every -Supermodular norm can be (1 + )-approximated by another -Supermodular norm that is in nitely di erentiable everywhere except at the origin; see the full version.
• (Gradient property): ∥ • ∥ has monotone gradients over the non-negative orthant, i.e., for all , ∈ R + and ∀ ∈ [ ], Proof.The rst part of the Gradient property follows when we take ∥ ∥ → 0. For the second part, use The rst part of the Hessian property follows from monotonicity of gradients.For the second part, use Two immediate implications of the above equivalence are the following: As mentioned in the introduction, for every ≥ 1 the ℓ norm is -Supermodular.This follows, e.g., from the gradient property of -Supermodular norms.For ≥ log , the ℓ norm is (1)approximated by ℓ log .So, ℓ ∞ can be (1)-approximated by (log )-Supermodular norm.We rst generalize this fact (ℓ ∞ is max of inequalities that are each 1-Supermodular).
Proof.Let ′ = max{ , log } and consider the norm Furthermore, for all , ∈ R + , we have An implication of this is that any norm in dimensions can be (1)-approximated by an -Supermodular norm.This is because we can nd a 1  4 -net N ⊆ A of the unit ball of the dual norm of size 2 ( ) .Since, Corollary 2.5.Any monotone norm in -dimensions can be (1)approximated by an -Supermodular norm.
Although -Supermodular norms have several nice properties, they also exhibit some strange properties.For instance, sum of two -Supermodular norms can be very far from being -Supermodular.

Orlicz Norms and a Su cient Condition for -Supermodularity
The following class of Orlicz functions and Orlicz norms will play a crucial role in all our norm approximations.
De nition 2.8 (Orlicz Norm).Given an Orlicz function , the associated Orlicz norm is de ned by Since we only focus on non-negative vectors, we will ignore throughout the absolute value | |.
For example, any ℓ is an Orlicz norm when we select ( ) = .Orlicz norms are fundamental in functional analysis [38], but have also found versatile applications in TCS.For instance, in regression the choice between ℓ 1 and ℓ 2 norms depends on outliers and stability, so an Orlicz norm based on the popular Huber convex loss function is better suited [4,48].Later we will show that Orlicz norms can be used to approximate any symmetric norm.
The following lemma is our main tool for working with Orlicz norms.It states that for such a norm to be -Supermodular, it su ces that its generating function grows "at most like power ".The key is that this reduces the analysis of the -dimensional norms ∥ • ∥ to the analysis of 1-dimensional functions, which is signi cantly easier.Lemma 2.9.Consider a twice di erentiable convex function : Notice that the function ( ) = satis es this condition, at equality.While in this special case the norm ∥ • ∥ = ℓ is -Supermodular, in general we obtain the slightly weaker conclusion of (2 − 1)-Supermodularity.
The rest of the subsection proves this lemma.The proof will rely on the Hessian property of -Supermodular norms.First, we observe the following formula for the gradient of the Orlicz norm ∥ • ∥ ; this can be found on page 24 of [38], but we repeat the proof for completeness.
Claim 2.1.If is di erentiable, then the gradient of the Orlicz norm ∥ • ∥ is given by .
De nition 2.10.Let Di erentiating the expression for the gradient ∇ ∥ ∥ gives a close-form formula for the Hessian of the Orlicz norm.(To be careful with the chain rules, we use brackets; for example ∇ ( (ℎ( ))) to denote the gradient of the composed function • ℎ, not of just .)Claim 2.2.If is twice di erentiable, then the Hessian of the norm Before proving the claim (which is mostly algebra), we complete the proof of the lemma.
Proof of Lemma 2.9.When ℓ ≠ we have , and when ℓ = we get an extra + 1 ∥ ∥ from the product rule.Letting 1(ℓ = ) denote the indicator that ℓ = , this implies Applying this to (3) and using ∇ ( ′ ( ˜ ℓ )) = ′′ ( ˜ ℓ ) • ∇ ˜ ℓ , we get where the inequality uses that the missing terms are non-negative for ≥ 0.Moreover, the assumption on implies that Similarly, we get for that which proves Lemma 2.9 by Lemma 2.1.□ Finally, we prove the missing claim.

Approximation of Orlicz Norms
This section shows that every Orlicz norm can be approximated by an (log )-Supermodular norm.
Before giving an overview of the proof of the theorem, it will help the discussion to have the following lemma that shows that to approximate an Orlicz norm ∥ • ∥ , it su ces to approximate the corresponding Orlicz function .
Proof Overview of Theorem 1.2.Given the su cient condition for -Supermodularity via the growth rate of the Orlicz function from Lemma 2.9 and Lemma 2.11 above, the proof of Theorem 1.2 involves three steps.First, we simplify the structure of the Orlicz function by approximating it with a sum of (increasing) "hinge" functions ˜ ( ) := ˜ ( ) in the interval where ( ) ≤ 1.These hinge function by de nition have a sharp "kink", hence do not satisfy the requisite growth condition.Thus, the next step is to approximate them by smoother functions ( ) that grow at most like power .However, the standard smooth approximations of hinge functions (e.g.Hubber loss) do not give the desired properties, so we use a subtler approximation that depends on the relation between the slope and the location of the kink of the hinge function (this is because the approximation condition required by Lemma 2.11 is mostly multiplicative, while standard approximations focus on additive error).Finally, we show that the Orlicz norm ∥ • ∥ , where ( ) = ( ), both approximates ∥ • ∥ and is (log )-Supermodular.
Proof of Theorem 1.2.This rst claim gives the desired approximation of by piecewise linear functions with slopes.

Approximation of Top-k and Symmetric Norms
In this section we will give -Supermodular norm approximations of Top-k and Symmetric Norms.The strategy is to rst construct such an approximation for Top-k norms; general symmetric norms are then handled by writing them as a composition of Top-k norms and applying the -Supermodular approximation to each term.
Approximation of Top-k norms.Even though the Top-k norms have a simple structure, it is not clear how to approximate them by a -Supermodular norm directly.Instead, we resort to an intermediate step of expressing a Top-k norm (approximately) as an Orlicz norm.
Together with Theorem 1.2 from the previous section, this implies the following.Corollary 2.13.For every ≥ 1, the Top-k norm ∥ • ∥ Top-k in -dimensions can be 2-approximated by an (log )-Supermodular norm.
The construction in the proof of Theorem 2.7 is inspired by the embedding of Top-k norms into ℓ ∞ by Andoni et al. [6].They considered the "Orlicz function" ( ) that is 0 until = 1 and behaves as the identity afterwards, i.e., ( ) := • 1( ≥ 1 ).The rough intuition of why the associated "Orlicz norm" approximately captures the Top-k norm of a vector is because ∥ ∥ Top-k has ≈ coordinates with value above 1 (the top ≈ coordinates), which are picked up by and give ≈ 1; thus, by de nition of Orlicz norm, ∥ ∥ ≈ ∥ ∥ Top-k .However, this function is not convex due to a jump at = 1/ , so it does not actually give a norm.Convexitfying this function also does not work: the convexi ed version of is the identity, which yields the ℓ 1 norm, does not approximate Top-k.Interestingly, a modi cation of this convexi cation actually works.
Proof of Theorem 2.7.We de ne the Orlicz function ( ) := max{0, − 1 }.We show that the norm ∥ • ∥ generated by this function is a 2-approximation to the Top-k norm.
Upper bound ∥ ∥ ≤ ∥ ∥ Top-k .By the de nition of Orlicz norm, it su ces to show that ( ∥ ∥ Top-k ) ≤ 1.For that, since there are at most coordinates having ≥ ∥ ∥ Top-k , we get Lower bound ∥ ∥ ≥ ∥ ∥ Top-k 2 . By the de nition of Orlicz norm, it su ces to show that for any < 1 2 , we have Let denote the set of the largest coordinates of .Then, which is > 1 whenever < 1 2 .This concludes the proof of Theorem 2.7.□ Given Theorem 2.7, one might wonder whether all symmetric norms can be approximated within a constant factor by Orlicz norms.The following lemma shows that this is impossible.Lemma 2.14.There exist symmetric norms that cannot be approximated to within a (log ) 1− factor by an Orlicz norm for any constant > 0.
We defer the proof of this observation to the full version.
Approximation of symmetric norms.Although Lemma 2.14 rules out the possibility of approximating any symmetric norm by an Orlicz norm within a constant factor, we show that every symmetric norm can be (log )-approximated by an an (log )-Supermodular norm.
As mentioned before, the idea is write a general symmetric norm as composition of Top-k norms and applying the -Supermodular approximation to each term.More precisely, the following lemma, proved in [35] (and a similar property in [6,17]), shows that the any monotone symmetric norm can be approximated by Top-k norms.With the decomposition of monotone symmetric norms into Top-k norms in Lemma 2.15 and the -Supermodular approximation to the latter in Corollary 2.13, we can now prove that every symmetric norm can be (log )-approximated by an (log )-Supermodular norm.
Proof of Theorem 1.1.Consider a monotone symmetric norm and its approximation ||| ||| given by Lemma 2.15.Let be the -Supermodular 2-approximation of the Top-k norm as given by Corollary 2.13, where = Θ(log ).We replace in ||| ||| the Top-k norms by these approximations, and the outer ℓ ∞ -norm by the ℓ -norm to obtain the norm .
Moreover, to see that is -Supermodular, consider the gradient of , which is given by Since each norm is -Supermodular and the multipliers are non-negative, ∇( ( ) ) is non-decreasing.By the Gradient property in Lemma 2.1, this implies -Supermodularity.□ We remark that given a Ball-Optimization oracle, we can evaluate at a given point the value and gradient of the approximating norm constructed in Theorem 1.1, up to error , in time poly(log 1 , ).This is because the decomposition into Top-k norms from Lemma 2.15 can be found in polytime given this oracle (e.g., see [17,35]), the Orlicz function of the Orlicz norm approximation of each Top-k can be constructed explicitly, and the value and gradient of this Orlicz norm can be evaluated by binary search on the scaling in the de nition of the Orlicz norm (and Claim 2.1).