SEAM: Searching Transferable Mixed-Precision Quantization Policy through Large Margin Regularization

Mixed-precision quantization (MPQ) suffers from the time-consuming process of searching the optimal bit-width allocation (i.e., the policy) for each layer, especially when using large-scale datasets such as ISLVRC-2012. This limits the practicality of MPQ in real-world deployment scenarios. To address this issue, this paper proposes a novel method for efficiently searching for effective MPQ policies using a small proxy dataset instead of the large-scale dataset used for training the model. Deviating from the established norm of employing a consistent dataset for both model training and MPQ policy search stages, our approach, therefore, yields a substantial enhancement in the efficiency of MPQ exploration. Nonetheless, using discrepant datasets poses challenges in searching for a transferable MPQ policy. Driven by the observation that quantization noise of sub-optimal policy exerts a detrimental influence on the discriminability of feature representations---manifesting as diminished class margins and ambiguous decision boundaries---our method aims to identify policies that uphold the discriminative nature of feature representations, i.e., intra-class compactness and inter-class separation. This general and dataset-independent property makes us search for the MPQ policy over a rather small-scale proxy dataset and then the policy can be directly used to quantize the model trained on a large-scale dataset. Our method offers several advantages, including high proxy data utilization, no excessive hyper-parameter tuning, and high searching efficiency. We search high-quality MPQ policies with the proxy dataset that has only 4% of the data scale compared to the large-scale target dataset, achieving the same accuracy as searching directly on the latter, improving MPQ searching efficiency by up to 300×.


INTRODUCTION
With the success of deep learning, deep neural networks (DNNs) have been adopted for many artificial intelligence tasks such as image classification [18,20,35], object detection [33,34], and meanwhile, become the indispensable part of modern multimedia applications [37].However, the large computational resource requirements of DNNs remain one of the most giant stumbling blocks for deploying deep learning models.There are several compression techniques to reduce the redundancy in a deep model, such as pruning [29], knowledge distillation [19] and quantization [6,26,41,53].Quantization is a promising technique to reduce both the storage and computational resources overhead remarkably, by leveraging the fact that the inference precision is not strictly as high as training time.Therefore, quantization enables large models to run directly on the edge and mobile devices without redesigning a new model architecture, which empowers edge intelligence significantly.
Quantization can be divided into two categories: fixed-precision quantization and mixed-precision quantization (MPQ).Fixed-precision quantization [12,28,53], where an identical bit-width is designated for all layers in a deep model.While such a paradigm is proven to make the quantized model achieve sufficiently good performance at high bit-width (e.g., ≥ 8 bits), a uniform bit-width is challenging for quantization in an ultra-low bit-width (i.e., ≤ 4 bits) scenario.For example, BRQ [17] reports that there is more than 20% top-1 accuracy degradation in a 2 bit quantization for the MobilNetV2 model as compared to its full-precision (FP) counterpart.
Mixed-precision quantization (MPQ) [11,14,21,46,51] offers a flexible and efficient way to quantize deep models by allocating varying bit-widths to individual layers based on their diverse redundancy levels.Unlike fixed-precision quantization, MPQ assigns specific precisions to different layers, with higher redundancy layers receiving less bit-width than lower redundancy ones, thereby achieving an optimal accuracy-efficiency trade-off.The MPQ process typically involves two stages: firstly, a full-precision (FP) model M   is trained on a training dataset D  ; subsequently, the FP model is served as a weight initialization to be quantized, while simultaneously searching for the optimal MPQ policy for determining the quantization precision to each layer, over a searching dataset D ℎ .The searching process also performs quantization-aware training, therefore all search-based approaches [3,21,46] use the same dataset during both two processes, namely, D  = D ℎ .Although using consistent datasets surely bring an accurate policy for the model to quantize, two problematic issues arise: (a) when D  = D ℎ and D  is large-scale, the combinatorial nature [40,46] of the MPQ problem poses severe difficulties in search efficiency (e.g., BP-NAS consumes 35.6 GPU-hours to search for the ResNet-50 [47]).(b) in some sensitive user-data application scenarios, the training dataset is inaccessible.
Nevertheless, few research has explored to decouple the dataset used in model training and MPQ search stages.This is promising to improve the search efficiency since the searching process can be done on a small-scale proxy dataset, but inevitably encounters intractable challenges due to shifted data distributions and reduced data volume caused by the disparate datasets.Notably, when CIFAR-10 was used to search for MPQ policy for ResNet50 trained on ISLVRC-2012, EdMIPS [3] encountered a substantial loss of nearly 7% in Top-1 accuracy [47].Recently, GMPQ [47] indicates that, for an input image, preserving the attribution rank between the FP and quantized model can search a generalizable MPQ policy.They resort the feature visualization technique Grad-cam [36] to maintain the consistency of image attribution rank between the quantized and FP model.GMPQ can be regarded as an instance-level regularization over the proxy dataset, by enforcing a consistent relationship between FP and quantized model of each input instance.However, it is noteworthy that GMPQ does not harness information beyond the instance-level, namely, at the class-level.Furthermore, GMPQ entails intricate hyper-parameter tuning to align the attribution rank, contributing to its complexity.
In this paper, we search the effective transferable MPQ policy by exploiting the class-level information on the proxy datasets, considering the class-level information is more luxuriant than instancelevel [4].Our idea is motivated by the observation that quantization poses side effects to the quantized model in the feature space compared to the FP model.Our finding has shed light on a common drawback of quantization: the quantization noise remarkably narrows the margin between classes and blurs the decision boundary (see Figure 2).On the other hand, maximizing inter-class separation while enhancing the intra-class compactness is highly favorable for classification, as there is a consensus that a large classification margin enhances the generalizability from statistical machine learning (e.g., SVM) to recent deep learning research [32,45].We hence look for the MPQ policy that can properly gather the features of the same classes and separate the features of different classes, making the features more robust to quantization noise.
Experimental results validate that a large margin between classes of proxy data helps search for a transferable MPQ policy for quantizing the model trained on challenging large-scale datasets.Our approach achieves competitive performance when searching on very small proxy datasets versus directly on large-scale datasets, in which the size of the former is only 4% of the latter.Consequently, we improve the MPQ policy search efficiency impressively.For ResNet18 and MobileNetv1, by using StanfordCars [23] as the proxy dataset, our method achieves 375× and 300× speedup compared to the state-of-the-art MPQ approach FracBits [50], respectively.

RELATED WORK 2.1 Fixed-Precision Quantization
Fixed-precision quantization assigns a uniform bit-width for all layers.In this paper, we only consider quantization-aware training, as it can achieve higher compression ratio than post-training quantization [22,30] and zero-shot quantization [25,49,52].
Dorefa [53] and PACT [6] uses a low-precision representation for weights and activations during forward propagation, and utilizes the Straight-Through Estimation (STE) [2] to estimate the gradient of piece-wise quantization function for backward propagation.To relieve the bias gradient of the STE, DSQ [13] employs tangent functions to approximate the non-differentiable quantization function.LSQ [12] introduces the learnable step-size scale factors to scale the tensor-wise weight and activation distributions.BSQ [17] further applies a bin regularization to ensure the weights fall in the center of quantization bins.All these works focus on training a wellperforming quantized network, but suffer from severe performance degradation when the bit-width is decreased significantly.

Mixed-Precision Quantization
The fundamental of Mixed-precision quantization (MPQ) is that the different layers in a model have different redundancy, in which the high redundancy layers can be allocated small bit-width to ensure low complexity without a severe performance drop.However, the bit-width choice is discrete, and the combination of bit-width and layer (i.e., the policy) grows exponentially.Therefore, the main challenge is how to determine the optimal bit-width for each layer.
Obviously, brute-force is rather ineffective for the purpose of searching, as an  layers model with  bit-widths for activations and weights has  2 possible policies [46].To solve this, several studies make efforts to apply the intelligent algorithms to search the optimal MPQ policy.HAQ [46] and ReleQ [11] use reinforcement learning (RL) to train a bit-width allocator.SPOS [14], EdMIPS [3] and BP-NAS [51] adopt neural architecture search (NAS) methods to learn the bit-width.In particular, GMPQ [47] develops an instance-level regularization to make searching MPQ policy on a small dataset possible.However, GMPQ suffers from a fussy hyperparameters tuning, including the approximated attribution rank level, number of interested pixels, etc.
Unlike learning the optimal MPQ policy, HAWQ [8,9] and MPQCO [5] use the Hessian information as the quantization sensitivity metrics to assist bit-width assignment.LIMPQ [40] proposes to learn the layer-wise importance during a once quantization-aware training process.In contrast to these methods that aim to define some metrics to estimate the quantization sensitivity of layers, we propose to directly learn the effective bit-width configurations on a small proxy dataset.

Discriminative Feature Learning
Learning discriminative feature is highly favorable since it greatly facilitates the generalization of deep models, its core is to clarify the decision boundaries between classes.For nearly two decades, there are several studies to make efforts to achieve this.
DrLIM [16] proposes to use the contrastive loss to identify the classes.L-Softmax [27] introduces a multiplicative hyper-parameter for the softmax function to produce a rigorous decision margin.L-GM [45] assumes the output of the penultimate layer (i.e., the deep features) follows the Gaussian Mixture (GM) distribution, and leverages the non-negative squared Mahalanobis distance to construct a GM loss.OPL [32] observes a potential orthogonality for features in the cross-entropy loss, and leverages this observation to explicitly enforce orthogonality of features.These works successfully demonstrate the significance of producing clear decision boundaries in the feature space, as the learned features become more robust and even increase the separation of features for the novel classes in a few-shot learning setting [32].

METHOD
In this section, we first review the mixed-precision quantization (MPQ) problem in a differentiable way and discuss why it cannot be adopted on inconsistent datasets directly.Next, we consider the MPQ policy searching from the feature perspective.Namely, what good MPQ policy can ensure the quantized model has a generalization deep feature as its full-precision counterpart?Motivated by the observation, we introduce the separation regularization to search the policy that guarantees the discriminative property of deep features.The illustration of our approach is shown in Figure 1.

Problem Formulation
We consider a differentiable MPQ policy searching process [3,47,51].Typically, the whole searching pipeline is organized as a Directed Acyclic Graph (DAG), where the nodes represent a specific quantization precision (e.g., 3bit), and the edges represent the learnable weight for its corresponding quantization precision.Therefore, a differentiable searching graph is built to determine the optimal quantization bit-width through the learnable weight, by adding a complexity constraint (e.g., BitOPs, model size) to the loss function.
Accordingly, the loss function is defined as where the L  represents the task loss, i.e., the cross-entropy loss, that guarantees the classification accuracy, L  denotes the complexity loss that guarantees the target computational budget (i.e., BitOPs), and  is the hyper-parameters to control the accuracycomplexity trade-off.L  is defined as where ,   and   are the pre-defined bit-width candidate set for weights and activations,   and   are the learnable weights vector for their corresponding bit-width candidate of layer , e.g,    ∈   represents the learned weight for bit-width candidate    ∈   .  is the BitOPs constraint of layer , where   and   is the number of input and output channels, respectively.  and   are the kernel size,   and ℎ  are the width and height of the output feature map.After searching, the bit-width for weights and activations of layer  is determined by an  function acts on its learnable weights vector   and   .This paradigm and its variants [3,21] require the searching dataset to be consistent with the full-precision model training one, otherwise resulting in a serious accuracy degradation [47].Inevitably, using a consistent dataset leads to inefficiencies, especially on large-scale datasets like ISLVRC2012 [7] with over 1 million samples to search for.
However, when searching an MPQ policy on a proxy dataset (e.g., a small-scale dataset CIFAR-10 with only 50000 training samples) through Equation 1and then directly applying it to the model trained on a large-scale dataset (e.g., ISLVRC2012), while the accuracy and complexity are both met, the accuracy on the proxy dataset is not of direct interest to us, because high accuracy on proxy dataset does not imply equivalent high accuracy on challenging large-scale datasets.One may argue that we can abridge the size of the target dataset to improve the efficiency, such as using a subset of target datasets to conduct MPQ search, but this would also result in serious performance degradation, as shown in Sec.4.4.
Accordingly, instead of optimizing the above improper objective on the proxy dataset, we aim to search an MPQ policy that guarantees a large-margin on the proxy dataset to handle the incoming classes of the large-scale dataset.

Exploiting the Class-level Information
From the perspective of class-level features in a well-preforming MPQ policy, they should be well separated if not in the same class, and tightly gathered if in the same class.This has the following benefits: a) It alleviates the side effect of quantization on classification boundary.As shown in Figure 2(a) and Figure 2(b), we observe quantization sharply narrows the class boundaries in the feature space compared to the full-precision model.Therefore, an MPQ policy with an explicit feature separation guarantee can effectivity alleviate the side effect of quantization.b) This is a widely pursued and dataset-independent attribute, as from classical statistical machine learning to recent deep learning research [27,32,45] both Figure 1: The illustration of our approach.During the MPQ policy search process on the small-scale proxy dataset, we not only use the conventional classification loss and complexity loss as the optimization objective, but also introduce a large-margin constraint to search the policy can ensure the discriminative property in the feature space.In short, we hope the searched MPQ policy with a general and favorable attribute-gathering the features of the same classes and separating the features of different classes-to be applied to the target large-scale dataset (e.g., ISLVRC-2012) for model deployment efficaciously.recognize a large classification margin in feature space can help generalization.
Motivated by this, we aim to search the MPQ policy that guarantees the large class margin on the proxy data distribution as much as possible.As we discussed above, such a general property in searched MPQ policies can ensure usability across the data distributions.However, the cross-entropy cannot provide this property, as the class margin is not explicitly formulated.Therefore, the objective is not only to optimize accuracy and complexity, but also to find an MPQ policy that maximizes the class margin.
We regard our approach as a class-level proxy data utilization, as it discovers the effective MPQ policy by leveraging the interclass and intra-class information on the proxy dataset.The 2D visualization of our approach is shown in Figure 2(c), we observe that the t-SNE pattern is quite similar to the full-precision model, indicating an MPQ policy that is able to separate the features is searched for the quantized model.1is the soft-max cross-entropy loss [3,47].For simplicity, we revisit it here by considering a binary classification problem, which can be trivial generalized to multi-class classification, where  ⊺ 1 and  ⊺ 2 are the weights for class 1 and class 2, respectively. is the deep feature of the model produced by several convolution layers (i.e., layers that need to be quantized to mixed-precision).

Separation Regularization. The first term in Equation
Since the equivalent optimized term is not carried the margin objective during optimization, Equation 4 cannot explicitly guarantee any margin between classes.Some previous works even observe that the learned feature regions for some classes tend to be bigger than others.If this combines with the side effect of quantization on decision boundaries, it inevitably leads to the search for suboptimal MPQ policies.In other words, the performance objective in Equation 1, the cross-entropy, is improper when the MPQ searching and full-precision model training datasets are inconsistent.
To this end, we introduce separation regularization to enforce a large margin guarantee in the searched policy.Firstly, a small intra-class variance should be achieved to compact the features, min where  is the number of samples,   ,   and    are the feature and label (ground truth) of sample  and the feature mean of class   , respectively. (•, •) is the metric for calculating the distance between the feature and its mean (e.g., L2 distance).Secondly, we consider the inter-class margin by minimizing a classification loss as ℎ(•; •) is a map from feature space R  (i.e., ) to class-wise prediction score and will be introduced in Equation 9. 1(•) is the indicator function and  represents the number of classes, respectively. is a non-negative scalar that represents the margin of different classes to form an explicit classification margin between the label class of sample  and other classes in feature space, i.e., ℎ  (  ; ) > ℎ  (  ; 0) ( ≠ , and  =   ).One can see Equation 6becomes the classic log-softmax cross-entropy loss when ℎ(; ) is a linear transformation and  ≡ 0, e.g., in classic softmax cross-entropy, a linear layer with weight W ∈ R  × and no biases is used to project the deep feature   to R  -let us denote   is the -th column vector of W, thus ℎ  (  ; 0) =  ⊺    .Please note when  ≠ 0, the classification margin requires the output sign of ℎ should be always either positive or negative, which is not always satisfied in a classic softmax cross-entropy loss as the sign of linear projection is not certain.
We hence follow the previous work L-GM [45] that assumes the feature   follows a Gaussian Mixture Distribution (GMD).Namely, where  () is the prior probability of class , and   and   are the mean and covariance of class .The posterior probability of feature   is derived through the Bayes' rule, Under the GMD assumption, we can easily derive the additive interclass margin according to where ℎ(•) is formulated from a probability perspective, thus it is guaranteed to be non-negative.By replacing the subscript   of Equation 9with  and setting  = 0, we can derive ℎ  (  ; 0) =  ()N (  ;   ,   , 0).Substitute it and Equation 9into Equation 6, we can obtain the L  accordingly.Finally, we apply a loglikelihood term [45] to restrict the feature   centralization near its mean    to achieve intra-class compactness according to Equation 5and Equation 7, We assume  (  ) = 1  and    is diagonal for both simplicity and considering its application in existing research [10,44].
Thus, the optimization objective during MPQ searching is where L  is the classification loss, L  is the intra-class compactness loss and L  is the complexity loss. and  are the hyper-parameters to weight the corresponding loss in the optimization process.We search the MPQ policy on the training set of proxy datasets.The training samples of proxy datasets are used to search MPQ policies.After searching, we finetune (quantize) the model with the searched policies on the target dataset.We use the basic data augmentation methods during finetuning and evaluate the final performance on the ISLVRC-2012 validation set.

Models.
We conduct the experiments on three representative models including the ResNet-{18, 50} [18] and the MobileNet [20].Particularly, we use the standard architecture for ResNet.
For searching, we adopt the SGD optimizer, and the initial learning rate is set to 0.01 for 15 epochs.Empirically, we find the intraclass compactness regularization is not sensitive to the hyperparameter and set  = 0.1 for all proxy datasets, more details for  = 0.1 can be found in the ablation study.We set the class margin  = 0.3 and  = 0.01 for CIFAR-10 and StanfordCars respectively while multiplying by the non-negative term in Equation 9. We fine-tune the hyperparameter  in line with prior works on Table 1: Accuracy and efficiency results for ResNet."Top-1 Q/FP" represents the Top-1 accuracy of quantized model and full-precision model."MP" means mixed-precision quantization."Cost" denotes the MPQ policy search time that is measured by GPU-hours."*": reproduces through the vanilla ResNet architecture [18]."#": the result of shortening the search epochs to half."Ours-C": denotes the MPQ policies search on CIFAR-10."Ours-S": denotes the MPQ policies search on StanfordCars.The lowest accuracy degradation results are bolded in each metric.differentiable MPQ [3,47].A higher  value corresponds to a less computation complexity policy to search for.For finetuning (quantizing), we follow the basic quantizationaware training settings in LSQ [12] and LIMPQ [40].Specifically, we use the full-precision model (trainined on D  ) as the initialization and adopt the SDG optimizer with Nesterov momentum [39] and the initial learning rate and weight decay are set to 0.04 and 2.5 × 10 −5 , respectively.We use the cosine learning rate scheduler and finetune the model 90 epochs and the first 5 epochs are used as warm-up.

Comparisons with the State-of-the-Art
We compare our method with the SOTA quantization works on the classification task.

ResNet.
We show the mixed-3bits and mixed-4bits results of ResNet-{18, 50}, as listed in Table 1.We provide the full-precision accuracy to compare the absolute accuracy degradation between the full-precision and quantized model.
For ResNet18, under 3-bits level BitOPs constraints, "Ours-C" causes only 0.5% Top-1 accuracy degradation compared to the fullprecision model, which is the lowest one among recent works.Under 4bits level BitOPs constraints, "Ours-C" achieves the highest Top-1 accuracy.Meanwhile, it achieves about 160× policy search speedup compared with FracBits.Thanks to the small data amounts of StanfordCars, "Ours-S" uses only 8041 training samples to search a very competitive MPQ policy.
For ResNet50, we search 4bits level policies.One can see that our method achieves quite similar performance compared to gradientbased methods BP-NAS and FracBits while further reducing the search time significantly.
Overall, our method not only achieves a comparable accuracy as searching directly on ISLVRC-2012, but also significantly improves the searching efficiency. 2 summarizes the results of mixed-3bits and mixed-4bits on MobileNetv1.

MobileNet. Table
For mixed-3bits searched on CIFAR-10, we observe our method both outperforms the existing SOTA mixed-precision work LIMPQ and fixed-precision work LSQ.In particular, our method arises a 1.8% absolute gain on Top-1 accuracy compared to LSQ, and 1.2% higher accuracy than FracBits.We further narrow the gap between the full-precision and quantized MobileNet.Please note that we are the first work to provide a 3-bits level MobileNet that almost achieves 70% Top-1 accuracy.For mixed-4bits searched on CIFAR-10, our method has up to 237× searching efficiency improvement compared to FracBits and up to 0.4% higher accuracy compared to the SOTA efficient MPQ approach LIMPQ.
For mixed-3bits and mixed-4bits searched on StanfordCars, they show 0.3% and 0.1% absolute Top-1 accuracy degradation compared to the CIFAR-10 but further save about 20% searching cost.This further proves that our method can still be very effective even if the proxy dataset (i.e., all cars) has much lower class-similarity to the target dataset.

Discussion for Proxy Datasets.
In this subsection, we observe that using CIFAR-10 as a proxy dataset can search for more wellperforming MPQ policies better than StanfordCars.On the other hand, StanfordCars has higher search efficiency than CIFAR-10.We conjecture this is because the category of CIFAR-10 is more similar to the target dataset ISLVRC-2012, and the data amounts of CIFAR-10 are more than that of StanfordCars.Meanwhile, we find that the performance loss of policies searched on StanfordCars is slightly larger than CIFAR-10 when the complexity constraint becomes tighter, e.g., the mixed-3bits results for MobileNet.
Therefore, while it is feasible to search a well-performing MPQ policy by using an arbitrary proxy dataset, if the model requires more aggressive quantization, a proxy dataset with more class-level similarity compared to the target dataset could be considered to further improve the performance.

Complexity-Accuracy Trade-off
In Figure 3, we show the complexity-accuracy trade-off of LSQ [12], EdMIPS [3] and our method for ResNet18 and MobileNet.Unless otherwise specified, the proxy dataset used in our method is CIFAR-10.
For ResNet18, our method achieves significant performance gains compared to the mixed-precision approach EdMIPS.We even consistently have an absolute advantage of over 2% Top-1 accuracy.
For MobileNet, our method provides a very high accuracy improvement within the constraints of approximate complexity.Especially, our method improves 4.9% Top-1 accuracy compared to LSQ at 3G BitOPs constraint.Meanwhile, our method has a much fine-grained trade-off thanks to the mixed-precision quantization.

Ablation Study
In this subsection, we investigate: (a) the effectiveness of using a subset of D  as the searching dataset; (b) what happens when one adds SEAM to the baseline; (c) performance difference under various hyper-parameters settings.As shown in Table 3, the subset of ISLVRC-2012 without proposed method still has about 1% performance degradation compared to CIFAR-10 with proposed method.This is because the data distribution in the subset is significantly different from the full set.When the proposed method is enabled, this subset yields superior performance than StanfordCars.That further demonstrates the effectiveness of our method, and indicates that we can gain more performance by leveraging the class-similarity between proxy and target datasets.

Performance
Improvement over Baseline.To show that MPQ benefits from discriminability of feature representations, we further add proposed method on EdMIPS [3] -a baseline MPQ approach.Specifically, EdMIPS is a conventional differentiable mixedprecision quantization approach, requiring consistency dataset of model training and policy searching.We directly apply proposed large-margin regularization term on it to search a MPQ policy.As shown in Figure 4, we observe that the proposed method can help the baseline to discover better MPQ policy.[1,17,38].Specifically, these studies enable their regularization term until after tens of training epochs.This delay is intended to avoid optimization interference by the different loss terms, ensuring that cross-entropy (CE) term dominates early training to optimize the parameters properly.Once the CE loss becomes small, the regularization term is added and plays a major role in optimization.In this paper, we direct use a small value for the regularization   term  to simulate the above optimization idea.We ablate this hyper-parameter in Table 5.
One can see that  does need a relatively small value, which conforms with our optimization principle.

Bit-width Assignment Behavior
In Figure 4, we visualize the searched MPQ policies for the mixed-3bit ResNet18, ResNet50 and MobileNet.For ResNet, we clearly see that almost the highest bit-width is given for the residual convolution layers.That is because these layers are more important for bypassing signals from shallow to deep layers [43], as well as having fewer parameters.For MobileNet, we find that higher bit-width is assigned to the Depthwise-Convolution (DW) layers than the Pointwise-Convolution (PW) layers, as the DW layer is typical less redundant [40].

Effectiveness of Knowledge Distillation
Follow SDQ [21], we use a ResNet101 as the FP distillation teacher during the fine-tuning time.The distillation temperature is set to 1.We compare our method with GMPQ and SDQ at the 3-bits levels (about 23G BitOPs) search policies.
As shown in Tab 6, our approach achieves the highest performance when knowledge distillation is applied.In particular, compared to the state-of-the-art work SDQ under approximate complexity, our method attains an absolute accuracy improvement of 0.5%,  indicating our method can search the optimal MPQ policy properly on a small-scale proxy dataset for the purpose of knowledge distillation.

CONCLUSION
In this work, we propose to search the MPQ policy on a small-scale proxy dataset for a model trained on a large-scale one.To bridge the inconsistent data distributions, we not only focus on optimizing the accuracy on the proxy dataset, but also enforce a large-margin of the searched MPQ policy should be met.We regard this as a class-level data exploitation for the limited proxy data, which is more data efficient than the instance-level data exploitation [47].
Our class-level data exploitation renders the search policies can compact the features in the same classes and separate the feature into different classes, which is a favorable and dataset-independent property.The experiments validate our idea, and we use only 4% of data to search for the high quality MPQ policies, achieving the same accuracy as searching directly on the large-scale dataset, and speeding up the MPQ searching process by up to 300×.

ACKNOWLEDGMENT
This work is supported in part by Shenzhen Science and Technology Program (Grant No. RCYX20200714114523079 and JCYJ2022081810-1014030).The authors would like to thank the anonymous reviewers for their valuable comments.

4. 4 . 3
Effectiveness of .The setting of fixed  is inspired by several regularization-based quantization-aware training studies (a) Bit-width assignment for weights.(b) Bit-width assignment for activations.

Table 3 :
Results of ablation study.D ℎ denotes the dataset used for MPQ policy searching.ISLVRC-2012 (4%) indicates a subset of ISLVRC-2012 with a sample of 4% of the full training set.

Table 4 :
Effectiveness of proposed method SEAM upon Ed-MIPS.Subset of ISLVRC-2012.Although GMPQ has shown direct searching over the proxy dataset incurs severe performance degradation, there is no relevant literature to study the effect of using a subset of the target dataset (e.g., ISLVRC-2012) as the proxy dataset.To this end, we randomly sample 4% (roughly the same sample size as CIFAR-10) training data from ISLVRC-2012 and use them to search a 3-bits level policy for ResNet18 without/with the proposed method.

Table 5 :
Performance of different  values.

Table 6 :
Results of finetuning the ResNet18 with an external teacher model ResNet101 ( * : result from Table1).