Poster: Design of Elastic Deep Neural Network Candidate Spaces for Inference on Diverse Devices

Deep Neural Network (DNN) inference on edge devices is now a common practice. However, tailoring a model for multiple devices involves a lot of time and effort. While elastic models, also known as weight-sharing models, have been proposed as an efficient solution to create high-accuracy, low-latency modules, the design of an elastic model's candidate space (search space) has been underexplored. We identified a new characteristic in candidate spaces, which we named sensitivity, made a design rationale for candidate spaces based on it, and then built a preliminary algorithm to generate candidate spaces. Results show that we can get a range of models (with a 2.75× FLOPs range) on the Pareto frontier of the space by training only once.


INTRODUCTION
Enhancing IoT and mobile edge computing with DNN models has become common practice.However, different devices have different computational capabilities, requiring model tailoring and retraining efforts.A reasonable balance of the tradeoff between inference latency, performance, and training time is required.Complicated models such as Transformer [4] and its variants [2,3] (hereinafter 'Transformers'), require even greater efforts.There have been approaches in the field of Neural Architecture Search(NAS) for Transformer models.Specifically, elastic models, also known as weight-sharing models, feature the concept of having one large model sharing weights with multiple smaller models and have been successful [1,5].
We define the terminologies as follows.A model consists of a set of hyperparameters, namely dimensions, that decide its exact shape and operation flow.An elastic model is a model that allows the sampling of a smaller model from the original model by choosing a value for each dimension and inheriting the weights from the overlapping parts.We define the original model supernetwork and the smaller model subnetwork.For example, an elastic BERT [2] based on BERT-base, can have the dimensions of intermediate size for each layer 1 to 12, hidden size (embedding size), number of layers.
Critical to the subnetwork accuracies and search time for optimal subnetworks of an elastic model is the candidate space (search space in NAS context), all possible combinations of value choices for each dimension.However, previous works on elastic models have not explored the relationship between candidate space and subnetwork performances in depth -many of those works used a manually set space.While [1] has explored the idea of searching for a candidate space, it has searched only for a specific type of Transformer (ViT [3]) and within a limited range of space options.
In this research, we aim to find a candidate space that yields high-accuracy, low-latency subnetworks within a low search time for elastic models of general size and model type.We made the following contributions: (1) We identified and named the concept of dimension sensitivity, an important factor in the relationship between candidate space and subnetwork performances.(2) Based on the new concept, we established the first design rationale for candidate spaces, which can be applied to complicated Transformers models.(3) We made a preliminary algorithm, named candidate space configuration, to build candidate spaces to meet the rationale, and a preliminary experiment shows that the output spaces can indeed generate high-accuracy, low-latency models.

SOLUTION
We first introduce the concept of sensitivity, which represents the tradeoff between the accuracy and latency of a dimension.We discovered that each dimension has a different tradeoff.We conducted a preliminary experiment with a sample elastic BERT, whose candidate space includes dimensions hidden size and number of hidden layers.We trained the sample elastic BERT model with the MNLI GLUE task and dataset, then randomly sampled 500 subnetworks and measured their latency (in FLOPs) and accuracy.Figure 1 shows the results of this experiment.The results show that the (average accuracy, FLOPs) of a subnetwork with hidden size 384 and 576 are respectively (72.4%, 2.6E+9) and (76.2%, 4.5E+9).On the other hand, the (average accuracy, FLOPs) of a subnetwork with number of layers 7 and 9 are (75.5%,3.4E+9) and (75.9%, 4.2E+9) In other words, an average subnetwork suffers a 3.8% accuracy loss for the gain of 1.9E+9 FLOPs of latency gain for changing the hidden size value, while the tradeoff for number of layers is 0.4% accuracy loss to 0.8E+9 FLOPs gain.We name the average accuracy loss divided by latency gain sensitivity, with a higher sensitivity value meaning a worse tradeoff (i.e., more loss of accuracy compared to a gain of latency.) In this example, the best strategy to sample high-accuracy, lowlatency subnetworks is to search from subnetworks with a low number of hidden layers and high hidden size value.Based on this insight, we propose a design rationale for candidate spaces.For highsensitivity dimensions, the dimension candidate values should be close to the maximum value.On the other hand, for low-sensitivity dimensions, it is better to allocate a wider variety of candidate values.This would lead an average subnetwork to have a low value for low-sensitivity dimensions and a high value for high-sensitivity dimensions, minimizing the loss of accuracy for the gain of latency.
We built a preliminary algorithm that follows this rationale: candidate space configuration.Given a Transformer model, this algorithm finds a candidate space containing high-accuracy, lowlatency subnetworks of its elastic version.While we omit the minute details of this algorithm here, the core procedure is as follows: (1) Initialize the candidate space: each dimension begins with 1/4, 2/4, 3/4, and 4/4 of the input model's value as initial candidate values.(2) Repeat the following procedures for n times: (a) Train the model elastically with the current candidate space for a limited number of epochs.(b) Sample m random subnetworks and measure their latency and accuracy.Using this data, calculate the sensitivity of each dimension.(c) For a dimension with sensitivity higher than a threshold , remove the minimum value from its candidate values.For a dimension with sensitivity lower than a threshold , add an additional value in its candidate values.This algorithm can be summarized as an evolutionary procedure to throw away inefficient subnetworks from the candidate space.

PRELIMINARY EVALUATION
We conducted a preliminary evaluation of this candidate space configuration algorithm in elastic BERT.We prepared BERT-base as the base model with the same dimensions as 2. We compared our algorithm against two baselines: (1) dense space, where the initial space is used for elastic training without any changes, and (2) sparse space, which is the same as the dense space except only the minimum and maximum candidate values are kept from it.Three elastic models, each with one of the candidate spaces, were trained with the GLUE QQP task and dataset with the same initial weights.Figure 2 shows the distribution of latency and accuracy of the randomly sampled subnetworks from each space, with the bold points in lines representing the Pareto-frontier subnetworks.Our resulting candidate space contains only 8208 subnetworks within, while the baselines without any candidate space design contain 6.8E+7 (sparse) and 3.8E+8 (dense) subnetworks, requiring considerably longer time to find optimal subnetworks within.Furthermore, for subnetworks of FLOPs range 3.8E+9 to 7.1E+9, our model is capable of finding subnetworks with +1.5% to +2% accuracy with the same latency compared to the baselines.
Our preliminary results suggest that the proposed candidate space configuration method utilizing dimension sensitivity can find efficient subnetworks for a Transformer model.We plan to improve the algorithm so that it works on different combinations of tasks/datasets/Transformer models.We also plan to measure latency on actual devices instead of using FLOPs as a proxy.

Figure 1 :
Figure 1: Accuracy and FLOPs of 500 randomly sampled subnetworks of elastic BERT trained for the MNLI task, each colored differently according to the value of a dimension.

Figure 2 :
Figure 2: Pareto-frontiers of candidate spaces.Figure2shows the distribution of latency and accuracy of the randomly sampled subnetworks from each space, with the bold points in lines representing the Pareto-frontier subnetworks.Our resulting candidate space contains only 8208 subnetworks within, while the baselines without any candidate space design contain 6.8E+7 (sparse) and 3.8E+8 (dense) subnetworks, requiring considerably longer time to find optimal subnetworks within.Furthermore, for subnetworks of FLOPs range 3.8E+9 to 7.1E+9, our model is capable of finding subnetworks with +1.5% to +2% accuracy with the same latency compared to the baselines.Our preliminary results suggest that the proposed candidate space configuration method utilizing dimension sensitivity can find efficient subnetworks for a Transformer model.We plan to improve the algorithm so that it works on different combinations of tasks/datasets/Transformer models.We also plan to measure latency on actual devices instead of using FLOPs as a proxy.