PockEngine: Sparse and Efficient Fine-tuning in a Pocket

On-device learning and efficient fine-tuning enable continuous and privacy-preserving customization (e.g., locally fine-tuning large language models on personalized data). However, existing training frameworks are designed for cloud servers with powerful accelerators (e.g., GPUs, TPUs) and lack the optimizations for learning on the edge, which faces challenges of resource limitations and edge hardware diversity. We introduce PockEngine: a tiny, sparse and efficient engine to enable fine-tuning on various edge devices. PockEngine supports sparse backpropagation: it prunes the backward graph and sparsely updates the model with measured memory saving and latency reduction while maintaining the model quality. Secondly, PockEngine is compilation first: the entire training graph (including forward, backward and optimization steps) is derived at compile-time, which reduces the runtime overhead and brings opportunities for graph transformations. PockEngine also integrates a rich set of training graph optimizations, thus can further accelerate the training cost, including operator reordering and backend switching. PockEngine supports diverse applications, frontends and hardware backends: it flexibly compiles and tunes models defined in PyTorch/TensorFlow/Jax and deploys binaries to mobile CPU/GPU/DSPs. We evaluated PockEngine on both vision models and large language models. PockEngine achieves up to 15 × speedup over off-the-shelf TensorFlow (Raspberry Pi), 5.6 memory saving back-propagation (Jetson AGX Orin). Remarkabl×y, PockEngine enables fine-tuning LLaMav2-7B on NVIDIA Jetson AGX Orin at 550 tokens/s, 7.9× faster than the PyTorch.CCS CONCEPTS• Computer systems organization → Neural networks.


INTRODUCTION
Edge devices are ubiquitous and produce an increasing amount of data in our daily lives.The need for intelligent, personalized, and private AI is rapidly growing, as a single model fails to fit different users' needs.However, while deep learning inferences are widely performed on edge devices, the training of deep neural networks is typically run on cloud GPU servers.Cloud-based training requires users to upload their personal data to the cloud, which not only incurs additional data transfer costs, but also brings privacy risks over sensitive data (e.g., healthcare data, keyboard input history, GPS location, etc.).
On-device training is a promising solution for model customization without sacrificing privacy (Figure 1).It allows a pre-trained model to continuously adapt to sensor data without sending it to the cloud.For example, the smart keyboard model can update itself to better predict the next word from users' typing history; the email assistant can learn from users' previous drafts and train personalized language models; vision models can automatically adapt to environments with domain shifts [53]).The near-sensor training paradigm also brings important benefits for energy and connectivity: it saves energy from data transmission (which is much more expensive than computation [35]); it also helps with applications like ocean sensing [25] and smart agriculture [56] that do not have physical access to the Internet.
Despite all the benefits, on-device training is difficult due to the following challenges: (1) Resource Limitations.The capacity of edge devices is orders of magnitude smaller than cloud servers.People have been trying hard to squeeze deep learning models just for edge inference, while model training and fine-tuning are more power-, computation-, and memory-expensive.We need extra memory to store all intermediate feature maps for backpropagation, and extra computation for the backward pass (roughly 3× compared to inference).Sometimes the training needs a larger batch size to ensure a stable convergence, making the process even more costly.For MobilenetV2 [50] the training memory is 14× and 7.3 × larger than inference (batch size 8) and for BERT [18] the peak memory usage is 7.3 × larger compared to inference.Furthermore, the optimizers also require extra memory (2x for Momentum and 3x for Adam [30]).With the current training framework, the training costs could soon exceed the resource limits of edge hardware.
(2) Hardware Diversity While the accelerators on cloud servers are dominated by GPUs, the hardware of edge platforms has a wide range of options on the market.The processor ranges from ARM microcontrollers to powerful Apple M1 chips, and the accelerator varies between Qualcomm Adreno GPUs, Hexagon DSPs, and edge TPUs.Each hardware comes with a different inference library.PockEngine can directly use these inference libraries for training by compiling the training graph into standard ONNX format.On the other hand, popular deep learning training frameworks like Ten-sorFlow [4], PyTorch [46] and Jax [9] are developed for high-end cloud GPUs/TPUs.The performance is poor when directly applied to edge platforms 1 .
To address the above challenges, we introduce PockEngine, a tiny and efficient training engine designed for on-device training.We highlight the following properties: • PockEngine provides system-level support for both dense and sparse backpropagation.Apart from updating the whole model, PockEngine supports flexible sparse update schemes by computing the gradients for only part of the weights, which proves to be a more efficient option for fine-tuning/transfer learning without harming the accuracy [10,20,23,24,37,41,42].Existing training frameworks can only simulate the sparse backpropagation by computing the backward and mask out gradients, but cannot realize measured speed up and memory savings.PockEngine supports sparse backpropagation via graph pruning and dead code elimination with the compilation nature, leading to smaller computation and memory usage.• PockEngine is a compilation-based efficient training engine and enables many inference-only framework to perform training.Our compilation workflow helps to connect diverse model architectures and frontend options (e.g., vision/NLP models, PyTorch/TensorFlow/ONNX definitions) with various backend libraries (e.g., SNPE for Qualcomm, Metal for Apple Silicon, TVM), exposing a unified intermediate representation (IR).By sharing the same set of operators for both forward and backward operations, we not only enable inference frameworks to train neural networks, but also allow for various graph optimizations to improve efficiency (see Figure 4).

RELATED WORK 2.1 Cloud Deep Learning Systems
The success of deep learning is built on top of popular training frameworks such as PyTorch [46], TensorFlow [5], MXNet [12], JAX [9], etc.These systems are designed for development flexibility and depend on a host language (e.g., Python) to execute.This brings significant memory overhead (>300MB) and makes the runtime especially slow on low-frequency CPU (e.g., ARM Cortex).Moreover, the operator kernels are optimized for high-end GPU devices and lack performance tuning for edge devices and some overheads such as extra gradient buffers for the optimizer step are not considered a bottleneck for powerful server hardware.Pock-Engine is a compilation-based framework thus the runtime does not rely on host languages as compared in Table 1.This moves most workloads from runtime to compile-time to minimize the runtime overhead and enables later optimizations to improve training throughput.

Edge Deep Learning Systems
When deploying models on tiny edge devices, inference libraries like TVM [13], TF-Lite, NCNN [1], TensorRT [2], and Open-VINO [57] deliver optimized kernels for mobile platforms and provide a lightweight runtime without host language.However, they focus mostly on inference and do not support on-device training.MNN [29] has preliminary support for CNNs but the flexibility is rather limited and it does not optimize training memory usage.POET [47] applies rematerialization and paging to deal with restricted memory size, but it introduces extra computation, relies on large external Flash (e.g.32GB SD Card) and does not support general model and workload definition.PockEngine provides complete training support for popular models at various scales including MCUNet [40], MobilenetV2 [50], ResNet [22], DistilBERT [51], and BERT [18].PockEngine optimizes both computation and memory efficiency to make on-device training easy and realistic.

Efficient On-Device Learning Algorithms
Edge devices have limited computational capacity.Therefore, ondevice training for edge devices often focuses on transfer learning [10,33].It first pre-trains the model on large-scale datasets to learn general and rich features, such as ImageNet [17] for ConvNets or BooksCorpus [64] for BERT.The model is then transferred to downstream tasks, such as Visual Wake Words [16] for vision or the GLUE benchmark [58] for language.After which, the model can be customized to a small amount of personal data (e.g., learning a user's accent) to perform better at the same task.Due to the smaller scale and diversity of the downstream data, people found that it is not always necessary to update the entire model to achieve a good performance.Sparsely updating part of the model proves to be a good solution that achieves similar or better performance at a smaller training cost [10,20,23,24,37,41,42].The most straightforward method is to fine-tune only the classifier layer [11,19,21,52], but the capacity is limited when the domain shift is large.For CNN models, people have investigated fine-tuning only biases [10,61], batch normalization layers [20,43], added parallel branches [10], etc.The sparse backpropagation scheme is even more popular for adapting pre-trained language models (e.g., BERT [18], GPT [49]) to various downstream tasks, which significantly reduce the trainable parameters [23,24,37].However, sparse backpropagation lacks system support.Despite the great theoretical savings, existing training frameworks cannot realize measured speedup or memory saving from sparse backpropagation.PockEngine provides system-level support for such flexible workloads to deliver a faster program and efficient runtime.

Computation Graph Transformation and Optimizations
There are plenty of graph transformations for inference scenarios.For example, one common transform used in edge deployment is data layout conversion, as the 'NCHW' preferred by GPU training is not efficient on the edge.Another common optimization technique is layer fusion.IO-intensive layers (e.g.ReLU) can usually be fused into preceding compute-intensive layers (e.g.CONV, LIN-EAR).In addition, MetaFlow [27] proposes functional-preserving graph transformations to optimize DNN architectures.TASO [26] further introduces automated generation of transformation rules using formal verification.These techniques have been proven effective in inference, but few studies have explored their performance on training, even though the training graph is much more complex.
Standing on the shoulder of conventional wisdom, PockEngine is early exploration for apply these graph optimizations techniques to on-device training and discover more potential optimizations.Pock-Engine shows that these optimizations bring up to 1.2x speedup.

Compilation-Based Workflow
Existing training frameworks (e.g., PyTorch, TensorFlow) are based on runtime auto differentiation for flexibility.However, the design is not suitable for edge devices with limited memory and computation resources.Instead, PockEngine is based on a compilation-based workflow, sharing the following benefits: Offload Workload from Runtime to Compile Time.With the compilation-centric design, we can offload part of the workload from runtime to compile time, like backward graph derivation with autodiff, memory scheduling, execution planning, etc. Modern neural network usually consists of thousands of operators, the overhead might be small for cloud servers but not negligible for edge devices (Figure . 7).
By offloading computation to the compiler, it is possible to perform more aggressive optimizations that would not be feasible or efficient to perform at runtime.For example, PockEngine performs graph pruning, fusions, and backend switching, which can lead to significant performance gains and memory saving.
Another advantage of compilation-based workflow is that it allows us to optimize the code across the entire program, rather than just focusing on optimizing individual operations at runtime.This not only allows us to compile used operators only to ship slim     binaries, but also reveals the memory redundancy in the training loop (details in Section 3.2).Support Diverse Frontends/Backends.Unlike the cloud, edge platforms are highly diverse, with different instruction sets, degrees of parallelism, etc.Our compilation-based workflow provides general support for various frontends/backends.It can effortlessly support training on hardware and vendor libraries that are designed specifically for inference (e.g., PockEngine can enable training on Qualcomm Hexagon DSPs with SNPE library).
The PockEngine frontend takes in a neural network represented in various representations (e.g., ONNX, torchscript, tf.graph) and analyzes the DAG structure.It will then perform automatic differentiation (autodiff) to derive the backward graph which computes the gradients w.r.t. the loss function (Figure 7).With the static forward and backward graph, PockEngine will convert it into a unified intermediate representation (IR), perform graph optimizations (will be introduced later), and generate the code for different backends.Only used operators will be compiled and PockEngine link these OPs to build a light-weight executable binary.The PockEngine backend supports both vendor libraries (e.g., SNPE for Snapdragon GPUs and DSPs, TensorRT for NVIDIA GPUs) and customized kernels (e.g., TVM [13] tuning for ARM CPUs).
Notably, instead of binding each operator with a backward implementation (e.g., matmul, matmul_backward), PockEngine uses the same set of primitive operations as inference to construct the training graph, allowing us to utilize inference-only backends (e.g., SNPE, TensorRT, TVM) for training, achieving high efficiency at minimal engineer effort.

Sparse Backpropagation and Computation Graph Pruning
Edge devices have a limited computation capacity compared to the cloud.Therefore, on-device training on edge usually targets a transfer learning/fine-tuning scenario.Due to the smaller scale and diversity of the downstream data, people found that updating the entire model may not always lead to the best performance due to over-fitting and feature distortion [10,33].Updating only a subset of the models is proven to be a good solution that achieves similar or better performance at a much smaller training cost, including updating bias terms [10] and the normalization layers [20] for vision models training the low-rank parts [24] and input prompts for language models [37], and sparsely update the important modules [41].PockEngine aims to generally support on-device training for various workloads and we focus on the sparse update to reduce training costs.
During the compilation, PockEngine takes in a user-defined sparse backpropagation scheme and will prune the corresponding subgraphs of backpropagation calculation.PockEngine flexibly supports the following sparse backpropagation patterns: Bias-only Update.Bias-only update does not require saving the intermediate activation [10], which significantly reduces memory usage (consider a linear layer y = Wx, dW =  1 (dy, x), db =  2 (dy), only the weight gradient requires saving the input).It also saves the computation by 1/3 by skipping dW computation.
Layer-wise Sparse Backpropagation.Not all the layers/weight tensors are equally important for transfer learning [41].For transfer learning to a downstream task, we find that part of the layers can be kept frozen without affecting the transfer learning performance (we can find the layers to freeze by sensitivity analysis [41]; detailed in Section 4.1).Therefore, we can skip the computation of part of the layers to further improve the training throughput.
Sub-layer Sparse Backpropagation.For edge devices with limited capacity (e.g., microcontrollers), we further support sublayer level sparse BP, where only part of the channels of a layer (convolutional layers and linear layers) are updated2 .It further reduces the memory cost for storing intermediate activation (we do not need to store activation for the frozen channels) and the computation cost for gradient calculation.• Sparse back-propagation does not back-propagate the very first layers in DNN models since there is no need to compute gradients to the front layers if they do not require gradients (the red X mark in Figure 5).

POCKENGINE
None of the prior work can convert the theoretical savings into measured speed-up and memory savings.PockEngine provides systematic support for sparse BP and is able to actually reduce the on-device training cost and we expand as follows

Searching for Sparse Backpropagation Scheme
Not all the weights are equally important for transfer learning [20,34,41].We aim to fine-tune only the important weights to reduce the training costs while preserving the model's accuracy.
Cost Model and Search Criterion.In order to find the training scheme, we build cost models for model quality and training cost.Following [41], we first fine-tune only one linear (conv, fc) layer until convergence, and then repeat this process for all layers.This is an offline analysis and we use the accuracy improvement/degradation as the "contribution" of the weights of  ℎ layer ( Δacc W ).Similarly, we obtain the results the for bias terms of  ℎ layer(Δacc b  ) and then iteratively repeat the same operations to all weights and biases to estimate their performance.
For the training cost, we focus on the memory as edge devices usually have limited memory and will easily get OOM.Thus we profile the feature map size and record it as Memory ,, .We then solve the following optimization: where  is the layer index of weights,  is the layer index of biases and  is the ratio of learnable weights.Optimizing the objectives finds the optimal update config where total contributions are maximized and the memory footprint does not exceed the constraint.We assume that the accuracy contribution of each tensor (Δacc) can be summed up thus the problem can be efficiently solved with evolutionary search.Generalization and Acceleration.It is worth noting that the sparse update scheme is general and universal across different datasets.We only perform ONE scheme search on CIFAR (for vision models) and CoLA (for language models) and sparse-BP demonstrates good generalization capability.The schemes achieve competitive training accuracy compared to full fine-tuning (Table 2 and  Table 3).Specifically, we find that for CNNs: it is most effective to update the weights of the first convolution in each block, while for transformer blocks, the weights in the attention module and the first linear layer in the Feed-Forward Network (FFN) are more important (Figure 6).Such schemes are also memory-efficient: the depthwise conv and second pointwise conv in the inverted bottleneck block (Figure 6.a) and the second linear layer in the FFN (Figure 6.b) have the largest input activation, while our update scheme does not require saving these large features.
After finding and specifying the gradients needed for the ondevice training, PockEngine automatically traces dependency and analyzes the updated topology, then prunes the training graph using dead code elimination (DCE) to prune the computation graph and remove intermediate nodes and buffers that are no longer needed for the training.Because the pruning is performed on graph level at compile-time, it can deliver measured memory saving and throughput improvement.

Training Graph Optimization
After we get the static, pruned training graph, PockEngine applies various graph optimization techniques on the unified IR before translating to different backends, which further improves the training efficiency.
Operator Reordering and In-place Update.Different execution orders lead to different life cycles of tensors and the overall/peak memory footprint will be also affected even for the same computational graphs.This has been well-studied for inference [6,38] but less discussed for training because the backward graph is usually derived during runtime and the compiler/scheduler does not have global information of the training process.
A concrete example is the optimizer, where the gradients are applied to update the model parameters.In conventional training, frameworks calculate all gradients and then apply the update.This is common among frameworks like PyTorch and TensorFlow as the optimizer and forward-backward are separate components in the system design.However, such a practice leads to significant memory waste for storing the gradients.In small batch training with sparse backpropagation, the cost of storing parameter gradients is close to peak memory usage in forward and backward as shown in Table .4: To address the overhead, PockEngine obtains all tensor information and plans for a better execution schedule.By reordering operators, the gradients can be immediately applied to the corresponding parameters before back-propagating to earlier layers.We further trace the life-cycle of all tensors (weights, activations, gradients) and re-order the schedules to reduce memory usage, leading up to 21x savings on microcontrollers for MCUNet.
Operator Fusion.In most deep learning frameworks, a simple operation usually requires a number of fine-grained kernels to implement.For example, a single-layer normalization operation requires three kernel calls and two memory reads and writes for forward, and six kernel calls and five memory reads and writes for backward.Moreover, transformations such as fusing cheap operations into expensive ones (e.g.CONV-BN-ReLU,), and parallel linear operations (e.g.batch matmul) have been shown effective in improving the inference.During compilation and codegen, PockEngine fuse these kernels into a single one and results in less memory IO and kernel calls.
Functional-Preserving Graph Transformation.Existing DNN frameworks optimize a computation graph by applying rules either designed by domain experts [2,4] or automatically discovered by program [26,28].There are more optimization opportunities but previous research is unable to utilize them since the backward graph was derived at runtime in earlier frameworks.Extensive investigation of potential graph optimizations will lead to slow training and incur undesired runtime overhead.
Our engine integrates these optimization techniques and is an early trial to apply to the training graph.PockEngine transforms the data layout for different hardware.For vision tasks, NCHW is the most widely used layout.But this format is only efficient on accelerators like GPU.When training on mobile CPUs / DSPs, such format is no longer optimal and PockEngine will transform the layout at compile-time to facilitate runtime training efficiency.
Furthermore, PockEngine explores different implementations of kernels.For example, Winograd has been widely used in inference because of its faster computation.However, the savings are not free: it requires extra pre-processing of the weights.If the weights are not static, then the transformation needs to be applied every epoch and the total FLOPs can be even higher than normal convolution.Hence it was utilized in inference and not incorporated into training frameworks.For on-device training scenarios, there are many frozen layers where the weights are not being changed during training [10,61].These layers in fact can utilize Winograd to accelerate but such opportunities are ignored in current frameworks even if the requires_grad attribute is set to False.PockEngine obtains the complete training graph during compile-time thus knowing the updating information of each parameter.Therefore, we can analyze the tensor and graph information, knowing whose weights are static and whose are dynamic.PockEngine can bind operation to the fastest implementation and enable the chance to utilize Winograd even in the training.

RESULTS
In this section, we comprehensively evaluate the performance of PockEngine.We first study the effectiveness of sparse backpropagation, then present the experimental results on different hardware and platforms, compared with other training frameworks.Finally, we discuss the graph optimization results.
We update the biases of the last 7 blocks and update {100%, 100%, 50%, 100%} of the weights of the first convolutions for the intermediate 4 blocks.language models (<1% performance drop).On some downstream datasets, the performance of sparse backpropagation is even higher surpassing the full baselines such as Flower in vision and mrpc-acc in language.The performance is far above the common requirements for TinyML [7] (80% accuracy on VWW), suggesting sparse propagation is a good strategy for on-device training.Furthermore, when evaluating language models, sparse backpropagation also maintains the finetuning accuracy at a reduced training cost.The average performance degradation is within 1%.This means that the use of sparse backpropagation can effectively reduce the time and cost required for training language models, without sacrificing accuracy.In fact, the results show that sparse backpropagation can even improve the model's performance on certain sub-tasks (e.g.MRPC and RTE).By making training more efficient, sparse backpropagation could help to accelerate progress in these fields and enable the development of more advanced language models.
Sparse Backpropagation Reduces Training Time and Memory.Besides the comparable performance when transferring to downstream tasks, sparse backpropagation greatly reduces the training peak memory and improves the training speed.
Shown in Table 4, the training memory grows rapidly w.r.t the batch size and soon exceeds the limitation for edge devices (e.g.1GB for Raspberry Pi), using swap or re-materialization techniques [47] will introduce extra computation and energy cost.Sparse backpropagation cuts down peak memory usage (2.2× to 21.3×) and the saving is general across models and applications.Even when batch size grows, the required memory is still small and the memory cost of training MCUNet-5FPS sparse-BP with batch size 8 is still smaller than batch 1. Batched training helps improve device utilization as well as training stability.
When applying sparse backpropagation, operations and tensors related to frozen layers are automatically trimmed from the training graph via dead-code elimination, resulting in less computation and higher training throughput.Figure 9 shows that the sparse backpropagation can further accelerate the speed by 1.3x to 1.6x on Raspberry Pi.Previous efficient training algorithms only discuss the theoretical performance and PockEngine provides system-level support and translates into measured reduction.Furthermore, the compilation-based workflow allows us to choose the best runtime backend for different training scenarios, including both vendor libraries (e.g.SNPE for Snapdragon GPUs and DSPs, TensorRT for NVIDIA GPUs) and customized kernels (e.g., TVM-tuned kernels for ARM CPUs and Apple M1).We present a comparison of training workflows in Figure 9 and discuss it below: Edge CPU..For platforms like the Raspberry Pi, PockEngine offers 13 to 21 × better performance compared to popular DNN training frameworks.This speedup is due to kernel tuning, which  existing frameworks either overlook in favor of GPU kernel implementations (PyTorch, TensorFlow, Jax) or optimize only for the inference pipeline and operators (MNN).The corresponding ARM kernels do not provide ideal performance, let alone the overhead brought by frameworks.

PockEngine Speedups On-Device Training
Edge GPU.We benchmark edge GPU platforms using NVIDIA Jetson Nano and Jetson AGX Orin due to their widespread use in edge applications.GPUs have a much higher degree of parallelism and better training throughput than CPUs.The faster training speed of PockEngine (2.2x to 2.6× speedup) is mainly due to the compilation process: The host language Python is typically slow on low-frequency CPUs, while PockEngine's compiled graph can run without host languages.While other frameworks like TensorRT [2] may also achieve this, they are limited to inference only and do not provide training support.
Apple M-Chip.The Apple M1 chip is a relatively new platform for training.While PyTorch and Tensorflow have preliminary GPU support, the compatibility is not ideal 3 .Even with the latest build (commit ID: c9913cf ), PyTorch throws errors when launching training for BERT and DistilBERT.On the other hand, PockEngine compiles the training graph to Metal, providing better compatibility and faster training speeds.
Mobile DSP.For Qualcomm DSP, we integrate SNPE [48] to deliver the final binaries.It is worth noting that SNPE is a conventionally inference-only library for integer models and our PockEngine easily extends it with training capability.As shown in Figure.9 (g), the peak performance of DSP is impressive and even on par with edge GPUs.
Microcontrollers.For the microcontroller platform, we integrate TinyEngine [40] to perform the codegen and enable training under extremely limited memory constraints.Previous frameworks like TF-Lite-Micro [3] is inference-only and we report the projected latency.Show in Figure .7 (c), the speed is much lower than Pock-Engine. 3https://github.com/pytorch/pytorch/issues/77764PockEngine enables efficient on-device training by compilation and adaptation to various runtimes.It further supports advanced backpropagation schemes and advanced graph optimization, which we will expand further in the following section.

FINE-TUNING CHATBOT WITH POCKENGINE
With the growing attention ChatGPT has received, the demand for fine-tuning one's own Chatbot models has also been increasing.This allows users to tailor the model to their domain-specific needs (e.g., law, biomedical, health care) and ensures privacy (e.g.private emails, personal assistant) by not uploading information to the cloud.By fine-tuning our own language model, we can address these concerns and obtain a high-quality language model that meets our needs.In this section, we demonstrate how PockEngine can efficiently fine-tune a chatbot on edge platform (Jetson AGX Orin).
Models.We choose Meta's LlamaV2 [55] and choose the 7B model as the backbone for our experiments.This decision was based on the trade-off of model quality and device resources.The detailed fine-tuning settings are discussed below.
Evaluation.For evaluation, we follow Alpaca-Eval [36] and MT-Bench [62] to use LLMs as the automated evaluator for benchmark generation and performance assessments.The quality of the answers is evaluated based on helpfulness, relevance, accuracy, and details from 805 questions 4 and 80 questions from Vicuna project 5 .This is a pair-to-pair comparison and we choose text-davinci-003 for Alpaca-Eval win rate (%) and ChatGPT-3.5-Turbofor MT-Bench Score.
Datasets.To align pre-trained language models with instructions, we follow the self-instruct [59] and adapt data from Stanford Alpaca [54].The total training set has 52K examples containing diverse instructions and responses 6Table 5. Instruction tuning comparisons between PyTorch and PockEngine.The pre-trained model is LLamaV2-7B [55] and we fine-tune the models following Stanford Alpaca's setting [54].We report the training loss and Alpaca-eval score [36]  Fine-tuning.We fine-tune the models for 3 epochs using a learning rate of 10 −4 and no weight decay.The optimizer we use is memory-efficient Lion [14], and the maximum sentence length is limited to 512.The instruction tuning batch size is 1, and the gradient is accumulated over 16 steps.We sparsely update the biases of the last 5 blocks (out of 24) and the weights of the attention module and the first linear layer in the FFN for the last 5 blocks.We further freeze the layer-norm layers to reduce training costs and speed up training.

Quantitative Comparison.
PockEngine Accelerates Training.As shown in Table 5, Py-Torch can train on Jetson AGX Orin, but one iteration takes more than 7 seconds for LlamaV2-7B.Fine-tuning on 1000 records would require 2 hours while PockEngine accelerates training by 4.4x and can finish in less than half an hour.
Sparse Backpropagation Accelerates Training.For popular parameter-efficient fine-tuning methods like LoRA [24], although they can effectively reduce the memory footprint (from 45.1GB to 30.9GB), the training cost is not significantly improved as they still need to backpropagate the first layer.In contrast, Sparse backpropagation reduces the backpropagation depth and significantly improves training speed (from 1768ms to 914ms, 1.9× faster).
Sparse-BP Achieves Comparable Accuracy.Besides training throughput improvement, sparse backpropagation also maintains fine-tuning accuracy.When compared to full-BP, sparse-BP demonstrates similar performance, achieving an impressive Alpaca-Eval score of 43.7.This score closely matches the performance of full-BP, which has an Alpaca-Eval score of 44.1.Sparse-BP also performs favorably when compared to LoRA (Alpaca-Eval score 43.1).

Qualitative Comparison between Full-BP and Sparse-BP
In addition to quantitative analysis on accuracy and throughput, we also conduct qualitative analysis to compare the quality of sparse-BP and full-BP when performing instruction tuning on LlamaV2-7B.We make the assumption that a model's knowledge and capabilities are learned primarily during pretraining, while alignment teaches LLMs how to interact with users [63].Therefore, we examine the response quality from multiple perspectives: Factual Recall.: To thoroughly evaluate the performance between Full-and Sparse-backpropagation, we conduct a detailed analysis of the corresponding models' proficiency to answer questions related to general knowledge.We first examine the capability to correctly identify the capital city of a given country.All models consistently deliver accurate responses while the tuned models not only provide accurate answers but also offer more comprehensive information, enriching the overall user experience.Similarly, when we inquire about the dates of notable events, all models exhibit a remarkable level of confidence and accuracy in their responses.This suggests the reliability and correctness of both Full-and Sparse-backpropagation during fine-tuning.However, when the question becomes obscure, the original Llama model no longer provides accurate answers.For example, the raw Llama model (without fine-tuning) incorrectly states that Michael Jackson died in 2020 (the actual year is 2009), and it also provides incorrect album names (the proper title is "Invincible").Meanwhile, the Full-BP model and the Sparse-BP model both properly interpret the questions and provide accurate responses.
Input: what was Michael Jackson's last album?LlamaV2-7B-original: Michael Jackson passed away on June 25, 2020, so he did not release an album after that.However, before his death, he released the album "Michael Jackson: The Ultimate Collection, " which was a compilation of his previous studio albums and bonus tracks.LlamaV2-7B-tuned (Full-BP): Michael Jackson's last album was 'Invincible', released in 2001.LlamaV2-7B-tuned (Sparse-BP): Michael Jackson's last album, released during his lifetime, was "Invincible" which was released in 2001.
Example 3. The Title of Michael Jackson's Final Album Operation and Logic Inference.Next, we evaluate a simple but interesting task: reverse the order of words.This task requires LLMs to understand the concept of reversing and properly segmenting the sentences.The original model misunderstood the task and reversed the words' meanings, leading to incorrect answers.Meanwhile, the Full-BP and sparse-BP models, which were fine-tuned, both correctly understood the term "reverse" and executed the given task, though the sparse-BP model made a tiny mistake and still misplaced the word "the".
Input: Please reverse the words in the sentence "I love the Micro conference" LlamaV2-7B-original: I hate the conference Micro.LlamaV2-7B-tuned (Full-BP): The reversed result is "Conference Micro the love I" LlamaV2-7B-tuned (Sparse-BP): The reversed sentence would be "Conference the Micro love I".
After comparing the response quality of executing given tasks, we then evaluated the logical inference capability of the fine-tuned models.We first briefly introduced the hosting location of the Micro conference for the past three years and asked where the conference was held last year.Without fine-tuning, the original Llama model failed to understand the question and only simply repeated the first sentence.However, for the fine-tuned models, both Full-BP and Sparse-BP were able to correctly name the place, with the response from Full-BP being more confident.Maths.We then compared the mathematical abilities of different models.Mathematics is often considered a significant challenge for many language models, and obviously, the original LlamaV2 model struggled with math-related questions, while the Full-BP and Sparse-BP models demonstrated certain capabilities for reasoning and properly calculating the results.
Input: Lee mows one lawn and charges $33.Last week he mowed 16 lawns and three customers each gave him a $10 tip.How many dollars did Lee earn mowing lawns last week?LlamaV2-7B-original: Lee earned $660 from mowing lawns last week.LlamaV2-7B-tuned (Full-BP): Lee earned $33 x 16 = $528 from mowing lawns.His total earnings from the 16 customers who gave him a $10 tip is $10 x 3 = $30.Therefore, Lee earned a total of $528 + $30 = $558 last week mowing lawns.LlamaV2-7B-tuned (Sparse-BP): To determine how much Lee earned mowing lawns last week, we need to calculate the total amount of money he earned from mowing lawns and tips.Lee mowed one lawn and charged $33, so he earned $33 from that lawn.Last week, Lee mowed 16 lawns, and three customers each gave him a $10 tip, so he earned $10 x 3 = $30 in tips from those lawns and earned $33 from mowing one lawn and total 16 lawns = $528.Therefore, Lee earned $558 last week from mowing lawns and receiving tips.
Example 6. Math Problem Solving Note that this is a concise qualitative study comparing original, Full-BP fine-tuned, and Sparse-BP fine-tuned LLMs.We carefully selected representative samples for this study, although it is important to note that it is not comprehensive given the extensive range of responses the model can provide.The objective of this analysis is to present compelling evidence in support of two findings: (1) fine-tuning is an essential process for personalizing your own Chabot, and (2) Sparse-BP is capable of fine-tuning models with comparable quality with much reduced cost.Experiments on different edge devices show PockEngine can significantly speedup on-device training: 11.2× on ARM CPUs, 2× on Apple M1, and 2.7× on NVIDIA edge GPU, and 9.6× on microcontroller compared to TensorFlow.PockEngine supports sparse backpropagation, which further speeds up by 1.5 -3.5 × while matching the accuracy of full backpropagation.Further, PockEngine enables fine-tuning LLamaV2-7B language model on a Jetson AGX Orin at 914ms, 7.9× faster than the PyTorch baseline.We hope our engine design can facilitate AI applications with personalization and life-long learning capacity by democratizing learning on the edge.

Figure 1 .
Figure 1.On-device learning and local fine-tuning enable customization, protect privacy, and form a virtuous cycle between user and devices.

•
PockEngine implements a rich set of graph optimizations to improve the efficiency on edge devices, including operator fusion, operator reordering, layout transforms, and backend switching that are conventionally used for inference only.We find that the training graphs actually have more optimization opportunities due to their complexity.By sharing the same operator set with inference graphs, PockEngine can well utilize the optimization techniques from inference engines (e.g., PockEngine utilizes previously inference-only winograd convolution to accelerate training).We extensively evaluated PockEngine on six edge platforms and six deep learning tasks from vision to NLP.PockEngine achieves up to 11× speedup over TensorFlow for the same training workload.With sparse backpropagation, we can further improve the acceleration up to 21× without losing transfer learning accuracy on tiny microcontrollers.We hope our work can contribute to the thriving of on-device training by providing a general-purpose, high-efficiency, user-friendly training framework for edge devices.

Figure 2 .
Figure 2. The computation graph of different backpropagation schemes on a five-layer model.We use blue to indicate the demanded intermediate activations during training.Sparse-BP delivers the best cost-quality trade-off which we will show in Section.4.

Figure 3 .
Figure 3.The computation graph of sparse backpropagation for a linear layer.Red and blue blocks indicate the forward and backward OPs respectively.The red line denotes the training memory bottleneck brought by storing activations, which can be avoided using bias only / sparse update as shown in (b) (c) (d).

Figure 4 .
Figure 4.The workflow of PockEngine.PockEngine performs the auto-diff at compile-time, prunes the computation graph to support sparse backpropagation, and enables previously inference-only hardware platforms to perform backpropagation.PockEngine enables efficient fine-tuning on resource-constrained devices like NVIDIA Jetson and mobile devices.
Compared to conventional training frameworks, sparse backpropagation has the following unique advantages • Expensive intermediate activations can be released immediately after forward When either learning the bias-only (/ and /) or fully skipping the layer (only / to keep chain-rule).Thus sparse backpropagations greatly reduce the main memory bottleneck of training (the red connection line in Figure 3.a).

Figure 5 .Figure 6 .
Figure 5.The computation graph of sparse backpropagation for ConvNet and Transformers.

Figure 8 .
Figure 8.The training loss curves of FT-Full and our used sparse update on QNLI and SST-2 dataset using BERT.Sparse updates slightly slow down the training curve, but do not degrade the final accuracy • MobileNetV2: update the biases and the weights of the first 1x1 convolution for the last 7 blocks (out of 19).• ResNet-50: update the biases and the weights of the first 1x1 convolution for the last 8 blocks (out of 16).• BERT: update the biases of the last 6 blocks (out of 12) and the weights of the attention module and the first linear in FFN for the last 4 blocks.• Distill-BERT: update the biases of the last 3 blocks (out of 6) and the weights of the attention module and the first linear in FFN for the last 2 blocks.• LlamaV2-7B: update the biases of the last 5 blocks (out of 32)and the weights of the attention module and the first linear in FFN for the last 5 blocks.

Figure 9 .
Figure 9. Training speed comparison between other frameworks and PockEngine of popular deep learning models on various hardware platforms.PockEngine consistently outperforms existing frameworks and sparse bp further speedups the training throughput.
We compare PockEngine with other training frameworks in Figure 9. PockEngine enables training on various hardware platforms, including Raspberry Pi 4, Snapdragon CPU and DSP, Apple M1, Jetson Nano, and microcontrollers.It also supports a wide range of models, such as MCUNet, MobilenetV2, ResNet-50, BERT, and Dis-tilBERT.PockEngine effortlessly supports diverse models through its frontend, which converts neural networks represented in various formats to a unified intermediate representation.

Input:
What is the capital of United Status?LlamaV2-7B-original: Washington, D.C. LlamaV2-7B-tuned (Full-BP): The capital of the United States is Washington D.C. (District of Columbia).LlamaV2-7B-tuned (Sparse-BP): The capital of the United States is Washington, D.C. Example 1.The Capital of U.S.A.

Input:
In what year was the Beijing Olympics held?LlamaV2-7B-original: The Beijing Olympics were held in 2008.. LlamaV2-7B-tuned (Full-BP): The Beijing Olympics, also known as the Games of the XXIX Olympiad, were held in 2008.The opening ceremony took place on August 8, 2008, and the closing ceremony was held on August 24, 2008.LlamaV2-7B-tuned (Sparse-BP): The Beijing Olympics, also known as the 2008 Summer Olympics, were held in Beijing, China in 2008.Example 2. The Date of Beijing Olympics.

Input:
Micro was running virtually online in 2021, and held physically at Chicago in 2022 and will soon be hosted in Toronto in 2023.Now is the year of 2023, where was Micro hosted last year?LlamaV2-7B-original: Micro was hosted virtually in 2021 LlamaV2-7B-tuned (Full-BP): Micro was hosted physically in Chicago in 2022.LlamaV2-7B-tuned (Sparse-BP): Based on the information provided, it seems that Micro was hosted physically at Chicago in 2022.Example 5. Inferring Information Based on Context.
We present PockEngine, an efficient training framework for learning on edge.PockEngine has general support for various frontends/backends to deal with hardware heterogeneity on edge.It improves the efficiency of on-device training via (1) compilationbased auto-differentiation to offload overheads from runtime to compile time; (2) supporting sparse backpropagation with backward graph pruning; (3) training graph optimization including operator reordering/fusion and various function-preserving transforms.

Table 1 .
Comparison between existing deep learning frameworks."-" denotes the feature is not fully supported for training.

Table 2 Full-bp Update Bias Only Sparse-bp Layer Sparse-bp tensor Update
Figure 7.The comparison between runtime auto-differentiation and our compile-time differentiation.By offloading the differentiation to compile time, PockEngine not only simplifies the runtime, but also enables plenty of optimization opportunities, which will be detailed in Section.3.2.

Table 2 .
Sparse BP achieves comparable transfer learning performance (< 1%) degradation on average) compared to the full update for vision models at various scales, while reducing the cost of on-device training.

Table 3 .
For language models, sparse BP maintains the fine-tuning accuracy for at a reduced training cost.Results are reported with mean and standard deviation for 3 runs.

Table 4 .
The training memory usage comparison of full backpropagation and sparse backpropagation.We report actual memory usage measured on Jetson AGX Orin.The saving ratios are more significant as batch sizes increase."-" denotes that the experiments cannot fit into devices.
(reference model text-davinci003.PockEngine shows significant speedup over PyTorch on Jetson AGX Orin while fully matching the training quality.With the sparse update, PockEngine further improves the training throughput while maintaining the response quality.