CoActo: CoActive Neural Network Inference Offloading with Fine-grained and Concurrent Execution

Collaborative inference is the current state-of-the-art solution for mobile-server neural network inference offloading. However, we find that existing collaborative inference solutions only focus on partitioning the DNN computation, which is only a small part of achieving an efficient DNN offloading system. What ultimately determines the performance of DNN offloading is how the execution system utilizes the characteristics of the given DNN offloading task on the mobile, network, and server resources of the offloading environment. To this end, we design CoActo, a DNN execution system built from the ground up for mobile-server inference offloading. Our key design philosophy is Coactive Inference Offloading, which is a new, improved concept of DNN offloading that adds two properties, 1) fine-grained expression of DNNs and 2) concurrency of runtime resources, to existing collaborative inference. In CoActo, system components go beyond simple model splitting of existing approaches and operate more proactively to achieve the coactive execution of inference workloads. CoActo dynamically schedules concurrent interleaving of the mobile, server, and network operations to actively increase resource utilization, enabling lower end-to-end latency. We implement CoActo for various mobile devices and server environments and evaluate our system with distinct environment settings and DNN models. The experimental results show that our system achieves up to 2.1 times speed-up compared to the state-of-the-art collaborative inference solutions.


INTRODUCTION
With the rapid development of neural networks, AI-based mobile applications [19,24] are now providing high-quality services that are comparable to those of human experts.The user-centric nature of these mobile applications often puts emphasis on user interaction, such as understanding voice commands, analyzing visual input, and interpreting human actions, which enable user experiences to be more relevant to their requests.However, user interactions can also signi cantly deteriorate user experience, when a high latency of the response introduces a delay to the operation of the application.As such low-latency responses to user requests are a crucial factor for a good user experience of AI-based mobile services.To achieve this, these services generally strive for a latency service-level objective (SLO) [7,14] instead of a throughput objective, usually on the order of milliseconds.
However, with the ever-increasing complexity of modern AI models and services, meeting the latency SLO solely with local processing in resource-constrained mobile devices has proven to be extremely challenging.Even with the help of modern mobile processors, executing a DNN solely on a mobile device consumes a considerable amount of time, power, and memory [20,37], and the execution becomes near-impossible with large-scale DNNs, such as generative NLPs.Instead, many services opted to o oad the DNN inference to remote high-performance cloud servers, by transmitting locally collected user data to the servers.For many mobile AI services with low latency SLOs, this mobile-server inference o oading has become the dominating, if not only, solution.As such, this signi cance of mobile-server inference o oading naturally led to a series of works that aim to expand upon its design.
One of the most signi cant improvements over the traditional mobile-server inference o oading is achieved by mobile-server collaborative inference [15].Unlike the traditional o oading approach, collaborative inference actively leverages the increasing AI processing capabilities of modern mobile devices, by splitting the DNN computation in a way that balances the computation between the mobile and server computation resources.By carefully pro ling the characteristics of three main resources of inference o oading, i.e., mobile, server, and network resources, collaborative inference approaches search and generate a partitioning and scheduling scheme that achieves the minimum combined latency of both the computation and the communications within its solution space, enabling much improved end-to-end latency in inference o oading.As such, many collaborative inference approaches [6,10,11,13,16,17] have been proposed in recent years, each providing di erent solutions to nding the optimal partitioning and scheduling scheme in mobile-server collaborative inference.
Despite these e orts, we nd existing collaborative inferences to be far from complete.Existing works focus on the question of how to split the DNN computation between the mobile and server resources for e cient DNN o oading, but this is only a part of a bigger question.What ultimately determines the end-to-end latency is the design of the execution system, of how the system can eciently support the characteristics of the given workload under the highly stochastic characteristics of the mobile o oading environment [22,34], such as channel dynamics in mobile networks [22,34] and bursty requests in cloud servers [33,36].For DNN o oading, this not only includes the partitioning and scheduling of the workload, but also the modeling of the workload, execution algorithm, dynamic load-balancing, possibility of multi-tenant execution, and many more.Therefore, to nd a more complete solution to collaborative inference, we ask the more fundamental question of "How should a DNN execution system be designed for e cient mobile DNN o oading?" We identify two key properties that a DNN execution system must have to realize e cient DNN inferences in a mobile-server o oading environment: 1) a ne-grained expression of DNNs, and 2) exibility of the system resource utilization.In current DNN computation frameworks, DNNs are often expressed in units of layers.However, we nd that layer granularity is often too large for DNN o oading scenarios, and does not supply enough parallelism to e ciently utilize all available resources.Existing solutions attempt to alleviate this by partitioning the layers [35,39,40], but we nd expressing DNNs in ne-grained units of tiles, instead of layers, to be the more fundamental solution that can be applied to any DNN o oading problem, regardless of the DNN and device used.Also, considering the dynamic nature of DNN o oading, the runtime components of the system must be exible enough to support dynamic changes in the execution, such as changes in available computation resources, network conditions, and the presence of competing inference o oads.To enable such dynamic behaviors, we nd that the system components for computation and networking should actively operate in parallel to dynamically allocate the resources to DNN o oading work ows that require attention depending on the system status.
To this end, we design CoActo, a DNN execution system built from the ground up for mobile-server inference o oading.The key design philosophy behind our system is Coactive DNN Inference O loading, which is a new, improved concept of DNN o oading that adds ne-grained expression of DNNs and concurrency of runtime resources to the existing collaborative inference.In the coactive DNN inference o oading of CoActo, the DNN workload is expressed as a ne-grained tile-based data ow graph.Using this ne-grained DNN graph, the computation and network resources of the given o oading environment are dynamically assigned and utilized to maximally leverage not only the concurrent activation of multiple o oaded workloads but also the concurrent activation of computation and communications within a single o oaded workload, allowing higher resource utilization and lower end-to-end latency in DNN o oading.For instance, in Figure 1(a), conventional collaborative DNN inference o oading simply searches the best DNN partitioning point to achieve the lowest combined latency of the mobile computation, data communications, and server computation.In contrast, as depicted in Figure 1(b), coactive DNN inference o oading of CoActo can dynamically schedule concurrent interleaving of the mobile, server, and network operations on a ne-grained expression of DNNs, to actively increase the utilization of the resources in the o oading environment and enable lower end-to-end latency.
We nd three design challenges to enable such a novel DNN execution system: 1) Devising a general model partitioner that expresses an arbitrary layer-wise DNN graph as a tile-wise DNN graph.2) Designing an execution system that allows exible allocation and utilization of system resources.3) Designing an e cient scheduler that can react to the dynamic changes in the runtime environment.We propose three corresponding design concepts for the system design CoActo, each overcoming the three challenges mentioned above.We implement CoActo for various mobile devices and server environments and evaluate CoActo in various network environments and DNN models, including recent Transformer-based DNN models.The experimental results show that our framework achieves up to 2.1 times the speed-up compared to the state-of-theart conventional collaborative inference frameworks.

BACKGROUND AND MOTIVATION 2.1 Collaborative Inference
In this section, we explain the concepts and limitations of two representative approaches in collaborative inference: split computing and fused-layer o oading.Split computing: This technique splits the DNN model into two submodels at the layer level, as in Figure 2(a).In this approach, the mobile device computes the earlier DNN layers and then transmits the intermediate output data to the server.Subsequently, the server computes the remaining layers after receiving the whole intermediate output data.In this scheme, the key control knob is the split point, which determines the ratio between mobile and server computation and the communication time in between, as each layer's intermediate output data size and computing cost are all di erent from one another.Many existing studies [6,10,13,15,17] have suggested several solutions in nding the optimal split point in this partitioning scheme by pro ling the computing and network resources of the given o oading environment.However, this approach has a critical aw, as layer-wise data dependency necessitates two of the server, mobile, or network resources to idle during the execution of a resource.As a result, this sequential execution approach su ers from under-utilization of the available resources, leading to only a limited amount of latency improvements.

Fused-layer (FL) o loading:
The key idea of this technique is to fuse multiple layers [2] by exploiting the spatial locality of layers, such as convolution and pooling layers.It allows the division of the neural network model into several submodels with fused layers, as shown in Figure 2(b), each with zero data dependencies to one another.This independent nature of the submodels allows the computation of each submodel to be executed without any synchronization to the computation of other submodels, enabling the available computation resources to be utilized without any idling.Many works have leveraged this technique for distributed inference [35,38,39] in edge devices such as IoT clusters, and a recent work [40] has demonstrated the potential of this scheme in collaborative inference by applying model parallelism and partial o oading to the mobile-server o oading environment.Unfortunately, this approach su ers from limited scalability and high computation overhead, as dividing a DNN into independent submodels inevitably introduces duplication of computation between the submodels.As the DNN becomes more complex, and as the number of submodels increases, overlapping computational regions (the gray areas in Figure 2(b)) between the submodel expands, resulting in an exponential increase in the total computational cost.As a result, the e cacy of this method drops signi cantly as the DNN model gets more complex and the concurrency of the method increases, resulting in reduced bene ts from collaborative inference.

Tiling for Collaborative Inference
The fundamental limitation of both approaches is that they use layers to express the DNN computation graph.Layers are simply too large of a unit for a DNN o oading environment.The large computation granularity signi cantly limits the number of independent computations in the workload, which leads to restricted use of parallelism between resources, or forced independence through duplicate computations, as seen in existing collaborative inference approaches.Furthermore, layer-wise expression also makes it very challenging to design a exible system that can adapt to dynamic changes in environmental characteristics, due to the large amount of processing time required for each workload unit.Therefore, a ne-grained expression of DNN computation is necessary to provide a higher number of independent computations and better dynamic capabilities at runtime.
In this regard, we investigate the tiling technique, which is used in matrix multiplication to increase the parallelism of the operation [5,29].This technique decomposes a single matrix multiplication into several sub-matrix multiplications, where each sub-matrix computation is referred to as a tile.This tile-wise partitioning allows parallel computing resources to access and compute each tile concurrently, leading to greatly increased parallelism.We nd this concept of tiles to be the ideal unit for ne-grained expression of DNN computation, as many DNN computations are already computed by mapping it to matrix multiplication.Also, the freedom in tile sizes, from a single element to the entire matrix, provides the opportunity to supply a su ciently ne-grained DNN representation for any DNN computation task.Based on this observation, we conclude that integrating the tiling technique into a DNN o oading system would enable a highly concurrent and exible collaborative inference system.Traditional approaches for collaborative inference primarily focus on model splitting, which allows the mobile, network, and server resources to achieve a minimum sum of latency.However, for optimal end-to-end performance, it is crucial not only to distribute the workload but also to ensure that runtime execution system components work in unison to make the best use of the available resources under the dynamic environments that exist during DNN o oading.Therefore, we propose a novel concept of Coactive Inference to propose an answer to the design of a DNN execution system bespoke for DNN o oading.In coactive inference, the execution system components go beyond simple model splitting and behave more proactively, ultimately achieving a coactive execution of the inference workloads.Coactive inference o oading adds the following two design philosophies to existing collaborative inference.P1) Fine-grained DNN expression: To increase the parallelism and exibility in the system, a large workload unit of current DNN expression must be decomposed into smaller, ne-grained subworkloads.Smaller workload size allows faster unit processing times, which are suitable for dynamic scheduling of parallel resources at runtime.In addition, this enables a large increase in parallelism for both the computing and network resources, which is necessary to saturate the given system.P2) Concurrency of runtime resources: As explained in Section 1, the end-to-end latency of DNN o oading highly relies on how the given mobile, network, and server resources are best utilized at runtime.Concurrent, rather than sequential, use of these resources is necessary to maximize parallel resource utilization.Furthermore, this concurrency provides adaptability in dynamic environments by enabling the handling of multiple workloads simultaneously.
3.1.2Design Challenges.To create a DNN inference o oading framework that realizes the coactive inference concept, we nd the three design challenges that must be addressed.C1) Tile-based expression: As mentioned in Section 2.2, the tiling technique is suitable for the ne-grained expression of DNNs.However, devising a general model partitioner that expresses an arbitrary layer-wise computation graph as a tile-wise computation graph poses many challenges.These challenges include determining the e cient tile dimensions and size for the given environment, automatically parsing and generating the independent data dependency ow graph between the tiles, and designing these processes in a general manner for all DNNs.C2) Concurrent execution system: Current DNN inference frameworks such as Tensor ow [1] or PyTorch [23] execute computation at the layer level, necessitating synchronization of the resources between each computation or communication operation.Although tile-wise computational graphs allow independent computational paths, this layer-wise execution system restricts the concurrency only to the intra-layer level, leading to a serialized execution.Designing a concurrent execution system that enables overlapping the computation and communications of tiles in the independent computational paths presents a challenging task.C3) Dynamic scheduling of tiles: The third challenge comes from the design philosophy of existing collaborative inference ofoading of balancing the model executions between the mobile, network, and server resources of the given environment.Unlike the existing approach, however, our concurrent execution model does not allow the simple performance modeling of adding the individual pro led latency of the resources.As such, a novel scheduling solution that allows dynamic adaptation and balancing of complex ne-grained DNN between the concurrently operating resources must be realized for coactive inference.

Overview
By addressing the challenges mentioned above, we design CoActo, a novel coactive inference framework that enables ne-grained and concurrent execution for DNN o oading.CoActo comprises the three components: Tile-based Partitioner (TP) (Section 3.3), Asynchronous Execution Engines (AEEs) (Section 3.4), and Dynamic Scheduler (DS) (Section 3.5).Each component has been designed to address the three challenges identi ed in the previous section.In the following sections, we discuss how each component tackles its corresponding challenge.
Figure 3 presents an overview of CoActo.TP transforms a layerwise computational graph to a ne-grained tile-wise computational graph by iteratively dividing the computation of each DNN layer into multiple ne-grained tiles and graphing the data dependencies among the tiles using the characteristics of each layer.The tilewise graph outputs of TP are saved, to be later used in the CoActo runtime, composed of AEEs and DS.The same CoActo runtime structure is used for both mobile and the server during DNN ofoading.DNN o oading is initiated by the runtime of the mobile device transferring its partitioned tile-wise computation graphs to the server runtime.AEEs are composed of three separate engines, namely the Graph Management Engine, Computing Engine, and Communication Engine, which asynchronously execute tiles in independent computational paths concurrently without any synchronization.DS dynamically decides whether to transmit the data of tiles or not at the mobile devices during runtime, using the proled network condition and server's current computation load to estimate the completion time of the o oaded tile.

Tile-based Partitioner (TP)
The objective of TP is to automatically convert an arbitrary layerwise DNN graph into a ne-grained computation graph of tiles.In doing so, TP combines the insight that matrix multiplications are composed of many smaller tiles which can be computed independently from Section 2.2 with the observation that many DNN layer computation kernels are executed using matrix multiplication.This leads us to the tile-level computation graphs that allow negrained dependency management and concurrent communication and computation of independent tiles during DNN o oading.
In CoActo, a tile is an abstract unit of scheduling that each references a certain submatrix of the original tensor data.As tiles are only references, CoActo is able to construct and execute negrained dependency graphs on top of existing tensor-based DNN execution model while keeping the original DNN tensors and their executions unmodi ed.That is, tile abstraction only holds the metadata needed to de ne and access the tile, such as tile dimension sizes, memory stride per dimension, dependency relationship with other tiles, and the pointer reference to its tensor data, while the memory structure of the original tensor data remains unmodi ed and contiguous in memory.Thus, by leveraging this tile abstraction, CoActo requires no additional duplication or transformation of tensor data for its tensor parallel scheduling and execution 1 of the DNN, unlike the fused-layer o oading approaches in Section 2.1.
Figure 4 shows the overall process of TP.The rst step (Figure 4(a)) is to analyze the size of each layer's output tensor.For CNN tensors, TP attens the output tensors into a matrix (Figure 4(b)) whose height equals the number of channels, and the width equals the product of the tensors' width and height (i.e. the number of elements per channel), and for Transformer-based DNNs, the tensors are already in the form of matrices.After that, the matrices are decomposed into column-wise vectors, and tiles are created by merging the partitioned columns, which are uniformly shaped submatrices (Figure 4(c)).The computation graph is then reconstructed by graphing the tiles into a directed acyclic graph (DAG), using the data dependency relationship between the tiles of input and output layers(Figure 4(d)).In this step, the granularity of the tiles is determined by the number of merged columns.Merging the tiles allows for decreased scheduling overhead and an increase in weight data reuse through decreased computation granularity.TP merges the columns until the number of tiles reaches a prede ned number per layer.TP evaluates the performance of the current graph under the given network and computing environment and heuristically adjusts the number of tiles until it achieves satisfactory performance.Through these steps, a ne-grained tile-wise computation graph is formed, which represents the DNN using ne-grained tiles as the graph node and the data dependencies among them as the graph edge.
This transformation to a ne-grained dependency graph allows a exible and concurrent execution in DNN inference o oading.Figure 5 illustrates the di erence between the tile-wise computation graph and the conventional layer-wise computation graph for DNN inference o oading.With the layer-wise computation graph, the server can start its computation of the subsequent layers only after the delivery of the whole input data.This results in the powerful server remaining idle during the data transfer, and this 1 The individual tile executions may require data duplication at the computation kernel level, depending on the implementation of the kernel.For instance, a convolution kernel that combines im2col with matrix multiplication may require duplication of the input data, while loop-based convolution may not require such duplication.underutilization becomes even worse when network conditions degrade.On the contrary, the tile-wise computation graph enables concurrent computation and transmission of independent nodes, as independent tiles are allowed concurrent execution and the processing time of each tile is much faster than the computation of the whole layer.As a result, the completion time is greatly reduced compared to the conventional layer-wise collaborative inference approaches.

Asynchronous Execution Engines (AEEs)
Our approach to achieving a concurrent DNN execution system involves designing Asynchronous Execution Engines (AEEs), which consist of three types of execution engines, namely Graph Management Engine, Computing Engine, and Communication Engine, where each engine can asynchronously and independently operate without waiting for each other.This asynchronous design maximizes concurrency by executing the tiles in parallel, but also requires sophisticated coordination of the tasks to avoid potential race conditions or race conditions.We now explain the engines asynchronously operate to achieve concurrency.

Graph Management Engine.
Managing the complex ne-grained computation graph separately in each parallel computing engine requires frequent synchronization of the graph states, resulting in severe serialization of execution.To avoid this, we separate the role of management of complex computation graphs to Graph Management Engine, which contains the entire computation graph information and its state.With a separate graph management engine, the computing and communication engines are guaranteed to access data exclusively for every node without any requiring synchronization.Graph management: To manage the data ow of computation graphs, we de ne three transition states of a node 2 : completed, ready, and not-ready.If a node is executed or if its outputs are received from another device (mobile or server), it is in a completed state.Note that input nodes, which hold the input data for the inference, are always considered completed.If all parents of a node are in completed state, the node can be executed, and this is represented as a ready state.On the other hand, if one or more parent(s) are not completed, the node is in a not-ready state.Whenever the graph engine receives the completed nodes from the computing engines or the communication engine, it updates the state of the completed node's child nodes and pushes any child node that becomes ready to the workload queue, as depicted in the left of Figure 6.This update process is performed atomically to prevent any race conditions.Also, this data dependency update is performed asynchronously by a data exchange between the graph management engine and only a single communication or computation engine, minimizing the communication overhead within a single device.Workload queue: The purpose of the workload queue is to act as a barrier between the nodes that are ready for execution and those that are not.This ensures that computing engines fetch only the ready nodes from the tile-based DNN graph, without having to consider the data dependency between other nodes.This approach ensures that new computations are readily available to be dynamically scheduled to the resources at any point of system execution, enabling the maximal utilization of the given computation resources.the graph engine whenever it is idle, as in Figure 6.Once it fetches a node, it dispatches the corresponding computation kernel of the node to a parallel computing resource.After nishing the computation of a node, it returns the results to the graph management engine.If the dynamic scheduler decides to transmit the completed node, it also pushes the node to the send queue in the communication engine.This execution cycle is asynchronously performed in each computing engine without synchronization, therefore, the parallel resources are saturated if a larger number of ready nodes than the number of parallel execution engines are prepared in the workload queue.To maximize concurrency, the communication engine also operates asynchronously without synchronization with computing engines.It continuously transmits the completed nodes in the send queue to the target device (e.g., a cloud server or a mobile device).Whenever it receives the completed nodes from the other devices, it also returns the nodes to the graph management engine.

Dynamic Scheduler (DS)
Our concurrent approach makes nding the optimal o oading decisions non-trivial, as the completion time cannot be modeled by simply adding the pro led computing times and communication times.Furthermore, the ne-grained tile-wise computation graph makes the problem more complex.With the tile-based partitioning technique, scheduling the tiles is interpreted as a complex DAG scheduling problem that is a well-known NP-complete problem [31].Static DAG scheduling approaches [28,31] are available for adoption, yet their e ectiveness diminishes when deployed in dynamic environments.For instance, unexpected network interference or an extremely burst request on the server reduces the e cacy of the statically derived solution.Therefore, we suggest a dynamic o oading decision algorithm that dynamically schedules the nodes at runtime based on the estimated completion time of each node at the moment.
3.5.1 Task Model.We de ne the partitioned computation graph as a DAG =< , >, where the vertex set represents the set of the nodes.The edge set contains the directed edges , ∈ for the data dependency between the node and .A node , which serves as the starting point of an edge, is referred to as the parent node, and a node that serves as the endpoint of the edge is referred to as the child node.A node without any child nodes is called an exit node .A child node is dependent on its parent nodes and can only be executed when all the output data of the parent nodes are ready.Each node has a computation cost (FLOPs) of , and output tile data size of .( ,server).Basic idea: The main idea of our approach is to rst send all the input nodes to the server and dynamically decide to compute and send the outputs of the subsequent nodes on runtime.Sharing the input data is very cheap, as input data are often much smaller than intermediate DNN tensors.Through this, we guarantee that the end-to-end execution latency of our coactive inference is at least the minimum value between a full server o oading and full mobile on-device computation.Our system is designed in this way to guarantee minimum performance, even with unforeseen network or server degradation during the execution.If incomplete input data were to be sent, the powerful server computations may wait for the transmission of intermediate nodes from the mobile, which may not be latency guaranteed in wireless or user mobility scenarios.Furthermore, it always ensures that the worst-case execution time is the time taken for on-device inference if the computation gain from the powerful server is lost by an o oading service disconnection.Dynamic o loading decision: We explain how the dynamic o oading decision operates by using the example DAG in Figure 7.As explained before, all input nodes are transmitted and duplicate computation between server and mobile is allowed.However, to minimize unnecessary duplicated computation, the mobile and server start their computation from a ready node pair with the largest diameter in between, iterating in opposite directions.For example, as illustrated in Figure 7(b), the mobile transmits 3 , while computing 4 .The server starts computation of 7 as soon as it receives 3 .Then, the o oading decision of each node is performed using a greedy approach in the mobile device, starting from 4 .This decision is made using the estimated completion times of the node on the server when it is either 1) o oaded from the mobile and 2) computed solely on the server, denoted as ( ,o oaded) and ( ,server), respectively.A node is only o oaded (i.e., = 1) if ( ,o oaded) < ( ,server), indicating that computation of the node on the server alone will be delayed when compared to the case when it is assisted by the mobile device over the network.For example, in Figure 7(b), the node 8 is o oaded based on the estimations from the mobile.The server is then allowed to skip the computation of 8 by using the computation result from the mobile device.This mobile-assisted greedy approach enables the maximal utilization of the powerful server resources, even if the estimation is inaccurate.Nevertheless, we also suggest a pro le-based estimation approach, for an accurate estimation of these completion times, detailed in the subsequent paragraphs.Estimating the completion times: Figure 7(c) and Figure 7(d) show the examples of ( 8 , o oaded) and ( 8 , server).( 8 , o oaded) can be decomposed into the completion time of the node 8 ( ( 8 )), the queuing latency (zero in the example of Figure 7(c)), and the transmission time of the node 8 .The queuing latency can be obtained by the total data size of the nodes in the send queue of the communication engine (in Sec 3.4) and the proled bandwidth.The transmission time of the node 8 is obtained by 8 + , where, is the network latency between the server and mobile.Note that is estimated by calculating the moving average of the division between the size of transmitted data and the time taken for the transmission time during the previous inferences.
( 8 ,server) is decomposed into the transmission time of all input nodes, the computation time of ancestors of 8 , and the delayed time of the mobile device by the resource contention from the other computation graphs as in Figure 7(d).The transmission time of all input nodes is calculated by dividing the data size of the nodes and the pro led network bandwidth.Then, the computation time of the ancestors of 8 is estimated by dividing the summation of the computation time of the ancestors of 8 and the number of computation resources in the server.Meanwhile, estimating in the mobile device is non-trivial, as the mobile devices cannot know the server's computational load.To estimate , the server informs the number of computed nodes of each mobile device within the prede ned time interval among the multiple computation graphs.Then, is estimated by * Σ , being the number of computed tiles of the computation graph of the mobile device and Σ being the summation of the computed tiles of all computation graphs in the server.This allows the mobile devices to nd the average time taken in the server for the scheduling of its node among multiple competing computations.

IMPLEMENTATION
As current DNN inference frameworks rely on layer-wise expression and do not support the execution of tile-wise computational graphs, we implement CoActo from scratch using approximately 21,000 lines of C code, aimed at CPUs.Tile-based Partitioner: We implement the custom C structs for both tiles and the tile-wise computation graph.The tile object contains the variables associated with the tiles, such as the pointers of the input and output data memory addresses.The tile-wise computation graph struct generated by TP then holds the pointer list of the tiles in the graph.As tiles are only references to tensor data, the memory overhead for each tile object is around 100 bytes.As DNNs are usually compiled into ne-grained DNN graphs with a few hundred to a few thousand nodes each, the memory overhead for a tile-wise graph is in the range of a few hundred kilobytes on average.To generate the tile-wise graph, TP parses the Darknet [25] cfg and weight les and creates the tile structure by parsing and analyzing the DNN structures layer by layer.After that, it locates the child tiles of each tile and graphs them by storing the child tile's pointer references to the tile struct.Graph Management Engine: To prevent the frequent memory copy of tile data during runtime, Graph Management Engine manages tiles by using references to the tiles.It keeps the array of pointers to the tile structs of the given graph, and their dependency update process is performed by traversing and updating the child pointers of each completed tile.To minimize the computational overhead of the child update process, only the number of completed parents of the child is atomically and asynchronously incremented.
The workload queue is also implemented by storing the pointer references of tiles, and the fetching process of computing engines is performed by passing the pointer reference of each tile.
Computing Engine: Each computing engine is implemented to asynchronously perform its function on a separate thread.Each engine continuously fetches a tile from the workload queue whenever it is idle.Then, it executes the tile by computing its computation kernel.We implemented our tile-wise computation kernels of GEMM (Generalized Matrix Multiplication) using the AVX2 X86 SIMD extensions for the server and the NEON ARM SIMD extension for mobile devices.Using this tile-wise GEMM kernel, we implement other DNN operators, such as Conv2D.Non-GEMM-based kernels, such as add or pooling, are implemented as simple C for-loops.
Communication Engine: We also implement Communication Engine by creating separate threads for receive and transmit.Before starting an inference, a TCP connection is established between the mobile and server device.During runtime, the communication of the tiles is performed in full-duplex using the TCP socket.The receive and transmission queues also store pointer references of each tile to prevent frequent memory copies.

EVALUATION
In this section, we evaluate the e ectiveness of the exible and ne-grained design of CoActo in a diverse set of dynamic mobile DNN o oading scenarios.The experiments are conducted under di erent network and server settings and in a multi-tenant scenario.

Experimental Setup
We conduct our evaluations using three di erent o -the-shelf mobile platforms, NVIDIA Jetson AGX Xavier, Raspberry Pi 4, and Pixel 5, and a server with an AMD Threadripper 3990X 64-core processor.The detailed speci cations are in Table 1.To validate the e cacy of CoActo in various network conditions, we use WiFi-5GHz (802.11ac) and control the bandwidth and latency by using Tra c Control (TC) [12] in the mobile platforms.To emulate the server's computational loads, we control the number of cores by setting the processor a nity of CoActo on Linux taskset in the server platform.We evaluate CoActo using four popular DNN models, VGG16 [8], ResNet50 [9], YOLOv3 [26], and BERT-base [4], with batch sizes ranging from 1 to 8. To obtain benchmark results, we average the end-to-end latency of 50 runs.
Baselines: 1) Cloud-only: A status-quo approach that o oads whole DNN inference workloads to the cloud server by transmitting the input data.2) On-device: An approach that executes local inference on mobile platforms without o oading.3) SPINN [16]: The state-of-the-art split computing in collaborative inference, which vertically partitions a DNN model into two submodels, adjusting the split point based on the network bandwidth and computation time of previous inferences.4) FL-offloading [39,40]: Fused-Layer (FL)-based collaborative inference approach that horizontally partitions a DNN model into multiple submodels by fusing several layers, adjusting the number of layers to fuse and the number of submodels.We nd the optimal number of submodels and the number of layers to be fused in brute force while o ine based on the pro led network conditions and the computation times of the mobile and the server.For a fair comparison, we implement all these baselines on top of our C-based execution backend of CoActo.

End-to-End Latency
We start by evaluating the e cacy of CoActo for di erent settings of server load and network conditions.E ectiveness in computation bottleneck: We rst evaluate CoActo under di erent available computational resources of the server.To simulate this, we con gure the number of available server cores.Figure 8 presents the end-to-end latency of each baseline with Jetson under di ering amounts of available computational resources of the server.We observe that CoActo outperforms the existing frameworks in all tested server load settings by hiding computation time into the communications time.Since both Cloud-only and SPINN are based on the sequential nature of layer-wise expression and execution, the end-to-end latency shows a linear summation of transmission and computation time.On the other hand, FL-offloading and CoActo operate concurrently, allowing interleaving of computation time and communications time, resulting in lower end-to-end latency.However, FL-offloading is still based on layer-wise expressions and system design, which signi cantly limits the concurrency of the system.Therefore, CoActo shows much-improved latency over FL-offloading.Interestingly, there is a higher performance gain when batch sizes are large.The reason is the increase of independent computational paths that provide ampli ed opportunity for concurrent execution for CoActo.The detailed operations of the baselines and CoActo can be observed in Figure 10 that shows the timelines of each tile in three baselines and CoActo with the same settings.While the server is idle until the whole input is delivered for Cloud-only and SPINN, FL-offlaoding and CoActo start computation during communications when only partial data is delivered to compute.As mentioned above, FL-offloading exhibits restricted granularity, and only a limited number of tiles can be executed while transmission.In contrast, in CoActo, through the ne-grained tile-wise computation graphs and the exible execution system that fully leverages computational resources in the server, the server dynamically starts tile computations on the execution engines whenever corresponding data is received, allowing maximum overlap between communications and computation.E ectiveness in network bottleneck: We also evaluate CoActo in di erent network bandwidth settings to simulate its e cacy under      Overall, CoActo achieves up to 2.1x speed-up and 1.3x on average compared to the baselines in all the tested network and server settings.This speed-up is achieved by overlapped computation and communications time through exible and concurrent execution of CoActo.While FL-offloading enables concurrent execution, its layer-wise expression limits the concurrency of the system as mentioned earlier.To compare the concurrency of both systems, we measure the amount of computation time overlapped by communication time in comparison to Cloud-only.On average, we observed that FL-offloading hides only 3.2% of computation time compared to Cloud-only, while CoActo hides 49.7% of computation time.Therefore, CoActo demonstrates signi cantly improved latency with this enhanced concurrency compared to FL-offloading.
We con rm that the coactive inference approach provides greatly improved utilization of the runtime resources when either the computing or network becomes the performance bottleneck, which is often the case in most computing scenarios.Ideally, the concurrency of computation and communications is maximized in an environment where equivalent computing and network performance is available, as the transmission and computation can be perfectly interleaved.Considering the future mobile network with ultra-low  latency and extremely high network bandwidth, and the increasing AI processing power of both cloud and mobile, our coactive approach is well-designed to be the future of DNN o oading.

Concurrency in Multi-tenant Scenario
In Figure 11, we further evaluate the e ectiveness of CoActo concurrency in a multi-tenant inference scenario in which each mobile device sends a distinct inference query of the DNN model to the shared server.For a fair comparison, we conducted our tests with all available resources (WiFi 100Mbps and 64 server cores).Our coactive inference approach with the ne-grained DNNs outperforms other baselines by concurrently utilizing all runtime resources, including mobile devices, during inference.With Cloud-only approach, all mobile devices transmit the input data to the shared server, and the server simultaneously computes all DNNs at the same time.In this scenario, the server su ers excessive computational loads while not e ciently utilizing the computing resources of mobile devices.Therefore, the latency of ResNet50 in Jetson becomes even larger than the On-device latency.While FL-offloading enables concurrent execution, it also shows similar results due to the limited concurrency in its nature.In contrast, SPINN can nd the solution for utilizing mobile computing resources.However, due to the lack of centralized control on the server, the Pixel 5 device overestimates the computational load of the server and decides to compute locally, resulting in higher latency than the Cloud-only.Unlike the others, by dynamically utilizing resources during runtime, CoActo improves the latency of all o oading execution by concurrently processing DNN queries from di erent devices, as well as computation and communication from a single o oading execution.In addition, when the server becomes saturated, as illustrated in Figure 11(b) of Jetson, mobile CoActo runtime dynamically identi es this and carries out the inference locally, without the use of complicated centralized scheduling.

E ectiveness of the Granularity
To evaluate the e cacy of tile granularity in CoActo, we use a static number of tiles per layer during partitioning in TP.The higher the number of tiles per layer, the more ne-grained is the computation graph.Figure 12 demonstrates the end-to-end latency of CoActo with a di erent number of tiles per layer using the same network Figure 12: End-to-End latency of CoActo with di erent number of tiles per layer in 100MBps and 64 cores in the server.Note that the batch size is 1 for all the tested DNNs.and server setting.In theory, as the number of tiles increases (i.e.becomes more ne-grained), more opportunities arise for concurrent computation and communication, allowing for a decrease in the endto-end latency.This expectation can be validated using VGG16 and YOLOv3 as in Figure 12.However, BERT-base and ResNet50 surprisingly exhibit increased latency after 100 and 50 tiles, respectively.The intricate layered structure of BERT-base and ResNet50, such as Transformers and residual connections, prevents the creation of long concurrent computational paths across multiple layers, unlike the more simple VGG16 and YOLOv3 structures.This reduces the e ectiveness of concurrency while increasing the overhead from smaller tiles.This suggests the need for a thorough design of the partitioning strategy for concurrency over various DNN models.To achieve this, tile con gurations of the aforementioned evaluations are automatically optimized during the granularity adjustment steps of TP.

DISCUSSION
Limitations of current design: While CoActo's ne-grained and exible design provides a novel opportunity for latency reduction in inference o oading, its current design has several limitations.Firstly, it only supports CPUs, which restricts its ability to utilize the powerful computational resources of DNN accelerators like GPUs or TPUs, particularly for server-side resources.However, with the implementation of tile-based computation kernels on those processors, CoActo is able to easily support such computation resources, as contributions of CoActo focus mainly on e cient parallelism and scheduling in mobile o oading scenarios, which is orthogonal to the choice of computation accelerators.Moreover, as the use of computation accelerator adds extra communication overheads, such as transferring tile data from global memory to the dedicated on-chip memory, the asynchronous system design and tile-level interleaving of communication and computation of CoActo may also be further applied to the between the global memory, network interface, and the accelerator memory to further increase the system utilization.
Secondly, as the bene t of CoActo is ideally maximized when the transmission and computation can be perfectly interleaved, CoActo's increase in resource utilization from interleaving concurrent executions may be limited in situations, where there is an imbalance between computation and network resources.when server resources are plentiful but mobile network bandwidth is limited, the performance gain is still bounded by the network bandwidth, even if computation and communication are perfectly pipelined.This highlights the need for additional research into the adoption of other scheduling methods for mobile o oading, such as dynamic tile batching or prioritized scheduling.CoActo in modern large models: Large models have tens to hundreds of billions of parameters, resulting in high computational and memory costs for inference.Given the limited memory and computational power of mobile SoCs, it is challenging to achieve satisfactory service QoE with on-device inference.In such cases, o oading inference to a powerful cloud server is often the only realistic solution.Unlike existing system designs, CoActo's negrained tile-based DNN expression greatly increases the number of independent computation paths within the graph.This leads to more opportunities for exible o oading decisions that help the system adapt to dynamic computing and network resources of mobile o oading environments, which existing systems cannot achieve.In addition, as increased depth and width of such large models allow an even greater number of independent paths, we expect CoActo to have much-improved adaptability to dynamic mobile o oading environments compared to existing solutions.Early decision: Our tile-wise ne-grained DNN expression has another potential for DNN inference o oading acceleration: an early decision that can terminate the o oading by predicting the inference output with only the partial received input data when it estimates enough partial data is received to predict the results, which is not feasible in current layer-wise expressions.We suggest further increasing the use of powerful server resources during input transmission by allowing the server to compute inference output using only the partial input data received.By using a con dence level of the intermediate output, the server may stop the o oading and return the output with even lower latency.This approach allows for reducing communications and computing time simultaneously by halting extraneous tile transmission, with the trade-o of accuracy.We leave the early decision method that nds an optimal point in the trade-o of accuracy and end-to-end latency for future work.Multi-machine inference: Existing layer-wise expression-based approaches only allow for multi-machines cooperative inference on certain models, like Inception-v3 [30], where multi-paths are available.Furthermore, parallelism is still restricted due to synchronization needs in several layers, and therefore extending their runtime execution system to multi-machine is non-trivial.In contrast, our coactive approach can facilitate cooperative inference by generating many independent computational paths through tilebased ne-grained expression and computing independent paths in parallel on multiple machines by extending our runtime execution design to multi-machines.It is worth mentioning that this aspect of our approach is left for future work.

RELATED WORKS
Split computing: To overcome the limitations of the cloud-only approach, Neurosurgeon [15] proposes the concept of mobile-cloud server collaborative inference.This approach balances mobile computing, communications time, and server computing time through vertical model partitioning.Motivated by this, many researchers propose expanding the collaborative inference approach to multipath models [10], adopting early-exit [16], uploading model [13].However, they only focus on determining the split point in a given environment, and not on designing the entire o oading execution system, providing a limited solution.Fused-layer (FL) o loading: Fused-layer (FL) technique is rst proposed to design a CNN accelerator that reduces the o -chip memory access overhead by fusing multiple layers [2].Motivated by this, some researchers introduce a distributed inference o oading in IoT clusters [35,38,39] to overcome the limitation of small memory size of IoT devices by partitioning a large DNN model into several independent submodels.A recent study suggests a parallel partial o oading for mobile-server collaborative inference [40] and demonstrated the potential of adopting FL in collaborative inference through simulated results.However, this approach requires sophisticated handling for overlapped regions, thereby limiting scalability.Furthermore, this method only applies to DNNs with spatial locality (i.e., CNNs), and not to Transformer-based models like BERT [4] and GPT [3].In contrast, tile-based expression can be employed in all DNNs and attain ner granularity, maximizing concurrency.Tiling technique: Tiling is a well-known matrix multiplication acceleration technique [5,29] that increases parallelism by decomposing matrix multiplication into multiple submatrix multiplications.Many studies adopt this tiling technique in various DNN computation topics such as DNN scheduling [18,21], DNN compilers [41], reducing memory overhead [27], or heterogeneous computing [32].To the best of our knowledge, we rst use this tiling technique to design a DNN o oading system.

CONCLUSION
In this paper, we design CoActo, a novel DNN execution system that realizes a new concept of collaborative inference, Coactive Inference O oading.Coactive inference o oading adds two properties, ne-grained expression of DNNs and concurrency of runtime resources, to existing collaborative inference.In coactive inference, system components go beyond simple model splitting, operating more proactive and concurrently to achieve coactive execution of inference workloads.CoActo dynamically schedules concurrent interleaving of the mobile, server, and network operations to actively increase resource utilization in the o oading environment, enabling lower end-to-end latency.We implement CoActo for various mobile devices and server environments and demonstrate that our coactive approach achieves up to 2.1 times speed-up compared to the state-of-the-art collaborative inference approaches.

Figure 1 :
Figure 1: Illustration of (a) conventional collaborative DNN inference o loading and (b) the coactive DNN inference ofoading of CoActo.The proposed coactive approach enables concurrent execution of computation and communications, enabling a novel opportunity of latency reduction in inference o loading.

Figure 2 :
Figure 2: Illustrations of two collaborative inference approaches, (a) split computing and (b) fused-layer o loading.Split computing: This technique splits the DNN model into two submodels at the layer level, as in Figure2(a).In this approach, the mobile device computes the earlier DNN layers and then transmits the intermediate output data to the server.Subsequently, the server computes the remaining layers after receiving the whole intermediate output data.In this scheme, the key control knob is the split point, which determines the ratio between mobile and server computation and the communication time in between, as each layer's intermediate output data size and computing cost are all di erent from one another.Many existing studies[6,10,13,15,17] have suggested several solutions in nding the optimal split point in this partitioning scheme by pro ling the computing and network resources of the given o oading environment.However, this approach has a critical aw, as layer-wise data dependency necessitates two of the server, mobile, or network resources to idle during the execution of a resource.As a result, this sequential execution approach su ers from under-utilization of the available resources, leading to only a limited amount of latency improvements.Fused-layer (FL) o loading:The key idea of this technique is to fuse multiple layers[2] by exploiting the spatial locality of layers, such as convolution and pooling layers.It allows the division of the neural network model into several submodels with fused layers, as shown in Figure2(b), each with zero data dependencies to one another.This independent nature of the submodels allows the computation of each submodel to be executed without any synchronization to the computation of other submodels, enabling the available computation resources to be utilized without any idling.Many works have leveraged this technique for distributed inference[35,38,39] in edge devices such as IoT clusters, and a recent work[40] has demonstrated the potential of this scheme in collaborative inference by applying model parallelism and partial o oading to the mobile-server o oading environment.Unfortunately, this approach su ers from limited scalability and high computation overhead, as dividing a DNN into independent submodels inevitably introduces duplication of computation between the submodels.As the DNN becomes more complex, and as the number of submodels increases, overlapping computational regions (the gray areas in Figure2(b)) between the submodel expands, resulting in an exponential increase in the total computational cost.As a result, the e cacy of this method drops signi cantly as the DNN model gets more complex and the concurrency of the method increases, resulting in reduced bene ts from collaborative inference.

FetchFigure 3 :
Figure 3: The overview of CoActo.Tile-based Partitioner (TP) transforms the layer-wise computation graph into a tile-wise computation graph by partitioning a layer into several negrained tiles.At runtime, Asynchronous Execution Engines (AEEs) concurrently compute and transmit partitioned tiles.Dynamic Scheduler (DS) makes o load decisions for each tile.

Figure 4 :
Figure 4: An illustration of Tile-based Partitoner (TP) operation.An original DNN is transformed into a partitioned computation graph through 4 steps, (a) parsing the target layer's input and output tensor sizes, (b) converting input and output tensors to matrices, (c) column-wise partitioning, and (d) generating tiles by merging multiple columns, recursively in every layer.

Figure 5 :
Figure 5: Examples of the collaborative inference with the tiles-wise computation graph (top) and the layer-wise computation graph (bottom).

Figure 6 :
Figure 6: The overall execution work ows of Asynchronous Execution Engines.

3 . 4 . 2
Computing & Communication Engines.All computing and communication engines operate concurrently and asynchronously without any synchronization among them.Each computing engine continuously fetches the ready nodes from the workload queue in2 Tile and node terms are interchangeably used in this paper.

3. 5 . 2
Dynamic O loading Decision.The goal of our dynamic ofoading decision in mobile is to nd the optimal o oading policy = { 1 , ..., } that minimizes the maximum completion time of the exit nodes .Note that denotes the o oading decision of a node , where 1 represents server o oading, and 0 represents local computation.Here, ( ) represents the completion time of a node , measured from the beginning of the inference.min ∈ max ( )

Figure 7 :
Figure 7: An example of dynamic o loading decision with (a) a sample DAG.It is performed by calculating ( ,o loaded) and( ,server).Basic idea: The main idea of our approach is to rst send all the input nodes to the server and dynamically decide to compute and send the outputs of the subsequent nodes on runtime.Sharing the input data is very cheap, as input data are often much smaller than intermediate DNN tensors.Through this, we guarantee that the end-to-end execution latency of our coactive inference is at least the minimum value between a full server o oading and full mobile on-device computation.Our system is designed in this way to guarantee minimum performance, even with unforeseen network or server degradation during the execution.If incomplete input data were to be sent, the powerful server computations may wait for the transmission of intermediate nodes from the mobile, which may not be latency guaranteed in wireless or user mobility scenarios.Furthermore, it always ensures that the worst-case execution time is the time taken for on-device inference if the computation gain from the powerful server is lost by an o oading service disconnection.Dynamic o loading decision: We explain how the dynamic o oading decision operates by using the example DAG in Figure7.As explained before, all input nodes are transmitted and duplicate computation between server and mobile is allowed.However, to minimize unnecessary duplicated computation, the mobile and server start their computation from a ready node pair with the largest diameter in between, iterating in opposite directions.For example, as illustrated in Figure7(b), the mobile transmits 3 , while computing 4 .The server starts computation of 7 as soon as it receives 3 .Then, the o oading decision of each node is performed using a greedy approach in the mobile device, starting from 4 .This decision is made using the estimated completion times of the node on the server when it is either 1) o oaded from the mobile and 2) computed solely on the server, denoted as ( ,o oaded) and ( ,server), respectively.A node is only o oaded (i.e., = 1) if ( ,o oaded) < ( ,server), indicating that computation of the node on the server alone will be delayed when compared to the case when it is assisted by the mobile device over the network.For example, in Figure7(b), the node 8 is o oaded based on the estimations from the mobile.The server is then allowed to skip the computation of 8 by using the computation result from the mobile device.This mobile-assisted greedy approach enables the maximal utilization of the powerful server resources, even if the estimation is

Figure 8 :
Figure 8: End-to-end latency using Jetson AGX Xavier with di erent number of available server cores, under a 100Mbps WiFi network.

Figure 9 :
Figure 9: End-to-end latency using Jetson AGX Xavier under di erent network bandwidths and 8 cores available in the server.

Figure 10 :
Figure 10: Timelines of each tile in three baselines and CoActo with VGG16 batch size 8.The tested network bandwidth is 100Mbps and the available server cores are 8.o oading approach becomes even slower than On-device.However, SPINN can resolve it through pro ling-based approaches, and CoActo can resolve it through duplication-based dynamic o oading decisions.Overall, CoActo achieves up to 2.1x speed-up and 1.3x on average compared to the baselines in all the tested network and server settings.This speed-up is achieved by overlapped computation and communications time through exible and concurrent execution of CoActo.While FL-offloading enables concurrent execution, its layer-wise expression limits the concurrency of the system as mentioned earlier.To compare the concurrency of both systems, we measure the amount of computation time overlapped by communication time in comparison to Cloud-only.On average, we observed that FL-offloading hides only 3.2% of computation time compared to Cloud-only, while CoActo hides 49.7% of computation time.Therefore, CoActo demonstrates signi cantly improved latency with this enhanced concurrency compared to FL-offloading.We con rm that the coactive inference approach provides greatly improved utilization of the runtime resources when either the computing or network becomes the performance bottleneck, which is often the case in most computing scenarios.Ideally, the concurrency of computation and communications is maximized in an environment where equivalent computing and network performance is available, as the transmission and computation can be perfectly interleaved.Considering the future mobile network with ultra-low

Figure 11 :
Figure 11: End-to-end latency in multi-DNN inference scenario that each device requests a distinct DNN inference query to the shared server with 100Mbps and 64 available cores settings.

Table 1 :
Speci cations of the tested platforms.