Chitu: Accelerating Serverless Workflows with Asynchronous State Replication Pipelines

Serverless workflows are characterized as multi-stage computing, while downstream functions require accessing intermediate states or the output of upstream functions for running. The workflow's performance can be easily affected due to the inefficiency of data access. Studies accelerate data access with various policies, such as direct and indirect methods. However, these methods may fail due to various limitations such as resource availability. In this paper, we propose asynchronous state replication pipelines (ASRP) to speed up workflows for general applications, replacing the sequential computing pattern of current workflows. Chitu is built based on the insight with three main points. First, differentiable data types (DDT) are provided at the programming model level to support incremental state sharing and computation. Second, ASRP works by continuously delivering changes of DDT objects in real-time so that downstream functions can consume the objects without waiting for the ending of upstream functions. Third, we make a systematic design to support DDT and ASRP in Chitu framework, including direct communication and change propagation. We implement Chitu atop OpenFaaS, compare it with popular serverless workflow frameworks, and evaluate it with three commonly seen cases. The results show that Chitu accelerates data transmission in general serverless workflows up to 1.7×, and speeds up end-to-end applications by up to 57%.


Introduction
Serverless workflows are characterized as multi-stage computing tasks (e.g., video analysis, ML serving, IOT data processing) with a group of serverless functions and cloud services.It usually relies on data stores or delivery to enable intermediate states or output access of each stage for downstream functions.For example, model gradients or parameters in ML training [16,22], and frames in video processing [12,18] are commonly shared by multiple functions in their workflows.However, the workflow's performance is easily affected due to the inefficiency of data access.
Studies have been devoted to solving the problem, including optimization for both indirect and direct data access patterns.For indirect data access, as shown in Figure 1(b), highly scalable distributed data stores are suggested to place data closely with functions [36,37,39].For example, cloudburst [36] leverages Anna [1,41], an autoscaling key-value store, to speed up data exchange for stateful serverless workflows.As the data amount is large, it can be expensive due to data movement from data stores to functions.For direct data access, as shown in Figure 1(c), studies promote co-located data-dependent functions to reduce network communications.For example, [11,17,26,42] put workflow functions on the same machine to support process/thread level data access, while sometimes it is hard to gather all functions as wished.In addition, [17,40] proposes to use RDMA to support direct data access for remote functions, while it fails RDMA-free environments.Considering all the above limitations, it is critical to explore more general data access optimization methods to accelerate serverless workflows.
We observe that current serverless platforms [4,8,36,42] run workflows in a sequential way that computation and data access in each stage are issued in turn, as shown in Figure 1(a-c).This sequential way exacerbates the inefficiency of data access with long function chains and large data amounts, further expelling some applications from serverless computing.Considering that data can be generated step by step in a single function's runtime, we realize that asynchronous state replication can be used to overlap data access time interval with computation, as shown in Figure 1(d).To verify that the pipeline can effectively improve the efficiency of state access, we implement a simple simulator 2.2.According to our simulation, this policy is potential to reduce end-to-end latency effectively.
With this insight, we intend to build a new serverless framework, adhering to two principles, to accelerate workflows.First, it is available for the framework to support general applications.Second, the runtime should be efficient to deliver data no matter whether functions are placed locally or remotely.
In this paper, we present ℎ1 , a novel framework to accelerate serverless workflows with asynchronous state replication pipelines (ASRP).It is built with three main points.First, differentiable data types (DDT) are provided at the programming model level to support incremental state sharing and computation.Second, ASRP works by continuously delivering changes of DDT objects in real time so that downstream functions can consume the objects without waiting for the ending of upstream functions.Third, we make a systematic design to support DDT and ASRP in ℎ framework, including direct communication optimization and change propagation.We implement ℎ atop OpenFaaS, compare it with popular serverless workflow frameworks, and evaluate it with three commonly seen cases.The results show that ℎ can successfully accelerate serverless workflows with the aforementioned goals.
In summary, we make the following contributions: 1. We conclude the sequential computing pattern for current serverless workflows and propose that ASRP is an effective pipeline parallel method for serverless workflows.2. We design a new programming model with differentiable data types and explain how asynchronous state replication pipelines work with DDT to provide workflow acceleration.These can be generally used to accelerate workflows different from both indirect and direct data access patterns.3. We develop and implement ℎ with ASRP, providing direct communication support and change propagation.The evaluation demonstrates that our solution accelerates data transmission in general serverless workflows up to 1.7× over Cloudburst [36], and speeds up end-to-end applications by up to 57% over running on the same base platform without ℎ.

Background and Motivation
In this section, we first summarize current solutions for accelerating serverless workflows.Then, we discuss our insights to accelerate serverless workflows with ASRP and specify our goals.

Accelerating Serverless Workflows
Serverless computing has gained significant popularity as a cloud-based framework for application development and deployment [3,5,8] with multiple benefits, like pay-as-yougo and auto-scaling.Based on the simple abstraction of low-level environments, developers can easily define serverless workflows for multi-stage computation.For instance, an ML serverless workflow can be easily defined to perform model training [16,22] and inference tasks.Similarly, a video processing workflow can be designed to utilize different functions to handle the various stages of the process.Since workflows are data-dependent, serverless hosting environments need to offer indirect or direct data access for data exchange between functions.The efficiency of data access can affect workflow performance, which is crucial [23].
Recent studies have explored optimization for both indirect and direct data access.
Indirect data access.In this case, workflows rely on external data stores or shared logs to schedule data for downstream functions.Studies suggest that external data stores should be highly scalable to cache data closely with functions [1,2,21,25,36,41,43].The solutions help achieve notable performance improvements and consistency guarantees.For example, Cloudburst [36] uses a high-performance key-value database Anna to perform state transmission.However, it may bring extra overhead for data access due to inefficient data requests and movement through the network, especially with large amounts of data, resulting in performance reduction.
Direct data access.It refers to delivering data from upstream functions directly to or allowing to consume data from the upstream directly by downstream functions, obviating the communication overhead with data stores.It can be further divided into 2 categories: • Function fusion (a.k.a.code shipping).It employs workflow's DAG to deploy functions as a group of processes or threads locally (e.g., in one container) so that data exchange can be done with memories, even with zero-copy support [11,26,33,35,37,42]. Then no network overhead needs to be worried about.However, the expectation to co-locate the functions can be hard sometimes, like available resources of a single node are not enough, or specific function placement policies are used.• RDMA-based state access.Studies suggest enabling direct state access between remote functions and facilitating zero-copy state sharing with RDMA [14,17].It reduces the state access latency with high-speed network, and can also speed up the cold start process [40].Nonetheless, the policy imposes more stringent requirements on the underlying serverless hardware environments, while general-purpose computing environments may fail to meet the demands.
Considering the current limitations mentioned above, it is critical to provide more general data access optimization to accelerate serverless workflows.In this paper, we propose asynchronous state replication pipelines to systematically improve workflow's performance.

Insight and Motivation
Insight.We observe that current serverless platforms [4,8,36,42] organize program workflows in sequential ways that computation and data access in each stage happen in turn, as shown in Figure 1(a-c).Both computation and data access steps are atomic.This can obviously lower the efficiency of data access with long function chains or large data amounts.It also transfers the burden to developers who would be responsible for cutting complete computing into functions prudently.Many applications face the challenges of how to split and define serverless workflows.A relatively coarsegrained splitting can hurt runtime elasticity, while too many tiny functions can decrease the running efficiency.
Considering data usually generated step by step in computing, especially for compute-intensive applications (e.g., AI, Big Data), it is possible to convert the atomic view to a streaming view, treating a data object as a series of object changes with computation goes on.The object changes can be transferred with asynchronous state replication.Then the whole process can be organized as a pipeline, as shown in Figure 1(d), so that the time intervals of data producing of upstream functions and access by downstream functions can be overlapped, no matter in an indirect or direct pattern.
Simulation.To validate our insight, we implement a simple stream processing simulator based on OpenFaaS, which mainly includes an engine and some functions for processing calculations.We simulate stream processing by constructing multiple functions into a function chain through an additional Redis service.And then we design three calculation modes, Baseline: which uses a serial method for calculation; Pipeline: which calculates through a pipeline method; and Async: which optimizes data transmission asynchronously based on the Pipeline mode.Based on this simulator, we implement two simple workloads: • Picture processing (PR).In this case, the workflow repeatedly resizes a picture and saves it to storage service.Each function in this workflow performs a simple resizing task.The end-to-end latency of this workflow primarily depends on the performance of resizing and transferring pictures.
• Big array iteration with average calculation (AC).
The workflow repeatedly calculates the average value of a float-point number array and slightly changes the elements to match the average.In this task, elements in an array can be transferred incrementally.
The experimental results depicted in Table 1 demonstrate that the program after pipeline optimization has improved by nearly 42% compared to Baseline.In addition, asynchronous data transmission can also effectively reduce end-to-end latency.We argue that asynchronous state replication pipelines can be promising to accelerate serverless workflows.
Goals.We aim to build a high-performance serverless workflow framework, adhering to 2 basic principles.
• First, it is available for general applications to realize serverless computing.While traditional applications view data access in a monolithic way, new abstractions for asynchronous state replication pipelines are required in the corresponding programming model (with supplied libraries), to avoid addressing some specific issues [12,16,18,22,30,34].ℎ proposes differentiable data types (DDT) to meet the demands.• Second, the framework runtime should be efficient to accelerate data access no matter whether functions are placed locally or remotely.Asynchronous state replication pipelines should be efficient enough.ℎ set up a group of support, including direct communication and change propagation.

Programming and Computation Models
In this section, we present the design of ℎ programming model with differentiable data types and explain how asynchronous state replication pipelines works with DDT at the computation level.

Programming Model
ℎ programming model is formed to bridge the gap between high-level programming and low-level asynchronous state replication pipelines.For low-level computation, ℎ adopts windowed data in DDT for data delivery convenience and supports incremental data sharing and computation to avoid redundancy.For high-level programming, DDT objects are provided to share mutable data between functions and the Change type is provided to modify the DDT objects.Windowed data.Windowing is a technique widely used in streaming processing [6,15] to group data into finite sets based on time or count.However, windowing is incompatible with general application development since not all computing scenarios follow streaming semantics and inside data structures may change arbitrarily, not as streams, mismatching principle 1 in section 2.2.
The same objects may update in different windows.Incremental computing helps reduce the duplicated computation and data traffic.We adopt a new concept Change to connect high-level general programming semantics with windowed data to implement incremental computing.
Change.The key concept of DDT is the Change type which abstracts all types of data object modifications on its associated type Value.With such, Windowed data can be constructed as a series of Changes on a DDT object.Considering that Change may need to be propagated in the network, it is designed as (de)serializable data structures instead of arbitrary functions in [20].Thereby, developers are allowed to define their own specific Change types based on requirements.Currently, DDT provides 4 kinds of Change types in the current built-in library, TrivialChange, VecChange, VecExtend and DictSet, which suffice for most cases.Detailed information about the four Change types is shown in Table 2.
Since some states may keep the same across data windows, incremental computing for synchronization can be used with windowed data to reduce redundant data traffic.Change a DDT object by an instance of its change type.Asynchronously apply the change and propagate it to all subscribers.onchange(Fn) ->Fut Register a handler for a DDT object.When the object is changed, the handler will be invoked and finally return a future.end() End the lifetime of an object.Once an object is ended, all registered handlers will exit.
Differentiable data types.Further, to apply Changes conveniently and propagate them automatically, we encapsulate normal types in differentiable data types that can be changed by Changes and subscribed by handlers registered with onchange2 .Whenever a DDT object is changed, the change will be automatically propagated to all handlers registered on it.DDT is represented as type Diff<T, C: Change<T> >.Table 3 shows the DDT's core APIs.
DDT works by registering handlers (a.k.a.differentiation or differential [29]) that will be invoked when the object changes.The registered handlers behind continuously update the results of handlers incrementally, without performing a full computation from scratch.
To illustrate DDT more clearly, we show an example of the incremental merge sort implemented with DDT.The detailed implementation is straightforward (Figure 2, 3).We use a simple VecExtend type as Change type on Vec.The data that need to be sorted are gradually generated from the ingress stream or upstream functions as windowed data.For the first window, the anonymous handler registered in line 4 sorts the windowed data with a regular merge sort.The sorted results here will be maintained as semi-results.Then, for each new-coming window, the program sorts it and merges the sorted results with its semi-result to update the results.
Let  be the size of each input data window and  be the size of the semi-result data window.Note that the sorting in line 6 takes  ( log ) time, and merging in line 7 takes  ( + ) time.This incremental merge sort will reduce sorting time on each data window from  ( log ) to  ( + log ).Although the total time spent on this sorting progress  ( log  +  2 /) may be longer than regular sort, Figure 2. DDT computing illustration with an incremental merge sort algorithm.Take the processing of the data window with "8, 2" as an example.This window is essentially a Change on the ingress stream.It will be sorted and then merged with the semi-result "1, 9".The sort and merge processes are defined in registered handlers of DDT object Results.To incrementally merge sort data windows, developers need to register a handler on a DDT object that represents the ingress data stream.The core logic for sorting is implemented in the handler, which will be invoked every time the DDT object is changed.
the sorting stage is often not the bottleneck in the computation which will be organized as a pipeline (refer to section 3.2).For example, many data queries generate stages with Asynchronously invoke a downstream function and pass the arguments.Function instances invoked by this API will be treated as in the same DAG as the invoker.
expensive join and group operators before the sort operator, where the input data of the sort operator can be regarded as gradually generated slowly.In this case, the processing time for each window will be greatly reduced.
The incremental merge sort uses a simple Change type, VecExtend, which only allows a vector mutated by appending another vector.If we need to support more flexible mutations such as VecChange presented before, the handler should match all cases of VecChange and update the semiresult correctly, which can be extremely hard.So, developers should carefully choose or define the change types for the data to be shared.

Computation Model
The most important computation model of ℎ is how ASRP supports and accelerates function interactions.In ℎ, ASRP replicates windowed data with DDT support asynchronously from upstream functions to downstream functions and organizes the processing of workflows as a pipeline to allow the computation and data access to work with different DDT changes in parallel.
ASRP can match with incremental computing of DDT.Programming interfaces of DDT enable incremental computing in a single serverless function.Hence, the computation in handlers registered by onchange can be parallelized in a pipeline with the stages.A pipeline will be generated for each data window.In a pipeline, the stages are defined by the registered handlers which can work with different data windows simultaneously.This pipeline parallelism can be also extended to workflows with multiple functions.
To support pipeline parallelism for workflows, functions need to share DDT through ℎ system.ℎ provides the AgentStub API (Table 4) to facilitate sharing DDT in serverless workflows.The invoke interface is used by upstream functions to invoke downstream functions so they will be treated as in the same DAG.The export and import interfaces are similar to get and set of data storage systems.The difference is that export will asynchronously propagate the changes of a DDT object instead of writing its value to data stores.On the other side, all functions that import a DDT object will continuously receive subsequent changes of the object until its lifetime ends.
With these APIs, change propagation of DDT can occur across serverless functions.The merge sort in Figure 2 can be implemented in two separate functions: Function-1 exports a DDT object , performs some complicated computation, and generates data in windows periodically as changes on ; Function-2 imports  and incrementally sorts it using the algorithm we describe in the programming model, refer to section 3.These two functions in a workflow form a pipeline with five stages shown in Figure 4.

System Design
In this section, we first introduce the overall system architecture and then present the communication and change propagation support.

Architecture Overview
The architecture of ℎ is designed to provide seamless integration with serverless platforms and efficient communication between functions through DDT.The main components of the system include a function scheduler, a state coordinator, a customized function runtime with a State Agent, and programming libraries Figure 5.
DDT Lib.To facilitate the creation, manipulation, and interaction with DDT objects, ℎ provides programming libraries for C++, Python, and Rust.These libraries offer a consistent API across languages, simplifying the integration of DDT into serverless functions written in these languages.
The function runtime in ℎ is equipped with a state agent that manages metadata of local DDT objects, caches the change history, handles user requests and transfers DDT' changes with other agents asynchronously.The state agent hosted on the function runtime where the user's function is executed can be manipulated through the programming libraries.To user-defined serverless functions, the state agent is the very AgentStub described in Table 4.
The scheduler supports multiple functions to be scheduled in the same container if the node has sufficient resources, like Function-a and Function-b in Figure 5. Functions running in the same container share the same state agent, so they can share data windows in memory.In other words, when Function-b requests data produced by Function-a, Function-b will get a read-only pointer to the data.There is no redundant replication or network overhead in this case.
The state coordinator is a central service that manages connections, DDT object metadata and workflow metadata.It coordinates the overall state of the system, ensuring that the necessary data is available to functions within the workflow.Additionally, the Coordinator maintains a registry of DDT objects, updating their information with associated state agents as necessary.
As elaborated in the computation model, the upstream functions in a workflow are allowed to share mutable data as windowed data by changing a DDT object, and the downstream functions read the data windows and update the computation with them by registered handlers.Thus, the functions in the workflow enjoy pipeline parallelism.The state agent and coordinator are carefully designed to offer these functionalities.

Direct Communication Management
Direct communication is fast but challenged in serverless because functions do not keep track of invocations.To solve this problem, ℎ coordinator and state agents jointly make the addresses of importers accessible to exporters that share the same DDT objects.
Firstly, we distinguish functions that share an object as exporters and importers which are also called upstream functions and downstream functions in a workflow perspective.The connections built between exporters and importers are virtual, which means they know the addresses of each other, instead of maintaining a physical connection.Each DDT object also has a unique ID combined with a key that functions use and a DAG ID held by the function that imports or exports the object so that objects with the same key but in different DAG instances won't be confused.
Another usage of DAG ID is to instruct state agents when to collect garbage.When all function invocations within a DAG instance are terminated, the state agent clears its local information of the DAG including addresses of importers and history of DDT objects.
Function Addressing.The coordinator maintains a DDT object table.When a function exports a DDT object , it sends an export message to the coordinator.Then the coordinator saves the exporter's location in the object table.When an importer intends to import , it sends an import message to the coordinator.The coordinator finds the exporter's location by the object's ID and notifies the exporter of the new importer's location.From this moment on, all changes on  will be propagated to the new importer.
Additionally, it is guaranteed that importers will always get the whole Change series as windowed data whenever it calls the import API.We explore two corner cases: • When an importer imports an object before the exporter, the coordinator will save the importer's location in the object table.When handling export message, the coordinator gives the importer's location to the exporter as a response.• When an importer imports an object after some changes on the object have been propagated, the exporter transfers the whole history to the new importer immediately.
Once the exporter's state agent knows the addresses of importers, it can directly transfer data to them.In other words, changes of DDT objects can be propagated between exporters and importers.As described in Table 4, an object can be exported only once.So there will be one exporter of an object and there might be one to many importers.The connections are virtual.In each function pair, all virtual connections share at most one physical connection and there might be no physical connections when the importers are on the same node as the exporter.Thus, there won't be extra overhead when the exporter transfers data to multiple importers.

Change Propagation
A DDT change refers to the modification of a DDT object as mentioned in section 3. Overlapping execution of functions is enabled by propagating changes instead of complete values.The change propagation in ℎ is based on the principles of incremental computation.When a DDT object is changed, the state agent is responsible for transferring the serialized change to all downstream functions that import the object.Then downstream functions upon receiving the changes, can begin processing the data with new changes incrementally, without waiting for the entire upstream function to complete.
Arrows with numbers in Figure 5 show a complete change propagation link of an end-to-end example.The changes come from user-defined functions by calling Diff::change.Next, the user library sends the changes to a serialization thread ①.The next step is different in Rust native function and other languages ②.The Rust native function will directly send the serialized change to the state agent through channels (asynchronous communication between threads in Rust); user-defined functions implemented in other languages are executed as a subprocess, in which case we use Unix named pipes (a.k.a.fifos) to exchange data including DDT changes between functions and state agents.Then the state agents package and broadcast bytes data serialized from DDT changes as windows to all importers in its local object table ③.If the state agent finds the importer is on the same node as the exporter, it broadcasts data through memory with zero copy enabled.After crossing the network and arriving at the importer, the windowed data will be received by the importer's state agent, ④ sent through Unix named pipes or channels, deserialized asynchronously, and finally be applied on registered handlers as DDT changes ⑤.
In the whole process, every stage is running asynchronously.Thus, each function pair in a workflow forms a pipeline with 5 stages, including producing, serializing, transferring, deserializing and consuming.The user-defined functions can continue to process data without any synchronous I/O.
Window size tunning for grouping.By default, the change is propagated as soon as it occurs.In this case, the latency for this change propagated to its destination is the lowest.However, when propagating changes in the network, it is not always efficient to transfer the changes one by one.Here, state agents can group the changes with the window size configured by users.Currently, the optimal count for each workflow needs to be tuned through multiple experiments.

Case Studies
In this section, we present three representative use cases, including big data query Q3 (Q3), data parallel training (DP), and video face detection (VFD), to show how to develop and run applications on ℎ.

Big Data Query Q3
Q3 is a typical data query benchmark [7].It reads two record tables into memory and performs a query including four stages filtering, joining, grouping, and sorting.We develop Q3 as a serverless workflow consisting of 4 functions corresponding to the above stages.Since Q3 is a batch processing job, we focus on the end-to-end latencies.Task-level parallelism, such as MapReduce-style parallelism and hash join, is not used in the workflow.
Our implementation consists of two versions.The basic version running on a single function without DDT uses iterators in Rust.The critical part of the basic program is filtering, folding and mapping on an iterator created by input data.The ASRP version is implemented with DDT.To demonstrate how easy it is to transform the basic Q3 into the pipelined Q3 with DDT, we describe the implementation in detail.
We use a DDT Diff<Vec<_>, VecExtend<_> >, change type VecExtend and four handlers for each stage.First of all, the DDT object that represents the input data will change after loading data records.Filter: the first stage is filtering, which is a typically stateless mapping operation.The implementation of filtering data window is trivial: call onchange to register a handler that filters each element in the new window by the same predicate and then appends them to existing semi-results.Join: after waiting for the whole dataset of one table, the joining stage produces joined data based on the new window and the loaded table, which is also similar to the basic program.Group: different from the previous two trivial stages, the output of grouping is not still data windows.Instead, we use a hash map to group the data.In other words, grouping is stateful.The new-coming data window results in changes of the hash map on existing keys.Sort: while we do not implement an incremental sorting algorithm applied on arbitrarily changing items, the sorting stage has to wait for the whole results produced by the grouping stage.While there are four stages in Q3, at most three can be pipeline parallel without advanced incremental computing algorithms.The filtering and grouping functions do not need to wait for upstream functions to finish.The first three functions process data windows in parallel.Theoretically, when input data are windowed into sufficiently small windows, the execution time of the entire workflow will be close to the time taken by the slowest stage in the pipeline (joining) plus the time taken by the sorting stage.

Data Parallel Distributed Training Data Parallelism (DP) for distributed ML training involves multiple training workers, each training on local data and exchanging model parameters or gradients via Parameter
Server [27] or AllReduce [24] after iterations of forward propagation (FP) and backward propagation (BP), leading to a long waiting time for the next training iteration.
In this case, the basic version follows the traditional procedure of DP in the AllReduce paradigm, calculating the gradients of all layers and then merging them together.The ASRP version which is implemented with DDT, takes advantage of the layer-by-layer characteristics of neural networks, fully overlaying BP and merging calculations.The detailed execution process is shown in Figure 6.
All parameters are registered as handlers in a DDT object of type Diff<HashMap<String, Tensor>, DictSet<_> >.Once the gradients of certain layers are finished, they are immediately transmitted to the master for merging triggered by torch.backward_hookfunctions, while BP is still ongoing concurrently.Specifically, worker-1 and worker-2 are computing gradients of Part-2 while transmitting Part-1 gradients for merging.Moreover, the master worker proactively sends merged gradients of some layers back to each worker, without waiting for the completion of the entire model's gradient calculation.Corresponding to Figure 6, the master is computing gradients of Part-3 while transmitting Part-2 merged gradients.This optimized strategy ensures efficient utilization of computational resources and reduces overall training time.

Video Face Detection
We used about 100 lines of code to implement the initial version of the face detection program and added 100 lines of ℎ-related code to create the ℎ version of the face detection program.The main additions in the ℎ version are in defining DDT, Change, and the important onchange function.
VFD is a program consisting of three stages including read, detect and write.
In the basic version, the detect function waits for the read function to completely finish and so does the write function.In the ASRP implementation, the overlapping of read, detect, and write function executions is achieved.The detect function can work as soon as the read function reads and parses a frame.
For the face detection program, we use three functions and two DDT objects both of type Diff<Vec<Frame>, VecChange <Frame> > to transmit intermediate frame data.Read: reading is the first step in the video processing pipeline, where the original video file is read and parsed into frames.OpenCV library is to implement this program.After reading, the frames are passed to the detection function by invoking the change method of DDT.Detect: the second step is to perform detection on the Mat in the OpenCV.The detect function simply uses the CascadeClassifier of the OpenCV library to perform face detection on the image, and then marks the face area in the image after detection.The modified image is then passed to the writing function through the change method of another DDT.Write: finally writes the Mat back into the video file.

Implementation
We have implemented ℎ 3 atop OpenFaaS [10], a general open-source FaaS platform.Compared to native OpenFaaS, the most significant updates of ℎ are located in state coordination and state agents.
Scheduler.Since ℎ directly utilizes the original scheduler of underlying serverless platforms (i.e.OpenFaaS), developers need to declare the co-scheduled functions explicitly in configuration files.The downstream functions can be invoked early with the OpenFaaS gateway.Then the scheduler responds as ordered.
State coordinator.The coordinator is deployed as a separate serverless function using 700 go LOC.Considering it is a centralized component, the coordinator handles only three HTTP requests within a DDT lifetime in most cases, which is lightweight.Thus, we argue it can achieve high scalability.
State agent.The state agent is implemented in 2200 Rust LOC.The Rust, C++, and Python programming libraries are implemented respectively in 500 Rust LOC, 600 C++ LOC and 500 Python LOC.In the OpenFaaS deployment, the addresses for functions and state agents are implemented by obtaining the IP address of a container.This allows other state agents to directly call any agents using their IP address and transfer data without intermediary services.

Evaluation
In this section, we evaluate the effectiveness of ℎ by comparing it with several popular serverless workflow frameworks, then explore the performance of three cases mentioned in section 5.

Experimental Setup
Due to various dependencies and environment requirements of different serverless frameworks, we adopt two different experimental infrastructures.One is self-hosted (resource listed in Table 5) and the other is AWS with 4 c5.xlargeEC2 instances.The self-hosted infrastructure is used both in section 7.2 and 7.3, while the AWS is only used in section 7.2 for open-source serverless computing frameworks.
Table 5. Experiment setup and resource limits for three cases.

General Workflow Acceleration Performance
In this section, we evaluate the general workflow performance of ℎ and other stateful serverless computing frameworks with the settings below.
• ℎ.Due to the 10 Gbps network bandwidth of EC2 instances, ℎ performs 5-40% better on AWS than self-hosted.Considering fairness (Knix is deployed on self-hosted), we only compare the self-hosted ℎ with other platforms.
• Cloudburst [36] on AWS.Using fast cache for indirect data access.We consider two versions including Cloudburst-local (running functions located on the same node) and Cloudburst-remote (running functions across nodes).• AWS Step Functions (ASF) [4].A service developers use to build serverless workflows with Lambda.
• Knix [9] on self-hosted.An evolution of SAND [11] which schedules functions in the same container for direct data access.• Faastlane [26] on AWS.Reducing function interaction latency by providing thread-level isolation domains using Intel Memory Protection.We consider the following two specific workloads: (a) Noops Workload.Each function of the workflow returns the initial input strings immediately as a response.ℎ groups the strings separated by window-sized characters, encapsulates them with Diff<String>, and then delivers continuously.(b) Loop Workload.Compared to the No-ops Workload without any operations, each function repeatedly converts all characters in the input strings to lowercase and converts it back to uppercase over 15 times.
The performance of these workloads helps understand ℎ's workflow acceleration in data transferring with direct communication and computation-transmission overlapping.No-ops and loop workloads verify ℎ's direct communication and pipelines' efficiency respectively.Under the two workloads above, we separately vary the initial input data sizes and workflow chain lengths, i.e. the number of functions.
Performance with different data sizes.Overall, ℎlocal outperforms most baselines with various data sizes from 1KB to 100MB, as shown in Figure 7.When the data size is small (e.g., 1KB, 10KB), ℎ performs slightly worse than Cloudburst.This is because, before data is transferred by change propagation, the exporter and importer need to establish a direct connection mentioned in Section 4.2.This cost three HTTP overhead which is computed about 2.5 ms through the experimental results.However, as the size of data increases, ℎ outperforms all methods.For example, in the Loop workload with an initial input data size of 100MB, ℎ-local shows an improvement of 21% compared to Faastlane, and ℎ-remote exhibits an improvement of 1.7× compared to Cloudburst-remote.This can be attributed to ASRP of ℎ as the overhead for connection establishment is relatively low while the time spent on transferring matters.On the other hand, due to the lack of specific optimizations in orchestration and scheduling, ASF incurs significant overhead.This overhead is most pronounced when dealing with smaller data sizes.Knix and Faastlane accelerate workflows by scheduling functions in the same node.So we consider to compare them with ℎ-local.In this case, ℎ-local outperforms Knix and Faastlane.
Performance with different lengths of function chains in the workflow.ℎ-local performs better than most baselines with 100KB data and various chain lengths from 5 to 80, as shown in Figure 8.In all cases, ℎ-local consistently outperforms Cloudburst-local, which is the best of the baselines, by 1.8-3.7× in both workloads.In comparison to Cloudburst-remote, ℎ-remote performs slightly worse in shorter workflow chains, such as 5 and 10.The reason is the same as that in the previous workload.However, it demonstrates superior performance in experiments with longer call chains where all the stages of ℎ-remote and ℎ-local workflows can transfer data in parallel.

End-to-end Application Performance
It is costly to implement all three end-to-end applications with other serverless frameworks.Specifically, those frameworks lack adequate SDK support for our applications.So we build the three end-to-end applications described in section 5 on ℎ and OpenFaaS-Single/Redis as below.
• OpenFaaS-Single: a basic version running in a single function deployed on OpenFaaS.This is the trivial solution to most applications in the real world.The advantage of this version is that all data exchange occurs in memory, and there is no overhead of data transmission over the network.This implementation violates the serverless paradigm because no components can scale independently.• OpenFaaS-Redis: components in a workflow deployed as individual functions which share data via an external Redis service.In experiments of this section, functions are all warmed up to avoid cold starts of all versions.We measure the total endto-end execution time for various input sizes and compare the results with baselines.
Q3 case.As shown in Figure 9a, ℎ employs pipeline parallelism and has the best performance with various data sizes from 65MB to 10GB under the most efficient window size, compared to 52-57% faster than OpenFaaS-Single and 130-150% faster than OpenFaaS-Redis.As ℎ allows developers to decide how many records to pack in a window, we also measure the execution time with varied window sizes from 10 to 10 5 .We see that OpenFaaS-Redis method has the worst performance.Compared to the OpenFaaS-Single method, OpenFaaS-Redis brings data (de)serialization and transmission overhead that makes it 48-71% slower.Figure 9b shows the results of ℎ method with varied window sizes.
DP case.OpenFaaS-Redis takes Redis as the intermediate store between functions, in which the master pulls all workers' gradients and pushes the merged gradients for the next iteration.OpenFaaS-Single only takes one function to train locally.As Figure 10a shows, ℎ outperforms all baselines, speeding up to 50% compared to OpenFaaS-Redis which encounters a high overhead of accessing data.Figure 10b indicates that all models' performance varies more with different window sizes compared to other cases, caused by  the extremely non-uniform parameter sizes of model layers.Moreover, with the increase of the model size shown in Figure 10b and the parallelism shown in Figure 11, ℎ benefits a lot.The underlying reason is that the master and workers in ℎ exchange gradients ahead of time without waiting for the full completion of BP, reducing the synchronization time significantly.
VFD case.As shown in Figure 12, ℎ outperforms all baselines on three different resolutions of videos, in which performance gain increases with higher resolutions.Figure 12a shows the execution time for each implementation with different video resolutions.Specifically, for the 1440p video, ℎ reduces 38.4% execution time compared to OpenFaaS-Redis and 17% to OpenFaaS-Single. Figure 12b shows that the window size increment has little effect on 1080p and 1440p videos, and it can improve the speed of 240p videos by 50% when using a large window size.For 1080p and 1440p videos, the data size of one frame is large enough, so their execution time gains little from window size increment.However, a frame of a 240p video is small, so setting the window size to 1 is inefficient.

Related Work
Data access optimization.Considering the high latency of state access and delivery on serverless workflows, recent studies mainly concentrate on optimizations from both indirect and direct state access.Indirect state access optimization [1,2,21,25,36,41,43] primarily involves high-scalability distributed storage systems or shared logs and caches data with functions to reduce latency.However, indirect state access heavily depends on network communication, which damages throughput seriously under low bandwidth.
In contrast, direct state access leverages algorithms and hardware, making the execution of stateful serverless workflows local or approximate-local.Function fusion [11,26,33,35,37,42] takes workflow as priory knowledge, and schedules functions with potential state sharing in the same container like the local pattern, running as multi-processes or multi-threads.Moreover, it employs shared memory and zero-copy state delivery to reduce the transmission overhead.However, its usage is limited by the resource concentration strategy.Remote Direct Memory Access (RDMA) [14,17] is also used to accelerate data access in direct state access methods, obviating the overhead of involving intermediate storage.But it puts forward higher hardware resource adaptation requirements for Serverless Computing.ℎ is efficient on state access for general serverless computing.
Pipeline acceleration in workflows.Pipeline is widely used in data processing workflows and sequential-task execution.[13] overlaps model layer-parameters transmission and forward computing to reduce the overhead of switching context.[31,32,44] employ inter-batch pipelining policies for tasks to accelerate context switching.In serverless workflows, [28] splits mini-batch data into micro-batches to improve resource utilization, enabling functions to compute ahead of time in the model parallelism training pattern.[19] supports pipelined model-parallelism in GPU-enabled containers to accelerate DNN training.[38] pipelines the aggregation-intensive operations on CPU servers and lightweight computation-intensive operations on lambda functions for distributed GNN training.
However, ℎ is designed as an ASRP-enabled framework for serverless workflows to support a wide range of general application scenarios, including distributed AI training, big data sorting and video streaming processing.

Conclusions and Discussions
In this paper, we present ℎ, a novel serverless workflow framework that uses the asynchronous state replication pipelines to accelerate workflow execution.To achieve this, ℎ provides three main capabilities.First, differentiable data type (DDT) is provided at the programming model level to support incremental state sharing and computation.Second, ASRP works by continuously delivering changes of DDT objects in real time so that downstream functions can consume the objects without waiting for the ending of upstream functions.Third, we make a systematic design to support DDT and ASRP in ℎ framework, including direct communication support and change propagation.We implement ℎ atop OpenFaaS, compare it with popular serverless workflow frameworks, and evaluate it with three representative cases.The results show that ℎ can effectively improve the efficiency of state transmission in serverless workflows by up to 1.7× and reduce end-to-end latency by up to 57%.
For further study, it is worth exploring several aspects, such as fault tolerance for ASRP, window size automatically tunning and applicability on multi-cloud.

Figure 1 .
Figure 1.Data access methods for serverless workflows can be divided into 4 types.The data can be carried by invocation requests in a piggyback way (a), accessed indirectly via an external data store (b), accessed directly through local memories (c), or accessed via asynchronous state replication pipelines in ℎ (d).

Figure 3 .
Figure 3. Rust code implementation of the DDT computation in Figure2.To incrementally merge sort data windows, developers need to register a handler on a DDT object that represents the ingress data stream.The core logic for sorting is implemented in the handler, which will be invoked every time the DDT object is changed.

Figure 4 .
Figure 4.A pipeline with 5 stages constructed by a serverless workflow of two functions that share a DDT object is executing 5 data windows (block 1-5).The 5 pipeline stages, Function-1's computation, serialization of raw data, data transmission in network, deserialization of serialized data and Function-2's incremental sorting, are running in pipeline parallel.

Figure 5 .
Figure 5. System Architecture of ℎ is composed of three layers, including the application layer, function runtime/container layer and global runtime manager layer.The function programs run with DDT libraries, and then the Changes of DDT objects can be acquired by state agents and get propagated to local (via reference) or remote functions (via network communication).
DAG Instances.ℎ uses DAG IDs to identify DAG instances.When a DAG is invoked, the root function will be scheduled and invoked at first.The invocation request is forwarded to the state agent hosted on the function runtime where the root function is scheduled.The state agent generates a new unique DAG ID for the root function request.Any other invocation triggered by the function that has a DAG ID will inherit the DAG ID and be seen as in the same DAG instance.

Figure 6 .
Figure 6.Pipeline in DP Case when parallelism is set to 2. Functions synchronize gradients while calculating new gradients in parallel with ℎ.

Figure 7 .
Figure 7. Execution time comparison on two workloads with different initial data sizes.Each workload contains only 2 functions .ASF only supports up to 256KB of data, limiting the experiments to [1MB, 10MB, 100MB].

Figure 8 .
Figure 8. Execution time comparison on two workloads with different numbers of functions in the workflow, with initial input data size=100KB.

Figure 11 .
Figure 11.Performance comparison in different parallelism in DP Case.

Table 1 .
Comparison of the overhead of two cases under different methods.

Table 2 .
Change Types Provided by DDT Library Description TrivialChange The simplest change.It can apply to any type of value.As values of any type can be changed by replacing with a new value or keeping the same as the old value, there is always a trivial implementation of Change for any value type.However, it is useless in real applications because it carries all information of the value, which prohibits incremental computation.VecExtend Only allows a vector to be changed by appending another vector.Though not flexible enough, it is efficient as it enables items of any size processed together.Our Q3 case (big data query, section 5.1) uses it to represent the datasets.DictSet Sets a hash map (a.k.a.dictionary) with a key-value pair.It is useful in many cases where hash maps are needed.Our DP case (data-parallel distributed training, section 5.2) utilizes this type to represent updates of a tensor.VecChange Provides Push and Pop.It is widely used because implementing map and reduce functions with it is straightforward.Our VFD case (video face detection, section 5.3) uses this type to abstract a stream of video frames.
API Description new()->Diff Instantiate a new DDT object.change(Change)

Table 4 .
The APIs for sharing DDT.