Approximating Fork-Join Systems via Mixed Model Transformations

While product-form queueing networks are effective in analyzing system performance, they encounter difficulties in scenarios involving internal concurrency. Moreover, the complexity introduced by synchronization delays challenges the accuracy of analytic methods. This paper proposes a novel approximation technique for closed fork-join systems, called MMT, which relies on transformation into a mixed queueing network model for computational analysis. The approach substitutes fork and join with a probabilistic router and a delay station, introducing auxiliary open job classes to capture the influence of parallel computation and synchronization delay on the performance of original job classes. Evaluation experiments show the higher accuracy of the proposed method in forecasting performance metrics compared to a classic method, the Heidelberger-Trivedi transformation. This suggests that our method could serve as a promising alternative in evaluating queueing networks that contains fork-join systems.


INTRODUCTION
Modern software systems are typically implemented in a distributed manner to improve performance, reliability, and scalability [18].Under this pattern, parallel and concurrent structures have gained increased importance over the years [9].The software components are often parallelized to achieve higher execution efficiency and increase the utilization of the available resources.Parallel jobs typically follow a fork-join mechanism.A parallel job is forked into several tasks, which are executed concurrently on distinct resources within the system.After finishing the execution, a task has to await its sibling tasks at the join point.This job exits the join point and continues only when all its tasks have finished execution.
Queueing networks are a class of efficient performance models to understand the impact of execution mechanisms on system performance.A large number of computer systems can be abstracted as product-form queueing networks from which designers obtain accurate performance predictions [3].Nevertheless, product-form queueing network models do not accommodate jobs that feature internal concurrency.Furthermore, the exact analysis for internal concurrency within a queueing network can rapidly lead to a state-space explosion.
In addition to the time spent on service, the total time of a parallel execution involves two other delays: queueing delay and synchronization delay [19].Synchronization delay occurs on any completed task that waits for the completion of other sibling tasks before leaving the fork-join system.This coordination introduces dependencies and leads to an increased complexity in designing an accurate analysis for parallel executions.Therefore, it is necessary to have approximate analytic performance models to analyze parallelism for software designers and practitioners.
The technique developed by Heidelberger and Trivedi [13] is long established.This method, which we refer to as HT method in the rest of the paper, decomposes a job into a primary task and a fixed number of secondary tasks; these tasks are assumed to be independent of each other and belong to different job classes.Then probabilistic routing and pseudo-servers are used to replace parallel executions so that a product-form solution can be produced.
In this paper, we propose a novel approach, named MMT, for the analysis of fork-join systems.Similarly to the HT method, the fork and join are replaced by a probabilistic router and a delay, respectively.In our terminology, a probabilistic router is an abstract node that routes incoming jobs along output branches, according to a probabilistic routing policy and with zero service time.Beyond that, it introduces auxiliary open job classes to mimic the influence of parallel computation and synchronization delay on original job classes, which leads to a mixed queueing network model to analyze.The arrival rates and service rates of the auxiliary classes are computed from an iterative procedure that enables us to finally obtain the approximated solution of the original queueing network containing parallelism.The main performance measures in the fork-join system can be obtained using the method described.
The effectiveness of the proposed method is validated by comparison with simulations and the HT method.We use the simulation results as ground truth.Experiments are conducted on closed queueing networks with homogeneous or heterogeneous fork-join queues Compared to the HT method, the proposed method offers lower error rates on predicted performance measures.The rest of the paper is organized as follows.In Section 2, we provide background on queueing network theory, fork-join systems, and their analysis.In Section 3, we propose our MMT method for the analytic analysis of fork-join systems.In Section 4, we present evaluation experiments and results.In Section 5, we review related work.Finally, we conclude the paper in Section 6.

BACKGROUND 2.1 Queueing Networks
Queueing networks serve as a class of models for analyzing the performance of systems.They are made up of a collection of service stations indexed by  = 1, . . ., , in which jobs are queued and executed.A service station could have either infinite servers or finite servers, and is referred to as delay or queue station, respectively.The common use of delay stations is to represent the think times of workloads, while queue stations typically represent system resources [16].In a queue station, there may be competition among jobs for the server, resulting in waiting times.A queueing network may execute multiple classes of jobs indexed by  = 1, . . ., .Distinct job classes typically feature different service rates  , , and can be further categorized into closed and open classes.For closed job classes, a fixed number of jobs circulate within the network.For open job classes, jobs from a source continuously arrive to the network with rate   .Scheduling policies determine the order in which jobs are served, with common policies including First-Come First-Serve (FCFS) and Processor Sharing (PS).
To evaluate the performance of a queueing network, one approach is to solve a system of global balance equations to get state probabilities of the underlying Markov chain where all performance measures can be further obtained.However, this method becomes hardly practical on complex networks due to the state-space explosion.Among queueing networks, there is a special class named product-form queueing networks.They have local balance properties and their exact performance measures can be obtained without

Fork-Join Systems
A queueing network may include a number of fork-join systems indexed by  = 1, . . ., .Fork-Join systems contain both fork and join nodes, which are a particular structure in queueing networks.
The fork node has the property that any arriving job is split into multiple tasks to be serviced independently and in parallel, while the join node combines these tasks back into the original job after they have finished processing [23].Let   denote the number of parallel paths spawned by the fork node of the system , the   executions are assumed to be independent of each other so they do not intersect before reaching the join node.The notations for networks considered by this paper are summarized in Table 1.
An example of a closed queueing network containing a fork-join system is given in Figure 1.Jobs circulate between the fork-join system and a single queueing station.Any job that arrives at the fork node is split into three tasks to be processed on Q1, Q2, and Q3.After completing processing, they must wait for the sibling tasks to finish.Once all the tasks spawned by the job are finished, they are joined together back into the original job, which subsequently moves on to Q4.

Heidelberger-Trivedi Approximation Method
A queueing network that contains fork-join systems has no productform solution [3], thus it cannot be efficiently solved in general.A classic transformation method is developed by [13], which originally transforms single-class closed queueing networks including a single parallel construct into a queueing network with no concurrency so that analytic methods such as AMVA can be applied.
The main feature of this transformation is that each closed class is coupled with one auxiliary closed class for each parallel path.The parallel constructs targeted by this work are analogous to fork-join systems.In its model, each job is represented by a primary task with multiple secondary tasks.Primary and secondary tasks are used to represent the activity of the job executed outside and inside the fork-join system, respectively.Each primary task arriving at the fork node is forked into multiple secondary tasks, while secondary tasks arriving at the join node need to wait for their siblings to arrive before they can be joined back into the primary task.
This work associates the join node of the fork-join construct with a delay station.The time a primary task is supposed to spend at this delay station represents the overall response time of the job in the fork-join system, whereas for a secondary task it represents the corresponding synchronization delay.This approach also adds an auxiliary delay station to the network, which models the time a primary task spends outside the fork-join system.The secondary tasks use the auxiliary delay as their reference station.Their service rates are computed using an iterative procedure analogous to [14].
Suppose  0 represents the response time of the fork-join system,   denotes the response time at the -th parallel path, and  denotes the number of parallel paths spawned by fork  , the expectation of the job response time in the fork-join system is Therefore, the service time of the primary task at the delay station that replace the join node is  [ 0 ].The   are assumed to be exponential random variables with  [  ] = 1/  for all  ∈ {1, . . .,  }, thus  [ 0 ] can be further expressed as the following [24] where , , ,  1 , . . .,   represent path indices.
Let   represent the service times of the secondary tasks at this delay station, i.e. the synchronization delays,  [  ] is obtained by subtracting the mean response time of the -th parallel path from the fork-join response time Figure 2 depicts an example of a network alteration using the aforementioned procedure.The original job class and its primary tasks are depicted with red circles in the original and transformed systems, whereas the distinct secondary tasks are represented through blue, green, and yellow circles.It can be observed from Figure 2b that the HT method produces four job classes for this single class fork-join system with three parallel paths.In the transformed fork-join system, the primary tasks are routed straight to the synchronization delay station, whereas the secondary tasks are routed through the fork-join system.From the synchronization delay station, every job is routed to the auxiliary delay station.From the auxiliary delay station, the jobs belonging to the primary task continue proceeding through the rest of the network, while the jobs from the secondary tasks are sent out to the router.The obtained model is a product-form queueing network that can be solved easily.

PROPOSED METHODOLOGY
The proposed method is a simpler transformation of fork-join systems by introducing only one more delay station and fewer job  classes.This transformation leads to a simple yet effective computation procedure for performance measures of the original system.The new notations introduced by the proposed method are summarized in Table 2.

Network Transformation
3.1.1Mixed model construction.We transform the network into the one that can be applied to analytic approximations.The join node of the fork-join system  is replaced by a delay station   , while the fork node is replaced by a router that forwards incoming jobs to original   parallel paths.Each path carries an equal probability 1   of being selected to forward a job to, and the sum of probabilities is 1.However, a router does not offer the functionality required to spawn other tasks.Thus, even if a job is routed to a parallel path, no sibling tasks are executed concurrently on the other parallel paths.To counter this issue, the parallelism induced by a forkjoin system is simulated through the addition of open job classes, which we shall refer to as the auxiliary classes.The auxiliary classes mirror the behavior of their original classes in terms of routing and service rates within the fork-join system.As shown in Figure 3a, the auxiliary classes are routed directly to the router from the source, and they are forwarded to the sink after leaving the delay station.Thus, their impact is limited to their corresponding forkjoin systems.
In contrast to the HT method, each original class passing through the fork node is attributed only one auxiliary class whose jobs are meant to act as siblings of the original job class.Due to the approximation decision of assigning path selection probabilities equal to 1   , the transformation exercises all parallel paths equally.This is a key difference compared to HT, as it results in a model in which classes do not map in a one-to-one fashion with a particular parallel path.Hence, any approximation error that affects an auxiliary class is equally distributed across the paths.A comparison between both HT and MMT transformations is shown in Table 3, where the original queueing network has one fork-join system and  represents the number of parallel paths.

Arrival rate of the auxiliary open class. Since we have intro-
duced auxiliary open classes into the network, it is necessary to determine their arrival rates.In equilibrium, the router that replaces the fork node of the structure  satisfies the flow balance condition [16].Let    , denote the class- arrival rate/throughput at the router and assume    , to be the same as the class- arrival rate at the original fork node, the class- throughput at the original fork node is then    , •   .Therefore, the arrival rate of the auxiliary open class can be computed by balancing the flow at the router as where  denotes the original job class,  ′  denotes the auxiliary class of  created for the fork-join structure ,   ′  represents the arrival rate of the auxiliary open class, and   here denotes the number of branches connected to the router.

Synchronization Delay
In this section, we illustrate how MMT method approximates the synchronization delay.For ease of presentation, we assume there is one fork-join system ( = 1) in the original network so we temporarily remove the subscript .We assume that the response time at any parallel path  to be approximately exponentially distributed, which is proved to an effective assumption for analytic analysis [22].Besides, let random variables  1, , . . .,  , represent class- response times at parallel paths 1, . . .,  , we assume they are mutually independent.Based on the same two assumptions as the HT method, we propose the following approximation approach consisting of two main steps to derive the synchronization delays for the fork-join system in a queueing network.

3.2.1
Mean response time at a parallel path.The first step of our approach is to obtain the response times at parallel paths of a forkjoin system.For each path  = 1, . . .,  , the mean response time is sum of mean response times at all queueing stations on that path.
We merge the performance measure of the auxiliary class with that of its corresponding original class.Given the merged queue length and throughput, Little's law [17] is then used to compute where   is the set of service stations on path .

3.2.2
Approximation by a homogeneous fork-join system.The second step is to update the service rates of the original and auxiliary classes at the sychronization delay station.We propose to approximate the given fork-join system with a homogeneous fork-join system.As shown in Figure 3b, the corresponding homogeneous fork-join system has the same number of parallel paths and features one delay station per path.The service times of both job classes at any delay station of the new fork-join systems are assumed to be exponential random variables with the means equal to the average of the original response times of the path executions where the random variable  ′ , represents the class- service time at the delay station of the parallel path .The homogeneuous system can approximate the behavior of the orginal system since their response times are close to each other, i.e.,  [max(  2), and the random variable  , represents the class- synchronization delay at the parallel path  of the original fork-join system.end for 19: end while

Algorithm
In our method, the arrival rates of auxiliary classes and the service rates at the delay server are not known in advance.This implies that an iterative computation framework is needed to approximate the solution of the original network.
The framework is shown in Algorithm 1.For ease of presentation, we consider one fork-join system and temporarily remove the subscript .The input of the algorithm includes an original queueing network, a tolerance  that serves as the iteration stopping threshold, and a boolean value  that determines whether to stop the iteration.We set the initial value of  , to 0, and both  , ,  0 , to the same number (line 1), and transform the original queueing network into a product-form mixed queueing network by the proposed procedure (line 3).The stopping criterion is the difference between the queue lengths of two successive iterations (lines 5-9).At each iteration, the transformed queueing network is solved by AMVA (line 10).Then, for each auxiliary open class, we update its arrive rate (lines 12-13) and update the service rates of both auxiliary and corresponding original classes at the synchronization delay station (lines [14][15][16].In this way, the parameters of the queueing network are updated.The last step of an iteration is to merge the results for each original class (line 17).

Extend Method to Nested Fork-Join System
A nested fork-join structure is a hierarchical arrangement of forkjoin systems, utilizing nested fork-join structures holds significance in the field of software development [4,15].Figure 4 shows a network transformed from a nested fork-join system by the proposed procedure.As it can be observed, the original system includes two fork-join structures.We refer to the outer fork-join as   1 and to the inner fork-join as   2 .In this network, the possible paths of the original job class are depicted in red, whereas the possible paths of its auxiliary job classes are depicted in green or blue.The green paths are for the auxiliary class created for   1 , whereas the blue color denotes the paths of the auxiliary class created for   2 .
To adapt to nested fork-join systems, for each original class, we first build auxiliary classes for every fork-join structure visited by the original class.Two additions are introduced in the MMT method compared to our original method for systems without nested forkjoin structures.The first addition is to start computing the synchronization delay only at the outermost fork-join structure, and then compute the nested fork-join structures recursively.
The second addition is the calculation of arrival rates of the auxiliary classes.Because nested systems incorporate inner forkjoin structures and auxiliary classes are created for each of them, solely considering the throughput of the original class at an inner fork to compute the arrival rate of the auxiliary class associated to this structure is not enough.The auxiliary class created for an inner fork-join structure simulates the sibling tasks executing concurrently of the auxiliary classes created for the outer fork-join structures.In other words, the influence of both the original class and the outer auxiliary classes should be considered.Hence, before computing the arrival rate, the throughput    , is updated using throughputs of class- and its auxiliary classes at that delay station where  denotes the set of fork-join structures in the queueing network,  ′  denotes the auxiliary class of  created for the fork-join structure , and   denotes the delay station created to replace the join node of the fork-join structure .
We refer to class- and its auxiliary classes as  -related classes.Equation ( 8) merges the throughputs of all  -related job classes at the delay station, except    , ′  , which is the throughput of the auxiliary class created for the fork-join structure associated to the current delay station   .In other words, this equation gives the total  -related throughputs that do not leave for the sink from the current delay station.If the fork-join structure  is not nested (i.e., the outermost fork-join),    , ′  will remain unchanged.

EVALUATION
We first compare the accuracy of the HT method [13] and MMT method against simulation results obtained by Java Modeling Tool (JMT) [2].The implementations of these methods are included by LINE [5] that is a algorithmic framework for queueing networks and layered queueing networks [6].The involved closed queueing networks in our evaluation can be categorised into distinct groups depending on the service rates of parallel executions (homogeneous or heterogeneous) and the scheduling used at the queueing stations (FCFS or PS).
There are two job classes in every queueing network.Both classes are closed with a population of 10 jobs each.The queueing networks always include a fork-join system containing two or three queueing stations.The service rates at these stations are randomly generated, with the average service time between 0.3 and 0.8.Table 4 shows a list of the fork-join queues used in the first experiments.We use the following notations to describe their topology: < denotes a fork node, > represents a join node, || defines a parallel branch, and → defines a serial routing.
The evaluation results are shown in Table 5.It can be observed that the proposed method achieves lower errors on most cases than the HT method.Compared to the baseline, the MMT method reduces the prediction error of queue length, response time, utilization, and throughput by 30.9%, 62.8%, 34.6%, and 35.3% on average.Figure 5a demonstrates that the two methods exhibit similar prediction accuracy on homogeneous networks, whereas Figure 5c illustrates that our method notably achieves higher accuracy on heterogeneous networks.Apart from accuracy, runtime is the other important factor to consider.As shown in Figure 5b and 5d, the average runtime of our method is less than 0.015s, substantially lower than that of the HT method, which is around 0.03s, as the runtime of the AMVA scales with the increasing number of service stations and job classes.Compared to our method, where only one auxiliary job class is created for a fork-join system, the HT method creates one auxiliary class for each parallel path of the concurrent system and an additional delay station to model the time spent outside the fork-join system.Hence, the models created by our method are more efficient to compute compared to those created by the HT method.
We then evaluate the MMT method on nested fork-join queues.Here we only compare the results with that of simulations since the baseline HT method is not designed for nested fork-join queues.The networks used for evaluation and numerical results are provided in Table 6 and Table 7, respectively.Figure 6 visualizes mean and maximum errors in FCFS and PS groups.As can be observed, the MMT method provides accurate predictions.This method inherently possesses the capability to handle nested fork-join systems because the auxiliary classes it creates are restricted to their corresponding fork-join systems, and they are independent of the time their original classes spend outside the fork-join systems, which is a distinct advantage of our approach.In contrast, the HT method is initially devised for a network with a single fork-join system.

RELATED WORK
Fork-Join queueing networks represent the key to modelling and solving parallel systems.However, they do not follow the productform restrictions, so most algorithms devised for fork-join networks   [23], [7].The only exact solution has been devised for two parallel servers [20].The main difficulty in analysing fork-join queueing networks stems from the synchronization delays incurred by the jobs waiting for the other jobs created by a job to finish [11].Duda and Czachórski [11] devise an algorithm to analyse forkjoin queueing networks by replacing the fork-join constructs with load-dependent queueing stations.Its foundation consists of the flow-equivalent server method [3] and the decomposition principle.
Varki [25] modifies the computation of residence time in the MVA algorithm to adapt to the closed, single-class queueing networks containing fork-join systems.This modification assumes service stations to have exponentially distributed service times and FCFS scheduling strategies.Alomari and Menasce [1] propose a method for analyzing fork-join systems involving servers with heterogeneous service times in open networks.The core of this method involves establishing bounds on the response time of a job in a fork-join system.This is achieved by analyzing the system under two scenarios: one where all stations have the same service rates and the other where their service rates vary.Mak and Lundstrom [19] introduce an iterative algorithm capable of approximating the performance measures of directed acyclic graphs abstracted from parallel systems in polynomial space and time.Franks and Woodside [12] illustrate the capability of the layered modelling framework in which a platform for defining models with parallelism can be conveniently used.

CONCLUSION
The paper proposes an accurate and computationally efficient approach for analyzing closed queueing networks containing fork-join systems.The core of this method involves establishing auxiliary open job classes to simulate the behavior of the original parallelism.This transformation leads to a mixed queueing network model that can be solved by analytic method.Compared to the well-established Heidelberger-Trivedi method, which uses an one-to-one fashion to create auxiliary closed job classes for each parallel path, our approach produces one auxiliary open job classes for the entire fork-join systems.The evaluation results show that our method achieves lower error rates.Meanwhile, the proposed method is faster than the baseline method since our transformed network has less number of job classes that requires less analytic computations.
In addition, the design of our transformation enables us to deal with nested fork-join system, which is a notable advantage.Hence, this paper contributes a simple yet effective fork-join transformation which has the potential to be of great value in areas of the concurrent system research.An extended version of the work presented in this paper is available in [10].

Figure 2 :
Figure 2: HT method for fork-join transformation The network gained from the transformation of a fork-join system Corresponding homogeneous fork-join system of the netwwork in Figure3a

Figure 3 :
Figure 3: The proposed method for fork-join transformation

Figure 4 :
Figure 4: An example of nested fork-join transformation by the proposed method

Figure 5 :
Figure 5: (a)-(b) for homogeneous networks.(c)-(d) for heterogeneous networks.(a),(c): Average prediction errors by HT and MMT methods.(b),(d): Box plots of average runtimes for both methods.Red circles and lines inside boxes represent mean and median values, respectively.

Figure 6 :
Figure 6: (a) for networks with FCFS scheduling.(b) for networks with PS scheduling.

Table 1 :
Notations for considered queueing networks (i.e., where servers can have the same or different service rates).

Table 2 :
Notations for auxiliary classes and network elements created by the MMT method

Table 4 :
Distinct fork-join queues used in evaluation

Table 5 :
Queue length, response time, utilization and throughput errors of HT and MMT methods

Table 6 :
Nested fork-join queues used in evaluation

Table 7 :
Queue length, response time, utilization and throughput errors of the MMT method