Managing Edge Offloading for Stochastic Workloads with Deadlines

Increasing demand for computationally intensive jobs on mobile devices is driving interest in computation offloading to the edge/cloud servers. This paper presents a comprehensive framework for managing offloading of stochastic and heterogeneous user(s)-generated jobs while considering job deadlines and congestion on wireless channels and edge/cloud servers. The goal of offloading is to maximize either the net computational work offloaded or power savings. We propose a class of policies called Predictive Abandonment (PA), where users opportunistically cut and offload jobs but abandon offloading if they predict that communication and computation delays will preclude on-time completion. Although these user-driven policies are desirable from an implementation perspective and achieve relatively good performance, they cannot coordinate tradeoffs amongst users with heterogeneous job types. To address this, we propose a complementary approach to coordinate offloading based on Probabilistic Admission Control and Cut Assignment (PACCA). When combined with PA, it delivers significant offloading benefits. We also develop an upper bound on the benefits of offloading, which can serve as a baseline for evaluating the additional gains of more complex offloading policies. We evaluate these policies via simulation for a range of loads and job profiles, demonstrating robust gains over a naive greedy offloading policy and near-optimal performance in some settings. Furthermore, we assess the robustness of PACCA + PA to imperfect knowledge of offered job rates.


INTRODUCTION
Next-generation applications and the MEC fabric.A new generation of applications powered by machine learning, e.g., AR/VR/XR, autonomous navigation, and photo editors, is pushing the computational and energy limits of mobile devices.One way to overcome these limitations while addressing low latency and privacy requirements is for users to (partially) ooad computationally intensive jobs to shared Mobile Edge Computing (MEC) resources.By combining mobile devices' sensing, communication, and computation capabilities with computation at nearby edge servers and/or more distant cloud servers, one can envision a computation-communication fabric that can cost-eectively address the most demanding mobile users' compute jobs.
Benets of compute job oloading.There are several benets mobile devices can reap from ooading.First, devices with insufcient computation resources may only be able to complete a job through ooading.Second, even if a device can complete a job, it may opt to ooad to save energy and/or reduce its computational work to allow for computation of other jobs.Third, ooading a job might enable a mobile device to leverage powerful MEC/cloud computation resources to speed up job completion.In this paper, we introduce policies that maximize the amount of computational work mobile devices ooad or the amount of energy devices can save while completing jobs within their deadlines.
Managing compute job oloading.To realize these benets, one must orchestrate ooading across various resources and account for the possible costs of doing so.In general, ooading a compute job may include the following steps: (i) (partially) computing the job on the device; (ii) transferring data to an edge/cloud server via a shared wireless link for performing the remaining computational work; (iii) performing further computation on the edge/cloud server; and (iv) transferring the results back to the device.These steps may involve shared computation resources on the mobile device, shared wireless channels, and shared edge/cloud computation resources, which may become congested under stochastic loads.Such systems also face signicant heterogeneity in terms of devices' computation and/or communication capabilities as well as running heterogeneous compute jobs with dierent Qualityof-Service (QoS) requirements, e.g., constraints on completion time.To address these complex challenges, we present an ooading framework that combines an ooad admission control policy with a lightweight user-driven ooad abandonment policy.
DAG job cuing and oloading.In this work, we focus on ooading compute jobs that can be roughly modeled as linear Directed Acyclic Graphs (DAGs), where the nodes represent computational sub-tasks, e.g., Deep Neural Network (DNN) layers, and edges represent the data dependencies and potential cut locations between sub-tasks, e.g., see Figure 1.In our work, we use thendings of [19] to only select a single cut location for time-sensitive jobs.The authors show that, under a block fading/Markovian stochastic channel, the optimal policy for ooading computational work from a local device to an edge/cloud server would at most ooad once.This result assumes a congestion-free system, a device expends more power on processing a job than sending/receiving data, edge server processes faster, and exibility to execute sub-tasks on either device or edge server.In the shared MEC fabric, a job's optimal cut location depends on its computation-communication requirements per cut location, completion deadline, current wireless network conditions, and the computational resources of the device it is generated on relative to available networked edge compute servers.It also depends on the operator's preferences [9] (e.g., rewards and/or fairness).Thus, some form of coordination of ooading decisions is necessary.Applicability of DAG model.There are several works [5-7, 11, 15] in the literature that embrace the linear DAG model as an eective abstraction/approximation of jobs that might particularly benet from ooading.The underlying driver is the layered structure of DNNs, currently used in several applications ranging from image classication, facial, digit, and speech recognition to many others.We believe that a substantial volume of future workloads will have structures like linear DAGs, where there is exibility to cut and ooad jobs.For a general DAG, the researchers [18] have extended their ooading policy for a linear DAG to a general DAG by exploiting the notion of a DAG's critical path.

Related work
Oloading problem.Ooading of compute jobs to MEC has been widely studied in the literature, which can be divided into two main categories: binary ooading and partial ooading.In binary ofoading, a job is either executed on the device or ooaded to one or more edge servers for execution, intending to optimize performance metrics such as average computation delay or energy consumption.In most cases, the binary ooading problem is NP-hard, and various heuristics, approximation, and stochastic approaches have been proposed, see e.g., [1,3,4,10,12,16,17,20].Researchers in [1] explore the behavior of users when making decisions about ooading compute jobs in a multi-MEC server environment.They propose a Prospect Theory-based solution, considering users' riskseeking or loss-aversion behavior.However, [1] has limitations, focusing on jobs consisting of independent sub-tasks and lacking a strict deadline constraint.Similarly, authors in [17] propose a game theoretic solution.
In [4], authors address fairness and maximum delay tolerance in hybrid fog/cloud systems by jointly optimizing computation migration and resource allocation (including computing and bandwidth).They propose a suboptimal algorithm to solve the formulated mixed integer non-linear programming problem.Another paper, [20], focuses on joint computation ooading decisions, resource allocation, and content caching strategy.The authors transform the problem into a convex form and solve it in a distributed and ecient manner using optimization theory tools.Given a set of jobs and multiple edge servers, [3] proposes an approximate solution considering dynamic voltage frequency scaling for mobile devices.Their heuristic algorithm optimizes job ooading and frequency scaling decisions.However, all of the aforementioned works focus on static regimes where all jobs are assumed to be present at the beginning and ignore congestion on wireless channels and edge servers.
Partial oloading.In partial ooading, a job represented as a DAG, see Figure 1, can be ooaded at several cut locations.Several research studies such as [2,8,13,18,19] have been conducted on partial ooading in the context of edge computing.In [19], the focus is on minimizing energy consumption while meeting latency constraints in a collaborative mobile device and edge server environment with stochastic channels.They propose a polynomial time algorithm for ecient job execution.Building on this work, [18] extends the approach to encompass general DAG frameworks beyond linear ones.In [2], the authors employ Reinforcement Learning (RL) to explore ooading multiple users' jobs to multiple servers.The users ooad heterogeneous jobs over time-varying wireless channels.However, the RL-based policy requires re-learning for each new environment (number of users, job proles, channel capacity, etc.).To overcome this limitation, [13] introduces the use of Meta Reinforcement Learning, enabling the RL agent to quickly adapt to new environments without re-learning.[8] investigates an online ooading framework similar to ours, where heterogeneous job types with deadline constraints arrive in the network according to a stochastic process and are executed dynamically over time.The authors propose a heuristic approach by relaxing the deadline constraint.The objective is to minimize the average makespan time.

Contributions and organization
The main contributions of this paper are summarized as follows: • To the best of our knowledge, this is the rst work to tackle the design of ooading policies for stochastic and heterogeneous job requests (distinct deadlines, cut locations, and computation-communication requirements per cut locations) under strict deadline constraints.• We introduce and evaluate two "revenue" models to capture the possible benets of ooading.The rst model, termed net timely ooaded workload, aims to maximize the amount of computational work ooaded while accounting for ofoading overheads.The second model, referred to as power savings with wastage, aims to maximize power savings while considering power wastage associated with unsuccessful ofoads.As a comparative baseline, we derive an upper bound on the revenue under any ooading and scheduling policy.• We propose several classes of ooading policies.They dier in terms of (1) their requirements of knowledge of the system state, including adapting to the long-term ooading oered load and job types; (2) leveraging measurement-based abandonment policy, which reacts dynamically to congestion resulting from an excessive number of active users or poor wireless channel capacity relative to the job deadlines; and (3) their ability to adapt decisions regarding the choice of cut location and the fraction of jobs to admit based on changes in long term loads.We evaluate and compare these policies using representative job proles from [6].We do this for a range of oered job rates resulting in varying levels of congestion on the wireless network and edge server and study how close their performance is to the performance upper bound.Since some policies, such as PACCA + PA, our best policy, require prior knowledge of loads and thus possibly re-optimization when loads change, we also evaluate it under imperfect estimates of oered loads showing the approach is robust to such errors.
The paper is organized as follows: In Section 2, we introduce our basic system model.In Section 3, we develop an upper bound on the achievable revenue rate and explore dierent ooading management policies.We end the section with some simulation results.We discuss ooading management policies for heterogeneous compute job types in Section 4. Finally, Section 5 concludes the paper.

SYSTEM MODEL
We begin by introducing our system model for a set of users, generating homogeneous jobs (identical deadlines, cut locations, and computation-communication requirements per cut location) with limited local computation resources and a limited amount of shared wireless network and edge/cloud computation resources.We later consider heterogeneous jobs.

Model for load
We let U denote a set of users sharing a wireless access point -the set has cardinality # = |U|.Each user D generates homogeneous jobs according to a stationary process.Users can have dierent channel qualities/classes.We let C denote the set of possible channel qualities with associated capacities.We let 2 D denote the channel quality of user D and use _ D,2 D to denote the arrival rate of jobs from user D. The total arrival rate of jobs from users with channel quality 2 is given by We denote the vector of total arrival rates for each channel quality as , = (_ 2 , 2 2 C).Finally, we dene the total job arrival rate to the system as

Job model
The execution of a job may involve computation on a user's device, ooading of data to the edge/cloud server, processing on the edge/cloud server, and then transmitting the result back to the device.Initially, we focus on a single job type with a xed time budget, g, for job execution -that does not include the time required to transmit the result back to the user device 1 .We model the job as a DAG, where the possible cut locations are denoted by a set S = {1, 2, ..., =}, with = representing the last cut location.For a cut location : 2 S, we let V : denote the cumulative device processing measured in oating point operations (FLOPs) including overhead related to cutting itself, 3 : denotes the ooad data volume (in bits), and W : models the cumulative edge server processing (in FLOPs).Here, : = 1 corresponds to processing everything on the edge server, and : = = corresponds to processing everything on the user's device.

Model for user's device, wireless channels, and edge server resources
A user's device has an eective processing speed denoted by X, measured in oating point operations per second (FLOPs/sec).A user with channel quality 2 has an uplink capacity to the base station of A 2 Mbps.However, the transmission rate throughout the ooading process, as explained later, may be reduced by congestion, e.g., by competing job ooads from other users.We consider multiple processors with multiple cores at the edge server.The total processing capacity, l (FLOPs/sec), is modeled as the sum of all cores' processing rates across all processors, assuming jobs can be parallelized on all processors and cores.Thus, all active jobs get an "equal" time of edge server.Note that modern computing systems would allow the parallelization of jobs across only a limited number of cores of a given processor.Hence this is a simplication.
We let S 2 ✓ S denote the set of cut locations for a user with channel quality 2 that guarantee the job will meet its delay deadline, g when one optimistically assumes there is no competition for communication or computation resources in the system.Thus : where the left-hand side is the best possible end-to-end time to complete the ooad when a job is cut at location :.

Sharing base station uplink resources
At any time, C, multiple users may be ooading data.U(C) represents the set of active users, and # (C) = |U (C)| is its cardinality.We shall assume that all users in U(C) share the BS's uplink resources in a Proportional Fair manner, with each ongoing ooading over channel 2 served at rate A 2 1 # (C ) .

Model for computation on the device and data oloading
When a job with local computational requirements is generated on a user's device, it undergoes computation based on a non-preemptive priority scheme that prioritizes jobs by their generation time.As a result, a job may be queued before its execution.After it leaves the queue, it is processed if there is enough time to execute it; otherwise, it is dropped.
Once the local part of the execution is complete, a job with data to ooad is served on a rst-come, rst-served basis, so there may be additional waiting before ooading to the edge server begins.

Stationary oloading policies
We consider a set ⇧ of stationary ooading policies.A policy may consist of any combination of job admission control/cutting /ooading methods, wireless channel scheduling, and edge server resource sharing.For a given oered load , and policy c 2 ⇧, we dene q(,, c) = (@ 2,: (,, c) : 2 2 C, : 2 S), where @ 2,: (,, c) denotes the long-term fraction of jobs that belong to users with channel quality 2, are cut at location :, and complete.Naturally, it must be the case that the sum of these fractions is less than or equal to one, meaning Õ : 2 S @ 2,: (,, c)  1 for all 2 2 C (in case not all jobs complete on time).We dene F (,) = {q(,, c) | c 2 ⇧}, as the set of fractions that are feasible under some policy.This set is convex since if c 1 , c 2 2 ⇧, then by alternating between policies over long periods, one can achieve any convex combination of their associated performance.

Reward model and revenue metric
We introduce two reward models to guide the design and evaluation of ooading policies.We use U : to represent the reward associated with the timely completion of a job cut at location :.
Net timely oloaded workload.The rst reward model captures the total amount of work ooaded to the edge server for jobs that complete on time.It indirectly captures the freeing up of users' computation resources.The reward for ooading a job at cut location : is modeled as Here W : denotes computation work ooaded to the edge and 6 • 3 : the overhead of doing so, where 3 : represents the volume of data ooaded, and 6 is a factor that "converts" bits to FLOPs.The net timely ooaded workload measured in FLOPs/sec for a given oered load , under policy c, is dened as where denotes the rate at which net work is ooaded, and !ow (,, c) is the rate of computational work on users' devices associated with jobs that do not complete on time and with jobs that a user attempts to ooad but ends up completing locally2 .
Power savings with wastage.The second reward model quanties the energy savings on a user device resulting from ooading.The energy expended by a device when ooading a job at cut : is modelled as 0 • V : + 1 • 3 : joules, where 0 and 1 represent the energy expended per FLOP for local computation and per bit for data ooading, respectively.The energy savings from ooading at cut : vs. not ooading at all, i.e., cut at = (the last cut location), is given by in joules.This captures the energy saved from decreased local computation while considering the energy overhead of data ooading.We dene the net power savings with wastage measured in Watts for a given load , under policy c, as where ' ps (•) is dened in the same way as ' ow (•) but with the energy savings reward U : dened above for each timely job completion.!ps (,, c) represents the power expended at devices associated with jobs that miss their deadlines and with jobs that a user attempts to ooad but ends up completing locally.

HOMOGENEOUS JOBS AND OFFLOADING POLICIES
In this section, we propose and evaluate several ooading policies for users with heterogeneous channel qualities but a homogeneous job type.In the next section, we extend the analysis to the case with heterogeneous job types.

Upper bound
We begin by developing a simple upper bound for the net timely ooaded workload or power savings with wastage achievable by any stationary ooading policy.Let q = (@ 2,: : 2 2 C, : 2 S), where @ 2,: denotes the fraction of jobs that belong to users with channel quality 2, are cut at location :, and complete.We dene the set of all possible vectors q as where in some settings, a fraction of jobs may not complete, hence they need not sum up to 1, and fractions for infeasible cut locations are zero.Given q, we dene the channel and edge server utilization as respectively.Recall that we dened F (,) to be the set of feasible long-term fractions of successful job completions under a set of stationary ooading policies when the system load is ,.We dene the set of all possible successful long term fractions as a natural outer bound for F (,) which leads to the following simple performance bounds.
Theorem 1.Given an oered load , we have that F (,) ✓ F (,). Then the maximum net timely ooaded workload achievable by any stationary ooading policy is dened as Similarly, the maximum power savings with wastage achievable by any stationary ooading policy is dened as ' ps (q, ,) P. We rst argue that F (,) ✓ F (,). Indeed suppose q 2 F (,). Recall that @ 2,: represents the fraction of incoming jobs that belong to users with channel quality 2, are cut at location :, and complete.Since each job is cut at only one location, these fractions always sum to less than or equal to 1 over all cut locations, thus @ is clearly in Q -but suppose the load on the channel or edge server is greater than 1 -given these fraction of jobs, i.e., d ch (q) > 1 or d ed (q) > 1.If this is true, then not all jobs will complete on time, contradicting our earlier statement.Thus, the channel and edge server load must be less than or equal to 1, i.e., d ch (q)  1 and d ed (q)  1 if q 2 F (,), which implies q 2 F (,). Thus F (,) ✓ F (,).This then implies max q2 F (,) ' ow (q, ,)  max q2 F (,) ' ow (q, ,) which results in the Equation 11, since !ow (,, c) 0. Similarly under the energy savings reward model and recognizing that !ps (,, c) 0, we have Equation 12. ⇤ Remark 1.We let q ⇤ ow (,) = argmax q2 F (,) ' ow (q, ,) and q ⇤ ps (,) = argmax q2 F (,) ' ps (q, ,) denote the vector that maximizes the bounds for the two reward models.These indicate the fraction of load to admit across sets of channel qualities and cut locations to maximize revenue in the absence of resource contention during job ooading.

Oloading policies
Naive Greedy (NG).The NG ooading policy optimistically assigns the cut location, : D , that yields the highest reward among all feasible cut locations given deadline g, to all jobs generated by user D in U with channel quality 2 D as follows: where U : either reects net ooaded workload or energy savings, see Section 2.7.The policy then greedily tries to ooad data and process on the edge server until success or time budget, g, expires.
Predictive Abandonment (PA).PA is a real-time user-based ooading policy that adapts to channel and edge/cloud server congestion.Similar to the NG policy, a user D implementing PA selects the cut location : D with the highest reward, see Equation 13.However, unlike NG, during the ooading process, a user implementing PA estimates the feasibility of meeting a job's completion deadline given current channel and/or edge server congestion.If the deadline is unlikely to be met, PA saves resources by abandoning the job's ooad, thus increasing the likelihood of completing other jobs on time.Furthermore, a job whose ooad is abandoned can still attempt to complete its residual processing on the user device if time permits.Thus, PA can be viewed as performing a type of state-dependent self-admission control or abandonment policy.
Under PA, a user D determines whether its 8th job is likely to complete on time by considering/predicting the following factors: queuing time, local processing time, data ooading time, and edge processing time.The queuing time for a job 8 at time C, denoted , D,8 (C), corresponds to the time the job has been waiting in the user's queue.The local processing time is calculated as the number of operations before the cut location : D divided by the device's execution speed, i.e., V :D X secs.Since a user's compute speed is xed, this value is the same for all C. Also, recall that all jobs from a user D are cut at the same location.The data ooading time at time C is calculated as the sum of time already spent ooading the job (if any) and the time required to ooad any remaining data.The latter can be estimated by dividing the data yet to be ooaded at time C, B D,8 (C), by an estimate for the future average transmission rate ⌘ D,8 (C).Suppose user D initiates data ooading of job 8 at time C D,8 over channel quality 2 D .At any instance C 0 (for C 0 greater than or equal to C D,8 ), the transmission rate under our model is given by , where recall A 2 D is the uplink channel capacity and # (C 0 ) is the number of active users.We estimate the future average transmission rate based on the average throughput experienced by the user D since it began ooading job 8, i.e., where 4 D,8 (C) is the time elapsed since ooading began.The last factor in the equation for job latency is the edge processing time.We assume that the edge server provides users with or users themselves maintain estimates of the job's processing time per cut location, < : D .These estimates are periodically updated.
Putting together the elapsed time and estimated future transmission/processing latencies, the estimated total latency for user D's job 8 is given by Under PA, a user may abandon ooading if it determines that its estimated total latency for job 8 is greater than its time budget, g.Once abandoned, any remaining computation, W : D , for job 8 can be completed on the user's device if there is enough time.
This prediction method is crude but roughly captures the impact of the user's channel capacity A 2 D , channel's uplink congestion # (C), and congestion at the edge server using the actual estimate of processing latencies, < : D .Note that these estimates will reect changes resulting from the PA policy itself since PA impacts the load on the channel and edge server.For more accurate predictions, one can consider additional factors such as number of active users during each user's ooad and/or remaining service requirements of currently ooading jobs.See, [14] for such a discussion in a processor-sharing scenario without abandonment.
PA eectively performs a sort of "real-time" admission control by abandoning jobs unlikely to meet their deadlines due to channel and/or edge server congestion.In congested scenarios, PA "prefers" users with better channels, since they are more likely to complete their ooads on time.Finally, note PA, like NG, is decentralized and does not require knowledge of the overall oered job rate, ,.Next, we explore a policy that considers the overall rate of oered jobs for coordination.
Probabilistic Admission Control and Cut Assignment (PACCA).Under PACCA, we pre-determine ?2,: , the fraction of jobs that belong to users with channel quality 2 that should be ooaded at each cut location : to maximize revenue given the rate of incoming jobs.We let p = (?2,: : 2 2 C, : 2 S), and require Õ : 2 S ?2,:  1 for all 2 2 C. We denote the associated policy as PACCA (p).If a job is not admitted for ooading under this policy, it will attempt to execute locally.Determining p is complex, as it involves multiple factors, such as contending ooads from users with dierent channel qualities, channel and server congestion, delay constraints, and revenue rate optimization.We propose to choose p based on the upper bounds from Theorem 1, either q ⇤ ow (,) to maximize net timely ooaded workload or q ⇤ ps (,) to maximize power savings with wastage.In later sections, for brevity we use q ⇤ to refer to either q ⇤ ow (,) or q ⇤ ps (,).Given that there is no resource contention underlying Theorem 1 (see Remark 1), these probabilities reect optimistic admission control and cut assignment for a given ,.Nevertheless, they still reect reasonable overall system tradeos.
Once a job is admitted for ooading, the user can either attempt to ooad the job greedily at the assigned cut location or perform PA, only proceeding with the ooad if it determines the job's deadline can be met.We refer to the former policy as PACCA (q ⇤ ) + Greedy and the latter as PACCA (q ⇤ ) + PA.

Simulation results
In this subsection, we evaluate the performance of our proposed ooading policies via discrete-time simulations in terms of: (i) net timely ooaded workload; (ii) power savings with wastage (explained in Section 2.7).We also plot the fraction of jobs completed.The simulations are conducted in MATLAB R2023a.
Seings.We consider a system with # = 20 users3 , where each user generates an equal rate of homogeneous jobs per second, according to a Poisson distribution 4 , with intensity _/20.This results in an aggregate job generation/arrival rate of _ per second.The system includes two channel qualities, half of the users (i.e., 10) have one channel quality, half the other 5 .A user's channel quality/capacity does not change, but the transmission rate at any instance depends on both the channel capacity and the number of competing users because of the Proportional Fair sharing of uplink resources.Table 1 summarizes the simulation parameters.We present results for the homogeneous scenario based on the job characteristics of AlexNet, a state-of-the-art Convolutional Neural Network for image classication.In the next section, we will also use DeepFace, which is used for face recognition.Figure 2 displays the job characteristics of AlexNet on the left and DeepFace on the right.The four bar graphs (from top to bottom) show the volume of data that gets ooaded, the amount of local vs.edge processing, the energy saved, and the amount of work that gets ooaded per cut location.The worst delay a job may experience is calculated as . However, this may not be the absolute worst case as it only considers the worst channel quality and not congestion.We evaluated our policies under both strict g = 0.4g max and relaxed g = 0.8g max delay deadlines.Results are averaged over 20 Monte Carlo simulations, each performed over 5e5 time slots.Here a time slot is 100 s long.Results discussion.Our rst set of results, presented in Figures 3a and 3b, include the net timely ooaded workload and fraction of completed jobs for our policies per total oered job rate, respectively, under a strict job deadline.In Figure 3a, we show the net computational workload ooaded, which depends on the fraction of completed jobs and the reward per completed job.Therefore, a policy can perform equally well in two cases: (i) completing numerous jobs with a low reward or (ii) completing fewer jobs with a high reward.Thus, we also present the fraction of completed jobs in Figure 3b for the same simulation setting for all policies.
We observe that as the oered job rate _ increases, the NG policy experiences throughput collapse, see Figure 3a.By contrast, policies like PA, which implement congestion-dependent ooad abandonment, and PACCA, which determines admission control and cut assignments based on prior knowledge of incoming jobs per user and channel quality, perform well under heavy job loads.However, PA does not perform as well as PACCA, highlighting the benet reserving channel and edge resources for jobs with good channels and/or high reward cut locations.The benet of combining these policies, PACCA + PA becomes more evident at high-load regimes where admission control and congestion management is crucial.
In Figure 3b, we show the fraction of completed jobs under different ooading policies as job arrival rate _ increases.With PA, more than 90% of jobs are completed in the load regimes considered.Indeed, for all the results reported hereafter, we only considered load regimes where PA has a high completion rate (at least 90%) and where the channel is the bottleneck.Interestingly, the fraction of jobs that complete under PACCA + Greedy is non-monotonic with increasing load.This is because initially (from 0 to 23 jobs/sec), PACCA selects the highest reward cut location for every job.However, since channel capacity is xed, an increase in the number of jobs means a decrease in completions.Hence, the downward curve.Then, at 23 jobs/sec, PACCA determines it will earn more revenue by adjusting the distribution of cut locations, so that jobs have less data to ooad.This results in less reward per job, but more completed jobs.PACCA makes this adjustment every time the rate of incoming jobs increases beyond 23 jobs/sec, thus the upward curve.We observe similar non-monotonic behavior for PACCA + PA though it is barely perceptible in the gure.In Figure 3a, we saw the advantage of combining PACCA with PA under a strict delay deadline.We observe similar benets under a relaxed delay deadline, see Figure 4; however, the performance gap between PACCA + PA and PACCA + Greedy decreases.This is because, with a relaxed delay deadline, a user has a higher chance of completing an ooad, even with network congestion.Thus, network congestion is detrimental only under strict delay deadlines necessitating a congestion-aware policy like PA.We see similar trends with power savings with wastage in Figures 5a and 5b.Additionally, as we relax the completion deadline, our best-performing policy (PACCA + PA) approaches the upper bound for both performance metrics.
Robustness study.We demonstrate the robustness of PACCA + PA to imperfect knowledge of oered jobs per sec in Figure 6, for net timely ooaded workload under a strict delay deadline 6 .In these simulations, we added errors to the estimates of job arrival rates per user, which aects the aggregate arrival rate per channel quality.We optimize PACCA + PA for the corresponding load and compare three scenarios: (i) PACCA + PA (exact), where we provide PACCA with the exact load, i.e., _ ; (ii) PACCA + PA (overestimate), where we provide PACCA with an overestimate of load, i.e., _ • (1 + G%), leading to less ooading compared to (i); and (iii) PACCA + PA (underestimate), where we provide PACCA with an underestimate of load, i.e., _ • (1 G%), resulting in more ooading compared to (i).We show PA as a baseline.The results show that for an estimation error of 25% (i.e., G = 25) 7 , both PACCA + PA (overestimate) and PACCA + PA (underestimate) are within 10% of PACCA + PA (exact).Additionally, we observe that PACCA + PA (underestimate) performs at least as well as PA, as huge underestimation errors result in admitting every job at the highest reward cut location, eectively imitating PA.

HETEROGENEOUS JOBS
In this section, we investigate the performance of our policies in networks with heterogeneous jobs, where jobs no longer have 6 Due to space constraints, we exclude similar results for the metric power savings with wastage. 7We also performed simulations for G = 50% but exclude it due to space limitations.identical deadlines, cut locations, and requirements per cut location (such as computation and data to ooad).
We have a set of users generating dierent types of jobs, denoted by J = {1, 2, ..., }.For each job type 9 in J , we let U 9 be the set of users generating such jobs, and _ 9,D denotes the arrival rate of job type 9 generated by user D in U 9 .The total arrival rate of job type 9 is denoted as _ 9 = Õ D 2 U 9 _ 9,D .We dene the arrival rate of job type 9 over channel quality 2 in C from all users as _  rate of each job type is captured by Λ = (, 1 , ..., , ).We use g 9 to represent the delay constraint, and S 9 to capture the set of possible cut locations for type 9 jobs.As before, S ed (Q) denote the total channel and the total edge server utilization, respectively.
Recall that we dened T (Λ) to be the set of feasible long term fractions of successful job completion by stationary ooading policies when the system load is Λ.Here we dene as a natural outer bound.Theorem 2. Given an oered load Λ we have that T (Λ) ✓ T (Λ).Then the maximum weighted net timely ooaded workload revenue achievable by any stationary ooading policy is dened as Similarly, the maximum weighted power savings with wastage revenue achievable by any stationary ooading policy is dened as P. Similar to proof of Theorem 1.
) denote the maximizers associated with the bounds for the two reward models.Once again, for brevity we will use Q ⇤ .

Simulation results
In this subsection, we compare the performance of various ooading management policies when dealing with heterogeneous jobs.For PACCA, we determine the admission control probabilities per combination of job types, user channel qualities, and cut location based on the fractions that maximize system revenue, i.e., Q ⇤ .We then again schedule the ooading of admitted jobs using either Greedy or PA policies, i.e., PACCA + Greedy or PACCA + PA.Due to space constraints, we omitted our study of the robustness of PACCA + PA for the heterogeneous case.
Seing.The simulation involves 20 users generating two job types, AlexNet and DeepFace.Half of the users (i.e., 10) generate job Type 1, while the other half generates job Type 2. Each user in either category generates jobs at an equal rate per second based on a Poisson distribution with intensity _ 9 /10, where _ 9 is the aggregate arrival rate for job type 9.We have set equal aggregate arrival rate for the two job types.Users for each job type are divided equally into two channel quality groups, with half (i.e., 5) ooading over channel Quality 1 and the other half over channel Quality 2. For more information on the simulation parameters, refer to Table 1 and Figure 2.
Results discussion.Figures 7 and 8 illustrate the weighted net timely ooaded workload and weighted power savings with wastage revenue achieved by various ooading policies, respectively.As observed in the case of homogeneous jobs, the PACCA + PA policy outperforms other policies.However, the relative performance of PA policy has declined compared to the homogeneous case since it only considers the residual data and channel capacity, ignoring a job's weight relative to others.In contrast, PACCA coordinates job admission control and cut assignment based on how much relative revenue each job type and cut will generate.

CONCLUSION
Managing heterogeneous compute job ooading in the MEC fabric subject to delay constraints presents signicant challenges that require careful management of user device, channel, and edge server resources while considering dierent job characteristics and system loads.To address this, we have detailed a comprehensive framework, which relies on job proling, using probabilistic admission control and cut assignment, coupled with a predictive abandonment policy that abandons ooads unlikely to meet their deadline (this not only frees up resources for jobs with more promise but also avoids throughput collapse).Our proposed approach, PACCA + PA, is expected to perform robustly and eectively but requires prior knowledge of oered job rates across job types and channel qualities.If this is not known, signal processing techniques such as window averaging can be employed to learn the oered job rate over time.

Figure 1 :
Figure 1: Cutting and oloading of a linear DAG.

Figure 2 :
Figure 2: Simulation parameters associated with AlexNet and DeepFace job types.

Figure 3 :
Figure3: Comparing the net timely oloaded work and fraction of jobs that complete for dierent policies when the job's delay deadline is strict, i.e., g = 0.4g max .

D 2 Figure 4 :
Figure 4: Comparing the net timely oloaded workload for dierent policies when the job's delay deadline is relaxed, i.e., g = 0.8g max .

Figure 5 :
Figure 5: Comparing the power savings with wastage for dierent policies when the job's delay deadline is strict vs. relaxed.