Creating Edge AI from Cloud-based LLMs

Cyber-human and cyber-physical systems have tight end-to-end latency bounds, typically on the order of a few tens of milliseconds. In contrast, cloud-based large-language models (LLMs) have end-to-end latencies that are two to three orders of magnitude larger. This paper shows how to bridge this large gap by using LLMs as offline compilers for creating task-specific code that avoids LLM accesses. We provide three case studies as proofs of concept, and discuss the challenges in generalizing this technique to broader uses.

cyber-physical (CP) systems such as autonomous drones have worstcase end-to-end latency constraints on the order of a few tens of milliseconds.Given this 2-3 orders of magnitude gap, we ask: How can latency-constrained AI systems use cloud-based LLMs?

Position: LLMs as Compilers
To bridge this gap, we advocate the use of LLMs as powerful compilers for generating domain-specific edge AI.This viewpoint seamlessly connects the cloud-centric character of LLMs and the edgecentric character of CH and CP systems.It avoids placing LLMs in the latency-critical code paths of these systems.Instead, the power of LLMs is brought to bear in the offline process of creating domainspecific and latency-critical code paths.This strategy accommodates the unique performance attributes and resource demands of these two very different classes of systems.At the same time, it also avoids the IP complications of needing to create stripped-down LLM functionality for use outside the cloud.All that is needed is SAAS use of cloud-based LLMs, exactly as they are deployed today.
Of course, our use of the term "compiler" here is metaphorical rather than literal.An LLM is very different in construction and execution from well-known compilers for programming languages such as C, Rust or Java.The similarity is at a higher level: the workflow that we are advocating bears resemblance to compilation-based strategies.There is an initial offline phase that is not latency-critical, and can thus afford the luxury of extensive and deep processing of the static attributes of the problem.The result of this processing is embodied in much simpler code that is narrowly tailored to the specific application context, and can be executed within latency-critical time bounds on dynamic aspects of the problem.
Viewing LLMs as compilers aligns well with their emerging ethos.As detailed in a recent survey [21], enormous effort is being invested in approaches to leverage the power of LLMs outside the cloud.All of these efforts suffer from one of two deep problems.In "white-box" approaches that inspect the internal structure of an LLM, the full LLM has to be visible and therefore vulnerable to IP compromise.In "black-box" approaches that avoid examination of LLM internals, the LLM is used as a teacher and results in enormous training effort of a student LLM.Our approach avoids both of these problems.
We provide qualitative validation of this viewpoint using three different cases studies: two CH systems, and one CP system.In all three cases, we show that the power of cloud-based LLMs can be leveraged to greatly simplify the effort needed to create unique domain-specific edge AI.This provides the initial evidence to support our position that LLMs are best viewed as extremely powerful offline tools in the creation of latency-critical systems.

Latency in CH and CP Systems
Many CH and CP systems run on small, lightweight and energyefficient mobile devices such as augmented reality (AR) glasses or ultralight drones.These devices correspond to Tier-3 of the reference architecture shown in Figure 1 [17].They typically offload resource-intensive operations over low-latency, high-bandwidth wireless networks to cloudlets at Tier-2.Whether such offloading is optimal depends on the capabilities of the Tier-3 device, the resource demands of the algorithm, and runtime variables such as network quality, device battery level and cloudlet load [15].
Ideally, the "cyber" component of CH systems should avoid becoming the latency bottleneck in task performance.Humans are remarkably fast and accurate on cognitive tasks, but acutely sensitive to delays in the critical path of interaction.For example, face recognition takes between 370 and 620 ms, depending on familiarity [14].Speech recognition takes 300 to 450 ms for short phrases, and only 4 ms to detect a human voice [1].Virtual reality applications that use head-tracked systems require latencies less than 16 ms to achieve perceptual stability [6].A safe goal for the "cyber" component is an end-to-end latency target of a few tens of ms.Since jitter is also annoying and distracting, it is important to avoid long-tailed distributions of end-to-end latency.
Similar considerations apply to CP systems.A drone traveling at 15 m/s only has 100 ms to detect and actuate to avoid an obstacle that is 1.5 m away.Thus, end-to-end cyber delays can be no more than a few tens of ms for safe operation at high speed in tight spaces.These results were obtained using the OpenAI chat completions API.The "No-Op" value was obtained using an incorrect API key.The results shown are the mean of five runs, with standard deviations in small font.[10].The "No-Op" measurement was obtained by using an incorrect API key in the request.It accounts for the full end-to-end round trip time (RTT) without execution of the LLM.The requests were made via the OpenAI chat completions API, using OpenAI's official Node.jslibrary.The temperature parameter was set to 0 to optimize for deterministic output.

Latency in Cloud-based LLMs
These results confirm that even for simple inputs, the end-toend latency of cloud-based LLMs is far too large for the "tens of milliseconds" latency tolerance of CH and CP systems.Consistent with the results of others [13], our results confirm a positive linear correlation between the number of output tokens and response time.Table 2 shows that the latencies are even longer for the more complex inputs of our three case studies in Sections 5 to 7.

Case Study: Wearable Cognitive Assistance
Wearable Cognitive Assistance (WCA) refers to a class of CH systems that have the "look and feel" of AR, but use compute-intensive AI algorithms such as computer vision based on machine learning (ML) in their time-critical execution paths.An application of this genre combines a wearable device such as Microsoft Hololens or Google Glass (Tier-3) with edge computing (Tier-2) to offer realtime task-specific guidance.The tasks can span a wide range, from assistive guidance for a visually-impaired user to assembly or repair of a complex mechanical artifact.The role of the AI algorithms is to track progress through the task, verify correct execution of steps, and offer timely detection and correction of errors as they occur [4,8,16].Figure 2 shows three example WCA applications.
Implementing a new WCA application involves software development in two areas: (a) training of an ML model to visually detect completion of task steps and (b) creation of a finite state machine (FSM) for sequencing of steps and detection of error states.Even using a tool chain to assist in this work [12], considerable manual development effort is still needed.For example, creating an assistant to guide the assembly of the IKEA cart shown in Figure 3 took nearly 200 person-hours for a Master's student in computer science [20].Even if an experienced WCA developer could halve this time, it is still a substantial investment of effort.

Figure 6: Raw Closed-Caption Text from YouTube
We will build a software application to guide users to assemble a Voxelab Aquila 3D printer step-by-step.I want you to act as an assistant to walk users through the assembly process by giving verbal instructions.Please read the below video subtitles with timestamps and create an ordered list of step-by-step guidance on how to complete the assembly task.Each list item should start with a time range in the video, like 00:00:15 -00:03:10.Please list around 20 items, be concise, and make sure not to repeat any steps and not to have overlapping time ranges.totally eliminate development effort.Our approach is based on the observation that many helpful YouTube videos are already available for assembly and repair tasks.Creating such a video only requires task-specific knowledge and domain-specific skills.It does not require programming skills or WCA expertise.We have successfully demonstrated this approach for assembly of a 3D printer (Figure 4) from an off-the-shelf kit [2].A third-party YouTube video for this task is available [19].Once created, this WCA app can be shared to benefit others in future who attempt the same task.
Figure 5 shows the compilation workflow.The starting point is YouTube audio and video.YouTube automatically generates closedcaption (CC) text for the audio channel, as shown in Figure 6.The timestamps help in synchronizing the display of CC text to video.The raw CC text has to be manually reformatted and truncated in order to provide inputs of optimal size to ChatGPT.The crucial part Here is the step-by-step guidance for assembling the Voxelab Aquila 3D printer: 1 [00:08:41 -00:09:37] Install the left z extrusion with the z limit switch using four m5-by-45 screws 2 [00:09:58 -00:10:10] Repeat step 1 for the right z extrusion 3 [00:10:22 -00:11:42] Install the z-axis motor and screw using two m4-by-20 screws 4 [00:11:48 -00:12:23] Install the x motor and pneumatic joint on the extruder using two m5-by-14 screws and one m6 pneumatic joint 5 [00:12:23 -00:13:06] Install the x and e motors on the x extrusion using screws inserted from the non-extrusion side . . . . . . . . . of the compilation process is the creation of the prompt to ChapGPT shown in Figure 7.The prompt is similar to the instructions that one would give to an intelligent human assistant.Yet, even minor differences in this prompt can result in suboptimal outputs from ChatGPT.The term "Prompt Engineering" has emerged for the skill and experience needed to create optimal ChatGPT prompts.Creating the right prompt involves significant trial-and-error, much like programming.Once the developer has converged on the right prompt, ChatGPT produces the output shown in Figure 8.Using this as input, the OpenWorkFlow component of the Ajalon toolchain [12] generates the FSM for the WCA task.As Figure 9 shows, the developer binds individual FSM states to classes of the pre-trained ML model for visual detection of task step completion.The resulting code can be run at Tier-2 to guide a user in performing this WCA task.No runtime accesses to ChatGPT are necessary.The FSM already embodies the necessary task-relevant knowledge.A video demo of this WCA application, implemented as described above, is available [18].

Case Study: Mission-Centric Drone Flight
Our second case study involves autonomous surveillance tasks on lightweight, cheap, commercial off-the-shelf (COTS) drones.In this CP system, the drone is a Tier-3 IoT device that offloads operations such as ML-based computer vision to a ground-based cloudlet (Tier-2) over a 4G LTE wireless network [3].This system has to strike a balance between precise description of a mission from the user's viewpoint, and its optimal execution for a specific drone type and current runtime conditions.This separation of concerns is achieved by describing the mission at a high level using the Google Maps API, imaging primitives, and drone-independent high-level navigation operations.A tool chain transforms this high-level mission description into a mission script that uses the Keyhole Markup Language (KML) format [9].Runtime progress of a mission can be monitored and controlled via software such as QGroundControl (QGC) [5].

Combine with
QGround-Control ChatGPT KML Code

Figure 10: ChatGPT-driven Flight Mission Creation Workflow
Given text: (placeholder for user command) Instruction: The above text is a description of a flight route for a drone.Generate the kml file based on the route described by the text.Here is the requirement of the kml file: 1. Include only one placemark section.2. If altitude in the text has not been specified then set it as 50 for every waypoint in KML. 3. Precise the value of the altitude, longitude, and latitude in 7 decimal places for each waypoint.4. One way point for one area of interest.5. Coordinates are measured in units of degree.100 meters in the latitude direction is 0.001 degree, and 100 meters in the longitude direction is 0.001 degree.6. Rules for latitude calculation.new latitude = old latitude + distance.For example, given that from (-74.0059731,40.Using ChatGPT, we have been able to reduce the human effort and expertise involved in describing a mission.A user does not need to know KML -he just describes the mission verbally, using natural language.Figure 10 shows the KML code compilation workflow. Starting from Cathedral of Learning, first, fly 100 meters south.Second, fly 100 meters east.Third, go directly Gates and Hillman Centers.The engineered prompt for this workflow is shown in Figure 11.The use of a real example in the prompt enables the GPT model to perform in-context learning to understand the structure of KML code, which might not have been seen in its training data.This prompt is combined with the voice transcript of input from the user that contains the flight command in natural language (Figure 12).The voice transcript is obtained by performing Speech-To-Text conversion using the Whisper API [11].We process the transcribed text by attaching geographic coordinates to each place name that appears in the command.The result is then combined with the prompt, and given as input to ChatGPT. Figure 13 shows the KML code generated by ChatGPT for this example.The code can be imported into QGC to create a flight route as shown in Figure 14.

Case Study: Real-Time Style Transfer
Our third case study focuses on an edge-native augmented reality (AR) application, OpenRTiST, which stands for Open Real-Time Style Transfer.This CH application lets you "see the world through the eyes of an artist," using a smartphone or wearable device [7].At the heart of OpenRTiST is a neural style transfer (NST) algorithm at Tier-2 that transforms a live video stream captured at Tier-3.This algorithm uses a DNN pre-trained offline to capture the style of a specific artwork, such as a famous painting.
We have shown how ChatGPT can help in discovering a famous painting that embodies a verbally-described effect.Figure 15 shows this workflow.Its engineered prompt is shown in Figure 16.The user   provides a vague description of the artwork he seeks (Figure 17).From this input, ChatGPT hallucinates several image descriptions, along with search strings for each description (Figure 18).From the image search results (Figure 19), the user picks the artwork that best matches his intent, and then uses it to train a DNN for OpenRTiST.

LLM Latency for Case Studies
Using the same format and notation as Table 1, the measured end-toend latency of our three case studies is shown in Table 2. Two facts are immediately apparent.First, the more complex inputs lead to significantly higher latency in our case studies.Second, these results

Challenges and Opportunities Ahead
The case studies of Sections 5 to 7 show that use of LLMs as offline compilers for CH and CP systems is already an achievable goal today.The more powerful LLMs of the future will make this capability even more valuable.At the same time, our experience reveals some limitations in our use of LLMs.We present these below as opportunities for simplifying the use of LLMs in the kinds of applications that are discussed in this paper.

Incremental Guidance and Convergence
When a human assistant receives initial guidance on a task he may not know precisely what is expected.The output produced is best effort, with the assumption that incremental guidance will help him to refine the output to final form.Today's LLMs view each requestresponse as a distinct interaction, with no built-in concept of a series of interactions leading to convergence.The effect of history can be achieved today by including previous messages and responses as context to new inputs.However, this is a clumsy solution to the lack of task-specific long-term memory in LLMs.It places the entire burden of simulating history on the code implementing the task, thus making the developer's job more difficult.A more elegant and easierto-use solution would maintain task-specific context internally.One possibility is to structure such a capability as a new API on top of today's LLMs, and to maintain history within that new layer.

Acquired Proficiency
A human assistant improves on a task as he does it more often.However, to the best of our knowledge, LLMs do not learn as they respond to requests.Their learning is done offline, guided by their creators rather than by their users.The path to improvement on a user task is extremely long and indirect.The ability for a user to have more direct guidance on the learning of an LLM on a task would be valuable.This leads to deeper IP questions.Should the learning acquired by an LLM from one customer be available to benefit a competitor?How do you partition learning so that there are rigid and enforceable boundaries for cloud-based LLM services?There are no easy answers to these questions, but explicit or implicit policies in this regard will become inevitable.

User-Specific Customization
In real life, the same exact request may be serviced differently for different users because of the specific whims and preferences of those users.Human assistants know this well.An assistant to persons A and B learns to responds to their identical requests in slightly different ways, and carries that learning over to new tasks.Today's LLMs support a form of this customization via fine-tuning.ChatGPT has recently provided APIs for users to upload training data and to fine-tune their private models.However, while fine-tuning allows customization of models for specific tasks, it can also be complex and require substantial labeled data.More importantly, it places the entire burden of customization on the developer.Internalizing the user-specific customization, possibly via a new layer above the current API, would greatly enhance the value of LLMs and help them to better emulate human assistants.This would simplify use of LLMs for CH tasks such as guidance of visually-impaired persons.

Correctness of Generated Code
Today, a human has to manually check output of an LLM to ensure that it is consistent with the intended actions.This is still a win over the much larger generative effort that would have been required to create that output manually.An even better approach would be to completely avoid the need for checking.Modern compilers are assumed, by construction, to be generating correct code.An open research question is whether we can ever reach a similar state of confidence with LLMs.One possibility is to substitute automated checking tools for human checking.In Case Study 1, for example, the OpenWorkFlow tool could be extended to include correctness checking.For Case Study 2, a custom-built correctness checker could verify that LLM-generated code does not violate safety constraints.
There is a large body of verification-and correctness-related research in computer science that can be harnessed towards this goal.

Conclusion
We have shown how the power of LLMs can be leveraged by CH and CP systems.The key insight is that one does not need the full power of an LLM in the latency-critical paths of these systems.Using an LLM as an offline compiler to generate latency-sensitive code paths is the approach that we advocate.This concept can be extended to a hierarchical approach, in which cloud-based LLMs are invoked every few seconds to generate fresh latency-sensitive code paths.Such a system would be achievable using today's LLM deployments.It would embody the powerful adaptive intelligence of a cloud-based LLM, while honoring the tight latency bounds of Edge AI.

Figure 1 :
Figure 1: Reference Architecture (Source: [17]) name of a familiar person whose name you cannot recall.Detects and extracts a tightly-cropped image of face, and then applies a face recognizer.PingpongTells novice to hit ball to the left or right, depending on which is more likely to beat opponent.LampGuides a user in assembling a table lamp from an IKEA kit.Gives video and verbal guidance for next step.

Figure 2 :
Figure 2: Example WCA Applications mean and standard deviation of response time over 5 runs of each experiment.It also reports the length of the output in tokens.The input prompts for all rows except "No-Op" are examples taken from OpenAI's official blog post[10].The "No-Op" measurement was obtained by using an incorrect API key in the request.It accounts for the full end-to-end round trip time (RTT) without execution of the LLM.The requests were made via the OpenAI chat completions API, using OpenAI's official Node.jslibrary.The temperature parameter was set to 0 to optimize for deterministic output.These results confirm that even for simple inputs, the end-toend latency of cloud-based LLMs is far too large for the "tens of milliseconds" latency tolerance of CH and CP systems.Consistent with the results of others[13], our results confirm a positive linear correlation between the number of output tokens and response time.Table2shows that the latencies are even longer for the more complex inputs of our three case studies in Sections 5 to 7.

Figure 14 :
Figure 14: Screenshot of Flight Route in QGroundControl

Table 1 :
OpenAI API Response Time and Output Token Count

Table 2 :
LLM Latency for Case Studies confirm the premise of this work: i.e., these latencies are far too high for use in the time-critical paths of CH and CP systems.However, they are perfectly acceptable in offline compilation.