Language, Camera, Autonomy! Prompt-engineered Robot Control for Rapidly Evolving Deployment

The Context-observant LLM-Enabled Autonomous Robots (CLEAR) platform offers a general solution for large language model (LLM)-enabled robot autonomy. CLEAR-controlled robots use natural language to perceive and interact with their environment: contextual description deriving from computer vision and optional human commands prompt intelligent LLM responses that map to robotic actions. By emphasizing prompting, system behavior is programmed without manipulating code, and unlike other LLM-based robot control methods, we do not perform any model fine-tuning. CLEAR employs off-the-shelf pre-trained machine learning models for controlling robots ranging from simulated quadcopters to terrestrial quadrupeds. We provide the open-source CLEAR platform, along with sample implementations for a Unity-based quadcopter and Boston Dynamics Spot® robot. Each LLM used, GPT-3.5, GPT-4, and LLaMA2, exhibited behavioral differences when embodied by CLEAR, contrasting in actuation preference, ability to apply new knowledge, and receptivity to human instruction. GPT-4 demonstrates best performance compared to GPT-3.5 and LLaMA2, showing successful task execution 97% of the time. The CLEAR platform contributes to HRI by increasing the usability of robotics for natural human interaction.


INTRODUCTION
Recent advances in large language models (LLMs) have enabled versatile new modalities for improving usability in the feld of Human-Robot Interaction (HRI).Thus far, existing LLM-driven robotics research has utilized models specifcally trained for robot control [1][2][3] or by asking an LLM to generate robot actuation code [6,7,9,14,17].Advances in LLM-driven agent-based models have also provided a way to overcome LLM token-length "memory" and chain-of-thought reasoning limitations [5,11,16].
Our robot vision system, the Context-observant LLM-Enabled Autonomous Robots (CLEAR), builds on previous work as a robot interface that does not require LLM fne-tuning.In other words, this system utilizes LLMs as they are, and further development of their model is not required for CLEAR to be efective.This work allows untrained users to interact with a vision-enabled robot to perform dynamic and context-appropriate behaviors.
We present a robot-vision-LLM system that is 1) robot-agnostic, 2) LLM-agnostic, and 3) prompt-only (no LLM fne-tuning or robotspecifc pre-training).In contrast to previous work, our software is specifcally focused on applications to HRI, as it includes the ability for users to interact with the robot during and between tasks via a web-based voice/chat interface, allowing a diferent level of user interaction than other, more planning-focused work.We evaluate CLEAR on tasks carried out by a simulated quadcopter using multiple LLMs with a YOLOv8 [8] vision model.GPT-4 demonstrates the most consistent execution of commands (97%) compared to GPT-3.5 and LLaMA2.Next, changing only the initial prompt, we provide an example of using CLEAR to direct a Boston Dynamics Spot ® quadruped to accomplish multistep tasks via verbal commands.This makes it the frst prompt-only system to easily swap between diferent robot form factors.
CLEAR is provided under an MIT license at https://github.com/MITLL-CLEAR so that, as LLMs, vision models, and robotics continue to evolve, almost anyone will be able to use an LLM to autonomously control a robot.The CLEAR_setup repository gives a means for installing, managing, and working with the components.We welcome community input and contributions.

CHARACTERISTICS
CLEAR is composed of several distributed services following a separation of concerns design philosophy.The services communicate via Representational State Transfer (REST) APIs built on Node.js.This design enables deployment confguration options, such as a globally available cloud service, a local network service, or a hybrid.

Object Detection and Tracking
CLEAR's awareness of its surroundings is achieved via a YOLO V8 object detection model, which perceives a constant stream of metaobjects.These meta-objects are abstract representations of detected objects from the environment.They have attributes that include object classifcation, local coordinates, time of initialization (age), and sub-object detections (e.g., hands of a person).These attributes are routinely updated (age reset and coordinates corrected) based on feature similarities with newly-perceived meta-objects.They are deleted if they are not updated before exceeding a maximum age threshold.Together, these meta-objects enable the system to perceive the robot's visual environment and characterize meaning based on knowledge of the system defnition.Based on the accuracy of meta-objects and their attributes, meta-objects are foundational features that are utilized to facilitate robotic actuation.

System Design
CLEAR has fve component services, each in its own repository.These services operate in tandem to contextualize sensor/user inputs, process LLM input/output, and actuate the robot (Figure 1).The CLEAR_interface_server connects the robot system to the CLEAR platform at large by handling the data transfer between the robot, human users, and other CLEAR services.Its browser-based user interface allows users to converse with the LLM via voice/chat, vet the LLM's commands to the robot, and take manual control of the robot (Figure 3).The information obtained from users and the robot is then relayed to the CLEAR coordinator.
CLEAR_coordinator is the central handler between sensor-related services of CLEAR, the LLM, and actuation commands.The coordinator manages two abstract processes for perceiving the robot's environment: a volatile list of detected objects (meta-objects) and a natural-language exposition detailing context.The meta-objects are used for actions dependent on tracking, such as moving to or grabbing an object.Both of these actions function upon information such as location and distance.The exposition dubbed the Conversation Ledger is perceived by the LLM and is kept in the LLM handler.However, the coordinator creates this conversation ledger and, in it, details a conceptual framework for understanding subsequent abstract representations of the frst layer that are provided as prompts.This perception is derived and applied to various services through the coordinator's relationship between the two web servers: the interface server and the worker server.With the interface server, the coordinator receives images and user text that require extensive processing.To prevent the coordinator from stalling, these processes are shared with the worker services.
CLEAR_worker_server handles the data transfer between the coordinator, computer vision, and the LLM handler.The worker server makes computationally rigorous functionality remotely available to the coordinator, providing object detection, depth estimation, and LLM inference.These worker services employ an event-driven architecture, awaiting for appropriate HTTP POST requests to deliver input data and emit a signal.Select events launch respective worker processes, which in turn, will yield data usable for the coordinator.
CLEAR_computer_vision interprets visual data, furnishing the coordinator's meta-object collection through its two sub-services: object detection and monocular depth estimation.While the coordinator provides camera input from the robot, the computer vision systems transforms and return visual data as depth matrices and encoded strings with object information.Depth is used for object avoidance, while the encoded strings are refned into meta-objects.These meta-objects are initialized in the coordinator's frst perception layer, then translated into the second Conversation Ledger layer as prompts to be handled by the CLEAR_LLM_handler.
CLEAR_LLM_handler connects the coordinator to an LLM to generate responses that produce robot behavior.The coordinator and LLM handler are related through the conversation ledger.The coordinator initializes the conversation ledger and iteratively appends prompts to it.The LLM handler preserves the conversation ledger, shares it with the LLM, and appends responses from the LLM to it, allowing.the LLM to use its conversation history to infer responses.

The Conversation Ledger
The conversation ledger is a natural language JSON log that drives the LLM's perception and interactivity.Entries of the conversation ledger are conceptually divided into three interdependent categories: prompts, responses, and system defnitions (Figure 2).Prompts are statements expressing system sensory input, including an abstract representation of the coordinator's meta-objects, general system information, and human comments from the web interface (Figure 3).These prompts are exchanged with the LLM for responses.
Responses from the LLM can spur robotic actuation and/or message users on the interface server.These responses must be strictly structured because they are semantically similar to programmatic function calls.Responses are parsed for arguments and related to a dictionary with function pointer key values.
System defnitions are tailored to each CLEAR confguration and contextualized conversation ledgers (Figure 2).System defnitions detail robot-specifc actions available and how to use them.For example, the Unity quadcopter (Section 3.2.1)can throw objects (e.g., apples) at other objects because the equipped system defnition defnes a THROW action, that takes in a target object as a parameter.

SOFTWARE
We provide two reference implementations: a Unity-based quadcopter simulation and a Boston Dynamics Spot robot.Both were tested with object detection (the most computationally intensive local process) running on either an NVIDIA DGX A100 or an NVIDIA GeForce RTX 3060 GPU.As multimodal LLMs evolve, local compute would only be necessary for privacy considerations.We frst describe the three LLMs used in testing, then the two reference implementations and associated results.

Choice of Large Language Model
For the simulated quadcopter, CLEAR was tested with gpt-3.5-turbo(GPT-3.5),gpt-4 (GPT-4), and LLaMA2-70B-chat-hf (LLaMA2) primarily due to their low latencies [4,10,15].Local instances of "open-source" LLMs, such as the large-parameter LLaMA2 and Falcon models, were set up for integration in CLEAR and, with sufcient optimization, will be testable, but the latencies were too high for real-time feedback [12,15], and we leave this and the integration of other LLM models to future work.

Example Implementations
3.2.1 Simulated Qadcopter.CLEAR_virtual_drone includes a Unity prefab that can be immediately deployed in simulation.Due to the ease of sim-based replication, we focus on this implementation to demonstrate CLEAR's usability.This implementation also encourages the community to build new environments and use cases.We do not seek state-of-the-art results in planning or other benchmarks and focus primarily on providing a novel, accessible HRI capability.To demonstrate CLEAR's behavior, we performed three experiments, each with 1000 trials for each of the three LLMs.The system defnition used by the LLMs in each experiment is included in our repository.Our results are derived from a prerelease CLEAR version using LLMs in September and October of 2023.We frst consider the diferences between how the LLMs synthesize input for robotic control without human intervention as a type of "background chatter" and initial environmental surveying.Each of the LLMs tested has access to the output commands (described in the initial prompt): MOVETO, LOOKAT, THROW, ROTATE, RESTART, STOP, DO NOTHING, and MESSAGE.Table 1 indicates that the GPT models tend to observe and act in their surroundings with high counts of THROW, MOVETO, and LOOKAT, whereas LLaMA2 tends to move, do nothing, and message more frequently.
In the same vein of providing observable insight into LLM capability without human intervention, we measure the LLM's resiliency against memory fatigue.LLMs face a well-documented difculty in utilizing older information.With CLEAR, we are able to track this resiliency in an efort to showcase its longevity of use before a system error occurs.In this case, a system error would refer to the LLM violating the system defnition and asking CLEAR to coordinate an undefned action.Table 2 depicts the tallied number of responses across all scenarios that were correctly formed (no failures), as well as the number of responses that were incorrect on the frst, second, third, and th tries.In all cases, the LLMs followed instructions without failure on the frst trial more than 85% of the time, with GPT-4 executing 97% of the time without failure.
In order to measure the receptivity of robot instruction-following in the CLEAR system performance as controlled by multiple LLMs, we posited a series of increasingly challenging human-directed task scenarios, a chained request of three actions: restart, message, move to an object.For testing, it was required that the actions be sequential -if any additional actions occurred within the sequence, this was considered a failure.These results (Table 3) show that LLaMA2 is the least receptive, satisfying the human-given 3-action command no more than 8% of the time, while GPT-4 was consistently the most likely to follow the entire human command.For this 3-action task, both GPT-4 and GPT-3.5 were successful more  than 30% of the time, even under the somewhat strict defnition of success, where we also did not attempt to optimize initial prompts.
3.2.2Boston Dynamics Spot.We also ofer an example of CLEAR on the Boston Dynamics Spot quadruped robot with an arm attachment (software is in the CLEAR_spot repository).A user/robot chat exchange is shown (Figure 3) where the output commands matched most of the commands of the simulation except RESTART and THROW.We included a new command, GRAB, and a human-inthe-loop approval requirement for it as a safety feature.

USAGE AND FUTURE WORK
The interfaces provided by CLEAR enable its use as an HRI research platform focusing on natural interactions between humans and robots.From the perspective of robotics accessibility, interfaces like CLEAR have the potential to democratize access to complex robotic systems by making their use closer to human-human communication, as well as allowing them to directly leverage state-of-the-art open-source machine learning models as they become available.Such systems do, however, present some risks.We provide some basic guidelines for safe and efective use, though experimental validation of this system with users remains part of future work.CLEAR provides basic safety checks, such as the rejection of commands not in the system defnitions, but more may be warranted at the LLM or robot control levels.We describe examples where human-in-the-loop autonomy is required and implemented -i.e.Spot's GRAB and simulated drone's THROW command, though safe robot behavior cannot be guaranteed.Additionally, "tricking" the system into performing dangerous actions is relatively simple if object detection or robot action labels were provided maliciously (e.g. a dangerous action was mapped to an innocent name).The system also inherits vulnerabilities to adversarial attacks at the LLM ACKNOWLEDGEMENTS level [18].However, harm reduction training in many of-the-shelf LLMs flters some typical (non-adversarial) unsafe prompts.
To minimize unsafe robot actions, we recommend 1) restricting access of the UI to trusted users; 2) limiting actuation speeds on the robot, regardless of the LLM commands; 3) ensuring that object detection and action labels are appropriately mapped, and 4) potentially implementing an observer layer to judge commands in the context of the robot's actions and reject unsafe commands.

CONCLUSION
With LLMs becoming ubiquitous, further research is needed to understand how they interact with humans in both virtual and physical environments.This development of the CLEAR platform to synthesize input from both LLMs and humans has the potential to aid researchers in better understanding how robotics can be designed for the general public.This implementation is the frst of its kind that 1) robot-agnostic, 2) LLM agnostic, and 3) promptonly, allowing user to use components of their choice without fne tuning.As more advanced LLMs and vision models are developed, their advances may be rapidly incorporated into robotic systems via CLEAR.We ofer baseline measurements of performance that support CLEAR's usage with a variety of LLMs and present indications of which may work better.Additionally, as CLEAR is implemented with human-centered principles of safety and natural languagebased instruction, we improve the overall usability of robotics by engaging humans without complicating interaction.
CLEAR will be used by our team as part of a human-AI interaction project at least through October 2025, with a high likelihood of continued use beyond that timeframe.Associated development and maintenance will continue through that project, which will be pushed to the open-source repository.

Figure 1 :
Figure 1: System architecture for CLEAR.

Figure 2 :
Figure 2: The initial instructions, user commands, and visionmodel textual output are all passed to the LLM -via the conversation ledger -to federate the robot actions.

Figure 3 :
Figure 3: Web interface for Spot (left) and Unity Drone (right).The interfaces show the robot's point of view, the robot's current action, and chat history between the user and the LLM.Any user interaction with the robot occurs through this interface, primarily via the chat box on the bottom right.
This material is based upon work supported by the Under Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001.Any opinions, fndings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily refect the views of the Under Secretary of Defense for Research and Engineering.

Table 1 :
Behavioral preferences based on LLM without human input.Prompts are generated entirely by object detection.

Table 2 :
Number of responses until frst incorrect robotic instruction.Deviating from the required response format.

Table 3 :
Percent task completion per step of 3-action task requests given in human prompt.