The Conversation is the Command: Interacting with Real-World Autonomous Robot Through Natural Language

In recent years, autonomous agents have surged in real-world environments such as our homes, offices, and public spaces. However, natural human-robot interaction remains a key challenge. In this paper, we introduce an approach that synergistically exploits the capabilities of large language models (LLMs) and multimodal vision-language models (VLMs) to enable humans to interact naturally with autonomous robots through conversational dialogue. We leveraged the LLMs to decode the high-level natural language instructions from humans and abstract them into precise robot actionable commands or queries. Further, we utilised the VLMs to provide a visual and semantic understanding of the robot's task environment. Our results with 99.13% command recognition accuracy and 97.96% commands execution success show that our approach can enhance human-robot interaction in real-world applications. The video demonstrations of this paper can be found at https://osf.io/wzyf6 and the code is available at our GitHub repository (https://github.com/LinusNEP/TCC_IRoNL.git).


INTRODUCTION
The exploration of human-robot interaction (HRI) [29], [32] and its advancement into real-world applications has been a topic of significant research over the past decades [30].Current approaches for controlling and interacting with autonomous robots in the real world have been dominated by complex teleoperation controllers [13], teach pendants [2], and rigid command protocols [16], where the robots execute predefined tasks based on specialized programming languages.As the challenges we present to these robots become more intricate and the environments they operate in grow more unpredictable [18], there arises an unmistakable need for more natural and intuitive interaction mechanisms.
Prior works have seen a tilt towards techniques like reinforcement [21] and imitation learning [2].By leveraging iterative learning and human demonstrations, these strategies have shown a capacity for fostering nuanced robot behaviours, as demonstrated in [28].However, the often computational burdens [15], and the high costs associated with reward specification [25], task-specific training, or fine-tuning, commonly observed in the reinforcement and imitation learning frameworks, have limited the practical applicability of these methods, especially for simpler robotic tasks.
Prompted by these challenges, we turned our focus to the recent advancement in large language models (LLMs) [23], [6] and multi-modal vision-language models (VLMs) [22], [24] to foster an intuitive human-robot collaboration.This paper introduces an innovative approach that exploits the inherent natural language capabilities of pre-trained LLMs and VLMs to enable humans to interact with autonomous robots through natural language dialogues.As demonstrated in Figure 1, we aim to realize a new approach to human-robot interactions-one where the conversation is the command (refer to Sections 3 & 4 for more details).
Our contributions are therefore threefold: (a) we introduced a framework that can leverage independent pre-trained LLMs (e.g., OpenAI GPT-2 [23] & GPT-3 [6], Google BERT [8], Meta AI LLaMA [31], etc), and VLMs (e.g., CLIP [22]) to enable real-world autonomous robots to interact with humans or other entities using natural language dialogues.(b) we performed real-world experiments with our developed framework to ensure that the robot's actions are always aligned with the user's instructions, thereby reducing the likelihood of erroneous operations.(c) we have made our code and associated resources available to the public.This allows for easy reproducibility of our results.

RELATED WORK
The recent rise of natural language processing (NLP) [33], marked by large language models (LLMs) like OpenAI GPT-3 [6], Google BERT [8], HuggingFace distilBERT [26], EleutherAI GPTNeoX [3], Meta AI LLaMA [31], Facebook RoBERTa [14], multi-modal visionlanguage models (VLMs) e.g., CLIP [22], DALL-E [24], and their successors, has opened new avenues for human-robot interaction.The inherent capacity of these models to understand and generate Figure 1: Example demonstration of our framework.We demonstrated these results in the real world as shown in the summary video at https://osf.io/wzyf6.In (a), our framework decodes the high-level instructions such as "move in a circular pattern", "move forward, go right, etc. " from humans, and abstracts them to the robot's physical actions.In (b), we leveraged our framework for the robot's task environment understanding, information requests, and goal navigation.
human-like text as well as visual observations has led to several interesting applications [4], [5].Recent works such as [1], [5], [10] and [12] have successfully incorporated LLMs and VLMs into robotic systems, allowing the LLMs to interpret and execute complex commands.Similarly, Wangchunshu et al. [35], Kaiwen Zhou et al. [34], and Miguel et al. [9] in their work demonstrated how LLMs could be used to facilitate real-time feedback, zero-shot object navigation and cognitive learning in autonomous robots.
While these works are exceptional, they have focused solely on a step-by-step task description [12], and rely completely on the LLM's ability to plan the robot's actions and act.In most complex real-world scenarios, especially as LLMs can sometimes hallucinate [7] or generate inconsistent data, their approach may introduce inconsistencies and randomness in the robot's actions.
On the contrary, our approach draws inspiration from the work of Yagi Xie et al. [33].Instead of relying completely on the LLMs' ability to plan and execute the robot's actions, we employ a bidirectional approach, simply using the LLMs as a linguistic decoder [33], and a classical robot operating system (ROS) [20] navigation planner to plan the robot's actions.We provide the LLMs with a dictionary of task descriptions and action patterns.We then use the ROS planner2 to plan the actual physical actions of the robot (e.g., path planning, localisation, obstacle avoidance, mapping, etc.), as shown in Figure 1b.

METHODS
An overview of our framework's architecture is shown in Figure 2. One of our core objectives is to develop a framework that enables real-world autonomous robots to interact with humans or other entities using natural language dialogues.To achieve this objective, we decompose the task into three subtasks: (a) the integration of LLMs and VLMs, (b) the development of the robot execution mechanism (REM) node, and (c) the development of the chat graphical user interface (ChatGUI).An overview is shown in Figure 2.This section provides details of each of the above three subtasks.

Integration of LLMs and VLMs
To decode the natural language conversation and abstract them to the robot's actions, we developed a ROS node, LLMNode (light green block in Figure 2) to establish communication between the pre-trained LLMs and the rest interfaces within ROS ecosystem [20].The LLMNode subscribes to topics that provide essential data, e.g., odometry for spatial sensing and outputs from the CLIPNode (light purple block in Figure 2) for visual observation and object recognition.We used the LLMNode to handle incoming natural language inputs from the ChatGUI (Subsection 3.3) by first passing them through the pre-trained LLM [23].The output is then mapped to the robot's actionable commands or queries.
In the mapping process, we leverage pattern matching to align the generated text with predefined actions or information requests.For example, navigation commands are translated into goals for the robot to pursue within its environment, while queries Q about the robot's status or surroundings are addressed with information derived from the robot's sensor data.The LLMNode also oversees the execution and feedback process.We added this function to provide real-time feedback through the publishing of messages, which not only inform the user but also log the interaction data for subsequent analysis (see Subsection 4.1 for more details).
Summarily, the LLMNode function can be described as a mapping from natural language inputs to robot actions, i.e., LLMNode : L ↦ → A where L represents the space of natural language inputs and A denotes the set of possible robot actions.This mapping is a composition of several functions, as depicted in Eq. 1.
From Eq. 1, LM() is the language model's interpretation of the input   ∈ L and Sen(Data) represents the sensor data that informs the context of the command.The REM node (Section 3.2) then translates this into an executable command for the robot.Furthermore, to provide a visual and semantic understanding of the task environment (e.g., Figure 1b), we used the OpenAI contrastive language image pretraining (CLIP) model [22].CLIP model consists of language and image encoders trained on a staggering 400 million image-text pairs [27].Thus, we used it to encode the stream of RGB images from our observation source (Intel Realsense D435i) alongside the textual descriptions of objects in the image.
Formally, given an image I  at time , we consider a set of predefined textual descriptions D = { 1 , 2 , ...,  }.Each description   ∈ D is mapped to a tokenized representation, forming a set T = { 1 , 2 , ...,  }.This set encompasses human-readable labels for common office objects such as "table", "chair", "person", and so on.Using the CLIP model [22], we extract the feature vector of the image, i.e.,  I = CLIP_encode_image(I  ).For each tokenized description   ∈ T , its feature vector is obtained as  T  = CLIP_encode_text(  ).Subsequently, for the image feature and every text feature, we compute the similarity scores using S  =  I •  T T  .Thus, we determine the recognized object within the image by selecting the textual description that yields the highest similarity score, as depicted in Eq. 2.
Additionally, our model uses the bounding boxes from YOLO V8 [11] to determine regions of interest (ROI) within the image.Notably, the centres of these bounding boxes are employed as the spatial coordinates for the recognized objects, capturing both the identity and the location of the objects in the scene.
We embodied the entire process within a ROS node, CLIPNode (light purple block in Figure 2).The output from Eq. 2, representing the recognized objects along with their respective spatial coordinates, is published as a ROS [20] topic.These are subsequently subscribed to by the LLMNode to handle the natural language commands, generate responses, and decide on actions for the robot to take.For instance, based on the prompt used by the robot's user, it can direct the robot to navigate to a detected object or provide information about detected objects and their positions.

Robot Execution Mechanism (REM)
To abstract the high-level language understanding and environment sensing from the LLMNode to actual robot actions, we developed the robot execution mechanism (REM) node.This node translates intents extracted from the LLMNode into actionable tasks for physical execution by the robot.Central to the REM node's functionality is processing navigation goals, G  (e.g., Figure 1b).When a textual description of a goal destination G  , such as "navigate to the Secretary's office" is provided, the REM node translates this into precise goal coordinates (  ,  ,  ,  ) within the robot's operational environment via a mapping process that correlates the descriptive labels with their corresponding spatial coordinates i.e., G  ↦ → (  ,  ,  ,  ).To navigate the robot to the goal, we used the MoveBase package of the ROS navigation planner, which provides an action server for handling navigation goals.REM node sends the goal to this server and monitors its progress.
Besides navigation goals, the REM node also handles custom movement commands C (e.g., Figure 1a) which are not tied to specific goal locations, but rather to particular motion patterns, such as "rotate in place", "move forward" etc.We encoded these patterns into the robot's YAML configuration files, allowing for a flexible command set   ∈ C that can be expanded or modified as required.We use the REM node to translate the commands into Twist messages W with linear (  ,  ,  ) and angular (  ,  ,  ) velocity components as W () =  (  ,  ,  ,  ,  ,  ).
In addition to handling movement, we integrated a security measure to halt the robot when an unrecognized command (e.g., the last command in Figure 1a) is received, issuing zero velocities to stop all motion, ensuring safe operation.
Formally, as summarised in Eq. 3, the REM node abstracts the complexity of the robot navigation and command execution, translating the high-level instructions into physical actions.

Chat Interface Development
To provide an intuitive conversational platform that would facilitate natural language interaction between the robot and its human users, we developed a simple and user-friendly chat graphical user interface (ChatGUI) which serves as the user's primary interaction point with the robot through textual communication.We designed the ChatGUI using Tkinter libraries 3 and integrated it within ROS [20] for message passing.We employed the standard ROS publish/subscribe communication mechanisms for the ChatGUI development, specifically, a bidirectional message exchange approach, i.e., ChatGUI : UserInputs ↔ LLMNodeOutputs.User natural language inputs are published to the LLMNode, and the responses are subscribed to and displayed to the user on the ChatGUI interface.
We developed the ChatGUI with event-driven architecture to ensure that user actions, such as sending a message or issuing a command, trigger corresponding updates in the ChatGUI or result in the publishing of commands to the LLMNode.We encapsulated this process in a function that translates user actions into corresponding LLMNode responses.

PRELIMINARY RESULTS
We conducted real-world and simulated experiments to demonstrate the applicability and adaptability of our framework.For simulation, we used the Unitree Go1 Gazebo packages 4 and a ROS-based open-source mobile robot adapted from [17].We ran all the simulations with a ground station PC with Nvidia Geforce RTX 3060 Ti GPU, 8GB memory running Ubuntu 20.04, ROS Noetic distribution.
In real-world experiments, we used a Lenovo ThinkBook Intel Core i7 with Intel iRIS Graphics running Ubuntu 20.04, ROS Noetic distribution.Unitree Go1 quadruped robot was used.The robot is equipped with Intel Realsense D435i RGB-D camera and Ouster 3 LiDAR for both visual and spatial observations of the task environment.All the real-world experiments were performed in our laboratory office (11 rooms) and outside corridor environment, measuring approximately 18 × 20  and 6 × 120  respectively.

Initial Evaluation / Participants
In our initial evaluation, we invited 21 participants (mostly students) with an average age of 23 (±5) and gender distribution, 61.9% male, 28.6% female and 9.5% others to assess the intuitiveness of our framework by interacting with the robots using natural language.We instructed the participants to command the robots to navigate to locations, identify objects, and inquire about their status.We meticulously logged the interaction data which includes the participant's input text, the LLM's output, the true label, the LLMNode predicted label, etc.To quantitatively evaluate the performance of our framework, we established four key metrics: With the ART, we compute the average duration from receiving the user's chat command to initiating the robot's movement.Figure 3 presents our preliminary statistical results obtained from the interaction data analysis.The top row of Figure 3 shows the performance metrics and the confusion matrix (for selected labels) of the LLMNode.The CRA, with a prediction accuracy of 99.13% (i.e., how often the "Predicted labels" matched the "True labels"), indicates a high level of accuracy in the command interpretation.This reflects the robustness of the LLMNode in processing the natural language inputs.The OIA on the other hand, achieved 55.20% accuracy, indicating room for improvement in our CLIPNode integration.Further, the NSR at 97.96%, indicates good performance in the REM's ability to abstract the high-level understanding from the LLMNode to the actual robot's navigation actions.Also, the overall ART across all the selected commands (refer to the figure at https://osf.io/ufctx) is approximately 0.45 seconds.This indicates that, on average, the robot takes less than half a second from receiving a chat command to initiating movement, which suggests a relatively quick response time for our framework.
Furthermore, the bottom row of Figure 3 shows the participants' feedback (refer to the questionnaire at https://osf.io/dgbtr).With 4 and 5 ratings as favourable benchmarks, 80.9% and 76.2% of the participants respectively rated the ease of interaction and the intuitiveness of our framework as favourable, while 85.7% are satisfied with the response of the robot to their natural language commands.

CONCLUSION AND FUTURE WORK
We introduced a framework that leverages the inherent capabilities of large language models (LLMs) and multimodal vision-language models (VLMs) to enhance human-robot interaction through natural conversation.Our evaluation from the logged interaction data and participants' feedback was overwhelmingly positive.The high command recognition accuracy and effective task execution, show that our framework can simplify human-robot interaction.Looking ahead, we aim to refine the framework across several dimensions, not just for ROS-based autonomous robots.The CLIPNode will be further improved for broader object recognition, and the LLMNode will be fine-tuned with domain-specific data for better contextual and voice understanding.User experience will be a priority, with a focus on creating a more intuitive and adaptive chat interface.

Figure 2 :
Figure 2: Overview of our framework's architecture.The LLMNode decodes the natural language conversations.The CLIPNode provides a visual and semantic understanding of the robot's task environment.The REM node abstracts the high-level understanding from the LLMNode to actual robot actions.The ChatGUI serves as the user's primary interaction point.See Subsections 3.1, 3.2, and 3.3 for more details.
(a) Command Recognition Accuracy (CRA): With the CRA, we assess how accurately the LLMNode interprets the natural language commands.This aids us in pinpointing instances where the predicted label diverged from the true label, providing insights into potential areas for improvement.(b) Object Identification Accuracy (OIA): We employed this metric to assess the precision of the CLIPNode in identifying and localizing objects within the robot's task environment.(c) Navigation Success Rate (NSR): We utilised this metric to determine the effectiveness of the REM node in successfully navigating the robot to the designated locations.(d) Average Response Time (ART): We logged in ROS Unix epoch clock standard, the time a message is sent from the ChatGUI, the time it is received by the LLMNode, and the time the robot responds.

Figure 3 :
Figure 3: Performance and variability measures illustrating CRA, OIA, and NSR (top) and the participants' feedback (bottom) based on the logged interaction data.