Integrating ChatGPT with Blockly for End-User Development of Robot Tasks

This paper presents an End-User Development environment for collaborative robot programming, which integrates Open AI ChatGPT with Google Blockly. Within this environment, a user, who is neither expert in robotics nor in computer programming, can define the items characterizing the application domain (e.g., objects, actions, and locations) and define pick-and-place tasks involving these items. Task definition can be achieved with a combination of natural language and block-based interaction, which exploits the computational capabilities of ChatGPT and the graphical interaction features offered by Blockly, to check the correctness of generated robot programs and modify them through direct manipulation.


INTRODUCTION
The growing interest in the use of collaborative robots (cobots) to assist human operators in diferent work scenarios, such as manufacturing, healthcare, and logistics, opens up an important research question about the easiness of robot task defnition.Cobots are fexible automation technologies, thus suitable to small-batch productions [15]; some models are also lightweight and can therefore be moved in diferent places within a work environment to support location-based tasks [25].For these reasons, cobots must be quickly and easily programmed by end users, who may be expert in their work domain, but not necessarily expert in computer and/or robot programming.
The recent survey [1] highlights how end-user robot programming can be much more complex than traditional end-user programming: the created robot programs must in fact refer to physical objects and locations; in addition, the robot must interact with the surrounding environment, including humans, by moving its arm(s) in the space and performing actions on the physical objects.Therefore, an important objective of the research on end-user robot programming is defning methods that allow end users without expertise in robotics and programming to cope with this complexity.The literature presents methods based on programming by demonstration [14,21], visual programming languages, and natural language programming.Programming by demonstration usually requires many trials, thus making the programming task inefcient.Visual programming languages, in turn, exploit fowchart notation [2], hierarchical trees [20] or block composition [9,26]; in all these cases, programming concepts, such as variables, conditionals or loops, are visually expressed in the programming environment, but their meaning must be known to the user; in other terms, the user must possess some programming knowledge, especially because the user must conceive the whole program from scratch.With the skill-based visual programming language, [22] adopts a higher-level approach than the previous ones: the user can create a sequence of skills, namely robot actions (e.g., "pick object", "navigate to location", "rotate object"), to defne the desired robot behavior.However, the approach appears as less fexible than those based on the other visual programming languages.Natural language programming has been proposed for instance in [4,16,17], even though the recent spread of Large Language Models (LLMs) has revived this approach.In [24], the authors explore OpenAI ChatGPT [18] as a potential versatile tool for robot programming; in their approach, a high-level function library is frst created and then ChatGPT can parse user requests and convert them to a logical sequence of function calls.However, the user must be able to evaluate the code generated by ChatGPT and provide feedback on its quality and safety.The assumption is that the user is able to understand the generated code and suggest modifcations, if needed.Other studies explore the use of LLMs in robotics for task planning [10,23] and control [11], but do not consider the users as the recipient of generated code.
In this work, we do not simply propose a new, user-oriented, programming method for cobots, but an End-User Development (EUD) approach that combines diferent interaction modalities and can lead the user to generate robot programs in an unwitting manner [5].EUD subsumes end-user programming [3,12], since it is not simply focused on facilitating programming by end users.EUD has a wider perspective in that it aims to empower end users in modifying, extending, and creating digital artifacts with software environments designed around their own domain concepts and work practices.
In the approach presented in this paper, the user goal is defning pick-and-place robot tasks.The idea is that the user can describe desired robot tasks to an assistant using natural language (as in the real world), but that what the assistant understands is not visualized as program code, like in [24], but through a more intuitive representation.In other words, we propose a EUD environment where the user interacts with a chat interface based on ChatGPT to describe pick-and-place robot tasks, and then verifes and possibly corrects the generated program through an intermediate visualization of the program based on Google Blockly [8].The selection of ChatGPT is based on a thorough assessment of its computational capabilities [13], ease of adaptation in training, and facility of integration and customization.The use of Blockly for the graphical representation and programming of robot tasks arises from its inherent benefts in fostering an intuitive programming environment, as shown in [9], even though in our case the user does not need to defne a Blockly program from scratch.In addition, pick-and-place tasks can be created based on the items (objects, actions, and locations) defned by the users themselves within the EUD environment, thus making it tailorable to diferent application domains.
The rest of the paper presents the architecture and implementation of the EUD environment.The generated programs can be executed on the DENSO COBOTTA collaborative robot.

ARCHITECTURE
The developed EUD environment is a web application implemented in adherence to a client-server paradigm, including a front-end developed in JavaScript/Typescript React and a back-end developed in Python Django; an SQLite database is used as data management system.The application architecture is shown in Figure 1.It encompasses four layers: (1) The User Interface Layer: through this layer the user can interact either with a chat or a graphic interface to defne pick-and-place tasks through natural language interaction or block-based interaction respectively; (2) The Task Defnition Layer: this layer is the application logic of the EUD environment, where ChatGPT or Blockly are exploited to process users' requests coming from the User Interface layer.(3) The Data Layer: in this layer robot programs are stored in the database using a JSON format compatible with Blockly; parsing functions transform the output of ChatGPT into JSON data that the Blockly engine may recognize.(4) The Execution Layer: here, the task execution by the robot is managed by means of a program template.
Going more in detail in the user workfow, let us assume that the defnition of a new task for the robot begins with the interaction with the Chat interface (Step 1 in Figure 1).User messages, conveyed via the chat, are directed to an Adapter, which exploits the ChatGPT API to interpret the user's request (Step 2).This module provides ChatGPT not only with the user's request, but also with specifc constraints and instructions, to ensure precise and non-deterministic interpretation of the robot task (Step 3).After the task defnition is complete and the interaction with the Chat ends, a custom JSON representation of the robot program is generated (Step 4).Following this, parsing functions are employed to convert the custom JSON format used by the Chat into a JSON format that can be interpreted by Blockly (Step 5), enabling the program to be visualised in a graphical format (Step 6).At this stage, the user can check the correctness of the program in the Graphic interface (Step 7), and can either confrm or modify it (Step 8).Finally, the Blockly program representation is used by a program template (Step 9); this program will execute the elementary robot actions needed to accomplish the task (Step 10), like moving the arm to a specifc position or searching for an object in the working area.
Advanced users of the application can directly defne a new robot task through the Graphic interface starting from Step 7.While this strategy guarantees faster execution, it necessitates more cognitive efort than using the Chat interface, and then simply check and correct what AI has generated.

IMPLEMENTATION
As an example of the robot task defnition, useful for explaining the workfow in detail, let us present an interaction scenario: The user aims to organize their desk using a collaborative robot.They will require to specify one or more items that ought to be picked up, such as the highlighter item, along with the position to drop it, in this case a box.The user wants to execute a shaking action before disposing the highlighter to prevent its tip from drying.Once the items are defned, they can request through the chat to execute the operation and repeat it fve times, since fve highlighters are present on the desk.Subsequently, the user can check the task created in the graphic interface and run it on the robot.

Item defnition
To create a task for the robot, the frst step is to defne the items, namely the object to pick (highlighter), the location where to place the object (box) and the optional action (shaking) to be executed between the pick and place steps.These items will then be saved in the corresponding libraries to be re-used for future tasks.An object is a tangible entity that a robot can manipulate.To create a new object, the user should capture a picture of it using the robot camera and set all relevant data required for proper manipulation.For identifcation purposes, users must provide an object name and optionally a set of synonyms to be used by the natural language processing function of the chat.Technical data must also be provided, pertaining to the required force for object grasping and the approach to the object, i.e. the Z-axis distance the robotic arm needs to attain in order to pick up the object.After having acquired the photo of the object, the image is processed through segmentation, binarization, fltering, and contour searching.At the time of execution, the robot can efciently search for items in the designated area, retrieve their position and alignment, and successfully grab them.With this procedure, the user can defne the "highlighter" object of our running example.A location refers to a point in space that serves as the target for a "place" operation.Therefore, its defnition remains straightforward: upon providing the identifcation information for the location, the robot arm must be manoeuvred to the intended point before acquiring the specifc position.Here, the user can defne the "box" location for our running example.An action enables the robot to move along a specifed path between a "pick" and a "place" step.It consists of a sequence of points (a cloud of points) that the robot arm must reach to follow a defned path.Recording a cloud of points is achieved by guiding the robot arm along the desired path and repeatedly pressing a specifc button in the software application to capture signifcant points of the trajectory.As for objects and locations, identifcation data related to the action must also be provided.In this way, the user can defne the "shaking" action for our running example.

Task defnition and execution
The chat-based interface is used to guide the user through the programming of the robot tasks.Figure 2 presents the interaction between the user and our application integrating ChatGPT, considering the scenario introduced above.We specifcally chose to use the gpt-3.5-turbomodel, which is recognized as the most powerful model currently available for a free trial.
To tailor the behavior of ChatGPT, it must be provided with clear instructions about its goals and context.As mentioned above, the Adapter is responsible for handling the calls to the ChatGPT APIs and instructing the model about its goals.All instructions are given as prompts at the beginning of the conversation and are transparent to the user.This specifcation allows the model to recognize the details required to defne a complete robot task.In addition to these instructions, the request from the Adapter includes the JSON data format expected as the output of the model.The desired output is defned by passing a function to ChatGPT Chat completions API, which expects as parameter a representation of the custom JSON format.In this way, the API will respond with a JSON object compliant with the expected format.As to the temperature parameter of the model, the value 0.2 was chosen to obtain a quite deterministic reasoning and prevent creative interpretation of the user's intentions.However, a small degree of creativity was retained to allow the chat to better adapt to user's requests and exhibit a human-like behavior.
After the dialogue is concluded, ChatGPT produces a task description in a custom JSON format (see the top right-hand corner of Figure 2).This format was introduced since ChatGPT was not able to always provide an output compatible with Blockly.The custom JSON data is passed to Python functions, which parse the data and transform them into a Blockly-compatible JSON format.
Blockly is a library of puzzle-like visual blocks, organized in categories, which allows creating intuitive programming interfaces, and thus empowering users to conceptualise and articulate program logic in a highly comprehensive manner.For the sake of integration in our EUD environment, Blockly categories and blocks have been properly tailored.The Logic category has been created for logic constructs, such as "When" and "Repeat".The Events category encompasses conditions like "Sensor" and "Find" (the former is used to check whether a sensor signal is arrived, while the latter is used to check if a specifc object is recognized).Steps is the category containing basic robot skills, like "Pick", "Place", and "Processing".Finally, the Objects, Locations, and Actions categories serve as userdefned libraries of items, which can be used like variables, without that the user must know the variable concept.
After the generation of the Blockly JSON data, the parsing functions query the database to retrieve the identifers of objects, locations, and actions involved in the task defned through the chat, and integrate them in the data structure as meta-data.Item identifers are needed for the subsequent task execution by the robot, to obtain the other technical data (e.g., object photo, location position, etc.).Item searching in the database is performed also using synonyms specifed during the item defnition.Once the retrieval of the item details is fnished, a data format that includes all the required information for execution is available and compatible with Blockly.Subsequently, the task is displayed in the graphical interface (see the bottom right-hand corner of Figure 2).
The task can now be executed on the COBOTTA robot.This robot is designed to be lightweight, easily portable, and highly adaptable.It is a single articulated arm with six axes, and its standard confguration includes a camera and a gripper suited for pick-and-place operations.To execute the task, a Python program template is activated, which analyzes the Blockly JSON code and determines the elements that are essential for generating an executable sequence of robot instructions.Beyond including calls to proprietary DENSO APIs, these ones also include the objects to pick up, the actions to execute, the locations where the objects should be placed, and the conditions and cycles to apply.During execution, the robot camera searches for the objects to be picked up, which are then manipulated by the robot arm and fnally placed in the desired location.

DISCUSSION AND CONCLUSION
The system here described is inspired to the hybrid approach to user-oriented programming presented in [6].In this paper, we experimented ChatGPT capabilities for user intent recognition to create domain-independent robot programs.Our objective was not to generate the fnal code for the robot based on the user's input, but rather to obtain a task representation that could be processed in our EUD environment, enabling users to evaluate the program correctness via Blockly, here tailored to represent pick-and-place robot tasks.In this way, non-expert users can visually check robot programs and modify them through drag-and-drop of the blocks, or direct manipulation of the variables corresponding to users' defned items.
We plan in the future to evaluate system usability with the participation of university students not enrolled in computer science courses, who must acquire basic programming skills in their curriculum.In this way, we aim also to assess whether our hybrid approach could be employed for educational goals, to improve the learning process of imperative programming languages.We are experimenting the same approach to support pharmacists in the defnition of robot tasks for the preparation of personalized medicines [7]; in this case, a simpler block-based visualization (more suitable to the target users) than Blockly was implemented.Positive qualitative feedback has been collected till now.
As to the limitations of the approach, a frst one concerns the complexity of the tasks that can be defned.In the chat, only the generation of tasks without nested logic has been implemented.To defne more complex tasks, users must possess adequate computational thinking skills.It is essential to further investigate ways to enhance task complexity, while minimizing the required end-user technical expertise.A second limitation concerns the image processing algorithms used for object recognition.These algorithms are quite simple and basic; it could be interesting to explore the new approaches to computer vision that have recently emerged along with LLMs, such as GPT-Vision [19].

Figure 1 :
Figure 1: The prototype architecture and the collaborative robot COBOTTA by DENSO WAVE Ltd.