ROS-based Robotic Applications Orchestration in the Compute Continuum: Challenges and Approaches

With the adoption of robots growing in several industrial sectors (e.g., logistics, healthcare, agriculture) comes the experience that in "robotic applications" robots are but components of larger distributed systems and, despite their specific requirements and assumptions, should be integrated with the other elements. This paper reports on the main challenges of building distributed robotic applications and discusses different approaches in which such applications are orchestrated and managed in the compute continuum from the Cloud to the Edge of the network.


INTRODUCTION
Cloud Robotics [1], despite the relatively slow acceptance from the robotics community, emerged as an advantageous paradigm that is finding growing adoption in industry.This is not only due to complementary resources aspects 1 , but, more importantly, to economical considerations.While robots are still expensive (high Capex) and tend to have expected operational times in the order of ten years or more, cloud computing resources are billed ondemand, per usage (Opex), and, thanks to Moore's law, every year more capable.Hence combining robots and cloud computing can help reduce up-front costs for robots (e.g., buy more, cheaper ones), extend their operational life, and reduce operational effort and costs by sharing cloud services across robots.It's not surprising then if one of the largest retailers in the US uses a cloud robotics approach to manage hundreds of warehouses with thousands of robots each 2 .
When robotic applications require off-board processing with stricter time constraints, the latency and jitter incurred by using public networks might prevent using remote cloud resources effectively.Edge and Fog computing paradigms [5] alleviate this issue by bringing compute resources "closer" to the application, i.e., closer to the robots.As a testimony of the perceived need for a set of best practices and approaches, several companies and technologies are addressing cloud robotics development today 3 .A recent work discussing some of these technologies and their readiness is [16].
In this paper, we will discuss the challenges in building distributed robotic applications in the fog, edge, cloud continuum, and the currently available approaches and technological solutions.

ISSUES AND CHALLENGES
The main challenges in building distributed robotic applications today stem from the fact that the assumptions, data representations, communication paradigms, and protocols used in robotics differ extensively from cloud computing (and micro-services) best practices.While there are advantages in combining robotics and cloud computing, there is a sort of "impedance mismatch" that needs to be addressed.Moreover, cloud/edge robotics are relatively new topics and lack universally accepted design patterns [15].

ROS Middleware for Robotic Applications
The Robotic Operating System (ROS) is currently the most adopted middleware for the design and development of robotic systems.A ROS application is typically composed of distributed processes, ranging from sensor and camera drivers, algorithm implementations, or external interfacing and control.The processes (known as ROS nodes) communicate using either topics, an asynchronous pub/sub message system, actions, an asynchronous interface with a request-feedback-response architecture, or services which follow synchronous request-response patterns.
ROS aims to create complex and robust robot behaviours across the widest possible variety of robotic platforms.It achieves that by allowing the reuse of robotics software packages and creating a hardware-agnostic abstraction layer.On the other hand, from the architectural point of view, ROS still treats the robot as a central point of the system and relies on local computation.This limitation makes the task of creating large-scale and advanced robotics applications much harder to achieve.By creating a cloud-to-edge-to-IoT continuum which provides the needs of such robotic environments, we can lower the hurdle for an application developer to use or extend robots' capabilities.

Networking
ROS1 and now ROS2 stem from years of research and practice in robotics and are derived from the real-time requirements imposed by controlling physical systems.The initial idea was that, albeit distributed in different processes for failure isolation, ROS nodes would either be running on the robot itself, or in the same local network.This implies assumptions on latency, bandwidth, and jitter that hardly agree with geographically distributed systems.While ROS1 relied on own implementations of TCP or UDP transport for ROS messages, ROS2 builds on a standard protocol (DDS).However, "DDS as a middleware protocol tends to have scalability and reliability issues when implemented over: i) wireless networks; and ii) non-local area networks" [5].Modern container orchestration solutions (e.g., K8S) leverage DNS resolution to discover IP endpoints of virtual service addresses, while DDS doesn't play well with dynamically changing IPs.Moreover, the default peer discovery mechanism in DDS has substantial overhead, so much that it is known to cause issues in its default configuration 4 , and it relies on multicast communication, which doesn't travel well across networks.This is why several technological solution addressing robotic applications networking have been proposed, as we discuss in Section 3.

Optimal and Adaptive Placement
When building distributed robotic applications, a common approach is for ROS nodes to be deployed across robots, edge, and cloud devices according to component, application, and safety requirements (e.g., nodes requiring physical access to sensors will be co-located with the sensors, safety components will be on-board).This allows to extend ROS beyond robots and take advantage of its primitives representing robotic states and messages.However, the mobile nature of robots, or the dynamic nature of workloads, might require adaptive placement of components (and subsequent network re-configuration) in response to changing conditions (e.g., high transmission errors on radio interfaces due to robot location).Many orchestration solutions support initial components deployment, but not their adaptive placement, we discuss them in Section 4.

Stateful Components
Many ROS components (e.g., SLAM, arm motion planners, navigation planners) are tied to a single robot and stateful.This is very different from cloud micro-service development where RESTful interfaces and stateless components are the current best practice to achieve elasticity / scalability / resource sharing with load balancing.
One more consequence of components having internal state is that, upon disconnections / message loss, distributed state consistency has to be restored through a reconciliation mechanism.
One alternative to this would be to modify stateful components so that representational state transfer could be used (e.g., transferring an entire planning scene and robot kinematics to an arm motion planner).However, this might simply be unfeasible in case of safety requirements (e.g., preventing collisions if objects in the planning scene move may require high frequency updates from sensors) or might require breaking currently monolithic functionality into smaller distributed components (e.g., collision checker runs on the edge, motion planner in the cloud).A similar idea, with the purpose of achieving dynamic placement, is proposed in [11].

ROS Boundary and Communication
Another consideration is to ask where the boundary of ROS should be.Should ROS nodes and communication primitives be limited to the robot (or the robot and the edge) while common service intercommunication protocols (e.g., REST on HTTP for synchronous communication, message buses for asynchronous and multirecipient) be adopted for the rest of the application? 5.ROS communication primitives (i.e., topics, services, actions) could be mapped to common service intercommunication protocols including by using streaming protocols and components.

Monitoring and Updates
The locality assumption of inter-node ROS communication has the consequence that some topics are high frequency and transmit data even for little or no state changes (e.g., ROS transformations topic /tf).This is also something that in a geographically distributed system should be dynamically configurable to minimize bandwidth usage.Notable cloud robotics solutions focusing on monitoring are for instance Formant6 and Freedom Robotics7 .

Storage
Robots interact with their environment consuming, producing and analyzing several data sources and data types varying in structure, format and size.A map of an area to navigate, rosbags storing message data from specific ROS topics, sensing data producing pictures or point clouds are examples of data produced and consumed by robotic applications.Isolated robots, would store the data on the local disk and memory.However, this is not a viable solution for large data amounts and distributed applications in the continuum [4].External databases hosted at the edge or in the cloud can support distributed applications to store and retrieve data.Object storage, key-value storage and time series databases may be adopted.Open questions for robotic applications are [4]: How can data be stored and accessed in conjunction with context?How can data, gathered from various sources, and context be dynamically put together, to deduce any kind of application-specific information?

Overlaps
ROS has its own process management (orchestration) mechanism called roslaunch which can be used to start multiple ROS nodes in parallel.ROS2 introduced managed nodes so that ROS orchestration can "ensure that all components have been instantiated correctly before it allows any component to begin executing its behaviour" 8 .This is a powerful feature that can for instance simplify safety considerations (e.g., stop robot motors if range sensors are not publishing data).On the other hand, this robot-specific orchestration needs to be either integrated with the distributed application orchestration, which might have its own health check control loops like in K8S, or limited to on-board nodes.

COMMUNICATION PARADIGMS
Several solutions for robotic applications networking have been proposed to cope with the limitations of the DDS protocol in ROS2.The first and foremost limitation of DDS is when ROS2 devices are in different networks and they cannot reach each other.This can happen when they do not have a public nor a static IP address or they are behind a Wi-Fi router NAT.In these cases DDS can not perform auto-discovery.To cope with this issue, eProsima proposed the use of a DDS Router.A DDS Router, based on Fast DDS middleware, is deployed on an edge device of each local network to enable communication of geographically spaced DDS networks.The DDS Router can route traffic from one network to the other through WAN communication, providing also built-in ROS2 topics filtering (e.g., to create secure fleets of robots or specifying a ROS2 interface for the robot for the outside world).Vulcanexus 9 is a ROS2 all-in-one tool set that supports Fast DDS and includes the DDS Router.
A widely-used alternative solution, is the use of Virtual Private Networks (VPNs) for establishing secure communication between robots and between robots and the cloud (e.g., using Wireguard).However, establishing a VPN link between a robot and the cloud can be a cumbersome process.To cope with this, web-based solutions (e.g., Husarion VPN 10 and the Husarnet network) appeared on the market to ease the work of a robotic application developer.FogROS [10] and FogROS2 [9] are examples of VPN-based solutions, that also try to automate the certificate generation for the VPN.
FogROS2-SGC [2] is an extension of FogROS2 that faces the challenge of effectively connecting robot systems across different physical locations, networks, and Data Distribution Services (DDS).The key to this is the use of globally unique and locationindependent identifiers that enable secure and efficient routes data between robotics components.Moreover, FogROS2-SGC is agnostic to the ROS2 distribution and configuration, is compatible with non-ROS2 software, and seamlessly extends existing ROS2 applications without any code modification.Another adopted solution is rosbridge [3] both for ROS1 and for ROS2 to allow non-ROS software to interact with ROS2 nodes.It can also be used to bridge two non-compatible and remote ROS applications when used in conjunction with rosduct [6].However, 8 http://design.ros2.org/articles/node_lifecycle.html 9 Vulcanexus The All-in-One ROS2 tool set! https://vulcanexus.org/ 10 https://husarion.com/software/os/vpn-access/rosduct and rosbridge have significantly high message latency when the message size is large (e.g., images).
Finally, Zenoh [7] is a protocol and suite of tools for data sharing and communication in distributed systems.It aims to provide a unified approach to data sharing and communication, regardless of the underlying hardware, network topology, or programming languages used.Zenoh offers several plugins for interacting with multiple protocols.Among them, both ROS1 and ROS2 middleware are supported to enhance peer-to-peer connectivity through discovery-efficient pub-sub communications in ROS applications over the continuum.In terms of performance, an analysis was presented in [17] comparing latency and throughput of FastDDS, CycloneDDS, Zenoh, and MQTT with ROS Messages under different network setups, including Ethernet, Wi-Fi, and 4G.The results show that under Ethernet, CycloneDDS has minimal latency and throughput, due to its UDP multicast mechanism.Under WiFi and 4G, Zenoh has better performance.
As a drawback, Zenoh is integrated with CycloneDDS, but it is not compatible with other DDS implementations.However, in September 2023 Open Robotics as the main maintainer of ROS decided to select Zenoh as the alternative middleware for ROS 2: "The research has concluded that Zenoh best meets the requirements, and will be chosen as an alternative middleware"11 .

ORCHESTRATION APPROACHES 4.1 Robot-First
FogROS2 [9] is an extension of the original work from some of the same authors [10] "to support ROS 2 applications, transparent video compression and communication, improved performance and security, support for multiple cloud-computing providers, and remote monitoring and visualization" [9].The authors provide evidence of vastly improved performance achieved by combining robotics and cloud computing with FogROS2 in three different use cases (SLAM, grasp planning, motion planning).FogROS2 adds a ROS2 image transport mechanism leveraging video compression (H264), hence inter-image information for improved performance, drops the use of proxies for robot-cloud communication in favor of VPNs, and supports multiple cloud providers.
For what concerns orchestration, FogROS2 leverages ROS2 python launch files, extending the "Node" class with a "CloudNode" and adding an attribute "machine" that specifies the placement of the node on a virtual machine to be spawned on a cloud provider.This is the same philosophy as in the original FogROS paper, and the authors acknowledge it as a deliberate choice, in contrast for instance with the "cloud-first" approaches of Rapyuta Robotics [8] and AWS Greengrass 12 .They claim that the FogROS2 approach is "simplified to standard ROS2", while cloud-first approaches require "interfacing with a proprietary library".Still, even though FogROS2 is an open source ROS2 library, launch files with explicit placement of components and spawning of virtual machines 13 have to be created (which is comparable if not more work than installing an agent or interfacing with a library), but, much more importantly, a "robot-first" approach to orchestration is inherently not scalable for larger (e.g., multi-robot) applications.This in our opinion limits the benefits of FogROS2 to smaller robot-driven robotic applications.

Edge-First
4.2.1 ROS with Kubernetes.Several practitioners have discussed or published software to integrate ROS with Kubernetes (K8S).The most prominent work is probably the one from Tomoya Fujita from Sony [14] who gave several recorded presentations on ways they integrated their robots with a K8S cluster.The main approach he proposes is to use each robot as a K8S node (i.e., a server in the K8S cluster) upon which pods (aggregations of containers) can be scheduled with standard K8S control API and tools (e.g., kubectl).
The orchestration approach is extremely elegant as it can leverage all the features offered by K8S, for instance labelling to be able to deploy a specific workload on a specific set of robots (e.g., deploy a ROS camera node on all robots using the Orbbec Astra camera), or automatically restarting components on failures (with K8S health checks), greatly simplifying deploying and updating software on hundreds of robots concurrently.Albeit making a robot part of a K8S cluster could be considered an overhead, node-side K8S components are generally lightweight enough to be run on constrained HW (e.g., Raspberry Pi).The current issues with this approach lie in the indirect way by which K8S-managed containers acquire access to physical devices 14 which is more convoluted than by running ROS components directly on a robot or even in a container, and the added complexity of managing intra-cluster ROS-specific communication (e.g., DDS over UDP across K8S nodes) and robot-cluster communication which is anyway required to add a robot as a node.[18], an automated platform for the deployment of multi-robot ROS2 applications as a solution for inadequate on-board resource requirements and lifecycle management of robot software.KubeROS abstracts on-board computing devices, edge and cloud as an unified infrastructure to developers.Differently from FogROS [10] and FogROS2 [9], KubeROS provides automatic deployment of entire applications based on the deployment configuration where developers can state the requirements and preferences, and selects the appropriate resource type based on the network condition and resource availability on top of a K8s cluster.

KubeROS. Zhang et al. propose KubeROS
KubeROS follows an "edge-first" approach with the main cluster sitting on the edge and allows user to set "preference" parameters to specify software placement (i.e., where the software should be deployed) on either robot, edge, or cloud.A "resource manager" is provided to store the hardware specifics and provide essential system state for software orchestration.KubeROS's own scheduler, matches the requirement and preferences with the system state for k8s' node assignment.
The user can manually use the "kubeadm join" or upload SSH public keys to KubeROS for automated registration of new robots or hardware.KubeROS also provides an interface to upload cluster certificates and service tokens for connecting to third-party clusters.KubeROS's complex hardware and network setup is one disadvantage and it needs to be tackled by a system administrator. 14see for instance https://github.com/fujitatomoya/ros_k8s/issues/17 For security, the edge and on-board resources should be in an isolated private network.The communication between the private network and cloud is established in two mechanisms based on the premises of the resources.Own cluster (Ex.Cloud): a network administrator can setup communication between cloud and local network through a secure Virtual Private Network (VPN); Third-party cluster: KubeROS provides KubeROS-Bridge over gRPC, where it setups a gRPC client in the third party cluster and a gRPC server in main cluster to encode messages.

Cloud-First
Rapyuta.io [8] was the first ROS-specific Platform-as-a-Service on the market.Based on K8S (more precisely RH OpenShift), it supported from the start core functionalities required to deploy software across robots and cloud devices, namely: a device (or fleet) manager, a set of networking solutions to allow bridging devices and locations (now also leveraging eProsima DDS Router), and orchestration based on K8S and SaltStack [12].
AWS Greengrass and AWS RoboMaker were AWS first solutions to respectively, orchestrate local processing on cloud-connected IoT devices, and leverage AWS resources in robotic simulation.More recently, AWS IoT RoboRunner15 focused on "integrating robot systems from multiple vendors and building fleet management applications".
Robolaunch 16 is a recently publicly released cloud robotics platform.Also based on K8S, it leverages more advanced K8S features for device management (e.g., custom resources and operators), and supports orchestration of distributed applications spanning robots and cloud and including GPU-acceleration to support AI workloads and virtual desktops as in [13].
Other, more limited approaches to support live-video streaming, remote teleoperations, deployment and configuration management, observability, log ingestion, and health monitoring are for instance provided by Transitive Robotics 17 and Freedom Robotics.

Adaptive
Adaptive orchestration modifies placement of components (or distributes their functionality in the case of ElasticROS) and/or reconfigures networking in response to variation of deployment conditions or QoS constraints.[11] proposes a novel cloud robotics approach building on FogROS [10] for "algorithm-level" collaborative computing: algorithm execution (e.g., what would be computed within a single ROS node) is distributed across devices, rather than the standard approach of distributing ROS nodes across robots and edge/cloud.The ElasticAction algorithm governs the elastic interaction between the robot and server, dynamically adjusting its parameters to accommodate changing conditions the robot encounters.A Press-Elastic-Release node splits one function node into a "press" and a "release" node.Release nodes are deployed in the cloud, and the press nodes are deployed on the robot.The advantages of this approach, according to the authors, are that ElasticROS manages CPU usage.In cases where there is insufficient CPU capacity available on the robot, it transmits more data to the cloud for computation.Conversely, if there is sufficient CPU capacity, it adjusts its policy and performs more calculations on the robot while reducing data transmission.The authors illustrate that ElasticROS significantly outperforms conventional and current approaches in three robotic tasks: SLAM, grasping, and human-robot dialogue.

ElasticROS. ElasticROS
Despite the promising results, the authors only describe singlerobot scenarios and did not discuss scenarios where multi-robots compete for or share resources.

Multi-agent Orchestration
Approach in the Continuum.A research direction for orchestration in the compute continuum is the adoption of multi-agent systems to distribute the functionality of an application and its deployment among several agents, either collaborating or competing with the guidance of an orchestrator.Orchestration in the computing continuum can thus modelled as a hierarchical network of intelligent, autonomous agents that manage resources in a decentralized manner.According to their place in the hierarchy, agents are given specific responsibilities and objectives for managing specific resources i.e., infrastructure, network, or application elements.Agents can also play the role of multiple stakeholders taking part in the continuum's operation.Handling the synergies arising among the stakeholders and developing mechanisms that can facilitate their coordination is the big challenge that orchestration architectures will have to face in the cloud continuum.Moreover, in complex systems, a multi-agent system needs to autonomically reorganize itself to adapt and evolve, in response to changes in the participating agents or in the external environment (e.g., using AI and ML solutions).NEPHELE 18 is a research and innovation action (RIA) project that follows the aforementioned approach.Nephele's vision is to enable the efficient, reliable and secure end-to-end orchestration 18 NEPHELE project: https://nephele-project.eu/, last accessed: September 19th, 2023 of hyper-distributed applications, such as a robotic applications, over a programmable infrastructure that spans across the compute continuum from IoT to edge to cloud.To reach this goal, interoperability barriers in the convergence of IoT technologies against cloud and edge computing orchestration platforms are to be removed.An IoT and edge computing multi-layer software stack, called a virtual object stack (VOStack), is proposed for leveraging the virtualization of IoT devices at the edge part of the infrastructure.A physical convergence layer in the VOStack enables registration, bootstrapping, authentication, networking and protocol interoperability, supporting various communication protocols between Virtual Objects and IoT devices (e.g., HTTP, MQTT, CoAP).This would enable ROS application lifecycle management from a remote non-ROS environment and, more in general, communication between the ROS world and other application components that are not necessarily adopting ROS.This can be achieved for instance with solutions like Zenoh or ROS-based MQTT client implementations 19 .For an overall orchestration solution over the Cloud-to-Edge continuum, a synergetic meta-orchestration framework will manage the coordination between cloud and edge computing orchestration platforms, through high-level scheduling supervision and definition.A "system of systems" management approach will be adopted for coordinating and assigning responsibilities to cloud and edge computing and networking orchestration managers.
Using a microservices-based approach, hyperdistributed applications are composed of independently deployable application components that can be orchestrated at the cloud or the edge part of the continuum.The synergetic meta-orchestrator is responsible for activating the appropriate orchestration modules to efficiently manage the deployment of the application components across the continuum.To this aim, it interacts with a set of further components to efficiently coordinate the management and orchestration of distributed compute and network resources via the Federated Resources Manager, the Computing Continuum Network Manager, and the enforcement of AI-assisted orchestration mechanisms in the various parts of the compute continuum (see Figure 1).4.4.3Other Adaptive Approaches.In [5], the authors present the design and preliminary experiments of edge robotics system in which the orchestration and control system manages the "full" application life cycle, i.e., including QoS violations due to degraded networking caused by robot mobility, re-configuring the placement of network components.This is a more advanced functionality than what is found in common cloud orchestration.Similarly to the approach of NEPHELE, specialized algorithms are required "in order to select the appropriate host(s) for instantiating or migrating the components based on constraints defined by the application, such as latency, computing requirements, availability of resources and services".This includes networking components and their (re)configuration.

DISCUSSION AND CONCLUSION
In this paper we discussed outstanding challenges in the development and orchestration of ROS-based robotic applications in the compute continuum.Since no established best practice or pattern has yet emerged, the current orchestration landscape sees different proposal taking a robot-, edge-, or cloud-first approach (see Figure 2).They however are limited in considering only the "deployment" of an application, not its "full life-cycle management" requiring dynamic adaptation.While adaptive orchestration approaches for both compute and networking (including radio access networks) are starting to emerge in academia (see Section 4.4), it is worth making some considerations w.r.t.their assumptions and expected benefits.The first thing to notice is that hyper-distributed applications are based on what used to be called an "open world assumption" in Web services literature.That is, the expectation that economical and technological "protocols" will be in place to allow application providers to orchestrate and manage resources across compute and networking infrastructure owned by third-parties.On the other hand, even if this were not to happen in the near future, many of the technological solutions proposed in adaptive approaches can find direct application on public cloud combined with privately owned edge and network (including private 5G installations) scenarios.
Another important aspect in this regard, is the more important role that orchestration could play in robotic applications.In robotics, orchestration is typically identified with parallel execution of ROS nodes (i.e., roslaunch).Robots are "started" and expected to do their job.Task management and assignment components are typically included in fleet management solutions for single-purpose robots, tasks are sent to robots that are "ready" and can be reassigned.However, the orchestration "of the full application life-cycle" in distributed robotic applications can impact several aspects that transcend the simple horizontal scaling and health checks of cloud computing.We already mentioned dynamic network reconfiguration, dynamic component placement or even splitting, but even more advantages can come by integrating robotic considerations with the orchestration logic all the way up to the application logic.For instance, one could leverage orchestration to achieve battery recharge cycle optimization across a robotic fleet, implement task assignment rotation to prevent task-specific wear in robots, or, in a not-so-far future, leverage fewer multi-purpose robots enabling only the minimal required components to perform the required tasks minimizing energy consumption.
We can, therefore, conclude that although a lot of progress has been made in the last years, still research and innovation efforts are needed from industry and academia to reach the whole adoption potential of robotic applications in the cloud continuum.

Figure 2 :
Figure 2: Current orchestration approaches with their scope