Reinforcement learning in autonomous multi-vehicle systems: A structured review

Unmanned autonomous vehicles become increasingly important in various application domains, such as manufacturing, logistics, communication, and the military. They are typically designed to navigate a specific environment, such as land, air, or water. However, many of the corresponding tasks require multiple vehicles to collaborate in a coordinated manner. Since these vehicles operate in dynamic environments, algorithms that are based on Reinforcement Learning (RL) are particularly well suited to achieve a high level of coordination. In this paper, we review the scientific literature that applies RL to control and coordinate multi-vehicle systems. A classification scheme is developed to analyze relevant articles with regard to various aspects of the application context and the technical implementations of the RL algorithms. Based on the results, we delineate the current state of research, identify current trends, and propose future research avenues in this field.


INTRODUCTION
In recent years, the technology of unmanned autonomous vehicles has advanced rapidly.They are expected to play an integral role in a wide range of application areas such as manufacturing, warehousing, agriculture, security, and the military ( [7,39]).Depending on their application, unmanned vehicles need to navigate on land, water, or in the air, which they achieve through numerous sensors that collect information on the immediate surroundings.Importantly, there are many application domains that require multivehicle systems to complete given tasks.These vehicle fleets need to be controlled and coordinated simultaneously in dynamic environments.Various control mechanisms have been developed to achieve (coordinated) navigation and control.Approaches that are based on Reinforcement Learning (RL) are particularly suited for this task since they can learn without any prior knowledge of the system, do not require system modeling, and can be adapted to both, single and multi-agent optimizations.
In the literature, several RL algorithms are known for complex multi-vehicle coordination mechanisms.They address a multitude of applications and have various methodological underpinnings.While several literature reviews exist in this area, they are always restricted to a specific environment (land, air, or water) and a specific problem, such as scheduling and routing for land-bound vehicles ( [6]) or path planning for airborne vehicles ( [16]).No comprehensive overview exists that captures the state of research across the different environments and problem types.Moreover, a specific multi-vehicle perspective, although posing many additional challenges compared to a single-vehicle problem, is rare.
We address this gap in the literature by means of a structured review of RL-based autonomous multi-vehicle systems.The objective of this article is to provide a detailed picture of the current state of research and identify future research directions in this field.To this end, a 2-dimensional classification scheme is developed, which allows to categorize papers according to their application context and the technical implementation of the approach.Within the application context, several subcategories are investigated, i.e., the environment in which the autonomous vehicle operates, the application area, and the primary problem that is addressed.The technical implementation, on the other hand, is concerned with the coordination mechanisms of vehicles and the considered action spaces.
Our findings reveal a rapidly growing interest in the scientific community regarding RL-based autonomous multi-vehicle systems.Moreover, they provide a comprehensive overview of this field from an application and a technical perspective.For instance, the results show that the current research focus lies on airborne vehicle fleets, which are characterized by very diverse application possibilities.While land-bound vehicle systems are also a very active research stream, they are much more focused on specific applications, mostly in commercial contexts.From a technical perspective, the results highlight that inter-vehicle coordination is achieved by combining varying degrees of information sharing with various reward types.Moreover, we find that recent papers have begun to study systems comprised of multiple vehicle types (e.g., land-bound and airborne vehicles) and increasingly address multiple problems simultaneously (e.g., path planning and scheduling).
The remainder of this paper is structured as follows.In section 2, we introduce the theoretical foundation of this work.Section 3 describes the applied methodology.In Section 4, we present our analysis before discussing the results in Section 5. Concluding remarks are provided in Section 6.

CONCEPTUAL FOUNDATIONS 2.1 Unmanned Autonomous Vehicles
An unmanned autonomous vehicle is defined as a vehicle that operates in the absence of a human supervisor (neither directly nor remotely) and, thus, has to monitor its own state autonomously, spot potential system faults, and react appropriately ( [7]).While this definition is independent of the application area and serves vehicles operating on land, in and on water, and in the air, conventions exist in the literature to use distinct terms to differentiate between these modes.Vehicles operating on land are commonly named automated guided vehicles (AGV) or autonomous mobile robots (AMR).According to Fragapane et al. [7] they differ in their guiding systems, where AGVs are only able to follow predefined paths, as they rely on mechanic, optical, inductive, cartesian coordinate, inertial or laser guidance, while AMRs leverage modern computer vision systems to navigate.Waterborne vehicles are referred to as unmanned surface vehicles (USVs), while airborne vehicles are commonly termed unmanned aerial vehicles (UAVs).The adjunct fixed-wing is used to delineate fixed-wing aerial vehicles from rotary-wing aerial vehicles.
Notably, vehicles exist that are autonomous but not unmanned (e.g.self-driving cars) or unmanned but not autonomous (e.g., remotecontrolled surface vehicles ( [36])).In any case, these systems typically focus on singular vehicles, thereby disregarding inter-vehicle coordination.The work at hand, however, is concerned with multivehicle systems that are autonomous and unmanned, their coordination, and distinct application areas.

Reinforcement Learning
Machine Learning is a hypernym for the three main methodological approaches of computer learning.In particular, these are Supervised Learning, Unsupervised Learning, and Reinforcement Learning, and are separated by the exposure to ground truth knowledge ( [33]).This work focuses on Reinforcement Learning (RL), which uses an oracle reward function to assess the behavior of an agent that acts in an environment, with the aim of iteratively improving a policy.RL imitates human trial-and-error tabula rasa learning ( [33]), i.e., it learns without prior knowledge.It is concerned with problems that are modeled by states, which are periodically observed ( [26]).The set of all possible states defines the state space, which forms the decision basis.At each time step, an agent chooses an action (decision), which makes the system return a reward (decision assessment), i.e., feedback generated through an oracle reward function, and transitions to the next state.Reaching the next state depends solely on the previous state and chosen action, in contrast to any other state or action from the history.In other words, RL assumes a Markov decision process for the underlying optimization problem ( [33], p.66).The set of all possible actions for each state defines the action space.A policy in RL (i.e., stationary policy) is a probability mass or density function that assigns a probability of selecting an action for each state.RL is concerned with finding the best stationary policy for a given Markov decision process.
Over the past decade, Artificial Neural Networks (ANNs) have become the standard choice for function approximators in RL, where usually the input vector describes the system state and the ANN output represents the actions.To overcome instability and divergence, two neural networks, a worker and a target network, are used, which reintroduce the required learning stability ( [28]).This is referred to as Double Q-Learning.
RL combines the advantages of Dynamic Programming and Monte Carlo Methods.Former bootstrap, i.e. they update the policy on the basis of the successor state ( [33], p.109), while the latter sample through direct interaction with the environment ( [33], p.126), i.e. no system model is required.Sometimes, these properties are softened, e.g.Proximal Policy Optimization ( [32]) samples various steps to perform an aggregated update.There exist three approaches to learning in RL.First, in value-iteration, the agent evaluates each state, from which a policy is then inferred ( [28]).Second, policyiteration aims to learn the policy, i.e. decision probabilities, directly ( [31]).Third, actor-critic approaches learn both a value function and the policy ( [27]), where the value function provides a measure of importance for the policy update.

Classification scheme
Through an iterative, deductive review process, two dimensions and several categories thereof emerged as recurring themes across the analyzed literature.We carefully abstracted and distilled these categories based on their prevalence and significance, resulting in a structured classification scheme that effectively captures essential aspects of the field.The identified articles are analyzed with respect to two dimensions, i.e., the application context and the technical implementation.Within each dimension, three subcategories were developed.The context of the application comprises (i) the environment in which the vehicle navigates (referring to air, land, and water), (ii) the application, and (iii) the primary problem handled by the RL algorithm.In the technical implementation dimension, on the other hand, (iv) the fleet coordination mechanisms, (v) the action spaces, and (vi) the state spaces are analyzed.
(i) Environment.Autonomous vehicles are developed to operate in specific environments, i.e., land, air, or water.While land-bound vehicles mostly include wheeled vehicles in the form of AGVs or AMRs, airborne vehicles typically refer to multi-rotor drones or fixed-wing vehicles, and water-bound vehicles to unmanned vessels operating on the water surface, that is, no submarines.Autonomous submarine fleets are not an active area of research, as our findings indicate.
(ii) Application.Naturally, the operating environment is closely related to the intended application area.These areas include commercial applications, communication networks, military use, and security/safety.Some articles do not mention a specific application for their RL algorithm and thus were categorized as generic.
(iii) Primary Problem.In each application area, very specific problems have to be tackled by the underlying algorithm.We identified five different primary tasks/problems, which we defined as the task or problem that is directly controlled by the RL algorithm.These tasks are the flocking/formation control problem, path planning, including route optimization, the task of moving to a certain stationary target and tracking a non-steady target, positioning, scheduling, and task assignment.
(iv) Coordination Mechanism.There are several approaches to reach coordination between multiple vehicles.We classify the RL coordination mechanism into two categories: First, the reward function can either be designed to maximize group rewards (GRew), individual rewards (IRew), or reciprocal rewards (RRew).Using GRew means that every vehicle (agent) contributes to a common (group) goal, whereas some agents coordinate by solely accumulating individual rewards (IRew), while RRew use both an individual and a group component that can be linearly or exponentially weighted.The second category is the scope of information sharing.One can generally differentiate whether agents share information or not (e.g., [3]).With regards to the former, there are several forms of information sharing, such as global information sharing, meaning that the information of the vehicle is shared with all other vehicles.Furthermore, local information sharing means that communication is restricted to a smaller area between neighboring agents, e.g., due to limited sensor/communication range.
(v) Action space.Historically, RL algorithms, such as Q-Learning ( [37]) or Deep Q-Learning ( [28]), are designed for discrete action spaces.Hence, the majority of investigated publications adapt discrete action spaces.Due to recent developments in RL, in particular actor-critic designs (e.g., [27,32]), the RL framework nowadays naturally extends to continuous action spaces (see also [20]).
(vi) State space.The used state spaces always depend on the specific problem description.They can include aspects such as locations, velocities, angles, and task status indicators.

METHODOLOGY 3.1 Structured Literature Review
To conduct a rigorous and comprehensive review of the relevant literature, the methodology developed by Webster and Watson [38] was applied.The authors propose a two-step process for a Systematic Literature Review (SLR).First, relevant literature needs to be identified.To this end, we conducted a thorough database search by applying topical keywords in combination with Boolean expressions.Then, we evaluated the search results regarding their relevance based on pre-specified criteria and excluded papers which failed to meet them.Second, the final set of texts is analyzed from a concept-centric perspective (see Subsection 2.3), in order to categorize the identified literature with respect to these essential concepts of RL and autonomous vehicle fleet management.

Database Search
To query databases for literature on RL in autonomous vehicle fleet operations, a corresponding search string consisting of keywords and Boolean expressions was developed.The search string consists of two parts.The first part specifies the applied method, i.e., "reinforcement learning", and the second part specifies the types of autonomous vehicles that may be operated via RL.For the latter, the chosen keywords include "mobile robot", "guided vehicles", and "unmanned", which were deduced from the typical terminology used in the literature.Variations, as well as truncation of these keywords, were considered in the database query.At least one keyword from each of the two search string parts had to appear in the title, abstract, or keywords of the paper in order to be detected by the search.The described search string was applied to the Sci-enceDirect database.Only full papers published in peer-reviewed scientific journals using the English language were considered.No restriction with regard to the publication year was applied.The systematic query produced 226 papers in total (April 5 ℎ , 2023).These papers were complemented by peer-reviewed journal articles and conference papers (52 additional papers), which were identified through a manual backward/forward search.
To identify relevant papers from the search results, various exclusion criteria were applied.First, manuscripts that use a primary methodology other than RL were dismissed.Moreover, RL must be applied in autonomous vehicle operations, which includes their control, coordination, path planning, and things of that nature.Finally, the focus of this paper is on fleets of autonomous vehicles and their coordination and collaboration mechanisms.Therefore, relevant papers need to address the simultaneous operation of more than one autonomous vehicle.After the exclusion criteria were applied and a backward/forward search was conducted, the final set comprised 54 papers for detailed content analysis. 1

ANALYSIS
First, we briefly discuss the historical distribution of the identified publications.Our results show that research on RL-based, autonomous multi-vehicle systems is very much in its infancy.To the best of the authors' knowledge, the first paper in this field was published in 1999 (see Figure 1).In the following 20 years, however, research articles remained extremely scarce as only a handful of papers were published.Then, technological progress has allowed to address more computationally demanding problems.As a result, since 2019, there has been a seemingly exponentially growing number of publications.While there were only 8 publications total in the two decades before 2019, we find 17 publications in the year 2022 alone.Meanwhile, the numbers for 2023 are on track to exceed those of the previous year as of the time we conducted the literature search.Therefore, we can attest that RL in autonomous multi-vehicle systems is a rapidly growing research area.In the following subsection, the results of the article classification are discussed.Notably, the classification was conducted by 2 authors independently to ensure inter-coder reliability.Any conflicts were resolved by the third author.In some cases, a particular article falls into more than one category, which is marked accordingly in the respective table.

Application Context
(i) Environment.Overall, we find that more than 80% of the papers focus on land and air environments, where the majority of papers (51.85%) analyze air and 18 (33.33%)land-bound vehicles.This leaves just 16.67% for water environments in the investigated articles.The detailed split between different environments can be viewed in Table 1, which also contains the breakdown of the application area as discussed in the following.
(ii) Application.Table 1 displays the distribution of all papers across these categories.We find that most papers have no specific area of application and, hence, are categorized as generic.They typically focus on advanced fleet coordination mechanisms for flocking (e.g., [42]), some of which are inspired by biological processes (e.g., [12,13]).A large portion of the identified papers are concerned with commercial applications which is a diverse field covering problems in manufacturing (e.g., [24,29]), logistics (e.g., [14]), as well as agriculture ( [5]).About 15% of all papers are concerned with establishing effective communication networks, which includes issues such as improved network security (e.g., [4]) and quality of service (e.g., [1]).Interestingly, the military is the second largest application area.The addressed use cases range from target tracking (e.g., [48]) to collaborative combat operations (e.g., [30]).The last and smallest category is security/safety.The corresponding articles are typically concerned with the exploration or surveillance of large areas.This mostly relates to target spotting in the context of search and rescue missions ( [25]) or possible threats to public safety ( [23]).
The aforementioned relationship between the operating environment and application area was also investigated for the identified literature.The results are presented in Table 1.Overall, we find that most research focuses on either land or air environments, while applications of water-bound vehicles is only represented by a few articles.Interestingly, all land-bound vehicle fleets are either used for commercial applications (e.g., [45]) or generic purposes (e.g.,

Application
Land Air Water Total ).UAVs, on the other hand, are used across the board in all application areas.However, there appears to be a strong focus on communication and military operations as, for example, addressed by Liu et al. [21] and Zhang et al. [46], respectively.
(iii) Primary Problem.Table 2 shows the number of papers in each category in combination with the application area.One can see that more than 50% of the identified papers belong to the category path planning (30 out of 54; e.g., [17]) where most are within the application area of military, the second most researched task is the flocking/formation control problem (10 out of 54 articles; e.g., [12]), closely followed by scheduling (8 out of 54) (e.g., [34]) which all but one belong to the application area of manufacturing and logistics (subsumed in the commercial category).Finally, only four articles analyze positioning and task assignment problems, where the former is mainly covered in the context of communication (e.g., [1,3]) and the latter papers are mainly within the military domain (e.g., [47]).

Technical Implementation
(iv) Coordination Mechanism.In Table 3, we depict the number of papers using the various coordination mechanisms (combinations of reward and information sharing).In general, we find that most papers use local information sharing (e.g., [41]), meaning that communication is restricted to neighboring agents.However, almost as many articles apply global information sharing.For example, in the paper of Zhang et al. [44], all robots can communicate with each other via a wireless communication network.Notably, Zhou et al. [49] apply both global and local information sharing.Moreover, in 11 of the reviewed papers, no information is shared among the vehicles.
Furthermore, the majority of papers use reciprocal rewards (RRew) to coordinate multiple vehicles.While Xia et al. [39] combine RRew with local information sharing, Zhang et al. [46] apply it in combination with global information sharing.Most of the remaining articles employ group rewards (GRew).For example, the paper of [24] uses group rewards where all AMR agents receive a positive reward if an order is completed in time or if delayed, they all receive a penalty.Different from that, Tang et al. [34] give group rewards for completing orders and, among others, penalize each AGV if it hits a wall.Finally, six papers do not use any group reward and coordinate solely with individual rewards (IRew), such as Xue et al. [41]     problem for four USVs operating in the same area.Here, the agents receive individual rewards for reaching the goal of the path, the distance between a desired and their actual velocity, and a reward for not colliding with obstacles or other USVs.
It is noteworthy that all but one paper in the communication category use GRew, which is contrary to the overall trend.This might be due to the fact that in these use cases, high quality of (communication) service has to be guaranteed ( [1,3]).While there is no trend in coordination mechanisms for commercial applications, the majority of papers that analyze military use cases utilize local information sharing.One special case is presented by Zhou et al. [49], who use both global and local information sharing and find that using local information sharing leads to better performance for their multi-target tracking task.Additionally, in the group of papers that do not share information between vehicles, eight out of eleven papers use a group reward.
Building on this, we analyze whether the use of group, reciprocal, and individual rewards is used for different primary problems (see Table 4).One can see that for flocking/formation control and for path planning problems, the vast majority of papers (6 of 10 and 21 of 30, respectively) use reciprocal rewards.Thus, we conjecture that it is highly effective to use reciprocal rewards if it is important to prevent collisions with other vehicles ( [39]).On the other hand, use cases on positioning and task assignment problems predominantly (75%) use group rewards.
(v) Action Spaces.A slight majority of investigated publications adapt discrete action spaces.In particular, there are 29 articles with discrete action spaces, while 24 use continuous ones (1 could not be classified).
The definition of an appropriate discrete action space falls into one of the following three categories.First, it is a direct consequence of the optimization problem.Ikenoue et al. [11], for example, use movement actions "forward", "backward", "left", and "right", while Hook et al. [9] also add a "No-Operation" action.Second, it is comprised of a predefined set of algorithms or operations.Xu et al. [40], for instance, use RL to choose predefined crossover and mutation operations of a Genetic Algorithm.In the third category, continuous action spaces are discretized as was done by Miao et al. [25] and Bai et al. [1].They discretize the heading angle of UAVs to a set of actions.Another example of discretization is the work of Li et al. [17], who discretize the linear acceleration vector of quadcopters in equidistant steps for a trajectory planning problem.
The definition of continuous action spaces is, in general, straightforward and directly inferred from the problem description.Commonly used are vectors for velocity, roll, torque, and other metrics (e.g., [39]) or (motor) control variables ( [43]).To ensure successful and more efficient learning, ANN network output values are usually kept close to the range (−1, 1), and re-scaling to the original variable range is performed afterward (e.g., [35]).Sometimes articles set the output range to (0, 1) for the ANN output (e.g., [24]).
From a more application-focused point of view, we find that action spaces are mostly addressing the agents' motions.Specifically, only 10 papers do not define any movement-related actions in the action space, while 44 papers do.Within these papers, we can observe varying levels of controls that are actuated.Exemplary, Yasuda et al. [43] define the action space as continuous signals that directly control the vehicles' motors.More commonly, the action space is defined by higher-level controls such as heading angle (e.g., [1]) or (cardinal) directions (e.g., [34]).Non-motion concerned papers address the problem of optimal scheduling by defining the action space as priority-values of orders ( [29]), bid-values of orders ( [24]), or combinations of dispatching rules ( [18]).Other publications address task ( [44]) and vehicle allocation ( [47]) and consequently define their action space as the set of available tasks/vehicles.For example, Zhang et al. [45] define the action space as the set of (grouped) parking spaces for their allocation problem.
(vi) State Spaces.The used state spaces in the reviewed manuscripts depend heavily on the specific problem description.Hence, it is no wonder that they vastly differ, especially for different application areas.In terms of task assignment, scheduling problems, as well as flocking/formation tasks, commonly used are the AGV/USV location, velocity, and task status indicators (e.g., [15,45]).For problems that are concerned with positioning, common features are comprised of location and supply information (e.g., [1,14]), and, for path planning problems, common features are position/coordinates, (heading) angles and, sometimes, velocity (e.g., [25,48]).

DISCUSSION
In this section, we lay out the current state of research regarding autonomous multi-vehicle systems that are based on RL algorithms.Moreover, we identify important gaps in the literature and derive possible directions for future research.
First, we discuss the findings in the application context dimension and its subcategories.We can attest that land and air environments are much more active research areas than water environments.A possible explanation for this imbalance can be offered by taking the application areas into account (see Table 1).Here, we find that land-based vehicle systems heavily focus on commercial applications to the point that all other specific applications (meaning non-generic) are completely omitted in this environment.This specialization may have various reasons.Given the increasing digitization and automation as part of the Industry 4.0 paradigm, AGVs and AMRs are widely considered to play essential roles in future manufacturing systems.The associated promise of increased efficiency and thus, profitability make this a highly important issue for many companies and, by extension, an interesting research avenue.Airborne vehicle systems appear to have a much broader spectrum of applications, which indicates a high degree of versatility.There is research on UAVs in virtually every application area, with a notable focus on those that are neglected by land-bound systems.Hence, both vehicle types seem to complement each other with respect to their application areas.Moreover, land-bound and airborne autonomous vehicle systems excel in different ways, the former in specialization and the latter in versatility.The aforementioned lack of research in water environments can also be traced back to these findings.USVs have very specific applications, often driven by military or security considerations.In contrast to AGVs and AMRs, however, there is currently no prominent application for which USVs are extraordinarily well suited, nor are they as versatile as UAVs.Notably, integrated systems that combine multiple vehicle types (e.g., AGVs and UAVs) have been completely omitted in the literature thus far.Given the complementary features of different vehicle types, this might be a promising research avenue.
With regards to primary problems that were addressed with RL algorithms, we find that path planning is by far the most prominent issue in the existent literature and has been part of all application areas (see Table 2).This was to be expected, given that this is a core issue for autonomous vehicles.Interestingly, the results show that other problem types are mainly addressed in the context of specific application areas, while some have not received much attention at all.This reveals several research gaps and plenty of opportunities for future work.For instance, we find that scheduling is one of the largest problem classes (together with flocking/formation) and is almost exclusively considered in commercial applications.Moreover, task assignment has not received much attention by this research community despite being a highly relevant issue in manufacturing and logistics.Furthermore, flocking/formation, which is mainly concerned with the coordination of large numbers of autonomous vehicles, is only considered in generic or military applications.For positioning, on the other hand, the primary focus is on communication applications, as the objective is typically to set up a high-performance communication network with UAVs serving as relays.
In the following, the results within the technical implementation dimension and respective subcategories are discussed.Regarding the coordination mechanism, we identified a link between the information-sharing approach and the reward function.If there is any kind of information sharing, either global or local, the vast majority of papers apply reciprocal rewards (see Table 3).In cases of no information sharing, on the other hand, the group reward appears to be the method of choice.Hence, we deduce that there are two general approaches to achieving multi-vehicle coordination, either through information sharing (global or local) or through group rewards.This logic holds across all application areas, with the exception of communication.Applications concerned with communication networks almost always apply group rewards regardless of the information-sharing approach.
Furthermore, we find that a slight majority of articles (about 54%) consider discrete action spaces while the others consider continuous ones.This is likely facilitated by the fact that learning is more difficult in continuous than in discrete action space ( [48]).While most of the approaches define motion-related actions, as one would expect, there are only few papers (10 out of 50) that do not require motion-related actions to fulfill their tasks.In the case of the former, the analyzed papers reveal that motion-related actions can be situated on different levels of control.At the lower level, the action space would address individual components of the vehicle, such as the engine, to influence the torque or velocity.At the higher level, these actions are abstracted to mere angles and cardinal directions.The action spaces are closely connected to the considered state spaces.It is noteworthy that the detailed makeup of state spaces appears to be strongly dependent on the specific problem being addressed.Hence, the considered state spaces are very diverse and may include vehicle velocity, task completion status, supply information, or other parameters.However, there are also commonalities as the majority of approaches feature vehicle locations in their state spaces.

CONCLUSION
Due to the fast advancements in the technology of unmanned autonomous vehicles and their ubiquitous applications in multiple areas, we investigate the scientific literature on RL for multi-vehicle systems.We analyzed the identified articles with respect to the application context and the technical implementation.Drawing from our results, we outline the current state of the research and identify key issues for future research.
Limitations to our study pertain to the limited amount of research on RL for multi-vehicle systems, although the first paper on this topic was already published in 1999.This results in a relatively small sample size of 54 scientific articles serving for our analysis.It is also noteworthy that our findings are limited to the database of ScienceDirect and the results of the backward/forward search, which was not confined to this database.Furthermore, the classification of the literature is subject to a degree of interpretation by the authors.However, the rigorous content analysis conducted according to the methodology of Webster and Watson [38] limits potential biases and provides a comprehensive picture of this research field.
Our results enable us to identify several research opportunities.Indicated by current trends, approaching multiple problem types simultaneously as well as integrating different vehicle types (e.g., ground and aerial vehicles) appears to be a promising direction.Moreover, this stream of literature appears to lack behind in realworld applications.This also includes the integration of human agents/decision-makers, which are still a crucial component in most application areas (e.g., [8]).We also find that the diversity and complexity of coordination mechanisms for autonomous multivehicle systems is a fruitful research topic.Hence, an in-depth analysis of such coordination mechanisms is a promising subject for another review.Finally, we find that only few studies provide the source code or sufficient implementation details (e.g., software packages used for simulation and algorithms) to allow replication (see online repository).There also is a lack of papers that compare RL algorithms.Hence, there is little to no guidance for practitioners and future research on which algorithm or design works best for different problem specifications.

Figure 1 :
Figure 1: Chronological distribution of publications

Table 1 :
Number of papers per application area and environment [22] and Yang[22]Air & Water who solve a collision avoidance path planning * Hu et al. [10] Path planning & Scheduling, † Li et al. [19] Flocking/formation & Path planning

Table 2 :
Number of papers per application area and primary problem * Zhou et al. [49] Global & Local info sharing

Table 4 :
Number of papers per primary problem and reward