Learning to Coordinate with Anyone

In open multi-agent environments, the agents may encounter unexpected teammates. Classical multi-agent learning approaches train agents that can only coordinate with seen teammates. Recent studies attempted to generate diverse teammates to enhance the generalizable coordination ability, but were restricted by pre-defined teammates. In this work, our aim is to train agents with strong coordination ability by generating teammates that fully cover the teammate policy space, so that agents can coordinate with any teammates. Since the teammate policy space is too huge to be enumerated, we find only dissimilar teammates that are incompatible with controllable agents, which highly reduces the number of teammates that need to be trained with. However, it is hard to determine the number of such incompatible teammates beforehand. We therefore introduce a continual multi-agent learning process, in which the agent learns to coordinate with different teammates until no more incompatible teammates can be found. The above idea is implemented in the proposed Macop (Multi-agent compatible policy learning) algorithm. We conduct experiments in 8 scenarios from 4 environments that have distinct coordination patterns. Experiments show that Macop generates training teammates with much lower compatibility than previous methods. As a result, in all scenarios Macop achieves the best overall coordination ability while never significantly worse than the baselines, showing strong generalization ability.


Introduction
Cooperative Multi-Agent Reinforcement Learning (MARL) [Oroojlooy and Hajinezhad(2023)] has garnered significant attention due to its demonstrated potential in various real-world applications.Recent studies have showcased MARL's exceptional performance in tasks such as pathfinding [Sartoretti et al.(2019)], active voltage control [Wang et al.(2021)], and dynamic algorithm configuration [Xue et al.(2022b)].However, these achievements are typically made within closed environments where teammates are pre-defined.The system will suffer from coordination ability decline when deploying the trained policies in real-world scenarios, where agents may encounter unexpected teammates in such open environments [Zhou(2022)].
Training with diverse teammates presents a promising avenue for tackling the aforementioned challenge.Various methods have emerged in domains such as ad-hoc teamwork [Mirsky et al.(2022)], zero-shot coordination [Treutlein et al.(2021)], and few-shot teamwork [Fosong et al.(2022)].Addressing this challenge effectively involves two crucial factors.Firstly, to enhance generalization and avoid overfitting to specific partners, it is essential for agents to be exposed to diverse teammates during the training process.Diversity can be achieved through various techniques, such as hand-crafted policies [Papoudakis et al.(2021a)], object regularizers designed among agents [Derek and Isola(2021), Lupu et al.(2021), Charakorn et al.(2023)], or population-based training (PBT) [Strouse et al.(2021), Xue et al.(2022a)].Secondly, when dealing with multiple teammates, especially in the context of multi-modal scenarios, specialized consideration is necessary.Naive approaches, like self-play (or "selftraining") [Tesauro(1994), Silver et al.(2018)], Fictitious Co-Play (FCP) [Heinrich et al.(2015), Strouse et al.(2021)], or coevolving agent and partner populations [Xue et al.(2022a)], have been explored (See related work in App.A.1).Nevertheless, complex scenarios often present substantial challenges arising from both the complexity and vastness of the teammate policy space.On one hand, enumerating all possible teammate groups is a daunting task, and training the agents can be time-consuming.On the other hand, even when we pre-define only representative and diverse teammates, we may still accidentally omit some instances.The exact number of such teammates cannot be determined in advance as well.This prompts a crucial question: Can we design a more efficient training paradigm that ensures our controllable agents are trained alongside partners in a policy space that guarantees coverage, ultimately enabling high generalization and effective coordination ability with diverse teammates?
To tackle the mentioned issue, we propose a novel coordination paradigm known as Macop, with which we can obtain a multi-agent compatible policy via incompatible teammates evolution.The core principle of Macop is the adversarial generation of new teammate instances, which are strategically crafted to challenge and refine the ego-system's (the agents we control) coordination policy.However, the exact number of representative teammates can not be determined beforehand, and maintaining a sufficiently diverse population requires significant computing and storage resources.We therefore introduce Continual Teammate Dec-POMDP (CT-Dec-POMDP), wherein the ego-system is trained with groups of teammates generated sequentially until convergence is reached.Our approach is rooted in two crucial factors: instance diversity and incompatibility between the newly generated teammates and the ego-system.During the training process, we iteratively refine teammate generation and optimize the ego-system until convergence is reached.This approach empowers the ego-system, leading to a coordination policy capable of seamlessly handling a wide array of team compositions and promptly adapting to new teammates.
We conduct experiments on different MARL benchmarks that have distinct coordination patterns, including Level-based Foraging (LBF) [Papoudakis et al.(2021b)], Predator-Prey (PP), Cooperative Navigation (CN) from MPE [Lowe et al.(2017)], and two customized maps from StarCraft Multi-agent Challenge (SMAC) [Samvelyan et al.(2019)].Experimental results show that our proposed Macop exhibits remarkable improvement in comparison to existing methods, achieving nearly 20% average performance improvement in the conducted benchmarks compared to multiple baselines, and more experiments reveal it from multiple aspects.

Problem Formulation
As we aim to solve a continual coordination problem, where the controllable agents are required to cooperate with diverse teammates which arise sequentially, we formalize it as a Continual Teammate Dec-POMDP (CT-Dec-POMDP) by extending the Dec-POMDP [Oliehoek and Amato(2016)].The CT-Dec-POMDP can be described as a tuple M = ⟨N , S, A, P, {π k tm } ∞ k=1 , m, Ω, O, R, γ⟩, here N = {1, . . ., n}, S, A = A 1 × ... × A n and Ω are the sets of corresponding agents, global state, joint action, observation.P is the transition function, {π k tm } ∞ k=1 represents the k groups of teammates encountered sequentially during the training phase until time t, m is the number of controllable agents, and γ ∈ [0, 1) represents the discounted factor.At each time step, agent i receives the observation o i = O(s, i) and outputs the action a i ∈ A i .
Concretely, when training to cooperate with a group of teammates π k tm , the agents do not have access to previous teammates groups π k ′ tm , k ′ = 1, ..., k − 1.However, they are expected to remember how to cooperate with all previously encountered teammates groups.For simplicity, we denote a group of teammates as "teammate" when no ambiguity arises.The training phase of cooperating with teammate π k tm can be described as The controllable agents are optimized to maximize the expected return when cooperating with teammate π k tm : where is the return of a joint trajectory.At the same time, for a formal characterization of the relationship between the policy space of π ego and π tm , we introduce the concept of complementary policy class: Definition 1 (complementary policy class).For any sub policy π ∈ Π i:j = ⊗ j h=i Π h , i ≤ j, we define its complementary policy class as We denote the complementary policy class of controllable agents and the teammate as Π c ego and Π c tm for simplicity.We also refer J sp (π ego ) = max πtm∈Π c ego J (⟨π ego , πtm ⟩) and J sp (π tm ) = max πego∈Π c tm J (⟨ πego , π tm ⟩) as "self-play return" of π ego and π tm , respectively.novel continual teammate generation module by combining population-based training and incompatible policy learning (Fig. 1(a)).Next, we outline the design of our continual coordination policy learning paradigm, which consists of a shared backbone and a dynamic head expansion module (Fig. 1(b)).These two phases proceed alternatively to train a robust multi-agent coordination policy that is capable of effectively cooperating with diverse teammates (Fig. 1(c)).

Incompatible teammate generation
The objective of Macop is to develop a joint policy that can effectively cooperate with diverse teammates.Since the policy space of teammate groups is too huge to be enumerated, we focus on identifying dissimilar teammate groups.
To achieve this, we begin by establishing a complementary-policy-agnostic measure capable of effectively quantifying the similarity between two teammate groups, ensuring that it remains unaffected by complementary policies.In particular, we pair two teammate groups with any arbitrary complementary policy, as defined in Def. 1.These groups are considered similar if the probability of the trajectory produced by both groups surpasses a predefined threshold.
Definition 2 (ϵ-similar policies).We measure the similarity between two different teammates π i tm , π j tm with the probability of the trajectory induced by them when paired with any complementary policies.Specifically, for any fixed complementary policy π ∈ Π c tm , the probability of the trajectory produced by the joint policy Accordingly, we define the dissimilarity between the two teammates d(π i tm , Based on the Def. 2 above, our approach involves the identification of representative teammate groups, ensuring that the dissimilarity between them surpasses the specified threshold ϵ.We continually generate such dissimilar teammate groups in order to gradually cover the space of teammate policies.Drawing inspiration from the proven efficacy of population-based training (PBT) [Jaderberg et al.(2019)] and evolutionary algorithms (EA) [Zhou et al.(2019)], we adopt an evolutionary process to formulate the teammate generation process by maintaining a population of teammates P tm = {π j tm } np j=1 under the changing controllable agents π ego .By ensuring that the teammate groups exhibits dissimilarity between instances in not only the current population but also previous ones, our aim is to systematically explore and cover the entire teammate policy space over time.Specifically, in each generation, the current population is first initialized through a customized parent selection mechanism (details provided later).We focus on promoting diversity within the teammate population, striving to enhance the dissimilarity between each individual, i.e., max i̸ =j d(π i tm , π j tm ).To achieve the goal mentioned, we take Jensen-Shannon divergence (JSD) [Fuglede and Topsøe(2004)] as a reliable proxy to effectively measure the dissimilarity between teammates' policies as is introduced in [Yuan et al.(2023c)]: where πtm (•|s) = 1 np np i=1 π i tm (•|s) is the average policy of the population, and D KL is the Kullback-Leibler (KL) divergence between two distribution.We provide proofs that the JSD proxy is a certifiable lower bound of the original dissimilarity objective in App.A.2.
The advantages of JSD are immediately apparent.Unlike TV divergence or KL divergence, which only allows pairwise comparison between two distributions, JSD enables a more comprehensive assessment of the diversity within a population by accommodating multiple distributions.Meanwhile, JSD is symmetrically defined, which is invariant under the interchange of the distributions being compared, helping simplify the implementation.
Despite the effectiveness of the population-based training with the L div in Eqn. 2, the continual generation would still result in teammate groups with similar behaviors in different generations without other guarantees.Meanwhile, the size of the population n p might also have a significant impact.Inspired by the relationship between similarity and compatibility proved in [Charakorn et al.(2023)], we extend the theorem to our CT-Dec-POMDP: Definition 3 (ϵ-compatible teammates).For the controllable agents π ego , let J sp (π ego ) = α.We refer π tm as an ϵ-compatible teammate π ego if and only if J (⟨π ego , π tm ⟩) ≥ (1 − ϵ)α.Theorem 1.Given the controllable agents π ego and teammate policies π tm and ∀π ′ tm , π tm , π ′ tm are ϵ−similar policies.Then we have The underlying idea behind Thm. 1 is that controllable agents, when effectively collaborating with a specific teammate group, will also be compatible with the teammate group's ϵ-similar policies.Proofs are given in App.A.2.We thus have the following corollary: Corollary 1.Given the controllable agents π ego and teammates π tm .If J (⟨π ego , π ′ tm ⟩) < (1 − ϵ)J (⟨π ego , π tm ), then π tm and π ′ tm are not ϵ-similar policies, i.e., d(π tm , π ′ tm ) > ϵ.
The result from Cor. 1 shows that we can ensure that teammate groups generated in the current population are different from those before by decreasing its compatibility with the controllable agents π ego , which are trained to effectively collaborate with the teammates generated so far.Assuming that the controllable agents are fixed during the teammate population evolving stage, the optimization objective can be written as: To ensure the meaningful learning of teammate groups' policies, it is crucial for each individual in the population to be capable of cooperating with complementary policies.Thus, the optimization of teammate focuses on maximizing the following objective: Considering the specified objectives, the complete objective function for the teammate population is as follows: where, α div and α incom are adjustable hyper-parameters that control the balance between the three objectives.

Compatible Coordination Policy Learning
After generating a new teammate population that is diverse and incompatible with the controllable agents, we aim to train the controllable agents to effectively cooperate with newly generated teammate groups, as well as maintain the coordination ability with the trained ones.It requires the controllable agents to possess the continual learning ability, as introduced in Sec. 2, where teammate policies appear sequentially in CT-Dec-POMDP.
In the context of evolutionary-generated teammate groups appearing sequentially, employing a single generalized policy network poses challenges due to the existence of multi-modality and varying behaviors among teammate groups.Consequently, conflicts and degeneration in the controllable agents' policies may arise.To address this issue, recent approaches like MACPro [Yuan et al.(2023b)] have adopted a solution where customized heads are learned for each specific task.Building upon this idea, our approach involves designing a policy network with a shared backbone denoted as f ϕ , complemented by multiple output heads represented as {h ψi } m i=1 .The shared backbone is responsible for extracting relevant features, while each output head handles making the final decisions.
With the structured policy network, when paired with the new teammate group's policy π k+1 tm , we first instantiate a new output head h ψm+1 .Subsequently, our focus shifts to training the controllable agents to effectively cooperate with the new teammate group.
It is worth noting that once trained, the output heads {h ψi } m i=1 remain fixed, and during the training process, the gradient L com only propagates through the parameters ϕ and ψ m+1 .
Training the best response via L com enables us to derive a policy that is capable of cooperating with the new teammate group π k+1 tm .However, the use of one shared backbone poses a challenge as it inevitably leads to forgetting previously learned cooperation, especially when encountering teammates with different behaviors, resulting in failure to cooperate with teammates seen before.One straightforward approach to address this issue is to fix the parameters of the backbone upon completing the training of the first policy head.However, this approach has obvious drawbacks.On one hand, the fixed backbone might fail to extract common features adequately due to the limited coverage of training data.On the other hand, the output head's capacity might be insufficient, leading to suboptimal performance when training to cooperate with new teammates.To mitigate the problem of catastrophic forgetting and enhance the policy's expressiveness, we apply a regularization objective by constraining the parameters from changing abruptly while learning the new output head h ψm+1 : where ϕ i is the saved snapshot of the backbone ϕ after obtaining the i th output head, and || • || p is l p norm.This regularization mechanism helps to retain previously learned knowledge and ensures that the shared backbone adapts to the new teammate.Striking a balance between adaptability and retaining relevant knowledge, we can effectively enhance the cooperative performance of the policy with diverse teammates.The overall objective of the controllable agents when encountering the (k + 1) th teammate group is defined as: where α reg is a tunable weight.
Despite the effectiveness of combining the proposed L ego and the carefully designed policy network architecture, a major limitation lies in its poor scalability as the number of output heads increases linearly with the dynamically generated teammate groups.To address this limitation and achieve better scalability, we propose a resilient head expansion strategy that effectively reduces the number of output heads while maintaining the policy's compatibility: • Upon completing the training of the output head h ψm+1 , we proceed to evaluate the coordination performance of this head and all the existing ones {h ψi } m+1 i=1 when paired with the new teammate group's policy π k+1 tm .The coordination performance is measured using the empirical average return represents the average return obtained by executing trajectories τ i j generated by applying the i th output head.
• To manage the number of output heads and prevent uncontrolled growth, we choose to retain the newly trained head if its performance surpasses a certain threshold compared to the best-performing existing head.Formally, we keep the newly trained head if Rm+1−maxi{ Ri} m i=1 maxi{ Ri} m i=1 ≥ λ.This approach ensures that we only expand the number of output heads when there is a substantial improvement in performance, indicating that the new teammate group's behavior requires a distinct policy.Otherwise, if the existing output heads are sufficiently generalized to cooperate effectively with the new teammate, no new head will be expanded.
By adopting this resilient head expand strategy, we strike a balance between reducing the number of output heads and maintaining the policy's adaptability, resulting in a more scalable and efficient approach to handling dynamic teammate groups under the continual coordination setting.

Overall Algorithm
In this section, we present a comprehensive overview of the Macop (Multi-agent Compatible Policy Learning) procedure.Macop aims to train controllable agents to effectively cooperate with various teammate groups.During the training phase, Macop employs an evolutionary method to generate diverse and incompatible teammate groups and trains the controllable agents to be compatible with the teammates under the continual setting.In each iteration (generation) k(k > 1), we first select the (k − 1) th teammate population P k−1 tm as the parent population.Then, the offspring population is derived by training the parent population with L tm in Eqn. 5, i.e., mutation.The teammate groups are constructed based on value-based methods [Sunehag et al.(2018), Rashid et al.(2018)], and With n p teammate groups of the parent population and n p teammate groups of the offspring population, we apply a carefully designed selection scheme as follows.To expedite the training of meaningful teammate groups, we first eliminate ⌊ np 2 ⌋ teammate groups with the lowest self-play return, i.e., max πi ego ∈Π c tm J (⟨ πi ego , π i tm ⟩).Next, we proceed to eliminate ⌈ np 2 ⌉ teammate groups with the highest cross-play return under the controllable agents, i.e., J (⟨π ego , π i tm ⟩), so as to improve incompatibility.Finally, we utilize the remaining n p teammate groups as the new teammate population of iteration k, i.e., P k tm .With the teammate population P k tm in place, we construct n p continual coordination processes in a sequential order and train controllable agents to learn compatible policies.The controllable agents are optimized using L ego (defined in Eqn.8), and the output head is expanded as introduced in Sec.3.2.
To determine when the continual process should be terminated, a carefully designed stopping criterion is employed.The training phase terminates at the k th iteration if the minimum cross-play return of P k+1 tm and the controllable agents in iteration k exceeds a certain value, i.e., C = mini J (⟨πego,π i tm ⟩) It indicates that the controllable agents at the k th iteration can effectively cooperate with the (k + 1) th teammate population even they have been trained to decrease the compatibility, and the teammate policy space is covered for a given environment.
During the testing phase, a meta-testing paradigm is employed to determine which output head is selected to pair with an unknown teammate group.Initially, all output heads are allowed to interact with the teammate group to collect a few trajectories, and their cooperation abilities are evaluated based on empirical returns.The output head with the highest performance is then chosen for testing.The pseudo-codes for both the training and testing phases of our Macop procedure are provided in App.A.3.

Experiments
In this section, we conduct a series of experiments to answer the following questions: 1) Can Macop generate controllable agents capable of effectively collaborating with diverse teammates in different scenarios, surpassing the performance of other methods?2) Does the evolutionary generation of teammates bring about a noticeable increase in diversity, and how do our controllable agents compare to other baseline models in terms of compatibility? 3) What is the detailed training process of Macop? 4) How does each component and hyperparameter influence Macop?Table 1: Average test return ± std when paired with teammate groups from evaluation set in different scenarios.We re-scale the value by taking the result of Finetune as an anchor and present average performance improvement w.r.t Finetune.The best result of each column is highlighted in bold.The symbols '+', '≈', and '-' indicate that the result is significantly inferior to, almost equivalent to, and superior to Macop, respectively, based on the Wilcoxon rank-sum test [Mann and Whitney(1947)] with confidence level 0.05.We here select four multi-agent coordination environments and design eight scenarios as evaluation benchmarks (Fig. 2).Level-based Foraging (LBF) [Papoudakis et al.(2021b)] presents a challenging multi-agent cooperative game, where agents with varying levels navigate through a grid world, collaboratively striving to collect food with different levels.The successful collection occurs when the sum of levels of participating agents matches or exceeds the level of the food item.Predator Prey (PP) and Cooperative Navigation (CN) are two benchmarks coming from the popular MPE environment [Lowe et al.(2017)].In the PP scenario, agents (predators) must together pursue the moving adversaries (prey).On the other hand, in CN, multiple agents receive rewards when they navigate toward landmarks while ensuring they avoid collisions with one another.We also conduct experiments in the widely used StarCraft II combat scenario, SMAC [Samvelyan et al.(2019)], which involves unit micromanagement tasks.In this setting, ally units are trained to beat enemy units controlled by the built-in AI.We specifically design two scenarios for each mentioned benchmark (e.g., PP1 and PP2), and details could be found in App.A.4.
To investigate whether Macop is capable of coordinating with diverse seen/unseen teammates, we implement Macop on the popular value-based methods VDN [Sunehag et al.(2018)] and QMIX [Rashid et al.(2018)], and compare it with multiple baselines.First, to assess the impact of the teammate generation process on the coordination ability of the controllable agents, we compare Macop with FCP [Strouse et al.(2021)], which initially generates a set of teammate policies independently and then trains the controllable agents to be the best response to the set of teammates.The diversity among teammate polices is achieved solely through network random initialization.Additionally, we examine another population-based training mechanism that trains the teammate population using both L sp and L div , aiming to generate teammates with enhanced diversity.This approach, which aligns with existing literature [Lupu et al.(2021), Ding et al.(2023)], is referred to as TrajeDi for convenience.On the other hand, LIPO [Charakorn et al.(2023)] induces teammate diversity by reducing the compatibility between the teammate policies in the population.Concretely, it trains the teammate population with an auxiliary objective J LIPO = − i̸ =j J (⟨π i tm , π j tm ⟩), where the indices i, j refer to two randomly sampled teammates in the population.Furthermore, with the teammate generation module held constant, we proceed to compare Macop with Finetune.Finetune directly tunes all the parameters of the controllable agents to coordinate with the currently paired teammate group.We also investigate two other approaches: Single Head, which applies regularization L reg to the backbone but does not utilize the multi-head architecture, and Random Head, which randomly selects an existing head during evaluation, thus verifying the necessity of Macop's testing paradigm.Finally, we employ the popular continual learning method EWC [Kirkpatrick et al.(2017)] to learn to coordinate with the teammates generated by TrajeDi, thereby providing an overall validation of the effectiveness of Macop.More details are illustrated in App.A.4.

Competitive Results
In this section, we analyze the effectiveness of the controllable agents learned from different methods from two aspects: coordination performance with diverse seen/unseen teammates, and continual learning ability on a sequence of incoming teammates.
Overall Coordination Performance To ensure a fair comparison of coordination performance, we aggregate all the teammate groups generated by Macop and baselines into an evaluation set.For each method, we pair the learned controllable agents with teammate groups in this evaluation set to run 32 episodes for each pairing.The average episodic return over all episodes when pairing with different teammate groups is calculated as the evaluation metric.This metric serves as a comprehensive measure of the overall coordination performance and generalization ability of the controllable agents.We run each method for five distinct random seeds.
As depicted in Tab. 1, we observe that approaches such as FCP, TrajeDi, and LIPO exhibit limited coordination generalization ability in different scenarios, especially when the population size is restricted.This highlights the need for ample coverage in teammate policy space to establish a robust coordination policy.Intriguingly, among the three methods mentioned, we found no significant differences, indicating that certain design elements, such as instance diversity among teammates, fail to fundamentally address this challenge.In contrast, when using generated teammates, simply finetuning the multi-agent policy or employing widely-used continual approaches like EWC exhibits inferior coordination performance, as confirmed by our experiments and in line with the findings in MACPro [Yuan et al.(2023b)].This suggests that specialized designs tailored for multi-agent continual settings play a crucial role.On the other hand, Macop exhibits a remarkable performance advantage over nearly all baselines across various scenarios, demonstrating that controllable agents trained by Macop possess robust coordination abilities.Furthermore, we discovered that the Single Head architecture struggles due to the presence of multi-modality in teammate behavior, underscoring the necessity of a multi-head architecture.An effectively designed testing paradigm, utilizing multiple available learned heads, proves indispensable.It is worth noting that Random Head fails to select the optimal head for evaluation, resulting in a degradation in performance.Our pipeline relies on efficient design for continual learning, and more comprehensive results on the necessity of each component can be found in Sec.4.5.[Kirkpatrick et al.(2017)] and Finetune.CLEAR is a replay-based method which stores some data of previously trained teammates to rehearse the controllable agents when training with the current teammate group.For a principled assessment of the continual learning ability, we introduce two metrics inspired by the concepts used in continual learning [Wołczyk et al.(2021), Wang et al.(2023b)] within our CT-Dec-POMDP framework: 1) BWT= . BWT (Backward Transfer) evaluates the average influence of learning to cooperate with the newest teammate group on previously encountered teammates.2) FWT= . FWT (Forward Transfer) assesses the average influence of all previously encountered teammate groups on the coordination performance of the new teammate.Here, α j k represents the coordination performance of the controllable agents paired with the j th teammate group after training to cooperate with the k th teammate group, measured by the empirical episodic return.Additionally, αj denotes the coordination performance of a randomly initialized complementary policy trained with the j th teammate group.
We record experimental results in Tab. 2. At first glance, Finetune demonstrates the worst BWT among all methods, validating the necessity of algorithm design to prevent catastrophic forgetting.However, even popular continual learning methods, CLEAR and EWC, grapple with forgetting to some degree.In contrast, Macop achieves the best BWT in all evaluated environments.As for FWT, Macop obtains a competitive result compared with other methods.Taking both BWT and FWT into consideration, Macop demonstrates a robust and adept continual learning ability.This aptitude empowers controllable agents to progressively acquire coordination proficiency with diverse teammates, and aligns seamlessly with the expanding coverage of the teammate policy space.

Teammate Policy Space Analysis
To investigate whether Macop is capable of generating teammate groups with diverse behaviors, a straightforward method involves comparing the self-play trajectories of different teammate groups.Concretely, we first learn a transformer-based encoder to map trajectories into a low-dimensional feature space (details will be provided in App.A.4.3).We subsequently encode the teammates' self-play trajectories generated by Macop into the feature space.For visualization, we select 10 teammate groups from the CN2 scenario and extract their trajectory features, as shown in Fig. 3(a).The projection displays a notable dispersion, validating that teammate groups generated by Macop exhibit diverse behaviors as expected.
Furthermore, we conducted experiments to assess the compatibility among the generated teammate groups.In accordance with Def. 3, we paired different teammate groups in LBF4.The cross-play returns are presented in Fig. 3(b)(c), generated by Macop and TrajeDi, respectively.It is evident that when pairing two distinct groups from Macop, there is a noticeable drop in returns outside the main diagonal, indicating a lack of compatibility among the teammate groups generated by Macop.Conversely, the cross-play returns of TrajeDi's teammate groups are nearly identical to their selfplay returns, suggesting a significantly lower level of incompatibility among teammate groups generated by TrajeDi because of poorer coverage of the teammate policy space.
To further explore whether methods without dynamically generating teammates can address policy space coverage by increasing the population size, we trained controllable agents using TrajeDi, varying in population size from 1 to 15. Subsequently, we evaluated the coordination ability using the evaluation set, as depicted in Fig. 3(d).The results clearly illustrate that coordination ability improves as the population size increases until convergence is reached.However, a considerable performance gap between TrajeDi and Macop persists.Our analysis leads us to the conclusion that in intricate scenarios with multi-modality, vanilla methods that lack dynamic teammate generation struggle with new and unfamiliar teammates due to inadequate coverage of the teammates' policy space.On the contrary, Macop's deliberate generation of incompatible teammates contributes to a more comprehensive coverage of the teammate policy space, ultimately enhancing its coordination ability.

Learning Process Analysis
To gain a comprehensive understanding of Macop's functioning, it's essential to delve into its learning process, which involves generating incompatible teammates and refining controllable agents until convergence is achieved.Fig. 4 illustrates the process in PP1, showcasing key aspects, including the number of teammate groups generated, the number of existing heads, and the stop criterion C, all presented for each iteration (Fig. 4(c)).In the first iteration, the teammate generation module produces a population of four distinct teammate groups, with three specializing in capturing the first prey and one focused on the second prey (Fig. 4(a)).However, the population lacks desired diversity, as none of the groups learn to catch the remaining third prey.As for the controllable agents, they acquire the ability to collaborate with their teammates: Head 1 coordinates with those capturing the first prey, while Head 2 interacts with the group targeting the second prey.
During the second iteration, the teammate generation module generates new teammates incompatible with the controllable agents, expanding the coverage of the teammate policy space.As shown in Fig. 4(b), a new teammate group (identified as "tm5" in blue) successfully acquires the skill to capture the last prey, showcasing a completely novel behavior.Consequently, when the controllable agents complete their training with this new group, they establish a new head for better coordination.
The dynamic interplay between the adversarial teammate generation module and the training of controllable agents persists until the seventh iteration, resulting in an increased number of teammate groups and output heads.In this final iteration, the teammate generation module endeavors to generate seemingly "incompatible" teammates as it has throughout the training process, but it encounters failure.The generated teammate groups up to this point have already effectively covered a wide range of the teammate policy space.The controllable agents have successfully acquired the ability to coordinate with a sufficiently diverse array of teammates.The newly generated teammate groups do not exhibit enough incompatibility, as indicated by the stop criterion surpassing the specified threshold ξ.This signifies that the cross-play performance between the controllable agents and these new "incompatible" teammates is comparable to the self-play performance of the teammate groups.It's worth noting that the C value from the second iteration also exceeds the threshold, yet a minimum iteration count of 4 is enforced to ensure thorough exploration of the teammate policy space.This automated and self-regulating learning process within Macop concludes after the seventh iteration.As a result of this process, Macop produces a notable set of 28 teammate groups with remarkable diversity, along with controllable agents that possess 10 heads.This is evidenced by their robust coordination abilities, which are prominently illustrated in Fig. 4(d).

Ablation and Sensitivity Studies
We here conduct ablation studies on CN2 and SMAC1 to comprehensively assess the impacts of multiple modules.
No Incom, No Div, and No Incom & Div, are derived by setting α incom = 0, α div = 0, and α incom = α div = 0, respectively.Furthermore, we examine the impact of L reg , and designate this variant as No Reg to explore the effects of regularization on the backbone network ϕ.To ensure a fair comparison, we incorporate the teammate groups generated by the four ablations into the evaluation set.The results, as illustrated in Fig. 5(a), reveal essential insights into the functioning of Macop.Removing L incom or L div leads to a performance degradation compared to the complete Macop, highlighting the significant contributions to the teammate diversity.Moreover, No Incom & Div exhibits a substantial performance degradation, verifying the necessity of actively generating diverse teammates, instead of relying solely on random network initialization.Furthermore, No Reg demonstrates the poorest performance among all the variants.The absence of regularization on the backbone network undermines the controllable agents' continual learning ability, weakening their coordination capability with diverse teammates.These findings emphasize that each module plays an indispensable role in Macop.
As Macop includes multiple hyperparameters, we conduct experiments to investigate their sensitivity.The teammate groups generated by different hyperparameter settings are also incorporated into the evaluation set for a fair comparison.One important hyperparameter is the population size n p .On one hand, with a very small population, Macop cannot cover the teammate policy space in an efficient manner.On the other hand, setting the population size to an excessively large number will unnecessarily increase the running time of Macop, reducing the overall efficiency.As shown in Fig. 5(b), we can find that when n p ≤ 4, the performance of Macop does improve with increasing population size.However, there is no further improvement as we continue to increase n p , proving that n p = 4 is the best setting in scenario PP2.More detailed analysis of other important hyperparameters is provided in App.A.5.

Final Remarks
We propose a novel approach to multi-agent policy learning called Macop, which is designed to enhance the coordination abilities of controllable agents when working with diverse teammates.Our approach starts by framing the problem as an CT-Dec-POMDP.This framework entails training the ego-system with sequentially generated groups of teammates until convergence is achieved.Empirical results obtained across various environments, compared against multiple baseline methods, provide strong evidence of its effectiveness.Looking ahead, in scenarios where we operate under a few-shot setting and need to collect some trajectories for an optimal head during policy deployment, developing mechanisms such as context-based recognition could be a potential future solution.Additionally, an intriguing direction for future research involves harnessing the capabilities of large language models [Wang et al.(2023a)] like ChatGPT [Liu et al.(2023)] to expedite the learning process and further enhance the generalization capabilities of our approach.

A Appendix
A.1 Related Work Many real-world problems can often be effectively modeled as multi-agent systems [Dorri et al.(2018)].Harnessing the problem-solving prowess of deep reinforcement learning [Wang et al.(2020)], Multi-Agent Reinforcement Learning (MARL) [Zhang et al.(2021)] has achieved significant success across diverse domains.Moreover, when agents share a common goal, the problem falls under the category of cooperative MARL [Oroojlooy and Hajinezhad(2023)], showing impressive progress in various areas such as path finding [Sartoretti et al.(2019)], active voltage control [Wang et al.(2021)], and dynamic algorithm configuration [Xue et al.(2022b)].Within the wide range of research domains, building an agent capable of cooperating and coordinating with different or even previously unknown teammates remains a fundamental challenge [Dafoe et al.(2021)].Recent approaches, such as ad-hoc teamwork (AHT) [Mirsky et al.(2022)], zero-shot coordination (ZSC) [Treutlein et al.(2021)], and few-shot teamwork (FST) [Fosong et al.(2022)], have been developed to address this challenge.AHT involves designing agents that can effectively collaborate with new teammates without prior coordination [Stone et al.(2010)], including aspects like teammate type inference [Barrett and Stone(2015), Chen et al.( 2020)], changing point detection [Ravula et al.(2019)], partial observation solving [Gu et al.(2021)], and adversarial training [Fujimoto et al.(2022)].ZSC addresses the problem of independently training two or more agents in a cooperative game, ensuring that their strategies are compatible and achieve high returns when paired together at test time [Treutlein et al.(2021)].This line of work includes diversity measurement [Lupu et al.(2021) 2023)], generation of diverse teammates [Charakorn et al.(2023)], and policy co-evolution for heterogeneous settings [Xue et al.(2022a)].Furthermore, in the FST setting [Fosong et al.(2022)], skilled agents trained as a team to complete one task are combined with skilled agents from different tasks, and they must collectively learn to adapt to an unseen but related tasks [Ding et al.(2023)].
One factor for high coordination generalization agents is fostering diversity among teammates, one approach is Fictitious Co-Play (FCP) [Strouse et al.(2021)].This method involves training a controllable agent partner to act as the optimal response to a group of self-play agents and their past checkpoints over the course of training.However, FCP lacks an explicit mechanism to enforce diversity among teammates.To address this limitation, other approaches have been proposed to encourage diversity in teammate behavior.For example, TrajeDi [Lupu et al.(2021)] introduces an auxiliary loss term that enhances team diversity by evaluating the trajectories generated by different teams.MEP [Zhao et al.(2023)] presents a maximum entropy population-based training scheme that mitigates distributional shifts when collaborating with previously unencountered partners.Another approach, LIPO [Charakorn et al.(2023)], focuses on cultivating diverse behaviors by assessing policy compatibility, which measures the effectiveness of policies in coordinating actions.Alternatively, MAZE [Xue et al.(2022a)] addresses the challenge of heterogeneous coordination and introduces a coevolution-based method involving the simultaneous evolution of two populations: agents and partners.Despite the merits of these approaches, they often assume that the process of generating teammates is independent from the optimization objective of coordination policies.This assumption limits their ability to encompass a wide range of teammate policies, subsequently undermining their performance in effectively coordinating actions with previously unseen teammates.
Another related topic is continual reinforcement learning [Khetarpal et al.(2022), Abel et al.(2023)], which has garnered increasing attention in recent years, focusing on enabling agents to learn sequentially across different tasks.Various methods have been proposed to tackle this challenge.Among the previous works, EWC [Kirkpatrick et al.(2017)] employs l 2 -distance-based weight regularization with previously learned weights, necessitating additional supervision information to select a specific Q-function head and task-specific exploration schedules for different tasks.On the other hand, CLEAR [Rolnick et al.(2018)] is a task-agnostic method that does not require task information during the continual learning process.It stores a large experience replay buffer and addresses the forgetting problem by sampling data from previous tasks.Other approaches, such as HyperCRL [Huang et al.(2021)] and [Kessler et al.(2022a)], utilize learned world models to enhance the efficiency of continual learning.To address scalability concerns in scenarios with a large number of tasks, LLIRL [Wang et al.(2022)] decomposes the task space into subsets and employs the Chinese Restaurant Process to expand the neural network, making continual reinforcement learning more efficient.OWL [Kessler et al.(2022b)] is a recently proposed approach that uses a multi-head architecture to achieve high learning efficiency.CSP [Gaya et al.(2023)], on the other hand, incrementally builds a subspace of policies for training a reinforcement learning agent on a sequence of tasks.Regarding the multi-agent continual learning problem, [Nekoei et al.(2021)] investigate whether agents can coordinate with unseen agents by introducing a multi-agent learning testbed based on Hanabi.However, it only considers uni-modal coordination among tasks.In contrast, MACPro [Yuan et al.(2023b)] proposes an approach for multi-agent continual coordination via progressive task contextualization.It obtains a factorized policy using shared feature extraction layers but separate independent task heads, each specializing in a specific class of tasks.Nevertheless, MACPro requires a handcrafted environment for testing efficiency, which is impractical for real-world applications with unpredictable task scenarios.

A.2 Proofs for Theorems
We first show that a reliable proxy effectively measures the dissimilarity between teammates' policies.Assumption 1. P (s ′ |s, a) > 0, ∀s, a, s ′ .
Theorem 2. We define the TV divergence between teammates' policies as 1] is the total variantion divergence for the discrete probability distribution p and q.We assume that min τtm,atm π x tm (a tm |τ tm ) = δ > 0, x ∈ {i, j}.Then the following inequality holds: where k = |A tm |. (9) Proof.To simplify notation, we use i, j to represent π i tm , π j tm , respectively, and omit the subscript "tm" when no ambiguity arises.
To prove the theorem, we can transform the problem into determine the value of x, D min T V (i||j) > x, i.e., a |π i (a|τ )− π j (a|τ )| > 2x holds for all possible trajectories τ , so that d(i, j) > ϵ .
Let r(a|τ ) = π i (a|τ ) π j (a|τ ) , and where k = |A tm |.Without loss of generality, we eliminate the absolute value | • | and observe that two situations may arise for any trajectory τ : To prove that d(i, j) > ϵ, we need to guarantee that d(i, j) = |1 − T −1 t=0 r(a t |τ t )| > ϵ.Equivalently, we seek to prove that ∃τ such that T −1 t=0 r(a t |τ t ) < 1 − ϵ or T −1 t=0 r(a t |τ t ) > 1 + ϵ.However, relying on the above "or" constraints for each local point τ t does not guarantee a global constraint for T −1 t=0 r(a t |τ t ).To address this, we proceed to derive more general constraints.
Considering the normalization of the policy, we have a (π i (a|τ ) − π j (a|τ )) = 0.Under constraint (1), i.e., π i (a A |τ ) − π j (a A |τ ) > 2x k , it follows that a,a̸ =a A (π i (a|τ ) − π j (a|τ )) < − 2x k .Let a B1 = arg min a π i (a|τ ) − π j (a|τ ), and we obtain π Combining the above constraints, we will find that for any τ , one of the following inequalities always hold: Based on the assumption, for any trajectory τ in d(i, j), we can transform the (τ t , a t , τ t+1 ) into (τ t , a ′ , τ t+1 ), ∀a ′ thus reconstructing a possible trajectory.This allows us to assign r(a|τ t ) either a lower bound or an upper bound for any t.
Taking the lower bound as an example, we solve the following inequality: and derive that . Similarity, by solving the inequality under the upper bound constraint, we obtain The idea behind Thm. 2 is based on the fact that a significant total variation divergence tends to result in dissimilarities between teammates' policies.Due to the computational complexity associated with directly calculating the divergence concerning the trajectory distribution, a more practical approach is to maximize the total variation divergence instead.

A.3.1 The Overall Workflow of Macop
We introduce the pseudo-codes for both the training and testing phases of Macop in this part, in Alg. 1 and Alg. 5. Choose one head h i 6: Compose π i ego with f and h i 7: Calculate one episode return R i from ⟨π i ego , π tm ⟩ 8: Ri ← Ri + Ri− Ri We implement Macop based on the PyMARL † [Samvelyan et al.(2019)] codebase.For agent network architecture, we apply the technique of parameter sharing, so self-play and cross-play can be easily implemented by inputting different agent ids into the agent network.And we design the feature extraction backbone f ϕ as a 2-layer MLP and a GRU [Cho et al.(2014)], and the policy head h ψi as a 2-layer MLP.The hidden dimension is 64 for both MLP and GRU.The head MLP takes the output of the backbone as input and outputs the Q-value of all actions.The individual Q-values of each agent are then fed into the mixing network to calculate the joint Q-value, according to the existing MARL methods.We select VDN [Sunehag et al.(2018)] for environment LBF, PP, CN, and QMIX [Rashid et al.(2018)] for environment SMAC.We adopt Adam [Kingma and Ba(2015)] as the optimizer with learning rate 5 × 10 −4 .The whole framework is trained end-to-end with collected episodic data on NVIDIA GeForce RTX 2080 Ti and 3090 GPUs with a time cost of about 10 hours in LBF, PP, CN, SMAC1 scenarios, and about 30 hours in SMAC2 scenario.
We use the default hyperparameter settings of PyMARL, e.g., the batch size of trajectories used to calculate the temporal difference error is set to the default value 32.The selection of the additional hyperparameters introduced in our approach, e.g., the size of each teammate population, is listed in Tab. 3.

A.4.1 Environments
We select four multi-agent coordination environments and design two scenarios each as evaluation benchmarks.Four scenarios (LBF4, PP1, CN3, SMAC1) displayed in the manuscript, together with the other four, are shown in Fig. 6.
Here we provide details of all eight scenarios.
Level-based Foraging (LBF) [Papoudakis et al.(2021b)] is a discrete grid world game where two agents with varying levels navigate through the grid to collect foods with different levels.Each agent moves one cell at a time in one of the four directions {up, left, down, right}.Agents gain reward 1 when they are at a distance of one cell from the food and

Figure 2 :
Figure 2: Environments used in this paper, all details could be seen in App.A.4.

Figure 3 :
Figure 3: Teammate policy space analysis.(a) The t-SNE projections of the self-play trajectory features of Macop's generated teammate groups in CN2.(b)(c) The cross-play returns of Macop's and TrajeDi's generated teammate groups in LBF4.(d) The change in TrajeDi's coordination ability with varying population sizes in LBF4 and CN2, compared with Macop.

Figure 4 :
Figure 4: Macop's learning process analysis.(a)(b) The self-play trajectories of the first four/five teammate groups.(c) The change of the number of trained teammate groups, the number of existing heads, and the stop criterion C on each iteration.(d) Coordination performance comparison with different teammate groups in the evaluation set.

Table 2 :
Continual Learning Ability.Average BWT/FWT ± std of four different methods in different evaluated environments.