Cooperative Multi-Type Multi-Agent Deep Reinforcement Learning for Resource Management in Space-Air-Ground Integrated Networks

The Space-Air-Ground Integrated Network (SAGIN), integrating heterogeneous devices including low earth orbit (LEO) satellites, unmanned aerial vehicles (UAVs), and ground users (GUs), holds significant promise for advancing smart city applications. However, resource management of the SAGIN is a challenge requiring urgent study in that inappropriate resource management will cause poor data transmission, and hence affect the services in smart cities. In this paper, we develop a comprehensive SAGIN system that encompasses five distinct communication links and propose an efficient cooperative multi-type multi-agent deep reinforcement learning (CMT-MARL) method to address the resource management issue. The experimental results highlight the efficacy of the proposed CMT-MARL, as evidenced by key performance indicators such as the overall transmission rate and transmission success rate. These results underscore the potential value and feasibility of future implementation of the SAGIN.


INTRODUCTION
The advent of 5G technology has ushered in a new era where the Internet of Things (IoT) serves as the backbone of numerous applications and services, such as intelligent transportation systems, home automation, and smart factories [4,7,20].However, the burgeoning computational demands of IoT devices stretch the limits of existing wireless communication networks.Current terrestrial communication paradigms are ill-equipped to cope with this demand.Enter the Space-Air-Ground Integrated Network (SAGIN), a proposed heterogeneous system that amalgamates satellites, unmanned aerial vehicle (UAV)-based air systems, and terrestrial communication systems like base stations [17].Its objective is to provide versatile network coverage and services.
SAGIN has emerged as a potential solution to meet the mounting computational requirements of IoT services and applications [6,14,19].By integrating terrestrial systems with satellites and unmanned aerial vehicle (UAV)-based air systems, SAGIN offers a flexible and scalable approach that could potentially rise to these challenges [24].However, the implementation of SAGIN within IoT services is not without its hurdles [11].The primary obstacle lies in the inherent high mobility of the air system.This mobility results in dynamic and often unpredictable channel conditions and coverage areas.Moreover, the ground connections in current approaches are frequently overlooked.Exceptions are made only for connections between ground users and UAVs, an approach that can lead to inefficiencies and limit the full utilization of the integrated network [15].An additional complication arises from the heterogeneity of the SAGIN subsystems.Each subsystem -terrestrial, aerial, and space -uses a unique communication interface, reflecting its particular technological requirements and constraints [17].Furthermore, channels between these various subsystems possess distinct properties, further complicating communication and integration.Therefore, it is imperative to develop a discerning resource management policy for SAGIN.
The rapid evolution of machine learning techniques offers potential solutions to these challenges [27].Among these, reinforcement learning (RL) stands out as a reward-centric approach designed to tackle combinatorial optimization problems.An RL agent interacts with and learns from its environment to optimize toward an endgoal, guided by a predefined reward system [23].In some contexts, there may be multiple agents, or centralized learning methods may prove ineffective due to inherent environmental characteristics [25].In these situations, the notion of multi-agent RL (MARL) emerges as a decentralized learning method.It has been increasingly adopted by researchers and institutions over recent years [13].Additionally, there may exist more than one type of agent in the multi-agent system, where different types of agents need to utilize diverse behaviors to coordinate with others [22].In this work, we develop an ambiguous stochastic optimization (ASO) SAGIN communication model aimed at handling the system's overall resource management problem.We then employ a tailor-made MARL technique, dubbed Cooperative Multi-Type MARL (CMT-MARL), to elaborately study how the multi-agent system will perform in this dynamic SAGIN.

HETEROGENEOUS SYSTEM OF WIRELESS COMMUNICATION
In this SAGIN system presented in Fig. 1, we consider four communication categories at three different altitudes, with a bunch of GUs (i.e.vehicles) and a BS on the ground layer, the UAV swarm hovering in the air layer and an LEO satellite in the space layer, and five classes of communication links among these categories.
The links started from GUs to BS, UAVs, and satellites are modeled as GU-to-BS (G2B) link, GU-to-UAV (G2U) link, and GU-to-Satellite (G2S) link, respectively.While each UAV can also establish two kinds of links from itself to BS and satellite, expressed similarly as the UAV-to-BS (U2B) link and UAV-to-Satellite (U2S) link.
Both GU and UAV groups continuously generate a series of data packets that need to be transmitted to the remote servers eventually.The BS and satellite are considered as the terminal transmission devices for each GU and UAV to transmit the data packets in the SAGIN, where the UAV in the air layer can be utilized as an intermediary for each GU in the ground to transmit the packet when the G2S or G2B link is temporally crowded or considering the transmit speed of them are relatively low.Specifically, GUs and UAV swarm involve  1 (= |N 1 |) and  2 (= |N 2 |) homogeneous individuals respectively, and can be further treated as heterogeneous system N (= N 1 ∪ N 2 ), where N 1 = { 1 ,  2 , . . .,   1 } and N 2 = {  1 ,   2 , . . .,    2 }.Each GU can select only one link from G2B, G2U, and G2S links to transmit data, while U2B or U2S are two links for each UAV to choose from.Note that the G2U link can only be generated if the UAV selected by the GU agrees to make this connection.

Wireless Communication Links
The unicast protocol is employed in this work, where each GU or UAV can only select one objective to transmit its message, and the transmission between the same category, such as GU-to-GU and UAV-to-UAV, is prohibited to prevent message congestion, following classic configuration of SAGIN [26].Since there exist three layers in SAGIN, we allocate two bandwidths  1 and  2 that support the low-altitude and high-altitude transmission, respectively, in which low-altitude bandwidth  1 supports G2B, G2U, and U2B links while G2S and U2S links are on the high-altitude bandwidth  2 .
2.1.1Low-Altitude Links.The low-altitude links, also known as the terrestrial links, are operated terrestrially, including G2B links, G2U links, and U2B links [3,9].The low-altitude links feature relatively lower bandwidth and frequency than high-altitude links and provide communication with lower latency and enhanced reliability due to low pathloss [26].Taking into account the interfering channels within the sub-band, as discussed in [16], the signal-tointerference-plus-noise ratio (SINR) for low-altitude links operating on the ℎ-th sub-band (where ℎ ∈ [ ]) can be formulated as follows, where [ℎ] and  , [ℎ] refer to the transmit power and the interfering channel from -th ( ∈ N ) device (GU or UAV) to its designated communication endpoint denoted as   (base station or another device) over ℎ-th sub-band.   [ℎ] denotes the interference of low-altitude links and can be expressed as, similarly represent the transmit power and the interfering channel from  ′ -th device to the communication destination is the indicator function equal to 1 if the ℎ-th sub-band is occupied and 0 otherwise.

2.1.2
High-Altitude Links.The high-altitude links, also known as the non-terrestrial links, encompass the connections between terrestrial devices [8].Specifically, the high-altitude links consist of G2U links and U2S links.These links enable communication and data exchange between the terrestrial devices (GUs and UAVs) and the satellite.Since the satellite can be utilized to support long-range transmission due to its wide horizon, the high-altitude link in the proposed SAGIN model can be treated as a relatively smaller BS with a relatively wide bandwidth for receiving a series of messages from both GUs and UAVs [12].Therefore, we express the SINR of high-altitude link over ℎ-th (ℎ ∈ [ ]) sub-band as, where [ℎ] and  , [ℎ], similar to the high-altitude link, are the transmit power and the interfering channel from -th terrestrial device to the satellite over ℎ-th sub-band.Considering there are only G2S and U2S links on the high-altitude bandwidth  2 , we use to indicate the interference power of high-altitude link from -th device to the satellite over ℎ-th sub-band.
2.1.3Transmit Rate and Latency.Since transmit rate and latency are critical performance metrics that can directly affect the overall quality and reliability of heterogeneous network communication, we mainly focus on optimizing them in this work.Based on all types of SINRs above, the transmit rates of different links over the same ℎ-th sub-band can be hence established as [21], where  ∈ { ,  } represents different types of links and  ∈ {1, 2} is the bandwidth occupied for transmission, where  1 and  2 denote the bandwidth for low-and high-altitude links.

OPTIMIZATION OBJECTIVE
In this section, we introduce two optimization models to assess how the SAGIN would operate, in which the ASO model is utilized to optimize over a range of possible latency probability distributions.ASO refers to the optimization of a system or process under conditions of uncertainty, where the probabilities of different outcomes or events are not precisely known [18].Given an object-power-channel decision profile (x, p, h) of this heterogeneous communication system, the ambiguous transmission rate  : X × P × H × Ξ → R, and a set of plausible distributions Ω over the uncertain parameters, the goal is to find the optimal decision profile (x * , p * , h * ) that maximizes the worst-case expected transmission rate of  over the set of plausible distributions, sup where  some random variables representing the uncertain parameters, and E  denotes the expected value with respect to the distribution  such that  ∈ N 1 and  ∈ N 2 .This formulation seeks to find a transmit decision that minimizes the maximum expected value of the objective function over all plausible distributions, accounting for the ambiguity and uncertainty in this SAGIN system.

MARKOV DECISION PROCESS OF SAGIN
Considering the aforementioned SAGIN optimization objectives are non-convex functions and difficult to figure out with conventional programming methods, we model it as a Markov decision process (MDP), given as ⟨S, A, , , ⟩ for embedding the framework of MARL method, where S ≜ {S 1 , S 2 } and A ≜ {A 1 , A 2 } are the joint observation and action spaces of GU and UAV agent, respectively. indicates the transition probability,  refers to the reward given by the SAGIN environment and  represents the factor that determines the importance of future rewards.

Observation Space
In this SAGIN, we model GUs and UAVs as two types of agents that cooperatively explore the uncertain and dynamic communication environment.As discussed in Section 2.1, SINR, on the one hand, is a measure of the quality of the received signal relative to the interference and noise.Agents can use the SINR to estimate the power of the received signal relative to the interference and noise, which can be used to make decisions about scheduling data transmissions.On the other hand, interference power is a measure of the power of the interfering signal caused by other data transmissions.This metric can be utilized by agents to estimate the impact of other data transmissions on the quality of their own data transmissions.In addition, the data packet size is the size of the data packet that needs to be transmitted.Agents can leverage this information to estimate the time required to transmit the data packet and make decisions about scheduling data transmissions.Based on the above information, the observation space of agents can be defined as a vector of features that includes SINR , interference power , and data packet size .We hence establish the observation of GU agent  as,   =    (, , ) ∈ S 1 , ∀   ,  ∈ N 1 , nd similarly, the observation of UAV agent  can be expressed as,   =     (, , ) ∈ S 2 , ∀    ,  ∈ N 2 , where   (•) and    (•) represent the observation functions of GU and UAV agents, respectively.In addition, the SINR  = {  ,    } and the interference power  = {  ,    }, both of which are modeled in Section 2.1.

Action Space
The action space represents the set of feasible actions that can be taken by the GU or UAV agent in the SAGIN environment.The goal of each individual GU or UAV is to develop a collaborative policy that optimizes their collective long-term reward by selecting actions from their own designated action space.For each GU agent, it needs to select a receiver to transmit the data package at first, i.e. the communication link   ∈ X 1 , and then the appropriate communication channel ℎ  ∈ H 1 , and transmit power   ∈ P 1 are supposed to be chosen for acceptable transmission efficiency.Therefore, each GU agent makes an object-power-channel decision profile,   = (  ,   , ℎ  ) ∈ (X 1 , P 1 , H 1 ) ⊆ A 1 , ∀   , Similarly, the decision profile that each UAV agent at each time step is expressed as,   = (  ,   , ℎ  ) ∈ (X 2 , P 2 , H 2 ) ⊆ A 2 , ∀    .

Scenario Reward
Since the objective is to maximize the transmission rate of the SA-GIN, we can naturally utilize the rate to design the reward that encourages agents to take actions that maximize the overall rate while still ensuring the stability and robustness of the network.Then, we stimulate coordination between agents.The reward should encourage agents to coordinate their actions to maximize the rate.Third, the reward encourages exploration of the environment to discover new strategies that improve the rate.For example, agents can be rewarded for taking actions that have not been tried before, or for taking actions that improve the rate in a novel way.Notably, the robustness and performance need to be balanced.
Based on these considerations, we specifically design the reward function for a cooperating MARL architecture in the aforementioned SAGIN as, where  is the common reward that all agents share, known as a cooperative reward.{  } 4 =1 are constant hyper-parameters that balance different reward components.  stands for the latency of the network.To incentivize agents to minimize the latency of the SAGIN, we choose to use the negative of the latency as the reward.The higher the negative value of the latency, the greater the reward, which ensures that agents will prioritize reducing the latency of the network. 1 and  1 are two hyper-parameters that balance the latencies of different communication devices.And   is a reward for actions that improve coordination between agents.For encouraging collaboration and coordination between Gus and UAVs to minimize latency, we give a reward to agents who take actions contributing to a coordinated strategy that reduces latency, where   and   refer to the weights assigned to GU  and UAV  indicating the importance of each agent's role in the network.  and   are binary indicators of whether their actions contributed to a coordinated strategy that reduced latency.To measure the reduction in latency resulting from a coordinated strategy, we utilize a metric such as the difference between the latency before and after the strategy is implemented.If the coordinated strategy results in a lower latency compared to the latency without the strategy, then we can consider that the action of GU  or UAV  contributes to a coordinated strategy that reduces the average latency of the SAGIN Further,   indicates a reward for novel actions that improve latency.To motivate agents to explore the environment and try new strategies, we can use an exploration bonus that rewards agents for taking actions that have not been tried before or that improve latency in a novel way.One way to do this is to give a reward to agents that take actions that increase the diversity of strategies used to reduce latency, where  1 and  2 are two weighting factors that control the influence of the exploration bonus, and H () is the entropy of the probability distribution  over the set of strategies used to reduce latency.We in this article utilize the Shannon entropy as a measure of the uncertainty or randomness of a probability distribution, given as H () = − ∈ A  () 2  (), in which  () is the probability of selecting action  over its action space A. This encourages agents to explore different strategies and avoid relying on a single strategy that may not be effective in all situations.

EXPERIMENTS AND RESULTS
For implementing the proposed CMT-MARL method in the SAGIN environment, we have novelly designed a SAGIN communication scenario, in which different channel models to characterize different channels in SAGIN are combined.Elaborately, the G2B channel is modeled using the WINNER II channel model [10], the U2B channel is modeled according to the definition provided in 3GPP  TR 36.777Rel. 15 [1], the G2U channel follows the definition in [5], and the non-terrestrial links (G2S and U2S) are modeled based on the definition in 3GPP TR 38.811 Rel. 15 [2].The specific simulation experiment settings can be found in Table 1.
To evaluate the performance of the proposed CMT-MARL method in the training phase, we first employ 2 vehicles and 1 UAV and exhibit the accumulated reward of the SAGIN system, in which the reward is normalized within the range of [0, 1] to effectively showcase the capabilities of the multi-type multi-agent system.Each episode comprises 100 steps, corresponding to a duration of 100 ms, during which the vehicle and UAV agents transmit packages.The convergence of system performance can be observed in Figure 3, where it becomes apparent that the performance stabilizes after approximately 500 episodes.The baseline represents that the vehicle and UAV agents randomly select actions at any time.
During the test phase, we assess the effectiveness of the CMT-MARL approach by varying the quantities of vehicle-UAV combinations, while maintaining a constant number of one base station (BS) and one satellite throughout the evaluations, as listed in Table 1.The test area spans 200×100 m 2 .Figure 4 illustrates the average transmission rates achieved by the vehicle and UAV groups.The results show that the vehicle group attains an average transmission rate of approximately 0.15 Mbps, while the UAV group achieves around  5, we demonstrate the transmission efficacy as reflected by the success rates across varying quantities of vehicle-UAV combinations.As the agent amount rises, the escalating interference within the SAGIN system poses hurdles to the transmissions.Consequently, agents must endeavor to discern and implement more efficient transmission policies to overcome the intensified interference and successfully accomplish their package transmission tasks.Hereto, The CMT-MARL technique allows the SAGIN system to achieve a nearly 100% success rate with a relatively smaller number of agents, i.e., {2, 1} and {5, 2} vehicle-UAV combinations.As the agent number expands from {10, 5} to {30, 15}, the success rates for both vehicle and UAV agents remain consistently high, hovering around 55% to 60% for the former and approximately 65% for the latter.This attests to the robustness and effectiveness of the CMT-MARL technique in facilitating reliable and successful transmissions within the SAGIN environment.

CONCLUSION
In this study, to elaborately investigate the resource management problem within the SAGIN system, we have innovatively constructed a SAGIN system incorporating five distinct communication links and proposed a specific MARL technique called CMT-MARL to solve it.This method is tested using different amounts of vehicle-UAV agent combinations and showcases its reliability and robustness in the SAGIN, providing the hope of implementing it over the physical SAGIN environment.In future works, distributed decisionmaking, as well as the scalability and efficiency issues, will further be investigated using mean field and MARL methods in consideration that each vehicle or UAV agent can make decisions based on local observations and interactions with their immediate environment, without requiring centralized control.This distributed approach allows the network to scale effectively as the number of agents increases.

Figure 2 :
Figure 2: The workflow of Markov decision process in SAGIN.

Figure 4 :
Figure 4: Average transmission rates with different numbers of vehicles and UAVs.