Privacy-Preserving Individual-Level COVID-19 Infection Prediction via Federated Graph Learning

Accurately predicting individual-level infection state is of great value since its essential role in reducing the damage of the epidemic. However, there exists an inescapable risk of privacy leakage in the fine-grained user mobility trajectories required by individual-level infection prediction. In this paper, we focus on developing a framework of privacy-preserving individual-level infection prediction based on federated learning (FL) and graph neural networks (GNN). We propose Falcon, a Federated grAph Learning method for privacy-preserving individual-level infeCtion predictiON. It utilizes a novel hypergraph structure with spatio-temporal hyperedges to describe the complex interactions between individuals and locations in the contagion process. By organically combining the FL framework with hypergraph neural networks, the information propagation process of the graph machine learning is able to be divided into two stages distributed on the server and the clients, respectively, so as to effectively protect user privacy while transmitting high-level information. Furthermore, it elaborately designs a differential privacy perturbation mechanism as well as a plausible pseudo location generation approach to preserve user privacy in the graph structure. Besides, it introduces a cooperative coupling mechanism between the individual-level prediction model and an additional region-level model to mitigate the detrimental impacts caused by the injected obfuscation mechanisms. Extensive experimental results show that our methodology outperforms state-of-the-art algorithms and is able to protect user privacy against actual privacy attacks. Our code and datasets are available at the link: https://github.com/wjfu99/FL-epidemic.


INTRODUCTION
The rapid spread of COVID-19 all over the world has caused great damage to human health and social economy, which creates a strong motivation for us to have an accurate individual-level infection prediction.Specifically, strategies including early warning [55] and mobility control [16] can be accurately implemented to effectively reduce the damage of the epidemic.However, accurate individual-level infection prediction requires fine-grained user mobility trajectories to characterize the contagion process driven by close contacts between individuals, where there exists growing privacy concerns [29,62].Since sensitive information such as where each individual went and who he met is contained in his trajectory, which may be leaked in the process.Therefore, how to simultaneously provide accurate individual-level infection prediction and user privacy protection is a research problem of great value.
Since the contagion process of the epidemic is essentially driven by close contacts between individuals [51], this process is naturally suitable to be characterized and modeled in the form of graph learning [67].In light of the great success of the graph neural network (GNN) in various fields, e.g., recommendation [7,22,27,68], drug design [26,30] and spatio-temporal prediction [50,76], there have been numerous methods based on GNN that have been proposed to implement COVID-19 forecasting [45,53].However, due to the risks of privacy leakage in individual mobility trajectories, most existing methods alternatively implement infection prediction at the regional aggregation level [5,31,53,76], giving up fine-grained individual infection prediction, which also limits their effectiveness in guiding fine-grained epidemic control strategies.Other approaches simply ignore the risk of user privacy leakage [9,45].At the same time, federated learning (FL) is a distributed machine learning framework, the concept of which is proposed by Google [34].Its goal is to train machine learning models based on training data distributed across multiple devices without the requirement of uploading the data, which only shares the aggregated locally calculated intermediate results between servers and devices.FL techniques provide us with a promising solution to the problem of privacy-preserving individual-level infection prediction.
However, realizing privacy-preserving individual-level infection prediction based on GNN and FL, i.e., federated graph machine learning (FGML), is also an intractable task with the following three distinct challenges.First, in the context of FGML, there is a risk of client-side subgraph structure leakage when various information follows the propagation mechanism of GNN and is transmitted between the client and the server.Specifically, the graph structure utilized in the infection prediction task either characterizes the contact between individuals [45] or the interaction between individuals and locations [23], which may reveal the sensitive information of users to the honest-but-curious server or other users.For instance, the interaction between individuals and locations can be indirectly exposed to the server by inference attack via non-zero gradient [42] or sequential location queries [58].Thus, how to characterize the complicated interactions between individuals and locations while protecting user privacy in the graph-structure data is the first challenge.Moreover, if we directly implement individual-level infection prediction with vanilla GNN-based models in the manner of FL, it will lead to the second challenge: cross-client missing information [18].Specifically, each individual conceals its interaction data with other individuals or locations in its local device to avoid the leakage of graph structure, which means each client only owns a subgraph of the global contact graph.Besides, clients will try to avoid exposing raw node features on their local subgraph to others [18,74].Therefore, a client can only aggregate features on its subgraph but cannot access the node features located on other clients' subgraphs, which leads to insufficient node representations [39].Finally, various obfuscated mechanisms utilized in FGML guarantee user privacy from multi-perspectives, such as the perturbation mechanism used in FL that can achieve differential privacy [28,64,70].As a double-edged sword, they also result in a decline in model prediction performance due to obfuscation on data quality or structure.Then, how to overcome this performance decline is the third challenge.
In this paper, we propose Falcon, a Federated grAph Learning method for privacy-preserving individual-level infeCtion predictiON.Falcon introduces a novel hypergraph structure with spatiotemporal hyperedge to characterize the complicated interaction between individuals and locations in the contagion process.The spatio-temporal hyperedge is employed as the mediator to propagate individual information, where its features are gathered on the server based on intermediate results shared in FL and then transmitted back to the client devices.By utilizing spatio-temporal hyperedge as the bridge for transferring features of high-order neighbors, Falcon is able to successfully overcome the challenge of cross-client missing information.Further, Falcon is embedded with a novel plausible pseudo location generation technique and a differential private perturbation mechanism, which respectively confront the leakage of graph structure caused by non-zero gradient and sequential location queries in the transfer intermediate results.Finally, we elaborately design a novel cooperative coupling mechanism, which incorporates the individual-level prediction model with an auxiliary region-level infection prediction model to counteract the performance decline caused by the injected obfuscation mechanisms.Overall, our main contributions are summarized as follows: • We design a hypergraph construction method to extract the spatio-temporal interaction among regions and individuals.Then, we detach the propagation procedures of hypergraph into client and server sides, which realizes the cross-client information exchange in a secure manner.• We propose a novel plausible pseudo location generation technique and a differential privacy perturbation mechanism against two kinds of location inference attacks from the honest-butcurious server.They use the lowest possible cost of prediction utility to exchange for user privacy.• We design a novel cooperative coupling mechanism between the macroscopic region-level model and the microscopic individual-level model, which is able to overcome the performance decline caused by the utilized obfuscation mechanisms.• Extensive and multi-scenario experimental results show that our proposed framework is able to implement accurate infection prediction, outperforming state-of-the-art algorithms.In addition, it can protect user privacy against actual privacy attacks.The structure of this paper is as follows.We first define the problem and introduce preliminary knowledge in Section 2. We then present our solution in Section 3 and conduct experiments in Section 4. Last, we review the related works in Section 5 and conclude the paper in Section 6.

PRELIMINARIES
In this section, we provide an overview of our paper by introducing fundamental concepts and formally delineating the infection prediction task at the individual level that we investigate.In addition, a brief review of hypergraph neural networks (HGNN) is conducted as follows.The key notations utilized in this paper are described in Table 1.

Problem Formulation
In the real world, predicting the infection status of each individual is beneficial for the precise design of individual-level intervention strategies.Meanwhile, the spread of the epidemic is driven by individual contacts, which are further characterized by the mobility pattern of each individual.Thus, we conclude this problem as a semi-supervised classification problem, utilizing the mobility data and observed positive cases to predict the health status of the rest of the individuals.Nevertheless, the mobility data of massive individuals raises several practical issues, including significant The number of individuals located in the region  at time  .

M
The number of regions.

M 𝑢,𝑡
The number of regions visited by the user  at time  .

𝐿 𝑢
The historical trajectory of user .

𝑦 𝑢
The infection status of user .

Θ
The parameters of the prediction model.

G
The spatio-temporal hypergraph .

V
The vertex set of hypergraph.

E
The hyperedge set of hypergraph.

H
The adjacent matrix of hypergraph.

D 𝑣
The degree matrix of vertex.

D 𝑒
The degree matrix of hyperedge.
The aggregate hyperedge embedding.

E 𝑚
The output hidden state of the macroscopic model.

𝜎 𝑙
The standard deviation of the Gaussian noise for the perturbation mechanism.

𝜎 𝑓
The standard deviation of the Gaussian noise for DPSGD.G ( )  The graph of the macroscopic model at time  . ( )  The new cases of each region observed at time  .

𝛽
The transmission rate of the disease.

𝜇
The recovery rate of the infected individual.

1/𝛼
The average duration of the latent period.

𝑅 0
The basic reproductive number of a disease.
computational requirements and broad privacy implications.Transferring the whole task to the manner of FL is an effective way to mitigate these anxieties [37].
We consider the COVID-19 disease transmission within a city or a district, which involves N individuals and M regions.The infection status of individuals can be divided into four categories [71]: susceptible, asymptomatic infectious, symptomatic infectious, and recovered.An individual under the asymptomatic infectious or symptomatic infectious is considered as an infected case.Based on the government's policies and individual intuitions, we legitimately assume that a proportion  of the total population will go to the hospital or the Center for Disease Control and Prevention (CDC) to obtain their infection status.Note that these institutions provide individuals with accurate infection status diagnosis results through medical techniques, such as molecular tests, antigen tests, etc. Besides, with the rapid development of mobile localization techniques, we reasonably suppose that each individual is capable to record his historical trajectory and always retains the latest  days trajectory.
As depicted in Figure 1, the overall prediction model is based on the FL framework, in which each individual is considered as a client, and the server is located in a specific agency (e.g., a government agency, the CDC, or a technology corporation).The infection diagnosis result and the historical trajectory are both stored on the local device.
We intend to provide policymakers with a GCN-based individual-level infection prediction model, based on which they can develop more precise individual-level intervention strategies.In the meantime, this model should ensure that each individual's historical trajectory   and diagnosis result   are still stored on their local device and cannot be accessed by other individuals or the server.
Based on the above notations, we formulate the formal definition of the individual-level COVID-19 infection prediction in the manner of FL as follows: where  represents the proportion of the uploaded trajectory to the complete trajectory.Thus, the lower the value of , the lower the frequency of trajectory uploads by the individuals.

Hypergraph Neural Networks
Hypergraph is a variant of graph, where an edge can connect with any number of vertices.A hypergraph can be formally expressed as G = (V, E) in which V and E denote sets of vertex and hyperedge, respectively.The adjacent matrix of a hypergraph G can be represented as follows: where   and   are the indexes of the vertex and hyperedge, respectively.Vertex and edge degrees can be defined as  () =  ∈ E ℎ(, ) and  () =  ∈ V ℎ(, ), respectively.Further, D e and D v denote the diagonal matrices of the edge degrees and the vertex degrees.HGNN [17] extended the graph convolution from graph into hypergraph to capture the highorder correlation, which can be represented as: where X ( ) ∈ R | V | ×  represents the input to the HGNN layer, Θ ( ) ∈ R   ×  denotes the trainable parameters in the -th layer, H ⊤ is the operator to aggregate features from nodes to edges, H is the operator to transform edge features to nodes.D  and D  , the degree matrices of edge and vertex, are utilized as normalization factors.W is the weight matrix of the hyperedge.

System Overview
To tackle the challenges of FGML as mentioned in Section 1, we propose a novel algorithm named Falcon.The overall architecture of Falcon is presented in Figure 1, where we use the hypergraph to construct spatio-temporal user-location interaction, then divide two aggregation phases into client and server sides.Besides, a plausible pseudo location generation method is designed against location inference attacks.Furthermore, to eradicate the performance decline caused by the privacy mechanisms, we elaborately design a GNN-LSTM-based macroscopic model to incorporate with our microscopic individual-level model.We summarize the detailed algorithm of Flacon in Algorithm 1 and Algorithm 2.

Privacy-preserving FGML Framework
First, to extract complex interaction among massive users, we design a spatio-temporal hypergraph construction method as presented in the left of Figure 1.Specifically, By regarding each individual as a node, we connect all nodes who visit a location  in the same time interval  to a hyperedge  (,  ), which enables the model to capture the interaction between individuals and locations and the contact among individuals.Consequently, people moving between two spatio-temporal points are represented as the node connects to two distinct hyperedges.Then, we craft a novel detached hypergraph propagation mechanism in Section 3.2.1 to address the cross-client missing information problem in FGML, which detaches the two phases of hypergraph aggregation to be executed on the client and server, respectively.Additionally, we also propose two strictly coupled obfuscation mechanisms, the plausible pseudo location generation algorithm and the differential privacy perturbation mechanism in Section 3.2.3 and Section 3.2.2,respectively, to avoid divulging users' actual locations to the server.In this way, Falcon can effectively reduce the privacy leakage risk of the graph structure.Finally, we utilize a differential privacy mechanism [1] to perturb the gradients of model parameters before the client uploads them to the central server, which can avoid exposing feature values via gradients.
Update the parameters of model.13: end for 3.2.1 Detached Hypergraph Propagation.In the scenario of FGML, an important challenge is that each client can only access a subgraph of the global graph and some nodes may be neighbors of other clients [18].However, a client can mostly only aggregate the features of nodes within the local storage under the constraint of privacy requirements.In the task of individual-level infection prediction, this challenge still exists and is even more intractable.Specifically, restricted by privacy considerations, each individual stores its trajectory data and features in the local device and conceals them from being accessed by other individuals.Thus, it is unattainable to construct the individual-individual contact graph in each client, which makes it more intractable for an individual to aggregate the features of other individuals.To address this complex issue, we first opt for the spatio-temporal hypergraph to establish interaction information between individuals and locations, allowing each client to construct a subgraph from their local historical trajectory.In addition, as depicted in Figure 2, we decouple the two phases of hypergraph aggregation and use the hyperedge as the mediator to propagate individual information, allowing clients to share information by uploading embeddings of visited locations.Throughout the entire process, the historical trajectories of all users will not be shared with other users, and no user will be able to detect other users who have visited the same location.
Intuitively, the propagation from node to edge is deployed on the server side, which can be expressed as: which represents the first part of (3), and E (+1) denotes the aggregate hyperedge embedding, X ( ) represents the node embedding in this work.The hyperedge weight matrix W is set to an identity matrix in this work.The propagation from edge to node is executed by local clients, which can be represented as follows: which represents the second part of (3).Consequently, Falcon maintains the cross-client transition of information and prevents the exposure of the local subgraph structure to other clients.Nevertheless, the server still can infer where the user has visited with the inference attacks, which can estimate the real location of users via non-zero gradients [42] or sequential location queries [58].Thus, we propose a novel plausible pseudo location generation technique and introduce a differential privacy (DP) mechanism to guarantee that even for the central server, it can not directly infer the real location the user visited.

Differential Privacy Perturbation Mechanism.
In the following content, we introduce a differential privacy perturbation mechanism, which is integrated into the aforementioned detached propagation mechanism to confront the inference attack via non-zero gradients [42].Note that each client only stores user embedding on the local device, and according to (4), the server needs to aggregate user embeddings to location embeddings.Consequently, instead of simply downloading user embeddings, each client calculates the local embedding for each location beforehand.The detailed calculation of the local location embedding in the -th epoch is as follows:

Global Aggregate Mobility Model Location Epidemic Risk Graph with Classes
A trace is a sequence of location visited over time Transform each location into the epidemic risk domain.
Generate the fake trajectory by sampling the location from epidemic domain according to the aggregated mobility model.

Fake User Trajectories
Upload fake trajectories and real trajectories to the server simultaneously.
where   ,, denotes the local embedding of location  in client  at time interval ,  −1 , denotes the global location embedding download from the server.  , represents the user embedding stored in the client.N , is the total number of individuals located in the region  at time .Thus, client  will contribute embedding to each location  in the history trace record:   , for the location not existing in   (i.e., the pseudo location), the local location embedding equal to 0. To avoid the honest-but-curious server inferring the real location by detecting the location embedding with the non-zero gradient, we propose a differential private perturbation mechanism satisfying the (, )-DP [64].Specifically, the updated value of a user for each location is clipped with a constant value   , and then Gaussian noise N (0,  2  ) is added to each location embedding.The standard deviation   can be calculated as follows: where  denotes the exposure number of parameters, that is, the number of training epochs.After executing the above processes, each user uploads their update of the corresponding location embedding to the central server.
On the server side, the server aggregates all local location embeddings received from users and takes an average to obtain the aggregated location embedding, which can be described as follows: where   , is the aggregated location embedding in the -th epoch.Then, each aggregated location embedding is distributed to all clients, and each client propagates the location embedding to node embedding following the (5).

Plausible Pseudo Location Generation.
Stipulating clients to upload embedding of the actual visited location   () =  while also attaching several embeddings of randomly chosen pseudo locations   () = ( ′ 1 ,  ′ 2 , • • • ), and adding noise to all the embeddings.It can effectively prevent inference attacks through the non-zero gradient, and conceal the observation of the user's locations on the server side to be   () = ,  ′ 1 ,  ′ 2 , • • • .However, it cannot defend against location inference attacks via sequential location queries.These attacks are able to filter out the genuine location over multiple location queries over time, if the pseudo location is not believable.Thus, a more plausible strategy for generating pseudo locations should be investigated.We proposed a plausible pseudo location generation algorithm inspired by a synthesizing traces technique [3] as shown in Figure 3, which can generate synthetic traces as opposed to independent pseudo locations.The details of this algorithm can be divided into the following four parts: Aggregate Mobility Model: Conceptually, as shown in Figure 3.2 we construct the global aggregate mobility model by averaging over the local mobility models from sampled user groups U. First, the local mobility model of a specific user  can be represented by a Markov model, which could be formulated as follows: where  ( |) denotes the probability of user  visits to location  at time , and  ( ′ | ) is the transition probability that indicates the willingness of user  move from location  to location  ′ .Thus   ′ , +1 , () represents for the estimated probability for user  to move from location  to  ′ at time .Consequently, the aggregate global mobility model can be characterized as follows: where  is the total number of users for normalization, and  is a small constant to avoid zero probability,  (,  ′ ) denotes the distance between  and  ′ .Epidemic Clustering: Different from the trajectory synthesizing approach proposed in [3], which simultaneously takes both geographic and semantic features into consideration.In the COVID-19 infection prediction task, the semantic features mentioned before are not such sensitive.
Since the semantic similarity is specifically proposed for LBS, which guarantees the synthetic location will not perturb the semantic profile of each user and affect the service quality of the LBS (e.g., location recommendation).Different from the LBS needs the identity of semantic similarity among users for a better personalization service, the epidemiology similarity of location should be considered in synthesizing pseudo trajectory to maintain the invariance of the epidemiology profile.Therefore, we proposed a method to estimate the epidemiology similarity between locations, then construct a location epidemiology graph, where the nodes are locations and the edge weights are the epidemiology similarity between nodes.The similarity between the two locations can be calculated as follows: where   and  ′  denote number of new cases number at region  and  ′ in time interval , respectively. is a constant coefficient of the distance between  and  ′ .Then we run a clustering algorithm on the epidemiology graph to divide the location set into several clusters as illustrated in Figure 3.3, within a specific cluster, locations share a similar epidemiology status.Transforming Traces into Epidemic Domain: As demonstrated in Figure 3.4, we transform each location in a trajectory into its corresponding epidemiological domain trajectory by simply substituting each location with its epidemiological equivalent.Clearly, the trajectory of the epidemiological domain comprises all candidate locations that exhibit a similar epidemiological condition to the genuine one.Sampling Trace from the Epidemic Domain: As depicted in Figure 3.5, we finally decode the trajectory of the epidemiology domain into the geographic domain, which has the same format as the original actual trajectory.With a time-varying Markov model, we undertake a random walk on the epidemiology domain trajectory.To initialize the generation procedure, we sample a location from the first candidate location set in the epidemiology domain trajectory as the initial location.Then, we randomly select the next location based on the aggregate mobility model (i.e., the transition probability)   ′ , +1 , .Note that multiple plausible trajectories can be generated by using different initiatory locations or repeatedly the sampling process with the time-varying Markov model.

3.2.4
Privacy-preserving model updating.In our framework, although the two phases of aggregation are divided to be processed in clients and the server, the training method of the model still follows the general setting of FL.That is, the parameters of the model are stored in the local client of each user, and the clients update the global model by uploading the updated gradient or parameters of the local model.Nevertheless, several existing researches have claimed that the private information of users can still be divulged via the uploaded gradients [42] or parameters [25,57].Adding random noise is a natural approach to prevent these values from leaking too much information about the user, and one prominent example is differential privacy (DP) [1,57].Thus, we utilize DPSGD [1], the widely used method satisfying DP, to train this distributed model.DPSGD will truncate the value of local model gradients into −  ,   range and add Gaussian noise to gradients before uploading them to the server: where g  denotes the model gradient of user ,   represents the clip threshold,   is the volume of noise.Note that the DP mechanism introduced here is not identical to the differential privacy perturbation mechanism mentioned in Section 3.2.2.DPSGD protects the model parameters under the FL framework, while the perturbation mechanism protects the node features under the FGML framework.

Macroscopic Model Cooperation
The privacy-serving FGML framework we proposed is capable of protecting the user from location privacy leakage and helping them share cross-client information simultaneously.However, the overall performance of the framework declines when we deploy the DPSGD for distributed model training, as well as the pseudo locations generation method and the coupled perturbation mechanism for hiding users' genuine locations.The perturbation mechanism and pseudo location generation method conceal the real locations of users by generating accompanying pseudo locations and adding noise to all location embeddings.These location obfuscation mechanisms diminish the information quality of spatio-temporal points, making it challenging for user nodes to aggregate accurate epidemic information from spatio-temporal points.Therefore, it is a natural approach to utilize an auxiliary model to re-extract the noise-free information of spatio-temporal points.Specifically, as shown in Figure 4, we construct a region-level infection prediction model as the auxiliary macroscopic model to recapture the epidemic information of location.Then, we design a novel cooperative coupling mechanism to integrate the microscopic model with the macroscopic model to counteract the performance decline.Before diving into the details of the cooperative coupling mechanism, we first give a brief introduction to the macroscopic model.Macroscopic Model: In contrast to the microscopic model that models infectious disease at the individual level, the macroscopic model is used to model the pandemic at the level of locations or regions.Specifically, the goal of the macroscopic model is to predict the number of new infection cases for M regions in the next  time interval by using the inter-regional population flow and the historical number of new infection cases in each location.In this paper, we utilize the spatiotemporal graph to model the population flow among different regions, then capture the epidemic diffusion with the GNN model.The population flow networks among regions at time interval  can be represented as a weighted directed graph G ( ) = V, E, W ( ) , where V is the set of regions, E is the set of the edges and W ( ) ∈ R M × M is corresponding to the weight matrix of edges.Let  ( ) ∈ R M ×1 represent the new cases of each region observed in time interval , the formal target of the macroscopic model is to fit a function  that predicts the new cases in time  +  with historical  time interval records: In this work, we apply DCRNN [38], a widely-used spatio-temporal graph neural network (STGNN), for this time series prediction task.In order to make DCRNN adapt to the dynamic changes of the inter-regional population flow, we revise its convolution layer to support the dynamic graph.Consequently, the refined convolution layer of DCRNN can be formulated as follows: where  ( ) ,  ( ) denote the input and output hidden state at time interval ,  ( ) ,  ( ) are reset gate and update gate at time  respectively.★G ( ) denotes the diffusion convolution proposed in [38], which models the dynamics of the epidemic as a diffusion process.  ,   ,   are parameters for corresponding filters.Note that the DCRNN model can be easily substituted with other STGNN models like STGCN [73], GraphWaveNet [66] and D2STGCN [54].
Cooperative Coupling Mechanism: The microscopic and macroscopic models we have proposed are still two independent frameworks.To incorporate the individual-level infection prediction model with the auxiliary region-level infection prediction model, we design a cooperative coupling mechanism that establishes a connection between the two models.The performance degradation of the microscopic model is largely blamed on excessive disturbance of spatio-temporal point embeddings during feature aggregation.Thus, we extract the output hidden state E  from the encoder module of the macroscopic model, which includes noise-free information of spatio-temporal points and can serve as a powerful supplement in the microscopic model.Consequently, different from directly downloading hyperedge embeddings (i.e., the aggregate location embeddings) for aggregation, each client also obtains the hidden state from the macroscopic model and concatenates it to hyperedge embeddings.In this way, the propagation from edge to node deployed on the client side can be amended as follows: where || denotes the concatenation operation, E  is the output hidden state of the encoder module in the macroscopic model.Notably, the number of new infections in each region is estimated based on the confirmed cases observed in each region, and the inter-regional population flux is determined by statistically analyzing individual trajectory data.In contrast to the microscopic model, the macroscopic model requires no additional data.In addition, information is propagated unidirectionally from the macroscopic model to the microscopic model.Thus, there are no information leakage channels in the microscopic model regarding the coupling component.

Prediction and Optimization
We utilize a fully connected layer to convert the node embedding X () ∈ R N× from the last HGNN layer into the probability output Y ∈ R N× : where  () is the output of the last of HGNN layer,   × is the weight matrix of the linear layer.Softmax function is used to transform each prediction value into probability.Then, we utilize the optimizer to minimize the cross-entropy loss, which is evaluated based on the predicted values and the ground truth: where   denotes the probability vector of node ,   represents the target class of node .Thus,   [  ] indicates the predictive value of the target class, and   [] indicates the predictive value of class .

Algorithm Analysis
3.5.1 Privacy Guarantee.In this section, we will systematically introduce the privacy guaranteed by our framework from two different perspectives, the client side and server side, which address the challenges of cross-client information missing and graph structure leakage, respectively.
• Cross-client side: In this work, we design a detached propagation method on the proposed spatio-temporal hypergraph, which ensures the trajectory information is stored in the local client and theoretically guarantees privacy on the cross-client side.Specifically, clients only share the local calculated gradients of hyperedge embedding to the server, then download aggregated gradients from the server for the following prediction procedure.Thus, the privacy of the client is guaranteed with respect to other clients.• Server side: However, even with our proposed detached propagation mechanism, the central server can still infer the actual location using adversary algorithms such as inference attack via non-zero gradient [42] or location inference attack [58].Thus, we propose a perturbation mechanism and a pseudo location synthetic method to address these two adversary concerns: -Pseudo location synthesis method against location inference attack (Localization Attack [58]) In this work, we proposed a pseudo location generation algorithm to protect the user's location privacy, i.e., hiding the user's actual location from the server and also preventing the inference of the full trajectory.The user's device sends a number of pseudo locations along with the real location.For example, if three pseudo locations are used, then the device will send four locations (one real and three pseudos) to the server.Our proposed method can ensure that the server does not know which location is real and it is not able to estimate the real location over multiple queries over time.Additional experiments are conducted in Section 4.4 to verify that our pseudo location synthesis method can defend against the widely-used location inference attack [58].The results demonstrate that our method can achieve a better balance between privacy and utility.-Perturbation mechanism against inference attack via gradient [42]: During training, the gradient of embedding in a client is sparse with respect to the history visited locations.
Besides, the client needs to upload the gradients of local location embedding to the server for aggregation.Thus, we proposed a perturbation mechanism that prevents the server from being able to infer the actual location with non-zero gradients.

Computational Complexity. Existing GNN-based models only model the interaction among users and can not characterize the complicated interactions between individuals and locations well.
Besides, the user-user interaction graph will be too huge to calculate on it ( (N2 ) time complexity, N denotes the number of individuals).We design a spatio-temporal hypergraph construction method as presented on the left of Figure 1.Regards each individual as a node and connects all nodes that visit a location in the same time duration to a hyperedge.This design not only captures the temporal and spatial dependencies simultaneously but also models the complex interactions between users and locations.Therefore, our model (without macroscopic model enhancement) can better prediction performance.At the same time, our method can greatly alleviate the computational complexity of GNN with  (N × M) time complexity (M denotes the number of locations, which is usually much lower than the population number).

Experiment Settings
4.1.1Simulation Environment.In this paper, we build an epidemic prediction benchmark environment based on the simulator1 provided in the Prescriptive Analytics for the Physical World (PAPW) Challenge 2 , which has been widely used by existing researches [12,15,16].This simulator provides an accurate approach to synthesizing human mobility, and then simulates disease transmission based on the mobility data.To reconstruct the realistic scenario more accurately, we substitute the synthetic mobility data with two real-world, fine-grained mobility datasets that indicate two distinct scenarios.Besides, we also provide two terms of disease transmission parameters for SARS-Cov-2 and Omicron variant, where the scenario of the Omicron variant is utilized as an extensive experiment to evaluate the generalization of Falcon in Section 4.3.

Experimental Scenarios.
To conduct multi-scenario experiments, we collect and prepare two datasets of location information for two distinct scenarios, Basic and Larger.The information of the two scenarios is depicted in Table 2.For the Basic scenario, we consider the diffusion of infectious disease within M regions in a district.Each region is considered as a POI (Point of Interest, e.g., school, restaurant, bank).This scenario corresponds to the situation that trajectory data is collected by fine-grained localization technologies such as GPS [2], Bluetooth [14], and Wi-Fi [69], etc.For the Larger scenario, we consider the disease spread within a city of  regions, where each area is a community with an average radius of approximately 0.79km.This scenario corresponds to the situation that trajectories collected by coarse-grained localization technologies such as Cell-ID [60].
The aforementioned datasets of two scenarios were supplied by a well-known location-based services (LBS) provider.When users use location-based services, their coordinates will be uploaded to the cloud server.For the ethical use of this data, we ensure that every user is aware of and consents to collect their location information, and all user IDs are anonymized to prevent them from being identifiable.Besides, our affiliation has issued the IRB approval and determined this study as nonhuman subject research.In this research, the epidemic dynamic is characterized by an SEIR model [71], and the parameters are calibrated to align with  0 of SARS-Cov-2 (5.7) provided by recent research [52].To further evaluate the generalization ability of models as we will discuss in Section 4.3, we further recalibrate epidemic parameters to align with  0 of the Omicron variant (10.78) to implement an additional experiment.Detailed information on the two terms of disease transmission parameters are presented in Table 3, where  is the transmission rate of infected individuals, 1/ denotes the recovery period, 1/ is the average duration of the latent period. 0 = / represents the basic reproductive number, which indicates the average number of secondary cases attributable to infection by an index case after the case is recovered [52].The number of epidemic simulation days  is set to align with the trajectory length for both Basic and Larger scenarios.To obtain a reasonable infected ratio in SARS-Cov-2 and Omicron on the last day, we initialize 300 and 30 infected individuals for them, respectively.4.1.4Evaluation Protocol.In our experiments, we assume that we have prior knowledge of 40% positive cases, and try to track the remaining potential infected individuals.To comprehensively measure the performance of various baseline methods, we evaluate these methods with multiple metrics, including AUC, F1-score, Accuracy, and BEP (Break-Even Point, indicating the value when recall and precision are equal).Please note that the F1-score and accuracy we will release later are both the maximum values the model can achieve in the trade-off between precision and recall.
To propose a more intuitive metric for the infection prediction task, we introduce a metric called disease extinction precision (DEP).The DEP assesses the maximum prediction precision that an individual-level infection prediction model can achieve while maintaining the minimum recall rate.
In epidemiological theory, the total number of infections will increase by a factor of  0 after each average recovery period 1/.Thus, we need to detect and control a ratio of ( 0 − 1)/ 0 positive cases to guarantee the extinction of disease, which corresponds to the minimum required recall:   = 1 − 1/ 0 .Based on the minimum recall for disease extinction, we can give the definition of DEP: Definition 4.1 (Disease Extinction Precision).DEP is the precision that the prediction model can offer while maintaining the minimal recall of   .This indicates that we must maintain a recall higher than   to ensure disease extinction.Consequently, the value of DEP is the precision while recall reaches   . =  (  ) , (18) where  (•) represents the function corresponding to the precision-recall curve.

Baseline Methods.
To conduct a comprehensive comparison, we selected a sequence of traditional methods and recent state-of-the-art works, resulting in a total of twelve baseline algorithms being employed in our experiments.Comprises an vanilla GCN model [33], two attention-based GCN models [56,61], an infection prediction GNN model [59], five STGNN models [35,38,54,66,73], one FGML framework [65], and a HGNN model [17].Furthermore, DCT is also adopted as a baseline method for estimating the utility of existing digital contact tracing techniques [44].Out of all baseline methods mentioned above, only the FGML framework -FedPreGNN [65] takes user privacy into consideration, which still fails to address the privacy divulgence of graph structures [18].
• DCT [44]: Digital contact tracing utilizes modern communication technologies, such as GPS, Wi-Fi, and Bluetooth, to digitally track and alert users who have interacted with infected individuals [44].In this work, DCT will track all potential infected individuals whose trajectories have intersected with those who have tested positive.
• GCN [33]: The basic semi-supervised classification algorithm on graph-structured data.It extends the convolutional operation into the graph structure by adopting ChebNet's first-order approximation in layer-wise propagation.
• GAT [61]: A well-known revised version of the GCN model.GAT endows the model to focus on important links by assigning different weights to neighbor nodes with the attention mechanism.Moreover, GAT incorporates multi-head attention to stabilize the learning process.
• UniMP [56]: A GCN-based semi-supervised classification algorithm that can simultaneously perform feature embedding and label embedding as input information for propagation within a graph transformer network.Besides, a masked label prediction strategy is proposed to avoid overfitting.
• EPIGNN [59]: A GCN-based individual-level infection prediction algorithm that is capable of estimating the infection state of all individuals in a large contact network.
• STGCN [73]: Spatio-Temporal Graph Convolutional Networks is a GCN model that captures spatial dependency and temporal dependency simultaneously by integrating graph convolution and gated temporal convolution through the spatio-temporal convolutional block.
• DCRNN [38]: Diffusion Convolutional Recurrent Neural Network models the spatio-temporal dependency by proposing a novel Diffusion Convolutional Gated Recurrent Unit (DCGRU), which replaces the fully connected layer in GRU with a graph convolutional layer.• GraphWaveNet [66]: GraphWaveNet stacks the diffusion convolution layer and the dilated casual convolution layer [46] to jointly capture spatial-temporal dependency.• D2STGNN [54]: Decoupled Dynamic Spatial-Temporal Graph Neural Network proposes a Decoupled Spatial-Temporal Framework (DSTF) to separate the diffusion signal and inherent signal from the original data.• DSTAGNN [35]: Dynamic Spatial-Temporal Aware Graph Neural Network introduces a novel dynamic spatial-temporal aware graph based on a data-driven strategy to substitute the static graph, as well as a multi-head attention mechanism to capture dynamic relations among nodes.• FedPreGNN [65]: FedPreGNN introduces a privacy-preserving graph expansion method to alleviate the information isolation problem under the federated learning scenario.• Hypergraph Neural Networks (HGNN) [17]: The pioneering research extended the GCN model from a normal graph to a hypergraph, which can capture the high-order interaction among more than two nodes.To adopt this model to the individual-level infection prediction task, we utilize the hyperedge of HGNN to represent a spatio-temporal point.Thus, it can be considered as an express edition of our framework without specific modules, such as FL, obfuscation mechanism, cooperative coupling mechanism, etc.
4.1.6Experiment Environment.All experiments are compiled and tested on a Linux server (CPU: AMD EPYC-7763, GPU: NVIDIA GeForce RTX 3090).The default settings of hyperparameters are as follows.For model optimization, we set the learning rate to 0.001, the dropout rate is set to 0.2, the training epoch numbers are set to 500, and the weight decay is set to 0.0005.For the mobility and the epidemic data, the granularity of the time interval in this work is set to 2 hours.We assume that the infection status of 40% of the total population is obtainable, meaning that 40% of individuals are used as training samples.The Adam optimizer is utilized to train the prediction model.For the perturbation mechanisms, the pseudo trajectories number is set to 2 as default, and we set  = 1,  = 0.001, the clip threshold  = 0.1 for the two differential privacy mechanisms (i.e. the DPSGD and the differential privacy perturbation mechanism).In this work, our individual-level infection prediction model consists of an embedding layer for feature extraction, two HGNN layers for information propagation, and a fully connected layer for prediction.

Overall Performance Comparison
As shown in Table 4, we first compare Falcon with baselines for all evaluation criteria under the Basic scenario and the Larger scenario, including multiple metrics for the classification task and our proposed metric DEP.In addition, we illustrate the Precision-Recall curves for each algorithm in Figure 5 for a more comprehensible presentation.From the aforementioned experiments, the following analysis can be summarised: • Our proposed model significantly outperforms existing baseline methods in both Basic and Larger scenarios even within a strict privacy budget: Under the multi-scenarios, Falcon  always achieves the best infection prediction performance, although with several obfuscation mechanisms for location privacy-preserving.Particularly, with the spatio-temporal hypergraph construction and the cooperative coupling mechanism, Falcon can improve the prediction precision by at least 22.43%−24.29% in multiple realistic scenarios.Besides, compared with the existing state-of-the-art FGML framework, i.e., FedPreGNN [65], our method remarkably improves the prediction precision by about 50.96% − 56.49%.Furthermore, as demonstrated in Figure 5, our method always strikes the best balance between precision and recall.• DCT achieves fairly weak and inflexible performance in practical applications, especially with coarse-grained mobility data: The basic concept of DCT is to directly classify individuals who have had close contact with confirmed infected individuals to the high-risk population.Therefore, it is hard for DCT to provide a flexible trade-off between precision and recall, which is critical for various requirements of intervention strength.Specifically, as shown in Figure 5, the recall achieved by the DCT in the Basic scenario is not sufficient to control the outbreak of the pandemic, and the extremely low accuracy achieved in the Larger scenario will lead to a waste of numerous medical resources.In contrast, GNN-based models still strike relatively better prediction performance in both two scenarios, while simultaneously providing continuously adjustable intervention strength.It further verifies the necessity of designing a machine learning-enhanced DCT technology.• STGNNs generally strike better performance by capturing the spatio-temporal dependency: Except for our proposed algorithm and HGNN that utilize the hyperedge to model the spatio-temporal point, STGNNs show the most promising performance among all baselines.
For STGNNs, such as D2STGNN [54] and DSTAGNN [35], they both capture the spatial and temporal dependencies during disease transmission with GNNs and temporal units.Among them, DSTAGNN can achieve better performance than other STGNNs.The underlying reason for this phenomenon is that DSTAGNN effectively models the dynamic relationships among nodes, providing a more accurate representation of the evolving contact networks during the pandemic.
The overall result shows it is crucial to model the spatio-temporal dependency in the infection prediction task.• The implementation of the attention mechanism on GNN led to similar or even worse predict performance: The attention enhancement GNN models, GAT and UniMP, neither of  them provide performance gains for the prediction tasks.A potential explanation is that the attention can not assess correct weights to edges with limited node features in disease prediction situations.This phenomenon also points out the importance of estimating the weights of different interactions among individuals, which reveals the key to improving prediction precision for future research.

Generalization Ability
In this section, we evaluate the generalization ability of Falcon from the following two perspectives: Transmission Rates of Diseases: Different infectious diseases, and even different variants of the same infectious disease, have different transmission rates.To verify the generalization ability of Falcon against different COVID-19 strains with varying transmission rates, we implement extended experiments under the disease parameters of the Omicron variant.The results shown in Figure 6 demonstrate that under the Omicron setting, Falcon can also strike the best performance, and maintain a better balance between precision and recall.Thus, Falcon can effectively and efficiently track potential infected individuals for controlling the spread of disease.

Stages of Pandemic:
In various stages of disease transmission, outbreaks can occur.Therefore, we investigate the prediction performance of Falcon and representative baselines in different stages.
The experiments are taken under the Large scenario and the Omicron settings.As shown in Figure 7, we first plot the cumulative number of susceptible, infectious (including asymptomatic and symptomatic infectious.) and recovered individuals for each day, then provide the daily number of newly confirmed infectious cases.We notice that the growth rate of the number of infections reaches its peak around the 14th to 15th day, indicating the occurrence of an outbreak phenomenon.Therefore, we respectively evaluate each algorithm on the 10th and 14th day as two representative scenarios before and after the outbreak.The results are summarized in Table 5, which indicates that compared with other algorithms, Falcon achieves better performance both before and after the outbreak.Moreover, during the early stages of the pandemic (before the outbreak), Falcon is able to ensure a precision higher than 20% for only a small percentage (6%) of the infectious individuals.Therefore, at the same cost, Falcon can detect more potential infected individuals, thus contributing to a faster control of the pandemic.

Evaluation of Privacy and Utility
As discussed in Section 3.5.1,there exist two privacy concerns with regard to the server side: the localization attack and the inference attack.We introduce a pseudo location synthesis method and a perturbation mechanism to address the two aforementioned privacy concerns.Therefore, we conduct experiments to assess the extent of the privacy budget provided by these two privacypreserving methods and the amount of performance decline they incur.Perturbation mechanism against inference attack via gradient: We perform a revised inference attack via gradient [42] on the honest-but-curious server.Specifically, the attacker sets a threshold for the 2-norm of the gradient.Values above this threshold are considered real locations, while those below are considered pseudo locations.The intensity of the proposed perturbation mechanism is depicted by the magnitude of the Gaussian noise.Thus, we summarize the evolution of the prediction performance and the attack error rate with regard to noise magnitudes, where we set up the number of pseudo locations as nine.As depicted in Table 6, as the magnitude of the noise increases, the attack error rate gradually rises and eventually reaches 90% (random guess).When no noise is added, the error rate of the attack is 0, since the server can directly infer the actual along with its genuine location   ().Correspondingly, a widely-adopted location inference attacklocalization attack [58], is deployed in the LBS server.Given historical observation (i.e., the sequence of locations queried to the LBS), the localization attack concentrates on discovering the user's true location at each time interval.The formal answer to such a goal is to estimate for each  ∈ R.These probabilities can be easily computed with the Forward-Backward algorithm [48].Then, the adversary can form a distribution of possible regions, from which it can choose the most probable one.We utilize the error probability of the inference attack as the privacy level of the pseudo location generation algorithm, where the Larger value means the higher privacy guarantee.Different pseudo location selection mechanisms not only provides distinct privacy budgets but also curse the utility of the infection prediction model to various extent.Therefore, we also evaluate the utility loss caused by the pseudo location generation technique.Here, we evaluate the performance decline with two metrics, AUC and F1-score.
In this section, we compare our pseudo location generation algorithms with the following three existing pseudo location generation methods: • Uniform IID [58]: we synthesize each pseudo location independently and identically from the uniform probability distribution.Therefore, the pseudo trace consists of a set of unrelated pseudo locations.• Aggregate Mobility IID [58]: we synthesize each pseudo location independently and identically from the aggregate mobility profile .Similarly, the pseudo trace consists of a set of unrelated pseudo locations.• Random Walk on Aggregate Mobility [72]: we synthesize the pseudo location sequence by continuously sampling the next location following the aggregate transition probability distribution .
The experiment is designed within the context of the Larger scenario, and we deploy the experiments with three different numbers of pseudo trajectories: 1, 5, and 10.Note that the macroscopic model is removed since it will largely surmount the performance decline caused by the privacy mechanisms, as we will discuss in Section 4.5.In Figure 8, we illustrate the trade-off between user privacy and model utility for the aforementioned pseudo location generation algorithms.Results show that our method clearly outperforms all the existing techniques, especially the random strategies.For the case of random walk on aggregate mobility, the privacy level against the tracking attack is close to what we achieve, due to the similarity with our method in geographic semantic.But our method achieves less performance decline compared with it by taking random walk on the epidemic domain.

Ablation Study
To evaluate the effectiveness of the macroscopic model, which is proposed to defend against the performance decay produced by privacy-preserving mechanisms, we perform several ablation experiments by blocking specific modules.In particular, we remove the macroscopic model to assess the performance gain brought by it.Then we further remove specific privacy-preserving modules one by one to estimate their impact on model performance and verify the robustness of Falcon under the strict privacy budget.Specifically, we remove the perturbation mechanism and the plausible pseudo location generation method respectively, or simultaneously.The results are illustrated in Table 7, where we deploy five variant models of Falcon.Particularly, we remove the macroscopic model and all privacy-preserving techniques and align with the HGNN model.With the above results, we can summarize the findings as follows: • With the enhancement of the macroscopic model, Falcon can effectively mitigate the performance decay: Falcon not only overcomes the performance decay but also achieves significantly higher prediction accuracy under the various privacy-preserving mechanisms.Besides, with the integration of the macroscopic model, the performance of Falcon keeps robust under various privacy strategies.• Even without the macroscopic model, Falcon still strikes better prediction performance compared to baseline methods: We compare the Falcon without macroscopic model with all baseline methods that have been mentioned before.Apparently, Falcon with strict privacy guarantees always strikes the best overall performance relative to other baselines without any privacy consideration.The phenomenon demonstrates that although the macroscopic model significantly improves prediction performance, even without it the microscopic model remains an innovative and effective approach compared to existing methods.

Hyperparameters Study
In the actual scenario, it is impossible for individuals to continuously upload their trajectories and report their infection states.Furthermore, different reporting rates can be attributed to various factors, such as government policies and the initiative of the public.Consequently, we conducted experiments to quantify the impact of two sets of hyperparameters, the report rate of infection states , and the proportion of uploaded trajectory points , on the performance of infection prediction models.
Trajectory Reporting Frequency: We conduct the experiment with a set of different reporting frequencies  from 1 to 0.5, where 1 represents each individual continuously uploading their complete movement trajectory, while 0.5 represents a probability of 50% for not uploading the trajectory.The results are summarized in Figure 9(a).We can observe that our proposed Falcon consistently achieves the best and most stable prediction performance because it can extract multiscale information of disease transmission through a joint macroscopic model.The performance of baselines shows varying degrees of decline as the reporting frequency decreases, with the DCT method showing a more significant decline.This reveals the high dependence of traditional DCT methods on individual trajectory reporting rates, which aligns with existing research findings [13,41].At the same time, this also emphasizes the necessity of combining strong and effective methods such as GNN.
State Reporting Rate: Similarly, we carefully selected six different possible reporting rates , ranging from high to low, in order to cover most of the scenarios that may exist in reality.As depicted in Figure 9(b), Falcon under the scenarios with lower reporting rates still strikes better prediction performance than baselines with higher reporting rates.Consequently, Falcon can maintain relatively good prediction utility, even in cases where individuals are not actively reporting their infection status.

RELATED WORKS 5.1 Federated Graph Machine Learning (FGML)
With the rapid development of machine learning and the emerging requirement of privacy protection [36], federated learning is proposed to train the global model with massive distributed clients [34], which guarantees user privacy by keeping data in local devices.Further, some researchers introduce differential privacy into federated learning [1,40] against the inference attack to model gradients [25,42].In contrast, some works [8,24,43,65] consider integrating federated learning with graph neural networks [33] to develop a federated graph machine learning (FGML) framework.Although the existing works provide a promising paradigm of FGML for multi-scenario applications (e.g., recommendation system [39,65], traffic prediction [43] and molecular prediction [24]), there are some critical challenges need to be addressed.Including cross-client missing  information [8], privacy leakage of graph structures [39] and performance decline [65].To tackle the above challenges, we propose a spatio-temporal hypergraph construction method and separate the propagation procedures of hypergraph into distributed for precise individual-level infection prediction.Besides, a macroscopic model is introduced to address the performance decline.

Individual-level Infection Prediction
Conventional individual-level infection prediction is deployed by digital contact tracing (DCT) [21].However, DCT failed to provide a precise risk of individual being infected in practical [11], as existing DCT techniques only predict the potential positive cases by deploying cross-check with the trace of confirmed cases [10,32,47,49], the complex interaction among massive individuals are not modeled well [6].
Recently, massive researches investigate the epidemiology domain with the help of GNN, in which some works [19,46,63] utilize GNN to embed spatial signals from disease dynamics for achieving more accurate infection prediction in the region level.Gao et al. [19] introduce a spatio-temporal attention network that fuses recurrent neural networks (RNN) and graph attention networks (GAT) to simultaneously extract geographic trends and temporal patterns during disease transmission.To account for the limited amount of data in some countries, Panagopoulos et al. [46] propose a method based on model-agnostic meta-learning to transfer the GNN-based infection prediction model from one country to another where limited data is available.Wang et al. [63] design a causal module on GNN to further capture the causal temporal signals in the pandemic for better performance.However, these methods only model interactions at the region level, which fail to capture the intercorrelation among a large number of individuals and accurately assess the infection risk for each person.Thus, other works [15,45,59] explore employing GNN in individual-level tasks and achieve relatively well performance.Murphy et al. [45] utilize a GNN model to learn the contagion dynamics on complex networks and exploit the percolation and phase transitions in the epidemic.Tomy et al. [59] propose an individual-level infection prediction model that employs GNN to capture the contact among individuals.Feng et al. [15] utilize GNN and reinforcement learning to design an individual-level epidemic control agent for searching efficient intervention strategies.Nevertheless, there are some limitations of these works as follows.First, they only model the interactions among individuals, but omit the high-order interactions between individuals and locations.Second, all existing works can not address the privacy concern of using mobility data.In this work, we proposed a novel federated graph learning method for infection prediction, namely Falcon, to capture the high-order interactions within a constricted privacy budget.

CONCLUSION
In this paper, we investigate the individual-level infection prediction for more precise individuallevel intervention strategies (e.g., early warning and mobility control) and propose Falcon, a privacy-preserving federated graph learning framework.In this framework, we introduce a novel hypergraph construction method to capture the spatio-temporal dependency, then utilize the spatiotemporal hyperedge as a bridge for cross-client information sharing under privacy constraints.Besides, we design two obfuscation mechanisms that strictly guarantee that the users' actual location will not be divulged to the honest-but-curious server.Furthermore, a novel cooperative coupling mechanism is designed to integrate the microscopic model with a macroscopic model for overcoming the performance decline caused by obfuscation mechanisms.Extensive and multiscenario experiments are conducted for comprehensive results, which verifies that our method outperforms existing state-of-the-art baseline methods.Our profound analysis provides valuable instruction for future research on individual-level infection prediction.

Fig. 1 .
Fig. 1.Overall architecture of Falcon.Definition 2.1 (Individual-level COVID-19 Infection Prediction in the manner of FL).Give a city with M areas, N individuals, the reported historical trajectory of all individuals, and the diagnostic results for individuals in the population with a proportion  of the population.The overall goal of the entire FL framework is to train an effective distributed individual-level COVID-19 infection prediction model to predict the infection status of all remaining individuals N 1− :

Fig. 4 .
Fig. 4. Illustration of the coupling mechanism between microscopic and macroscopic models.

Fig. 5 .
Fig. 5.The precision-recall (PR) trade-off curves of all baseline methods on the two scenarios.

Fig. 6 .
Fig.6.The precision-recall (PR) trade-off curves of all baseline methods on the two scenarios with the Omicron settings.

Fig. 7 .
Fig. 7. Evolution of the cumulative number of susceptible, infectious, and recovered individuals, as well as the daily number of newly confirmed infectious cases.

Fig. 8 .
Fig. 8.The trade-off between location privacy and model performance for different synthetic location generation techniques.
Reporting rate of infection state.

Fig. 9 .
Fig. 9.The infection prediction performance of Falcon and representative baselines under different trajectory reporting frequencies and infection state reporting rates.

Table 1 .
Key notations used in this paper.
Algorithm 1Training algorithm of Flacon on the server side.

Table 2 .
The information of the datasets of Basic and Larger scenarios.

Table 3 .
The values and corresponding information sources of two terms of disease transmission parameters.

Table 4 .
Performance comparison of our model and baselines in two scenarios, where higher values represent better performance.Bold denotes the best results for each metric, and "-" denotes that the metric can not be evaluated for this baseline method.

Table 5 .
Prediction performance of Falcon and representative baselines on stages that before and after outbreak.

Table 6 .
[58]prediction performance and attack error rate w.r.t different noise magnitudes   .thelocationwith non-zero gradients.Moreover, with the mitigation of the macroscopic model, the performance decline is relatively acceptable with the increasing intensity of the perturbation mechanism.Pseudo location synthesis method against localization attack: To measure how much privacy budget a plausible pseudo location generation algorithm can provide, we deploy a location inference attack[58]on the server side.We consider the server of the LBS provider as an adversary, who is capable to captures the sequential location queries   = {  (1),   (2), • • • ,   ()} of a user  and has the prior knowledge of the population mobility profile ,  .As we mentioned in Section 3.2.3, a user who queries the LBS at a time interval will upload a number of pseudo locations   ()

Table 7 .
Results of ablation study.