GAD-NR: Graph Anomaly Detection via Neighborhood Reconstruction

Graph Anomaly Detection (GAD) is a technique used to identify abnormal nodes within graphs, finding applications in network security, fraud detection, social media spam detection, and various other domains. A common method for GAD is Graph Auto-Encoders (GAEs), which encode graph data into node representations and identify anomalies by assessing the reconstruction quality of the graphs based on these representations. However, existing GAE models are primarily optimized for direct link reconstruction, resulting in nodes connected in the graph being clustered in the latent space. As a result, they excel at detecting cluster-type structural anomalies but struggle with more complex structural anomalies that do not conform to clusters. To address this limitation, we propose a novel solution called GAD-NR, a new variant of GAE that incorporates neighborhood reconstruction for graph anomaly detection. GAD-NR aims to reconstruct the entire neighborhood of a node, encompassing the local structure, self-attributes, and neighbor attributes, based on the corresponding node representation. By comparing the neighborhood reconstruction loss between anomalous nodes and normal nodes, GAD-NR can effectively detect any anomalies. Extensive experimentation conducted on six real-world datasets validates the effectiveness of GAD-NR, showcasing significant improvements (by up to 30% in AUC) over state-of-the-art competitors. The source code for GAD-NR is openly available. Importantly, the comparative analysis reveals that the existing methods perform well only in detecting one or two types of anomalies out of the three types studied. In contrast, GAD-NR excels at detecting all three types of anomalies across the datasets, demonstrating its comprehensive anomaly detection capabilities.

Unlike anomaly detection methods for tabular and time-series data, Graph Anomaly Detection (GAD) [3,49] poses additional challenges.Graph data is often multi-modal, containing information from both node/edge attributes and topological structures.This complexity makes it difficult to find a unified definition of anomalies for graph-structured data and to design a principled algorithm for detecting them.
Due to the inherent multi-modality of graph-structured data, anomalies on graphs can be grouped into three categories: contextual, structural, and joint-type, as illustrated in Fig. 1.Contextual anomalies refer to nodes whose attributes are vastly different from those of regular nodes, such as spammers or fake account holders in social media networks [29,32,80].Structural anomalies refer to nodes with different connectivity patterns compared to other nodes, such as a group of malicious sellers exchanging fake reviews with super dense connections [73] or bots retweeting the same tweet forming a densely connected co-retweet network [21,28].Joint-type anomalies are those that can only be identified by considering both attributes and connectivity patterns, such as a node that is sending a large number of phishing emails to users across different communities in an email network [35,54].To identify all these types of anomalies, we need a powerful model to capture attribute information, connectivity patterns, and most importantly the correlation between them.
However, current GAD approaches [3,45,49] only perform well to detect one or two types of these anomalies but not all of them.

Input Graph Structural Anomaly
Contextual Anomaly Joint-type Anomaly Some GAD approaches only leverage network structure, which cannot detect contextual anomalies.Examples include the methods to check centrality measures or clustering coefficients [50,65], based on factorization of the adjacency matrix [71], and performing network clustering [82].Some approaches check the distribution of node features to detect anomalies [5,43], such as using the nearest neighbor algorithm on node features, to detect nodes that are isolated from others.These approaches fail to detect anomalies other than contextual anomalies.Recently, autoencoders have been widely employed for anomaly detection [7,15,20,36,61].The rationale is that autoencoders leverage neural networks to reduce the dimension of the data.Anomalies are often sparse in the data and hence such a data compression process tends to record only the principal part of the data and automatically exclude sparse anomalies.Therefore, one can use the obtained compressed data representations to approximately reconstruct the normal data but not the anomalies.Monitoring the reconstruction loss can thus identify those anomalies from the normal data.For GAD, Graph Auto-Encoders (GAEs) have been proposed to leverage Graph Neural Networks (GNNs) [25,38,81] to encode both graph structure and node attributes, which have recently been used to detect anomalies on graphs [15,20,36].
However, current GAE-based methods [15,20,36] often adopt a strategy of reconstructing direct links between nodes based on their representations, which brings the nodes close to each other in the latent space that are originally connected in the graph structure.Such a proximity-driven loss to reconstruct graph structures may be effective to detect structural anomalies that are inherently clustered together in the graph.However, they fail to detect joint-type anomalies that are not naturally clustered.Intuitively, joint-type anomalies rely on the entire neighborhoods for correct detection, because the information of which nodes are connected and the attributes on these neighboring nodes is useful for the detection.
In this paper, we address the current limitation and propose a novel framework Graph Anomaly Detection via Neighborhood Reconstruction (GAD-NR).GAD-NR extends a recently-proposed neighborhood reconstruction-based GAE model, namely NWR-GAE [70] to address fundamental problems in GAD.Specifically, rather than using a proximity-driven loss to recover direct links,
GAD-NR imposes the dimension-reduced node representations to reconstruct the entire neighborhoods, i.e., the receptive fields that are encoded/compressed by GNNs into the node representations.Specifically, GAD-NR aims to reconstruct the information of one's own attributes, its connectivity pattern, and the attributes of its neighboring nodes.By checking different types of reconstruction losses, GAD-NR can detect all three types of anomalies.
The key novelty of GAD-NR is that it is the first work that identifies neighborhood reconstruction as a powerful metric for GAD, which fundamentally differs from previous GAE models that adopt the metric of link reconstruction/prediction.
Moreover, GAD-NR also advances technical aspects of the backbone model NWR-GAE [70] directly for GAD tasks, which yields substantial improvements in stability, scalability, and accuracy.Specifically, GAD-NR adopts Gaussian approximation of neighbors' features distributions, which not only substantially reduces the computation cost of NWR-GAE but also avoids learning a too expressive model that risks overfitting the anomalous behaviors in the data.This non-trivial change improves NWR-GAE originally proposed for the unique purpose of dimension reduction now suitable for GAD tasks.
We extensively compare GAD-NR with state-of-the-art (SOTA) models on six real-world graph anomaly detection datasets that have been benchmarked recently [45].GAD-NR outperforms all baselines significantly (by up to 30%↑ in AUC) over five among these six datasets by following the settings in [45].We also evaluate and demonstrate the capability of GAD-NR on detecting each of the three types of anomalies.Note that in real-world applications, the types of anomalies are often unknown.The significance of GAD-NR is that it allows detecting the real-world anomalies across different datasets (in [45]) with one fixed hyperparameter configuration, which illustrates the robustness of GAD-NR.Further ablation studies also justify the effectiveness and computational efficiency of Gaussian approximation adopted by GAD-NR for GAD when being compared with NWR-GAE [70].
The contributions of this paper can be summarized as follows: • We designed a novel framework GAD-NR for graph anomaly detection.GAD-NR leverages the reconstruction loss of the entire neighborhood of a node from the node representation, which in principle can detect all three types of anomalies in Fig. 1.
• Technically, GAD-NR adopts a Gaussian approximation of the distribution of neighbors' representations and computes a closedform KL divergence as the reconstruction loss, which substantially improves the scalability and effectiveness of the approach.
• Extensive experiments on six real-world networks demonstrate the effectiveness of GAD-NR compared to SOTA baselines, and the rationale of the design specifics of GAD-NR.

RELATED WORKS
We put previous methods for GAD into three categories as follows.
Structure-only-based methods: Traditional graph anomaly detection focuses on detecting only structural anomalies.Many works in this category leverage spectral analysis of the adjacency matrix and its variants [31,51].Recent methods define structural similarity measures for anomalies and then perform clustering approaches for detection [56,82].Statistical features computed based on the graph structure such as in/out degrees, total weights of edges, number of neighbors of a node, or dense subgraphs can be utilized for GAD [2,17,28].However, these structure-based methods are only able to detect structural anomalies.They may detect some joint-type anomalies but they tend to make a slot of false alarms as they miss the information from node attributes.
Traditional methods for GAD over attributed networks: In real-world applications, most of the graphs have node attributes (features).Nodes with inconsistent attributes have a high chance to be an anomaly node.Moreover, considering the information on node attributes along with structure helps to locate anomalies more accurately.Detecting anomalies in attributed networks can be achieved by clustering methods [9,59], interaction with human experts [16], group merging techniques [90].Network embedding methods [23,60,68] can also be applied to GAD on attributed graphs [6,8].Network embeddings can be paired with anomaly detection techniques for tabular data such as density-based approaches [5], and distance-based techniques [1,43] to find node anomalies on graphs.However, these approaches, since they process graph structure separately with node attributes, often fail to capture the synergy of graph structure and node attributes and may be suboptimal for GAD in some cases.
Deep learning based GAD approaches: Auto-Encoder framework that focuses on extracting principal components from the data via deep learning has been extensively applied in anomaly detection [7,15,20,36,48].Applying traditional autoencoders to node attributes [61] can only detect contextual anomalies.GAE built upon GNNs can combine node attributes and graph structure properly and can detect anomalies based on checking the reconstruction loss of node attributes or links [15,20,36].But these works do not reconstruct the entire neighborhood for GAD.Rather, they use reconstruction error, and estimating Gaussian mixture density is also applied for GAD [42].Some works view nodes with multiple views and a node may or may not be considered an anomaly in different views.These nodes hold attributes from multiple views of the identity.To capture such multi-view information, multiple GNNs are often applied [47,57,66,78,79] for anomaly detection.GNNs have also been applied to detect anomalies in multiple scales [24], and to detect anomalies and solve recommendation tasks simultaneouly [75,87].More involved techniques such as self-supervised learning [13,30,33,46,83,89] and reinforcement learning [16,40,52] have also been recently applied to GAD.

NOTATIONS AND PROBLEM FORMULATION
In this work, we focus on detecting anomalous nodes over attributed static graphs.An attributed graph  = ( , ,  ) ∈ G consists of a vertex set  = {1, 2, • • • ,  } and an edge set collect all node attributes and   ∈ R  is the attribute for node .The degree of node  is denoted as   .This work focuses on unsupervised anomaly detection.Each node  has an anomaly label   where   = 0 or   = 1 implies node  is normal or anomalous respectively.The goal is to design a detection method  () : G → {0, 1}  that associates each nodes with a label.However, these node labels are assumed to be unknown when designing  .
Let N  be the set of 1-hop neighbor nodes of node .Let N be an augmented set of 1-hop neighborhood of node  that includes the attribute of node , the set of the attributes of its neighbors, i.e., N ≜ (  , {  | ∈ N  }).Our assumption to detect anomalous nodes is that given the label   , the distribution P( N |  ) are different across norms and anomalies.Here, we consider just one-hop neighborhood as a proof of concept, which is also often adequate for use cases in practice [4].The neighborhoods considered can be extended to the multi-hop case, while extra computation costs need to be paid in that scenario.

METHODOLOGY
In this section, we first provide the motivation of our method by narrating the potential drawbacks of previous graph auto-encoder methods.Then, we introduce GAD-NR which is based on neighborhood reconstruction.

Motivations
AutoEncoder (AE) is an easy-to-use and effective framework for anomaly detection.The motivation of AE is to perform dimension reduction by compressing the high dimensional input data into a low dimensional latent representation [27] via an encoder and reconstructing the original input with the help of a decoder.AE can be used for anomaly detection because such dimension reduction is expected to capture the principal properties of the data mostly corresponding to the normal data points.The data points that cannot be properly reconstructed via the decoder, i.e., with larger reconstruction losses tend to be anomalies.
Graph AutoEncoder (GAE) is used to perform dimension reduction of graph data via a Graph Neural Network (GNN) as the encoder [37].Given a graph  = ( , ), GAE encodes graph data into node representations {ℎ  | ∈  }.The decoder of current GAE methods is to reconstruct the graph structure and node attributes from these node representations.Regarding graph-structure reconstruction, it typically relies on a mapping from the representations of two nodes to 0 or 1 that indicates whether there is an edge between them [15,20], e.g., comparing ℎ ⊤  ℎ  with some threshold  to reconstruct the edge.However, this procedure can only preserve proximity information of nodes in the graph, i.e., pushing node representations close if the corresponding nodes are directly connected in the graph, which may miss useful information for detecting anomalies.Moreover, by checking the reconstruction loss, 7 8 (ℎ " (6) ) 9 : ; <( 9 : ; , :;) 7 = (ℎ " (6) ) Reparameterization one may only tell whether an edge is an anomaly.To detect node anomalies that are often more useful in practice, one needs to aggregate the reconstruction losses of edges into the node level, and how to properly aggregate these losses is not a trivial problem by itself and often depends on heuristics.

GAE via Neighborhood Reconstruction
Our strategy to overcome the drawback of traditional GAEs lies in the first-principle idea of autoencoders.Autoencoders aim to perform dimension reduction of the data with the least loss to recover the original data.GAE encodes each node's attributes and the attributes of the nodes in its one-or-several-hop neighborhood into a node representation.Therefore, the node representation should be able to reconstruct the neighborhood and its own attributes with the least loss.This idea leads to the design of GAE in this work.The model architecture is illustrated by Fig. 2 and described the pseducode in Algorithm 1.

4.2.1
The encoder.The encoder Φ(•) follows the common pipeline of message passing GNN [22] e.g.GCN [38] or GraphSAGE [25].A GNN will further iteratively aggregate the representations from the neighbors and combine them with one's own representation to update the representation.Specifically, let ℎ The AGG function aggregates messages from the neighbors and the UPDATE function updates the node representations.Note that in practice, if the node attribute   is extremely high dimensional and sparse, a random linear projection is used to encode them into a dense low-dimensional representation ℎ

4.2.2
The decoder.Our decoder is designed based on the first principle of designing an autoencoder [27].We are supposed to reverse the procedure of Eq. ( 1) by using ℎ to reconstruct all the information within the -hop neighborhood of .In practice, it is Algorithm 1 GAD-NR : Graph Anomaly Detection via Neighborhood Reconstruction 1: Input: Graph  ( , ), Input Feature  , Anomaly Label  2: Encoder: ) 10: 11: ))) 13: for end for Weighted Loss Function 18: , ĥ(0 19: end for computationally heavy for large .In this work, we focus on reconstructing the information within just the one-hop neighborhood as a proof of concept, which we find is practically sufficient for GAD.The information within just the one-hop neighborhood N consists of the attributes of the center node ℎ   by a multi-layer perception (MLP) ĥ(0 ).Then, the self-reconstruction loss for node  can be calculated as: where D (•, •) is a distance function such as L2-distance that measures the discrepancy between the original attributes and the reconstructed attributes.
Neighborhood reconstruction: It is far from trivial to decode the set H  from the compressed ℎ () .The difficulties come from two aspects.Firstly, the size of the set might vary across node  ∈  .Secondly, the elements in the set do not have an order.Using an MLP to decode a set of a variable size from ℎ ()  is impossible.Our idea is inspired by the recent work NWR-GAE [70].We regard the set H  = {ℎ (0)  | ∈ N  } as   many I.I.D. samples from a distribution P  .In fact, the empirical distribution of the elements in ) where  (•) is the dirac delta function.In this sense, we can decompose the neighborhood information into two parts, namely the number of neighbors (i.e. the node degree)   and the distribution of neighbor's representations P  .The reconstruction procedure should reconstruct these two parts of information properly.
Node degree reconstruction.To reconstruct node degree   , we use another MLP that follows d =   (ℎ ()  ).Then the node degree reconstruction loss for node  is: Here, we just use ℓ 2 -loss as the metric D, though as node degrees are non-negative integers, we can also adopt discrete distributions such as Poisson distribution to model it.
Neighbors' representation distribution reconstruction.To reconstruct the distribution P  from the node representation ℎ to an estimation of the distribution P .As we do not know P  in the population level, the first direct idea is to reconstruct the empirical distribution P emp  by following NWR-GAE [70].The Wasserstein distance between P and P emp  is adopted as the reconstruction loss in NWR-GAE [70].However, the computation of such a loss has a huge overhead, because it needs to solve a matching problem based on the Hungarian algorithm [34], which is of complexity  ( 3  ).Moreover, empirically, we observe that reconstructing such an empirical distribution is likely to overfit the anomalies, which actually does harm to anomaly detection tasks.
Therefore, we propose to reconstruct a multi-variate Gaussian approximation of P  .Specifically, given we estimate the mean and covariance matrix of the neighbors' representations by following: Then, we map ℎ where   (•) is an MLP, and each entry of   (ℎ ()  ) is non-negative, which includes an MLP followed by exp(•).Then, we estimate the mean and the covariance matrix of reconstructed neighbors' features based on μ = 1   =1 h and Σ = 1 Given the two groups of parameters (  , Σ  ) and ( μ , Σ ) for multi-variate Gaussian distributions, we adopt the KL divergence between these two distributions to measure the reconstruction loss: where  is the dimension of representation.Note that   and   should not allow the gradient to pass through as they provide supervision signals.In practice, we may encounter the case when   is smaller than , which makes Σ  in Eq. ( 4) not full-ranked and causes a numerical problem.Therefore, we add an identity matrix to the covariance matrices,   ←   + , Σ ← Σ +  for some constant  > 0 to compute Eq. ( 6).
Note that the complexity of the above computation including Eqs. ( 4),( 5) and ( 6) is  (  ), which significantly reduces the complexity of the pipeline in [70].The remaining challenge is that since node degrees vary across different nodes, the computation of Eq. ( 4) is irregular.For this, we extend the package adopted in principal neighborhood aggregation [14] to implement Eq. ( 4) efficiently in parallel across different nodes.

4.2.3
The overall reconstruction loss.The overall reconstruction loss is a combination of the losses to reconstruct node self attributes in Eq. ( 2), node degrees in Eq. ( 3), and neighbors' representation distributions in Eq. ( 6): and where   ,   , and   are the hyper-parameters that control the weights of different types of reconstruction losses.

Anomaly Detection
We may adopt L  in Eq. ( 7) as the score to characterize how anomalous each node  is to be.The greater score means the encoded information is harder to be reconstructed, and thus the corresponding node is more likely to be an anomaly.We may also adopt different hyperparameters  ′  ,  ′  , and  ′  if we have different confidence or some prior knowledge about the type of anomaly to be detected.For example, if we tend to detect contextual anomalies, we can increase  ′  .To encode such flexibility, we define the anomaly score ŷ in the following general form ranking which tells the nodes that are more likely to be anomalies.
Although different weights here emphasize the detection of different types of anomalies, in Sec. 5, we show that GAD-NR is robust to the selection of these weights, where a fixed choice of these weights is sufficient to outperform baselines to detect real-world anomalies across different datasets.

Improvements over NWR-GAE
Here, we would like to provide a more direct explanation how GAD-NR advances the idea of neighborhood reconstruction (previously proposed in NWR-GAE [69] for the purpose of dimension reduction instead of GAD) to better fit GAD tasks.NWR-GAE is built upon an optimal transport loss and needs to run a complicated Hungarian matching algorithm [34] for each node to reconstruct its neighbors' attributes to compute the loss function.Such complexity is  ( 3 ) for a node of degree .GAD-NR regards the representations of neighbors' attributes as samples from a Gaussian distribution and adopts KL divergence [12] between Gaussian distributions as the reconstruction loss, which has a closed form and has complexity  ().This approximation is crucial for GAD tasks: NWR-GAE did not adopt such approximation because NWR-GAE aims to perform dimension reduction.Achieving a low reduction error is the ultimate goal of NWR-GAE.Therefore, NWR-GAE should be sufficiently expressive to make low-dimensional representations recover high-dimensional data.However, GAD tasks have different goals.
A model for GAD should not be too expressive, and otherwise risks overfitting the anomalies.GAD-NR just adopts the correct trade-off, where Gaussian approximation (by just checking the first and second moments of the distributions) is adopted, which not only improves anomaly detection accuracy but also substantially reduces computational complexity.Moreover, NWR-GAE also supports reconstructing multi-hop neighbors.However, we found that multi-hop reconstruction did not get obvious improvement in the task of GAD while introducing much computational overhead, so GAD-NR only considers the first hop in practice.

EXPERIMENT
In this section, we extensively compare GAD-NR with several baseline methods for graph anomaly detection.Specifically, we aim to answer the following questions: • How does neighborhood reconstruction facilitate in the performance improvement of GAD-NR for GAD? • Which part of the GAD-NR is important for different types of anomaly detection?• How do important hyperparameters such as the size of hidden representations, and the weights before different types of reconstruction losses affect the performance of GAD-NR?• How does the adopted Gaussian approximations of neighborhood feature distributions improve the running time efficiency of GAD-NR?

Experimental Settings
Our first experimental setting follows the benchmark paper BOND [45].Note that among the datasets, Weibo, Reddit, Disney, Books, and Enron have real-world anomaly labels.For the Cora dataset, there are no real benchmark anomaly labels, so we follow the benchmark paper BOND [45] where the union of contextual and structural anomalies are considered as anomaly labels for evaluation in the Cora dataset.The results are reported in Table 2.We call this setting the benchmark anomaly detection.Moreover, we attempt to study the performance to detect each type of anomalies separately, so for each dataset including those with real-world labels, we also inject contextual, structural, and joint-type anomalies for evaluation, which give the later results in Table 3. Due to the page limitation, we present contextual and the merge of structural and joint-type anomaly detection results in this work.Contextual anomalies are nodes whose attributes are significantly different from their neighboring nodes.Hence, to generate this type of anomaly for a target node , its feature   is replaced with another randomly sampled node 's feature   that has the largest Euclidean distance with   .Let  denote the number of contextual anomaly nodes and  denote the number of candidate nodes randomly sampled in the above procedure.Structural anomalies are nodes that are densely connected in contrast to sparsely connected normal nodes.To inject structural outliers, we consider  nodes at random and then make them fully connected and this process will be repeated for  times to generate  such cliques of size .Following the BOND paper, we approximately set  and  as twice the avg.degree for most datasets.To add joint-type anomalies in different datasets, we choose  nodes randomly as anomalies.Then, we connect each of these  nodes with randomly sampled  other nodes.Therefore, those anomalous nodes can be treated as nodes with high degrees and connected to neighbors with different types of features.We utilized the PyGOD library [45] for the injection of contextual and structural anomalies and for running the baseline anomaly detection models.
Hyperparameter Tuning: In practice, we often do not know the anomaly labels to tune the parameters.Typically, the way to choose hyperparameters is based on some expert experiences and a good model should be robust by using such a set of hyperparameters.Hence, we fix GAD-NR's encoder as GCN with hidden dimension 16 (expect for Cora, where 128 is used) and fix the hyperparameters of the decoder as   = 0.8,   = 0.5, and   = 0.001 to run the experiments for all datasets five times and report the averaged performance with standard deviation.We compare GAD-NR obtained via this fixed hyperparameter setting with the averaged performance of baselines proposed in the benchmark work [44].Our experiments demonstrate that our model GAD-NR can outperform the baselines by setting a set of hyperparameters that are not sensitive to the datasets, which makes the most practical sense.
Hardware: All the experiments are performed on a Linux server with a 2.99GHz AMD EPYC 7313 16-Core Processor and 1 NVIDIA A10 GPU with 24GB memory.

Evaluation Metric
We adopt the area under the ROC curve as the evaluation metric.The ROC curve is created by plotting the true positive rate against the false positive rate at various threshold settings.In the experiment, we regard the anomaly nodes as positive classes and compute AUC for it.AUC equals 1 means that the model makes a perfect prediction, and AUC equals 0.5 means that the model has no distinguishing ability.AUC is better than accuracy when evaluating the anomaly detection task since it is not sensitive to the imbalanced class distribution of the data.In Table 2, we show the results on the benchmark anomaly detection with baseline models.In Table 3, we present the results in injected contextual, structural and joint-type anomaly detection.From the results, we can observe that GAD-NR outperforms the baseline methods over most of the datasets in detecting benchmark anomaly labels and structural + joint-type anomalies while all the datasets in contextual anomaly detection.
The key reason behind the performance improvement can be attributed to the entire neighborhood reconstruction around a target node, which includes its self-feature reconstruction, degree reconstruction, and neighbor-feature distribution reconstruction.The feature-based models like MLPAE perform quite well at detecting contextual anomalies, specifically in the Cora dataset as they put emphasis on self-feature reconstruction.However, MLPAE performs worse to detect joint-type and structural anomalies due to the limitation of ignoring graph structure.Approaches that only consider the structure information, for example, SCAN performs exceedingly well in detecting structural+joint-type anomalies but they are incapable of doing well in contextual anomaly detection.In the case of the GAE-based models, the performance is more competitive while still worse than GAD-NR in Table 2 and Table 3. 5.4.2Impacts of different types of reconstruction losses.In Table 2, Table 3 along with the performance of GAD-NR with all three types of reconstruction loss, we have also shown results by removing each part from the loss function of the GAD-NR's decoder, i.e., setting   = 0,   = 0 or   = 0 in Eq. (7).From the results, it is clearly visible that without the neighborhood reconstruction part   = 0, the performance of GAD-NR drops the most in both types of anomaly detection.From the results of Table 3, we can observe that without the self-feature reconstruction loss (  = 0), the performance of GAD-NR drops down heavily when detecting contextual anomalies.When detecting the structural + joint-type anomalies, the performance decay of GAD-NR is moderate, which matches the expectation.By removing the loss for degree reconstruction (  = 0), GAD-NR also suffers from some performance decay.However, decay is less severe compared to that of removing self-feature reconstruction for detecting contextual anomalies or that of removing neighbors' feature distribution reconstruction for detecting structural anomalies and joint-type anomalies.[70].We compare GAD-NR with NWR-GAE [70] in two aspects: performance and running time.For performance, we are adding the benchmark anomaly detection performance comparison between NWR-GAE and our model GAD-NR in Table 4.We can observe that the performance of GAD-NR is significantly better than NWR-GAE in all six datasets.NWR-GAE directly tries to match the empirical distribution of neighbor representation, by which NWR-GAE may capture neighbors' features more precisely (also time consuming) but it tends to overfit anomalous behaviors, compared to the Gaussian approximation that GAD-NR adopts.running time comparison, we also added the comparison between NWR-GAE and our model GAD-NR in the following table.Optimizing the KL-divergence leads to a running time complexity of  () from the neighborhood matching the Hungarian algorithm's running time complexity of  ( 3 ).Therefore, GAD-NR is far more scalable on a relatively large graph dataset compared with NWR-GAE for detecting anomalies.

Impacts of tuning 𝜆 ′
,  ′  , and  ′  .We present the trend of GAD-NR's performance on different types of anomaly detection by varying the weights in Eq. ( 8) to perform detection in Figure 3.While increasing the weight for self-feature reconstruction  ′  in Figure 3 left top, we have observed the performance curve of contextual anomaly detection (blue) is very steep.Similar in 3 left bottom, a growing trend has been observed on both contextual and joint-type anomaly detection performance curve (blue and green).The reason is intuitive.With higher weights for self-feature reconstruction, the decoder of GAD-NR tends to assign a higher penalty to contextual anomalies as well as joint-type anomalies.By varying the weight for degree reconstruction,  ′  in Figure 3 middle column, the change in performance is not that significant across different types of anomalies.This is because contextual and structural anomalies do not have much change in node degrees.For the jointtype anomalies, where node degrees may provide useful signals for detecting, only checking node degrees is often insufficient to determine an anomaly.This is because a normal node can also have higher degrees.Node degree reconstruction should be paired with neighbors' feature distribution reconstruction together to provide effective anomaly detection.Lastly, in Figure 3 right column, when we vary the weight  ′  for neighborhood reconstruction, we notice a significant performance gain in joint-type anomaly detection and structural+joint-type anomaly detection, which demonstrates the effectiveness of neighborhood reconstruction via leveraging signals from their neighborhoods.

CONCLUSION
In this study, we introduce GAD-NR, for identifying anomalous nodes in graph structures.GAD-NR is based on a graph autoencoder that reconstructs the neighborhood information from node representations generated by a GNN encoder.The reconstruction process encompasses a self-feature representation, degree reconstruction, and the distribution of neighboring node representations, thus allowing the detection of various anomalies including contextual, structural, and joint-type anomalies.Experimental results on six real-world datasets demonstrate the effectiveness of neighborhood reconstruction in identifying different types of anomalies.GAD-NR outperforms state-of-the-art GAD baselines in five out of the six datasets in benchmark evaluations.Additionally, GAD-NR provides flexibility and potential for detecting different types of anomalies through the combination of different types of reconstruction loss with varying weights.GAD-NR also shows the robustness of the selection of weights to detect real-world anomalies.

APPENDIX 7.1 Baseline Description
• LOF: [5] Local Outlier Factor (LOF) computes how isolated a node is compared to its neighborhood which only uses the node feature and the neighborhood is computed using k-nearest neighbors.• IF: [43] Isolation Forest is an ensemble method of base trees for anomaly detection where the closeness of individual instances to the root of the tree is used to define the decision boundary.• MLPAE: [61] MLPAE leverages multi-layer perceptron as the encoder and decoder to reconstruct the node features and reconstruction loss of node feature is used as the anomaly score of a node.• SCAN: [82] Structural Clustering Algorithm for Networks (SCAN) takes only the structure of the graph to detect clusters that are regarded as the structural anomalies of the graph.• Radar: [41] Radar takes both structure and node attribute information as the input and detects the anomaly nodes which are different from the majority in terms of attribute residual and network coherence and residual reconstruction norm is considered as anomaly score.• GCNAE: [36] GCNAE employs the encoder-decoder architecture where the encoder is a GCN that uses node attribute and feature information to find a latent representation and the decoder employs two GCNs to reconstruct node attributes and graph structure.The reconstruction error of the decoder is used as the anomaly score.
• DOMINANT: [15] DOMINANT also follows the encoder-decoder architecture the encoder is a two-layer GCN while the node attribute decoder is a two-layer GCN and the adjacency information is decoded using the dot product.Anomaly score is defined as the combination of both decoders.• DONE: [7] DONE utilizes MLPs for the encoder and decoder architecture.The node embeddings and anomaly scores are simultaneously optimized with a unified loss function.• AnomalyDAE: [20] AnomalyDAE utilizes a structure AE and attribute AE to detect anomaly nodes where the structure AE takes both adjacency matrix and node attribute as the input and attribute decoder utilizes structure and attribute embeddings to reconstruct node attribute.• GAAN: [11] GAN-based architecture used for anomaly detection method in GAAN which generates fake graphs using MLP and encodes graph information using another MLP.The discriminator is trained to decide whether a graph is fake or real.The real node detection confidence and node attribute reconstruction are used as the anomaly score.• GUIDE: [86] GUIDE preprocess the structure information where node motif degree is used to represent structure vector and the rest of the architecture is the same as DONE.• CONAD: [83] CONAD generates augmented graphs to impose prior knowledge of anomaly nodes and contrastive learning loss is used for optimization.• NWRGAE: [70] Neighborhood Reconstruction-based Graph Auto Encoder performs neighborhood matching with optimal transport loss along with self-feature and degree reconstruction for node classification and role identification tasks.

Dataset Description
• Cora [64]: Cora dataset represents a citation network where the nodes are the research papers and edges represent the citation information between the papers.Node attributes of this dataset represent the bag-of-words vectors from the paper's content.• Weibo [88]: Weibo dataset is a user-user interaction networkbased directed graph dataset on the common hashtag from Tencent-Weibo (A twitter-like platform in China).Suspicious users are marked as anomalies based on the temporal information of their posts.The node attribute of each user is based on the location information of the post and the bag-of-words representation of the post's content.• Reddit [39,77]: Reddit is a user-subreddit interaction network on the social media platform Reddit.The users which are banned from the subreddit are marked as anomalous users.The post of users and subreddits are converted into feature vectors representing the LIWC category [58] and the summation of these vectors are the attribute of the user and subreddit.• Disney [53] and Books [62]: Disney and books are co-purchase networks of movies and books.The anomaly label of Disney is obtained by majority voting from the school students whereas the books dataset's anomaly label is determined by the amazon fail tag information.Both datasets' feature contains information about price, number of reviews, and ratings.

Figure 1 :
Figure 1: Contextual anomalies are feature-wise different, structural anomalies form dense subgraphs in the network and joint-type anomalies connect with many nodes with different features.

Figure 2 :
Figure 2: Model architecture of GAD-NR.The encoder (left) per-forms dimension reduction with an MLP followed by a message passing GNN to obtain the hidden representation of a node.The decoder (right) reconstructs the self features and the node degree via MLPs and estimates the neighbor feature distribution with an MLP-predicted re-parameterized Gaussian distribution.Reconstructions of the self features and the node degree are optimized with MSE-loss whereas the KL-divergence between the ground truth and the learned neighbors' feature distribution is used for the optimization of the distribution estimation.
and the set of attributes of the direct neighbors of the centerH  = {ℎ (0)  | ∈ N  }.Self reconstruction: In order to reconstruct the attributes of the center node, we design a simple decoder that takes ℎ ()  as input and reconstructs ℎ to a multi-variate Gaussian distribution P through the following procedure.We sample  neighborhood features h1 , • • • , h by transforming samples  1 , • • • ,   from the distribution N ( μ , Σ ) via a fully-connected neural network (FNN).Here the parameters μ , Σ are determined by μ =   (ℎ ()

5. 4 . 1
GAD-NR shows the superior performance in different type anomaly detection.

Table 2 :
354]ormance comparison (ROC-AUC) of GAD-NR in benchmark anomaly detection for six different real-world datasets (injected anomaly for Cora dataset).For the results of baseline methods, we followed the BOND[44]paper where the avg performance ± the STD of perf.(maxperf.) is reported.For our model GAD-NR, we fix hyperparameters   = 0.8,   = 0.5 and   = 0.001 and report the avg performance ± the STD of perf.foralldatasets including the best performance in each dataset with tuned hyperparameters.The best and second best performances are mentioned in bold and underlined respectively and _ indicates out of memory with regard to GPU. .0362.35± 1.05 92.01 ± 0.73 74.81 ± 4.39 85.01 ± 7.90 82.22 ± 2.14

Table 3 :
Performance comparison (ROC-AUC) of GAD-NR in contextual (left) and structural + joint-type (right) anomaly detection for different real-world datasets.The best and second best performances are mentioned in bold and underlined respectively and _ indicates out of memory with regard to GPU.
Figure 3: Impacts of varying feature reconstruction weight loss  ′  , degree reconstruction weight loss  ′  and neighbor reconstruction weight loss  ′  in Eq. (8) on detecting different types of anomalies in the Cora (top) and Books (bottom) dataset.

Table 4 :
Direct Performance Comparison between NWR-GAE [70] and our model GAD-NR 5.4 Detection Performance Comparison

Table 5 :
Performance comparison of GAD-NR with different latent dimension sizes for detecting benchmark anomalies in Cora and Reddit datasets.5.5.2 Impact of the latent representation's dimension.In Table5, we present the performance of GAD-NR on benchmark anomaly detection in the Cora and Reddit datasets by varying the dimension size of hidden representations.From the results, we can observe that the performance of GAD-NR gradually increases as the latent dimension increases for Cora (32 to 128) and Reddit (8 to 32) compared to other GAE-based methods.We think using neighborhood reconstruction is the reason behind the gradual performance improvement of GAD-NR.Other autoencoders can only increase the capability of latent representations by increasing the latent dimension.Instead, GAD-NR can also increase the supervision strength of neighborhood reconstruction by increasing the latent dimension.When the dimension size increases even more e.g.256 in Cora and 64 in Reddit, the anomaly detection performance of GAD-NR drops.With a higher latent dimension size, the model becomes too much expressive and it can overfit the anomalies.For anomaly detection, we are expected to capture normal behaviors instead of making models memorize all information in the data, especially abnormal behaviors.Therefore, we need to strike a balance between model expressiveness and the proportion of normal information extracted for the best anomaly detection performance.
[62]ron[62]: Enron is an email-interaction network where the nodes are email addresses and edges represent the interactions

Table 7 :
Performance comparison (ROC-AUC) of GAD-NR in joint-type anomaly detection for different real-world datasets.The best and second best performances are mentioned in bold and underlined respectively and _ indicates out of memory with regard to GPU.