Phases, Modalities, Spatial and Temporal Locality: Domain Specific ML Prefetcher for Accelerating Graph Analytics

Memory performance is a key bottleneck in accelerating graph analytics. Existing Machine Learning (ML) prefetchers encounter challenges with phase transitions and irregular memory accesses in graph processing. We propose MPGraph, an ML-based Prefetcher for Graph analytics using domain specific models. MPGraph introduces three novel optimizations: soft detection of phase transitions, phase-specific multi-modality models for access delta and page predictions, and chain spatio-temporal prefetching (CSTP) for prefetch control. Our transition detector achieves 34.17-82.15% higher precision compared with Kolmogorov-Smirnov Windowing and decision tree. Our predictors achieve 6.80-16.02% higher F1-score for delta and 11.68-15.41% higher accuracy-at-10 for page prediction compared with LSTM and vanilla attention models. Using CSTP, MP-Graph achieves 12.52-21.23% IPC improvement, outperforming state-of-the-art non-ML prefetcher BO by 7.58-12.03% and ML-based prefetchers Voyager and TransFetch by 3.27-4.58%. For practical implementation, we compress the prediction models to reduce the storage and latency overhead. MPGraph with the compressed models still shows significantly superior accuracy and coverage compared to BO, with 3.58% IPC improvement.


INTRODUCTION
Graph analytics is widely used in scientific and engineering fields to analyze complex structural relationships in graphs.However, the enormous size of Big Data graphs and attendant complexity of the analytics algorithms [66] often result in underperformance due to inefficient memory utilization (low data reuse, high cache miss rates, etc.) [3].While many effective graph processing frameworks for improving graph analytics performance have been proposed [12,15,28,44,55,60,65,82], there still remains ample scope for ameliorating memory performance issues.
Data prefetching is the process of proactively fetching data into the memory cache before the data is requested.There are several options for incorporating prefetching within graph analytics applications.Traditional rule-based prefetchers [8,20,22,36,39,45,58,62,72,79], for example, offer the simplicity of easy hardware implementation along with some performance acceleration [69].However, these rule-based prefetchers have limited adaptabilitythey cannot handle the complex memory access patterns seen in graph analytics.In contrast, Machine Learning (ML) algorithms for memory access prediction and prefetching show high adaptability and generalizability [17,59,64,87,88], thereby offering a promising alternative for boosting performance.Further, their superior prediction accuracy on complex sequences and patterns makes them an attractive design choice for graph analytics acceleration.
Given the importance of the graph analytics domain, we believe there is a need to develop dedicated high-performance ML prefetchers for such applications.To the best of our knowledge, there do not exist such domain-specific ML prefetchers (for graph analytics).Developing such a prefetcher presents multiple unique challenges. 1) The memory access patterns vary across different graph processing phases [28,29,55], making it difficult to train a general ML model that performs well across all phases [13,27].2) Parallel executions under multi-core systems introduce randomness and irregularity [38], decreasing pattern matches in temporal prefetchers, which typically rely on recurring patterns [20,59].3) Processing of connected nodes stored on multiple pages causes wide-range page jumps, making spatial prefetchers that predict within a page [39,87] less effective.Existing ML-based prefetchers [17,47,59,87] model the prefetching process as a simple sequence prediction problem.Such a modeling approach integrates little or no domain knowledge of applications and does not address the above challenges.
In this paper, we present MPGraph, the first ever ML-based Prefetcher for Graph analytics.Uniquely, MPGraph is built over a collection of domain specific (DS) ML models, which capture the domain specific context of both architecture and computation in graph analytics.Specifically, MPGraph is optimized along three DS dimensions: phases, modalities, and locality.
Optimizations for phases.Graph processing applications often exhibit different memory access patterns in different phases.To improve prediction accuracy, we propose phase-specific models.However, detecting phase transitions in a timely and reliable manner is challenging.We address false detections caused by impulse pattern shifts and present two methods for phase transition detection based on whether phase labels are accessible or not.If the phase labels are inaccessible, we develop an unsupervised model called Soft-KSWIN, which is a variant of the windowing Kolmogorov-Smirnov test (KSWIN).If the phase labels are accessible, we train a decision tree classifier offline by supervised learning.Both methods use a soft detection scheme to avoid false positive in detections.
Optimizations for modalities.Modality refers to the way of describing an event or experience [2].We propose a novel approach to model memory access prediction based on modality.We treat the program counter (PC) sequence as a distinct modality, which describes the memory accesses from the instruction perspective.We introduce AMMA, an Attention-based network structure using Multi-Modality Attention fusion, which combines the address input and PC input using an attention mechanism [21,42].The attention mechanism [70] has shown success in prefetching due to high adaptability and parallelizability.Our AMMA model outperforms existing methods on various benchmarks and datasets.
Optimizations for locality.To address the limitations of prefetchers relying only on spatial or temporal locality, we propose to exploit both by combining memory access page prediction with delta prediction.We predict a future memory page based on temporal locality and multiple deltas within a page based on spatial locality.Then we develop a novel Chain Spatio-Temporal Prefetching (CSTP) strategy, which uses the predicted page as the base for delta prediction and forms a chain of prefetch requests.
We conduct a comprehensive evaluation of our approach using three state-of-the-art graph processing frameworks: GPOP [28], X-Stream [55], and PowerGraph [15].These frameworks adopt a synchronous computation model, where each phase of the graph algorithm is followed by a global barrier synchronization.We use a diverse set of real-world and synthetic graphs as input data for our experiments.We use ChampSim [10] to generate memory access traces and simulate the physical memory behavior of our approach.
While our primary focus in this paper is improving prefetching accuracy, we also develop several techniques for reducing implementation complexity.While ML prefetchers are generally more complex than rule-based ones, such designs will become increasingly practical as hardware efficiency evolves (for example, efficient ML model parallelization).Thus, we believe ML graph analytics prefetchers are needed even at (currently) higher implementation costs due to the promise of superior performance.Notwithstanding this, we develop several techniques for optimizing our models for practical implementation.1) To reduce model storage, we compress the models based on binary encoding, knowledge distillation, and quantization.2) To reduce inference latency, we analyze the critical path under parallel implementation and apply distance prefetching to hide the latency.Even our most compressed model, with drastically reduced storage/latency outperforms the best performing non-ML prefetcher and can thus be integrated into any state-ofthe-art graph analytics framework as the preferred prefetcher.
We summarize our contributions below: • A methodology for developing domain specific (DS) ML models for prefetching.We analyze features that capture the context of the architecture and computation, then illustrate their application in developing a prefetcher for graph analytics.• MPGraph, the first ever dedicated ML prefetcher for graph analytics built over DS ML models optimized for phases, modalities, and locality.• Phase transition detectors using a soft detection scheme.Our approach achieves 34.17-82.15%higher detection precision than the KSWIN and decision tree baselines.[39]) by 3.58%.

RELATED WORK 2.1 State-of-the-Art Data Prefetchers
Rule-based Prefetchers.Traditional prefetchers learn from predefined rules.For example, spatial prefetcher BO [39] and VLDP [58] learn from history access page offsets or deltas and predict future accesses within a spatial region.Temporal prefetchers like Irregular Stream Buffer (ISB) [20] and Domino [1] predict temporally correlated memory accesses by recording and replaying the history access sequences.Indirect prefetcher IMP [79] detects indirect accesses based on index-address pairs.These approaches use heuristic rules that cannot adapt to complex graph analytics memory access patterns.Recently, rule-based prefetchers (DROPLET [3], RnR [81], and Prodigy [67]) were developed specifically for graph analytics.However, they require extra support from input data, programmer notation, or compiler to exploit graph analytics properties.In contrast, our work uses the same interfaces with traditional hardware prefetchers.Therefore, we use BO and ISB as rule-based baselines that are compatible with our architecture.ML-based Prefetchers.Machine Learning (ML) has become a key technique for memory access prediction and data prefetching.Various approaches have been proposed, including logistic regression and decision trees for pattern classification [52], reinforcement learning for context-based prefetching [46], and Long Short-Term Memory (LSTM) networks to predict memory accesses [7,17,47,80,86,88].Hashemi et al. used LSTM to learn memory access delta patterns [17].Shi et al. proposed Voyager [59], which predicts both page sequence and page offsets using two LSTM models along with a dot-product attention mechanism.Zhang et al.
developed TransFetch [87], an attention-based network using finegrained address segmentation as input and achieves state-of-the-art prefetching performance.Although useful, these general models may not be the most efficient solution for specific domains such as graph analytics.Recently, ML algorithms were applied to predicting memory accesses for graph analytics [83,85].However, they do not develop the models based on graph analytics properties and do not integrate the models into prefetchers.To the best of our knowledge, our work is the first ML-based prefetcher developed specifically for the graph analytics domain, achieving superior prediction and prefetching performance.

Incorporating Domain Knowledge into ML
Domain knowledge have been integrated into ML models in various areas [11], such as climate modeling [26], turbulence modeling [74], fire engineering [43], earth science [23], chemistry [33], etc. Murdock et al. demonstrate the effectiveness of domain knowledge in improving ML model predictive ability [40].Several notions related to domain specific ML has been explored in recent literature.Theory-guided data science [25] proposes a new paradigm that is gaining prominence in scientific disciplines when designing data science models.Informed machine learning [71] integrates prior knowledge into the machine learning pipeline by using it as an independent input source.Physics-based machine learning [24,73] incorporates physical properties into model training by introducing physics-guided loss functions and initialization.
In this work, we propose using domain specific ML in the context of memory address prediction and prefetching.Some existing MLbased prefetchers, though not designed specifically for graph analytics incorporate domain knowledge into their models.For example, TransFetch [87] (memory address configuration), Voyager [59] (program counters) and ReSemble [84] (spatio-temporal localities for ensemble prefetching).We generalize this process by analyzing the architecture context of the target hardware and the computation context of the target applications to guide our modeling.

DOMAIN SPECIFIC ML FOR PREFETCHING
Previous ML-based prefetchers used general models with limited domain specific optimizations for particular classes of applications.We propose developing DS ML models for prefetchers that incorporate context features based on domain knowledge to achieve higher prefetching performance.

Domain Specific Features for Prefetching
By exploiting domain specific features from the context of architecture and computation for a specific domain application, we can integrate domain knowledge into ML models for prefetching.
Example features from the context of the architecture include: • Platform: the target architecture on which the computation is executed, such as multi-core platform, multi-CPU cluster, heterogeneous architecture, etc. • Memory hierarchy: the parameters for each level of cache, main memory, persistent memory [76], flash memory, etc. • Memory address configuration: the bit length of a memory address, the cache line size, the page size, etc. Example features from the context of the computation include: • Computing paradigm: A programming model that a particular application follows, such as MapReduce [12], Scatter-Gather [28,55], and GAS [15] in graph analytics.• Phase: a step or a super-step in the domain computation, such as the local computation step and value communication steps in distributed computing paradigm.• Modality: a mode, a set of features, to describe the computation.For memory accesses, modalities include program counters, memory addresses, thread IDs, etc. • Locality: the data access patterns in a computation, e.g., the spatial locality and the temporal locality.• Thread: a unit of execution within a process.Multiple threads can run in parallel in a multi-core CPU and share the same memory space, challenging memory access prediction.• Coordination: the way a parallel computing application defines how multiple cores or nodes of a system work together, including synchronous and asynchronous coordination.

Developing Domain Specific ML Models for Graph Analytics Prefetcher
To build a high-performance ML-based prefetcher for graph analytics, we propose developing domain specific ML models -incorporating the domain specific features into the model design.
In the context of architecture, our target platform is a multi-core shared memory architecture, as depicted in Figure 1, which can serve as a node of HPC.The memory hierarchy consists of private L1 caches (including data cache L1D and instruction cache L1I), private L2 caches, a shared last-level cache (LLC), and a shared main memory.This results in LLC data requests from interleaved instructions from different cores.Memory subsystem directly impacts the performance of the HPC system [48,57].Our prefetcher leverages domain specific models to predict memory accesses.We take into account the memory address configuration of the system.The prefetching is at the block level and the virtual-to-physical address translation is at the page level.Based on the configuration, we train page and block index prediction models for data prefetching.In the context of computation, many graph processing frameworks using various computing paradigms have been developed with exceptional performance [12,15,28,55,89].We target graph analytics frameworks with iterative barrier-synchronized phases.In these frameworks, the phases are defined in software and all the applications implemented using the framework follow the same programming paradigm.Thus, a domain specific ML prefetcher design is applicable to all applications developed using the framework.for all vertex  that need to scatter updates do Although the number of patterns within a phase is not fixed, the number of phases is small and constant.This observation has led us to train a separate model for each phase to improve memory access prediction performance.Additionally, Figure 2b shows that program counters form clusters for each phase, which suggests that program counters can be used to detect phase transitions.
Incorporation of modality.Parallel execution of multi-threaded applications results in interleaved instructions and high irregularity in memory accesses.For example, GPOP processes partitions of the input graph in parallel on each core.Memory address sequence cannot fully reveal this characteristic.Instead, from an instruction perspective, a multi-threaded process has multiple PCs, each pointing to the next instruction to execute for a given thread.Therefore, we use the PC sequence as an equally important input modality as the address sequence and develop a multi-modality network to fuse the two inputs for memory access prediction.Incorporation of locality.Large-scale graph processing often involves high irregularity in memory access when accessing graph nodes stored in different memory pages.As shown in Figure 3, there are frequent and wide memory access page jumps in the case applications in GPOP.Therefore, in addition to predict deltas within a page following spatial locality, we propose to also predict memory access pages.Considering that graph processing is iterative, we predict pages following temporal locality.In summary, to build a high-performance prefetcher for graph analytics, we design domain specific ML models tailored for multi-core   shared-memory platforms running graph analytics applications with iterative barrier-synchronized phases.These models include a phase transition detector and phase-specific multi-modality predictors for memory access delta and page prediction.The detailed design is presented in Section 4. The proposed approach can be extended to accelerate 1) graph neural networks with well-defined phases (e.g.aggregation and update), 2) GPU applications that uses shared memory, and 3) HPC applications where graphs are extremely large, etc.

APPROACH 4.1 Overview
We introduce MPGraph, an ML-based prefetcher designed to accelerate graph analytics frameworks using domain specific ML models.Figure 4 illustrates the overall design and workflow of MPGraph.
The prefetching process begins with a phase transition detector that reads the PC sequence and detects phase transitions (see Section 4.2).For each phase, there are phase-specific multi-modality predictors for spatial delta prediction and temporal page prediction, both using address sequence and PC sequence as two modalities of inputs (see Section 4.3).To select the phase-specific predictors and manage the delta and page predictions, we develop a prefetching controller that operates a novel chain spatio-temporal prefetching strategy and generates prefetch requests (see Section 4.4  phase to learn the mapping from   to   .These models will then be used to perform memory access prediction and prefetching to improve instructions per cycle (IPC).

Phase Transition Detector
Phases are defined by the software and extracting phase labels requires access to the source code and software interface.With this in mind, we have developed phase transition detectors for two scenarios based on the accessibility of phase labels.

Phase label Inaccessible.
In scenarios where phase labels are not accessible, we use an unsupervised learning model to detect phase transitions.Given a stream of program counters at time , represented as   =  1 ,  2 , . . .,   , and the corresponding phase as class   ∈ 0, 1, . . .,  − 1, the distribution of   is defined as  (  ,   ).If a transition of phase occurs at time , then: The phase transition is described as a concept drift [35] in the context of machine learning.Kolmogorov-Smirnov Windowing (KSWIN) [51] is a stateof-the-art concept drift detection model for a data stream.It is based on the two sample Kolmogorov-Smirnov (K-S) test [5], which estimates the probability that two sets of samples were drawn from the same distribution.The K-S statistic  is the absolute distance between the two empirical cumulative distributions   and   : where is the empirical distribution function for  ordered observations   ;  [ −∞, ] (  ) is an indicator function that equals to 1 if   <  and 0 otherwise; sup() is the supremum of the set of distances.KSWIN detects concept drifts by comparing the distributions of two windows: one containing ℎ history samples  and the other containing  recent samples .These windows are sampled from a sliding window Ψ, which keeps  most recent points from the stream, as shown below:  We can reject the null hypothesis (that there is no statistically significant difference between the two observations) at a significance level of  if the following inequality is satisfied: The hyperparameter  determines the threshold for drift detection in the KSWIN model and its selection is crucial.The model is very sensitive to the selection of .A large value of  may increase the rate of false positive detections, while a small value may cause the model to fail in detecting drifts.KSWIN reports a positive phase transition when  exceeds the threshold.This "hard" detection process ignores pulsing pattern shifts within a phase and can result in false positive detections, as shown in Figure 5a.Soft-KSWIN is a domain specific variant of KSWIN designed to detect graph processing phase transitions.It exploits domain knowledge that phases are stable for millions of instructions to avoid false positives.We design a soft detection process using a soft history window  ′ sampled from a dynamic set of history data points, as shown in Figure 5b and Equation 6: where  is a counter initiated when a positive detection occurs.
The Soft-KSWIN algorithm is shown in Algorithm 2. The history window samples from the sequence not polluted by the detected new pattern.As data flows through the stream, if phase transition detection occurs between the recent window with the old pattern, both the counter  and the number of detections are incremented (lines 13-15 in Algorithm 2).When an entirely new recent window is sampled and  ≥  , if the ratio between the number of detections and the counter is larger than a soft threshold ℎ_ (default set as 0.5), a final detection is declared to be positive and the model is reset for future detection (lines 16-20 in Algorithm 2).Push new data point to Ψ 4.2.2Phase Label Accessible.Processing phases can be labeled offline using source code, programmer annotation, and instrumentation tools like Intel Pin [37].A supervised model can then be trained using the PC trace and phase labels.Decision Tree (DT).We use a simple decision tree classifier [41] to predict the current phase from the PC trace sequence.When two consecutive predictions differ, we detect a phase transition.Soft Decision Tree (Soft-DT).We observe that DT can produce false positive detections when it immediately reports phase transitions upon predicting a different phase.This includes short-term pattern shifts within a phase and false predictions.To reduce the false positive rate, we use a soft detection method similar to Section 4.2.1.We store past phase inferences in a result queue  and compare the modes (most frequently occurred element in a list) between its head and tail halves.When the two modes differ, we report a transition detection to avoid pattern shifts.

Phase-Specific Multi-Modality Predictors
We train phase-specific predictors for each phase of graph processing.To utilize spatial and temporal locality, we design two domain specific models: the spatial delta predictor and the temporal page predictor.We develop AMMA, an Attention-based network with Multi-Modal Attention fusion, as the backbone (feature extractor) for both predictors, see Figure 7.We use attention mechanism in our models due to its high adaptability in prediction and high parallelizability in implementation [70].

Workflow of Training and Inference of Predictors.
The predictors in MPGraph are trained offline and then deployed for online inference.The workflow is shown in Figure 6.We extract memory access traces by monitoring the application access to the shared last level cache.Considering the iterative characteristic of graph analytics applications, we extract the memory access traces for the first iteration of execution.Using this extracted trace, we set scanning windows for  past accesses (as input) and  future accesses (to collect labels for future page and deltas) at a specific time .We train phase-specific models offline using memory access traces extracted from the phases.Once the models are trained, we deploy the models to the proposed ML-based prefetcher MPGraph for accelerating future execution of the application.Self-Attention Layer takes the embedding of items as input, converts them to three matrices through linear projection, then feeds them into a scaled dot-product attention defined as: where  represents the queries,  the keys,  the values,  the dimension of layer input.
Multi-Modality Attention Fusion Layer merges the input of two modalities through concatenation and self-attention: where  is the input of the MMAF layer,  is the weight matrix for a fully connected layer,  is the number of modalities. = 2 in this work, one modality is the address sequence and the other is the PC sequence.Input.The input preprocessing employs the address segmentation method [87].An input memory address is divided into a list of segments to make it processable by an ML model and to avoid tokenizing the vast space of addresses (in millions).The PC is hashed and normalized for processing by the model.
Output.The model predicts multiple future deltas within a spatial range, i.e., a page size.The sum of the address and the predicted delta is the prediction of future accesses.
Training.The model is trained using labels from future deltas in the form of a bitmap for multi-label classification.We use binary cross entropy as the loss function.Input.The page sequence input is tokenized because the small vocabulary (in thousands) can be processed by an ML model [59].
Output.The model outputs the probability of the next future page as determined by the Softmax function.
Training.The model is trained using the future page token as label.We use categorical cross entropy as the loss function.

Chain Spatio-Temporal Prefetching Controller
The prefetching controller performs two functions: it switches between phase-specific predictors and specifies the prefetch request.
We propose a novel strategy called Chain Spatio-Temporal Prefetching (CSTP) to determine what to prefetch.

Switching Predictors.
The prefetching controller receives signals from the phase transition detector.Upon detecting a transition, it activates all  phase-specific predictors to work in parallel.The controller then monitors the performance of these predictors for a small number of accesses and selects the best-performing one.

Chain
Spatio-Temporal Prefetching.Figure 8 illustrates the CSTP strategy.Given an input sequence of memory block addresses denoted using "Page-offset" (e.g., A-1 represents page A offset 1) and a PC sequence, the spatial delta predictor and temporal page predictor operate in parallel.A page base offset table (PBOT) records the latest offset and PC for past pages.For a predicted page, the latest offset and PC can be retrieved from the PBOT for further spatial and temporal inference.This process continues as a chain until either the temporal degree is exceeded or the page offset is missing in PBOT.Given a spatial degree   and temporal degree   , the total prefetch degree   range is:
5.1.2Datasets.We evaluate MPGraph and the baselines using 6 real world graph datasets [31] and a randomly generated synthetic graph based on R-MAT [9], as summarized in Table 2.

Trace Generation.
We use Intel Pin [37] to extract instruction traces from the benchmark applications running on 4 core.Then, we use ChampSim to extract the shared LLC memory access trace.For model training, we use the trace from the first iteration of computation in the frameworks.For model testing and prefetching simulation, we use the traces from the following 10 iterations.

Results
. Table 4 presents the performance of various phase transition detectors.All detectors can accurately detect all true phase transitions, resulting in a recall of 1.The challenge lies in the precision of detection: false positives are common during a phase.By avoiding false positives, Soft-KSWIN achieves up to 66% higher precision compared to KSWIN and Soft-DT achieves up to higher 82.15% precision compared to DT. Figure 9 shows how Soft-KSWIN avoids false positives in detail.While KSWIN quickly reports a phase detection when the K-S statistic D exceeds a threshold, it also reports multiple false positive predictions due to impulsive pattern shifts.Simply setting a higher threshold cannot solve this problem and may cause true transition detection to be missed.Soft-KSWIN uses soft detection to avoid false predictions while incurring only a small lag.Since the number of instructions in a phase is in the millions, this lag is acceptable.

Evaluation of Multi-Modality Predictors
5.3.1 Baselines.We implement AMMA using configurations as in Table 5.We implement several models to demonstrate the effectiveness of our approach, including: • LSTM [19] that use a concatenation of address and PC input for each time step.The hidden dimension is 256.• Attention [70,87]      • AMMA-PS refers to Phase-Specific AMMA, in which separate AMMA models are trained specifically for each phase.

Metrics.
We use F1-Score to evaluate the performance of spatial delta prediction, which performs multi-label classification.
We use accuracy-at-10 (accuracy@10) as in [17] to evaluate page prediction: a prediction is considered correct if the predicted page occurs within the next 10 memory accesses.

Results
. Table 6 shows the spatial delta prediction performance.AMMA-PS shows the highest F1-score for all the applications, outperforming LSTM by 11.15%-24.01%,Attention by 4.52%-11.83%,AMMA by 2.46%-5.9%and AMMA-PI by 1.79%-5.2%.Table 7 shows the temporal page prediction performance.AMMA-PS also shows the best performance with respect to accuracy@10 for all the applications, outperforming LSTM by 8.23%-25.5%,Attention by 4.67%-22.6%,AMMA by 2.03%-21.15%,and AMMA-PI by 1.36%-21.24%.Phase-specific models are particularly advantageous for page prediction due to the unique temporal patterns in each phase.We set the spatial and temporal degree of MPGraph to 2 and total degree 6 (Equation 11).We set the degree of all baselines as 6.

TOWARDS PRACTICAL PREFETCHERS
This paper explores using domain specific ML models to improve memory access prediction and prefetching performance.While these models are not optimized for hardware implementation, we discuss techniques to adapt them for practical use.

Reducing Storage Overhead
Binary Encoding [64].We can use binary-encoding compression to reduce the vocabulary and output dimension of the temporal page predictor.By representing 2 16 classes (page tokens) with a 16-dimension binary vector, we can convert the model output dimension to 16 and the vocabulary of input tokens to 2. This results in up to 33× compression for the model in Table 5 with 2 16 classes, reducing its parameters from 13M to 397K.The model can be further compressed using knowledge distillation as discussed below.
Knowledge Distillation [16,18].Knowledge distillation involves transferring knowledge from a larger teacher model to a smaller student model.By tuning the AMMA configuration described in Table 5 and training them as student models, we reduce the size  of the spatial predictor from 419.6K to 7.5K parameters and the temporal predictor from 397K to 1.9K parameters, achieving an overall compression of 87×. Figure 13 shows that knowledge distillation significantly improves model performance when compressing models.By training a single student model using  phase-specific teacher models, we can further compress the predictor by a factor of  (# phases per iteration).Although the compression results in performance degradation, MPGraph still shows 5.47% higher IPC improvement compared with the best non-ML prefetcher BO.Quantization [49].By representing the weights in the models using 8 bits and applying the above optimizations, we can reduce the storage cost of the spatial model to 7.5KB and the temporal model to 1.9KB.By performing this optimization, the requirement of our model is similar to baseline rule-based prefetchers such as BO (4KB storage) and ISB (8KB storage).

Reducing and Hiding Inference Latency
Parallel Implementation.Hardware acceleration of neural networks have been widely studied [14,56].MPGraph is based on the attention mechanism, which is highly parallelizable.Using a fully parallel implementation of the AMMA models, we can estimate the latency using the critical path based on Figure 7 as follows: where   is for the embedding layer composed of a matrix multiplication (takes   ) and an activation function (takes   ),   is for the self-attention layer that takes 4  and 3  ,    is for the multi-modality attention fusion layer that takes  +  +4  ,   is for the Transformer layer with the same critical path with the fusion layer,  ℎℎ is for the input processes (hashing, segmentation, and tokenization) that can be implemented in look-up tables with latency 1, and  ℎ is for the output layer that takes   .Assuming full parallelism,   = 1 + log 2  for dimension  and   = 1 for activation functions using a look-up table, the overall latency is estimated as  ≈ 123 processor cycles for the original model with Transformer dimension  = 128 and  ≈ 79 for the compressed model  = 8.Approximation as a Look-Up Table .We can further accelerate the model inference by using Look-Up Tables (LUT) to avoid complex computations.Recent work has explored approximating matrix multiplication and activation functions using LUT [54], developing LUT-based Processing-In-Memory approach for fast neural network inference [4,6,53], and using LUT for layer-wise approximation [68].By implementing layer-wise LUT, for models in Figure 7 with two sub-layers within the fusion layer and transformer layer [70], the model inference latency can be reduced to approximately 8 cycles regardless of the model dimension.
Distance Prefetching.In addition to reducing the inference latency of a memory access prediction model, distance prefetching [87] offers an alternative method to hide or offset the latency by skipping the inference slot and predict the future memory accesses in a distance.Figure 14 shows that distance prefetching (DP) effectively avoids performance loss caused by model inference latency.With 200 cycles latency introduced in the simulation process, the uncompressed and the 87× compressed MPGraph still outperform BO by 8.77% and 3.58% w.r.t.IPC improvement, respectively.

Analysis of Computational Complexity
We use three complexity metrics to demonstrate that our method is state-of-the-art in terms of complexity-performance w.r.

CONCLUSION
In this paper, we propose MPGraph, an ML-based prefetcher for graph analytics.Our approach leverages phase, modality, and locality features to develop domain specific models, including phase transition detectors and phase-specific multi-modality models for memory access page and delta prediction.We use a chain spatiotemporal prefetching strategy to manage the models, resulting in 12.53% to 21.23% IPC improvement, outperforming state-of-the-art prefetchers.We compress model storage costs via knowledge distillation and reduce inference latency through distance prefetching.The compressed model outperforms non-ML baselines.
DS ML can be extended to many scenarios.For example, Graph frameworks using asynchronous execution allow processes to go beyond the current phase without a barrier for synchronization.The phase transition detector in MPGraph can be extended to each thread for asynchronous frameworks [34,61,90].DS ML can also be applied to accelerate graph machine learning frameworks [75].The computation of graph machine learning consists of multiple phases, including sampling, updating and aggregation.Phase detection and phase-specific models can be extended to this domain.Characteristics in graph machine learning such as sparsity in a graph network can also be used in the design of domain specific models.A general extension of DS ML is to accelerate the training of machine learning models, e.g., neural networks.Each step of optimization consists of iterative phases including fetching batch data, forward-path inference, and back-propagation for weight updates.

Algorithm 1 2 : 4 :
Scatter-Gather Paradigm with Synchronized Phases 1: procedure Scatter(vertex ) Propagate the value of  to neighbors along edge 3: procedure Gather(vertex ) Accumulate values from neighbors to update  5: while not done do 6: We use the GPOP framework[28] to illustrate the incorporation of domain specific features for developing domain specific ML models.GPOP is a graph processing framework based on the Scatter-Gather paradigm, as outlined in Algorithm 1.It consists of two phases: Scatter, which propagates the current value of a vertex to its neighboring vertices along edges, and Gather, which accumulates values from neighboring vertices to update the value of a vertex.Specifically, we incorporate domain specific features phase, modality, and locality into our model development.Incorporation of phase.Memory access patterns vary among different phases of graph processing.We perform Principal Component Analysis (PCA) of the memory access sequences from the applications of Connected Component and PageRank in GPOP.The results of the three top components (Comp 1-3) are shown in Figure 2a, there is diversity in memory access patterns both within and between the two phases.
PCA for program counters.

Figure 2 :
Figure 2: Scatter and Gather phases in GPOP applications have distinct memory access patterns.The phase transitions can be detected by analyzing the program counters.

Figure 3 :
Figure 3: Wide-range memory access page jumps in GPOP.

Figure 4 :
Figure 4: Overall design of MPGraph.Domain specific ML models operate in a phase transition detector and phasespecific multi-modality predictors.A prefetching controller performs a chain spatio-temporal prefetching strategy to manage the predictors and request prefetches.

Figure 5 :
Figure 5: KSWIN's false positive issue arises from its hard detection of statistic  against a threshold.Soft-KSWIN addresses this using a modified history window sampling.

Figure 6 :
Figure 6: Workflow of training and inference of predictors.
Figure 7a illustrates the model of the spatial delta predictor.It uses both the address sequence and PC sequence as inputs, with AMMA serving as the backbone and a Multi-Layer Perceptron (MLP) with a Sigmoid function acting as the classification head.

Figure 7 :
Figure 7: Phase-specific attention-based multi-modality models for memory access delta and page predictions.

4. 3 . 4
Temporal Page Predictor.Figure7billustrates the model of the temporal page predictor.It takes the page address part of a memory address as a modality and uses PC as another modality.AMMA serves as the feature extractor and an MLP with a Softmax activation function is used for the classification head.

Figure 9 :
Figure 9: Case study on the phase detection performance of KSWIN and Soft-KSWIN on GPOP PageRank.KSWIN reports false positive results due to "hard" detection.Soft-KSWIN avoid the false positives while incurring a small window of lag.

Figure 10 :
Figure 10: Prefetch accuracy of MPGraph and the baselines.

Figure 11 :
Figure 11: Prefetch coverage of MPGraph and the baselines.

5. 4 . 1
Baselines.We compare MPGraph with state-of-the-art rulebased prefetchers and ML-based prefetchers:• Best-Offset prefetcher (BO)[39]: rule-based spatial prefetcher that predicts delta patterns within a page.• Irregular Stream Buffer (ISB)[20]: rule-based temporal prefetcher based on record and replay.• Delta-LSTM[17]: ML-based prefetcher using delta inputs and delta outputs.• Voyager[59]: ML-based prefetcher using address and PC as inputs and using two LSTM-based models to predict temporally the next page and offset.• TransFetch[87]: ML-based prefetcher using address and PC as inputs and using an attention-based model to predict deltas within a spatial range beyond a page.

Figure 12 :
Figure 12: IPC improvement using MPGraph and the baselines.
record and reply mechanism cannot work well on multi-core executions.TransFetch shows higher accuracy but lower coverage than Voyager, who performs page prediction that covers widerange of jumps.MPGraph achieves both high accuracy and high coverage due to spatio-temporal prefetching.Figure12shows the IPC improvement for each application on each input graph.MP-Graph achieves the highest performance among all the prefetcher: on average the improvement is 12.53% for GPOP, 21.23% for X-Stream, and 14.57% for PowerGraph.Comparing with the top baselines, MPGraph outperforms TransFetch in GPOP by 3.72%, Voyager in X-Stream by 4.58%, and Voyager in PowerGraph by 4.14%.

Figure 13 :
Figure 13: Performance of MPGraph under knowledge distillation (KD).Baselines are uncompressed.We compress the models by up to 87× for each phase with 5.47% higher IPC improvement than BO.

Figure 14 :
Figure 14: Effectiveness of distance prefetching (DP) for MP-Graph using the uncompressed and the compressed models.
• AMMA, a multi-modality attention-based network for memory access prediction.It outperforms LSTM and vanilla attention by 6.80-16.02%w.r.t.F1-score for delta prediction and by 11.68-15.41%w.r.t.accuracy-at-10 for page prediction.

Table 4 :
Phase Detection Evaluation

Table 5 :
AMMA model configuration that use address as input and PC as side information.It uses 2 Transformer layers with dimension 128 and head number 4.• AMMA that uses attention layers for each modality with dimension 64, the attention fusion and Transformer layer has dimension 128, each uses 1 layer.• AMMA-PI refers to Phase-Informed AMMA, in which phase embeddings are incorporated as side information after the fusion of the two modalities in AMMA.

Table 6 :
F1-Score of Spatial Delta Prediction

Table 8 :
[70]aseline ML-based prefetchers: 1) total number of model parameters (Param), 2) number of operations performed in model inference (OPs), and 3) critical path, indicating potential speedups for future parallelized model implementation.The results are shown in Table8.With regard to IPC improvement (IPC Impv), we demonstrate that the MPGraph using 7.2× compressed models achieves superior performance, utilizing a smaller number of parameters and OPs.The critical path of our attention-based model (Section 4.3,[70]) does not depend on the input sequence length  but only depends on the number of layers , leading to lower latency compared to LSTM-based models when  is large.Computational Complexity of MPGraph and stateof-the-art ML-based prefetchers  is the sequence length;  is the number of layers in the model.*MPGraph using 7.2× compressed prediction models.