INFINEL: An efficient GPU-based processing method for unpredictable large output graph queries

With the introduction of GPUs, which are specialized for iterative parallel computations, the execution of computation-intensive graph queries using a GPU has seen significant performance improvements. However, due to the memory constraints of GPUs, there has been limited research on handling large-scale output graph queries with unpredictable output sizes on a GPU. Traditionally, two-phase methods have been used, where the query is re-executed after splitting it into sub-tasks while only considering the size of the output in a static manner. However, two-phase methods become highly inefficient when used with graph data with extreme skew, failing to maximize the GPU performance. This paper proposes INFINEL, which handles unpredictable large output graph queries in a one-phase method through chunk allocation per thread and kernel stop/restart methods. We also propose applicable optimization techniques due to the corresponding unique characteristics of operating with low time/space overhead and not heavily relying on the GPU output buffer size. Through extensive experiments, we demonstrate that our one-phase method of INFINEL improves the performance by up to 31.5 times over the conventional two-phase methods for triangle listing ULO query.


Introduction
GPUs are widely used in various domains, such as deep learning, virtual reality, simulation, and modeling, which require efficient processing of complex computations due to their characteristics of iterative parallel computations.As GPUs are also used for data analysis processing, there are a growing number of examples of significant improvements in graph query processing performance capabilities [4, 8, 11, 12, 14-16, 22-25, 27, 29, 31, 32].For example, the triangle counting query is known to be a difficult algorithm to process due to its high time complexity, but this case has shown performance over a hundred times faster with the use of a GPU [4].
Currently, as various graph data models such as knowledge graphs become more widely used, the sizes of these graphs are also increasing.Thus far, many researches have focused on how to process large input data sizes of graph queries on a GPU [8, 11, 12, 14-16, 22, 23, 25, 31, 32].In the contrast, when the size of the output data of a graph query is large, few studies have focused on how to process such a query with a GPU.In particular, Unpredictable Large Output graph (ULO) queries, which have a large output size and for which it is difficult to determine the output size before performing the query, are extremely challenging to process due to the limited memory capacity of a GPU.Examples of ULO queries with these characteristics include triangle listing and subgraph enumeration, among others.
To address the processing of ULO queries, some studies have proposed two-phase methods consisting of a precompute-phase and a writing-phase [10,13,18,30].These studies can be reclassified into the count-write [13,30] method and the sampling-write [10,18] method according to the precompute-phase.The count-write method executes the query once to count the output size for each thread.This information is used to divide the sub-tasks that can be executed in the GPU output buffer.Next, each sub-task sequentially performs the query again and writes the output.However, this method has the problem of requiring approximately twice the time, as it performs the query twice [10,18].The sampling-write method uses a sampling approach roughly to estimate the output size and then divides the task into sub-tasks that can be executed within the GPU output buffer.This method can encounter a buffer overflow issue in some sub-tasks due to sampling bias.To address this issue, Zhuohang et al. [18] proposed the Direct Result Output (DRO) method, which utilizes a maximum of two kernels for each sub-task.In the first kernel, while performing write operations, if the output buffer becomes full, the residual output size is stored.Then, in the second kernel, overflow is handled by assigning an output buffer equal to the residual output size and performing the remaining write operations.However, the DRO method can operate very inefficiently when there is significant bias in the estimation due to the skewness of the graph data.Both the conventional methods are associated with the fundamental issue of inefficient GPU utilization, as they statically partition the sub-tasks based solely on the output size without considering the dynamic nature of varying execution times for each thread on the GPU during runtime.
Ideally, to overcome the disadvantages of traditional methods, it would be better to process only in the write-phase without a precompute-phase.However, two significant challenges occur.First, the functionality to stop and restart a kernel is essential for dividing the query into sub-tasks during the kernel runtime.However, the highly parallel feature of the thousands of cores within a GPU make the calculation state incredibly complex, leading to a lack of support [2,3,7].Second, for ULO queries, it is difficult to determine the output size for each thread in advance, making it challenging to allocate unique output buffers for each thread without a precompute-phase.This has the drawback of requiring multiple threads to write to a single output buffer at the same time, which can lead to performance degradation during write operations due to write conflicts.One way to solve this problem is with a dynamic memory manager [1,26,28], but it suffers from significant lower performance due to a lot of dynamic re-allocations and a large synchronization overhead of the thread's output buffers severely fragmented.
In this paper, we propose INFINEL, a method for processing ULO queries on a GPU in a one-phase method without a precompute-phase.The main concept is to solve the performance degradation issue that arises during write operations using a chunk allocation per thread strategy.Furthermore, a method of stopping and restarting the kernel with low time/space overhead is proposed to utilize more fully the GPU's computational capabilities.In addition, by expanding this method, the load balancing problem caused by the skewness characteristics of graph data can be addressed through the thread block segmentation optimization technique.As IN-FINEL shows little performance difference depending on the GPU output buffer size, it applies optimization techniques that hide the kernel execution time or the time to transfer the GPU memory output data to the main memory using asynchronous GPU streams and a double buffer configuration.
The main contributions of this paper are as follows: • We propose the INFINEL method that performs ULO queries in one phase on a GPU without a precomputephase.
• We propose a method to resolve inefficient GPU utilization in large-scale output and performance degradation of write operations through the chunk allocation per thread and kernel stop and restart methods.

Preliminaries
In this section, we review the method of processing graph queries on a GPU and explain the traditional two-phase methods for handling ULO queries.

GPU Architectures
GPU architectures typically have dozens of Streaming Multiprocessors (SMs), each of which contains hundreds of cores.Each core can assign and execute a single thread, and each SM can be allocated one or more thread blocks, which are groups of threads.Within a thread block, threads are divided into warps according to the warp size unit based on their thread IDs, and the threads within a warp process the same instruction simultaneously.GPUs utilize three types of memory to store and manage data: register memory, which is accessible by a single thread; shared memory, accessible by all threads running on a single SM; and GPU memory, accessible by all threads running on all SMs.The GPU operates by copying input data stored in the CPU main memory to the GPU memory (H2D copy), executing a kernel function on the GPU, and then copying the output data stored in the GPU memory back to the main memory (D2H copy).For the ULO queries addressed in this paper, the output data can be very large.To complete a single query, whenever the output buffer of the GPU memory becomes full, it is necessary to perform a D2H copy of the output data, initialize the buffer, and continue processing the query.We define this single process as an iteration.
Through this architecture, the GPU can process hundreds of millions of threads in parallel.However, the instructions performed by the threads can all be different.Additionally, as the GPU is designed with a focus on fast computations, it does not provide the ability to store or manage instructions that are processed by each thread.Consequently, the GPU has the characteristic of not supporting the functionality to stop and restart a kernel [2,3,7].

Graph Query Processing on GPU
For efficient GPU computing, a number of graph query processing methods using memory coalescing have been proposed that allow access to adjacent GPU memory from a single warp [4,14,22,24,25,27].Representatively, Bisson et al. [4] proposed a GPU-based query processing method that effectively finds all triangles on a graph using data in the Compressed Sparse Row (CSR) format.Figure 1 shows an example graph  and an example of the CSR format.Figure 1(a) presents the example directed graph , and in Figure 1(b) we denote the edges with  0 as the source vertex and  3 ,  4 ,  6 as the destination vertices in orange.Algorithm 1 represents a triangle listing overwrite kernel that is slightly modified in this paper to perform a triangle listing query using the triangle counting algorithm proposed by Bisson et al. [4].To identify triangles where three distinct vertices are connected by three edges, this method employs the approach of computing the intersection between one-hop vertices and two-hop vertices for each starting vertex (Line 11).First, each vertex  of the  array is assigned to a thread block to search for triangles with  as the starting vertex (Lines 2-3).Next, the list of one-hop vertices for  is stored in  . (Line 5).Through the two nested for loops, all thread blocks explore the vertices  corresponding to the two-hop distance (Lines 6-10).If  belongs to  ., a triangle , ,  is found (Lines [11][12].This kernel ensures that threads within a warp read adjacent memory regions (Lines 9-10).The discovered triangles are repeatedly overwritten in the register memory variable , without actually returning the enumerated results to the user (Line 12).
In this paper, we use the triangle listing query as a representative example of a ULO query to explain the proposed method, INFINEL.We expand the output part of the triangle listing overwrite kernel (Line 12) to store all of the triangles in main memory and return them to the user.The triangle listing query has the characteristic that it cannot accurately predict the output size without pre-executing a triangle counting query, and when simply enumerating triangles for a dataset with 10 billion edges, it can generate massive amounts of output data, typically consisting of more than 2 TB of large-scale output data with 165 billion triangles [5].

Two-phase methods for processing ULO
In this paper, existing two-phase methods are classified into count-write and sampling-write methods.We provide a detailed explanation of the Lock Free Scheme for Result Output (LFRO) [13], a representative count-write method, an extended version of the Direct Result Output (DRO) [18], a representative method of the sampling-write method, called DRO-TL (DRO capable of processing a Triangle Listing).While the original DRO method proposed a process that could be handled with a maximum of two iterations per subtask, it had limitations when dealing with queries that exhibit significant workload skew.To overcome this shortcoming, we extend this method to DRO-TL, which has no limit on the number of iterations.
LFRO: First, count-phase is performed to count the output size for each thread.During this phase, a partially modified kernel is executed, where each thread stores the output size in its register memory instead of performing write operations.When the query is completed, this information is  stored in the array  within the GPU memory.Once the kernel terminates, it performs a D2H copy operation to transfer , and the output sizes are summed up for each thread block.Then, the continuous thread blocks that do not exceed the size of the GPU output buffer are grouped together and designate the range processed by those thread blocks as a single sub-task.Subsequently, a write-phase is performed to execute the query again for each sub-task sequentially.
Figure 2(a) shows an example of performing triangle listing using the LFRO method of Algorithm 1.In the countphase, the output size is counted for each thread, and this value is stored in array .Assuming that the output buffer size is 20, thread blocks 0 and 1 containing  0 ,  1 ,  2 , and  3 are grouped as - 0 , thread block 2 with  4 and  5 is grouped as - 1 , and thread block 3 with  6 and  7 is grouped as - 2 .Then, in the write-phase, each subtask is executed sequentially.Before executing a sub-task, unique write spaces are allocated to each thread using the information from array .Afterwards, consistent with the information from the tasks divided into each sub-task, - 0 is performed with two thread blocks and - 1 and - 2 are performed with one thread block each.
This method has the drawback of performing nearly identical queries twice, in the count-phase and write-phase, resulting in twice the execution time [10,18].Additionally, in Figure 2(a), - 1 and - 2 are executed using Examples of a two-phase method with four thread blocks, each composed of a single warp containing two threads (output buffer size = 20).
only one thread block each, with the disadvantage that a certain sub-task may not fully utilize the GPU performance capabilities because only a few thread blocks are used.
DRO-TL: First, sampling-phase is performed to estimate the output size.In the case of a triangle listing, a subset of vertices in the input graph are queried to determine the output size  (e.g., sampling rate = 0.01).The average value  is then used to estimate the overall output size.Furthermore, the input graph is partitioned based on vertices into a size that can be output in a single iteration, defining it as a sub-task.For a triangle listing, the sampling-phase is performed on a subset of vertices in the  array in Algorithm 1 (Line 2), in order to estimate the output size.For example, in Figure 2(b), when { 0 ,  5 } is sampled,  = (9 + 1) / 2 = 5, and the estimated output size is 5 × 6 = 30.Assuming an output buffer size of 20, the vertices are grouped into - 0 with  0 ,  1 ,  2 , and  3 , and - 1 with  4 and  5 .
Next, each sub-task is performed sequentially in the writephase.Because it is difficult accurately to determine the output size for each thread in advance, the output buffer is pre-divided into fixed-size chunks.Each warp is assigned a chunk to perform write operations.If the assigned chunk becomes full, the process repeatedly seeks and assigns another empty chunk.If there are no available empty chunks for a particular warp, the unprocessed residual output size and breakpoint are stored without stopping.In the next iteration, the kernel is executed again starting from the breakpoint.If there is significant workload skew among the sub-tasks, certain sub-tasks may require multiple iterations.For example, in Figure 2(b), - 0 is processed by four warps,  0 ,  1 ,  2 and  3 , each of which is assigned a chunk for write operations.In  0 ,  0 ,  2 , and  3 have an output size exceeding the allocated chunk and therefore require additional iterations of  1 and  2 to process the remaining output.
This method has the advantage of generally faster processing compared to the LFRO method because the speed of the first phase is fast due to sampling, and the write-phase utilizes all thread blocks to handle each sub-task.However, when the skewness of the graph is significantly high, it requires the processing of one sub-task through multiple iterations.This leads to overhead when determining the residual output size that was not written to the output buffer in each iteration.In addition, as illustrated in Figure 2(b) where only  2 and  3 are executed in  2 , it generally has a disadvantage of not being able to utilize all of the core resources of the GPU in later iterations.Lastly, because a warp-sized number of threads must perform write operations on a single chunk, there is a disadvantage in that writing to a chunk must be used through atomic operations to prevent conflicts between write operations.Given that many write operations occur in ULO queries, a performance degradation can occur due to serialization, even when using atomic operations.The experimental results of this performance degradation are discussed in Section 6.2.

ULO Processing Method
In this section, we introduce the INFINEL method, which handles ULO queries only through the write-phase without a precompute-phase.

Chunk Allocation per Thread
This section explains the Chunk Allocation per Thread (CAT) method, which resolves the performance degradation issue of write operations.The GPU output buffer is divided into  chunks of a fixed size, each assigned a unique chunk ID of  0 ,  1 , ...   −1 .A single chunk has a size that allows it to hold tens to hundreds of the minimum unit of output (e.g., a single triangle in a triangle listing).Each thread is allocated the chunk with the smallest ID among the available chunks, performing write operations on that chunk without competition.When the chunk is full, the process of allocating a new chunk is repeated to handle the variable output sizes per thread.During this process, the chunk is allocated to the thread using atomic operations on the GPU memory variable nextChunkID, which represents the smallest unallocated chunk ID. Figure 3 shows an example of the CAT method in a kernel operating with four threads,  0 ,  1 ,  2 , and  3 .The output buffer is divided into ten chunks,  0 to  9 , and seven of these chunks are assigned to the four threads at time  0 .If write operations are required in  2 and  3 , they are allocated different new chunks  8 and  7 respectively to perform the write operations (assuming  3 was allocated before  2 ).Through this method, it is possible to allocate an appropriate size of the output buffer for each thread, even without knowing the output size of each thread in advance.
The CAT method has the advantage of lower output write overhead compared to DRO-TL, as it requires atomic operations only when each thread allocates a new chunk, rather than for every write operation.Nevertheless, because chunk allocation is performed on a per-thread unit basis rather than a per-warp unit basis, there can be greater internal fragmentation due to partially filled chunks.However, theoretically, the maximum size of wasted memory is the product of the chunk size and the total number of threads, which does not exceed 100 MB for a kernel with a chunk size of 1 KB and 100,000 threads.This size is less than 0.7% of the 16 GB output buffer size used in the experiments, indicating that the memory waste due to internal fragmentation is negligible.

Kernel Context for Stop and Restart
This section describes the method used to stop and restart the kernel, which is necessary for processing ULO queries using the CAT method.When the output buffer becomes full, it is necessary to stop the kernel, perform a D2H copy of the output buffer for the next iteration, initialize the buffer, and restart the kernel.To do this, we define the warp context array, warp state array, and block state array.
Warp Context (WC) array: The WC array is an array of GPU memory with a length equal to the number of warps.It compactly stores the contexts of all threads within each warp.Thread context is defined as the information required for each thread to restart the kernel after the kernel is stopped.Next, taking advantage of the fact that threads within a warp execute the same instruction while having consecutive thread IDs, we can represent the thread contexts within a warp, which can be represented as a single context, referred to as the WC.For example, for a triangle listing query, the thread context can be defined as the (, , ) tuple values determined through the loop variables , , and  (Lines 2, 6, 9).Furthermore, because the threads within a single warp operate within the same ,  range, the WC can be represented as (, , { 0 ,  1 ...   −1 }), where  is the warp size.Moreover, by utilizing memory coalescing, the WC can be represented as (, ,  0 ). Figure 4 illustrates an example of a triangle listing based on Algorithm 1 using Figure 1 as an input graph when there are two thread blocks composed of two warps, with each warp composed of two threads (for a total of eight threads, where the length of the WC array is four).It explores the starting vertices through the loop variable , the one-hop vertices through the loop variable , and the two-hop vertices  :
Warp State (WS) array: The WS array is an array of shared memory with a length equal to the number of warps present in each thread block.It stores the state of each warp within a thread block, indicating whether it is still running (RUNNING) or stopped (STOPPED).At the beginning of each iteration when the kernel starts, the WS array is initialized to RUNNING.During the execution of the computation, if any thread within a warp is unable to receive a new chunk, the WS is updated to STOPPED for the next iteration.This ensures that all threads within a warp can stop at the same point by referring to the WS values.Additionally, by maintaining the WS array in shared memory, the reference overhead is reduced.
For example, in Figure 4, it is assumed that  3 finds triangles { 0 ,  4 ,  6 } and attempts to perform a write operation, but stops because a new chunk is not assigned.In this case,  3 changes its WS value to STOPPED, and the thread  2 within the same warp refers to it and also stops.A more detailed explanation of this process can be found in Lines 22-28 of Algorithm 3.
Block State (BS) array: The BS array is an array of GPU memory with a length equal to the number of thread blocks.It stores the state of each thread block, indicating whether any thread within the thread block is still proceeding with the kernel (PROCEEDING) or if all threads have completed the kernel execution (COMPLETED).Before starting the query execution, the BS array is initialized to PROCEEDING once.If all warps within a thread block complete the kernel execution without any stops during the current iteration, the BS of that thread block is changed to COMPLETED.This ensures that only thread blocks with a BS of PROCEEDING execute the kernel in the next iteration.This allows only the thread blocks with a PROCEEDING state in the BS array to execute the kernel in the next iteration.
For example, in Figure 4, thread block 0 intends to find triangles starting with  0 and  4 sequentially, corresponding to [0] and  [2], and thread block 1 intends to find triangles starting with  3 and  6 also sequentially, corresponding to  [1] and  [3], respectively.For thread block 1, warp  2 and  3 have discovered the one-hop vertex  5 from  6 .However, because there are no vertices connected to  5 as two-hop vertices, the query is considered complete.Accordingly, the BS of thread block 1 is changed to COMPLETED.In the next iteration, only thread block 0 executes the kernel, or thread block 1 shares and executes the workload of thread block 0 (will be explained in detail in Section 4).
In contrast to LFRO and DRO-TL, which are statically divided into sub-tasks considering only the output size, the method proposed in this paper dynamically stops the kernel according to the warp unit, considering that the workload allocated for each thread (e.g., the number of output triangles in triangle listing) is different for each thread due to the skewness of the graph.As a result, it does not incur the overhead of the precompute-phase required for sub-task partitioning.Moreover, this method performs kernel stopping and restarting with low time/space overhead.A detailed theoretical analysis of this overhead is presented in Section 5.
When adopting this method, two issues may arise.First, if some threads within a warp cannot be allocated a new chunk and need to stop, all threads in the warp must stop despite the fact that other threads within the warp can still perform write operations on their allocated chunks.However, similar to the memory waste caused by internal fragmentation, the maximum size of theoretically wasted memory is very small.Second, as the iterations progress, the number of thread blocks with BS in the COMPLETED state increases.Consequently, only a few remaining thread blocks would be executed, potentially leading to a performance degradation in the GPU.To address this issue, we will provide a detailed explanation of the optimization technique in Section 4.1.

INFINEL Framework
Algorithm 2 represents the pseudo code of the INFINEL framework.This framework assumes that the query is given as a ULO-aware kernel function K.The ULO-aware kernel function refers to the modified function of the original query kernel, as described in Section 3.1 and Section 3.2.Further details will be provided in Algorithm 3. First, INFINEL performs a H2D copy to load the input data stored in the main memory input buffer (InBuf h ) into the GPU memory input buffer (InBuf d ) (Line 1).For a triangle listing query, the graph data corresponds to the input data.Then, after initializing the WC and BS array (Lines 2-3), it iteratively undertakes the process of initializing the GPU output buffer (OutBuf d ), executing the ULO-aware kernel function K, and performing a D2H copy of the output data stored in the OutBuf d into the main memory output buffer (OutBuf h ) until the ULO query is completed (Lines 5-12).For each iteration, the workload is re-balanced by applying the Thread Block Segmentation method, as described in Section 4.1 (Line 10).In the iterNum-th iteration, the output data stored in OutBuf d performs a D2H copy in OutBuf h [iterNum] (Line 9).Before the kernel execution, the nextChunkID used in the CAT method is initialized (Line 6).When the kernel finishes, if the nextChunkID is less than the total number of chunks in the output buffer represented by  , the query is terminated (Line 12).
Algorithm 3 represents the ULO-aware kernel that applies the INFINEL method to Algorithm 1.To convert Algorithm 1 to a ULO-aware kernel, the updated part is shaded in three different colors.
First, the WS is initialized for each warp as RUNNING and the query is performed (Line 7).While performing the query, when a write operation is required (e.g., when a triangle is discovered), each thread checks whether it has available space within the allocated chunk or can receive a new chunk for writing (Line 22).If some threads need to stop due to a lack of write space, the WS of these threads is changed to STOPPED and the WC is updated (Line 23).Then, all threads within the warps to which these threads belong stop by referencing the WS, as indicated in the shaded blue part (Lines 24-25, 27-28).If there is available write space, the write operation is performed using the CAT method (Line 26).If the kernel is restarted (i.e., iterNum > 0), the values stored in the WC are retrieved, allowing each thread to restart execution from the most recent stop point (Lines 9, 14, 18).In this case, loading the WC into   ,   , and   is required only once during the current iteration of the kernel.To address this, the flag variable loadWC is used (Lines 3-6, 29).When the kernel execution in each thread block is completed, the BS is changed to COMPLETED (Lines 30-31).In the case of a restart, if the BS is in the COMPLETED state, the corresponding thread block no longer executes the kernel (Lines 1-2).

Optimization for load balancing
In this section, we introduce optimization techniques that are applicable according to the unique characteristics of IN-FINEL.

Thread Block Segmentation
This section explains the Thread Block Segmentation method, which dynamically distributes the workload evenly among thread blocks using the WC and BS introduced in Section 3.2.When there are few thread blocks in the PROCEEDING state, the GPU's SMs may not operate efficiently.To address this, we re-distribute the remaining workload of the thread blocks in the PROCEEDING state to the thread blocks in the COM-PLETED state.Instead of storing the restart point of threads within a warp, we extend the WC to store the range of workload that the threads within a warp should perform.For example, in a triangle listing, while the warp context was represented as (, , ) in Section 3.2, in the expanded WC, if re-balancing is based on the  value, it is represented as ([  ,   ), , ).In this paper, a heuristic method is proposed that evenly re-balances the workload based on the extended WC, ensuring that more than 50% of all thread blocks are always executing the kernel when more than half of the thread blocks are in the COMPLETED state.
Figure 5 demonstrates examples of Algorithm 3, operating with two thread blocks consisting of two warps each, without and with the Thread Block Segmentation technique.It is assumed that when the kernel is stopped, thread block 1 has completed its workload, while thread block 0 has stopped, with a remaining workload.In Figure 5(a), in the next iteration, only thread block 0 processes the remaining workload.On the other hand, in Figure 5(b), the workload of thread block 0 is divided and some is allocated to thread block 1, and both thread blocks process the remaining workload.This optimization method utilizes the characteristic of the INFINEL method, where the time overhead for kernel stop/restart is very low, as will be shown in Section 5 and Section 6.

Double buffering with a small output buffer
Due to the large output size of ULO queries, the D2H copy time is significantly long.Specifically, in Algorithm 2, ULO queries must repeatedly execute kernel calls (Line 8) and D2H copy operations (Line 9), which account for most of the total query execution time.In this method, we apply optimization to hide the kernel execution time or D2H copy time by utilizing double buffering techniques with asynchronous GPU streams (e.g., CUDA Streams) [16].The INFINEL method operates with low time/space overhead and is minimally affected by the GPU output buffer size due to the CAT method.Accordingly, even when dividing the given output buffer into two sub-buffers for double buffering, the performance degradation is negligible.
In contrast, when applying two-phase methods, the kernel execution time increases significantly as the output buffer size decreases.In the LFRO method, more sub-tasks are divided, and fewer thread blocks are assigned to a single subtask, leading to slower operation.In the DRO-TL method, sub-tasks with large outputs require processing through more iterations, and the task of determining the residual output size must also be repeated more frequently.A performance comparison based on the output buffer size will be shown in Section 6.2.

Analysis of the Time/Space overhead
In this section, we analyze the time and space overhead when applying the INFINEL framework to ULO queries.The consistency of this analysis with the experimental results will be shown in Section 6. 3.
Time overhead: We analyze the additional operations performed when converting to a ULO-aware kernel by comparing Algorithm 3 and Algorithm 1.The most frequently executed operation is that which accesses the WS to determine whether the threads should be stopped or not (Lines 24, 27: the portion shaded in blue).In this part, a read operation on the WS is required to determine whether a write operation should be performed.However, because the WS is stored in shared memory, the time taken to refer to it is relatively short, resulting in low overhead during the execution of this step.Next, there are read and update operations for the WC and BS stored in GPU memory (Lines 1, 9, 14, 18, 23, 31), but these operations are performed no more than once iteration for each thread.In this way, our method minimizes access to the GPU memory, which has relatively high access latency, resulting in low time overhead for query processing.
Space overhead: The INFINEL framework additionally stores the WC and BS in GPU memory in addition to the output buffer.The size of the extra memory usage (in bytes) can be defined by Eq. (1).
While assuming that we have a kernel with 1280 thread blocks, each using 128 threads, a warp size of 32, and a warp context size of 16 bytes, in such a case, the extra memory usage is less than 90 KB, which is less than 0.001% of the memory available in a modern GPU, ranging from 24 to 80 GB.Additionally, for the WS stored in shared memory, each thread block uses ℎ_ _ bytes of space.Even if a kernel is executed with the maximum number of threads that a single thread block can have, which is 1024 threads, each thread block occupies 32 bytes of space when the warp size is 32.Compared to the typical SM shared memory size of 32 to 64 KB, this size can be considered negligible.

Performance Evaluation
In this section, we present experimental results in three categories.First, we show the performance of the INFINEL method for processing ULO queries, especially for the triangle listing query in Section 6.2.Additionally, we compare its performance with the LFRO method [13] and the DRO-TL [18] method.Second, we present the performance characteristics of INFINEL based on pertinent factors, in this case the chunk size and output buffer size, in Section 6.3.

Experimental Setup
For the experiments, we use both synthetic datasets and real-world datasets.For synthetic datasets, we generate a scale-free graph following a power law degree distribution generated by using RMAT [6].RMAT- has 2  vertices and 2 +4 edges.We generated RMAT24 to RMAT27 datasets which can be stored within GPU memory, with the following four parameters set as follows [20]:  = 0.57,  =  = 0.19,  = 0.05.For the real-world datasets, we use well-known graph datasets with various sizes and characteristics: Live-Journal [19], Orkut [19], Friendster [19], and Twitter [17] datasets.Table 1 presents the statistics of the datasets used in our experiments.When simply enumerating all of the triangles, the output size can reach up to 2 TB, which is dozens of times larger than the GPU output buffer size used in our experiments.
We conduct all the experiments on a single server equipped with two 16-core 3.0GHz CPUs, 1 TB of memory, two PCI-E SSDs with a capacity of 6.4 TB (RAID 0), and one Nvidia A100 GPU with a capacity of 80 GB.The operating system used is Ubuntu 20.04.02, the compiler used is g++ 7.5.0,and the kernel compiler utilized the CUDA 11.6 toolkit.We apply the LFRO, DRO-TL, and INFINEL methods to Algorithm 1 to perform a triangle listing query that enumerates all triangles in the main memory.The query execution time is measured excluding the H2D copy time of the input data.For instance, for INFINEL, we measure the execution time from Line 2 to 12 of Algorithm 2. Furthermore, the kernel execution time is measured by excluding the D2H copy time for the output data from the query execution time.When measuring only the kernel execution time, we denote the method with a suffix (K) (e.g., INFINEL(K)).All three methods operate on the same GPU parameter set with 1280 thread blocks, 128 threads per thread block, and a warp size of 32.Unless otherwise specified, the GPU output buffer size used in the experiments is 16 GB, and the chunk size is set to a default A detailed breakdown of the kernel execution time and D2H copy time can be found in Table 2.The INFINEL method outperforms other methods on all datasets.In LFRO using the count-write method, as the data size increases, significant number of write operations should be performed by a single thread.Consequently, the number of thread blocks operating within a single sub-task decreases, preventing the GPU from fully utilizing its performance.This leads to an increase in the query execution time for datasets with large output sizes (e.g., RMAT26, RMAT27, Twitter).DRO-TL using the samplingwrite method has many sub-tasks with numerous iterations due to skewness.Because it does not support kernel stops and restarts, it executes the kernel until the end in each iteration.This also leads to a sharp increase in the query execution time.
Figure 7(a) presents the kernel execution time for each method, depending on the size of the GPU output buffer.As the output buffer size decreases, for LFRO(K), the number of thread blocks processing a single sub-task diminishes.Moreover, for DRO-TL(K), sub-tasks with a high degree of skew require more iterations, leading to numerous redundant operations.However, INFINEL(K) shows very little difference in performance as it effectively utilizes the computational capabilities of the GPU by stopping and restarting the kernel using the approach proposed in Section 3.2.Furthermore, the number of iterations increases as the output buffer size decreases, but the additional operations incur only low time overhead, such as initializing the nextChunkID (Line 6 of Algorithm 2) and updating the WC and WS once for restart (Line 23 of Algorithm 3).These shows that INFINEL has very minimal dependence on the size of the GPU output buffer.
Figure 7(b) illustrates the efficiency depending on the chunk allocation unit in Section 3.1.As mentioned in Section 3.1, INFINEL primarily allocates chunks at the thread unit.For comparison, we implemented and compared a method referred to as INFINEL-warp, which allocates at the warp unit.In INFINEL-warp, atomic operations are used for threads within a warp to perform write operations to a single chunk.In Figure 7(b), although there is a slight difference, the performance is better when allocating chunks at the thread unit for all datasets.As the output size increases, the performance difference becomes more pronounced.This  indicates that our method of chunk allocation per thread unit is more appropriate and more efficient.3. Buffer utilization represents the ratio of the actual output data to the total output size copied to the main memory via D2H copy operations during a query execution.The remaining portion represents the wasted output buffer due to internal fragmentation.As buffer utilization approaches 100%, it signifies a reduction of wasted memory and thus a decrease in the number of iterations.Figure 8(a) illustrates the performance when varying the output buffer size for the same 1.2 KB chunk size.When the output buffer size is 1 GB, the ratio of wasted output buffer due to internal fragmentation is 100% -88.39% = 11.61%, which leads to significant increases in the number of iterations and the performance degradation.As the output buffer size increases, the internal fragmentation ratio within a single iteration gradually decreases, resulting in a reduction in the number of iterations.If the output buffer is greater than or equal to 8 GB, the number of iterations decreases steadily as the output buffer size increases, resulting in a slight reduction in the execution time due to a decrease in the frequency of kernel stops and restarts.In this method, when the output buffer size exceeds a certain threshold (e.g., 8 GB for triangle listing query), the performance is scarcely affected by the output buffer size.Figure 8(b) shows the performance when varying the chunk size for the same 16 GB output buffer size.As the chunk size increases, the internal fragmentation within the GPU output buffer also increases, leading to inefficient utilization of the output buffer.According to this, when the chunk size is 76.8 KB, the number of iterations increases to 84, resulting in a significant amount of time being consumed.On the other hand, when the chunk size is too small, there  is also some performance degradation.For instance, when the chunk size is 0.3 KB, the output buffer is efficiently utilized up to 99.80%.However, compared to the case where the chunk size is 1.2 KB, there is a slight decrease in performance due to the atomic operations required for allocating approximately two billion chunks.Table 4 demonstrates that the INFINEL method operates with low time overhead analyzed in Section 5. To illustrate this, we compare it with a slightly modified Triangle Listing Overwrite in GPU memory kernel (TL-Overwrite) method derived from Algorithm 1. TL-Overwrite is a version that modifies Line 12 of Algorithm 1 to simply overwrite the output at a specific location in GPU memory.This method is an ideal approach that has no overheads of kernel stop/restart, managing write conflicts, and load balancing.In Table 4, the INFINEL method is shown to operate with low time overhead of 8.6% for RMAT27 and 2.2% for Twitter when compared to TL-Overwrite.

Characteristics of INFINEL
Our INFINEL method operates with three key ideas: (1) Chunk allocation per thread and kernel context for stop and restart; (2) Thread block segmentation; (3) Double buffering.Figure 9 shows the effectiveness of each key idea in terms of performance.The method that does not apply all of the ideas is compared to the LFRO method as a baseline.For the RMAT26 dataset, the impact of Idea 1 is significant, improving the performance by more than 7.4 times.When Idea 1 is applied, applying Idea 2 improves the performance by 4.7%, Idea 3 does so by 60.0%, and using both together boosts it by 72.3%.Idea 2 only affects the latter part of the query execution and thus the performance improvement is relatively small.However, for datasets with high skewness, we can expect greater efficiency.

Related Work
To address the limited GPU memory issue when dealing with large-scale input graph data, several methods have been proposed [8,11,12,14,16,22,23,25,31,32].Some of these methods [12,16] are based on partitioning a large input graph into smaller sub-graphs.GTS [16] employs asynchronous streaming of partitions to the GPU memory.Graphie [12] directly loads only to-be-processed (active) partitions into the GPU memory.Other methods [22,31] focus on minimizing the copying of duplicate data into the GPU memory to reduce the data transfer cost.Subway [22] performs the query by loading only a sub-graph composed of active vertices and edges.LargeGraph [31] dynamically explores frequently used paths to maintain them in the GPU memory.
In this paper, the triangle listing query which we covered as a representative ULO query has been mostly researched based on distributed system methods due to its large-scale output characteristics [9,21].PDTL [9] operates such that every machine within the distributed system stores a copy of the entire graph.Each machine performs a sub-task, and the results are then aggregated on a master machine.PTE [21] divides the edges into multiple edge sets and loads them into a distributed system.Afterwards, it performs the query on all possible combinations of edge sets to check for the existence of triangles.To the best of our knowledge, there have been no cases in which triangle listing query is processed using a GPU-based method.

Conclusion
In this paper, we have proposed an efficient GPU-based ULO query processing method called INFINEL.This method shows significantly enhanced performance by executing in a one-phase method, without the traditional precomputephase that statically divides tasks into sub-tasks based only on the output size.Through theoretical and experimental analyses, we that INFINEL operates with low time/space overhead.In particular, it maintains consistent performance without a significant impact on the GPU output buffer size.Through various experiments, we demonstrated that INFINEL significantly outperforms conventional two-phase methods.The method proposed in this paper can be applied not only to graph queries but also to large-scale output queries with unpredictable output size.

9 Figure 3 .
Figure 3. Example of the CAT method.

Figure 4 .
Figure 4. Example of warp context for kernel stop and restart in Triangle Listing.

Figure 5 .
Figure 5. Examples with and without Thread Block Segmentation.4.2 Double buffering with a small output buffer

Figure 6 .
Figure 6.Performance comparison with other methods for the query execution time.

Figure 8 (
Figure 8(a) and (b) demonstrate the performance depending on two major INFINEL parameters: the GPU output buffer size and the chunk size.The number of iterations and the buffer utilization for each experiment are shown in Table3.Buffer utilization represents the ratio of the actual output data to the total output size copied to the main memory via D2H copy operations during a query execution.The remaining portion represents the wasted output buffer due to internal fragmentation.As buffer utilization approaches 100%, it signifies a reduction of wasted memory and thus a decrease in the number of iterations.Figure8(a) illustrates the performance when varying the output buffer size for the same 1.2 KB chunk size.When the output buffer size is 1 GB, the ratio of wasted output buffer due to internal fragmentation is 100% -88.39% = 11.61%,

Figure 7 .
Figure 7. Performance varying GPU output buffer size and chunk allocation unit for the kernel execution time.

•
We extend the INFINEL method to propose a thread block segmentation optimization technique that easily applies load balancing to reduce the processing time.•We show that INFINEL operates with low time/space overhead and maintains nearly constant performance even with a relatively small output buffer.

Table 1 .
Statistics of graph datasets used in the experiments.

Table 2 .
Kernel execution time and D2H copy time comparison with other methods.

Table 3 .
Number of iterations and buffer utilization varying GPU output buffer sizes and chunk sizes.

Table 4 .
Comparison with the triangle listing overwrite kernel.