ngAP: Non-blocking Large-scale Automata Processing on GPUs

Finite automata serve as compute kernels for various applications that require high throughput. However, despite the increasing compute power of GPUs, their potential in processing automata remains underutilized. In this work, we identify three major challenges that limit GPU throughput. 1) The available parallelism is insufficient, resulting in underutilized GPU threads. 2) Automata workloads involve significant redundant computations since a portion of states matches with repeated symbols. 3) The mapping between threads and states is switched dynamically, leading to poor data locality. Our key insight is that processing automata "one-symbol-at-a-time" serializes the execution, and thus needs to be revamped. To address these challenges, we propose Non-blocking Automata Processing, which allows parallel processing of different symbols in the input stream and also enables further optimizations: 1) We prefetch a portion of computations to increase the chances of processing multiple symbols simultaneously, thereby utilizing GPU threads better. 2) To reduce redundant computations, we store repeated computations in a memoization table, enabling us to substitute them with table lookups. 3) We privatize some computations to preserve the mapping between threads and states, thus improving data locality. Experimental results demonstrate that our approach outperforms the state-of-the-art GPU automata processing engine by an average of 7.9× and up to 901× across 20 applications.


Introduction
Finite automata are workhorses for many applications across various domains such as data analytics [10,31,49,54,76], machine learning [62], network intrusion detection [42,46,73], graph processing [48], and bioinformatics [18,47].Widely used regular expression engines also use finite automata as the compute kernels to match patterns [2,7,68].However, processing automata on compute-centric architectures is extremely challenging due to irregular accesses and data dependency.The former causes poor cache performance, while the latter serializes the execution, known as "embarrassingly sequential" [75].Worse, in recent years, applications with automata as compute kernels have become increasingly larger [41].For example, Snort [46] is a network intrusion detection and prevention system that matches a series of rules to identify malicious network activity from many packets (input streams), where each rule can be represented as an automaton.From 2014 to 2021, the number of rules has increased by 71% in a ruleset (ET Pro) [6] while the third-party users continue contributing new rulesets.
To provide high throughput for such large-scale automata processing applications, many domain-specific accelerators are proposed.Many of them address the irregular data movement with processing-in-memory architectures [22, 29, 50-53, 59, 60, 64].While they could achieve orders of magnitude higher performance than von Neumann architectures [39], they lack programmability [13,21,23] and are not easily accessible to all users [40].Integrating many domain-specific accelerators into a computer system for diverse workloads also leads to higher heterogeneity and complexity [63].
Due to their massive data parallelism, GPUs have become the most widely accepted general-purpose accelerator [37], and continue to scale faster than CPUs [61].Processing largescale automata applications on GPUs is therefore an attractive option if high throughput can be achieved.Most previous works have used two representations of automata: Deterministic Finite Automata (DFAs) and Non-deterministic Finite Automata (NFAs).Since NFAs allow multiple active states and are compact in size, they can better utilize the parallelism of GPUs [34].While DFAs are inherently NFAs by definition [55], they only permit one state to be active at any time, thereby losing a potential source of parallelism.
Although processing automata on GPUs has been studied for many years, early works that use small automata up to a few thousand states [20,63,77] perform suboptimally on largescale automata applications.In contrast, recent work [34] has improved the throughput for large-scale automata processing on GPUs.However, our detailed characterization reveals that three challenges have not been systematically addressed: • Challenge #1: GPU Threads Underutilization.Prior works program GPU to process one single symbol from the input stream at a time.Consequently, numerous threads that do not have matched states finish earlier and wait on the cross-symbol barrier until all GPU threads finish processing the current symbol (Figure 1 (a)).Improving GPU thread utilization is challenging when the GPU is only allowed to process one symbol at a time due to the insufficient parallelism offered by that symbol.
• Challenge #2: Redundant Computations.Due to the nature of the NFA workloads, the always-active states need to be processed for every symbol from the input stream.This leads to repeated computations for matching symbols with the next states, resulting in the inefficient utilization of GPU cycles.
• Challenge #3: Poor Data Locality.In each iteration, GPU threads access items from a shared worklist and push the newly generated computations back.However, the dynamically changing mapping between threads and data impairs data locality.
To this end, we propose Non-blocking Automata Processing (ngAP in short) on GPUs.In contrast to the prior works that process one symbol at a time, ngAP allows states matched with different symbols to be processed in parallel and hence improves thread utilization (Figure 1 (b)).More importantly, ngAP provides support for more optimizations to address these challenges systematically: To address Challenge #1, we prefetch the matches between symbols and always-active states to the worklist, enabling concurrent processing of more symbols.This results in improved thread utilization due to the availability of more parallelism.Second, we address Challenge #2 by designing a memoization mechanism for a portion of the states in which a sequence of matching operations can be converted by a table look-up.We extend the memoization table to support prefixes of patterns, further reducing redundant computations.Third, to address Challenge #3, a thread can selectively privatize the newly generated computations by preserving the data in registers instead of writing it back to the shared worklist, thus improving data locality.Overall, a detailed evaluation demonstrates the impact of each optimization and shows that combined optimizations significantly improve the throughput of automata processing on GPUs.
To the best of our knowledge, this is the first work that reduces the scope of barrier synchronization, allowing many symbols to be processed in parallel.The non-blocking processing enables new opportunities for unexplored optimizations to address challenges in automata processing on GPUs.Our contributions can be summarized as follows: • We characterize and identify three major challengesthread underutilization, redundant computations, and poor data locality -that limit the throughput of automata processing on GPUs, but remain unsolved in prior works.
• We propose a new approach, ngAP, that allows matches between different states and symbols to be processed in parallel, enabling us to propose three new optimizations that address the identified challenges synergistically.
• Our detailed evaluation demonstrates that the proposed approach and optimizations achieve a 7.9× geometric mean speedup across a wide range of 20 applications compared to the state-of-the-art GPU automata processing engine.Moreover, our approach outperforms HyperScan, the advanced CPU automata processing engine, by 11.9×.represents zero or more repetitions of ""; ".+" matches any character one or more times; "(na⋃︀an)?"matches "na", "an" or nothing; "" matches a character "".

Non-deterministic Finite Automata
A Non-deterministic Finite Automaton (NFA) is defined as a quintuple (, Σ,  0 , ,  ), where  is a finite set of states, Σ is the alphabet defined by the NFA,  0 is a set of starting states, and  is a set of reporting states.Transition function (, ) defines the set of states to be activated when a set of state  matches with input symbol , where  ⊆  and  ∈ Σ.As used in ANML [1] or MNRL [12] format, this work focuses on homogeneous NFAs where each state has valid incoming transitions for only one input symbol, which could be converted from classical NFA representations by Glushkov construction [25].An NFA processing application consists of multiple homogeneous NFAs, where each NFA represents a pattern that the application searches for in the input streams.
Starting States.The matching process begins by activating the starting states.An NFA processing application can have two types of starting states: "all-input" states and "start-of-data" states, defined in ANML or MNRL format.The "all-input" starting states are used in applications that search for a pattern throughout the input stream regardless of the starting location of the pattern.For example, the pattern "/apple/" searches for all occurrences of the word "apple" in the input stream.The "all-input" starting states are always active, so we refer to them as always-active states in this work.In contrast, if the application is only interested in the pattern that appears at the beginning of the input stream (e.g., "/ˆapple/"), the "start-of-data" starting states are only active at the beginning of the input stream.NFA Example. Figure 2 (a) illustrates an NFA that accepts regular expression ⇑ * .+(na⋃︀an)?⇑, in which each node is a state and each edge is a state transition.Each state has a matchset 1 of symbols in the alphabet that the state accepts.The states in hexagon ( 0 ,  1 ) denote always-active starting 1 The symbol " * " in NFA's matchset means the state can accept any symbol.states while the state ( 6 ) in double-circle denotes a reporting state.Initially, only the starting states are active.In each iteration, a symbol of the input stream matches with the active states.When the matchset of an active state contains the incoming symbol, the state becomes matched and then activates its neighbors.Figure 2 (b) illustrates the matching process of this NFA under the input stream "banana".For example, when the incoming symbol is "b",  0 and  1 are active in iteration 0. Since  1 accepts "b",  1 becomes matched, and then activates its neighbors  2 ,  3 and  6 in the next iteration.As the starting states ( 0 ,  1 ) are always active, they are also active in iteration 1.We will use this NFA as examples for illustration purposes in the following sections.

Processing Automata on GPUs
An NFA application has multiple levels of parallelism [34].
For instance, many input streams (e.g., network packets) and NFAs (e.g., network intrusion signatures) can be processed concurrently.Further, multiple states could be active at the same time.Plenty of parallelism makes NFAs a good fit for GPUs.
Prior works that process NFAs on GPUs can be categorized into topology-driven and data-driven approaches.Topologydriven approaches [20,71,72] statically map GPU threads to NFA states or transitions.However, these approaches tend to underutilize GPU threads when states or transitions are idle, leading to lower throughput compared to data-driven approaches.In contrast, to alleviate the underutilization of GPUs, state-of-the-art approaches often adopt variants of data-driven approaches [34,35,77], which maintain doublebuffered worklists only for the active or matched states.Next, we will introduce the basic idea of data-driven approaches.
Illustrative Example. Figure 3 depicts the matching process in the data-driven approach, where two worklists containing matched states -the current worklist and the next worklist -are double-buffered.During each iteration, each thread is assigned to one or more states in the worklist ( 1 ).These matched states in the current worklist ( 0 ,  1 ,  3 ,  4 ,  6 ) activate their neighbors ( 1 ,  2 ,  3 ,  6 ,  5 ,  6 ).Then, the activated neighbors match with the incoming symbol ("a").Only .At the end of the step, the next worklist is assigned to the current worklist while the current worklist is emptied ( 4 ).In other words, every state that shares the same worklist must wait on a barrier until all the states in the current worklist are processed, making the process "blocking".This matching process consumes a symbol at each iteration (i.e."one-symbol-at-a-time") until all symbols in the input stream are processed.We refer to this approach as "BAP" (blocking automata processing).

Challenge #1: GPU Threads Underutilization
This section shows factors that can affect GPU thread utilization.We first discuss the definition of thread utilization, followed by an example showing why input can affect it.
Thread Utilization.We define thread utilization as the ratio of busy threads count to the total number of threads.If there is no work available to be picked up by GPU threads, they are considered idle.To illustrate this concept, we use an example of matching the input stream banana with the NFA in Figure 2. Figure 8 compares the thread utilization between BAP and our optimizations, which will be discussed later.To process the fourth symbol "a", Figure 8 (a) indicates that four threads are mapped to the first four states in the worklist ( 1 ), while the first thread works on the fifth state in the second round ( 2 ).Consequently, in the second round, all threads except the first thread remain idle, resulting in poor thread utilization (i.e.thread utilization = 5⇑8).When processing the next symbol "n", at time  3 , three states in the worklist are assigned to four threads, leaving one thread idle, thus the thread utilization is 3⇑4. Figure 4 depicts the measured thread utilization across the evaluated applications.We observed an average thread utilization of only 32.6% in blocking automata processing (shown in the "BAP" bar).Furthermore, a few applications have a thread utilization of less than 3%.
Insufficient Parallelism.The states in the worklist run in parallel.However, the length of the worklist at any given iteration depends on the number of matched states, which may not be enough to utilize all the threads.This issue is also highlighted by previous research showing only a small fraction of states are active for most of the time [33,34,58].Although decreasing the number of threads allocated to the worklist to improve utilization is possible, it would result in serialized execution and poor performance.Thus, only leveraging the parallelism provided by one symbol is often insufficient, leading to thread underutilization.
Summary.We conclude that when the worklist does not have enough states, the GPU threads are underutilized.One key reason is that the threads can only work on states matched at the same iteration of the input stream as a barrier must be performed by the end of each iteration.Otherwise, if threads could proceed without per-symbol barriers, as Figure 1 (b), thread utilization could be improved significantly.

Challenge #2: Redundant Computations
This section first analyzes the redundant matches associated with always-active states, and then investigates the potential to reduce the work by not computing these matches.
Due to the nature of the matching process of automata, when matching a given string, a constant set of active states will transition to another predetermined set of active states.For example, a set of active states {S2, S3} in the NFA of Figure 2 (a) will transition to {S6} when the input string is "an".When the combination of active states and string recurs during the matching process, computations become redundant.In practice, this happens often: Consider the matching process depicted in Figure 3 ( 3 ), where the always-active states match against the incoming symbols, and the matched elements are pushed into the subsequent worklist.During each iteration, the identical set of always-active states matches against the incoming symbol.Since each application has an alphabet Σ, there are only ⋃︀Σ⋃︀ combinations of input symbols and the always-active states.As a result, it becomes unnecessary to repeatedly match the always-active states and the incoming symbol in order to obtain the matching results.Instead, these matching results can be stored for future reference.It is important to note that only 2 out of 20 evaluated applications (APR and SM; see Table 2) do not have always-active states.Consequently, most applications face the challenge of redundant computations.

Potential to Convert the Matches to Table Lookups.
We investigate how much work can be eliminated by converting redundant matches to table lookups.Given that alwaysactive states match with every symbol in the input stream, we associate these matches to individual symbols.To achieve  this, we define a pattern to be a sequence of symbols beginning with a match against always-active states and extending until no state stays active.Hence, in Figure 5 (a), we demonstrate the patterns vertically and place them along with the symbol at which they start.For example, searching for the NFA depicted in Figure 2 (a) from the first symbol in the input stream banana results in the matching process demonstrated in Figure 5 (b), where bana was identified.When the matching results of a pattern prefix and the always-active states are stored, we can convert the matches to table lookups.Figure 5 (a) depicts the pattern prefixes of length  = 3 in shaded boxes.By using the total number of matched symbols as an indicator to measure the total amount of work, storing matching outcomes for all prefixes of length 3 removes a considerable portion of computations (79%), as evidenced by this illustration.
Results.We further measure the percentage of work that could be eliminated by storing the matching results of pattern prefixes and always-active states when varying the prefix length  from 1 to 5. Figure 6 displays that eliminating matches associated with all prefixes of length 1 to 5 reduces the total work by 59.7% to 88.6%, on average across the evaluated applications.Notably, this reduction is more pronounced for  values ranging from 1 to 3 (from 59.7% to 81.9%) than for those from 3 to 5 (81.9% to 88.6%).Summary.Our analysis highlights that transforming pattern prefixes and always-active states matches into table lookups can substantially diminish the workload.

Challenge #3: Poor Data Locality
In this section, we show why prior data-driven designs lead to poor data locality.
Threads and States Mapping Switches Frequently.As discussed in Section 2.2, in the blocking automata processing (BAP), threads store the matched neighbors of states in the current worklist to the next worklist.Then, the states in the next worklist are evenly assigned to the threads.This approach redistributes the newly generated work for each iteration and hence ensures load balance.
Analysis of Poor Locality.However, the remapping between threads and states happens frequently, leading to poor locality.In each iteration, a thread loads a different state and then fetches its neighbors.For example, in an iteration, thread   is mapped to   .To match a symbol, thread   loads the data structures of   and   's neighbors from memory.Suppose   is a state of   's neighbors.At this time,   must be in the registers of thread   's context and cache.However, thread   pushes   into the next worklist, and then mapped to another state in the next iteration, resulting in a loss of data locality.Similarly, as   may be also mapped to a thread other than   in the next iteration, we are unsure whether   is still in the cache or has been evicted due to limited cache size (especially, L1 cache is small in GPUs).
Summary.Overall, prior approaches do not exploit the temporal locality in the matching process, potentially resulting in suboptimal performance.Therefore, if we could preserve the mapping between threads and data for a longer duration, the data locality would be improved.

ngAP: Non-blocking Automata Processing
We propose ngAP, Non-blocking Automata Processing, that allows threads to work on different symbols in parallel.Our key insight is that the scope of synchronization can be reduced to a single state, eliminating the need to process the input stream one symbol at a time.This non-blocking approach leverages parallelism across different symbols of the input stream, making the matching process more efficient.This section first discusses the design of ngAP ( § 4.

Design of Non-blocking Automata Processing
Worklist of State-Index Pairs.The basic execution flow of ngAP is presented in Figure 7.In contrast to prior works, we use a single worklist instance for all iterations instead of double-buffered worklists.In BAP, all states in the worklist process the same symbol index, thus synchronization across all states for the next iteration is needed.However, if each state in the worklist is aware of the index of the input symbol it needs to match with, the matching process can continue by matching the state's neighbors and the symbol at the corresponding index.Therefore, we represent each element of the worklist as a pair that contains the symbol index and the matched state (i.e., state-index pair).Figure 7 ( 1 ) illustrates the worklist of state-index pairs.As a result, this worklist allows any indices of the input stream to be included, and thus can be processed in parallel.
State Transition.A sliding window (dashed rectangle) of the pairs is mapped to the threads in each iteration ( 2 ), managed by two pointers head and tail that point to the range of elements in the worklist.In each iteration, the threads fetch a sliding window of state-index pairs from the worklist, which is done by updating the two pointers atomically.In Figure 7, the second thread is mapped to a pair ( 1 , 3).It loads the index 3 from the input stream ("a") and then matches "a" with  1 's neighbors ( 3 ).As  2 accepts "a", the pair ( 2 , 4), where 4 is the next symbol index, will be pushed into the worklist ( 5 ).
Handling Always-Active States.Besides the states in the sliding window, the threads must process the alwaysactive states.When the threads start to process a new index in the input stream that was not processed (in this example, index 3), they match the always-active states with the symbol at the new index.The matched always-active states and the next index are paired ( 1 , 4) ( 4 ), and will be pushed into the worklist ( 5 ).Finally, in the next iteration, the threads are mapped to new state-index pairs by updating the sliding window ( 6 ).
Summary.ngAP releases the restriction on the synchronization scope from all states in the application to the states within a sliding window.Further, we discuss how ngAP provides support for further solutions to the three challenges., with a shift from 8⇑12 to 1.Although ngAP enables processing of different symbols simultaneously, we have noticed that execution is serial at the state level, and the worklist is scheduled in a first-come-first-serve order, leading to limited parallelism improvement since the indices co-existing are not many in the worklist.Figure 4 shows that ngAP only slightly improves thread utilization, increasing it from 32.6% to 33.2%.

Prefetching Always-Active States. To further address
Challenge #1, we leverage the parallelism that arises from processing different symbols, and maximize the opportunity to process different symbols.As discussed in Section 2, always-active states must match with every symbol in the input stream.Therefore, we propose to prefetch matches between each symbol and always-active states to the worklist in batches.At the start of execution, the threads load a batch of symbols and match them with always-active states.For instance, with a batch size of 3 ( = 3), symbols with indices 0 to 2 match with the always-active states, and the resulting state-index pairs are added to the worklist.The iteration  1 of Figure 8 (c) depicts the content of the worklist after loading the first batch.As a result, symbols at indices 1,2,3,3 are matched with the neighbors of  1 ,  1 ,  0 ,  1 , respectively.In the following iteration  2 , since index 3 hasn't matched with always-active states, a batch of matching results between symbols (indices 3 to 5) and always-active states are loaded to the worklist (omitted from this figure).
Thread Utilization Improvement.Compared with Figure 8 (b), more symbols are processed in each iteration with Prefetching Always-Active States, thereby improving thread utilization considerably.According to Figure 4, the addition of Prefetching Always-Active States (batch size  = 256) to ngAP results in a substantial improvement in thread utilization across the evaluated applications, increasing it from an average of 33.2% to 83.6%.We will discuss how we determine the batch size in Section 5.
Summary.We propose ngAP to solve Challenge #1, enabling parallel processing of symbols.By prefetching the matches between always-active states and symbols to the  Reporting States worklist, we increase the number of indices coexisting in the worklist.As a result, our approach improves thread utilization significantly.

Optimization #2: Reducing Redundant Work via Prefix Memoization
To address Challenge #2, we introduce Prefix Memoization in this section.Prefix Memoization substitutes matches between always-active states and pattern prefixes with table look-ups, thereby reducing redundant computations.

Memoization Table.
The memoization table is computed offline since the number of possible combinations between always-active states and short prefixes with a length of  is finite.To construct the memoization table, we first enumerate all prefixes of length  according to the alphabet.Thus, if the alphabet size is ⋃︀Σ⋃︀, the table needs ⋃︀Σ⋃︀  entries to store all length prefixes.We match each prefix with the always-active states and their subsequent states, and store what states are matched when the prefix ends to each entry of the table.Since reports could be generated within -length patterns, we also store the generated reports in table entries.
Illustrative Example.In Figure 9, an example of using a memoization table for pattern prefixes is illustrated.In the current iteration, we need to match index 3 with alwaysactive states, and the memoization table records prefixes of length 2. Thus, we look up the table for the prefix composed by symbols at indices 3 and 4 ("an", shaded in the figure), and then add the entry to the worklist, along with other matched states at the end of this iteration.If the entry includes reporting states, reports are generated at the symbol indices accordingly.By substituting computations with table lookups, we can eliminate a significant amount of redundant computations since the same prefix can occur many times during execution.
Integrating with Prefetching Always-Active States.
We can build Prefix Memoization on top of Prefetching Always-Active States (Section 4.2): We prefetch the matching results from the memoization table in batches.Threads load every  adjacent symbols from the input stream and look up the corresponding entries in the table.However, it's worth noting that the sizes of the matching results for prefixes and always-active states can vary significantly.As shown in Figure 10, the entries differ in size.Loading an entry using each thread results in a significant load imbalance, which impinges on thread utilization.To tackle this issue, we distribute the entries evenly across threads by loading  We set up the threshold to the average length of entries as it achieves the best performance empirically.During the second stage, all threads in the thread block collectively load the remaining states in the longer entries, distributing them equally among them.In the final stage, since reports are infrequent in the prefixes, each thread loads only the reports from the entry it needs.
Memory and Time Consumption.Table 1 shows the memory usage and construction time for a memoization table containing 3-length prefixes ( = 3), as indicated in the "W/O Comp" row.The memory space required for the table ranges from hundreds MBs to several GBs in various applications.We observe that most offline computations can finish in several seconds, however, a few applications (e.g.CRP2 and HM) require minutes because they have more active states on average for the prefixes.Nevertheless, since these generated tables can serve for all input streams, the computation time could be amortized.

Prefix Length Selection.
With  values ranging from 1 to 3, we observe increased throughput for all applications.Therefore, we use  = 3 for all applications except YARA and HM, which uses  = 2 due to exceeding the memory capacity of the GPUs for evaluation.In this work, we limit  to 3 for two reasons: 1) In Section 3.2, we demonstrate that the benefits of increasing the prefix length to 4 or more are less significant.2) Computing a memoization table with  ≥ 4 takes at least hours, and storing it is very expensive.

Memoization Table Compression.
Since the table has entries with different sizes, we store it in a format similar to Compressed Sparse Rows (CSR, as shown in Figure 11 (a)).However, we found that many rows of the table, such as the row for ab and ba in this figure, are empty and take up space in the row pointer array (RPtr) of the CSR format, even though no values are stored in them.To mitigate this issue, we compress the memoization table by indexing it with the prefix values.Figure 11 (b) illustrates our design.The initial two rows contain the values for non-empty prefixes and their corresponding starting locations.Retrieving a prefix's entry requires a thread to perform a binary search on the first row (RIdx) and then load the row pointer to access the entry.Table 1 illustrates the impact of compression on the table.In general, compression substantially reduces the storage space required for memoization tables, and takes similar amount of time in preparing them.For instance, certain applications (e.g., Bro, Ran5, PEN, CAV) may only need up to a few MBs, as opposed to several hundred MBs without compression.We observe the throughput is also improved by 5.3% on average due to reduced memory footprint.

Optimization #3: Improving Data Locality via
Work Privatization In this section, we propose Work Privatization to improve the data locality.The key reason behind Challenge #3 is that each thread pushes the state-index pairs produced by it to the shared worklist.However, the consumer thread of the stateindex pairs is dynamically mapped.This loses data locality and requires more writes and loads on the GPU memory.Controlling Warp Divergence and Work Serialization.First, A thread must decide whether it privatizes its computations, or terminates the privatization by pushing the results back to the worklist.However, on a GPU, threads within the same warp must run in lockstep [24].If a few threads within a warp choose to privatize their computations while others do not or cannot due to mismatches, warp divergence impairs thread utilization.To address it, each warp calculates an active thread ratio for the next step, depending on whether its threads have more work to do.For example, assuming the warp size is 4, in step 1 of Figure 12, only 3⇑4 threads have more computations, thus the active thread ratio is 3⇑4.To control the warp divergence, we set up a threshold : When active thread ratio ≤ , the threads terminate privatization as thread utilization will be low if continues.If 3⇑4 < , all threads within the same warp terminate privatization; otherwise, the threads could step forward.Essentially,  = 100% indicates we disable Work Privatization while  = 0% indicates a thread always privatizes its work.Second, a thread needs to limit the number of steps to privatize the computations because privatization for many steps accumulates the works of a thread, serializing the computations.To control the serialization, we limit the maximum number of steps ().We empirically observe that when  > 1, the throughput Table 3. Evaluated schemes.ngAP's parameters: sliding window (), batch size (), prefix length (), and divergence threshold ().

Scheme
Description HyperScan [68] State-of-the-art automata processing engine on CPUs NFA-CG [77] Calculating compatible groups and mapping to threads.AsyncAP [35] Matching patterns asynchronously in the input stream.GPU-NFA [34] Flexibly scheduling hot and cold states to threads.ngAP-default Our proposed design with default parameters.
( = 25600,  = adaptive,  = degrades drastically due to work serialization, thus we set  = 1 in this scheme while leaving the divergence threshold  as a tunable parameter.
Discussion.Although our proposed optimizations tackle different challenges, they share a common requirement: the worklist must be able to handle various indices of symbols.Consequently, these optimizations cannot be applied to prior work, and they must be built on top of ngAP.We will show that these optimizations can work synergistically to address three challenges effectively in § 6.

Evaluation Methodology
System Configuration.We perform most experiments on a computer with an NVIDIA RTX 3090 (Ampere architecture, 24 GB memory, 6 MB L2 cache, and 82 SMs).The system runs Linux on a 12-core Intel Xeon 4214R CPU and 128 GB memory.We use NVIDIA NSight Compute to profile the kernels.All CUDA/C++ programs were compiled with -O3 flag with GCC 9.5 and CUDA 12.0.We use an NVIDIA Tesla V100 (Volta architecture, 32 GB memory, 6 MB L2 cache, and 80  Figure 13.Throughput results normalized to GPU-NFA. SMs) to evaluate the performance sensitivity.The throughput is defined as the number of input symbols processed per second.The execution time is averaged across three repeated runs excluding automata loading and data preparation, as they can be amortized with long input streams.
Benchmarks.We evaluate a wide range of applications from AutomataZoo [67], ANMLZoo [65], and Regex [17].Only two were selected from ANMLZoo as most of its applications were updated in AutomataZoo.Table 2 shows the characteristics of the evaluated applications.To simulate a real-world scenario, we use a 1 MB input stream equipped with each application in the benchmark suites and then generate 600 copies as the input.In order to reduce the duration of experiments, for the applications that cannot finish within an hour, we use a throughput of 0.16 MB/s (i.e., 600MB⇑3, 600s) to estimate its upper bound.To validate our implementations, we use a serial version of automata processing on the CPU as a reference and verify that the generated reports, which include the reporting states and corresponding symbol indices, are identical.Evaluated Schemes.Table 3 summarizes the evaluated schemes.For GPU work, we evaluate three prior data-driven designs: NFA-CG [77], GPU-NFA [34], and AsyncAP [35].NFA-CG and GPU-NFA are variants of BAP.AsyncAP increases the parallelism by starting matching from different input locations in the input stream.HyperScan [68] is the state-of-the-art CPU automata processing engine, combining many optimizations.We use MNCaRT [11] and VASim [66] to convert the automata for HyperScan.ngAP-default and ngAP-best are our schemes with different parameter setups.
Parameter Setup.Table 3 also shows the tunable parameters: sliding window size  ( § 4.1), batch size  ( § 4.2), and divergence threshold  ( § 4.4).ngAP-default uses a set of parameters that work well for all applications, while ngAP-best tunes the best parameter combination for each application.We propose an adaptive scheme for batch size , as we observe a larger value of it has better performance but may overflow the worklist.When Prefix Memoization and Prefetching Always-Active States are enabled, we estimate the number of state-index pairs to be included in the worklist as the entry with a maximum number of states for prefixes to be loaded ().Thus, when a batch is loaded, the batch size is set to the remaining spaces of the worklist divided by .
6 Experimental Results

Throughput
The normalized throughput of the evaluated applications is shown in Figure 13.It should be noted that GPU-NFA and NFA-CG transform NFAs to restrict each node from having an out-degree fewer than 4.However, a few NFAs cannot be transformed, we excluded 2 NFAs from CAV and 5 NFAs from Snort and refer to them as CAV' and Snort', respectively.In contrast, our design utilizes a compressed sparse rows (CSR) format for NFA topology, eliminating the need for NFA transformation and avoiding this limitation.The absolute throughput of full applications is in Table 4.
Application Analysis.Compared to GPU-NFA, ngAPdefault and ngAP-best achieve maximal speedups of 486.6× and 901.9× for CAV'.8 out of 15 applications experience more than 10× speedup.Our schemes improve performance in these applications for two reasons: 1) Some applications have imbalanced worklists, resulting in low thread utilization.Applications with few states, such as DS, BRO, and EM, lack parallelism to fully utilize threads when processing one symbol at a time.2) These applications have short patterns, allowing the memoization table to eliminate most computations during prefix lookup.Our schemes synergistically address thread utilization and redundant work problems, resulting in significant speedup.
Certain applications do not have speedup due to the absence of always-active states (SM) or the memoization table covering only a small percentage of work (LV, RF, HM).Our schemes are slower than AsyncAP for a few applications (CRP2 and HM) because when an application's patterns are balanced, AsyncAP has less overhead, but it performs poorly in imbalanced scenarios (e.g., PEN).Overall, our scheme achieves significant speedup for most evaluated applications.
Comparison with HyperScan.Table 4 shows HyperScan throughput, which is faster than prior GPU work on some applications.However, overall, ngAP-best outperforms Hy-perScan, except for YARA.Notably, in intrusion detection application Snort that HyperScan focused on, ngAP-best

Breakdown Analysis
To understand the impact of proposed optimizations for ngAP, we evaluate them incrementally on top of a baseline blocking automata processing on GPUs ("BAP", discussed in § 2). Figure 14 shows the throughput normalized to BAP for each optimization by adding it on top of the former one (as listed in Table 5).Our observations are as follows: 1) On average, ngAP slightly decreases throughput by 10% compared to BAP due to increased overhead from maintaining the worklist of state-index pairs.2) ngAP+O 1 achieves a 1.9× increase in throughput compared to BAP due to significant improvement in thread utilization from Prefetching Always-Active States.3) Throughput is significantly improved by 7× with ngAP+O 2 , indicating that Prefix Memoization effectively eliminates redundant computations.4) ngAP+O 3 improves throughput to 7.7× compared to BAP, demonstrating the effectiveness of Work Privatization to improve the data locality.
Profiling Results.We profiled cache and memory statistics to understand the reasons behind the improved throughput. Figure 15 (a) shows that all optimizations increase the L1 cache hit rate, with ngAP+O 1 showing the most significant improvement due to Prefetching Always-Active States.This

Sensitivity Studies
To investigate the impact of each parameter on sensitivity, we have fixed the parameters as listed in Table 5 and then varied one parameter at a time.Then, we study whether our approaches work in other GPU architectures.
Sensitivity to Sliding Window Size.As discussed in Section 4.1, the sliding window size () can affect thread utilization.Figure 16 (a) demonstrates the performance sensitivity to sliding window size.We observe that the throughput is generally better but not very sensitive when  ≥ 5120 compared with  = 256, because a larger sliding window  size amortizes the overhead of updating the sliding window pointers.
Sensitivity to Batch Size. Figure 16 (b) shows the sensitivity of throughput to the batch size.Larger batch sizes generally result in better performance due to improved parallelism, but if the batch size is too large, it may exceed the GPU memory capacity (e.g.,  = 512, 1024 in LV).We observe that an adaptive batch size approach (discussed in §5) that considers the remaining memory and entries to be loaded, which often leads to acceptable performance.
Sensitivity to Divergence Threshold.Figure 16 (c) shows the sensitivity results of different divergence thresholds ().We make the following observations: 1) Work Privatization ( < 100%) generally benefits most applications, but can have a negative impact on a few.2) The benefits of Work Privatization depend on the specific application, as the trade-off between work serialization and data locality improvement can vary.Enabling work privatization without considering divergence ( = 0%) can cause a performance loss, as seen in the case of LV, which achieves only 60% throughput compared to disabling it ( = 100%).Therefore, the divergence threshold is a tunable parameter for each application.
Sensitivity to Volta architecture.We conducted the experiment on an NVIDIA V100 GPU, where ngAP-default uses the same parameters as those used for the 3090 GPU, and ngAP-best is tuned for the V100.The results are presented in Figure 17.Our observations indicate that ngAP-default and ngAP-best achieve a speedup of 3.4× and 6.0× over GPU-NFA, respectively.The lower speedup of ngAP-default suggests that the default parameters may not be portable enough for all GPU architectures.Nevertheless, ngAP still achieves a significant improvement on the V100 compared to prior works.
Discussion on GPU Architectures.To understand why GPU architectures result in different performance profiles for ngAP, which only requires integer operations, we first analyze their performance under the integer roofline model of 3090 and V100.As shown in Figure 18, NFA-CG, AsyncAP, and GPU-NFA are compute-bound, suggesting that their performance may benefit from increased integer capabilities.Figure 19 illustrates the speedup that each evaluated scheme could achieve on 3090 over V100.We observe that the performance of NFA-CG, AsyncAP, and GPU-NFA is 1.1×, 1.2×, and 1.1× over V100, respectively.This speedup can be attributed to the compute-bound nature of these schemes, as depicted in Figure 18, which aligns with the 3090's 1.1× peak integer performance compared to V100 (13.1 TIOPS vs. 12.0 TIOPS [3,4]).
In contrast, ngAP-default and ngAP-best are likely to be latency-bound (Figure 18), as they exhibit lower compute and memory efficiency.They achieve even higher performance improvements of 1.8× and 1.5× on 3090, respectively, compared to the peak integer performance differences between the 3090 and V100 GPUs (Figure 19), and this can be attributed to two key factors: First, 3090 has more registers per thread [3,4], enabling more complex GPU kernels to have higher occupancy and hence hiding latency by fine-grained multithreading [5].This is supported by the fact that both ngAP-best and ngAP-default achieve an average achieved occupancy of 0.62 on 3090, compared to 0.48 and 0.36, respectively, on V100.Second, the 3090 operates at a higher SM frequency (13.9 GHz vs. 12.5 GHz), further reducing the latency, given that cache accesses require a similar number of cycles [8,26].In summary, to improve the performance of ngAP, GPU architectures could employ techniques that reduce latency or improve the concurrency to hide latency, while other schemes are more sensitive to compute resources for integer operations.

Latency
We measure the latency of ngAP by evaluating its throughput in processing a 1 MB input stream.To coordinate all thread blocks to process the NFAs, ngAP groups all connected components (CC; the total number of CC for the evaluated application shows in Table 2) into  groups, and utilizes each thread block with its own worklist to process each group.We use  = 200 as the default group number in ngAP-default while ngAP-best tunes to the best  for lower latency.Other parameters remain unchanged from those specified in Section 6. Figure 20 shows the speedup over GPU-NFA.We observe that ngAP-best results in better performance compared to prior GPU works by 10.9×, 3.2×, and 12.0×, respectively.
Table 6 shows the absolute throughput with one input stream.We observe that HyperScan has an advantage, especially for the applications with fewer NFAs, resulting in significantly shorter latency.In this scenario, the level of input stream parallelism is insufficient to utilize all the compute resources of GPUs.Therefore, we suggest enhancing ngAP by incorporating speculation or enumeration schemes (will discuss them in Section 7), allowing ngAP to function as if operating in a multi-input scenario.This augmentation would leverage all available GPU resources, resulting in latency improvements.We conclude that while ngAP primarily focuses on enhancing the throughput of large-scale automata   Figure 20.Latency results normalized to GPU-NFA.We define throughput (in bytes per second) for a single input stream as the measure of latency.
applications, it effectively reduces latency compared to prior GPU schemes.

Related Work
This section provides an overview of related work on automata processing, specifically focusing on the discussed techniques.
Improving Parallelism.Previous works have attempted to improve parallelism by enumerating, speculating, or using hybrid approaches to partition the input stream into segments and allow them to run in parallel.Enumeration approaches [30,38] enumerate all possible active states at starting locations of input segments, but require significantly more work.Speculation approaches [36, 43-45, 69, 74, 75] speculate one active state at an input segment to reduce the work, but must recompute misspeculated input segments.Speculative enumeration [27,70] is a hybrid of these two approaches.All these approaches focus on DFAs as only one state is active at any iteration, making it easier to speculate or enumerate.AsyncAP [35] focuses on NFAs by starting to process patterns from always-active states in parallel but performs poorly under imbalanced workloads.In contrast, Prefetching Always-Active States works on top of ngAP to improve parallelism and thread utilization.Unlike previous approaches, our optimization does not require enumeration or misspeculation handling, resulting in lower overhead.Additionally, Prefetching Always-Active States improves data locality due to optimized worklist scheduling ( § 6).
Reducing Computations.String algorithms [9,28,32] reduce redundant computations by keeping a memoization table.However, these approaches cannot be applied to automata processing.HyperScan [68] transforms a subset of NFAs into string matching, resulting in reduced time complexity.Other works use multi-striding processing [14,16,19] to process a stride at a time, reducing the total computations on symbols.However, multi-striding approaches do not scale well for large-scale automata problems, as the number of state transitions grows exponentially after transformation.In contrast, Prefix Memoization memoizes only a small portion of transitions into a look-up table, effectively reducing computations.
Data Movement and Locality.Many accelerators for automata processing reduce data movement by in-memory processing [22, 29, 50-53, 59, 60, 64].Other than accelerators, transformation approaches [15,56,57] construct new representations for automata that have better locality and memory efficiency, which are complementary to our work.GPU-NFA [34] loads the topology information of always-active states into GPU registers, thereby reducing data movement.In contrast, our work optimizes data locality by new schedules of the worklist with Prefetching Always-Active States and Work Privatization, which are general to any automata.

Conclusions
We present Non-blocking Automata Processing (ngAP) which allows multiple symbols to be processed concurrently for automata processing on GPUs, and further broadens the design space in automata processing on GPUs: On top of ngAP, this work proposes optimizations focusing on addressing three identified challenges, poor thread utilization, redundant computations, and poor data locality.Evaluation of the synergistic approach demonstrates that our work outperforms state-of-the-art GPU automata processing engines significantly across a wide range of emerging applications.

Figure 1 .
Figure 1.Comparison of processing automata using the prior "blocking" approach (a) vs. the "non-blocking" approach (b) proposed in this work: (a) The states matched with two adjacent symbols are denoted by ○ and ☆.A barrier is required between them; (b) Our proposal allows matching states with different symbols in parallel, making the process "non-blocking".

Figure 3 .
Figure 3. Illustrating the blocking automata processing (BAP) on GPUs corresponding to iteration 3 of Figure 2.

Figure 5 .
Figure 5. Measuring the potential of omitting the computations of pattern prefixes.(a) We measure the ratio of the number of symbols in pattern prefixes of length  = 3 to the total number of matched symbols in all patterns ( 3+3+3+3+2+1 4+5+4+3+2+1 = 0.79).A larger ratio indicates more work is associated with always-active states and subsequent  steps.(b) Illustrating how the first pattern "bana" is identified.

Figure 6 .
Figure 6.The percentage of work could be eliminated by not computing prefixes varying the length  from one to five associated with the always-active states.The applications without always-active states are excluded.

Figure 7 .
Figure 7.The basic execution flow of ngAP.

4. 2
Optimization #1: Enhancing Thread Utilization via Prefetching Always-Active States In this section, we first characterize how ngAP improves thread utilization, then propose a new optimization that works with ngAP to address Challenge #1.Thread Utilization Revisited.In comparison to Figure 8 (a) where the thread utilization is only 8⇑12, ngAP exhibits an improved thread utilization, as depicted in Figure 8 (b) Prefetching Always-Active States boosts parallelism from multiple symbols.

Figure 8 .
Figure 8. Techniques to improve thread utilization compared to BAP.(a) BAP underutilizes the threads as it processes one symbol at a time.(b) ngAP improves thread utilization by exploiting parallelism arising from processing multiple symbols simultaneously.(c) Prefetching Always-Active States further improves thread utilization by enabling more opportunities for the worklist to have different symbol indices.

Figure 10 .
Figure 10.Balancing workload when loading results for multiple prefixes from the memoization table.

Figure 11 .
Figure 11.The comparison between uncompressed and compressed memoization table formats.them in three stages.In the first stage, each thread loads the initial few states in each entry.The dashed line in Figure10represents a threshold, and all entries that exceed this threshold are fetched in the second stage.We set up the threshold to the average length of entries as it achieves the best performance empirically.During the second stage, all threads in the thread block collectively load the remaining states in the longer entries, distributing them equally among them.In the final stage, since reports are infrequent in the prefixes, each thread loads only the reports from the entry it needs.
Figure 12  shows how Work Privatization works.The threads extend to the neighbors of states

Figure 15 .
Figure 15.Average profiling results for evaluated applications.

Figure 16 .
Figure 16.Performance sensitivity to parameter tuning.

Figure 18 .Figure 19 .
Figure18.Roofline model focusing on Integer Operations Per Second (IOPS) for the RTX 3090 and V100 GPUs.We quantify the performance and arithmetic intensity of various schemes by calculating their geometric means across 14 different automata applications.Applications that do not run with any scheme or ncu profiler are excluded.

Table 5
BarrierFigure9.Memoization table for pattern prefixes of length 2.

Table 1 .
The memory consumption and construction time of memoization table with the prefix of length 3 with/without compression.Applications without always-active states are excluded.

Table 2 .
Characteristics of evaluated applications.CC stands for "connected component".
that are mapped to them, and then check whether the neighbors match with the symbols in the corresponding indices (Step 1).With Work Privatization, each thread can decide whether it privatizes the extended neighbors without writing them back to the shared worklist.As shown in Step 2, the thread could further compute ahead without interacting with the shared worklist.This improves temporal locality at the expense of parallelism, as the extended states are processed sequentially.

Table 5 .
Figure 14.Throughput results normalized to BAP for breakdown analysis.To demonstrate the impact of proposed optimizations, we evaluate them individually by adding them one at a time.Breaking down ngAP to evaluate effect of each optimization.

Table 6 .
Absolute throughput (in MB/s) with one input stream for evaluated applications.U: Unsupported applications.