Paguroidea: Fused Parser Generator with Transparent Semantic Actions

Parser generators have long been a savior for programmers, liberating them from the daunting task of crafting correct and maintainable parsers. Yet, this much-needed simplicity often comes at the expense of efficiency. We present, Paguroidea, a parser generator that harnesses the power of lexer-parser fusion techniques to create parsers that boast user-friendly grammar definitions while delivering performance that rivals specialized parsers. Building upon the foundations of the flap parser, our work introduces a series of extensions. One of our key contributions is a novel approach to the normalization method. By encoding reduction actions directly into the Deterministic Greibach Normal Form (DGNF), we provide parser generators with flexibility in manipulating semantic actions. This unique approach empowers developers with the freedom to customize their parser generators to their specific needs while maintaining semantic correctness. Furthermore, we formulate the execution of the parser in substructural logic, providing an elegant way to prove the correctness of the amended normalization procedure. In this exposition, we offer a glimpse into efficacious, user-friendly, and correctness-provable parser generation.


Introduction
Discussions surrounding lexical and syntactical analysis often delineate them into two discrete phases [1,2,21].This established paradigm involves the initial processing of raw input by a lexer, resulting in the generation of token streams.Subsequently, the parser engages with these tokens, orchestrating semantic actions to produce the anticipated syntactic structures.
However, it is increasingly evident that the demarcation between lexers and parsers is growing less distinct, primarily driven by the pursuit of more agile and efficient parsing techniques that are capable of handling intricate inputs.Prominent exemplars of this shift can be found in Parser Expression Grammar (PEG) [4][5][6], which eschew the convention of employing different metalanguages to define tokens and production rules.This scanner-less parsing paradigm is also used in Generalized LR (GLR) [22] and Earley parsers [20] when dealing with tricky grammar definitions.Moreover, real-world applications such as Clang may require frequent interactions between lexers and parsers, instead of running them separately [13].
Recent work has introduced a novel algorithm called "Flap" which leverages Deterministic Greibach Normal Form (DGNF) to seamlessly integrate the lexer and parser components [26].The Flap generator initiates the process by subjecting the context-free grammar to a rigorous type-checking phase, aimed at ensuring that the grammar is free of any left recursion.Moreover, it has no ambiguity upon a single-token lookahead [11].The production rules that successfully pass this type-checking are normalized into DGNF.In normalized grammar, a named rule can be associated with one or more normalized production rules, each of which either begins with an empty string or commences with a terminal followed by zero or more non-terminals.Notably, these rules are distinguishable solely based on their initial terminal symbols.Once DGNF representation is obtained, Flap proceeds to generate parser routines for each named rule.These routines will invoke a "local" derivative-based lexer [16] constructed specifically for the limited set of terminals that appear at the heads of the corresponding production rules, and continue on the matched rules accordingly.
Flap is implemented in MetaOCaml [8,9].Utilizing multistaged programming, Flap allows users to supply their grammar definitions as parser combinators [7,10,23].Semantic actions are maintained alongside the parser functions and composed during the normalization process.
In this study, we introduce a novel approach to manage semantic actions during the normalization process.Instead of treating the semantic actions as opaque functions, we propose the use of "reduction symbols." These symbols serve to explicitly delineate positions and demarcate input ranges pertinent to semantic actions.Such a modification not only enhances the generality of the algorithm for generating parsers across diverse target languages but also offers expanded opportunities for optimizations, leveraging transparent representation of the semantic actions.Furthermore, we draw upon an abstract formulation of parsers in substructural logic [3] to validate the accurate preservation of semantic actions.

The Flap Generator
Flap operates on Context-Free Grammar (CFG) and normalized them into DGNF [26].To ascertain the termination of these procedures with accurate results, Flap utilizes the type system proposed in [11] to confirm the absence of ambiguity in input grammar rules after employing a single-token lookahead mechanism.The type system associates each term in CFG with its nullability, "first set", and "follow set", which allows the checker to rule out syntax definitions with unexpected properties.
More specifically, the type-checking algorithm ensures that well-typed grammar rules exhibit the following properties: 1. Sequential Uniqueness Preservation: Let and represent two grammar rules, and denote ∼ as the rule of " followed by ."Sequential uniqueness for ∼ necessitates that is non-nullable, and the follow set of is disjoint from the first set of .2. Disjunctive Uniqueness Preservation: Let and denote two grammar rules, with | representing the rule of " or ."The disjunctive uniqueness of | requires that and cannot both be nullable simultaneously, and their first sets are disjoint.3. Guarded Fixpoints: Consequently, left recursions are absent.
These properties collectively ensure that the grammar can be parsed with a single-token lookahead.Moreover, a set of rules fulfilling such restrictions can be soundly normalized into DGNF within finite steps.
As stated in the previous section, production rules for a named rule in DGNF can only take two forms: 1. → : accepts empty input.
Once DGNF is obtained, the "localized" lexer for can be generated by collecting all terminal rules at the beginning of each production rule.The parser routine for begins by calling1 the localized lexer to check which token is recognized.If there is a token recognized, then it is guaranteed by the type system that a single-token lookahead can already determine which production rule to use; otherwise, the parser may or may not fail depending on whether there is an empty production rule or not.Without this fusion technique, one needs to generate a large lexer containing all regular expressions defined by the language and use it to recognize all tokens from the input.In this approach, "localized" smaller lexers are used once per parser routine without introducing extra states into the DFA.

Extended DGNF
In this section, we introduce an extension to DGNF, which involves the direct integration of "reduction symbols" into the normal form to explicitly denote semantic actions.This approach offers a more transparent representation of semantic actions, thereby facilitating optimization techniques such as inlining action routines and eliminating the need for intermediate result materialization.
It is worth noting that the incorporation of "reduction symbols" into DGNF can be adapted to a wide range of semantic actions, making it applicable to various parsing scenarios.While our current focus centers on optimizing semantic actions within the context of syntax tree generation, this extension's flexibility implies that it can readily accommodate arbitrary semantic actions.

Tree-Generation
The most common usage of the parser is to generate the abstract syntax tree (AST) for a given language.Therefore, this work adopts the approach used by Pest [19] and many other popular parser generators, defaulting on creating a syntax tree as the output of the parser.
There are two kinds of semantic actions associated with tree generations, distinguished by the activity of a named rule2 : • An active rule corresponds to a node in the AST, which is tagged with the name of the rule.• A silent rule does not generate a node in the AST.
Instead, all of its children (if there are any) are directly inserted into its parent.The user inputs (in the form of a metalanguage that defines parsers) are described in Equation 1: A named rule may consist of one or more production rules, each sharing the same activity.Users can provide these alternations as multiple rules, as illustrated in Equation 1.The body of a production rule, denoted as atom+, signifies an ordered sequence, meaning successful recognition of the rule occurs only if the corresponding atoms are accepted in the specified order.Throughout the remainder of this section, we will employ ∼ to represent the concept of " followed by ", and utilize | to denote an (unordered) alternation between " or ." 3In alignment with the conventions outlined in [11], we introduce the concept of -fixpoints, which pertain to named rules that include self-recursion within their production rules, either directly or indirectly.For example, the Kleene closure 'a' * can be expressed as .|('a' ∼ ).When grammar definitions are presented in the format demonstrated in Equation 1, it is feasible to deploy depth-first search algorithms for determining whether a rule should be classified as a -fixpoint [24].For the purposes of this section, we proceed with the assumption that we possess prior knowledge of recursive definitions.
A parser for S-expressions, for instance, can be defined with the following rule (tokens are represented by capitalized identifiers):

Normal Form with Reduction Symbols
Similar to DGNF, the extended normal form (EDGNF) always commences with terminals.For the sake of consistency, we distinguish active and silent rules using distinct arrow notations.

→ [ ] *
(silent empty rule) ⇒ [ ] * (active empty rule) The primary distinction between EDGNF and DGNF is the inclusion of reduction symbols (denoted as [ ]) within the body of the production rules.When the parser encounters a reduction symbol, it performs reduction operations on all preceding tokens or previously reduced results, utilizing the associated semantic action linked to rule . 4or instance, consider the following rules in EDGNF, designed for parsing a sequence of ATOM.(Subscripts are used solely for reference and do not indicate distinct names): The parsing process initiates with the rule list to analyze the input sequence "(a b c)".At the outset, the lexer for list successfully recognizes LPAREN, guiding the parser into the atoms routine.Subsequently, the lexer at atoms identifies ATOM, leading to the selection of rule atoms 1 .With the presence of a reduction symbol [atom], "a" is incorporated into a tree node tagged as atom.The parser proceeds with a recursive call to the atoms routine.Notably, list remains active while atoms remains silent, resulting in all nodes generated by atoms being placed within the list node as its children.Ultimately, the resulting tree structure resembles this: Analogous to a shift-reduce parser, the presence of reduction symbols serves as instructions for the parser to "reduce, " while conventional non-terminals guide the parser to "shift."

Normalize Grammar Definitions into EDGNF
In this section, we present an algorithm for the normalization of user input (as defined in Equation 1) into EDGNF.As previously mentioned, we may refer to operator connectives in the following explanations.
The fundamental concept involves dividing the normalization algorithm found in [26] into two distinct phases.While Flap's implementation also utilizes a two-staged normalization approach, it is initially defined using a single set of rules.To enhance the clarity of the implementation and elaborate on how reduction symbols are inserted, we explicitly separate the procedure into two phases.
The function 1 (semi-normalization) extends the rule set based on the input definitions.When user input contains terminals within the trailing part of production rules, 1 introduces fresh names and converts them into non-terminals.Additionally, 1 divides the top-level rule and the body of a -fixpoint by associating them with different names.
The context for stage 1 is a mapping that associates names with CFG production rules.We assert that implementing the normalization algorithm based on the provided semantic definitions is straightforward.For instance, to normalize the sequencing rule 1 2 , the algorithm initially creates two fresh names, 1 and 2 , for 1 and 2 , respectively 5 .It then obtains the semi-normalized form and continues by normalizing 1 1 and 2 2 .This process aligns precisely with the definitions outlined in Equation 4.
We present the second phase, known as full normalization, as an iterative function.The following semantic formulae provide a single-step execution, where the symbol refers to the output obtained from the first stage.maps a name to the set of all its associated rules The step function 2 leaves empty rules and rules beginning with regex tokens unchanged, as they are already in a fully normalized form.For rules starting with non-terminals, each of them with the original trailing rules to create a combined production rule.The normalization procedure and type system guarantee that there can only be two cases for the elements in ( 1 ): either an already fully normalized non-empty rule or a semi-normalized one starting with a non-terminal.In fact, if an empty rule is contained in ( 1 ), it violates sequential uniqueness.
An important aspect of Equation 5 is that when expanding a leading non-terminal, if the target non-terminal is active, the algorithm inserts a "reduction symbol" for the target non-terminal.This concept can be viewed as inlining parsing routines, with these "reduction symbols" representing original subroutine calls that incorporate "active" semantic actions.Detailed proof of the correctness of this expansion will be provided in the next section.
Additionally, we offer an algorithm (Algorithm 1) for fully normalizing the rule set .The algorithm iteratively normalizes the rules until no semi-normalized rules remain.In each step, for each name in ( ), we apply 2 to all its associated production rules and collect the results using a "flat-map" operation.

Correctness of the Normalization
The correctness of the Flap normalization algorithm from a grammar perspective has been previously established in [26].Specifically, when provided with a type-checked grammar , the Flap normalization algorithm terminates, yielding ′ in DGNF, which precisely recognizes the same language as described in .
In this work, we present the Flap normalization algorithm in two stages, incorporating the insertion of reduction symbols.Importantly, these reduction symbols do not alter the language recognition capabilities of the resulting grammar.Therefore, we do not reiterate the proof for the well-definedness and soundness of the normalization algorithm.Instead, our focus in the subsequent sections of this paper centers on demonstrating that the algorithm, with the insertion of reduction symbols, effectively preserves the proper handling of semantic actions.To establish this correctness, we initiate the formulation of a substructural logic representation for parser state transitions.
This approach of the proof appeals to us primarily due to: • the proof provides different considerations from [26] that not only involve the grammatical correctness of parsers but also semantic actions; • it formalizes the execution of parsers (including semantic actions) in substructural logic which provides a novel perspective to understand the behavior of parsers.
In the context of substructural logic, the logic systems that we refer to do not inherently assume structural properties of the proof context, including the contraction rule, weakening rule, or exchange rule [17].When considering the execution of the parser, it involves the selection of production rules, the consumption of input elements, and the generation of syntax tree nodes.More specifically, the parser operates as a deterministic finite transducer (DFT) that scans through the input and generates a corresponding syntax tree.This execution is intricately sensitive to the order and arrangement of objects within the context (including the input stream, the outputs on the stack, and the production rule sequences).Consequently, the utilization of substructural logic becomes indispensable for effectively abstracting and formalizing the execution of parsers.A similar formulation for DFT can be found in [3].
We represent the state of such a transducer using a tripartite structure, denoted as

| |
Here, signifies the grammar rule that the transducer is currently processing, corresponds to the output being actively constructed, and represents the input sequence under consideration.As an illustration, in Equation 6, the state transition denoted as Shift captures the process of accepting a terminal token .It is important to note that we employ placeholders and to represent grammar rules and input tokens, respectively.
To comprehensively investigate the property of normalization, we seek to establish a precise understanding of parser execution semantics.It is essential to recall that within our context of tree generation, each grammar rule encompasses two distinct semantic actions: either it passes information to the parent node or allocates a tree node to encapsulate its children.To substantiate the correctness of the normalization process, we conduct parsing operations on the original grammar definition, as stipulated in Equation 1.
Equation 7 articulates the semantics of all remaining rules derived from the user's input before any normalization takes place (Here, ← − Δ signifies that the body of the production rule Δ is expanded in reverse order): The above definition aligns with a similar semantic of a shift-reduce parser and possesses the following behaviors: • When the parser encounters a token that matches the current expectation, it consumes the token and proceeds.
• When an empty rule is expected, the parser advances without consuming the input stream.• In the case of a silent non-terminal being expected, the parser simply expands the production rule.• In the case of an active non-terminal rule being expected, the parser records the reduction position and transitions to that non-terminal rule.• Upon reaching a reduction position, the parser applies the corresponding semantic actions.
Equation 7 doesn't define a DFT because Shift ⇒Δ and Shift →Δ may not be deterministic if multiple production rules are associated with .To address this, and considering our requirement for unambiguous grammars with singletoken lookahead, we introduce an "oracle" Δ , to represent the production rule of determined by looking ahead at .Subsequently, we update the shift rules as depicted in Equation 8.
For the sake of simplicity, we omit the rules associated with the successful or failed termination of the parser.The DFT can be terminated when it becomes stuck while following the aforementioned rules.
In the ensuing section, we delve into specific cases that pertain to the normalization algorithm.Our objective is to demonstrate that normalization engenders the same execution behaviors as the original rule set.Consequently, the rules post-normalization can be viewed as a static expansion, meticulously preserving the underlying semantics.Proof.Each execution of Shift results in placing another non-terminal +1 at the rightmost position of the grammar part of the state, continuing until .Consequently, during this process, none of the rules corresponding to , , ⇓ are applicable.Furthermore, Shift does not consume the input; therefore, the DFT continues looking ahead based on and expands the grammar sequence accordingly.□ Lemma 4.2.(Whole Stack Reduction) Under the same conditions as Lemma 4.1, and assuming that For each ⇓ ∈ Δ, the reduction always consumes every "non-arrow" result before Γ.Additionally, immediately after every such reduction, before Γ, there can only be a single "non-arrow" object followed by multiple "arrows".
Proof.Each expansion inserts one "arrow" at the rightmost position of the grammar part and one "arrow" in the leftmost position of the output part.This holds true even when silent rules appear as heads during the process.If silent rules are involved in consecutive expansions, they introduce no ⇓ symbols.The next active rule, if any, will still be expanded to the leading position.Without loss of generality, after such consecutive expansions, the state ends up in the form of Equation 9. (Δ ′ represents the production rule that has already been expanded.Some rules are silent (e.g., 2 ), thus inserting no reduction symbol into the sequence).
Be aware that body parts Δ ′ may still have nonterminals to be expanded by future execution, but Lemma 4.2 only takes care of the reduction symbol (arrows) that are inserted during the consecutive expansion before any advancement on terminals.When ⇓ is executed, any possible reduction inside Δ is finished, thus ⇓ consumes whatever "non-arrow" results before Γ and reduces it to a single result.Inductively, the property can be concluded for the entire sequence.□ With Lemma 4.1 and Lemma 4.2, one may already be able to foresee some insights into the proof.These lemmas state properties of the execution semantics of the parser that are "captured" by the normalization process.Indeed, in the following sections, we demonstrate how normalization evaluates possible consecutive expansions statically and inserts triggers for whole stack reductions.
Recall that Algorithm 2 (Equation 5) preserves rules beginning with terminals while expanding the production rules of leading non-terminals.In this "substructural logic abstract machine, " such expansion corresponds to decisions made based on runtime look-ahead information.However, in a static environment, we lack access to runtime look-ahead information.Therefore, the algorithm expansively considers every potential production rule.Furthermore, 2 iteratively executes until all production rules are fully normalized.This iterative process mirrors all feasible consecutive expansions that can occur during the runtime of our DFT abstraction.Consequently, we establish the following lemma: Lemma 4.3.After the termination of the iterative 2 algorithm, all possible consecutive expansions are statically evaluated.Consequently, if the DFT executes rule at some step, triggering a consecutive expansion, the expanded rules uniquely resolve to a subset of the normalized production rules associated with , expanded in the same order as the runtime execution.
Proof.This conclusion naturally follows from the earlier discussion.The uniqueness arises from the grammar's unambiguous nature when looking ahead to a single token.Since the iterative algorithm expands the leading non-terminals step by step, the expansion order aligns with the runtime execution order.□ During the normalization process, 2 inserts a reduction symbol whenever an active rule is expanded.These symbols indicate the construction of a tree node based on the reduced results to the left of the entire rule.As Lemma 4.3 has stated, the runtime expansion procedure corresponds to a static expansion path where the same set of rules is expanded in the same order.Therefore, these reduction symbols are analogous to the ⇓ symbols in the substructural logic formulation.We summarize this observation in Theorem 4.4: Theorem 4.4.(Correctness of the Normalization) The normalization procedure 2 accurately maintains the semantic actions.Specifically, the inserted reduction symbols trigger reductions of the same active rules, in the same order, and on the same children as in the abstract parser execution with look-ahead oracles.
Proof.Leveraging Lemma 4.3, we readily observe that 2 expands production rules in a manner that aligns feasible executions of nonterminal with uniquely resolved normalized production rules.Building upon Lemma 4.2, we can ascertain that in such executions, every reduction consumes every output within the "stack frame" initiated at .Concurrently, the reduction symbols within the normalized production rules also initiate reductions on all reduced results to the left of the rule.As a result, these statically expanded reductions execute in the same order, and on the same ordered set of children as if they were dynamically determined by the abstract parser employing look-ahead oracles.□ based on derivatives, a frontend responsible for reading grammar definitions and converting them into normal form, and a backend that generates fused parsers in the form of Rust token streams.The implementation is designed to fulfill the requirements of tree generation and provide a showcase of how transparent semantic actions can be utilized to improve performance.It should not be hard to extend similar optimization techniques to semantic actions other than tree generation.
The parser generator makes full use of the transparent representation of semantic actions, which enables it to efficiently perform optimizations and produce high-quality code.For instance, in the case of silent rules, intermediate results are never materialized.Instead, silent routines accept a mutable reference to a dynamic array and directly insert the results into it.Such an approach is possible because the parser generator is aware that a subsequent active rule will consume the outputs of silent rules.In contrast, in parser combinators, such semantic actions are combined opaquely in the early stages of normalization, making it challenging to perform similar optimizations.Additionally, the parser generator can apply tail-call optimizations to silent rules, leading to significant speed improvements in parsing silent rules in the form of + or * .
Even when normalized production rules contain multiple active reduction symbols, the operations are still executed efficiently.Essentially, the generator constructs a function that manages the current results in a dynamic vector.It appends nodes during shifting and "compresses" nodes into a single result during reduction actions.The transparent representation allows the generator to avoid the need for passing results between various semantic action routines, further enhancing efficiency and performance.
Paguroidea also applies general optimization techniques like SIMD and Look-up Table optimizations.In the derivativebased lexers, it is common to encounter self-loops in DFA states, such as Kleene closures.Paguroidea estimates the patterns' cost by counting boundaries of the intervals.It chooses to use look-up tables if the pattern is complex and SIMD otherwise.For SIMD, it utilizes Rust's std::simd library to pack the input token and compare them with the splat representation of left and right boundaries.Depending on the distribution of the intervals, the generated lexer may choose between positive and negative look-ahead methods.If the control flow is too complicated to be compiled into packed SIMD patterns, Paguroidea falls back to look-up tables.We use a similar approach from Logos, where multiple look-up patterns are stacked within a global byte array [15].In this implementation, each cell of the table occupies a single bit rather than a full byte.This compact representation brings a better cache locality and higher memory efficiency.

Evaluation
In this section, we present a detailed assessment of the parser generated by Paguroidea, primarily focusing on its performance for two critical data formats: CSV (Comma-Separated Values) and JSON (JavaScript Object Notation).Our evaluation spans various dimensions, including hardware platforms and input data types, all of which have been meticulously selected to indicate the efficiency of the parsers.
As one will see from the following section, our evaluation shows that Paguroidea is capable of generating parsers that beat popular generators in Rust and match the performance of state-of-the-art parsers that are specialized to input formats.

General Setup
Our implementation comprises multiple Rust crates.To maximize the optimization potential, Link-Time Optimization (LTO) and level 3 optimization are enabled, allowing the Rust compiler to exploit all possible optimization opportunities.Given our intention to evaluate the parsers on specified platforms, the compiler flag -Ctarget-cpu=native is passed to enable the utilization of native microarchitecture features.
Considering that the parsing process constructs sizable data structures, we import the state-of-the-art memory allocator SnMalloc [12] to mitigate the effects of memory allocation during our experiments.
For the AArch64 CPUs, we selected the Apple M1 (up to 3.2GHz with 8M LLC, ASIMD available) and AWS Graviton 3E (2.6GHz with 32M LLC, SVE available).

Random CSV
For the random CSV benchmark, the program generates multiple rows of either integral or text data of varying lengths, ensuring the same random seed for different parsers.This data is then processed repeatedly by the benchmark framework (cargo bench), producing throughput data averaged over a minimum of 500 data points.
Three parsers are evaluated in this experiment: • pag: generated by Paguroidea, • csv: the widely-adopted implementation from the Rust community, • pest: the parser constructed using Pest, sharing the same grammar definition as Paguroidea.
The results are illustrated in Figure 1.
As evident from the figure, the performance of Paguroideagenerated parser showcases that it is either comparable to or exceeding that of the specialized parsers.

Random JSON
The setup for the random JSON benchmarks mirrors that of the random CSV tests, albeit with a distinct set of parsers: • pag: generated by Paguroidea, • serde: the leading JSON parser implementation for "serde", Rust's most widely-used serialization library.• simd-json: a cutting-edge JSON parser ported to Rust.
• lalrpop: a table-driven LR(1) parser generator in Rust with its default lexer.• lalrpop+logos: "lalrpop" combined with the lexer generated by "logos", a high-performance Rust lexer generator.• pest: the parser designed using Pest, possessing an identical grammar definition to Paguroidea.
It's essential to note that both simd-json and serde execute additional semantic actions, including attribute map creation and integer deserialization.Thus, if solely generating syntax trees, specialized parsers might achieve higher throughput.Nonetheless, we present the data in its original form to underscore that the parser generated by Paguroidea remains competitive with these specialized parsers.Furthermore, as depicted in Figure 2, our performance notably surpasses other generated6 .

Twitter JSON
The Twitter JSON evaluation replicates the experiments with the previously mentioned parsers, but it introduces a distinct dataset sourced from websites.This dataset is considerably large and encompasses a greater number of Unicode strings.As depicted in Figure 3, the parser produced by Paguroidea maintains a performance level that is commensurate with the specialized parsers.

Comparison with Flap
To provide a detailed analysis of performance outcomes, we conducted an additional series of JSON-based benchmarks together with Flap.We created a comparable JSON parser in MetaOCaml, using parser combinators sourced from ocaml-flap.This implementation mirrors the approach from Paguroidea, wherein the parser generates a tree structure without de-serializing string or numeric literals.Alongside the normal parsers, we also construct JSON scanners with both Paguroidea and Flap.In the case of Paguroidea, the scanner is implemented by marking all rules except the top-level one as silent.In Flap, we discard all intermediate results in semantic actions by returning units.
The comparative throughput of the various parsers is depicted in  reinforce by explicit tail-call optimization techniques.In contrast, the parser combinators in Flap invariably return composite types to encapsulate results derived from iterative or concatenated rules.Given that these functions are composed in an opaque manner, converting parsing routines to tail-recursive formats is unfeasible.The benchmark suite from the original Flap work requires unlimited stack sizes to accommodate the profound recursion depths.

Conclusion
This research introduces Paguroidea, an innovative parser generator that employs lexer-parser fusion technology.Building upon the Flap parser as detailed in [26], our approach formulates a normalization algorithm, integrating semantic actions with normal forms.The correctness of this algorithm is examined through a substructural logic abstraction of parser execution.Our empirical evaluations demonstrate that the parsers generated by Paguroidea exhibit performance levels comparable to specialized parsers.
However, this study is not without its limitations.Inheriting from [26], Paguroidea imposes relatively strict constraints on grammar definitions.These limitations curtail the expressiveness and user-friendliness of the generator.Prospective research should contemplate expanding the fusion technology to support more comprehensive grammar families, such as PEG, Adaptive LL(*) [18], or LR(k).Earlier studies like [14] may have already provided insights into similar static analyses of PEG.
One challenge that emerges when creating a parser generator for a high-level language is ensuring compatibility between the generated code and the target language's type system.This compatibility becomes particularly complicated when general semantic actions come into play.It would be worth exploring whether the transparent representation proposed in this study offers a more universally applicable method for inferring types within parser routines.
In conclusion, this research furnishes both practical techniques for efficient parser generation and a formalized logical approach to understanding the runtime behavior of parsers.We are optimistic that this amalgamation of hands-on implementation and theoretical reasoning will inspire further advancements in the field of efficient and trustworthy compiler construction.

Data-Availability Statement
The data that support the findings of this study are openly available in Zenodo as 10.5281/zenodo.10570638,reference number [27].

Figure 1 .
Figure 1.CSV Parsing Throughput.The y-axis represents throughput (in MiB/s), and the x-axis denotes different hardware platforms.

Figure 2 .
Figure 2. Random JSON Parsing Throughput.The configurations are analogous to Figure 1.

Figure 3 .
Figure 3. JSON Parsing Throughput with Real Dataset Sourced from Websites.

Figure 4 .
Parsers created by Paguroidea consistently surpass those produced by Flap across all metrics.This performance disparity can be attributed to several key factors.Paguroidea capitalizes on the transparency of semantic actions to avoid the passing of intermediate results, further