Fast Deterministic Black-box Context-free Grammar Inference

Black-box context-free grammar inference is a hard problem as in many practical settings it only has access to a limited number of example programs. The state-of-the-art approach Arvada heuristically generalizes grammar rules starting from flat parse trees and is non-deterministic to explore different generalization sequences. We observe that many of Arvada's generalization steps violate common language concept nesting rules. We thus propose to pre-structure input programs along these nesting rules, apply learnt rules recursively, and make black-box context-free grammar inference deterministic. The resulting TreeVada yielded faster runtime and higher-quality grammars in an empirical comparison. The TreeVada source code, scripts, evaluation parameters, and training data are open-source and publicly available (https://doi.org/10.6084/m9.figshare.23907738).


INTRODUCTION
Learning a context-free grammar from sample programs with just the help of a black-box parser currently does not scale well to realistic settings.Existing approaches need a combination of a large number of sample programs (deep learning), the ability to manipulate a grey-box or white-box parser, or are non-deterministic.The most closely related approach, the recent Arvada work [22], is non-deterministic and thus ran Arvada 10 times for each input to explore different sequences of grammar inference steps.
Black-box context-free grammar inference is crucially important when a language only has a black-box parser that cannot be instrumented.On the other hand, program samples are often available (e.g., as open-source code or as example programs from the language vendor).Examples of such languages typically only have closed-source parsers that are often only available remotely (or cannot be instrumented for legal reasons).This unfortunately rules out using white-box or grey-box parser instrumentation [16,19,27,42].
The task of black-box inference of a context-free grammar is fundamentally hard.First, the given input programs likely do not cover all aspects of their language's "golden grammar".Second, it is often very hard to generalize from a few programs exhibiting a few combinations of language features to a grammar describing the language features with the correct nesting rules.Finally, not being able to inspect or instrument the language's parser makes black-box inference significantly harder than grey-box or whitebox inference, as a black-box approach has a much narrower access to the parser's encoding of the language's golden grammar.
While there has been a lot of interest in applying deep learning techniques to learning grammars from program samples [11,15], a principal limitation of deep-learning approaches is that (a) they need a very large amount of training samples (which may not be available) and (b) they do not take advantage of black-box parsers that are typically available even for closed-source languages.Indeed, the Arvada paper reports on a comparison with state-of-the-art deep learning approaches, in which deep-learning tools did not match the precision of either Glade [5] or Arvada [22].
While Arvada [22] has made significant improvements over the pioneering Glade [5,6] work, it still has several limitations.For example, Arvada has  ( 4 ) runtime in its  input tokens and requires its "seed" input programs to be very short.For example, on average Arvada mostly produced [22] grammars within 5 minutes with over 80% F1 scores when running on a few dozen hand-selected minimal sample programs that on average consist of just 1. 12.5 characters.However when running on 25 randomly generated nodejs programs with an average length of 50 characters, Arvada yielded on average a 29% F1 score-after a 12 hour runtime.
TreeVada combines several new techniques.First, to guide its grammar generalization steps to avoid breaking common nesting rules, TreeVada first pre-structures its input programs according to nesting rules induced by balanced brackets that are common in many languages [41].Depending on the language grammar's nesting structure this step reduces TreeVada's runtime from Arvada's  ( 4 ) downward up to  ( 2 ).Second, once TreeVada accepts a grammar generalization step, TreeVada applies this generalization rule recursively.Finally, building on these techniques TreeVada carefully omits non-determinism and thus yields a reproducible grammar in a single run.In an empirical comparison the resulting TreeVada implementation achieved both faster runtime and better grammar quality than the most-closely-related Arvada tool.To summarize, the paper makes the following major contributions.
• TreeVada is the first fast deterministic black-box approach for context-free grammar inference that produces high-quality grammars.
• The paper compares TreeVada empirically with its closest competitor (Arvada) using Arvada's setup and achieves faster runtime and better grammar quality.

BACKGROUND
While there are trade-offs and special cases, "context-free" remains an important abstraction level for programming language definition, both for human-level programming language understanding and for automated language processing tools.For example, the latest versions of the official language specifications of complex mainstream languages such as Java [17], JavaScript [45], and C++ [21] include in their language syntax descriptions context-free grammars.Similarly, many sample grammars of the widely-used ANTLR4 line of parser generators [33] are context-free.Figure 1 is a small example context-free grammar.As usual, a rule with alternatives ( →  1 . . .  |   . . .  ) is just shorthand for having both a first ( →  1 . . .  ) and second ( →   . . .  ) rule.Each rule is thus essentially of the same form ( →  1 . . .  ) with a single non-terminal () on the left and a sequence ( 1 , . . .,   ) of terminals, non-terminals, or both on the right.This, of course, allows recursive rules (e.g.:  → ∼  ) and balanced nesting structures (e.g.:  → (  +  )).
Many programming languages allow balanced nesting of language concepts, where a concept has a dedicated start terminal and a dedicated end terminal and a concept can contain other concepts [41].For these start and end terminals many languages use matching round, square, and curly brackets: ( ) [ ] { }.Common thereby nested concepts include class and function definitions, parameter lists, code blocks, array creation and access, and various other expressions-for example, a code block containing other code blocks that in turn contain arithmetic expressions.

Black-box Grammar Inference
The long-term goal of this line of work is to reverse engineer the (unknown) grammar (or specification) of a programming language from only two things, (1) a few valid sample programs and (2) a black-box parser.Having such a specification would support various software engineering tasks, including code comprehension [31], reverse engineering [28], smell detection and refactoring [20,29], test input generation [18], and code transformation [1].
For example, some popular commercial languages (e.g., MAT-LAB/Simulink) neither have a formal specification nor parsers that can be analyzed.Specifically, the language's tools are closed-source and cannot be instrumented for legal or technical reasons (e.g., they are only available as a remote service).But valid sample programs are often widely available-via GitHub or the vendor's website (to document language features and encourage language adoption).
We follow Arvada's definition [22] of grammar quality.So an inferred grammar G  is better if it has a higher F1 score, i.e., if the set of input programs G  accepts is closer to the set of input programs accepted by the golden grammar.

State-of-the-art Inference: Arvada in 𝑂 (𝑛 4 )
Black-box inference of context-free grammars was pioneered by Glade [5,6] and recently advanced by Arvada [22].Arvada's evaluation showed Arvada's average run provided on average some 5× improvement in recall and 3× improvement in F1 score over Glade (while being just 30% slower).
Arvada initially treats each input program as a flat parse tree, i.e., a single rule that can only reproduce the given input program.Arvada then iteratively generalizes grammar rules.It groups a few (leaf and/or internal) parse-tree sibling nodes   , . . .,   (aka a "bubble") under a new bubble parent node , reflecting a candidate grammar rule ( →   . . .  ).Arvada then picks an existing interior tree node  and its child nodes   , . . .,   , which together implicitly define a grammar rule ( →   . . .  ).It then checks if it can rename the new bubble parent node to (and thereby merge it with) the existing interior node, yielding the generalized rule ( →   . . .  |   . . .  ).
Arvada heuristically accepts such a rule merge when a blackbox parser accepts up to 100 freshly generated sample programs that exercise the newly-merged rule.Since each such merge check (and especially failing it) is expensive, Arvada orders its potential bubbles via heuristics.The key heuristic is to compare the  siblings immediately before ("left -context") and after ("right -context") a candidate bubble.Arvada thus ranks a bubble higher if the bubble's contexts are more similar to the contexts of existing interior tree nodes.The secondary bubble-ranking metric is each bubble's occurrence count in the input programs' parse trees (a higher bubble occurrence frequency yields a higher rank).To increase the chance of a merge, Arvada also tries to merge two bubbles directly with each other ("2-bubble"), for which it ranks all bubble pairs.The evaluation of Arvada (and Glade) [22] points to two scalability issues.(1) First, Glade's and Arvada's training sets only contain very small input programs, i.e., the largest input programs range from just 5 (arith language) to 245 characters (tinyc).(2) Second, relatively more complex languages (tinyc and especially nodejs) have the relatively larger golden grammars and input programs.Besides the higher runtime, here Glade and Arvada also yield lower F1 scores.Following are the key technical challenges of Arvada.
2.2.1 Arvada Run = 10 Non-deterministic  ( 4 ) Runs.Arvada's first key challenge is its non-determinism, which makes the results hard to reproduce.For example, when we ran it 10 times on the Figure 2 input programs  1 , Arvada produced two different grammars.Non-determinism also creates a trade-off between using the first run's grammar vs. re-running Arvada in the hope of finding a better grammar.On each set of input programs, the Arvada work ran Arvada 10 times, effectively yielding an order-of-magnitude worse total runtime than the reported average runtime.
Arvada uses non-determinism to explore various sequences of grammar generalization steps.Such a generalization sequence can get Arvada stuck in the sense of cutting off subsequent generalization options, reducing the inferred grammar's quality.For example, the Arvada study reported for several languages a high F1 score variance among its 10 runs.For example, among 10 nodejs runs F1 scores ranged from 0.14 to 0.55.Following are Arvada's four main sources of non-determinism.
Shuffling Initial Candidate Node-pair Merges: Arvada first tokenizes each input program along character classes (lower-case, uppercase, digits, whitespace), keeping only other ASCII (aka "punctuation") and non-ASCII characters as individual tokens.(Arvada treats each such resulting token as the only child of a token-specific "dummy" parent node connected to the root node-for brevity we omit from figures these dummy nodes.) Arvada then tries an initial attempt ("MergeAllValid") to generalize grammar rules.Specifically, it creates all pairs of existing non-terminal (mostly dummy) nodes across all parse trees, orders the pairs arbitrarily (by storing them in a non-deterministic data structure), and tries to merge each pair.For example, as Figure 2's two initial parse trees contain 12 unique non-terminal node types ( 0 and implicit parents of while, ␣, n, etc.), Arvada tries merging 66 node pairs, which yields one successful merge (skip with  0 ).
Ranking & Shuffling  ( 4 ) Candidate Merges: After the initialization phase, for each grammar generalization step Arvada first (re-) collects and (re-) ranks all possible parse-tree sibling-node sequences ("1-bubbles") up to a configurable length together with their pairs ("2-bubble").For  tokens in the initial parse trees there are essentially  ( 2 ) 1-bubbles, which makes the ranking overall  ( 4 ).Arvada then takes the top-100 candidates, shuffles them, stores the existing non-terminal tree nodes in a non-deterministic structure, and iteratively tries the merges, until one succeeds.For ) ␣ ; ␣  0 , partially merge with Figure 2: Top to bottom: Input while programs  1 and a resulting Arvada run: initial (pre-tokenized) flat parse trees, initial node-pair merges (green), 1 st bubble merge (lime), 2 nd bubble merge (yellow) without reapplying rule, and 3 rd bubble merge (orange) breaking tree nesting; resulting grammar.example, in Figure 2's first bubbling step Arvada ranks 1,043 candidate 1-and 2-bubbles to merge the lime bubble ( 1 → L␣=␣n).
Accepting Rule Generalization Via Sampled Programs: For both above cases (merging initial single nodes or a candidate bubble), Arvada accepts the merge if the black-box parser accepts up to 100 freshly generated programs.From all programs that exercise the proposed generalized grammar rule, Arvada samples these 100 programs (50 per merged side) uniformly.
Final Step (Expand Terminals): At the end Arvada expands each terminal to a larger character class, so the grammar may accept tokens that were not in the seed programs.For example, Arvada tries to expand  1 → 1 | 2 to all single digits, integers, or alphanumeric letters.Arvada then samples 10 strings, generates programs, and checks them via the parser.A grammar's terminals may thus differ across Arvada runs on the same seed inputs.

Not Generalizing
Recursively.Arvada's second challenge is that it does not recursively reapply a rule generalization it just learned and thus on some runs needs additional expensive steps or gets stuck.For example, Figure 2 shows 5/10 runs we observed Arvada pursue for the  1 input programs.As the second bubble (yellow) it grouped sibling nodes (n+n) under new bubble parent  2 and merged  2 with n's (not shown) dummy parent into  2 .
While this bubble yields an appropriate generalized grammar rule ( 2 → n | ( 2 + 2 )), Arvada does not recursively reapply this just learned rule to its parse trees-even though the sibling sequence ( 2 + 2 ) is now present in the right parse tree.Instead, Arvada re-ranks all bubbles (an expensive operation), picks and merges another bubble (orange), and thus gets stuck.

2.2.3
Breaking Bracket-implied Nesting Structure.Many languages use matching round ( ), square [ ], and curly { } brackets to recursively nest concepts such as class and function definitions, code blocks, parameter lists, array creation and access, and various other expressions.Arvada's third key challenge is that on some runs it prioritizes a bubble that conflicts with a parse tree's bracketimplied nesting structure and thus gets stuck.
For example, in some runs on the Figure 2 input programs  1 , Arvada breaks the while language's numerical expression nesting, which is defined via matching round brackets.In these runs Arvada partially merges the bracket-wise unbalanced orange bubble ()␣;␣ 0 ) with the implicit parent of the last closing bracket.Arvada then cannot further generalize the grammar.The resulting grammar is recursive.But for statement sequences (recursive applications of the semicolon) the it only allows very specific instantiations, i.e., each generated statement sequence must start with an assignment statement that contains at least one addition (L␣=␣( 2 + 2 . . .).

OVERVIEW AND DESIGN
We guided our design via feedback from running Arvada and our alternatives on Arvada's seed programs for tinyc [22].To prevent over-fitting we did not use feedback from any other programs or languages we used in the subsequent evaluation.

Assumptions on Strings & Brackets
TreeVada's current heuristics build on two "soft" assumptions, i.e., that many languages (1) use ' " quotes to wrap strings and (2) use ( ) [ ] { } brackets for nesting.If a language (also) uses these characters for other purposes then TreeVada's F1 score may suffer.

Pre-tokenizing Input Programs
As many languages share basic tokenization (or lexing) rules (e.g., an identifier is separated by some non-identifier token from the following token), TreeVada and Arvada first tokenize their input programs.Tokenizing both likely yields higher-quality grammars and is more efficient than on each run rediscovering common lexing rules via relatively expensive grammar inference.
Both approaches thus replace a sequence of elements of one of the four character classes (lower-case, uppercase, whitespace, or digits) with a new terminal.This leaves all brackets, punctuation, and other "special" characters as individual character terminals.For example, on the  1 input programs of Figure 2, Arvada and TreeVada produce the same token sequence.

Program Structure in String Literals.
While not part of the main three problems we focus on, we also notice that Arvada treats the contents of string literals as program structure.For example, during initial tokenization Arvada tokenizes the 7-character input fragment "k␣:-)" into 7 nodes.It may then bubble and merge some of these nodes and get stuck.We thus want to distinguish string literals from program elements.
As this is not the paper's main focus we use a simple heuristic that solves several scenarios that are common in many languages.Specifically, we notice that many languages wrap a string literal in single (') or double (") quotes.When it encounters either quote character TreeVada thus groups all following characters until again encountering the same quote character.While this scheme cannot handle all cases (e.g., escaped quote characters), it tokenizes common simple cases correctly, e.g., "k␣:-)" into three tokens, one per double-quote character plus one for the string literal's content.

Pre-structuring Parse Trees Along Brackets
We observe that Arvada's first bubble-ranking generalization step is its most expensive, as it may rank in  ( 4 ) all pairs of all possible sibling token sequences of the input programs.Many such bubbles are likely illegal as they cross a round/square/curly bracket "boundary" and thus violate a nesting rule that is common among languages.This becomes clear on the extreme example program of  = 2 + 1 tokens that starts with  round opening brackets followed by x and  round closing brackets.Arvada's first bubble generalization step ranks  ( 4 ) bubble pairs.Most such bubbles cross a nesting boundary and thus the parser likely rejects them.
The deeper-structured the parse-trees become via generalization steps, the cheaper each subsequent generalization step is.These later steps are cheaper not so much due to earlier grammar generalization but as a more-structured tree only permits shorter (and thus fewer) sibling token sequences.Our goal is thus to quickly convert parse trees from flat to richly structured, by essentially enforcing common nesting rules.In the above extreme nesting example, the nesting-implied parse tree consists of root node plus a single stack of  layers of a single bracket-wrapped node, reducing  ( 4 ) to a single  ( 2 ) step, i.e., the upfront MergeAllValid.
Given the wide use of round/square/curly bracket-defined nesting, TreeVada pre-structures parse trees heuristically-likely without significantly impairing the inferred grammar's quality.Specifically, TreeVada makes one simple stack-based pass over each input program, initializing the stack (and the parse tree) with root  0 .TreeVada adds each token to the parse tree as the child of the current top-of-stack node.When encountering an opening ( [ { bracket, TreeVada first pushes a new non-terminal onto the stack.When encountering a matching closing bracket ) ] }, TreeVada then pops the top element off the stack.When brackets no longer match TreeVada reverts to a flat tree.The subsequent attempt to merge all tree node pairs often merges some of these new rules with each other or other nodes.For example, the Figure 3 run merges the bracket-implied  1 ,  2 , and  3 with the existing node n.If a pre-structured rule remains un-merged it does not generalize the grammar.As it just adds slightly to grammar verbosity, we do not remove such rules.
Arvada's motivating while example language uses brackets only lightly (i.e., only one rule in the Figure 1 golden grammar contains brackets).But even then there are several cases (e.g., for the Figure 2 input programs) where TreeVada is faster, infers a better grammar, or does both.Figures 2 and 3 is an example of the latter.

Removing Specialized Bubbling Heuristics
As TreeVada creates nesting structure upfront, there is less need for special cases and we therefore remove the following two rarely successful strategies Arvada uses.
1-bracket Bubbles: Pre-structuring the parse trees ensures that each sibling node sequence contains at most two round, curly, or square brackets.While this prevents a bubble from crossing concept nesting, we observe that one bracket rarely generalizes the grammar correctly either.Not generating 1-bracket bubbles thus ensures that TreeVada never considers a bracket-unbalanced bubble.
Partial 1-char Node Merges: When Arvada cannot merge a given bubble with any interior node, it also tries merging the bubble with a subset of interior node instances that represent a terminal character, e.g., to merge the bubble with one ")" instance but not others.As both Arvada and TreeVada pre-tokenize their input programs, such special treatment of 1-char tokens rarely yields a successful merge and TreeVada thus omits such partial merges.

Deterministic Grammar Inference
TreeVada addresses Arvada's sources of non-determinism.Several of these were easy to fix without significant performance degradation, just by switching the implementation to a deterministic datastructure.Specifically, TreeVada orders the parse trees' unique node types by their shortest distance from any program root node (in ascending order) for two related operations.First, when trying to merge all node-pairs upfront and after exhausting bubbling.Second, when ordering merge-target nodes for a given bubble.
Similarly, when there are more than 50 candidate programs that exercise a candidate rule-merge on one of the two merged rules, TreeVada makes Arvada's program sampling strategy deterministic, by switching to deterministic data structures and always using the same random number generator seed value.Finally, TreeVada makes Arvada's terminal expansion deterministic, by fixing the random number generator's seed value to sample 10 programs from the larger character class.Following is a more complex case, where TreeVada needed a new heuristic to compensate for Arvada's benefits from arbitrary order and randomness.
3.5.1 Depth-and Length-aware Bubble Ranking.To avoid Arvada's non-deterministic shuffling of the top-100 ranked bubbles, Tree-Vada builds on two observations.First, a longer bubble has a higher chance of being rejected because it tries to group together more nodes.A longer bubble, when accepted, also has a higher chance of getting grammar inference stuck.For example, in Figure 2 Arvada's last merge is a 5-node bubble  3 → )␣;␣  0 .The bubble's use of ")" prevents ")" from being used as the closing bracket in an otherwise possible subsequent bubbling of ( 2 + 2 ).
Second, and more importantly, a more deeply nested bubble is more promising than a bubble closer to the root.The intuition is that a more deeply nested node sibling sequence is in a more specialized area of the input program that has more of its immediate surroundings already correctly structured via other rules.The likelihood of correctly generalizing such an already specialized area thus tends to be higher and, crucially, the impact of getting it wrong is lower-as it will only affect a relatively specialized program area.
TreeVada thus refines the bubble ranking, by adding two new criteria, bubble depth and bubble length.We have found that with these additions the bubble ranking is more reliable and does not require shuffling to get to promising candidate bubbles early on.The resulting ranking scheme ranks by context similarity first, then resolves ties via bubble depth (i.e., the bubble occurrences' minimum root-distance), further ties via bubble occurrence counts, and additional ties via bubble length.

Applying Learned Rules Recursively
After merging a bubble, Arvada does not recursively reapply the just learned grammar rule.For example, in Figure 2 Arvada merged  2 → (n+n) with n into  2 , yielding rule  2 → ( 2 + 2 ) | n.While Arvada proceeds by re-ranking all bubbles, TreeVada here instead directly tries to recursively reapply the just learned rule as renamed by the merge (i.e.,  2 →( 2 + 2 )).
For the Figure 2 scenario, TreeVada would group ( 2 + 2 ) under a new  2 node, yielding fewer direct children under the second parse tree's root node (L␣=␣ 2 ␣;␣ 0 ).This cheaply adds to the parse trees' structure we have just accepted as correct.In this case it would also prevent Arvada from its last bubble step that gets it stuck by breaking the parse tree's nesting structure.

EVALUATION
Overall we would like to get a better understanding of how Tree-Vada compares with the state-of-the-art approach Arvada, both on very small and slightly larger input programs.While the larger input programs may not yet be representative of how a user would want to apply these approaches on other languages, it at least gives us a glimpse of the scalability of the compared approaches.We thus seek to answer the following research questions.
RQ0 Baseline: How does non-determinism affect Arvada?RQ1 Grammar quality: At similar runtime, does TreeVada infer better grammars than Arvada?RQ2 Runtime: When inferring grammars of similar quality, does TreeVada have a lower runtime than Arvada?RQ3 Readability: How compact are the inferred grammars?RQ4 Ablation study: How do TreeVada's components influence its resource consumption and grammar quality?
To ease comparison, we run our experiments in Arvada's Docker image 1 .Specifically, from the image we reuse the Arvada, blackbox parser, grammar-sampler, random program generator, and ANTLR4 parser generator [33] binaries.From the image we also use the languages' existing 1k test programs.The following summarizes the metrics we reuse from Arvada's work.
Precision: From each Arvada-/TreeVada-inferred grammar we sample 1k programs and count how many of these 1k programs the respective existing "golden" black-box parser accepts.
Recall: We compile the Arvada-/TreeVada-inferred grammar into a parser and count how many of the given 1k ("golden") test programs that parser accepts.
F1 score: As usual, the F1 score is the harmonic mean of precision and recall and ranges from 0 (zero precision or zero recall) to 1 (both perfect precision and perfect recall).
Runtime: The main measure is the Arvada/TreeVada runtime, which does not include computing precision or recall.
Averages: We follow Arvada [22] in comparing a deterministic technique's result with the average of 10 non-deterministic runs.The latter estimates what a user may expect from running Arvada once.We also plot each of our Arvada and TreeVada runs.

RQ0: Timeouts vs. Precision Results
To get a sense of non-determinism's effect on Arvada we first reproduce ("different team, same experimental setup") 2 Arvada's main results [22, Table 1]-i.e., runtime and F1 score.We use Arvada's Docker image (same Arvada configuration options, etc., including the same ("seed") training input programs) and reran all languages from Arvada's experiment as in the Arvada paper 10 times.Here we used a 24GB RAM Ryzen-9 5900HX 3.30GHz CPU laptop.Table 1 summarizes the results.First, in our rerun the runtime was consistently lower (likely due to the different machine).For the main grammar quality measure (F1 score), the average over all 11 languages was similar (80.8% vs our rerun's 79.3%).For individual languages the impact was larger.For example, tinyc's average F1 score dropped from 81% to 69%.It may thus be misleading to compare Arvada performance across languages.For example, while in the earlier study Arvada produced much better grammars for tinyc than for curl (81% vs 68%), this difference all but disappeared in our rerun (69% vs 68%).
From the source code we learnt that when calling a black-box parser Arvada's precision calculation enforces a 10s timeout.While the programs are just a few dozen tokens, arith's "golden" parser timed out for many programs sampled from the Arvada-inferred grammars, with parse time growing quickly with program size.On one example arith run 76/1k programs timed out.Arvada's metric tool unfortunately treats a timeout as if the parser accepted the program, which very likely corrupts precision (and thus F1 score).As we could not easily solve the problem (e.g., by increasing the timeout by 10×), we exclude arith from tool comparisons.
Excluding a language for unreliable measurement is not meant to avoid running TreeVada.On the Table 1 arith seeds TreeVada "scores" 100% precision via Arvada's metric tool.Due to the metric tool's parser timeout treatment, this result is equally unreliable.

Experimental Setup for RQ1 to RQ4
While the Arvada work carefully constructed minimal input programs that cover all rules of a given golden grammar, we aim to emulate a more realistic scenario where the user does not have a golden grammar and thus cannot construct a minimal set of minimal input programs.Hence we rely on the Arvada Docker image's 1k test programs.These programs do not guarantee to cover all golden grammar rules.For the first 5/8 languages, we randomly pick seed inputs from this pool of 1k programs (which may slightly inflate recall but still allows comparing TreeVada with Arvada).As in the Arvada work, for the next 3/8 languages (curl, tinyc, and nodejs) we do not have a golden grammar.We thus use the same 3rd-party random program generators the Arvada work used (with their default settings) to create new seed programs.For all 8 languages the new input sets  may thus not cover all golden grammar rules.We call this random input set 1.
Table 2 compares the 1 input programs with the ones used in the Arvada work-the first 5 languages had handpicked seeds ("H") and the next 3 used 3rd-party generators ("R0").We focus on the token counts via TreeVada's tokenization scheme, as it only differs in how it treats " and ', which only a few input programs contain.Compared to the Arvada study, the average token count tends to be larger-for most of the first 5 languages by an order of magnitude.Especially the largest programs are significantly larger.
To explore larger input programs we generate another set 5 using the same generator used for tinyc [16] ("tinyc-500") and nodejs [32] ("nodejs-500").Here we skip programs under 200 characters long.Table 2 shows tinyc-500 and nodejs-500 programs are on average 5× larger than tinyc and nodejs programs by token count.R1/R5 seed (and test) programs either both did (json, xml, curl, js, js-500) or did not have some programs with quotes.All R1/R5 seed (and test) programs had brackets, except for xml.Curl had one seed program with unmatched brackets.
Here we run each experiment on an EPYC Milan 7763 64-core CPU machine in TACC's Lonestar6 cluster 3 , which does not support Docker.We thus recreated the Docker image's setup as closely as possible (same oracle binaries, etc.).After removing arith we removed the parser timeouts.First, we removed a 3s parser timeout Arvada used for its grammar inference.This should improve the quality of Arvada-inferred grammars as it no longer has to interpret a parser timeout as "parsed ok".
After removing the 3s parser timeouts, sampled programs of fol and math also got stuck (for at least 30 minutes each) in their "golden" parser (due to poorly written grammars).To protect precision calculation's integrity we also removed fol and math.For precision calculation we could then remove its 10s parser timeout, which allowed us to just use the first 1k programs sampled from an Arvada-/TreeVada-inferred grammar (Arvada's evaluation [22] silently discarded any sampled program over 300 characters).
We started all experiments with 32GB RAM.As Arvada's grammar inference ran out of memory on lisp, nodejs, and nodejs-500, for these three experiments we then used an otherwise identically configured 256GB RAM machine.

RQ1, RQ2: Precision, Recall, F1, Resources
Table 3 shows the main evaluation results (on R1 and R5).Across all 10 experiments, TreeVada on average both produces better grammars and is faster than Arvada, i.e., TreeVada has a 9.3% higher recall, a 22.1% higher precision, a 19.5% higher F1 score, and a 2.4× speedup over Arvada.In 9/10 experiments TreeVada inferred a grammar of the same or higher quality than the average Arvada run (i.e., TreeVada's F1 score was at least as high as Arvada's).
The outlier is curl, where TreeVada's 72% F1 score is slightly below Arvada's 78%.curl has brackets but does not use them for nesting.TreeVada's attempts to pre-structure the input programs' parse trees thus either fail quickly during the initial pass over the input programs or get TreeVada stuck with a sub-optimal grammar.
Compared to our reruns of the earlier study (Table 1, middle), Arvada increased its F1 score on 2/8 languages-i.e., for curl from 68 to 78% and while from 83 to 100%, likely due to non-determinism and differences in input programs.On the other hand, switching from hand-picked minimal programs to randomly selected input programs here may have contributed to lowering Arvada's F1 score in 4/8 other languages-json from 97 to 79%, xml from 88 to 76%, tinyc from 69 to 62%, and nodejs from 39 to 12%.
At the same time, in all 10 experiments TreeVada was faster than the average Arvada runtime.For 6/10 experiments TreeVada was at least twice as fast as Arvada's average run.On the larger input programs tinyc-500 and nodejs-500 (which have about double the total number of input tokens as tinyc and nodejs), TreeVada remains faster than the average Arvada run and achieves higher F1 scores, i.e., 63 vs. 67% on tinyc-500 and 12 vs.46% on nodejs-500.For scalability it is further interesting to consider the group of 7/10 experiments with the highest total token counts in their input programs-i.e., lisp, turtle, while, tinyc, nodejs, tinyc-500, and nodejs-500.In these 7 experiments TreeVada has either much better grammar quality at similar runtime, much better runtime at similar grammar quality, or both much better grammar quality and runtime.For example, on the experiment with the highest total token count (tinyc-500 with some total 4.2k tokens), TreeVada's F1 score is 67% vs Arvada's 63% while using less than a quarter of the time.To better understand the differences between Arvada and TreeVada, we further compare the following metrics.
Bubble Ranking Time: Both Arvada and TreeVada rank their potential grammar generalization steps (aka bubbles).The main difference is that TreeVada tries to omit from this ranking bubbles that may violate common bracket-defined nesting rules.
String Sampling Time: Except for the use of non-determinism, Arvada and TreeVada use the same scheme for sampling programs that exercise a proposed grammar rule merge.
Oracle Calls & Time: Arvada and TreeVada call an external parser during grammar inference in the same way.
Memory Use: We measure the peak memory use during grammar inference via Linux's time command.
Results: TreeVada consistently uses the same or less memory than Arvada.In 3/10 experiments this difference is one (nodejs, nodejs-500) or even two (lisp) orders of magnitude.Figure 4 reinforces that the source of this difference is the large difference in time spent on bubble ranking.Given that in these three experiments TreeVada also yields significantly higher F1 scores makes clear that much of Arvada's bubble ranking is counter-productive as it prioritizes bubbles that ultimately get the grammar inference stuck.
Bubble ranking is the dominating time expenditure for 4/10 of the Arvada experiments, but in none of the TreeVada experiments.Instead the external parser dominates TreeVada's overall runtime in 7/10 experiments and in 6/7 of these cases by a large margin.In the remaining 3/10 cases TreeVada's bottleneck is string sampling.

RQ3: Grammar Readability
Neither Arvada nor TreeVada attempt to simplify their inferred grammars.They just export the state of the grammar when they cannot find any additional grammar generalization steps.Since there are use-cases involving human consumption such as program understanding, it is still interesting to determine if higher grammar quality comes at the expense of larger grammars.
A related question is if a larger grammar for a given language is maybe structured more efficiently for parsing, i.e., in parser runtime and memory consumption.To explore these two related questions we thus measure the following two metrics.Figure 5: F1 score of 10 Arvada (-) and TreeVada (◀) runs on hand-picked (H [22]) and random seeds (R0 [22], R1, R2, R5).
Grammar Size: We count a grammar's unique non-terminals, unique terminals, number of rules (i.e., rule alternatives), and each rule's length (i.e., the length of a rule's right-hand-side sequence of terminals and non-terminals).The grammar's size is then the sum its rule lengths.
Parse Time & Memory: We measure the total time required and peak memory used to parse the 1k "golden" test programs using a parser generated from an Arvada-/TreeVada-inferred grammar.
Results: Table 4 gives an overview of grammar size and parse performance.Despite covering more of the golden grammars (higher recall) and having higher F1 scores, TreeVada's grammars are smaller for 8/10 languages.The biggest difference is lisp where TreeVada's grammar is less than one sixth the size of the average Arvada grammar.For json, TreeVada's grammar size equals Arvada's average grammar size.The only outlier is xml (157 vs. 149)-where TreeVada has a significantly higher F1 score.
Arvada's larger grammars do not improve parsing performance, as there is no experiment in which Arvada's average parse time or memory use is lower than TreeVada's.On the contrary, for 6/10 experiments, TreeVada's parse time is less than half of Arvada's.

Performance Variance Across Seed Sets: R2
To compare performance across seeds we generate a fresh round of random input program sets ("R2") in the same style as for R1.
Table 2 shows R2's size metrics.In R2 all seed sets contain some brackets, except for xml and curl.For space we only plot their individual results (with the Table 1+3 results for context), i.e., F1 score in Figure 5 and runtime in Figure 6.TreeVada's results were stable compared to Arvada's in the sense that across R1, R2, and R5 (except for curl on R1) TreeVada's F1 score was better than the average Arvada F1 score and sometimes better than every Arvada F1 score.Similarly, across all R1, R2, and R5 runs, in the single run where Arvada was faster than TreeVada (js-500 = R5 nodejs) Arvada's F1 score was zero.Specifically, on the R1 and R2 seeds TreeVada's F1 score was better than all Arvada runs for both while and xml.Across all seed sets, TreeVada had a better F1 score than at least 9/10 Arvada runs for nodejs.Notable outliers are TreeVada's while and xml runs on the Arvada work's seeds (H).Since the while language uses C-style bracket nesting, TreeVada does not seem overly overfitted for C-style programming languages.

RQ4: Ablation Study
Here we explore the impact of TreeVada's components, using Table 3's experimental setup.We first make Table 3's Arvada deterministic (Section 3.5).For each language the resulting Table 5 and following ablation tables show the average of 10 runs.For space we omit standard deviations (all very small or zero).
Making Arvada deterministic yielded several interesting effects.For example, for lisp recall and F1 score are down, runtime is up, parser calls are down, and memory consumption is up-each by about a factor of two.Overall, though, average F1 scores improved while runtimes slowed down.This indicates that some of Arvada's non-deterministic runs got stuck in sub-optimal grammars and thus terminated relatively quickly.
From Table 5 to Table 6 we add recursive rule application (Section 3.6).For most languages this change had little impact on F1 scores.The notable exception are nodejs, tinyc-500, and nodejs-500, which all had lower F1 scores and a faster runtime.Since the initial bracket-implied parse tree are not imposed here and partial merge is used, this version still has bubbles breaking nesting rules.Reapplying such learned nesting-breaking rules does not improve F1 scores (but may get the grammar stuck relatively quickly).
From Table 6 to Table 7 we add bracket-based initial parse trees (Section 3.3) and ignoring likely string literal contents (Section 3.2).Adding these two features makes the F1 score of lisp jump from 0.34 to 0.99 with a decrease in memory use from 58.7 to only 0.07GB.Additionally, lisp's runtime drops from 11.4k to 0.47k seconds.All 10/10 experiments had improvement in runtime, while languages with more nesting structure benefited the most.
From Table 7 to Table 8 we remove partial merges.The F1 scores tend to improve.Especially nodejs's F1 score spiked from 0.09 to 0.56, which indicates that partial merges were responsible for some invalid merges, which essentially blocked learning.Finally, from Table 8 to TreeVada in Table 3 we add the new bubble ranking scheme (Section 3.5.1),which has overall neutral results on F1 scores but overall reduces runtimes.

THREATS TO VALIDITY
We briefly summarize key threats to internal and external validity.
Threats to external validity: Often grammar inference tools show mixed performance with different seeds [6].Also relying on hand-crafted small toy programs as seed does not replicate real-world situations.To overcome these challenges, we randomly sample seed sets R1, R2, R5 for our experiments.
Threats to internal validity: TreeVada's pre-structuring may fail if brackets serve other language-specific purposes than nesting.For example, in xml names enclosed in angle brackets < > are used for nesting, whereas in Java or C++ these are mainly used for relational or bit manipulation operators.We have carefully chosen   only three bracket types for pre-structuring parse tree as these three are the most commonly used for nesting [41].For terminal expansion, synthesizing regular expressions for the terminals would give a more robust grammar, which is future work.

RELATED WORK
Grammatical inference is important as the inferred grammar can serve many tasks [39] when the language's golden grammar is unknown.Rather than probabilistic learning [24,40], active learning [3] is a good fit as a parser can serve as a minimally adequate  [5,22,42] follows this setting to infer grammars.Following is other related work.

Linguistics
The linguistics community has developed several negative [4,12,13] and positive results on grammar inference in a wide variety of settings, i.e., for various grammar classes (including context-free), oracles, and availability of additional kinds of input [36].GRIDS [23] starts with flat parse trees and iteratively bubbles and merges rules.Maybe most closely related to Arvada is applying a GRIDS-like approach iteratively [46] and thus sampling new inputs from an updated grammar.Maybe most closely related to TreeVada are Sakakibara's techniques for inferring a subclass of context-free grammars when given positive examples with their complete (but unlabeled) parse trees plus a parser-like oracle and a grammar equivalence oracle [34,35].We observe that a program's bracket-implied nesting structure can be captured by a Dyck language [8], i.e., a context free language comprising of only balanced brackets.TreeVada's pre-structured parse trees can thus be seen as instances of the Dyck language  3 on the alphabet (, ), [, ], {, }.An interesting property is that a Dyck language captures the "non-regular essence" of a context-free language.Specifically, the Chomsky-Schützenberger representation theorem [10] says any context-free language can be mapped to the intersection of a Dyck language and a regular language.
For TreeVada, this pre-structuring is just a startup heuristic.For example, in subsequent steps TreeVada may discover additional balanced structures defined by some other opening and closing terminal pair (which taken together may then be represented by a  4 language).Similarly, TreeVada may not be able to merge rules in the initial parse trees.

Deep Learning
Several reports have been negative on using deep learning for inference of a context-free grammar (CFG) [37,43].RNNs lack the ability to learn concrete hierarchical rules, leading to a decline in generalization with increasing input length and recursion depth [7,44].LSTMs learn statistical approximations, not a deterministic rulebased solution [37].Even the state of the art attention based Seq2Seq model struggles to understand CFGs [44].Arvada showed significantly better precision than an LSTM-based model (measuring recall is not possible).Deep learning's under-performance may be due to solely relying on (many) input samples.Glade, Arvada, and TreeVada use active learning [3], where an additional oracle guides the learning process.Another significant deep learning limitation is that it does not produce an explicit grammar, which makes it also hard to measure recall and F1 score.

Grey-box Grammar Inference
The grey-box grammar inference approaches don't make use of the entire parser source code.GRIMOIRE [9] is a grey-box fuzzing tool that makes use of the parser's coverage information.GRIMOIRE synthesizes a grammar like structure of the inputs while fuzzing.

White-box Grammar Inference
White-box grammar inference approaches utilize the parser source code to extract input grammars that follow the structure of the input.Lin et al. [25,26] proposed the first white-box method that recovers parse trees from inputs using static and dynamic analysis.Autogram [19], introduced by Höschele et al, adopts another white-box approach that tracks the dynamic data flow between variables of the program to infer an approximate context-free grammar.Mimid [16] by Gopinath et al. infers a grammar by leveraging dynamic control flow and tracking input character access across parser locations.

CONCLUSIONS
Black-box context-free grammar inference is a hard problem as in many practical settings it only has access to a limited number of example programs.The state-of-the-art approach Arvada heuristically generalizes grammar rules starting from flat parse trees and is non-deterministic to explore different generalization sequences.We observe that many of Arvada's generalization steps violate common language concept nesting rules.We thus propose to pre-structure input programs along these nesting rules, apply learnt rules recursively, and make black-box context-free grammar inference deterministic.The resulting TreeVada yielded faster runtime and higher-quality grammars in an empirical comparison.The TreeVada source code, scripts, evaluation parameters, and training data are open-source and publicly available.

Figure 4 :
Figure 4: Average (and standard deviation) of time spent on ranking bubbles, sampling strings, and in black-box parser.Each value is normalized by dividing by Arvada's average total runtime for that language (R1, R5 seed); A = Arvada; T = TreeVada.

Figure 6 :
Figure 6: Runtime of Figure 5 runs; H/R0 runtimes omitted as done on different machines.

Table 8 :
Table 7 + Remove partial merge teacher (Mat or oracle).Even as a black-box the Mat/oracle can answer membership queries.Recent work