Diffy: Data-Driven Bug Finding for Configurations

Configuration errors remain a major cause of system failures and service outages. One promising approach to identify configuration errors automatically is to learn common usage patterns (and anti-patterns) using data-driven methods. However, existing data-driven learning approaches analyze only simple configurations (e.g., those with no hierarchical structure), identify only simple types of issues (e.g., type errors), or require extensive domain-specific tuning. In this paper, we present Diffy, the first push-button configuration analyzer that detects likely bugs in structured configurations. From example configurations, Diffy learns a common template, with "holes" that capture their variation. It then applies unsupervised learning to identify anomalous template parameters as likely bugs. We evaluate Diffy on a large cloud provider's wide-area network, an operational 5G network testbed, and MySQL configurations, demonstrating its versatility, performance, and accuracy. During Diffy's development, it caught and prevented a bug in a configuration timer value that had previously caused an outage for the cloud provider.


INTRODUCTION
Ensuring the correct operation of software systems is critical to avoid downtime and performance degradation.While many techniques exist to find bugs in software, increasingly systems are built by assembling existing software artifacts whose behavior is customized through specialized configuration languages [9,10,18,27,29,31,49,50,60].For instance, routing protocols that control how packets move across the Internet [49], application orchestration frameworks [18, 27] that manage critical infrastructure, and databases backing popular web services [10] are all managed through extensive configuration.
To combat misconfiguration, one line of research has explored the use of data-driven learning methods to identify possible errors as anomalies or deviations from normal usage based on a training corpus of example configurations [12,50,51,60].This approach is appealing because it requires no effort from users to provide specifications of correctness, which may be difficult to obtain or even unknown to the system operators [31].
While the prospect of automatically identifying configuration bugs without specifications is alluring, current incarnations of this approach have several major limitations.First, a majority of them [50,51,60] apply only to "flat" configurations whose format comprises a bag of key-value string pairs.While useful in some simple settings, many configurations include richer structures such as sets, lists, tables, maps, objects, or even specialized domain-specific languages (DSLs) [2,31].Second, prior work requires that users incorporate extensive domain-specific knowledge into their tools.For instance, parts of configurations are frequently represented using ad hoc string data formats.Existing tools typically identify bugs by checking if these strings conform to a set of user-defined types (e.g., a file path, an IP address [31,60] or that "ON" and "OFF" are boolean [50]).As a result, they either learn little from data that does not conform to predefined types or require that users arduously add such domain-specific knowledge.
In this paper, we present Diffy1 , the first push-button approach to automatically find likely bugs in arbitrary JSON configurations.We target the JSON [7] format due to its pervasive use in practice, its ability to represent complex structure (e.g., nested lists and objects), and its ease of conversion from other configuration formats (e.g., YAML, XML).While the focus of the paper is JSON, the techniques we present are general and can apply to other hierarchically structured data formats.Diffy requires no user specifications; its input includes only (1) a set of configuration files and (2) an optional set of regular expression tokens to help guide its analysis.From these inputs Diffy synthesizes a template, which is another JSON file that captures the similarities and differences in the configurations.The template contains "holes" that, when filled in with specific values, will recreate the original configurations.Using the synthesized template, Diffy then employs a state-ofthe-art unsupervised anomaly detection algorithm known as isolation forest [37] to identify likely bugs in configurations as those with anomalous parameters for these "holes".
At the core of Diffy's approach is a novel template synthesis algorithm that attempts to extract similarities between JSON configuration elements.For a given set of configurations, many valid templates exist.Efficiently finding a "good" template, i.e., one that is neither too complex nor too general, is challenging.We define a cost metric to approximate how well a template matches a set of configurations, and then formulate the synthesis problem as a dynamic programming algorithm.The algorithm is related to the idea of string edit distance, but instead finds the minimal regularexpression-aware edit distance between strings.Diffy finds a template for other data structures, e.g., lists of lists of strings, by recursively evaluating element template costs and selecting the most suitable template type, such as ordered, unordered, or repetitive.
We assess Diffy against a variety of real configurations, including those from a large cloud provider's wide-area-network (WAN), an operational 5G radio access network (RAN), and MySQL database configurations mined from GitHub.Our results show that Diffy: (1) Generalizes across domains as the first automatic bug finder applicable to arbitrary JSON configurations.(2) Scales well, analyzing thousands of configurations comprising millions of lines within seconds to a few minutes, exhibits a near-linear scaling trend, and outperforms existing data mining tools by 2-3 orders of magnitude.(3) Achieves high precision (up to 97% on the WAN and RAN).And (4) identifies issues comparable to domain-specialized tools.

155:3
During Diffy's development, it automatically identified a bug in a protocol timer configuration value that had previously caused a large outage for the cloud provider's WAN and that was silently reintroduced by faulty automation.
Contributions: In summary, this paper's contributions are to: • Define the problem of statistical bug finding for configurations in terms of synthesizing a low-cost template to fit the configuration data.• Solve the template synthesis problem with a novel dynamic programming formulation based on the minimal regular-expression-aware string edit distance and then generalize this approach to richer configuration structures.• Employ unsupervised learning to identify the most likely bugs from configuration differences based on the synthesized template.• Implement Diffy based on these ideas, the first tool to automatically detect likely bugs in arbitrary JSON configurations.• Demonstrate the versatility, performance, and precision of Diffy on three diverse configuration datasets (wide-area network, radio access network, and database).

OVERVIEW OF DIFFY
Consider the example configurations used to define networking policies for a wide-area network (WAN) shown in Figure 1.For this example, we picked 48 different configuration files, one from each router to provide to Diffy.These configurations are parsed from vendor-specific formats (e.g., Cisco [9], Juniper [29]), and define low-level settings for various routing protocols, access control policies, queueing and performance, and more.In practice, these configurations are often several tens of thousands of lines long, and thus not easily amenable to human review.For the sake of exposition, we have highly simplified these configurations for the example.Challenges for Diffy.Analyzing configuration files like for this WAN poses several challenges.First, they make use of complex and domain-specific structure.Route policies (e.g., "Policy"), for example, are defined using a domain-specific policy language for matching and modifying layer 3 protocol routes.These policies consist of nested JSON structures whose particular order, size, and elements may or may not be semantically relevant depending on the context.For instance, the order of the declared route length matches ("LenMatch") does not matter, while the order of the "Actions" applied to the routes is important.Similarly, the policies are defined using nested objects with different keys, some of which may be missing or intentionally different across configurations.Existing data-driven methods cannot analyze such structure (see §7).
Second, an abundance of data exists in ad hoc data formats.For example, route policy actions like autonomous system (AS) number prepending to routes (e.g., "prepend 65000 65000") and route length filtering (e.g., "/12−/24" and "le 8") employ unconventional data formats that hinder existing configuration bug finding tools.These tools will learn little to nothing from strings that do not conform to a predefined set of types such as an integer, boolean, or a file path [36,50,51,60].In theory, it is possible to enhance these tools by incorporating specialized parsers, types, and rules tailored to specific domains.This is the approach taken by SelfStarter [31], which models domainspecific concepts such as IP address octets and table reordering rules.However, incorporating such information requires substantial human effort and engineering.Router vendors, as one example, support thousands of distinct configuration parameters, presented in a variety of formats [9,29].Furthermore, this process needs to be repeated for each new configuration type that one aims to analyze, thereby restricting the tool's broad applicability.
Synthesizing templates.Diffy takes the configurations in Figure 1 as input, along with regular expression tokens of interest.For the example, we include a single token describing a number
For the SNMP source interface ("SnmpSourceIface") configuration segment, Diffy identifies a "choice" template indicating that the value is either for a management ("Management1") or loopback ("Lo0") interface, with the former appearing in 45 out of 48 files.As there is no simple pattern to combine these choices, they are left separate.
For the route length filters list ("LenMatch"), Diffy learns a "@type:repeat" template where all the elements of the list are well summarized with one of two patterns: "/[num]−24" or "le [num]".On the other hand, for the "StaticRoutes", Diffy identifies that the better template for these values is an unordered multiset ("@type:unordered").One element of the multiset is always "10.1.2.0/24" while a second element is optional and only appears in three configurations.In those three, Diffy has learned that the value of the second list element is "10.3.19[num].0/24"and that the value for the "hole" is different for each configuration.
For the route policy actions ("Actions"), Diffy synthesizes an ordered sequence template ("@type:ordered") consisting of a prepend action followed by an "accept" action that permits the route.Diffy also correctly learns the ad hoc data format for the action that prepends autonomous system (AS) numbers to Border Gateway Protocol (BGP) routes as "prepend 65000 [num]", and understands the values captured by each [num] token as well as their frequencies.In the example, the value "65000" appears in 43 configurations, "65001" in two and "2" in three.Despite lacking prior knowledge of Autonomous System (AS) path prepending, Diffy successfully extracts a helpful pattern from the data to detect anomalies as we will see next.
Finding bugs with unsupervised learning.After synthesizing a configuration template, Diffy employs unsupervised learning [37] to identify template parameter anomalies, which are shown 155:5 for the example in the bottom left of Figure 1.The first anomaly captures that the value of the 0 th parameter of the 0 th list element in the "Actions" list template being "2" is a likely bug since most configurations use value "65000" or "65001".Indeed this is a misconfiguration likely due to different configuration vendors implementing different AS path prepending semantics.An operator might have expected the second argument of the "prepend" action to mean how many times to prepend the AS number 65000 (twice) to the BGP AS path, when instead the configuration expects a space-separated list of AS numbers to prepend to the path.Such a mistake can cause other networks to erroneously filter these routes, reducing Internet connectivity.In this case Diffy correctly marks the more frequent "2" as an anomaly and not "65001" (another private AS number) because "2" is more distant from other values.Diffy leverages a state-of-the-art anomaly detection algorithm ( §4) that is able to identify such anomalies and achieve consistently accurate detection results.
The second anomaly states that "SnmpSourceIface" has pattern "Lo0" instead of "Management1"2 , and the last anomaly indicates that only 3 out of the 48 configurations have a second static route.Diffy reports all the anomalies with an anomaly likelihood score above a user-provided threshold.
Why it works and limitations.Diffy works on the principle of "bugs as deviant behavior" [13] where bugs are often easily identified as behavior that deviates from the normal.Template synthesis and unsupervised learning go hand-in-hand to identify only the most likely bugs and thereby reduce false positives.Synthesized templates, even when imperfect, concisely summarize all the differences across the configurations and unsupervised learning effectively sifts through these differences to find the most likely bugs while filtering noise.For instance, in the three configurations that have a second static route, the pattern learned for the route is "10.3.19[num].0/24".However, since the value for the "hole" is different in each configuration and the evidence (3 examples) is low, the unsupervised learning algorithm marks these differences as unlikely to be bugs and filters them out.In this way, Diffy often avoids reporting intentional differences as bugs.
Diffy requires the existence of configurations with at least some overlap to be effective as configurations with no similarity provide little opportunity for learning.Diffy also currently limits its analysis to bugs that are identified as deviations in single template parameters.Finding multi-parameter relationships, e.g., if setting 1 is "true" then setting 2 must not be "0", is outside the scope of our approach, and we leave it to future work to efficiently learn such relationships.

TEMPLATE SYNTHESIS
In this section, we describe the format for configurations and templates ( §3.1), explain how Diffy defines costs for templates, show how it efficiently synthesizes low-cost string templates ( §3.4), and lift this idea to JSON lists and objects ( §3.5, §3.6).Before we can explain how Diffy computes a template like that in Figure 1, we must first define the input format and template syntax.A summary of the JSON, template, and anomaly syntax are illustrated in Figure 2. Diffy processes configurations in JSON format [7], defined as a JSON node .Specifically, a JSON node is either a string , a list of JSON nodes A labeled template = ℓ consists of a template as well as a unique label ℓ that we use to reference .The template may be a constant string , a regular expression token such as those in Table 1, or a concatenation of sub-templates ( 1 • . . .• ).For example, "/[num]−/24" from the example is a concatenation of "/" • [num] • "−/24".Templates for lists may be unordered multisets { 1 , . . ., }, ordered sequences [ 1 , . . ., ] or repetitions of a sub-template * .An optional template ?describes element that may be present or absent.Object templates consist of key-value pairs { 1 : 1 , . . ., : }.A choice template ( 1 | . . .| ) signifies a case split, where a configuration will match any of 1 through , useful for representing sets of configuration that correspond to distinct patterns.Finally, defines expressions over configuration values captured by template labels (e.g., ℓ) that are used to identify anomalies.

Example Tokens
Example templates: Going back to Figure 1, we can write the template presented in JSON more formally.The following synthesized template is for the "LenMatch" field from Figure 1: That is, the field represents a sequence (list) of zero or more elements, where each element is described by a choice of either of the two patterns.The labels a through i identify the sub-templates.
As another example, consider the "Actions" field for the route policy.The synthesized template consists of an ordered sequence of two sub-templates: Each subtemplate in Diffy has a unique label to allow for writing expressions over configuration values by referencing one or more labels.The expression in Figure 1 states that the value captured by b is unlikely to be "2", or (value(b) = "2") is an anomaly.This approach is general and permits expressions involving multiple parameters, such as length(d) < value(b), or even the entire string with c. Expressions transform configuration values to reveal anomalies.For example, the values "1", "2", "3", "Z" for some label x may not appear anomalous as strings, but when evaluated over the function integer(x), the string "Z" becomes anomalous.
As a final example, the "StaticRoutes" field is an unordered list template where one of the fields is optional, and is expressed with the following template: The expression present(e) from Figure 1 maps the optional value to a boolean representing whether e is present in a configuration.Since most configurations do not have the second static route, those with result "true" are flagged as potential bugs.We elaborate on this idea in §4.
For the remainder of the paper, we sometimes elide the labels from templates for readability.Given a template = ℓ , we define L ( ) = ℓ as the label for the template.For a set , the notation {{ }} refers to the set of all multisets over elements of .For a regex ∈ R and a string ∈ Node, we write ∼ to mean that is in the language of .To make the synthesis problem more tractable, we do not allow choice templates to be used within a concatenation template.For instance, "abc" • ([num] | "d") is not allowed, while ("abc" • [num])|"abcd" is.This restriction reduces the search space without loss of generality.We similarly assume that nested choice templates are flattened, e.g.
The template synthesis problem inputs include (1) JSON nodes (configurations), (2) a set of regular expressions R, and (3) a cost function C : R → R that assigns a cost C( ) to every character of a string matched by ∈ R. We assume that 0 ≤ C( ) ≤ 1. Typically one should define C( ) proportionally to the sparsity of the regular expression, i.e., regular expressions that match fewer strings should have lower (better) cost since they are more specific.For instance, [num] should have lower cost than an identifier token [a−zA−Z0−9]+ since this would prioritize matching numbers over identifiers when the text is all digits.To ensure a template always exists, we assume that a default regular expression [any] ∈ R matches all strings, and that C([any]) = 1.

Template Cost Function
There are many templates that can represent a given set of configurations.The goal is to find a template that balances generality and complexity.For example, several templates could represent the strings "Eth1", "Eth2", and "Eth3".The token [any] covers all three strings but is perhaps too broad because it matches all other strings as well.Alternatively, the template ("Eth1" | "Eth2" | "Eth3") precisely captures only those three strings but is complex and likely overfits the data.In this case, a compromise such as "Eth" • [num] strikes a balance between conciseness and specificity.
To synthesize desirable templates, we clearly need a way to compare templates.We do so by defining a heuristic template cost function that gives a direct measure of the "quality" of a template.This cost function is inspired by the Minimum Description Length Principle [16] where the cost of a template is related to the amount of data that one would need to transmit to recreate a template's configurations.The cost function is designed to return a normalized score between 0 (best) and 1 (worst) and lends itself to a natural and efficient synthesis algorithm based on dynamic programming.In this way, Diffy avoids an enumerative search over templates, such as those used in prior data mining tools [21,44,45], and scales to large real-world configurations ( §6.3).
Cost of string templates.We define a heuristic cost function for templates in Figure 3.The cost function C has three arguments: a template instantiation function , the number of configurations , and the template itself .The cost function C returns a value in ∈ R such that 0 ≤ ≤ 1.The template instantiation function : L × N → {{Node}} takes a pair of a template label ℓ ∈ L and a configuration index ∈ N such that 1 ≤ ≤ and returns a multiset of JSON nodes representing the matches for the template for configuration .The function essentially "fills in" the concrete values for each configuration for each sub-template.
For strings, the cost definitions are simple-the normalized cost for a constant string (const-cost) is 0 (ideal), and the normalized cost for a regular expression template (regex-cost) is given by the per-character cost C( ).For a concatenation of templates 1 • . . .• , we compute the normalized cost (concat-cost) for each as C( , , ) and then multiply this cost by the length of the strings matched by each template S( , , L ( )).This value is then re-normalized by dividing by the total length of the matched strings for each template =1 S( , , L ( )).Example: Consider the template = ("Eth" • [num] ) that matches strings "Eth12" and "Eth8".Assume we have C([ ]) = 1 5 (this exact value is used for illustrative purposes only).The number of configurations = 2 and template instantiation: To compute the cost of we first compute the cost of "Eth" and [num] , which are 0 and 1   5   respectively.We then compute the cost of the concatenation as: Cost of other templates.Intuitively, the cost for lists and object templates should be related to the costs of their respective elements.However, this is not enough.Elements of a list or set may be missing from some configurations (due to optional elements ?) or the template may require more or fewer cases to describe the content.For instance, the templates ["Eth1", "Eth2"?, "Eth3"?] and ("Eth" • [num]) * both describe lists, yet the latter may be preferred due to its simplicity.
Figure 3 defines our cost function for ordered (ord-cost), unordered (unord-cost), repeat (repeat-cost), choice (choice-cost), and object (object-cost) templates in a similar manner.All of them evaluate the cost of a sequence of templates 1 • • • using ì C. The function ì C takes the label of callee template ℓ as well as a lower bound and upper bound on the number of template elements possible for the sequence given the configurations.For each template in the sequence , we compute the inverse cost, or "benefit", (1 − C( , , )).Each element is then scaled by the fraction of configurations with that element present relative to those in scope N ( , ,L ( ) ) N ( , ,ℓ ) and we divide the result by to calculate the average benefit.To penalize more complex (longer) templates, we scale the result by − − , which represents the proximity of the template to the minimal possible length compared to the maximum.The final value represents the "benefit" of the template, and 1 minus this value gives the normalized cost.For ordered, unordered, and object templates, is set to M ( , , ℓ) the maximum concrete list or object size, and the is set to S( , , ℓ) the total number list or key elements across all configurations (if all elements required a unique template).For a repeat template, we set = 1 since a single pattern could describe all list elements, and set the same value for .A choice template has = 1 and = N ( , , ℓ) as the number of configurations in scope.Example: Consider two JSON lists ["Eth1", "Eth2"] and ["Eth2", "Eth3"].Many templates describe these lists.For instance, the ordered template [("Eth1" )? , "Eth2" , ("Eth3" )? ] .In this case we have the template instantiation function defined as: The cost of the template is given by (ord-cost), which computes ì C with = M ( , , ) = 2 (the largest list size) and = S( , , ) = 4 (the total number of elements).Recursively computing each subtemplate (b, c, e) results in cost 0, and now we calculate the cost as: In this case, the template cost is high because (1) labels b and e only match elements in half of the configurations and ( 2) the template consists of 3 list elements rather than the minimal 2 possible for the unordered type.Now consider another template that also matches the same two lists: For this template, we have the instantiation function is defined as: This time, we compute the cost according to (repeat-cost), which uses ì C with = 1 and , and now we calculate: In this case, the repeat template has substantially lower cost than the unordered template because a single pattern captures every element of both lists.

String Template Synthesis
To efficiently synthesize a low-cost template for a collection of JSON nodes, we first solve the simpler problem of synthesizing a template for a collection of strings (i.e., the base case).We then lift this approach to lists and objects in §3.5.To synthesize string templates efficiently, Diffy uses a dynamic programming approach to identify a low-cost union of concatenations of templates that match the set .To further simplify the problem, the algorithm takes two string templates as input and produces a new lowest-cost template that combines these templates.We begin by computing the lowest-cost template for the initial two strings in 1, 2 ∈ and then iteratively combine the resulting template with subsequent strings in until none remain.
We assume all string templates are a sequence of either regex tokens , or constant string templates of size 1 (e.g., "10" is converted to "1" • "0").If the first string template is given as the concatenation through the dynamic programming cost function D ( , , , ) defined through the recurrence: This definition takes the current template instantiation function and number of configurations analyzed so far (initially = 2).The best template has (un-normalized) cost D ( , , , ).The cost is initially 0.0 and is then updated recursively in two main scenarios.When both sequences have the same next template the previous cost D ( , , − 1, − 1), is updated by the normalized costs of the matched th and th templates times the number of characters matched by each.All other cases involve a "backwards jump" using regex tokens.For a token ∈ R, we consider all indices and such that matches the sub-sequences This logic is represented by an index match function I, which is implemented efficiently using automata.For instance, since concatenating two numbers remains a number, and from index 2 ([num]) also matches [num].We discuss the details of I later.In this case, there is a "diagonal" jump, and the lowest cost is updated from D ( , , − 1, − 1) by adding the regex cost C( ) for each character matched by the regex.If token matches an empty string ("" ∼ ) then we can also jump backwards only in a horizontal (1 st case) or vertical direction (2 nd case).
Calculating index matches.In Figure 4, Diffy must compute I.The algorithm must know, for instance, that "1" • "M") = ∅ since any string ending with "M" is not a number.To calculate I one could check for language containment (e.g., the language of "1" • [num] is a subset of the language of [num]).However, this approach is far too slow in practice because it calculates I for every regex and for every pair of indices, resulting in O( 2 + 2 ) expensive language containment checks.
Instead, we efficiently compute I for every substring simultaneously by tracking possible automaton states for each end index, as depicted in Figure 4 (part 3 ).For template "1" • [num] • "M", the algorithm begins at state 0 of the [num] automaton and transitions to state 1 after processing "1".Since this is an accepting state, "1" matches [num].Similarly, "1" • [num] matches [num] since transitioning from state 1 upon seeing a [num], we must remain in state 1 .However, "1" •[num]•"M" will end with "M" ( 2 state), not matching [num].Simulating a state machine for a token is done using a product automata construction, and we discuss the details for this algorithm in the supplemental materials for space reasons.This method enables near-linear time token substring matching.
Choice string templates.Whenever the normalized template cost exceeds a threshold (0.5 by default), we create a new template.When evaluating new strings, the algorithm compares them to existing choices and selects the one with the lowest cost.For example, naive synthesis for strings "true", "false", and "True" produces template [any] • "e", with high cost.Instead, because the normalized cost for the template for "true" and "false" exceeds 0.5, we create a choice template.Comparing "True" to both "true" and "false", the former offers a lower-cost match.Thus, the final template is ([any] • "rue" | "false") While the choice of threshold may seem arbitrary, we show in §6.4 that Diffy's ability to correctly identify bugs is fairly insensitive to the value of this threshold.
"/16−/24" "le 8" "le 8" Each list element is repeatedly compared to each of a set of patterns.
[ "@type:repeat−one−of", "/[num]−/[num]", "le [num]", ] Fig. 5. How Diffy calculates a list template, for input lists ["le 16", "/20−/24", "/30−/32"] and ["/16−/24", "le 8"].The ordered template (top) is determined using sequence alignment [52], where the pairwise element cost is the string template cost.The grey rectangles represent unmatched elements.The first element of the list on the bo om ("/16−/24") is matched with the second element of the list on the top ("/20−/24"), and other possible matches not used are greyed out due to their higher costs.The unordered template (middle) is from the lowest-cost matching of elementsi.e., finding the pairs of elements in each list that should be matched together to produce the lowest total cost.The solid lines show the chosen matches, and the dashed lines show some of the matches that were not selected.Finally, Diffy finds the minimum cost repeat template (bo om) by comparing each list element with an accumulated set of pa erns.
Complexity.Given two string templates with lengths and , computing D has worst case complexity O (|R| • 2 • 2 ) in the event that each regular expression ∈ R matches every substring.The quadratic terms exist because, for each element in the table, each regex could potentially match • substrings and Diffy must compare every such match to find the one with the lowest cost.In the best-case, the algorithm must do work proportional to |R| • • because it must "fill in" every element of the dynamic programming table similar to the standard string edit distance algorithm.We discuss the optimizations we use to mitigate this worst-case complexity and achieve performance closer to the best case in practice in §5.

List Template Synthesis
To synthesize a template for two JSON lists, Diffy recursively computes the lowest-cost templates for every pair of list elements.There are three template types: ordered list one with the lowest cost.Figure 5 illustrates the computation of a list template for each type using example lists: ["le 16", "/20−/24", "/30−/32"] and ["/16−/24", "le 8"].
Ordered template.To synthesize the best ordered template, Diffy uses a sequence alignment algorithm to efficiently arrange list elements.Diffy computes pairwise templates for every element in both lists and selects the lowest-cost alignment to form a new template.For unaligned elements, it introduces an optional template.The example's resulting alignment from in Figure 5 is: [ "le 16"? "/" • [num] • "−/24" "/30−/32"?"le 8"? ] Unordered template.To synthesize the optimal unordered template, Diffy calculates the pairwise template cost of list elements and computes the minimum cost matching between the elements, provided their cost falls below the cutoff threshold.The unordered template for the example is: {"le " • [num], "/" • [num] • "−/24", "/30−/32"?},as shown in Figure 5 (middle).This is a better match than the ordered template since the "le " • [num] pattern captures multiple matches.
Repeat template.To synthesize a repeat template, Diffy merges elements from each list and applies a greedy clustering algorithm as depicted in Figure 5 (bottom).The algorithm starts by creating a group with the first element ("le 16").Then, it compares the next element ("/20−/24") to the group using a new template.Due to the high cost, "/20−/24" forms a separate group.The process continues with the subsequent element "/30−/32", comparing it against both groups.It has a significantly lower cost when combined with "/20−/24", so Diffy updates the second group to be "/" The algorithm proceeds until there are no elements are remaining, resulting in the final template of ("le "
Handing mismatched types.In certain scenarios, configurations can possess distinct node types.A key might have a string value in one configuration and a list or null value in another.We segregate the configurations based on their types and template them individually.The outcome is presented as a choice template, such as ( | ) for string and list templates.

FINDING BUGS WITH UNSUPERVISED LEARNING
We split Diffy into two orthogonal components -template synthesis and anomaly detection.In this section, we describe the second component that uses a state-of-the-art anomaly detection algorithm known as isolation forest [37].While we describe our approach in detail, we note that it would also be possible to apply any other learning approaches to Diffy's templates.
Isolation forests are a form of unsupervised (or self-supervised) anomaly detection technique.Given a vector of values such as [1.2, 1.8, 0.9, 2.2, 20.1], an isolation forest will assign an "anomaly score" to each input.This score ranges from 0 to 1 where a value with score closer to 1 indicating a higher likelihood of it being an anomaly.In the example, 20.1 would have the highest anomaly score.Although this example used only single-dimentional inputs, isolation forests also handle multi-dimensional data as well (i.e., a vector of vectors).
Isolation forest is based on the observation that anomalies are both "few and different, " i.e., they not only have infrequent occurrences but also consist of values that noticeably differ from the normal, making them easier to separate from other samples.Empirically, this observation leads to more accurate and reliable anomaly detection results compared with other methods that rely solely on frequency, density, or distance [23].To quantify the observation, isolation forest partitions data randomly and recursively, logging the number of partitions required to isolate each sample.This process is then repeated many times to derive the average number of partitions, which serves as the anomaly score for each sample.
As an unsupervised learning method, isolation forest does not require anomaly labels from users and thus is well suited to our problem where the labels are often not available.Meanwhile, isolation forest is highly scalable and efficient, demonstrating linear time and space complexities.
Embedding configuration data.Given the synthesized template from §3, Diffy recursively traverses .For each label ℓ in , Diffy translates the concrete values ∈ (ℓ, ) for each configuration into a numerical vector that is fed to the isolation forest algorithm.This translation occurs according to the following steps: (1) apply each type of applicable expression (e.g., present for an optional template) to get string values (e.g., "true"), (2) if the strings from all configurations are numerical, then encode them directly, otherwise sort the strings and then apply a standard ordinal encoding to obtain a numerical representation.Table 2 showcases example expressions that we used, and users can extend Diffy to add additional expressions.The value of .
Analyzing multiple parameters: In this work, we focus on identifying configuration errors involving only a single configuration parameter.Single parameter errors are by themselves a prevalent category, and encompass many kinds of misconfigurations including typos or "fat finger" mistakes, copy-paste errors, type errors, missing configuration, and more (see §6).We leave learning multi-parameter errors, i.e., correlations across multiple configuration settings, as future work.

IMPLEMENTATION OF DIFFY
Diffy is implemented in 4.7K lines of F# and 1.6K lines of C# code.It accepts a file directory with JSON configuration files and an optional set of tokens with associated costs through the command line.It always includes token [any] in its analysis.Diffy outputs a template, and a set of potential bugs along with a score for each bug.Diffy accepts many other optional flags to include or exclude different parts of the configurations from its analysis or filter out different kinds of anomaly expressions from its results.To enhance scalability, Diffy employs several optimizations.
While creating the JSON abstract syntax tree from Figure 2 in memory, Diffy calculates a perfect hash for each node and stores it as part of the node.This hash enables quick approximate equality checks between JSON elements and allows Diffy to efficiently cache calls to its synthesize function to prevent redundant work.Furthermore, when synthesizing a template for a set of strings or lists , etc., Diffy identifies duplicate JSON nodes between configurations using these hashes, and replaces them with a single copy of the node.This simplification applies recursively and significantly reduces the cost of synthesis.
Recall that the complexity of string synthesis is worst case O (|R| • 2 • 2 ).We apply several optimizations to avoid this worst-case complexity in practice.First, Diffy's implementation includes an additional rule in the dynamic programming formulation for the [any] token that only allows that token to match a single character (i.e., horizontally or vertically).We then post-process the template to coalesce any consecutive [any] tokens.Doing so avoids having to consider all backwards jumps.Similarly, we use a greedy regex matching semantics to greatly reduce the number of backwards jumps we must consider (i.e., consider only the maximal match for the regex for each start index).
To speed up the substring matching process, Diffy caches state transitions for index tables from Figure 4.When updating states the cache stores the new set of states as a function of the current state set, token, and new token.This one simple idea removes most of the overhead from tracking automata states, and reduces the index match function I to linear time in practice, as only a few state sets are typically visited.Finally, Diffy first checks if every string shares a common prefix.If so, it removes this prefix and inserts a constant template at the beginning.It then synthesizes a template for the remaining suffixes.
To improve the performance of synthesizing unordered templates, Diffy uses a greedy matching algorithm that iteratively pairs off the two elements with the lowest cost template.Doing so improves performance and we found does not affect the quality of the final template much.
Lastly, Diffy omits the computation of some list templates if their cost cannot outperform the existing ones.For instance, when an ordered template's cost is low enough that an unordered template could never surpass it, Diffy does not bother synthesizing the best unordered template.

EVALUATION
Our primary focus is to examine the versatility, scalability, and accuracy of Diffy.We assess these characteristics through multiple case studies, each representative of a distinct aspect of the network stack.These include the configurations from the wide-area network of a large cloud provider, the radio units of an operational 5G testbed, and a collection of MySQL database configurations mined from GitHub.Due to space limitations, we have included the MySQL analysis in the supplemental materials .These configurations vary significantly in their structure and scale.

Case Study 1: Wide-Area Network
We analyzed one of the world's largest backbone networks, a wide area network (WAN) consisting of thousands of routers and millions of configuration lines.Routers in the WAN are categorized by roles, such as edge, border, core, and reflector.Table 3 shows the total number of lines of configuration for each role.For sensitivity reasons, the exact role names and configuration details are anonymized.We apply Diffy to each role since configurations within a role share similar definitions, like prefix lists and community lists, though specifics can vary across routers.
Router's running configuration files (i.e., those active on the devices) are stored in a centralized database, providing an accurate snapshot for Diffy.A set of golden configurations (i.e., the expected configurations) are also available.The running and golden configurations may vary for Table 3. Diffy performance on WAN configurations.For each configuration element type, table shows the (P)arsing time to read JSON files, the (S)ynthesis time and the time for finding (A)nomalies.Each entry of the format / represents the time taken to template or find anomalies without any tokens ( ) and with additional tokens like [num] and [alpha] ( ).All the times are in seconds.a large number of operational reasons.For example, partial or incomplete configuration change rollouts, temporary changes (e.g., QoS settings to alleviate ongoing network congestion), differing maintenance schedules (e.g., operators can only take a limited number of devices offline in a given month for refresh), and many more reasons.

Role
To reduce network outages, the WAN uses a custom "diffing" service that periodically compares the golden and running configurations and reports all current deviations that may need repair.We leverage this service to identify the ground truth (i.e., those differences that are unintentional) and use this data to evaluate the accuracy of Diffy.Because the service reports all differences, it also includes many differences that are ultimately harmless (e.g., the version number for the configurations).For this reason, we focus on evaluating Diffy's precision rather than recall.
Diffy performance.To convert vendor-specific configuration files to JSON, we utilized two parsers: open-source Batfish [17] and a closed-source parser.We employed the latter for prefix lists and route policies, while Batfish handled the rest (e.g., community lists, VRFs, SNMP servers, etc.).We used both parsers because Batfish offers broader coverage but also complicates and obscures the original configuration data by expanding and inlining prefix lists and route policies.The internal parser's JSON representations of prefix lists is exemplified in the top section of Figure 1.Using these parsers, we analyzed over 99% of the configuration lines for the WAN routers.Table 3 displays the time Diffy takes to transform the JSONs into internal data structures, synthesize a template, and find anomalies for each role and configuration element type, both with and without tokens.During the learning phase, we had Diffy report all potential anomalies, which were then sorted and filtered by score for operator review.In most instances, Diffy completes within seconds or minutes, even when processing millions of configuration lines.Route policies, being the most complex element, take the longest time.However, even in the worst-case scenario, Diffy completes route policies in approximately 20 minutes.
Accuracy of Diffy.We run Diffy on each role using a minimum anomaly score filter 0.51 to find all issues with some evidence of being an anomaly.For each router and configuration element name we report if Diffy found an anomaly with the element for this router and compare with the "diffing" tool based on golden configurations.From this ground truth, we plot the precision (true vs. false positive ratio) of Diffy as a function of the anomaly score filter in Figure 6.Diffy's anomaly score is closely correlated with its precision against the "diffing" service.For instance, for issues with an anomaly score of 0.8 or higher, roughly 80% of Diffy's findings are confirmed as true positives, and the precision goes up to 97% true positives with anomaly score 0.88 or higher.While ground truth exists for route policies and prefix lists, other elements such as SNMP servers are not currently tracked by the diffing tool.For these, we identified a small subset of classes of issues Diffy reported with high score and manually investigated them with network operators.Of these, the operators identified 77% as true positives, 8% as false positives, and 15% as requiring further investigation.Many of the issues represented "cruft" in the configurations from legacy devices and were subsequently addressed by the operators.
Breakdown of anomalies.We show the frequency of each type of anomaly expression from Table 2 across all configuration elements in the WAN in Table 4.Most reported issues for this dataset are related to (1) a configuration element missing or (2) present that should not be, or (3) a parameter value from the template being different.
Template types.We examined the templates for each element type to better understand the use of different list template types described in §3.5.For prefix lists, ordered templates are prevalent, however in some cases where the prefixes differed greatly across routers, Diffy synthesized the general template Route policies mainly use object templates with certain list-type values, for which ordered templates are frequent, as the keys (match, set, or action commands) demand order.Repeat and unordered templates are common for other elements such as communities, AS paths, Virtual Routing and Forwarding (VRFs) and servers as they have set semantics.These observations demonstrate that the different template types are useful in practice for capturing the domain-specific semantics for the various configuration components.
Example Bug: BGP Community Regex Match.Diffy discovered inconsistencies in regular expression filter policies used for matching route community tags.For example, some routers configured a regex starting with a "^" character, while most did not.Diffy synthesized the template [any] • [num] • ":2[1−3]" and found that the value of [any] was typically "" and only rarely "^".This character, when present, requires a match at the beginning of the BGP community string and may lead to a failed match.The operators acknowledged the problem, standardizing the configurations.Example Bug: IGP Timer.In another instance, a single router had an IGP routing protocol configuration parameter with a "overload" setting.This setting was supposed to be "true" but had previously been erroneously set to a time in seconds (e.g., "1200") on a single router.This switches the value to "false" 1200 seconds after a reboot and resulted in the router taking in more traffic than intended.The device did not have the capacity to handle this traffic, and started dropping most of the packets, leading to connectivity issues for many of the cloud's customers.The configuration issue was resolved, but automation later mistakenly reapplied the flawed setting, opening up the potential for another outage.Diffy detected a high-confidence anomaly from template {"overload" : ("false" | "1200")} and we were able to alert the operators in time.

Case Study 2: 5G Radio Access Network
We collected configuration files from a 5G vRAN (virtualized Radio Access Network) testbed developed and operated within a global cloud provider.The testbed represents the state-of-the-art 5G vRAN deployment, consisting of hardware components such as radio units (RUs), vRAN servers, PTP (Precise Time Protocol) grandmaster clocks, network switches, along with 5G software stacks.
Table 5.We compare the string synthesis performance of Diffy and FlashProfile [44].The "Total calls" column shows the number of string synthesis calls in thousands ( §3.4).We also show the mean and median number of strings in each call.The sub-columns show the breakdown by prefix list, route policy, and other.Our focus in this dataset is on the configurations established for a total of seven RUs in the testbed.These RUs, manufactured by Foxconn for 5G vRAN, use four antennas to transmit and receive radio signals on the n78 band (3.3-3.8GHz).During operation, each RU loads an XML configuration file that specifies its behavior, e.g., the MAC address of the vRAN server for baseband processing, the IP address of the PTP grandmaster for clock synchronization, as well as many low-level configuration knobs (see the supplemental materials for a sample configuration snippet).

Role
Given that the testbed is time-shared among users who frequently experiment with various setups, it is common for multiple variants of the XML configuration to coexist on each RU.To evaluate Diffy, we first acquired 42 RU configurations that were consciously created by previous testbed users, excluding factory defaults.In addition, we recreated several types of bugs commonly encountered during the daily operation of the 5G testbed and we retroactively introduced these bugs back by generating an additional 8 synthetic configurations to simulate the operational experience.Each synthetic RU configuration showcases a distinct category of bug, such as a different firmware version, an old PTP grandmaster IP, or a deprecated VLAN tag for user-plane messages.In total, the final dataset encompasses 50 configurations.
Diffy performance.We converted the 50 XML configurations into JSON and ran Diffy on them to detect anomalies with two tokens: [num] and [hex].Diffy completed in 225 ms, with 44 ms spent on parsing, 110 ms on synthesis, and 71 ms on discovering anomalies.Diffy outputs anomalies in descending order by their anomaly scores, and we manually reviewed the 64 flagged issues with an anomaly score of 0.7 or higher.Among these issues, 62 were true positives, resulting in a precision of 96.9%.Notably, Diffy correctly identified 38 issues in non-synthetic configurations and flagged all retroactively injected bugs with high confidence, resulting in a recall of 100%.The only two false positives were associated with uncommon yet plausible values for RRH_DST_MAC_ADDR, the MAC address of the destination vRAN server, due to Diffy's lack of knowledge of the MAC address that the user intended to allocate to a vRAN server.
Example bugs.Diffy successfully identified the misconfigured RU frequency with the synthesized template "3" • [num] and value "3929700" kHz.This was outside the range of typical values, which all fall into the n78 band.In another case, Diffy flagged a field called "RRH_TX_ATTENUATION", which indicates the transmit power attenuation before the RU's power amplifier.In one configuration, this field has a value of "[20.0, 20.0, 20.0, 20.0]", whereas the typical value is "[30.0,30.0, 30.0, 30.0]".As a result, 4 anomalies were flagged by Diffy, one for each element.It is likely the configuration was created by a testbed user who experimented with power attenuation, but this deviation should be flagged to the testbed operator.

Diffy Performance
In this section, We expand on Diffy's performance along several dimensions.
String synthesis performance.We evaluated Diffy's string synthesis algorithm from §3 by comparing it to FlashProfile [44], a leading data profiling tool that also learns regular expression patterns to represent string sets.We obtained the set of all calls in Diffy to synthesize string patterns for each of the WAN roles R1-R5 in Table 5.We replay this set of calls in both Diffy and in FlashProfile and measure the total time to synthesize string patterns.We measure the performance difference for prefix lists (first column), route policies (second column), and other elements (third column), and show the corresponding speedup (FlashProfile time / Diffy time).
The performance comparison results is shown in Table 5. Diffy takes anywhere from under 1 second, to 86 seconds to synthesize the string patterns.In contrast, FlashProfile takes between 10 seconds and 13.6 hours to synthesize similar patterns.This corresponds to a roughly 2-3 orders of magnitude speedup for Diffy.One of the main reasons for these differences is likely due to the different approaches taken by the tools.FlashProfile has its own notion of pattern cost, but relies on heavyweight program synthesis to enumerate regular expression candidate patterns [44,45].In contrast, Diffy takes the cost into account during synthesis as part of its dynamic programming algorithm.By using FlashProfile in Diffy, for instance, the time to template the route policies in role R1 would increase from roughly 20 min to over 13 days.
We also manually compared the quality of patterns generated by Diffy and FlashProfile using FlashProfile's publicly available benchmark dataset [43].We found that Diffy produced similar sets of patterns as FlashProfile if given the same regular expression tokens, costs, and an appropriate grouping threshold cutoff.Moreover, slight variations in the patterns often do not impact anomaly detection much (see §6. Performance trends.We have seen that Diffy can easily scale to the entire WAN.However, to get a better sense of how the performance varies with the number of configurations, we additionally ran Diffy on smaller subsets of the full set of configurations of the largest role R1 and observe its relative performance.In Figure 7 we show the fraction of the runtime for the subset of configurations compared to that of the full set of configurations relative to the ratio of configurations analyzed.This lets us see the general scaling trend, which appears to be just slightly more than linear.
To evaluate the performance of Diffy in relation to the number of tokens, we executed Diffy on WAN configurations, altering the number of tokens between 1 and 10.Due to limitations in space, the selected tokens and the particulars of the experiment are elaborated in the supplemental materials.Our findings reveal that Diffy demonstrates linear scalability with the number of tokens.

Diffy Hyperparameter Sensitivity
To evaluate the sensitivity of the grouping threshold used to create choice templates, we vary the threshold from 0.02 to 0.98 and plot the fraction of the maximum possible anomalies reported at each threshold using an anomaly score threshold of 0.8 or higher in Figure 8 for the WAN, RAN, and MySQL configurations (see the supplemental materials).The plot shows the results for each of the configuration element types in the WAN for all the roles summed together.
We find that beyond a certain point (e.g., 0.2) the particular value of the threshold does not affect the anomalies reported greatly.There may be several reasons for this phenomenon.For the WAN, many of the anomalies are due to missing policies (see Table 4), which remains largely unaffected by the particular groupings of patterns.In other cases, however, the anomaly detection approach 0.2 0.4 0.6 0.  described in §4 is fairly robust to the groupings used.For instance, Diffy may detect an anomaly in the value of a string parameter (e.g., value) or due to separate patterns (e.g., choice).

Diffy Generality
Diffy offers extensive coverage of configurations due to its support of complex structure for formats such as JSON, XML, and YAML as well as its ability to learn domain-specific data formats from examples.In Table 6, we showcase this generality by comparing a sample of WAN router configuration element types against existing data-driven configuration analysis tools.SelfStarter [31] is built specifically to analyze router configurations and supports structured policies like prefix lists and route policies.However, it requires hand-tuning and specific implementation for each element type that one wants to analyze and therefore cannot support many conceptually similar and widely used configuration elements like BGP community lists without nontrivial modification.Even in the domain of router configurations, Diffy analyzes significantly more configuration features than SelfStarter.Meanwhile, other tools [50,51] find simple kinds of issues for basic strings but fail to analyze complex structured configurations.

RELATED WORK
Diffy is related to several lines of prior work: Configuration anomaly detection.Previous research extensively explored identifying configuration errors using data-driven learning, such as ConfigC [51], ConfigV [50], and Minerals [36].These approaches view configurations as bags of key-value pairs, learning rules like type or arithmetic rules.EnCore [60] learns configuration invariants while considering the system environment to minimize false positives (e.g., by verifying file paths).However, unlike Diffy, these methods do not apply to hierarchically structured configurations like JSON and do not handle ad hoc data formats, limiting their applicability.
SelfStarter [31] is a closely related work, uniquely capable of analyzing some table structures in configurations.It identifies errors in ACLs, prefix lists, and route policies, but needs domainspecific semantic knowledge (e.g., about IP address octets and ACL reordering).In contrast, Diffy generalizes SelfStarter's approach to manage general JSON structures and ad hoc data formats.
Additionally, as discussed in Table 6, Diffy supports a broader range of templates, such as string, repeat, and choices templates (refer to Figure 1).
Synthesizing data formats.Learning concise patterns from a given set of strings ( §3.4) has been explored in prior work [1,16,44], including FlashProfile [44] and PADs [16].FlashProfile relies on expensive program synthesis techniques that involve enumerating all regular expressions matching a set of examples [45], then utilizing cost metrics and clustering methods to choose the most appropriate patterns.However, exhaustive enumeration renders FlashProfile impractical for frequent use in template synthesis as shown in Table 5.In comparison, Diffy employs a dynamic programming approach and various optimizations to efficiently synthesize string templates.
PADs attempts to solve the ambitious problem of learning complex data formats directly from text using a top down statistical approach.Rather than learning from raw text, Diffy instead directly leverages the hierarchical structure that exists in JSON and optimally solves the string synthesis sub-problem using dynamic programming.PADs also uses regular expression tokens but relies on a pre-tokenizing step for input strings that may hinder anomaly detection (e.g., all integers are converted to [num] prior to analysis).Diffy lazily tokenizes strings based on the specific configuration values.These two approaches vary greatly in their details, though it may be possible to incorporate ideas from PADs and Diffy.
Network verification.Verification enables users to formally prove the correctness of their configurations by precisely modeling a system and allowing proactive analysis of configuration changes.By comparing behavior to user-provided specifications, verifiers identify bugs before deployment.Verification has been successful in specialized domains like data plane forwarding [3,26,28,34,35,39,40,57,61], routing [4-6, 15, 20, 53], and DNS [30].However, it requires (1) rigorous system modeling, which can be time-consuming or infeasible for many systems, and (2) user specifications, which may be challenging to obtain.Diffy requires no specifications and no system modeling but also provides no guarantees of correctness.
Log file anomaly detection.There is extensive work related to log file anomaly detection [11,19,24,25] that attempts to predict system failures based on anomalous log file entries.These tools typically work by clustering log lines into templates such as "Command Failed on: * " and track template counts over time to detect anomalous behavior.For instance, an increase in a template that mentions the word "Failed" may predict a future system crash.Compared to Diffy, these tools are highly specialized to the domain of log files and neither analyze the hierarchical structure of JSON nor find bugs in template parameters that are common in configurations.

CONCLUSION
In this paper, we presented Diffy, the first tool that can identify likely bugs in complex configurations with ad hoc data formats.Given JSON configurations, Diffy first efficiently synthesizes a low-cost template that succinctly captures similarities as well as differences across these configurations.This template approach leverages a novel dynamic programming algorithm to efficiently learn a set of patterns summarizing string data, and Diffy then lifts this algorithm recursively to synthesize templates for structured data.Finally, by applying unsupervised learning, Diffy is able to identify likely configuration errors as deviating from the normal behavior.Evaluating Diffy on a variety of network configurations, we demonstrated its versatility, scalability, and accuracy.

Fig. 1 .
Fig. 1.Example of Diffy in use for anonymized and simplified configurations inspired by a wide-area network (48 configurations in total).Top le shows example configuration snippets for WAN routers.Right shows the synthesized template produced by Diffy.Bo om le shows the high confidence bugs flagged by Diffy.

Fig. 3 .
Fig. 3. Definition of a template cost C given JSON configurations and a template instantiation function .

1 2 3 Fig. 4 .
Fig.4.Diffy's string template synthesis algorithm for strings "10M", "15M", and "200" using regular expression tokens for numbers [num] = "[0−9]+" and file paths [path] = ".* /. * " with cost1  5 .The algorithm has three steps: 1 constructing DFAs for each token, 2 synthesizing a template using dynamic programming, and 3 tracking automaton states in index tables.The table entries can either have an exact match with zero cost or jump backward by matching a regex token.An index entry, such as {q 3 } ↦ → {1,2} at [num] for token [path], signifies that the substring starting at index 1 or 2 and ending at [num] leads to { 3 } in the FSM of [path].Since 3 is not a final state it does not match either substring.The final path through the table results in pa ern "1" • [num] • "M" before being adjusting to [num] • [any] for the third string.

Fig. 8 .
Fig. 8. Fraction of anomalies vs. grouping threshold for the WAN, RAN, and MySQL configurations.

Table 2 .
Example expressions for a value with label that transform the value to another value, with example outputs.
present(x) Optional value has a value."true" value(x)

Table 4 .
Percentage of anomalies by expression type for WAN configurations with anomaly threshold 0.8.

Table 6 .
Representative subset of Router configuration elements and their support in existing tools.