Benchmarking Combinations of Learning and Testing Algorithms for Automata Learning

Automata learning enables model-based analysis of black-box systems by automatically constructing models from system observations, which are often collected via testing. The required testing budget to learn adequate models heavily depends on the applied learning and testing techniques. Test cases executed for learning (1) collect behavioural information and (2) falsify learned hypothesis automata. Falsification test-cases are commonly selected through conformance testing. Active learning algorithms additionally implement test-case selection strategies to gain information, whereas passive algorithms derive models solely from given data. In an active setting, such algorithms require external test-case selection, like repeated conformance testing to extend the available data. There exist various approaches to learning and conformance testing, where interdependencies among them affect performance. We investigate the performance of combinations of six learning algorithms, including a passive algorithm, and seven testing algorithms by performing experiments using 153 benchmark models. We discuss insights regarding the performance of different configurations for various types of systems. Our findings may provide guidance for future users of automata learning. For example, counterexample processing during learning strongly impacts efficiency, which is further affected by testing approach and system type. Testing with the random Wp-method performs best overall, while mutation-based testing performs well on smaller models.


INTRODUCTION
Automata learning enables the automatic construction of finite-state models of black-box software systems on the basis of samples of their input-output behaviour.In active approaches to learning, these samples are commonly gathered through testing those systems.Hence, the combination of automata learning and testing offers the potential to apply model-based verification of black-box systems, which would otherwise be infeasible.Applications of learning-based verification range from model checking [19] over manual analysis of models [18] to model-based regression testing [4].Active learning was also used to learn the behaviour of recurrent neural networks [32,36,43,67].Domains in which such techniques have been successfully applied include communication protocols [5, 6, 18-21, 48-50, 60], embedded systems [2,3,54], and cyber-physical systems [1,33].
There are two main types of automata learning: active learning and passive learning.Active learning repeatedly alternates between two phases of learning involving two types of queries.Membership queries are test cases that are executed to gain knowledge about the system under learning (SUL) to build hypothesis automata, while equivalence queries check whether a hypothesis conforms to the SUL.The former are selected by the used learning algorithm, while the latter are usually implemented through conformance testing.The selection of test cases for both types of queries affects the learning runtime.In contrast, passive learning takes a sample of the SUL's inputoutput behaviour and constructs a finite-state model that is consistent with the provided sample.There is no data selection prescribed by algorithms of this type.The accuracy of the learned model completely depends on the given sample data.For this reason, approaches to iteratively refine passively learned models have been proposed in the literature, thus turning passive learning into active learning.An approach to achieve this consists of testing intermediate learned models to extend the sample data with input-output behaviour observed during testing [64].Compared to conventional active learning, such approaches may be viewed as performing only equivalence queries.
Since the dominant factor in active automata learning is usually the test execution time [18,60], it is paramount to minimise the number and length of tests.Various approaches have been suggested for minimising the number of tests required for membership queries [29,51] and for equivalence queries [7,28].In this article, we empirically analyse the interaction between these approaches in active automata learning.To gain a better understanding of the influence of membership queries in general, we also analyse the behaviour of a passive learning approach that we make active through conformance testing.
We examine the learning performance of combinations of various learning algorithms and conformance-testing algorithms implementing equivalence queries.Our goal is to provide data on the relative performance of different learning setups by determining the testing budget required for correct learning.Such data may support practitioners in choosing a particular learning setup.Our analysis focuses on communication protocols and is based on benchmark models 1 from the field of active automata learning [44].Models of communication protocols are often succinct, but they may be challenging to test.For example, the learned Mealy machine models of Transmission Control Protocol (TCP) servers [19], Message Queuing Telemetry Transport (MQTT) brokers [60], Transport Layer Security (TLS) servers [18], and Secure Shell Protocol (SSH) servers [21] have at most 66 states.Note that abstraction is generally applied for automata learning to be feasible.These properties make such protocols well-suited experimental subjects for the evaluation of approaches to learn complete automata models.
This article is the extended version of a paper presented at the 14th International Conference on Tests and Proofs (TAP2020), which was postponed and merged with TAP2021 and published in Smetsers et al. [56] presented an efficient method for finding counterexamples in active automata learning that applies mutation-based fuzzing.In their evaluation, they compared four learning configurations with respect to learning performance, measured in terms of the size of learned models and the required number of queries.They considered combinations of the L * algorithm [10] and the TTT algorithm [29] with their proposed method and the W-method [15,63].Peled et al. [47] introduced a method of combining automata learning with model checking, showing that it is more efficient than first learning the automata and model checking it afterwards.Counterexamples for model checking basically serve as an additional source of counterexamples for learning.They additionally apply the W-method for conformance testing to avoid incorrect results stemming from spuriously positive model checking results that do not produce counterexamples.
Groz, Brémond, and Simão applied concepts from finite-state-machine-based testing to implement an efficient algorithm for active learning of Mealy machines without resets [14,26].The authors evaluated various active learning configurations, including L * -based configurations with and without resets.
Additionally, we use an approach for making a passive algorithm active that was proposed by Walkinshaw et al. [64].This approach was also applied in other contexts, such as SMT-based passive learning [55] and learning of timed automata via genetic programming [6].An even earlier approach of extending passive learning with queries was proposed by Grinchtein et al. [25].

PRELIMINARIES
In this section, we define Mealy machines, provide an overview of active learning of Mealymachine models of black-box systems, and discuss test-case selection for this kind of learning.We conclude the section by discussing passive learning via RPNI [46].

Mealy Machines
We use Mealy machines as modelling formalism, as they have successfully been used in contexts combining learning and verification [18,19,35,60].Additionally, the Java-library LearnLib [30] provides algorithms for both learning and conformance testing of Mealy machines.
Mealy machines are finite-state machines with inputs and outputs.Their execution starts in an initial state and they change their state by executing inputs.During execution, they produce exactly one output in response to each input.Formally, Mealy machines are defined as follows.

Definition 3.1 (Mealy Machines).
A Mealy machine M is a 6-tuple M = Q, q 0 , I , O, δ, λ , where • Q is a finite set of states, • q 0 is the initial state, • I and O are finite sets of input and output symbols, • δ : Q × I → Q is the state transition function, and Our definition implies that Mealy machines are input enabled and deterministic.This means that outputs and successor states must be defined for all inputs in all states.A Mealy machine is deterministic if there is at most one output and one successor state for every pair of input and source state, in contrast to non-deterministic Mealy machines that may have multiple outputs and multiple successor states on any given pair of input and source state.
We extend λ to sequences of inputs in the standard way.For s ∈ I * and q ∈ Q, the output function λ(q, s) = t ∈ O * returns the outputs produced in response to s executed in state q and we define λ(s) = λ(q 0 , s).We say that two Mealy machines over the same input and output alphabets are equivalent if they produce the same outputs in response to all possible input sequences.Let Benchmarking Combinations of Learning and Testing Algorithms for Automata Learning 3:5 Fig. 1.The interaction between a learner and a minimally adequate teacher (MAT) communicating with a SUL to learn a mealy machine [62].
M 1 and M 2 be two Mealy machines with output functions λ 1 and λ 2 , respectively.Formally, they are equivalent, denoted Furthermore, we define a trace as a sequence t ∈ (I × O ) * , that is, it is a sequence of pairs consisting inputs and corresponding outputs.We use the function makeTrace(i, o) = t to create a trace from a pair of input and output sequences i ∈ I * and o ∈ O * of the same length.Given two sequences s, s ∈ I * , we denote their concatenation as s • s = s .We say that s is a prefix of s and s is a suffix of s , where s and s may be the empty sequence ϵ, thus s is a prefix/suffix of itself.We use the function prefixes(s) to return a set containing all prefixes of s.
As we talk about testing a lot in this article, let us also formally define test case and test step: Syntactically, a test case is a sequence of inputs i ∈ I * .Semantically, a test-case execution is the resetting of the SUL, applying the test case and observing the outputs produces by the SUL, therefore, a test case is always applied from the initial state q 0 .A test step is the evaluation of δ (s, i) for a given s ∈ Q and i ∈ I and the observation of the corresponding output via λ(s).

Active Automata Learning
We apply active automata learning algorithms in the minimally adequate teacher (MAT) framework introduced by Angluin for the L * algorithm [10].While L * has originally been proposed for deterministic finite automata (DFA), it has been extended to other types of automata, such as Mealy machines [35,45,52].For the following discussion of learning, we assume that we interact with a MAT to learn a Mealy machine producing the same outputs as a black-box SUL.In this context, the MAT basically wraps the SUL, which is assumed to behave like a Mealy machine.
A MAT is usually required to answer two types of queries that are posed learning algorithms.These queries are commonly called membership queries and equivalence queries; see Figure 1 for a schematic depiction of the interaction between a learning algorithm, also called learner, and a MAT, also called teacher.
In membership queries (also called output queries [52]), the learner provides a sequence of inputs and asks for the corresponding outputs.In test-based learning, the teacher implements such a query by performing a single test on the SUL, while recording the observed outputs.In equivalence queries, the learner provides a learned hypothesis automaton and asks whether this automaton is correct, i.e., if it is equivalent to the SUL.This is commonly implemented through conformance testing, i.e., the teacher generates a test suite from the hypothesis and executes it on the SUL.If a test case reveals a difference between SUL and hypothesis, then it is returned as a counterexample to equivalence.Otherwise, the teacher returns yes, signalling that SUL and hypothesis are considered to be equivalent.Put differently, conformance testing approximates equivalence checking between the hypothesis and the Mealy machine underlying the SUL.It basically checks Equation (1) while sampling only a finite set of input sequences from I * .From the point of view of the SUL membership queries and tests look the same.Their execution requires a sequence of inputs upon which the SUL produces a sequence of outputs.We depict their similarity in Figure 1 by using the same labels in the teacher.
Active automata learning operates in rounds by performing the two types of queries in alternation.In every round, the learner performs membership queries until there is sufficient information to build a hypothesis.What constitutes sufficient information depends on the learning algorithm, but generally, all algorithms will only build a hypothesis if they have enough information to build an input-complete hypothesis, i.e., every input's transition is determined in every state of the hypothesis.In case of L * this would mean a consistent and closed observation table (see Section 4.1.1).After that, the learner issues an equivalence query, asking whether the hypothesis is correct and learning can stop.If the teacher returns a counterexample, then the learner integrates it into its knowledge and starts a new round of learning.Otherwise, learning stops with the correctly learned hypothesis as output.
Because active automata learning algorithms produce minimal automata as their intermediate hypotheses [10], every equivalence query must necessarily add at least one new state to a hypothesis.The number of equivalence queries required is therefore bound linearly by the size of the automaton.

Test-Case Selection for
Learning.There are several factors influencing the test-case selection in the learning process outlined above.First, learning algorithms differ in the amount of membership queries required for creating hypotheses.This largely depends on the used internal data structures, such as observation tables used in L * [10].Tree-based learners often require fewer membership queries per round [29,31].The second factor concerns counterexample processing, which may also require testing.There are different ways to extract information from counterexamples, affecting the content stored in data structures and consequently future membership queries.Third, the test-case selection for conformance testing depends on the applied testing technique.Since test cases revealing differences serve as counterexamples, conformance testing affects subsequent counterexample processing and selection of membership queries.Therefore, we investigate which combinations of learners and testing techniques are the most efficient overall, i.e., which combinations require the lowest testing budget for learning.

Passive Automata Learning
In contrast to active learning algorithms, passive ones do not have access to a teacher and may therefore not interact with the SUL at all.They must instead form a hypothesis of the workings of the SUL by analysing a sample S, a set of traces.
For the sake of simplicity, let us consider passive learning of DFA first, where we want to learn an automaton consistent with a sample of words from I * .In this case, a sample is usually of the form S = (S + , S − ), where S + contains words that should be accepted by the learned automaton, called a positive sample, and S − contains words that should be rejected by the automaton, called a negative sample.It is then the goal to find an automaton with the smallest number of states that accepts all words in S + and not the ones in S − .
In this article, the RPNI algorithm by Oncina and Garcia [46] is used to passively learn automata.To learn DFA, RPNI initially constructs a prefix tree automaton that accepts strictly words from the positive sample S + .It then attempts to merge all states (tree nodes) with each other as long as no inconsistency is detected, i.e., the resulting automata would not accept a word from the negative sample S − .Finally, the algorithm terminates once all merges were attempted and no further merges are possible, returning the final automaton as the learning result.
The automaton resulting from applying the RPNI algorithm is completely determined by the information available in the sample S. The larger the sample and the more useful the traces in it, the more accurate the final automaton becomes.RPNI is guaranteed to produce an automaton isomorphic to the canonical automaton underlying the regular language to be learned if the samples S is a characteristic set [17,46].This means that an automaton learned from a characteristic set is minimal in terms of the number of states, like intermediate hypotheses in active automata learning [10].However, automata learned from non-characteristic sets are typically larger.While other algorithms, such as SAT solving techniques, produce minimal intermediate hypotheses, we have chosen RPNI for its polynomial runtime in the sample size.However, using RPNI with an equivalence oracle means that we do not have the bound on the number of equivalence queries that most active automata learning algorithms have, as RPNI does not necessarily produce minimal intermediate hypotheses.Without going into details, we note that characteristic sets of polynomial size with respect to the canonical automaton exist for RPNI [17].
A DFA D admits a straight-forward transformation to a Mealy machine M with the two outputs true and false.Every word accepted by D corresponds to a trace of M where the last output is true and every rejected words corresponds to a trace where the last output is false.This insight provides a way of generalising RPNI to such Mealy machines.Recall that an inconsistent merge is one that would lead to accepting a word from S − .In the Mealy machine interpretation, such a merge would lead to non-determinism where an input sequence should map to true and false at the same time.Hence, for learning Mealy machines over arbitrary output alphabets with RPNI, merges are performed on input sequences and they are rejected as inconsistent if they would introduce non-deterministic behaviour in the resulting Mealy machines.In this case, the sample S provided to learning is a set of traces.We use the RPNI implementation for Mealy machines available in LearnLib [30].

EXPERIMENTAL SETUP
We evaluate the performance of combinations of six learning algorithms and seven conformancetesting algorithms.For this purpose, we determine the lowest conformance-testing budget for learning to be successful, i.e., for learning correct models.To determine whether a given testing budget is sufficient to learn correctly, we "re-learn" known models of network protocols, which are part of a benchmark suite for learning and testing [44].We treat these models as black boxes during learning by simulating them to generate outputs in response to input sequences.Once learning terminates, we compare the learned to the true model.We deem learning successful if the correct model has been learned once with a configuration involving deterministic testing.In experiments involving randomised testing, we repeat learning runs ten times and deem learning successful if all runs produce the correct model.To ensure reproducibility, we use fixed seed values for random number generators.
We chose to repeat learning runs with randomised testing 10 times to reduce algorithm performance variability.It would be ideal to perform more repetitions, because variability increases with problem size, but to keep experimentation effort manageable we fixed the number of runs to 10.For example, some combinations with random testing require several hours to learn certain SULs a single time.
The random number generation used during testing is the one used by the implementations of the various testing algorithms used in LearnLib [30].These use the pseudo random number

Learning algorithm
Testing algorithm L * [10,52] W-method [15, 63] RS [51] partial W-method [23] KV [31] random words TTT [29] random walks Active RPNI [46] (Section 5) mutation [7] ADT [22] transition coverage [7] random Wp-method generator available in Java.Upon launch of a new learning run, this random number generator is seeded with a seed between 1 and 10, depending on which of the ten runs is currently executed.These seeds are "fixed" for all experiments, i.e., we always use the same seeds, to allow for ten reproducible yet different runs on every benchmark model.Seeding the random number generator will then produce the same sequence of pseudo random numbers every time, thus allowing for consistently reproducible experiments.However, this reproducibility is dependent on the implementation of the random number generator, such that using different versions or implementations of Java, such as OpenJDK or Oracle, may lead to different learning results.Additionally, other factors may also prevent reproducible experiments, such as different implementations of learning or test algorithms, choice of iterator functions or even specific encodings of input and output alphabets.
The setup for the learning experiments and the measurement results from these experiments can be found in the supplementary material [66].The results include all relevant data, such as system resets (executions of test cases) and test steps (executions of test inputs) for equivalence and membership queries.Here, we present statistics computed from these data.In our target application of network protocols, we consider test steps to be the most relevant performance measure, as resets often can be implemented efficiently by simply reconnecting.Since we are interested in the overall performance of learning, we generally consider the combined number of test steps required for equivalence queries and membership queries.
Because we focus on the number of test steps and not on execution time, the setup should work with any CPU.It should generally be possible to estimate the time a test step requires on the SUL.For example, Tappler et al. [60] presents times for MQTT brokers that range, depending on the application, between 25 and 300 ms for a single test step.This supports our decision of focusing on test steps as the dominant metric for the performance of learning and testing algorithms especially with the given benchmark set composed also of mostly communication protocols.
Most of the benchmarking was done in parallel to increase the speed at which benchmarks could be executed.In most cases RAM was not an issue for the majority of algorithms.However, the more memory intensive algorithms, such as mutation, transition coverage and random Wpmethod, explained in the section below, were benchmarked single-threaded and had a maximum allocation of 12GB RAM during execution.

Selection of Algorithms
The evaluated learning algorithms are listed in the first column of Table 1 and the testing techniques are listed in the second column of Table 1.These lists include various popular algorithms available in LearnLib [30].Hence, our evaluation, e.g., considers the performance of L * [10] combined with the partial W-method [23].
The learning algorithms can be split into three groups and we will use this grouping for the presentation and discussion of experimental results.The first group comprises the four algorithms L * , RS, KV, and TTT.This group was considered in the conference version of this article.Here, we consider two additional learning algorithms that form their own groups with distinct characteristics.
The algorithms in the first group use preset sequences to distinguish states.This means that the states in learned hypothesis models are distinguished by a number of fixed experiments, i.e., through the execution of non-branching test cases.The second group contains only active RPNI, the passive learning algorithm RPNI that we make active by combining it with conformance testing.Thus, the test-case selection of active RPNI is mostly governed by the applied conformance testing algorithm.The third group contains only the ADT algorithm.In contrast to the first group, this algorithm uses adaptive distinguishing sequences to determine states.That is, it uses the insight that states can be distinguished through the execution of branching test cases, where input stimuli depend on previous observation.Such a strategy potentially reduces the number of required tests.
In the following, we provide a brief discussion of the most important features of the applied algorithms.Generally, we apply the implementations of these algorithms available in LearnLib 0.16 [30].In some cases, we slightly adapted the testing techniques to be able to control the number of test cases executed during equivalence queries.

Learning Algorithms with Preset Distinguishing
Sequences.L * and RS.Angluin established the basis for active automata learning by introducing the L * algorithm and the MAT framework [10].L * stores information in so-called observation tables and processes counterexamples by adding all prefixes of a counterexample to the table.Rivest and Schapire improved L * by maintaining smaller observation tables [51].This is achieved through advanced counterexample processing that extracts a distinguishing suffix from a counterexample.Such a suffix distinguishes two SUL states corresponding to a single state in the current hypothesis.We refer to this improved version as RS algorithm.
The advanced counterexample processing of RS affects the membership query complexity.Angluin's L * requires O (kmn 2 ) membership queries [51], where k is the (input) alphabet size, m is the length of the longest counterexample and n is the size of the learned automaton, while RS requires O (kn 2 + n log(m)) membership queries.Hence, the number of test cases performed for membership queries depends only logarithmically on the counterexample length.
KV and TTT.Kearns and Vazirani presented an active automata learning algorithm that stores queried data in trees [31].We refer to this algorithm as KV algorithm.Without going into details, the original KV algorithm required one round of learning for each state of the final hypothesis, thus conformance testing needs to be performed more often.However, we have observed that each round requires fewer membership queries.The TTT algorithm [29] also stores information in trees, but improves upon KV in various ways.It, e.g., also processes counterexamples by extracting distinguishing suffixes.Additionally, counterexample prefixes are processed as well.
Analogously to L * , the number of membership queries performed by KV depends linearly on the counterexample length [31].TTT in contrast has the same worst-case membership query complexity as RS [29].We can expect TTT and RS to perform better than KV and L * in the presence of long counterexamples.[22], the ADT algorithm is a recently developed, highly efficient, incremental active learning algorithm.While its performance is asymptotically as efficient as or even less efficient than other learning algorithms this mostly results from degenerate cases.Applications have shown ADT's performance to be among the best [38], which prompted its inclusion here to evaluate it in combination with different testing algorithms.In contrast to a more traditional learning algorithm such as L * , which produce few hypotheses resulting in a low number of equivalence queries, incremental learning algorithms, such as ADT, try to produce more hypotheses quickly to reduce the number of membership queries.This property also makes ADT especially interesting to pair with different testing strategies.ADT works by integrating adaptive distinguishing sequences into its core data structure, the adaptive discrimination tree.It uses heuristics to reduce the number of resets during the learning process by replacing subtrees using adaptive distinguishing sequences.We will look more closely at the effects of this approach in Section 6.2.

Passive Learning.
In contrast to all other learning algorithms presented thus far, RPNI, as introduced by Oncina and Garcia [46], is a passive learning algorithm instead of an active one.It works by first constructing a tree out of given samples, such as observed input-output data.Then it merges all possible nodes in the tree that are compatible with each other, i.e., that do not result in non-deterministic behaviour during the merge.Once all nodes were checked against each other the resulting merged model is the learned hypothesis.To turn RPNI into an active learning algorithm, we combine it with conformance testing, like Walkinshaw et al. [64], which we discuss in detail in Section 5.The reason for the inclusion of RPNI is twofold: First, we are using RPNI as a general representative of passive learning algorithms and their effectiveness when compared against active ones.Second, we study the potential performance impact of the choice of approach to applying a passive learning algorithm in an active context.For this purpose, we evaluate testcase selection based on detected counterexamples with varying thoroughness of testing.

Testing Algorithms.
Random Testing.Random-words-based testing and random-walks-based testing generate random sequences of inputs.Both select inputs completely randomly and differ only in the distribution of the test-case length.The length of random words is uniformly distributed within some range, whereas random walks have a geometrically distributed length.
Variations of the W-Method.The W-method [15,63] is a deterministic conformance testing technique, which requires a bound m on the number of SUL states.Given such an m, it can prove equivalence between hypothesis and SUL up to m.Hence, if all generated test cases pass, then we know that either SUL and hypothesis are equivalent, or the SUL has strictly more than m states.LearnLib [30] uses a depth parameter to define m, which specifies the difference between the number of hypothesis states and m.The partial W-method [23], also called Wp-method, improves upon the W-method by requiring fewer test cases, while providing the same guarantees.However, the number of test cases generated by both techniques is exponential in the bound m, thus it usually does not scale to large systems.The random Wp-method, as implemented in LearnLib [30], uses the partial W-method as basis, but executes only a random subset of all generated test cases, therefore it does not prove equivalence.
Since the W-method generally creates larger test suites than the partial W-method, individual equivalence queries using the partial W-method are more efficient.However, the partial W-method and the W-method may find different counterexamples leading to different intermediate hypotheses.For this reason, we included both testing algorithms in our evaluation.
Mutation and Transition Coverage.In our previous work, we developed two conformance testing techniques for active automata learning, which work similarly.Both techniques start by generating a large set of test cases through random walks on the hypothesis.The random walks alternate between completely random sequences and paths to randomly chosen transitions.Afterwards a subset of the generated test cases is selected and executed.The mutation-based technique selects test cases based on mutation coverage, where mutants model potential successor hypotheses.The transition-coverage-based technique selects test cases with the goal covering all hypothesis transitions.Configuration of Testing Techniques.We apply the same configuration of every testing technique for all considered models.The configurations have been chosen to enable reliable learning of system with up to approximately 50 states.For instance, we configured random-words-based testing such that all generated test cases have a length between 10 and 50.However, as we demonstrate in Section 6, the parameter settings also support learning of slightly larger system models.The parameter configurations are as follows.
• random walks: test stop probability: 1  30 .This setting ensures that the expected length of random walks is the same as of random words.
• random Wp-method: we set the minimal length of the middle sequences in test cases to 0 and the expected length to 4. • transition coverage: maximum test-case length: 50, maximum length of random sequences: 4, retry and stop probability for test-case generation: 29  30 and p stop = 1 30 , respectively.For more information on the parameters, we refer to our previous work [7].Note that we have chosen p stop to be the same value as p stop of random walks.
• mutation: we used the same test-case generation settings as for transition coverage.For test-case selection, we generated mutants with distinguishing sequences of length two and applied mutation sampling such that at most 10,000 mutants are considered.redmin mutant sampling strategy followed by fraction sampling with r = 1.Finally, we sampled 10,000 of the remaining mutants, unless there were less than 10,000 mutants remaining after fraction sampling.
The only parameter of the deterministic algorithms, the W-method and the partial W-method, is the depth parameter.For transition coverage and mutation, we generated min(100,000, n sel • 100) test cases and selected n sel of these test cases based on coverage.In the remainder of this article, we write testing techniques in italics.
Search for Required Testing Budget.While learning with deterministic conformance testing, we increase the depth parameter linearly until learning correctly.In case of randomised testing techniques, we control the number of test cases that are executed during each individual equivalence query to find a counterexamples to equivalence.We apply a binary search to find the minimum number of test cases to reliably learn correct models, i.e., to learn correctly in ten repeated learning runs.
In our analysis, we consider the testing budget in terms of test steps.This quantity is more difficult to control uniformly across the different testing techniques, but it is clearly correlated with the number of test cases.For this reason, we deem the search appropriate.The exact relation between the number of test cases and test steps depends on the applied test-case generation algorithm.

Benchmark Models
We consider a subset of the benchmark models from the automata-learning benchmark models collected at the Radboud University Nijmegen [44]. 2 In particular, we use all six TCP models, including both server and client models of the TCP stacks of Ubuntu, Windows, and BSD, learned by Fiterău-Broştean et al. [19].We consider all 32 MQTT models, created in our previous work on learning-based testing of MQTT [60].We also consider a simple coffee machine that is similar to a model used by Steffen et al. [58].Finally, we also consider 114 more benchmark models, among which are GnuTLS, the ABP protocol, OpenSSL and more.The exact list of all benchmark models can be found in Reference [66].We have chosen this selection to cover system models of different categories, which are defined below.
Estimating Model Complexity.On the topic of quantifying the complexity of learning and testing, Meinke [40] has shown that benchmarking performance for learning algorithms should not be based on state-space size alone.He shows that the finding of distinguishing sequences can be a dominant factor in the complexity of learning of automata with hard-to-distinguish states.We extend these observations and present a concrete hardness score that takes into account learning and testing complexity to determine the overall learning hardness of an automaton.
The difficulty in learning and testing an automaton is primarily composed of two different factors: reachability, i.e., the difficulty in navigating to a certain state from the initial state, and the distinguishing of states.Reachability is mostly dependent on the size of an automaton and the length of its prefixes, which are the shortest transfer sequences from the initial state to all other states, while distinguishing states is dependent on the length of distinguishing sequences.A distinguishing sequence α ∈ I * is able to distinguish two states s, s For a given Mealy machine M let n be the number of states |Q |, k the size of its input alphabet |I |, m the approximation for the longest counterexample, Prefixes the set of all prefixes, and let W be the characterization set as described in Reference [63], which contains input sequences such that every pair of states may be distinguished with a least one of them.We approximate the length of the longest counterexample with the length of the longest prefix ∈ Prefixes plus the length of the longest distinguishing sequence from the W -set as seen in Equation ( 2).Alternatively, the m could be approximated by the number of states n [10], but this would be less accurate.We further define p max to be the longest prefix out of all prefixes, and therefore |p max | is the length of the longest prefix sequence, and w max and |w max | to be the longest distinguishing sequence of the W -set and its length, respectively, hardness score = learn-hardness test-chance .
The learn-hardness depicted in Equation ( 3) approximates the worst-case complexity of the number of steps required to learn the model M, i.e., the total number of symbols contained in all membership queries, which is equivalent to the symbol complexity given by Isberner et al. [29] for the TTT algorithm.This worst-case symbol complexity is also shared by the RS algorithm.
The test-chance shown in Equation ( 4) represents the chance of randomly sampling the longest counterexample by first choosing the longest prefix, i.e., to reach the hardest-to-reach state, and then choosing the correct sequence of inputs of length |w max |, i.e., the longest distinguishing sequence, at random.The smaller this chance is the less likely we are to find the the hardest counterexample through random sampling, which in turn makes the testing of the automata harder.
Finally, the hardness score in Equation ( 5) is the learn-hardness divided by the test-chance.The larger the learn-hardness and the smaller the test-chance the harder an automata will be to learn and test respectively and the higher the hardness score will be.Most of the time the test-chance will be the dominant factor and determine the order of magnitude of the hardness score that is consistent with most active learning algorithms having polynomial runtime while testing is more difficult.In fact, complete conformance testing without additional knowledge, such as a bound on the number of system states assumed by the W-method [15,63], is not possible, as testing is inherently incomplete.
Categories.Certain behavioural aspects of communication-protocol models may favour a particular learner-tester combination, while other aspects may favour different combinations.For this reason, we grouped the benchmark models into categories based on the following properties: • small: a model is small if it has less than or equal to 15   • sink-state: a model satisfies the property sink-state if there exists a (sink) state q such that all outgoing transitions from q reach q • strongly-connected: a model satisfies the property strongly-connected if its underlying directed graph is strongly connected, i.e., for each ordered pair of nodes exists a directed path between these nodes The above categories have been chosen with common application scenarios in mind.Given a concrete application scenario, learned models can often be expected to have certain properties.For instance, we may want to learn a behavioural model capturing a single session of an application protocol.In this case, learned models are likely to have a sink state that is reached after closing a session.This holds true for many TLS models [18] in the benchmark set [44].On the contrary, if restarting of sessions is allowed during learning, then learned models can be expected to be strongly connected.We can observe this behaviour on most MQTT models [60] in the benchmark set.The size of models depends on the abstraction.Harsh abstraction leads to small models and is often applied when testing is expensive.Hence, such assumptions on model categories are reasonable and do not require sacrificing our black-box view of systems.Therefore, we have, for instance, examined which learner-tester combinations perform best for small models that have a sink state.
Our hardness measure lends itself to comparing different models, such as those from the benchmark set [44].Hence, we chosen the hardness ranges given above based on the distribution of hardness values computed for the benchmark models.

PASSIVE TO ACTIVE LEARNING
In this section, we present how to turn RPNI into an active automata learning algorithm.For this purpose, we follow an iterative approach as in conventional active automata learning algorithm, where we learn hypotheses via RPNI and perform equivalence queries by testing the intermediate hypotheses.In contrast to algorithms, such as L * , we do not perform a systematic exploration using membership queries.This kind of approach to making a passive algorithm active has been proposed by Walkinshaw et al. [64] and since then been applied in other contexts, such as SMTbased passive learning [55], state-merging-based stochastic learning [8], and learning of timed automata via genetic programming [6].An even earlier approach of extending passive learning with queries was proposed by Grinchtein et al. [25].Algorithm 1 formalizes our approach to active learning with RPNI.The algorithm employs three auxiliary functions.The function rpni takes a set of traces and returns an automaton learned with RPNI from the traces.The function confTest applies a conformance-test algorithm.That is, it performs an equivalence query on a hypothesis, returning either an input sequence that is a counterexample to equivalence between SUL and hypothesis or None.Finally, membershipQuery executes a single test case, which is a sequence of inputs, on the SUL, returning a sequence of outputs produced by the SUL.hyp ← rpni(data) return samples The algorithm takes a parameter el, called extension length, as input that controls the initialisation and counterexample processing.It starts initialising the trace data by creating all input sequences of length el and performing membership queries with them (Function initialize called in Line 1).After that, we have some intial knowledge about the SUL's behaviour and learn a first hypothesis in Line 3 that we test in Line 4. If we do not receive a counterexample, then we return the learned hypothesis in Line 5. Otherwise, we process the counterexample by creating and executing additional test cases based on it.We create a test case for every counterexample prefix concatenated with every input sequence of length el. 3 In the remainder of this article, we will refer to el also as the counterexample extension length and we refer to Algorithm 1 as active RPNI(el).
The counterexample processing serves two purposes.First, it explores the neighbourhood of counterexamples to potentially reduce the number of expensive equivalence queries, which are the only other source of data generation.Second, combining every counterexample prefix with every input, by choosing el ≥ 1, ensures that RPNI creates input-enabled models.This property is required by all testing algorithms that we apply, except from random words and random walks.Variations of the W-method actually require minimality of automata, which is not satisfied by intermediate hypothesis learned by RPNI.We leave minimisation of learned hypotheses implicit, since it does not require additional tests.

EXPERIMENTAL RESULTS
In this section, we will present results from the benchmarking experiments that we performed.We structure our presentation into three parts.First, we will present results and selected findings from experiments with active automata learning that use preset sequences to distinguish states.In comparison to the corresponding part in the conference version of this article [9] we updated the results from the original 39 to a total of 153 benchmark models.In the second part, we investigate the performance of ADT, that is, active automata with adaptive distinguishing sequences.There we will use the best-performing algorithms from the first part as a benchmark for comparison.Finally, we present and analyse the results from learning experiments with the active version of RPNI presented in Section 5.
Performance Measures and Presentation.We present selected results from our experiments in the following, focusing on the number of test steps required for both equivalence queries and membership queries.In particular, we consider the maximum and mean number of test steps required to learn reliably.Due to the large amount of learning experiments, we present aggregated results for learner-tester combinations in (1) cactus plots and (2) bar plots.Additional information and the complete results can be found in the accompanying supplementary material [66].
The cactus plots show how many experiments can be finished successfully, such that learning is reliable, given a maximum number of test steps.The bar plots show two different scores, s 1 and s 2 , computed for the learner-tester combinations lt.The concrete score values are not important, but they allow for comparisons, where a lower value means better performance.The scores are given by where B is the set of considered benchmark models, LT is the set of all considered learner-tester combinations, and meanSteps(lt, b) returns the mean number of steps required to reliably learn models of the benchmark b with the combination lt.The first score s 1 (lt ) simply sums up the average number of test steps required in all experiments, whereas s 2 (lt ) is normalised, through dividing by the worst-performing combination of each benchmark.Hence, s 1 allows to analyse which combinations perform best, when learning all considered benchmark models consecutively.This is an objective measure under the assumption that test steps require the same amount of time in every benchmark experiment.The normalised score s 2 accounts for the large variation in terms of model complexity across the different benchmarks.Normalisation ensures that individual performance outliers do not severely affect the overall score of a learner-tester combination.As information about outliers is useful, it is represented in the cactus plots.When using these two scores it is important to have valid results for every learn-combination lt for every b ∈ B as otherwise we would have sums over different amounts of benchmark models, which would skew the results in favour of combinations that were used on a lower number of benchmark models.However, some combinations could not successfully learn some benchmark models due to performance, memory or implementation constraints.We will therefore, when using the s 2 score, penalise any missing experiment, that would normally add a value between 0 and 1, with a value of 1, the worst possible result per experiment.
Benchmarking Set Missing Models.As already mentioned, some of the 153 benchmark models could not successfully be learned by some of the learner-tester combinations due to either performance, memory or implementation limitations.Some of the large models would have required more than the 12 GB RAM available in our test setup (see Section 4) and could therefore not be learned.Due to this issue, mutation learned on average 125, transition coverage learned 143, and random Wp-method in combination with ADT learned 150 of the 153 tested benchmark models.
Additionally, we excluded random walks and random words for 37 of the harder models due to the infeasible runtime of these testing algorithms with our setup.
Finally, we attempted to learn one really large model, the "esm-manual-controller-v2" with 3410 states, which would have required a bound on the maximum number of tests greater than the range of the Java integer in our setup and in the LearnLib implementation of the random Wp-method.This model is therefore not included in the final 153 benchmarking models.
Altogether we performed 116 learning experiments with each of the 35 learner-tester combinations.Additionally, we performed 37 learning experiments with a selection of learner-tester combinations for a total of 153.
Benchmarking Set Hardness.In Figure 2 we see an overview of the sizes and hardness of the entire benchmarking set in the form of a histogram with the definition of the groups of hardness given in Section 4.2.We also show the correlation of test steps of the TTT-random Wp-method combination with the hardness score in Figure 3. Unsurprisingly, small models are mostly easy and large models are mostly hard, however, there are some interesting exceptions: For example, the hardest model in the benchmarking set according to our hardness score is the 'BitVise' automaton with a hardness of 2.6 × 10 15 while only having 66 states.This high score is due to this automaton having the longest distinguishing sequence of length 7 out of the entire benchmarking set.In the combination of TTT and random Wp-method, "BitVise" was the fourth-worst performing automaton with an average number of test steps of over 8 million.
Another example would be the "river.flat_0_10"benchmark model, which is with a hardness score of 1.3 × 10 8 and only 9 states among the smallest yet hardest models while the "sip.flat_0_8"automaton with hardness score of 5.4 × 10 11 and 200 states is among the largest medium-hard automata.
In general, the hardness score seems to more accurately represent the difficulty in learning an automaton than only state-space size would.The spearman correlation [57] between the average number of test steps for the TTT/random Wp-method combination and the hardness-scores for the entire benchmark set is 0.878, a strong correlation, while the correlation of test steps and the automata size is only 0.8.This correlation can be seen in Figure 3 where the number of test steps is shown with the corresponding hardness score of each model.

Active Algorithms with Preset Distinguishing Sequences
6.1.1Overview.First, we want to provide a rough overview.Figure 4 shows the score s 1 (lt ) for each learner-tester combinations computed over all 116 experiments that could be learned by all combinations.Due to large variations in the required test steps, it uses logarithmic scale.Figure 5 shows the normalised score s 2 (lt ).Similar to observations in previous work [7], we see that mutation, transition coverage, and random Wp perform well in comparison to other techniques.In Figure 4, we can observe that the relative gap between mutation and the worst-performing     techniques is very large.This is caused by a few outliers.In particular, the TCP server models required a very large number of test steps for random walks and random words to learn reliably.For this reason, we see a smaller gap between those test techniques and mutation in Figure 5, because s 2 is less affected by outliers.
Furthermore, we see that the W-method indeed generally performs worse than the partial Wmethod.Random words and random walks perform similarly well.Figure 5 shows that, using the same testing either KV and L * perform similarly efficient or KV performs worse.For these reasons and to ease readability, we will ignore certain combinations in the following.In the discussion of findings from the experiments with learners with preset distinguishing sequences, we will not show performance plots for combinations involving the W-method and random-walksbased testing.We will also not analyse the performance of the KV algorithm further, as it performs similarly to or worse than L * .
Figure 6 shows all 153 experiments including the 37 that could not be learned by some combinations with missing experiments penalised as explained above.As can be seen random words and random walks, which are missing 37 experiments, each got considerably worse compared to the Wp-method.Furthermore, mutation and to a lesser extent transition coverage got penalised due to not being able to learn all benchmarking models.We see that random Wp-method achieves the best scores due to being able to learn all benchmark models and faring considerably better than the worst placed W-method.We will also see that the random Wp-method performs well on hard benchmark models, because most of the remaining 37 models being in this category.For the remaining learner-tester combinations, Figure 7 shows a cactus plot describing how many learning experiments can reliably be completed with a limited number of test steps.For instance, with RS-mutation we are able to learn about 75 models with at most approximately 10,000 test steps, whereas L * -random words requires about 100,000 test steps to learn the same number of models.We see a steep increase in the required test steps for random-words-based testing to learn about half of the given 153 models.This explains the discrepancy between the s 1 score and the s 2 score of random-words-based testing.It is interesting to note that transition coverage performs excellently for about 60 experiments, then mutation is the best-performing testing algorithm for about 110 of our experiments and random Wp is the overall best performing testing algorithm over the entire benchmark set.Additionally, L * combinations perform worst overall while random words performs worst among the testing algorithms for the entire set although it is better for some of the smaller experiments.

Selected
Findings.Next, we discuss a few selected findings related to features of the examined techniques and benchmark categories.
Counterexample Processing.In Figures 4 and 5, we see that mutation combined with RS and mutation combined with TTT perform best overall.In contrast to that, mutation combined with KV and mutation combined with L * perform substantially worse, whereas random Wp shows uniform performance across different combinations with learning algorithms.Similar observations as for mutation can be made for transition coverage.
This can be explained by considering the counterexample-processing techniques of different learning algorithms.RS processes counterexamples by extracting distinguishing suffixes [51], like TTT, which also performs additional processing steps [29].This reduces the length and number of sequences that are added to the learning data structures.L * and KV do not apply such techniques, therefore the performance of these learning algorithms suffers from long counterexamples.We have chosen the parameters for mutation conservatively to create long test cases, which leads to long counterexamples, explaining our observations.In contrast to this, random Wp generates much shorter test cases.Therefore, we see uniform performance in combination with different learning algorithms.Hence, mutation and transition coverage should be combined with either RS or TTT.In such combinations, mutation-based testing performs efficient equivalence queries, while sophisticated counterexample processing ensures that a low number of short membership queries are performed to process each counterexample.Comparing RS and TTT combined with mutation, there is no clear winner; both combinations performed similarly well in our experiments.Small Models with Sink State.We evaluated the learner-tester combinations on 33 small models that also have at least one sink state.Small models may result from harsh abstraction.Sink states may be created if learning focuses on individual sessions in a communication protocol, where the sink state is reached upon session termination.Hence, this is an important class of systems that we can identify prior to learning.Therefore, it makes sense to analyse which active automata learning configurations work well in such scenarios.
Figures 8 and 9 show scores computed for this kind of models.The non-normalised score s 1 shows that transition-coverage-based testing may be very inefficient for such models.In particular, the combinations with RS and TTT are the two worst-performing with respect to s 1 excluding random words.However, the normalised score s 2 is in a similar range as the s 2 score of randomwords-based testing.Interestingly, there is a large discrepancy between the s 1 and s 2 score of the combination L * with random words when compared to the combinations with RS and TTT.Overall, all three combinations perform bad according to s 1 , but the s 2 score, that is computed individually for each benchmark model, shows favorable results for the combination with L * .Thus, there is a number of experiments where this combinations works well, but random words performs very poorly regardless of the learner, when considering all combinations.The cactus plot shown in Figure 10 demonstrates that this is indeed the case.There is a steep increase in the test steps required  to reliably learn in 17 or more experiments.On the one hand, this means that 17 benchmark models seem to be difficult to learn with random words and L * .On the other hand, in cases where we are constrained to applying random testing and we expect to learn models with sink states, it may be best combined with L * as this combination performs well for about half the benchmarks.
Actually, we can make similar observations for other combinations with L * : For example, the combination with mutation performs best overall according to Figure 10.The cause for this behaviour is likely due to the exploration enforced by L * 's counterexample processing.It adds every prefix of a counterexample c to its data structure, the observation table, thus enforcing an exploration in the state space close to what is covered by c.This kind of thorough exploration seems to be beneficial in models with sink states that often have a DAG-like4 structure.
When comparing Figures 8 and 9, we can see that transition coverage seems to perform very well for some examples (low s 2 score), but poorly for others resulting in requiring a large number of steps overall (high s 1 score).This is most apparent for the combinations with RS and TTT.In Figure 10, we can actually see an increase in required test steps at about the 17th benchmark model.Hence, the remaining models seem more difficult to learn with transition-coverage-based testing.We analysed one of these models in more detail to determine the reason for the poor performance of transition coverage.It is a coffee-machine model similar to the model used as an illustrative example by Steffen et al. [58].Figure 11 shows the corresponding Mealy machine.Two properties of the coffee machine cause the poor performance of transition-coverage-based testing.First, many input sequences reach the sink state q 5 that only produces error outputs.Second, other states require very specific input sequences.In experiments, we observed that learning frequently produced incorrect models with five states that did not include q 1 or q 2 .The transition-coverage heuristic does not  help to detect these states.In fact, it is even detrimental.To reach q 1 or q 2 from a state other than q 0 , we need to reach the initial state q 0 first.Consequently, covering any known hypothesis transition other than the water (pod) transition in q 0 leads away from reaching and detecting q 2 (q 1 ).Moreover, the transition-coverage heuristic generates very long test cases.For this reason, most suffixes of these test cases merely execute the self-loop transitions in q 5 , because the probability of reaching q 5 is high.This worsens the performance of transition coverage even more.
It is interesting to note that mutation-based conformance testing performs well on the coffee machine, although it applies the same test-case generation strategy as transition coverage.In contrast to transition coverage, mutation applies mutation-coverage-based test-case selection.Hence, this form of test-case selection is able to drastically improve performance, as can be seen in Figure 10.This can be explained by considering the same situation as outlined above.Suppose that an intermediate hypothesis with five states has been learned.In this scenario, the true model is a mutant of the hypothesis that can be generated through the used split-state mutation [7].By covering that mutant, it is possible to detect the last remaining state and learn the true model.Large Models.Finally, we examine the learning performance on large models.In our classification, models are large if they have more than 15 states.Our benchmark set includes 58 such models.Figure 12 shows the s 2 score with penality for models that could not be successfully learned.Due to the substantial number of such missing models we do not show the s 1 and normal s 2 score for large models.However, this can be much better visualized in Figure 13 which shows the corresponding cactus plots.
We can observe that the plots are missing for some of the 58 models in Figure 13.These are the models that could not be successfully learned for a variety of reasons explained above; see   the paragraph on missing benchmark models.Of the given combinations, random words performs worst overall.In fact, we decided to only learn 21 models of the 58 large models with random words due to the dramatic explosion in runtime that every additional experiment required.This rapid increase in test steps only happened at at a certain point as about 10 of the models could be learned quite efficiently with random words.Additionally, the cactus plot shows the average over ten runs but random words also had the highest variability, i.e., the highest standard deviation, of all combinations that makes it even more unreliable for large models.The mutation combinations performed better, especially for some of the smaller experiments but 28 models could not be learned due to memory constraints.Transition coverage also failed to learn some large models but only ran into the memory constraints for 8 of the largest models.We can observe that the combinations RS-mutation and RS-transition coverage perform very well for large models as long as the memory constraints allow for the algorithm to successfully learn said models.This is in line with findings from our previous work on the mutation-based testing technique [7].There, we focused on models up to approximately 50 states, as mutation analysis is generally costly.Finally, the Wp-method as well as the random Wp-method both successfully learned all 58 models.Of these two, random Wp-method performed better in every metric.It consistently had lower average test steps with every single learning algorithm than the corresponding Wp-method combinations and it is seemingly well suited for large models overall.Please note that, while the combinations with the Wp-method successfully learned all models, they required a large number of test steps; see Figure 13.
Medium-hard Models.Finally, we show the results for the 32 medium-hard models in Figure 14.This classification uses our hardness measure explained in Section 4.2.The class of medium-hard models all have hardness-scores between 10 8 and 10 12 .We start to see difficulties for some of the combinations in this range of hardness values: For one, we see that mutation learns 19 models until the memory requirements prevent the learning of the rest.These missing models are mostly the larger models as mutation depends more strongly on the state space size, which determines the number of mutants that should be generated, which may take a lot of memory.However, transition coverage learns all of the medium-hard models and in fact is one of the best performing combinations in this range.It only runs into memory constraints much later than mutation, which can also be seen in Figure 13.The range of medium-hard models also seems to be the spot where random-words-based conformance checking goes from a solid to worst choice.We had to exclude random-word combinations for parts of the medium-hard and harder models as the required runtime increased steeply.

Benchmarking Experiments with ADT
Next, we examine ADT, the active learning algorithm that uses adaptive distinguishing sequences.Since we established above that RS and TTT perform best in the considered experiments, we compare ADT only with these two learning algorithms.We further restrict our attention to testing with mutation, random Wp, the Wp-method, and random words.As transition coverage mostly performed similarly or worse than mutation, we only consider the latter.We ignore random words when considering large models, to not distort the graphics.
We start our comparison with an overview in Figures 15-17 showing the s 1 and s 2 score without and with penalty respectively for the mentioned learner-tester combinations.It can be seen that, aside from the s 1 score, where TTT-mutation beats ADT narrowly, ADT performs better in all metrics over all three scores and testing algorithms.Additionally, mutation again performs best among the testing algorithms when it is not penalised like in Figure 17 where the entire benchmark set was used.In Figures 15 and 16 we only show the set of 124 benchmark models that all three algorithms could successfully learn.As in other cases, let us take a closer look at how the learners perform.Figure 18 shows how many models can be successfully learned when limiting the average number of test steps.We can see that performance is mostly governed by the applied testing algorithm.Although it is difficult to see in said plot, ADT performs better than the other learning algorithms in most cases.For instance, aside from the range between 40 and 75 where TTT-mutation performs better, ADT combined with mutation requires the lowest number of steps everywhere else, closely followed by TTT.ADT and TTT may be very efficient on small models due to a characteristic shared by them.These algorithms are also referred to as incremental learning algorithms [38] meaning that they attempt to create a larger number of hypotheses where each hypothesis takes fewer membership queries to learn, as compared to L * , for instance.Such an approach in turn requires more equivalence queries, which pays off when equivalence queries are computationally cheap [38].However, it may be detrimental when equivalence queries are expensive, like for the largest benchmark models.Restricting the analysis to only small models supports this reasoning.The normalised scores s 2 of ADT and TTT are lower than of RS conditional on the specific testing algorithm (we do not show a figure of this score for brevity).
It should be noted that ADT supports various configuration options and that we used the standard configuration available in LearnLib.Hence, while ADT already showed very good performance in our experiments, it might benefit from settings and heuristics tailored toward an application domain.Frohme [22], for instance, notes that heuristics employed in ADT may reduce the number of system resets, i.e., the number of tests performed during learning.The rationale behind such a reduction is that system resets may be more expensive than test steps on which we focused as a measure of performance.Given the potential relevance of reducing resets, Figure 19 plots the tests required for correct learning with ADT, TTT, and RS.While the number of tests depends heavily on the testing algorithm, we can see that ADT performs best over large parts and especially when paired with mutation.

Benchmarking Experiments with Active RPNI
In the following, we present our learning experiments with active RPNI.For performance reasons that we will discuss below, we limited the experiments to the small benchmark models with up to 15 states including models of MQTT brokers [60], the TCP servers and clients [19], and the coffee machine [58] shown in Figure 11.We want to examine two performance-related aspects of learning.
First, we will consider the influence of the employed test-case generation approach and counterexample extension length el on the learning performance.With that, we want to address considerations that are relevant to turning a passive learning algorithm into an active one.After that, we will compare the number of steps required for correct learning with active RPNI to L * and TTT.We have chosen L * , since its counterexample processing is similar to how we process counterexamples in active RPNI.We have further chosen TTT for comparison, as it performs well in general and applies a sophisticated approach to counterexample processing.This analysis shall provide insights into the amount of data required to learn accurate models with passive approaches.

Testing Algorithm Choice and Counterexample Processing.
Figure 20 provides an overview of how well the different testing algorithms work with active RPNI(2), i.e., with extension length el = 2.The figure compares the testing algorithms based on the normalised score s 2 on the basis of all small models.In stark contrast to all results presented so far, the W-method and the Wp-method perform best.There are several reasons for that.First and most importantly, both required only a depth of one to learn all models correctly.In practical scenarios where the ground-truth model is unknown, more conservative depth settings are more appropriate.Therefore, the good performance results in this particular case should be taken with a grain of salt.Second, both testing techniques create short test cases.This is beneficial due to the linear dependence of counterexamples processing on counterexample length.A third potential reason for the good performance of the (partial) W-method is that testing with these techniques results in a systematic exploration of the state space.Such a systematic exploration is usually done via membership queries in active learning.Berg et al. [12] examined this correspondence between conformance testing and active automata learning.They note that conformance testing "systematically constructs" tests, whereas learning synthesises a model through a "systematic experimentation process." We argue that with systematic test-case generation, as performed by the W-method, we can implement systematic experimentation to some extent.
To get a clearer picture of the influence of the testing algorithms, we show a cactus plot with all the testing algorithms in Figure 21.The plot describes how many models can be learned correctly with a limited number of average test steps.In this plot, we can see that mutation and transition coverage perform almost as poorly as completely random testing.The aforementioned linear dependence on counterexample length causes this behaviour, as both mutation and transition coverage predominantly create long test cases.While performing worse for most models (see also Figure 20), random Wp performs similarly to the Wp-method when considering all models.We can conclude from this first batch of results that the exact approach used to turn a passive algorithm into an active one has a big influence on performance.In this benchmarking study, the specific test-case generation technique to steer data generation significantly impacted performance.
In the remainder of the RPNI analysis, we will focus on random Wp and random-words-based testing.On the one hand, random Wp showed acceptable performance, while offering potentially better scalability to larger models than the W(p)-method.On the other hand, analysing active RPNI combined with random words can provide insights into how much data a passive algorithm requires.With random words, we do not change the data generation throughout learning, with the exception of counterexample processing.
Counterexample processing is exactly what we investigate next.Figure 22 shows a cactus plot comparing random Wp and random-words-based testing, while using extension lengths from one to four when processing counterexamples.We can see that the extension length does influence the performance, but the choice of testing algorithm is more important.An excessive extension length of four expectedly results in bad performance.However, when considering the average steps required to learn all 28 models, the three other extension length options all show similar Fig. 22.A cactus plot showing showing how many learning experiments involving small models can be completed successfully when running active RPNI with a limited number of test steps.The plot different counterexample extension behaviour.Performing best overall, random Wp with an extension length el = 2 provides a good balance between efficiency and thoroughness in processing of each counterexample.

Comparison to
Active Automata Learning Algorithms.Finally, we want to compare the performance of our active RPNI with conventional active automata learning.For an overview, Figure 23 shows bar charts of the unnormalised score s 1 for active RPNI(2), L * , and TTT combined with random Wp and random words.Active RPNI combined with either testing algorithm performs worst.Thus, the first conclusion from our experiments is that the systematic exploration performed via membership queries definitely improves performance.Monitoring the learning process, we noticed that active RPNI usually required a large number of learning rounds and that intermediate hypotheses had up to approximately 100 states, even though we restricted our experiments to benchmark models where the minimal representation has at most 15 states.For this reason, we did not perform experiments with larger models.
For a more detailed analysis, we show cactus plots for the same learner-tester combinations in Figure 24.These plots also show that active RPNI performs considerably worse than L * or TTT.RPNI combined with random Wp requires for a large part of the experiments about an order of magnitude more test steps than TTT.Only when considering the steps required to learn all 28 models, the gap between RPNI and TTT decreases to a factor of five.The gap between RPNI and active learning algorithms is even bigger when we apply random-words-based testing.Active RPNI combined with random words requires almost 50 times as many test steps as TTT to learn all models.As noted before, random words does not change the test selection during learning.For this reason, active RPNI combined with random words simulates a case, where we have no control over data generation.We can therefore conclude that the data required for correct passive learning is potentially considerably larger than in applications of active learning.

SUMMARY
We examined the performance of 42 combinations of learning and conformance testing algorithms in the context of learning Mealy machines of black-box systems.Our analysis covers active automata learning algorithms with preset distinguishing sequences, an active algorithm that uses adaptive distinguishing sequences, and a passive learning algorithm that we turned into an active learning technique by combining it with conformance testing.
Since the learning runtime in practical applications is usually dominated by the time required for interacting with systems, we generally quantify learning performance in terms of required test steps for correct learning.The performed experimental evaluation is based on 153 benchmark models including models of implementations of TCP and MQTT.It focuses on learning and testing techniques available in LearnLib [30] and also includes two testing algorithms developed in our previous work.We presented measurement results and discussed selected insights with respect to overall learning performance and specific properties of systems.The results and insights may serve as guidance for practitioners seeking to apply active automata learning.
Figure 25 depicts a summary of our measurement results.It is an alternative visualisation of Figure 6 with the addition of ADT in which we show the general performance of all combinations aside from RPNI.This heat map shows the performance of combinations compared to each other and normalised by each benchmark model such that some of the results may appear skewed.For example, while the W-method may appear as the worst possible choice, it should still be preferred over random walks and random words for very large or hard models.Additionally, random walks and random words had the most variability, i.e., the highest standard deviation, over the performed runs, which makes them even less reliable.Therefore, we have analysed different groups of experiments in detail and in different arrangements that we present in Section 6.
If absolutely no assumptions can be made about the SUL, then random Wp may be a good choice of a testing algorithm.It performed best overall, especially on hard models.However, we described a wide range of situations where some assumptions about state space size or connectivity can be made in advance due to the level of abstraction or the application scenario, respectively.In some of these cases, mutation and transition coverage may be a good choice as we show in Figures 4 and 10.Additionally, we have seen that different learning algorithms succeed under certain assumptions.In general, ADT performs very well compared to the other learning algorithms as can be seen in Section 6.2 followed by RS and TTT.However, L * performed best on small machines with sink-state as seen in Figures 9 and 10.

CONCLUSION
Our results regarding the performance of active learning algorithms are generally in line with their asymptotic query complexity.The TTT algorithm [29] and L * extended with improvements by Rivest and Schapire [51] have the same membership query complexity and they performed similarly well.It is interesting to note that the TTT algorithm generally performs more equivalence queries, but requires a similar number of test steps overall.Another interesting insight is that despite potentially higher query complexity, 5 ADT performed similarly to the best-performing active learning algorithm with preset distinguishing sequences.Hence, ADT is a viable choice for learning not only in settings, where equivalence queries are less expensive, like in model-checkingbased implementations [38], but also when equivalence queries are implemented via testing.
Our measurements demonstrate neither deterministic conformance testing nor pure random testing scale in the context of the evaluated benchmarking set.Deterministic conformance testing showed especially poor performance for large models, however, especially random-words-based testing cannot reliably learn large models with a limited number of test steps.Hence, it is not efficient to guarantee conformance up to some bound, and it is not efficient to test completely blind.Transition coverage showed weaknesses for small models with sink states, an important class of system models.However, transition coverage combined with RS performed very well for large models.In general, we have observed that the counterexample processing implemented by RS and TTT may have a large impact on efficiency.This is especially true if test cases are long, as is the case in transition-coverage-based testing.
The random Wp-method and mutation-based testing [7] performed well for all types of benchmarks.Both techniques benefit from learned hypotheses and add variability through randomisation.While random Wp showed uniform performance across different learners, mutation combined with RS performed best overall.Mutation-based testing requires a low number of test Fig. 25.Performance of all combinations for all 153 benchmark models in the form of a heat map.The values given are the s 2 score with a penalty of 1 for each missing experiment (smaller score is better).In parenthesis given is the number of benchmark models that could be learned successfully by each combination.
cases for equivalence queries, while the counterexample processing by RS keeps the number and length of membership queries low.However, while mutation performed very well it was contained by memory and was unable to learn a number of very large benchmark models.We conclude that mutation or random Wp combined with RS or TTT should be chosen to efficiently learn automata, mutation for smaller automata and random Wp for larger automata or if the state space size is completely unknown.
The experiments with passive learning show that the systematic exploration performed by membership queries makes a substantial difference.The combination of RPNI and random testing suggests that correct passive learning requires substantially more data than active learning.In our experiments the relative gap, between RPNI and TTT was a factor of 50.Moreover, we have also shown that the applied conformance-testing techniques have a big impact on performance.For instance, mutation and transition coverage performed poorly in combination with RPNI.Hence, the characteristics of a passive learning algorithm need to be considered for applications in an active setting.This is especially relevant in cases, where only passive learning algorithms exist for certain types of models.
In future work, we plan to investigate the performance impact of test-case selection strategies on the learning of other types of systems.For instance, we plan to examine the behaviour of nondeterministic and stochastic automata learning implemented in AALpy [42,48,61].Of interest is also the further study of these combinations in other contexts, such as on random automata or different benchmark sets outside of communication protocols.Furthermore, it would be interesting to see whether the use of a state minimization algorithm used on the intermediate hypotheses of the active RPNI algorithm might improve its results.Finally, another interesting avenue for future research would be to perform a benchmarking study, where we fix a limited testing budget to learn potentially incomplete models.In this context, we could compare algorithms in terms of accuracy of their learned models through stochastic equivalence checking [40].Such stochastic equivalence and statistical model checking approaches may also provide statistical guarantees on the learning outcome, as recently shown for recurrent neural networks [32], that we can take into account.

Fig. 2 .
Fig. 2. The number of benchmarking models of certain size and hardness (in colour) of the entire benchmarking set of all 153 models.

Fig. 3 .
Fig.3.The number of test steps of the TTT-random Wp-method combination compared to the hardness score of each model for the entire 153 models of the benchmarking set.The benchmarking models are additionally labelled as small (green) and large (red).

Fig. 4 .
Fig. 4. The score s 1 computed over 116 experiments for all learner-tester combinations from the first group of learners, grouped by testing technique.

Fig. 5 .
Fig. 5.The score s 2 computed over 116 experiments for all learner-tester combinations from the first group of learners, grouped by testing technique.

Fig. 6 .
Fig. 6.The score s 2 computed over all 153 experiments with all learner-tester combinations from the first group of learners, penalising missing experiments with 1 each, grouped by testing technique.

Fig. 7 .
Fig. 7. plot showing how many learning experiments can be completed successfully with a limited number of test steps.

Fig. 8 .
Fig. 8.The score s 1 computed for experiments involving small models with a sink state (33 models).

Fig. 9 .
Fig. 9.The score s 2 computed for experiments involving small models with a sink state (33 models).

Fig. 10 .
Fig.10.A cactus plot showing showing how many learning experiments involving small models with a sink state can be completed successfully with limited test steps.

Fig. 11 .
Fig.11.An input-complete Mealy-machine model of a coffee machine[58].The transitions are depicted with input/output or alternatively, if multiple transitions with the same output and target state exist, as {input1, input2}/output.

Fig. 12 .
Fig. 12.The score s 2 with penalty of 1 for missing models computed for all 58 experiments involving large models.

Fig. 13 .
Fig.13.A cactus plot showing how many learning experiments involving large models can be completed successfully with a limited number of test steps (58 models).

Fig. 14 . 32 Fig. 15 .
Fig. 14.A cactus plot showing showing how many learning experiments involving medium-hard models can be completed successfully with a limited number of test steps (32

Fig. 16 .
Fig. 16.The score s 2 computed for experiments with the 124 benchmark models for which all combinations could successfully learn said models, performed with ADT, RS, and TTT.

Fig. 17
Fig. 17.The score s 2 with a penalty of 1 for missing models computed for experiments with all 153 benchmark models performed with ADT, RS, and TTT.
Fig. 17.The score s 2 with a penalty of 1 for missing models computed for experiments with all 153 benchmark models performed with ADT, RS, and TTT.

Fig. 18 .
Fig.18.A cactus plot comparing the performance of ADT, TTT, and RS combined with different test-case generation approaches on all 153 benchmark models.

Fig. 19 .
Fig.19.A cactus plot comparing the number of tests required to learn the 124 benchmark models that all nine combinations can successfully learn.

Fig. 20 .
Fig. 20.The score s 2 computed for learning experiments with active RPNI combined with all of the testing algorithms.

Fig. 23 .
Fig. 23.The score s 1 computed for learning experiments with active RPNI, L * , and TTT, combined with two of the testing algorithms.

Fig. 24 .
Fig. 24.A cactus plot comparing the performance of RPNI(2), L * , and TTT when combined with random words and random Wp.

Table 1 .
Evaluated Learning and Testing Algorithms states • large: a model is large if it has more than 15 states • easy: a model is easy if it has a hardness score of less than 10 8 • medium-hard: a model is medium-hard if it has a hardness score of greater or equal than 10 8 but less than 10 12 • hard: a model is hard if it has a hardness score of greater or equal than 10 12