L*-Based Learning of Markov Decision Processes (Extended Version)

Automata learning techniques automatically generate system models from test observations. These techniques usually fall into two categories: passive and active. Passive learning uses a predetermined data set, e.g., system logs. In contrast, active learning actively queries the system under learning, which is considered more efficient. An influential active learning technique is Angluin's L* algorithm for regular languages which inspired several generalisations from DFAs to other automata-based modelling formalisms. In this work, we study L*-based learning of deterministic Markov decision processes, first assuming an ideal setting with perfect information. Then, we relax this assumption and present a novel learning algorithm that collects information by sampling system traces via testing. Experiments with the implementation of our sampling-based algorithm suggest that it achieves better accuracy than state-of-the-art passive learning techniques with the same amount of test data. Unlike existing learning algorithms with predefined states, our algorithm learns the complete model structure including the states.

take a given sample of system traces as input and generate models consistent with the sample. The quality and comprehensiveness of learned models therefore largely depend on the given sample. In contrast, active algorithms actively query the system under learning (SUL) to sample system traces. This enables to steer the trace generation towards parts of the SUL's state space that have not been thoroughly covered, potentially finding yet unknown aspects of the SUL.
Many active automata learning algorithms are based on Angluin's L * algorithm [4]. It was originally proposed for learning deterministic finite automata (DFA) accepting regular languages and later applied to learn models of reactive systems, by considering system traces to form regular languages [24]. L * has been extended to formalisms better suited for modelling reactive systems such as Mealy machines [31,36] and extended finite state-machines [13]. Most L * -based work, however, targets deterministic models, with the exceptions of algorithms for non-deterministic Mealy machines [26] and non-deterministic input-output transition systems [43]. Both techniques are based on testing, but abstract away the observed frequency of events, thus they do not use all available information.
Here, we present an L * -based approach for learning models of stochastic systems with transitions that happen with some probability depending on nondeterministically chosen inputs. More concretely, we learn deterministic Markov decision processes (MDPs), like IoAlergia [29,30], a state-of-the-art passive learning algorithm. Such models are commonly used to model randomised distributed algorithms [9], e.g. in protocol verification [27,33]. We present two learning algorithms: the first takes an ideal view assuming perfect knowledge about the exact distribution of system traces. The second algorithm relaxes this assumption, by sampling system traces to estimate their distribution. We refer to the former as exact learning algorithm L * mdp e and to the latter as sampling-based learning algorithm L * mdp . We implemented L * mdp and evaluated it by comparing it to IoAlergia [29,30]. Experiments showed favourable performance of L * mdp , i.e. it produced more accurate models than IoAlergia given approximately the same amount of data. Apart from the empirical evaluation, we show that the model learned by L * mdp converges in the limit to an MDP isomorphic to the canonical MDP representing the SUL. To the best of our knowledge, L * mdp is the first L * -based learning algorithm for MDPs that can be implemented via testing. Our contributions span the algorithmic development of learning algorithms, their analysis with respect to convergence and the implementation as well as the evaluation of learning algorithms.
This work is an extended version of the conference paper "L * -Based Learning of Markov Decision Processes" accepted for presentation at FM 2019, the 23 rd International Symposium on Formal Methods in Porto, Portugal. It provides additional details on the implementation of L * mdp , the convergence analysis of both learning algorithms and an extended evaluation.
The rest of this paper is structured as follows. We introduce notational conventions, preliminaries on MDPs and active automata learning in Sect. 2. Sect. 3 provides a characterisation of MDPs and presents the exact learning algorithm L * mdp e . Sect. 4 describes the sampling-based L * mdp and analyses it with respect to convergence. Section 5 discusses the evaluation and in Sect. 6, we discuss related work. We provide a summary and concluding remarks in Sect. 7.

Preliminaries
Notation & Auxiliary Definitions. Let S be a set. We denote the concatenation of two sequences s and s ′ in S * by s · s ′ , the length of a sequence s by |s| and the empty sequence by ǫ. We implicitly lift elements in S to sequences of length one. Sequence s is a prefix of s ′ if there exists an s ′′ such that s · s ′′ = s ′ , denoted by s ≪ s ′ . The pairwise concatenation of sets of sequences A, B ⊆ S * is A · B = {a · b | a ∈ A, b ∈ B}. A set of sequences A ⊆ S * is prefix-closed, iff for every a ∈ A, A also contains all prefixes of A. Suffixes and suffix-closedness are defined analogously. For a sequence s in S * , s[i] is the element at index i, with indexes starting at 1, s[≪ i] is the prefix of s with length i and prefixes(s) = {s ′ | s ′ ∈ S * : s ′ ≪ s} is the set of all prefixes of s. Given a multiset S, we denote the multiplicity of x in S by S(x). Dist (S) denotes the set of probability distributions over S, i.e. for all µ : S → [0, 1] in Dist (S) we have s∈S µ(s) = 1.
In the remainder of this paper, distributions µ may be partial functions, in which case we implicitly set µ(e) = 0 if µ is not defined for e. For A ⊆ S, 1 A denotes the indicator function of A, i.e. 1 A (e) = 1 if e ∈ A and 1 A (e) = 0 otherwise. Hence, 1 {e} for e ∈ S is the probability distribution assigning probability 1 to e. In Sect. 4, we apply a pseudo-random function randSel taking taking a set S as input and returning a single element of the set, whereby the element is chosen according to a uniform distribution, i.e. ∀e ∈ S : P(randSel (S) = e) = 1 |S| . In addition to that, we use the function coinFlip(p) returning true with probability p and false otherwise.

Markov Decision Processes.
Definition 1 (Markov decision process (MDP)). A labelled Markov decision process (MDP) is a tuple M = Q, Σ I , Σ O , q 0 , δ, L where -Q is a finite non-empty set of states, -Σ I and Σ O are finite sets of input and output symbols respectively, -q 0 ∈ Q is the initial state, -δ : Q × Σ I → Dist (Q) is the probabilistic transition function, and -L : Q → Σ O is the labelling function. An MDP is deterministic if ∀q ∈ Q, ∀i : δ(q, i)(q ′ ) > 0 ∧ δ(q, i)(q ′′ ) > 0 → q ′ = q ′′ ∨ L(q ′ ) = L(q ′′ ).  We learn deterministic labelled MDPs as learned by passive learning techniques like IoAlergia [30]. Such MDPs define at most one successor state for each source state and input-output pair.
In the following, we refer to these models uniformly as MDPs. We use ∆ : Q × Σ I × Σ O → Q ∪ {⊥} to compute successor states. The function is defined by ∆(q, i, o) = q ′ ∈ Q with L(q ′ ) = o and δ(q, i)(q ′ ) > 0 if there exists such a q ′ , otherwise ∆ returns ⊥. Fig. 1 shows an MDP model of a faulty coffee machine [3]. Outputs in curly braces label states and inputs with corresponding probabilities label edges. After providing the inputs coin and but, the coffee machine MDP produces the output coffee with probability 0.8, but with probability 0.2, it resets itself, producing the output init.
Execution. A path ρ through an MDP is an alternating sequence of states and inputs starting in the initial state q 0 , i.e. ρ = q 0 ·i 1 ·q 1 ·i 2 ·q 2 · · · i n−1 ·q n−1 ·i n ·q n . In each state q k , the next input i k+1 is chosen non-deterministically and based on that, the next state q k+1 is chosen probabilistically according to δ(q k , i k+1 ). We denote set of all paths of an MDP M by P ath M . The execution of an MDP is controlled by a so-called scheduler, resolving the non-deterministic choice of inputs. A scheduler as defined below specifies a distribution over the next input given the current execution path. The composition of an MDP M and a scheduler s induces a deterministic Markov chain, i.e. a fully probabilistic system allowing to define a probability measure over paths. Additionally to M and s, we also need a probability distribution p l ∈ Dist(N 0 ) over the path lengths. 1 An MDP M, a scheduler s, and a path length probability distribution p l induce a probability distribution P l M,s on finite paths P ath M , defined by: P l M,s (q 0 i 1 q 1 · · · i n q n ) = p l (n) ·   n j=1 s(q 0 · · · i j−1 q j−1 )(i j ) · δ(q j−1 , i j )(q j )   (1) Sequences of Observations. During the execution of a finite path ρ, we observe a trace L(ρ) = t, i.e. an alternating sequence of inputs and outputs starting with an output, with t = o 0 i 1 o 1 · · · i n−1 o n−1 i n o n and L(q i ) = o i . Since we consider deterministic MDPs, L is invertible, thus each trace in Σ O ×(Σ I ×Σ O ) * corresponds to at most one path and P l M,s can be adapted to traces t by defining: We say that a trace t is observable if there exists a ρ with L(ρ) = t, thus there is a scheduler s and a p l such that P l M,s (t) > 0. In a deterministic MDP M, each observable trace t uniquely defines a state of M reached by executing t from the initial state q 0 . We compute this state by δ * (t) = δ * (q 0 , t) defined by δ * (q, L(q)) = q and If t is not observable, then there is no path ρ with t = L(ρ), denoted by δ * (t) = ⊥. We denote the last output o n of a trace t = o 0 · · · i n o n , by last (t).
We use three types of observation sequences with short-hand notations: -Continuation sequences: abbreviated by CS = Σ I × T S These sequence types alternate between inputs and outputs, thus they are related among each other. In slight abuse of notation, we use A × B and A · B interchangeably for the remainder of this paper. Furthermore, we extend the sequence notations and the notion of prefixes to Σ O , Σ I , T R, T S and CS, e.g., test sequences and traces are related by T R = T S · Σ O .
As noted, a trace in T R leads to a unique state of an MDP M. A test sequence in s ∈ T S of length n + 1 consists of a trace in t ∈ T R with n outputs and an input i ∈ Σ I with s = t · i; thus executing test sequence s = t · i puts M into the state reached by t and tests M's reaction to i. Extending the notion of observability, we say that the test sequence s is observable if t is observable. A continuation sequence c ∈ CS begins and ends with an input, i.e. concatenating a trace t ∈ T R and c creates a test sequence t · c in T S. Informally, continuation sequences test M's reaction in response to multiple consecutive inputs. Lemma 1. If trace t ∈ T R is not observable, then any t ′ ∈ T R such that t ≪ t ′ is not observable as well.
Lemma 1 follows directly from (1). For a non-observable t, we have ∀s, p l : P l M,s (t) = 0 and extending t to create t ′ only adds further factors. The same property holds for test sequences.
Active Automata Learning. We consider active automata learning in the minimally adequate teacher (MAT) framework [4], introduced by Angluin for the L * algorithm. It assumes the existence of a MAT, which is able to answer queries. L * learns a DFA representing an unknown regular language L over some alphabet A and therefore requires two types of queries: membership and equivalence queries. First, L * repeatedly selects strings in A * and checks if they are in L via membership queries. Once the algorithm has gained sufficient information, it forms a hypothesis DFA consistent with the membership query results. It then poses an equivalence query checking for equivalence between L and the language accepted by the hypothesis. The teacher responds either with yes signalling equivalence; or with a counterexample to equivalence, i.e. a string in the symmetric difference between L and the language accepted by the hypothesis. After processing a counterexample, L * starts a new round of learning, consisting of membership queries and a concluding equivalence query. Once an equivalence query returns yes, learning stops with the final hypothesis as output.
L * has been extended to learn models of reactive systems such as Mealy machines [36]. In practice, queries for learning models of black-box systems are usually implemented via testing [2]. Therefore, equivalence queries are generally only approximated as complete testing for black-box systems is impossible unless there is an upper bound on the number of system states. We cover the ideal setting in Sect. 3 by presenting an L * -based exact learning algorithm for MDPs. In Sect. 4, we discuss an implementation in a sampling-based setting that approximates queries by testing the SUL.

Exact Learning of MDPs
This section presents L * mdp e , an exact active learning algorithm for MDPs, the basis for the sampling-based algorithm presented in Sect. 4. In contrast to sampling, L * mdp e assumes the existence of a teacher with perfect knowledge about the SUL that is able to answer two types of queries: output distribution queries and equivalence queries. The former asks for the exact distribution of outputs following a test sequence in the SUL. The latter takes a hypothesis MDP as input and responds either with yes iff the hypothesis is observationally equivalent to the SUL or with a counterexample to equivalence. A counterexample is a test sequence leading to different output distributions in hypothesis and SUL. First, we describe how we capture the semantics of MDPs.
Semantics of MDPs. We can interpret an MDP as a function M : T S → Dist (Σ O ) ∪ {⊥}, mapping test sequences s to output distributions or undefined behaviour for non-observable s. This follows the interpretation of Mealy machines as functions from input sequences to outputs [37]. Likewise, we will define which functions M capture the semantics of MDPs by adapting the Myhill-Nerode theorem for regular languages [32]. We denote the set of sequences s where M (s) = ⊥ as defined domain dd (M ) of M .
Definition 3 (MDP Semantics). Given an MDP Q, Σ I , Σ O , q 0 , δ, L , its semantics is a function M , defined for i ∈ Σ I , o ∈ Σ O , t ∈ T R as follows: Definition 4 (M -Equivalence of Traces). Two traces t 1 , t 2 ∈ T R are equivalent with respect to M : A function M defines an equivalence relation on traces, like the Myhill-Nerode equivalence for formal languages [32]. Two traces are M -equivalent if they end in the same output and if their behaviour in response to future inputs is the same. Two traces leading to the same MDP state are in the same equivalence class of ≡ M , as in Mealy machines [37].
We can now state which functions characterise MDPs, as an adaptation of the Myhill-Nerode theorem for regular languages [32], like for Mealy machines [37].
Proof. Direction ⇒: first we show that the semantics M of an MDP M = Q, Σ I , Σ O , q 0 , δ, L fulfils the conditions of Theorem 1. According to Def. 3, M (ǫ)(L(q 0 )) = 1, thus the second condition is fulfilled.
Let t ∈ T R be an observable trace, then we have for Since M contains finitely many states q ′ , δ(q ′ , i) and therefore also M (t · i) take only finitely many values. M -equivalence of traces t i depends on the outcomes of M and on their last outputs last (t i ), which are both finite, therefore M -equivalence defines finitely many equivalence classes for observable traces. For non-observable t ∈ T R we have δ * (t) = ⊥ which implies M (t · i) = ⊥. As a consequence of Lemma 1, we also have M (t · c) = ⊥ for any c ∈ CS. Hence, non-observable traces are equivalent with respect to M if they end in the same output, therefore M defines finitely many equivalence classes for non-observable traces. In summary, ≡ M has finite index, i.e. the first condition is fulfilled. Prefix-closedness of the defined domain dd (M ) of M follows from Lemma 1. Any extension of a non-observable test sequence is also non-observable, thus M fulfils the third condition.
For the fourth condition, we again distinguish two cases. If t is a nonobservable trace, i.e. δ * (t) = ⊥, then M (t · i) = ⊥ for all i ∈ Σ I according to Def. 3, which fulfils the second sub-condition. For observable t, the distribution M (t · i) depends on δ(δ * (t), i), which is defined for all i due to input-enabledness of M, satisfying the first subcondition.
Direction ⇐: from an M satisfying the conditions given in Theorem 1, we can construct an MDP M c = Q, Σ I , Σ O , q 0 , δ, L by: Each equivalence class of ≡ M gives rise to exactly one state in Q, except for the equivalence classes of non-observable traces Q.
The MDP M c in the above construction is minimal with respect to the number of states and unique, up to isomorphism. Therefore, we refer such an MDP as canonical MDP can(M ) for MDPs semantics M .
Viewing MDPs as reactive systems, we consider two MDPs to be equivalent, if we make the same observations on both.
. We can view an observation table as a two-dimensional array with rows labelled by traces in S ∪ Lt(S) and columns labelled by E. We refer to traces in S as short traces and to their extensions in Lt(S) as long traces. An extension s · i · o Table 1. Parts of observation table for learning the faulty coffee machine (Fig. 1).
Analogously to traces, we refer to rows labelled by S as short rows and we refer to rows labelled by Lt(S) as long rows. The table cells store the mapping defined by T . To represent rows labelled by traces s we use functions row (s) : . Equivalence of rows labelled by traces s 1 , s 2 , denoted eqRow E (s 1 , s 2 ), holds iff row (s 1 ) = row (s 2 )∧last (s 1 ) = last (s 2 ) and approximates M -equivalence s 1 ≡ M s 2 , by considering only continuations in E, i.e. s 1 ≡ M s 2 implies eqRow E (s 1 , s 2 ). The observation table content defines the structure of hypothesis MDPs based on the following principle: we create one state per equivalence class of S/eqRow E , thus we identify states with traces in S reaching them and we distinguish states by their future behaviour in response to sequences in E (as is common in active automata learning [37]). The long traces Lt(S) serve to define transitions. Transition probabilities are given by the distributions in the mapping T . Table 1 shows a part of the observation table created during learning of the coffee machine shown in Fig. 1. The set S has a trace for each state of the MDP. Note that these traces are pairwise inequivalent with respect to eqRow E , where E = Σ I = {but, coin}. We only show one element of Lt(S), which gives rise to the self-loop in the initial state with the input but and probability 1.
Definition 6 (Closedness). An observation table S, E, T is closed if for all l ∈ Lt(S) there is an s ∈ S such that eqRow E (l, s).

Definition 7 (Consistency). An observation table S, E, T is consistent if
Closedness and consistency are required to derive well-formed hypotheses, analogously to L * [4]. We require closedness to create transitions for all inputs in all states and we require consistency to be able to derive deterministic hypotheses. During learning, we apply Algorithm 1 repeatedly to establish closedness and consistency of observation tables. if S, E, T is not closed then 3: l ← l ′ ∈ Lt(S) such that ∀s ∈ S : row (s) = row(l ′ ) ∨ last(s) = last (l ′ ) 4: S ← S ∪ {l} 5: else if S, E, T is not consistent then 6: for all s1, s2 ∈ S such that eqRow E (s1, s2) do 7: S, E, T ← MakeClosedAndConsistent( S, E, T ) 7: fill(S, E, T ) 8: H ← hyp(S, E, T ) 9: eqResult ← eq(H) 10: if eqResult = yes then 11: cex ← eqResult 12: for all (t · i) ∈ prefixes(cex) with i ∈ Σ I do 13: S ← S ∪ {t} 14: fill(S, E, T ) 15: until eqResult = yes 16: return hyp(S, E, T ) 17: procedure fill(S, E, T ) 18: for all s ∈ S ∪ Lt(S), e ∈ E do 19: if T (s · e) undefined then ⊲ we have no information about T (s · e) yet 20: T (s · e) ← odq(s · e) Learning Algorithm. Algorithm 2 implements L * mdp e using queries odq and eq. First, the algorithm initialises the observation tables and fills the table cells with output distribution queries (Lines 1 to 3). The main loop in Lines 4 to 15 makes the observation table closed and consistent, derives a hypothesis H and performs an equivalence query eq(H). If a counterexample cex is found, all its prefix traces are added as short traces to S, otherwise the final hypothesis is returned, as it is output-distribution equivalent to the SUL. Whenever the table contains empty cells, the Fill procedure assigns values to these cells via odq.
Correctness & Termination. In the following, we will show that L * mdp e terminates and learns correct models, i.e. models that are output-distribution equivalent to the SUL. Like Angluin [4], we will show that derived hypotheses are consistent with queried information and that they are minimal with respect to the number of states. For the remainder of this section, let M be the semantics of the MDP underlying the SUL and let M = can(M ) be the corresponding canonical MDP and let H = Q, Σ I , Σ O , q 0 , δ, L denote hypotheses. The first two lemmas relate to observability of traces.
Proof. The lemma states that traces labelling rows are observable. Algorithm 2 adds elements to S and consequently Lt(S) in two cases: (1) if an equivalence query returns a counterexample and (2) to make observation tables closed.
Case 1. Counterexamples c ∈ T S returned by equivalence queries eq(H) satisfy M (c) = ⊥ (see also Remark 1). In Sect. 3 of Algorithm 2, we add t p to S for each t p · i p ∈ prefixes(c). Due to prefix-closedness of dd (M ), M (t p · i p ) = ⊥ for all t p · i p ∈ prefixes(c), and therefore M (s · i)(o) = T (s · i)(o) > 0 for each added trace t p of the form t p = s · i · o with i ∈ Σ I and o ∈ Σ O . The set Lt(S) is implicitly extended by all observable extensions of added t p . By this definition, Case 2. If an observation table is not closed, we add traces from Lt(S) to S. As noted above, all traces t = s · i · o in Lt(S) satisfy T (s · i)(o) > 0. Consequently, all traces added to S satisfy this property as well.
Theorem 2 (Minimality). Let S, E, T be a closed and consistent observation table and let H = hyp(S, E, T ) be a hypothesis derived from that table with semantics H. Then H is consistent with T , that is, ∀s ∈ (S ∪ Lt(s)) · E : T (s) = H(s), and any other MDP consistent with T but inequivalent to H must have more states. Proof. Similarly to [4], we prove this by induction on the trace length k, i.e. the number of outputs in s.
Assume that for every s ∈ S ∪ Lt(S) of length at most k, Proof. We will prove this by induction on the length of e, i.e. the number of inputs of e. As induction hypothesis, we assume T (s · e) = H(s · e) for all s ∈ S ∪ Lt(S) and e ∈ E of length at most k. For the base case, we consider e consisting of a single input, i.e. e ∈ Σ I . From Def. 3 we can derive that H(s · i) = ⊥ if δ * (s) = ⊥, then we have: For the induction step, let e ∈ E be of length k + 1, thus it is of the form e = i · o · e k for i ∈ Σ I , o ∈ Σ O , and due to suffix-closedness of E, e k ∈ E. We have to show that T (s · e) = H(s · e) for s ∈ S ∪ Lt(S). Let s ′ ∈ S such that eqRow E (s, s ′ ), which exists due to observation table closedness. Traces s and s ′ lead to the same hypothesis state because: Thus, s and s ′ are H-equivalent and therefore H(s · e) = H(s ′ · e). Due to eqRow E (s, s ′ ), T (s · e) = T (s ′ · e) and in combination: holds by the induction hypothesis, as e k has length k.
In both cases, it holds that With Lemma 5, we have shown consistency between derived hypotheses and the queried information. Now, we show that hypotheses are minimal with respect to the number of states.
Lemma 6. Let S, E, T be a closed and consistent observation table and let n be the number of different values for last (s), row (s) , i.e. the number of states of hypothesis hyp(S, E, T ). Any MDP consistent with T must have at least n states.
, then s 1 and s 2 cannot reach the same state in M ′ , because the states reached by s 1 and s 2 need to be labelled differently. If row (s 1 ) = row (s 2 ), then there exists an e ∈ E such that M ′ (s 1 · e) = M ′ (s 2 · e), because M ′ is consistent with T . In this case s 1 and s 2 cannot reach the same state, as the observed future behaviour is different. Consequently, M ′ has at least n states.
, and with n or fewer states is isomorphic to hyp(S, E, T ).
Proof. From Lemma 6, it follows that M ′ has at least n states, therefore we examine M ′ with exactly n states. For each state of H, i.e. each unique row labelled by s ∈ S, exists a unique state in Q ′ . We will now define a mapping φ from short traces to Q ′ given by φ( last (s), row (s) ) = δ ′ * (q ′ 0 , s) for s in S. It is bijective and we will now show that it maps q 0 to q ′ 0 , that it preserve the probabilistic transition relation and that it preserves labelling. First, we start with the initial state and show φ(q 0 ) = q ′ 0 : For each s in S, i in Σ I and o ∈ Σ O . We have: Finally, we show that labelling is preserved. For all s in S: Labelling is preserved by the mapping φ.
which shows that H ′ is not equivalent to H, with c being a counterexample to equivalence. We do not remove elements from S, E, or T , thus M ′ is also consistent with T . Therefore, M ′ must have at least one state more than M according to Theorem 2. It follows that each round of learning, which finds a counterexample, adds at least one state. Since Algorithm 2 derives minimal hypotheses and M can be modelled with finitely many states, there can only be finitely many rounds that find counterexamples. Hence, we terminate after a finite number of rounds, because Algorithm 2 returns the final hypothesis as soon as no counterexample can be found via equivalence queries eq.
Correctness. The algorithm terminates when the equivalence query eq(H) does not find any new counterexample between the final hypothesis H and M. Since there is no counterexample, we have H ≡ od M. Theorem 2 states that H is minimal and M = can(M ) is consistent with T , therefore it follows from Lemma 7 that H is isomorphic to M the canonical MDP modelling the SUL.

Learning MDPs by Sampling
In this section, we introduce L * mdp , an approximate sampling-based learning method for MDPs based on L * mdp e . In contrast to L * mdp e , which requires exact information, we place weaker assumptions on the teacher. Here, we do not require exact output distribution queries and equivalence queries, but we approximate these queries via sampling, i.e. testing. Since large amounts of data are required to produce accurate models, we also alter the learning algorithm structure in contrast to the previous section. The sampling-based L * mdp allows to derive an approximate model at any time, unlike most other L * -based algorithms. Therefore, this section is split into three parts: first, we present a sampling-based interface between teacher and learner, as well as the interface between teacher and SUL. The second and third part describe the adapted learner and the implementation of the teacher, respectively.
Queries. The sampling-based teacher maintains a multiset of traces S for the estimation of output distributions that grows during learning. It offers an equivalence query and three queries relating to output distributions and samples S.
frequency (fq): given a test sequence s ∈ T S, fq(s) : given a test sequence s ∈ T S, cq(s) returns true if sufficient information is available to estimate an output distribution from fq(s); returns false otherwise. refine (rfq): instructs the teacher to refine its knowledge of the SUL by testing it directed towards rarely observed samples. Traces sampled by rfq are added to S, increasing the accuracy of subsequent probability estimations. equivalence (eq): given a hypothesis H, eq tests for output-distribution equivalence between the SUL and H; returns a counterexample from T S showing non-equivalence, or returns none if no counterexample was found. The sampling-based teacher thus needs to implement two different testing strategies, one for increasing accuracy of probability estimations along observed traces (refine) and one for finding discrepancies between a hypothesis and the SUL (equivalence). The frequency query and the complete query are used for hypothesis construction by the learner.
To test the SUL, we require the ability to (1) reset it and to (2) perform an input action and observe the produced output. For the remainder of this section, let M = Q, Σ I , Σ O , q 0 , δ, L be the MDP underlying the SUL with semantics M . Based on q ∈ Q, the current execution state of M, we define two operations available to the teacher: reset resets M to the initial state, i.e. q = q 0 , and returns L(q 0 ). step takes an input i ∈ Σ I and selects a new state q ′ according to δ(q, i)(q ′ ).
The step operation then updates the execution state to q ′ and returns L(q ′ ).
Note that we consider M to be a black box, i.e. its structure and transition probabilities are assumed to be unknown. We are only able to perform inputs and observe output labels, e.g., we observe the initial SUL output L(q 0 ) after performing a reset.

Learner Implementation
Observation Table. The sampling-based learner is also based on observation tables, therefore we use the same terminology as in Sect. 3. Table). An observation table is a tuple S, E, T , consisting of a prefix-closed set of traces S ⊂ T R, a suffixclosed set of continuation sequences E ⊂ CS, and a mapping T :

Definition 8 (Sampling-based Observation
An observation table can be represented by a two-dimensional array, containing rows labelled with elements of S and Lt(S) and columns labelled by E. Each table cell corresponds to a sequence c = s · e, where s ∈ S ∪ Lt(S) is the row label of the cell and e ∈ E is the column label. It stores queried output frequency counts T (c) = fq(c). To represent the content of rows, we define the function row on S ∪ Lt(S) by row (s)(e) = T (s · e). The traces in Lt(S) are input-outputextensions of S which have been observed so far. We refer to traces in S/Lt(S) as short/long traces. Analogously, we refer to rows labelled by corresponding traces as short and long rows. As in Sect. 3, we identify states with traces reaching these states. These traces are stored in the prefix-closed set S. We distinguish states by their future behaviour in response to sequences in E. We initially set S = {L(q 0 )}, where L(q 0 ) is the initial output of the SUL, and E = Σ I . Long traces, as extensions of access sequences in S, serve to define transitions of hypotheses.
Hypothesis Construction. As in Sect. 3, observation tables need to be closed and consistent for a hypothesis to be constructed. Unlike before, we do not have exact information to determine equivalence of rows. We need to statistically test if rows are different. First, we give a condition determining whether two sequences lead to statistically different observations, i.e. the corresponding output frequency samples come from different distributions. This condition is based on Hoeffding bounds which are also used by Carrasco and Oncina [12]. We further apply this condition in a check for approximate equivalence between cells and extend this check to rows. Using similar terminology to [12], we refer to such checks as compatibility checks and we say that two cells/rows are compatible if we determine that they are not statistically different. These notions of compatibility serve as the basis for slightly adapted definitions of closedness and consistency.
Definition 9 (Different). Two sequences s and s ′ in T S produce statistically different output distributions with respect to f : , and (2) one of the following conditions holds:

Definition 10 (Compatible). Two cells labelled by
Two rows labelled by s and s ′ are compatible, denoted compatible E (s, s ′ ) iff last (s) = last (s ′ ) and the cells corresponding to all e ∈ E are compatible, i.e. compatible(s · e, s ′ · e).
Compatibility Classes. In Sect. 3, we formed equivalence classes of traces with respect to eqRow E creating one hypothesis state per equivalence class. Now we partition rows labelled by S based on compatibility. Compatibility given by Def. 10, however, is not an equivalence relation, as it is not transitive in general. As a result, we cannot simply create equivalence classes. We apply the heuristic implemented by Algorithm 3 to partition S.
First, we assign a rank to each trace in S. Then, we partition S by iteratively selecting the trace r with the largest rank and computing a compatibility class cg(r) for r. The trace r is the (canonical) representative for s in cg(r), which we denote by rep(s) (Line 9). Each r is stored in the set of representative traces R. In contrast to equivalence classes, elements in a compatibility class need not be pairwise compatible and an s may be compatible to multiple representatives, where the unique representative rep(s) of s has the largest rank. However, in the limit compatible E based on Hoeffding bounds converges to an equivalence relation [12] and therefore compatibility classes are equivalence classes in the limit (see Sect. 4.3). Note that the first condition of consistency may be satisfied because of incomplete information. Given a closed and consistent observation table S, E, T , we derive hypothesis MDP H = hyp(S, E, T ) through the steps below. Note that extensions s · i · o of s in S define transitions. Some extensions may have few observations, i.e. T (s·i) is low and cq(s·i) = false. In case of such uncertainties, we add transitions to a special sink state labelled by chaos, an output not in the original alphabet 3

Definition 11 (Sampling Closedness). An observation table
-representatives for long traces l ∈ Lt(S) are given by (see Algorithm 3): • for q chaos : L h (q chaos ) = chaos and for all i ∈ Σ I : δ h (q chaos , i)(q chaos ) = 1 -q 0h = L(q 0 ), row (L(q 0 )) -for q = o, row (s) ∈ Q h \ {q chaos } and i ∈ Σ I (note that Σ I ⊆ E): 1. If ¬cq(s · i): δ(q, i)(q chaos ) = 1, i.e. move to chaos 2. Otherwise estimate a distribution µ = δ h (q, i) over the successor states: Updating the Observation Table. Analogously to Sect. 3, we make observation tables closed by adding new short rows and we establish consistency by adding new columns. While Algorithm 2 needs to fill the observation table after executing MakeClosedAndConsistent, this is not required in the samplingbased setting due to the adapted notions of closedness and consistency.
Trimming the Observation Table. Observation table size greatly affects learning performance, therefore it is common to avoid adding redundant information [34,25]. Due to inexact information, this is hard to apply in a stochastic setting. We instead remove rows via a function Trim, once we are certain that this does not change the hypothesis. Given an observation table S, E, T , we remove s and all s ′ such that s ≪ s ′ from S if: 1. there is exactly one r ∈ R such that compatible E (s, r) 2. s / ∈ R and ∀r ∈ R : ¬(s ≪ r) 3. and ∀s ′ ∈ S, i ∈ Σ I , with s ≪ s ′ : diff fq (s ′ · i, r · i) = false, where r ∈ R such that last (r), row (r) = δ * h (r) = δ * h (s ′ ), and δ h is the transition relation of hyp(S, E, T ).
The first condition is motivated by the observation that if s is compatible to exactly one r, then all extensions of s can be assumed to reach the same states as the extensions of r, i.e. we do not need to store s in the observation table.
The other conditions make sure that we do not remove required rows, because of a spurious compatibility check in the first condition. The third condition is related to the implementation of equivalence queries and basically checks if an extension s ′ reveals a difference between observed frequencies (queried via fq) Algorithm 4 The main algorithm implementing L * mdp Input: sampling-based teacher capable of answering fq, rfq, eq and cq 1: T (s · e) ← fq(s · e) ⊲ update observation table with frequency information 5: round ← 0 6: repeat 7: round ← round + 1 8: while S, E, T not closed or not consistent do 9: S, E, T ← MakeClosedAndConsistent( S, E, T ) 10: H ← hyp(S, E, T ) ⊲ create hypothesis 11: S, E, T ← trim( S, E, T , H) ⊲ remove rows that are not needed 12: cex ← eq(H) 13: if cex = none then ⊲ we found a counterexample 14: for all t · i ∈ prefixes(cex ) with i ∈ Σ I do 15: S ← S ∪ {t} ⊲ add all prefixes of the counterexample 16: perform rfq( S, E, T ) ⊲ sample traces to refine knowledge about SUL 17: for all s ∈ S ∪ Lt(S), e ∈ E do 18: T (s · e) ← fq(s · e) ⊲ update observation Learning Algorithm. Algorithm 4 implements L * mdp . It first initialises an observation table S, E, T with the initial SUL output as first row and with the inputs Σ I as columns (Line 1). Lines 2 to 4 perform a refine query and then update S, E, T , which corresponds to output distribution queries in L * mdp e . Here, the teacher resamples the only known trace L(q 0 ). Resampling that trace consists of observing L(q 0 ), performing some input and observing another output.
After that, we perform Lines 6 to 19 until a stopping criterion is reached. We establish closedness and consistency of S, E, T in Line 9 to build a hypothesis H in Line 10. After that, we remove redundant rows of the observation table via Trim in Line 11. Then, we perform an equivalence query, testing for equivalence between SUL and H. If we find a counterexample, we add all its prefix traces as rows to the observation table like in L * mdp e . Finally, we sample new system traces via rfq to gain more accurate information about the SUL (Lines 16 to 18). Once we stop, we output the final hypothesis.
Stopping. L * mdp e and deterministic automata learning usually stop learning once equivalence between the learned hypothesis and the SUL is achieved, i.e. no counterexample can be found. Here, we employ a different stopping criterion, because equivalence can hardly be achieved via sampling. Furthermore, we may wish to carry on resampling via rfq although we did not find a counterexample. Resampling may improve accuracy of a hypothesis which is beneficial for the test-case generation in subsequent equivalence queries.
Our stopping criterion takes uncertainty in compatibility checks into account. As previously noted, rows may be compatible to multiple other rows. In particular, a row labelled by s may be compatible to multiple representatives, i.e. we are not certain which state is reached by the trace s. We address this issue by stopping based on the ratio r unamb of unambiguous traces to all traces, which we compute by: More concretely, we stop if: 1.a. at least r min rounds have been executed and 1.b. the chaos state q chaos is unreachable and 1.c. and r unamb ≥ t unamb , where t unamb is a user-defined threshold, or 2.a. alternatively we stop after a maximum number of rounds r max .

Teacher Implementation
In the following, we describe the implementation of each of the four queries provided by the teacher. Recall that we interact with the SUL M with semantics M (see Sect. 3).
Frequency Query. The teacher keeps track of a multiset of sampled system traces S. Whenever a new a trace is added, all its prefixes are added as well, as they have also been observed. Therefore, we have for t ∈ T R, t ′ ∈ prefixes(t) : S(t) ≤ S(t ′ ). The frequency query fq(s) : Σ O → N 0 for s ∈ T S returns output frequencies observed after s: Complete Query. Trace frequencies retrieved via fq are generally used to compute empirical output distributions µ following a sequence s in T S, i.e. the learner computes The complete query cq takes a sequence s as input and signals whether s should be used to approximate M (s), e.g. to perform statistical tests 4 . We base cq on a threshold n c > 0 by defining: Algorithm 5 Refine query 1: rare ← {s | s ∈ (S ∪ Lt(S)) · E : ¬cq(s)} ⊲ select incomplete sequences 2: trie ← buildTrie(rare) 3: for i ← 1 to n resample do ⊲ collect n resample new samples 4: newTrace ← sampleSul(trie) 5: S ← S ⊎ {newTrace} 6: function sampleSul(trie) 7: node ← root(trie) 8: trace ← reset ⊲ initialise SUL and observe initial output 9: loop 10: output ← step(i) ⊲ execute SUL and observe output 12: trace ← trace · i · o 13: if trace / ∈ trie or trace labels leaf then ⊲ did we leave the trie? 14: return trace 15: node ← node ′ Note that for a complete s, all prefixes of s are also complete. Additionally, if cq(s), we assume that we have seen all extensions of s; therefore, we we set for each o with S(s · o) = 0 all extensions of s · o to be complete (second clause). The threshold n c is user-specifiable in our implementation.
Refine Query. Refine queries serve the purpose of refining our knowledge about output distributions along previously observed traces. Therefore, we select rarely observed traces and resample them. We implemented this through the procedure outlined in Algorithm 5.
First, we build a trie from rarely observed traces (Lines 1 and 2), where edges are labelled by input-output pairs and nodes are labelled by traces reaching the nodes. This trie is then used for directed online-testing of the SUL via sample-Sul (Lines 6 to 16) with the goal of reaching a leaf of the trie. In this way, we create n resample new samples and add them to the multiset of samples S.
Equivalence Query. Equivalence queries are often implemented via (conformance) testing in active automata [23], e.g., via the W-method [16] method for deterministic models. Such testing techniques generally execute some test suite to find counterexamples to conformance between a model and the SUL. In our setup, a counterexample is a test sequence inducing a different output distribution in the hypothesis H than in the SUL. Since we cannot directly observe those distributions, we apply two strategies to find counterexamples during equivalence queries. First, we search for counterexamples with respect to the structure of H via testing. Second, we check for statistical conformance between all traces S collected so far and H, which allows us to detect incorrect output distributions.
Note that all traces to the state q chaos are guaranteed to be counterexamples, as chaos is not part of the original output alphabet Σ O . For this reason, we do not search for other counterexamples if q chaos is reachable in H. In slight abuse of terminology, we implement this by returning none from eq(H). L * mdp in Algorithm 4 will then issue further rfq queries, lowering uncertainty about state transitions, which in turn causes q chaos to be unreachable eventually.

Algorithm 6 State-coverage-based testing for counterexample detection
Input: H = Q, Σ I , Σ O , q0, δ, L , schedulers qSched Output: counterexample test sequence s ∈ T S or none 1: qcurr ← q0 ⊲ current state 2: trace ← reset 3: qtarget ← randSel (reachable(Q, qcurr)) ⊲ choose a target state 4: loop 5: if coinFlip(p rand ) then 6: in ← randSel (Σ I ) ⊲ random next input 7: else 8: in ← qSched(qtarget ) ⊲ next input leads towards target 9: out ← step(in) ⊲ perform input 10: qcurr ← ∆(qcurr, in · out ) ⊲ move in hypothesis 11: if qcurr = ⊥ then ⊲ output not possible in hypothesis 12: return trace · in ⊲ return counterexample 13: trace ← trace · in · out 14: if coinFlip(pstop) then ⊲ stop with probability pstop 15: return none 16: if qcurr = qtarget or qtarget / ∈ reachable(Q, qcurr) then 17: qtarget ← randSel (reachable(Q, qcurr)) ⊲ choose new target Testing of Structure. Our goal in testing is to sample a trace of the SUL that is not observable on the hypothesis. For that, we adapted a randomised testing strategy from Mealy machines to MDPs, which proved effective in previous work [2]. In this work, we generated test cases for active automata learning by interleaving random walks in hypotheses with paths leading to randomly chosen transitions. By generating many of these tests, we aim at covering hypotheses adequately, while exploring new parts of SUL's state space through random testing. Here, we aim at covering randomly chosen states and apply an online testing procedure, as the SUL is stochastic. This procedure is outlined in Algorithm 6.
The algorithm takes a hypothesis and qSched as input, where qSched is a mapping from states to schedulers. Given q ∈ Q, qSched(q) is a scheduler maximising the probability of reaching q, i.e. it selects inputs optimally with respect to reachability of q. For optimal reachability, there exist schedulers that are memoryless and deterministic [19], which means that they take only the last state in the current execution path into account and that input choices are not probabilistic. Therefore, a scheduler qSched(q) is a function s : Q → Σ I . In Algorithm 6, we start by randomly choosing a target state q target from the states reachable from the initial state (Line 3), which are given by reachable(Q, q curr ). Then, we execute the SUL, either with random inputs (Line 6) or with inputs leading to the target (Line 8), which are computed using schedulers. If we observe an output which is not possible in the hypothesis, we return a counterexample (Line 12), alternatively we may stop with probability p stop (Line 15). If we reach the target or it becomes unreachable, we simply choose a new target state (Line 17).
For each equivalence query, we repeat Algorithm 6 up to n test times and report the first counterexample we find. In case we find a counterexample c, we resample it up to n retest times or until cq(c), to get more accurate information about it.
Checking Conformance to S. For each sequence t · i ∈ T S with i ∈ Σ I such that cq(t · i), we check for consistency between the information stored in S and the hypothesis H by evaluating two conditions: 1. Is t observable in H? If it is not, then we determine the longest observable prefix t ′ of t such that t ′ · i ′ · v = t, where i ′ is a single input, and return t ′ · i ′ as counterexample from eq(H). 2. Otherwise we determine q = o, row (r) reached by t in H, where r ∈ R, and return t · i as counterexample if diff fq (t · i, r · i) is true. This statistical check approximates the comparison M (t · i) = M (r · i), to check if t ≡ M r. Therefore, it checks implicitly M (t · i) = H(t · i), as t ≡ H r.

Convergence of L * mdp
In the following, we will show that the sampling-based L * mdp learns the correct MDP. Based on the notion of language identification in grammar inference [21], we describe our goal as producing an MDP isomorphic to the canonical MDP modelling the SUL with probability one in the limit. To show identification in the limit, we introduce slight simplifications. Firstly, we disable trimming of the observation table (see Sect. 4.1), i.e. we do not remove rows. Second, we set p rand = 1 for equivalence testing and we do not stop at the first detected difference between SUL and hypothesis, but solely based on a p stop < 1; i.e. all input choices are uniformly randomly and the length of each test is geometrically distributed with p stop . This is motivated by the common assumption that sampling distributions do not change during learning [21]. Third, we change the function rank in Algorithm 3 to assign ranks based on a lexicographic ordering of traces instead of a rank based on observed frequencies, such that the trace consisting only of the initial SUL output has the largest rank. We actually implemented both types of rank functions and found that the frequency-based function led to better accuracy, but would require more complex proofs. We let the number of samples for learning approach infinity, therefore we do not use a stopping criterion. Finally, we concretely instantiate cq by setting n c = 1, since n c is only relevant for applications in practice.
Proof Structure. We show convergence in two major steps: (1) we show that the hypothesis structure derived from a sampling-based observation table converges to the hypothesis structure derived from the corresponding observation table with exact information. (2) Then, we show that if counterexamples exist, we will eventually find them. Through that, we eventually arrive at a hypothesis with the same structure as the canonical MDP can(M ), where M is the SUL semantics. Given a hypothesis with correct structure, it follows by the law of large numbers that the estimated transition probabilities converge to true probabilities, thus the hypotheses converge to an MDP isomorphic to can(M ).
A key point of the proofs concerns the convergence of statistical test applied by diff f , which is based on Hoeffding bounds [22]. With regard to that, we apply similar arguments as Carrasco and Oncina [12,. Given convergence of diff f , we also rely on the convergence of the exact learning algorithm L * mdp e discussed in Sect. 3. Another important point is that the shortest traces in each equivalence class of S/≡ M do not form loops in can(M ). Hence, there are finitely many such traces. Furthermore, for a given can(M ) and some hypothesis MDP, the shortest counterexample has bounded length, therefore it suffices to check finitely many test sequences to check for overall equivalence.
Auxiliary Definitions & Notation. We show convergence in the limit of the number of sampled system traces n. We take n into account through a datadependent α n for the Hoeffding bounds used by diff f defined in Def. 9. More concretely, let α n = 1 n r for r > 2 as used by Mao et al. [30], which implies n α n n < ∞. For the remainder of this section, let S n , E n , T n be the closed and consistent observation table containing the first n samples stored by the teacher in the multiset S n . Furthermore, let H n be the hypothesis hyp(S n , E n , T n ), let the semantics of the SUL be M and let M be the canonical MDP can(M ). We say that two MDPs have the same structure, if their underlying graphs are isomorphic, i.e. exact transition probabilities may be different.
Theorem 4 (Convergence). Given a data-dependent α n = 1 n r for r > 2, such that n α n n < ∞, then with probability one, the hypothesis H n is isomorphic to M, except for finitely many n.
Hence, we learn an MDP that is minimal with respect to the number of states and output-distribution equivalent to the SUL.
Hoeffding-Bound-Based Difference Check. First, we briefly discuss the Hoeffding-bound-based test applied by diff f . Recall, that for two test sequences s and s ′ , we test for each o ∈ Σ O if the probability p for observing o after s is different than the probability p ′ for observing o after s ′ . This is implemented through: As pointed out by Carrasco and Oncina [12,, this test works with confidence level above (1−α) 2 and for large enough n 1 and n 2 it tests for difference and equivalence of p and p ′ . More concretely, for convergence, n 1 and n 2 must be such that 2ǫ α (n 1 , n 2 ) is smaller than the smallest absolute difference between any two different p and p ′ . As our data-dependent α n decreases only polynomially, ǫ α (n 1 , n 2 ) tends to zero for increasing n 1 and n 2 . Hence, the test implemented by diff f converges to an exact comparison between p and p ′ .
In the remainder of the paper, we ignore Condition 2.a for diff f , which checks if the sampled distributions have the same support. By applying a datadependent α n , as defined above, Condition 2.b converges to an exact comparison between output distributions, thus 2.a is a consequence of 2.b in the limit. Therefore, we only consider the Hoeffding-based tests of Condition 2.b.
Access Sequences. The exact learning algorithm L * mdp e presented in Sect. 3 iteratively updates an observation table. Upon termination it arrives at an observation table S, E, T producing a hypothesis H = Q h , Σ I , Σ O , q 0h , δ h , L h = hyp(S, E, T ). Let S acc ⊆ S be the set of shortest access sequences leading to states in Q given by S acc = {s|s ∈ S, ∄s ′ ∈ S : s ′ ≪ s ∧ s ′ = s ∧ δ * h (q 0h , s) = δ * h (q 0 h , s ′ )} (the shortest traces in each equivalence class of S/ ≡ M ). By this definition, S acc forms a directed spanning tree in the structure of H. There are finitely many different spanning trees for a given hypothesis, therefore there are finitely many different S acc . Hypothesis models learned by L * mdp e are isomorphic to M, thus there are finitely many possible final hypotheses. Let S be the finite union of all access sequence sets S acc forming spanning trees in all valid final hypotheses. Let L = {s · i · o|s ∈ S, i ∈ Σ I , o ∈ Σ O , M (s · i)(o) > 0} be one-step extensions of S with non-zero probability. Observe that for the correct construction of correct hypotheses in L * mdp e , it is sufficient for eqRow E to approximate M -equivalence (see Def. 4) for traces in L. Consequently, the approximation of eqRow E via compatible E needs to hold only for traces in L.

Hypothesis Construction.
Theorem 5 (Compatibility Convergence). Given α n such that n α n n < ∞, then with probability one: compatible E (s, s ′ ) ⇔ eqRow E (s, s ′ ) for all traces s, s ′ in L, except for finitely many n.
Proof. Let A n be the event that compatible E (s, s ′ ) ⇔ eqRow E (s, s ′ ) and p(A n ) be the probability of this event. In the following, we derive a bound for p(A n ) based on the confidence level of applied tests in Def. 9 which is above (1 − α n ) 2 [12]. An observation table stores |S ∪ Lt(S)| · |E| cells, which gives us an upper bound on the number of tests performed for computing compatible E (s, s ′ ) for two traces s and s ′ . However, note that cells do not store unique information; multiple cells may correspond to the same test sequence in T S, therefore it is simpler to reason about the number of tests in calls to diff T (c, c ′ ) = diff fq (c, c ′ ) with respect to S n . A single call to diff fq involves either 0 or |Σ O | tests. We apply tests only if we have observed both c and c ′ at least once, therefore we perform at most 2 · |Σ O | · n different tests for all pairs of observed test sequences. The event A n may occur if any test produces an incorrect result, i.e. it yields a Boolean result different from the comparison between the true output distributions induced by c and c ′ . This leads to By choosing α n such that n α n n < ∞, we have n p(A n ) < ∞ and we can apply the Borel-Cantelli lemma like Carrasco and Oncina [12], which states A n happens only finitely often. Hence, there is an N comp such that for n > N comp , we have compatible E (s, s ′ ) ⇔ eqRow E (s, s ′ ) with respect to S n . Lemma 8. Under the assumed uniformly randomised equivalence testing strategy, for every s · i · o ∈ L : S n (s · i · o) > 0 after finitely many n.
Proof. Informally, we will eventually sample all traces l ∈ L. The probability p L of sampling l = o 0 · i 1 · o 1 · · · o n · i · o during a test, where l[≪ k] is the prefix of l of length k, is given by (note that we may sample l as a prefix of another sequence): , then the set of representatives R computed by Algorithm 3 for the closed and consistent observation table S n , E n , T n is prefix-closed.
Proof. Recall that we assume the function rank to impose a lexicographic ordering on traces. This simplifies showing prefix-closedness of R, which we do by contradiction. Assume that R is not prefix-closed. In that case, there is a trace r of length n in R with a prefix r p of length n − 1 that is not in R. As r p / ∈ R, we have r p = rep(r p ) and rank(r p ) < rank(rep(r p )), because the representative rep(r p ) has the largest rank in its class cg(r p ). Since S n is prefix-closed and R ⊆ S n , r p ∈ S n . Let i ∈ Σ I and o ∈ Σ O such that r p · i · o = r. Algorithm 3 enforces compatible E (r p , rep(r p )) and due to consistency, we have Since r is a representative in R, rep(r p ) · i · o ∈ cg(r). Representatives r have the largest rank in their compatibility class cg(r) and r = rep(r p ) · i · o, thus rank(r) > rank(rep(r p ) · i · o).
In combination we have rank(r p ) < rank(rep(r p )) and rank(r p · i · o) > rank(rep(r p ) · i · o) which is a contradiction given the lexicographic ordering on traces imposed by rank. Consequently, R must be prefix-closed under the premises of Lemma 9.
Lemma 10. Let S n , E n , T n be the exact observation table corresponding to the sampling-based observation table S n , E n , T n , i.e. with T n (s) = odq(s) for s ∈ (S n ∪ Lt(S n )) · E. Then, T n (r · i)(o) > 0 ⇔ T n (r · i)(o) > 0 for r ∈ R, i ∈ Σ I , o ∈ Σ O after finitely many n.
Proof. First, we will show for prefix-closed R (Lemma 9) that R ⊆ S, if compatible E (s, s ′ ) ⇔ eqRow E (s, s ′ ). S contains all traces corresponding to simple paths of can(M ), therefore we show by contradiction that no r ∈ R forms a cycle in can(M ).
Assume that r forms a cycle in can(M ), i.e. it visits states multiple times. We can split r into three parts r = r p · r c · r s , where r p ∈ T R such that r p and r p · r c reach the same state, and r s ∈ (Σ I × Σ O ) * is the longest suffix such that r s visits every state of can(M ) at most once. As R is prefix-closed, R includes r p and r p · r c as well. The traces r p and r p · r c reach the same state in can(M ), thus we have r p ≡ M r p · r c which implies eqRow E (r p , r p · r c ) and compatible E (r p , r p · r c ). By Algorithm 3 all r ∈ R are pairwise not compatible with respect to compatible E leading to a contradiction, thus no r visits a state of can(M ) more than once and we have R ⊆ S.
Hence, every observable r l = r · i · o for r ∈ R, i ∈ Σ I and o ∈ Σ O is in L, as L includes all observable extensions of S. By Lemma 8, we will sample r l eventually, i.e. T n (r · i)(o) > 0 and therefore T n (r · i)(o) > 0 ⇔ T n (r · i)(o) > 0 after finitely many n.
Lemma 11. The chaos state q chaos is not reachable in H n , except for finitely many n.
Proof. We add a transition from state q = last (r), row (r) with input i to q chaos if cq(r · i) = false. As we consider n c = 1, cq(r · i) = true if there is an o such that T n (r · i)(o) > 0. Lemma 10 states that T n (r · i)(o) > 0 for any observable r · i · o after finitely many n. Thus, Lemma 10 implies cq(r · i) = true for all r ∈ R and i ∈ Σ I , therefore the chaos is unreachable in H n , except for finitely many n.
Combining Theorem 5, Lemma 10 and Lemma 11, it follows that, after finitely many n, hypotheses created in the sampling-based setting have the same structure as in the exact setting. Corollary 1. Let S n , E n , T n be the exact observation table corresponding to the sampling-based observation table S n , E n , T n , i.e. T n (s) = odq(s) for s ∈ (S n ∪ Lt(S n )) · E. Then there exists a finite N struct such that the exact hypothesis hyp(S n , E n , T n ) has the same structure as H n for n > N struct .
Theorem 6 (Convergence of Equivalence Queries). Given α n such that n α n n < ∞, an observation table S n , E n , T n and a hypothesis H n , then with probability one, H n has the same structure as M or we find a counterexample to equivalence, except for finitely many n.
According to Corollary 1, there is an N struct such that H n has the same structure as in the exact setting and compatible E (s, s ′ ) ⇔ eqRow E (s, s ′ ) for n > N struct . Therefore, we assume n > N struct for the following discussion of counterexample search through the implemented equivalence queries eq. Let H n be the semantics of H n . Recall that we apply two strategies for checking equivalence: 1. Random testing with a uniformly randomised scheduler (p rand = 1): this form of testing of testing can find traces s · o, with s ∈ T S and o ∈ Σ O , such that H(s)(o) = 0 and M (s)(o) > 0. While this form of search is coarse, we store all sampled traces in S n that is used by our second counterexample search strategy performing a fine-grained analysis.
2. Checking conformance with S n : for all observed test sequences, we statistically check for differences between output distributions in H n and distributions estimated from S n through applying diff fq . Applying that strategy finds counterexample sequences s ∈ T S such that M (s) = ⊥ (as s must have been observed) and approximately M (s) = H(s).

Case 1.
If H n and M have the same structure and n > N struct , such that eqRow E (s, s ′ ) ⇔ compatible E (s, s ′ ), we may still find counterexamples that are spurious due to inaccuracies. Therefore, we will show that adding a prefixclosed set of traces to the set of short traces S n does not change the hypothesis structure, as this is performed by Algorithm 4 in response to counterexamples returned by eq. Lemma 12. If H n has the same structure as M and n > N struct , then adding a prefix-closed set of observable traces S t to S n will neither introduce closednessviolations nor inconsistencies, i.e. S n ∪ S t , E n , T n is closed and consistent. Consequently, the hypothesis structure does not change, i.e. H n and hyp(S n ∪ S t , E n , T n ) have the same structure.
Proof. Let t be a trace in S t and q t = δ * h (t) be the hypothesis state reached by t, which exists because H n has the same structure as M. Let t s ∈ S n be a short trace also reaching q t . Since M and H n have the same structure, t and t s also reach the same state of M, therefore t ≡ M t s (by reaching the same state both traces lead to the same future behaviour), implying eqRow E (t, t s ). With n > N struct , we have compatible E (t, t s ). By the same reasoning, we have compatible E (t · i · o, t s · i · o) for any i ∈ Σ I , o ∈ Σ O with M (t · i)(o) > 0; which is the condition for consistency of observation tables, i.e. adding t to S n leaves the observation tables consistent.
Furthermore because S n , E n , T n is closed, there exists a t ′ s ∈ S n , with compatible E (t s · i · o, t ′ s ). Since compatible E (t · i · o, t s · i · o) and because compatible E is transitive for n > N struct , we have compatible E (t · i · o, t ′ s ). Hence, adding t as to S n does not violate closedness, because for each observable extensions of t, there exists a compatible short trace t ′ s .

Case 2.
If the hypothesis H n does not have the same structure as M and n > N struct , then H n has fewer states than M (following Lemma 6 given that H is consistent with T n and compatible E (s, s ′ ) ⇔ eqRow E (s, s ′ )). Since M is minimal with respect to the number of states, H n and M are not equivalent, thus a counterexample to observation equivalence exists and we are guaranteed to find any such counterexample after finitely many samples.

Lemma 13.
If compatible E (s, s ′ ) ⇔ eqRow E (s, s ′ ) for traces s and s ′ in S n , then the hypothesis H n derived from S n , E n , T n is the smallest MDP consistent with T n .
Proof. Recall that for a given observation table S, E, T , the exact learning algorithm L * mdp e derives the smallest hypothesis consistent with T . By Corollary 1, H n is the smallest MDP consistent T . As diff Tn does not produce spurious results for n > N struct (Theorem 5), H n is also the smallest MDP consistent with T n with respect to diff Tn . Hence, there is a finite set C obs of sequences with lengths bounded by n 2 q + 1 such that we if we test all sequence in C obs , we can check equivalence with certainty.
Proof. Let M and M ′ with states Q and Q ′ as defined above, i.e. |Q| = n q and |Q ′ | ≤ n q , and let reachQSeq(t) ∈ (Q×Q ′ ) * be the sequence of state-pairs visited along a trace t by M and M ′ , respectively. M ≡ od M ′ iff for all t ∈ T R and i ∈ Σ I , we have M (t·i) = M ′ (t·i). If the length of t·i is at most n 2 q +1, then t·i ∈ C. Otherwise, reachQSeq(t) contains duplicated state pairs, because |Q × Q ′ | ≤ n 2 q . For t longer than n 2 q , we can remove loops on Q × Q ′ from t to determine a trace t ′ of length at most n 2 q such that reachQSeq(t)[|t|] = reachQSeq(t ′ )[|t ′ |], i.e such that t and t ′ reach the same state pair. Since t reaches the same state as . Consequently for all t · i ∈ T R · Σ I : either t · i ∈ C, or there is a t ′ · i ∈ C leading to the same check between M and M ′ .
We further restrict C to C obs , by considering only observable test sequences in C. This restriction is justified by Remark 1. In summary: Proof. Due to p rand = 1 and p stop < 1 we apply uniformly randomised inputs during testing and each test has a length that is distributed dependent on p stop . Let c = o 0 i 1 o 1 · · · o n−1 · i n be a sequence in C obs with c[≪ k] being its prefix of length k, then the probability p c of observing c is (note that we may observe c as a prefix of another sequence): By definition of C obs , we have M (c[≪ j])(o j ) > 0 for all indexes j and c in C obs , therefore p c > 0.
In every round of L * mdp , we check for conformance between S n and the hypothesis H n and return a counterexample if we detect a difference via diff fq . Since we apply diff fq , we follow a similar reasoning as for the convergence of hypothesis construction. Here, we approximate M (c) = H(c) for c ∈ T S by diff fq (t · i, r · i), where c = t · i for a trace t, input i and the hypothesis state last (r), row (r) reached by t, where r ∈ R is the corresponding representative short trace.
Lemma 16. Given α n such that n α n n < ∞, then with probability one M (c) = H(c) ⇔ diff fq (t · i, r · i) for c = t · i ∈ C obs and r as defined above, except for finitely many n.
Proof. We use the identity H(t · i) = H(r · i) for traces t and r and inputs i, which holds because t and r reach the same state in the hypothesis H. Applying that, we test for M (t·i) = H(t·i) by testing M (t·i) = H(r ·i) via diff fq (t·i, r ·i). We perform |Σ O | tests for each unique observed sequence c, therefore we apply at most n · |Σ O | tests. Let B n be the event that any of these tests is wrong, that is, M (t · i) = H(r · i) ⇔ diff fq (t · i, r · i) for at least one observed c = t · i. Due to the confidence level greater than (1 − α n ) 2 of the tests, the probability p(B n ) of B n is bounded by By choosing α n such that n α n n < ∞, we can apply the Borel-Cantelli lemma as above. Hence, B n only happens finitely often, thus there is an N 1 such that for all n > N 1 we have M (t · i) = H(r · i) ⇔ diff fq (t · i, r · i) for all observed c = t · i. Furthermore, the probability of observing any c of the finite set C obs during testing is greater than zero (Lemma 15), thus there is a finite N 2 such that S n contains all c ∈ C obs for n > N 2 . Consequently, there is an N cex , such that Lemma 16 holds for all n > N cex . Lemma 13 states that hypotheses H n are minimal after finitely many n and thus all potential counterexamples are in C obs (Lemma 14). From Lemma 16, it follows that we will identify a counterexample in C obs if one exists. Combining that with Lemma 12 concludes the proof of Theorem 6.
Putting Everything Together. We have established that after finitely many n, the sampling-based hypothesis H n has the same structure as in the exact setting (Corollary 1). Therefore, certain properties of the exact learning algorithm L * mdp e hold for the sampling-based L * mdp as well. The derived hypotheses are therefore minimal, i.e. they have at most as many states as M. As with L * mdp e , adding a non-spurious counterexample to the trace set S n introduces at least one state in the derived hypotheses. Furthermore, we have shown that equivalence queries return non-spurious counterexamples, except for finitely many n (Theorem 6). Consequently, after finite n we arrive at a hypothesis H n with the same structure as M. We derive transition probabilities by computing empirical means, thus by the law of large numbers these estimated probabilities converge to the true probabilities. Hence, we learn a hypothesis H n isomorphic to the canonical MDP M in the limit as stated by Theorem 4.
More efficient parameters. So far, we discussed a particular parametrisation of L * mdp . Among others, we used uniformly random input choices for equivalence testing with p rand = 1, and instantiated cq to accept samples as complete after only n c = 1 observation. This simplified the proof, but is inefficient in practical experiments. However, the arguments based on n c = 1, such as Lemma 10 and Lemma 11, are easily extended to small constant values of n c : Since the samples are collected independently, any observation that occurs at least once after a finite number of steps also occurs at least n c times after a finite number of steps.

Experiments
In active automata learning, our goal is generally to learn an MDP which is equivalent to the true MDP modelling the SUL. This changes in the stochastic setting, where we want to learn a model close to true model, as equivalence can hardly be achieved. Note that we perform experiments with known models, which we treat as a black boxes during learning. As a reference, we also learn models and perform the same measurements with IoAlergia. Our experiments aim to measure the similarity between the learned models and the true model: 1. We compute the discounted bisimilarity distance between the true models and the learned MDPs [7,8]. We adapted the distance measure from MDPs with rewards to labelled MDPs by defining a distance of 1 between states with different labels. 2. Additionally, we perform probabilistic model-checking. We compute and compare maximal probabilities of manually defined temporal properties with all models. The computation is done via Prism [28].
Experimental results and the implementation can be found in the evaluation material [38].
Measurement Setup. As in [30], we configure IoAlergia with a data-dependent significance parameter for the compatibility check, by setting ǫ N = 10000 N , where N is the total combined length of all traces used for learning. This parameter serves a role analogous to the α parameter for the Hoeffding bounds used by L * mdp . In contrast to IoAlergia, we observed that L * mdp shows better performance with non-data-dependent α, therefore we set α = 0.05 for all experiments. Motivated by convergence guarantees given in [30], we collect traces for IoAlergia by sampling with a scheduler that selects inputs according to a uniform distribution. The length of these traces is geometrically distributed with a parameter p l and the number of traces is chosen such that IoAlergia and L * mdp learn from approximately the same amount of data.
We implemented L * mdp and IoAlergia in Java. In addition to our Java implementations, we use Prism 4.4 [28] for probabilistic model-checking. and an adaptation of the MDPDist library available at [6] for computing bisimilarity distances. We performed the experiments with a Lenovo Thinkpad T450 with 16 GB RAM, an Intel Core i7-5600U CPU with 2.6 GHz and running Xubuntu Linux 18.04.

Fig. 2. The first gridworld
Models similar to our gridworlds have, e.g., been considered in the context of learning control strategies [20]. Basically, a robot moves around in a world of tiles of different terrains. It may make errors in movement, e.g. move south west instead of south with an error probability depending on the target terrain. Our aim is to learn an environment model, i.e. a map. Figure 2 shows the first gridworld used for evaluation. Black tiles are walls and other terrains are represented by different shades of grey and letters (Sand, Mud, Grass & Concrete). A circle marks the initial location and a double circle marks a goal location. Four inputs enable movement in four directions. Observable outputs include the different terrains, walls, and a label indicating the goal. The true model of this gridworld has 35 different states. We set the sampling parameters to n resample = n retest = 300, n test = 50, p stop = 0.25 and p rand = 0.25. As stopping parameter served t unamb = 0.99, r min = 500 and r max = 4000. Finally, the parameter p l for IoAlergia's geometric trace length distribution was set to 0.125.
Results. Table 2 shows the measurement results for learning the first gridworld. Our active learning stopped after 1147 rounds, sampling 391 530 traces (Row 2) with a combined number of outputs of 3 101 959 (Row 1). The bisimilarity distance discounted with λ = 0.9 to the true model is 0.144 for L * mdp and 0.524 for IoAlergia (Row 5); thus it can be assumed that model checking the L * mdp model produces more accurate results. This is indeed true for our three evaluation queries in the last three rows. These model-checking queries ask for the maximum probability (quantified over all schedulers) of reaching the goal within a varying number of steps. The first query does not restrict the terrain visited before the goal, but the second and third require to avoid G and S, respectively. The absolute difference to the true values is at most 0.015 for L * mdp , but the results for IoAlergia differ greatly from the true values. One reason is that the Fig. 3. The second gridworld IoAlergia model with 21 states is significantly smaller than the minimal true model, while the L * mdp model has as many states as the true model. IoAlergia is faster than L * mdp , which applies time-consuming computations during equivalence queries. However, the runtime of learning-specific computations is often negligible in practical applications, such as learning of protocol models [39,35], as the communication with the SUL usually dominates the overall runtime. Given the smaller bisimilarity distance and the lower difference to the true probabilities computed with Prism, we conclude that the L * mdp model is more accurate.

Second Gridworld
Fig . 3 shows the second gridworld used in our evaluation. As before, the robot starts in the initial location in the top left corner and can only observe the different terrains. The goal location is in the bottom right corner in this example. The true MDP representing this gridworld has 72 states. We configured learning as for the first gridworld, but collect more samples per round by setting n retest = n resample = 1000. Table 3 shows the measurement results for learning. We sampled 515 950 traces with a combined number of outputs of 3 663 415, i.e. the combined length of all traces is in a similar range as before, although we sampled more traces in a single round. This is the case because learning stopped already after 500 rounds. We used similar model-checking queries as in the previous example and we can again see that the difference between the true model and the L * mdp model is much smaller than for IoAlergia. However, compared to the previous example, the absolute difference between L * mdp and the true model with respect to model-checking has slightly increased.

Shared Coin Consensus
This example is a randomised consensus protocol by Aspnes and Herlihy [5]. In particular, we used a model of the protocol distributed with the PRISM model checker [28] as a basis for our experiments. 5 We generally performed only minor adaptions such as adding action labels for inputs, but we also slightly changed the functionality by doing that. For the purpose of this evaluation these changes are immaterial, though.
We consider only the configuration with the smallest state space of size 272 with two processes and constant K set to 2. Basically, the SUL has two inputs go 1 and go 2 , one for each process, where executing input go i causes process p i to perform exactly one step. The outputs of the SUL comprise the counter state, the processes' coin states, as well as additional propositions, e.g., denoting that the protocol finished. Note that we need to make the coin states visible, to be able to model the SUL with deterministic MDPs. In this experiment, we basically learn the state machine underlying the protocol, which we cannot observe directly.
We set the learning parameters to n resample = n retest = n test = 50, p stop = 0.25 and p rand = 0.25. We controlled stopping with t unamb = 0.99, r min = 500 and r max = 4000. Finally, we set p l = 0.125 for IoAlergia. Table 4 shows the measurement results for learning a model of the shared coin consensus protocol. Compared to the previous example, we need a significantly lower sample size of 98 064 traces containing 537 665 outputs, although the models are much larger. A reason for this is that there is a relatively large number of outputs in this example, such that states are easier to distinguish from each other. The bisimilarity distance is in a similar range as before for L * mdp , which is again significantly smaller than IoAlergia's bisimilarity distance. The L * mdp model is again larger than the IoAlergia model, but in this example it is smaller than the true model. This happens because many states are never reached during learning, as reaching them within a bounded number of steps has a very low probability -see e.g. the fifth model-checking query determining the maximum probability of finishing the protocol within less than 40 steps, but without consensus, as p 1 chooses heads and p 2 chooses tails. Here, we also see that the model-checking results computed with the IoAlergia model are more accurate in some cases, but L * mdp produces more accurate results overall. The absolute difference from the true values averaged over all model-checking results is about about 0.066 for L * mdp , approximately half of IoAlergia's average absolute difference of 0.138. We see an increase in runtime compared to the gridworld examples, which is caused by the larger state space, since the precomputation time for equivalence testing grows with the state space.

Slot machine
The slot machine originally served as an example in [29,30], as an adaptation from another model, and we used it subsequently in [3] as well. It has three reels, each of them controlled by a separate input. Initially they are blank, but after a reel is spun, it may either show apple or bar. A play generally spans m rounds (spins) and after that a prize is awarded. It is Pr10, if all reels show bar, it is Pr2, if two reels show bar, and otherwise it is Pr0. The probability of bar decreases with decreasing number of remaining rounds. Finally, there is also a fourth input stop, which with equal probability either stops the game or grants two extra rounds, but the remaining rounds cannot exceed m.
For our experiments, we configured the slot machine with m = 3. In this configuration, the true minimal model has 109 states. We configured sampling for IoAlergia with p l = 0.125 and we set the following parameters for L * mdp : n resample = n retest = n test = 300, p stop = 0.25, p rand = 0.25, r min = 500 and r max = 20000. To demonstrate the influence of the parameter t unamb , we performed experiments with t unamb = 0.9 and t unamb = 0.99. Table 5 and Table 6 show the results for t unamb = 0.9 and t unamb = 0.99, respectively. Configured with t unamb = 0.9, L * mdp stopped after 2988 rounds and it stopped after 12 879 rounds, if configured with t unamb = 0.99. We see here that learning an accurate model of the slot machine requires a large amount of samples; in the case of t unamb = 0.99, we sampled 7 542 332 traces containing 24 290 643 outputs. These are almost 10 times as many outputs as for the gridworld examples. However, we also see that sampling more traces clearly pays off. The L * mdp results shown in Table 6 are much better than those shown in Table 5. 1.0000 Pmax(X(X(bar-bar-blank))) 0.1600 0.1615 0.1639 Pmax(X(X(X(apple-bar-bar)))) 0.2862 0.2865 0.2776 Pmax(¬(F <10 (end))) 0.2500 0.3013 0.3283 Pmax(X(X(X(apple-apple-apple))) ∧ (F (Pr0))) 0.0256 0.0262 0.0107 Notably the state space stayed the same way. Thus, the model learned with fewer traces presumably includes some incorrect transitions. This is exactly what our stopping heuristic aims to avoid; it aims to avoid ambiguous membership of traces in compatibility classes to reduce the uncertainty in creating transitions.
We also see in both settings that L * mdp models are more accurate than IoAlergia models, with respect to bisimilarity distance and with respect to modelchecking results. While the experiment with t unamb = 0.99 required the most samples among all experiments, it also led to the lowest bisimilarity distance. It is also noteworthy that model-checking results for the L * mdp model are within a low range of approximately 0.01 of the true results. A drawback of L * mdp compared to IoAlergia is again the learning runtime, as L * mdp required about 5 hours while learning with IoAlergia took only about 8.7 minutes. However, in a non-simulated environment, the sampling time would be much larger than 5 hours, such that the learning runtime becomes negligible. Consider for instance a scenario where sampling a single traces takes 20 milliseconds. The sampling time of L * mdp is about 42 hours in that scenario, i.e. about 8.4 times the learning runtime.

Discussion & Threats to Validity
Our case studies demonstrated that L * mdp is able to achieve better accuracy than IoAlergia. The bisimilarity distances of L * mdp models to the true models were generally lower and the model checking results were more accurate. These observations will be investigated in further case studies. It should be noted though that the considered systems have different characteristics. The gridworld has small state-space, but is strongly connected and the different terrains lead to different probabilistic decisions, e.g. if we try to enter mud there is a probability of 0.4 of entering one of the neighbouring tiles, whereas entering concrete is generally successful (the probability of entering other tiles instead is 0). The δ0.9 -0.0486 0.2518 Pmax(F (Pr10)) 0.3637 0.3722 0.3991 Pmax(F (Pr2)) 0.6442 0.6552 0.6997 Pmax(F (Pr0)) 1.0000 1.0000 1.0000 Pmax(X(X(bar-bar-blank))) 0.1600 0.1607 0.1597 Pmax(X(X(X(apple-bar-bar)))) 0.2862 0.2866 0.2851 Pmax(¬(F <10 (end))) 0.2500 0.2606 0.4000 Pmax(X(X(X(apple-apple-apple))) ∧ (F (Pr0))) 0.0256 0.0264 0.0128 consensus protocol has a large state space with many different outputs and finishing the protocol takes at least 14 steps. The slot machines requires states to be distinguished based on subtle differences in probabilities, as the probability of seeing bar decreases in each round. L * mdp has several parameters that affect performance and accuracy. We plan to investigate the influence of parameters in further experiments. For the present experiments, we fixed most of the parameters except for n retest , n test and n resample and we observed that results are robust with respect to these parameters. We, e.g., increased n resample from 300 for the first gridworld to 1000 for the second gridworld. Both settings led to approximately the same results, as learning simply performed fewer rounds with n resample = 1000. Hence, further experiments will examine if the fixed parameters are indeed appropriately chosen and if guidelines for choosing other parameters can be provided.
L * mdp and IoAlergia learn from different traces, thus the trace selection may actually be the main reason for the better accuracy of L * mdp . We examined if this is the case, by learning IoAlergia models from two types of traces: traces with uniform input selection and traces sampled during learning with L * mdp . We noticed that models learned from L * mdp traces altogether led to less accurate results, especially in terms of bisimilarity distance, and therefore we reported only results for models learned from traces with uniformly distributed inputs.

Related Work
In the following, we discuss techniques for learning both model structure and transition probabilities in case of probabilistic systems. There are many learning approaches for models with a given structure, e.g., for learning control strategies [20]. Covering these approaches is beyond the scope of this paper.
We build upon Angluin's L * [4], thus our work shares similarities with other L * -based work like active learning of Mealy machines [36]. Interpreting MDPs as functions from test sequences to output distributions is similar to the interpretation of Mealy machines as functions from input sequences to outputs [37].
Volpato and Tretmans presented an L * -based technique for non-deterministic input-output transition systems [43]. They simultaneously learn an over-and an under-approximation of the SUL with respect to the input ouput conformance (ioco) relation [40]. Inspired by that, L * mdp uses completeness queries and we add transitions to a chaos state in case we have low information. Beyond that, we consider systems to behave stochastically rather than non-deterministically. While [43] leaves the concrete implementation of queries unspecified, L * mdp 's implementation closely follows Sect. 4. Early work on ioco-based learning for nondeterministic systems has been presented by Willemse [44]. Khalili and Tacchella [26] addressed non-determinism by presenting an L * -based algorithm for non-deterministic Mealy machines. Like Volpato and Tretmans [43], they assume to be able to observe all possible outputs in response to input sequences applied during learning. Our implementation does not require this assumption by checking for compatibility, i.e. approximate equivalence, between output distributions. Both these approaches assume a testing context, as we do.
Most sampling-based learning algorithms for stochastic systems are passive, i.e. they assume preexisting samples of system traces. Their roots can be found in grammar inference techniques like Alergia [11] and rlips [12], which identify stochastic regular languages. We share with these techniques that we also apply Hoeffing bounds [22] for testing for difference between probability distributions. Alergia has been extended to MDPs by Mao et al. [29,30]. The extension is called IoAlergia and basically creates a tree-based representation of the sampled system traces and repeatedly merges compatible nodes to create an automaton. Finally, transition probabilities are estimated from observed output frequencies. Like L * mdp , IoAlergia converges in the limit, but showed worse accuracy in Sect. 5. It was adapted to an active setting by Chen and Nielsen [15]. They proposed to generate new samples to reduce uncertainty in the data. In contrast to this, we base our sampling not only on the data collected so far (refine queries), but also on the current observation table and the derived hypothesis MDPs (refine & equivalence queries), i.e. we take information about the SUL's structure into account. In previous work, we presented a different approach to apply IoAlergia in an active setting which takes reachability objectives into account with the aim of maximising the probability of reaching desired events [3]. L * -based learning for probabilistic systems has also been presented by Feng et al. [17]. They learn assumptions in the form of probabilistic finite automata for compositional verification of probabilistic systems. Their learning algorithm requires queries returning exact probabilities, hence it is not directly applicable in a sampling-based setting. The learning algorithm shares similarities with an L * -based algorithm for learning multiplicity automata [10], a generalisation of deterministic automata. Further query-based learning in a probabilistic setting has been described by Tzeng [41]. He presented a query-based algorithm for learning probabilistic automata and described an adaptation of Angluin's L * for learning Markov chains. In contrast to our exact learning algorithm L * mdp e , which relies on output distribution queries, Tzeng's algorithm for Markov chains queries the generating probabilities of strings. Castro and Gavaldà review passive learning techniques for probabilistic automata with a focus on convergence guarantees and present them in a query framework [14]. Unlike MDPs, the learned automata cannot be controlled by inputs.

Conclusion
We presented L * -based learning of MDPs. For our exact learning algorithm L * mdp e , we assumed an ideal setting that allows to query information about the SUL with exact precision. Subsequently, we relaxed our assumptions, by approximating exact queries through sampling SUL traces via directed testing. These traces serve to infer the structure of hypothesis MDPs, to estimate transition probabilities and to check for equivalence between SUL and learned hypotheses. The resulting sampling-based L * mdp iteratively learns approximate MDPs which converge to the correct MDP in the large sample limit. We implemented L * mdp and compared it to IoAlergia [30], a state-of-the-art passive learning algorithm for MDPs. The evaluation showed that L * mdp is able to produce more accurate models. To the best of our knowledge, L * mdp is the first L * -based algorithm for MDPs that can be implemented via testing. Experimental results and the implementation can be found in the evaluation material [38].
The evaluation showed promising results, therefore we believe that our technique can greatly aid the black-box analysis of reactive systems such as communication protocols. While deterministic active automata learning has successfully been applied in this area [18,39], networked environments are prone to be affected by uncertain behaviour that can be captured by MDPs. L * mdp converges in the limit, therefore a potential direction for future work is an analysis with respect to probably approximately correct (PAC) learnability [42,14]. A challenge towards this goal will be the identification of a distance measure suited to verification [30]. Furthermore, L * mdp provides room for experimentation, e.g. different testing techniques could be applied in equivalence queries.