Nonlinear Dynamics for the Ising Model

We introduce and analyze a natural class of nonlinear dynamics for spin systems such as the Ising model. This class of dynamics is based on the framework of mass action kinetics, which models the evolution of systems of entities under pairwise interactions, and captures a number of important nonlinear models from various fields, including chemical reaction networks, Boltzmann’s model of an ideal gas, recombination in population genetics, and genetic algorithms. In the context of spin systems, it is a natural generalization of linear dynamics based on Markov chains, such as Glauber dynamics and block dynamics, which are by now well understood. However, the inherent nonlinearity makes the dynamics much harder to analyze, and rigorous quantitative results so far are limited to processes which converge to essentially trivial stationary distributions that are product measures. In this paper we provide the first quantitative convergence analysis for natural nonlinear dynamics in a combinatorial setting where the stationary distribution contains non-trivial correlations, namely spin systems at high temperatures. We prove that nonlinear versions of both the Glauber dynamics and the block dynamics converge to the Gibbs distribution of the Ising model (with given external fields) in times O(nlogn) and O(logn) respectively, where n is the size of the underlying graph (number of spins). Given the lack of general analytical methods for such nonlinear systems, our analysis is unconventional, and combines tools such as information percolation (due in the linear setting to Lubetzky and Sly), a novel coupling of the Ising model with Erdős-Rényi random graphs, and non-traditional branching processes augmented by a ”fragmentation” process. Our results extend immediately to any spin system with a finite number of spins and bounded interactions.


INTRODUCTION
Mass action kinetics is a general framework for studying systems of interacting entities.The framework emerged in the study of chemical reaction networks, dating back at least to the seminal work of Horn and Jackson in the 1970s [29], and has seen a resurgence of activity in recent years; see the monograph [23].However, it also captures a wide range of processes that are of interest in other fields, including Boltzmann's model of an ideal gas [7], classical models of population genetics [27,54], genetic algorithms in combinatorial optimization [25,38], and random sampling [44,47].
We describe mass action kinetics in the special case where all interactions are pairwise and homogeneous; this captures most of the complexity of general systems while keeping notation and technicalities to a minimum.Let Ω denote a finite set of types.A (quadratic) mass action system is described by a directed graph whose vertices are ordered pairs of types (σ, σ ′ ), and a directed edge from (σ, σ ′ ) to (τ, τ ′ ) indicates the presence of a reaction in which types σ, σ ′ combine to produce types τ, τ ′ .Reactions involving a specific pair (σ, σ ′ ) are governed by a collision kernel Q(σ, σ ′ ; • , • ), where Q(σ, σ ′ ; τ, τ ′ ) is the probability that the outcome of the reaction is the pair (τ, τ ′ ).We assume throughout the symmetry property The state of the system at any time t is fully described by the vector p t , where p t (σ) is the mass of type σ at time t, normalized so that σ∈Ω p t (σ) = 1 (i.e., the p t (σ) can be viewed as concentrations, or probabilities).The initial state is denoted p 0 .According to the so-called "mass action" principle, each reaction (σ, σ ′ ) → (τ, τ ′ ) takes place at a rate determined by the product of the current masses of types σ, σ ′ .The dynamics of the system is described by the following set of equations 1 , one for each type τ ∈ Ω: (1.1) At this level of generality such systems can be arbitrarily badly behaved (e.g., chaotic), so it is necessary to impose standard regularity conditions.A mass action system is said to be reversible 2 or detailed balanced if there exists a strictly positive mass vector µ = (µ(σ)) > 0 such that µ(σ)µ(σ ′ )Q(σ, σ ′ ; τ, τ ′ ) = µ(τ )µ(τ ′ )Q(τ, τ ′ ; σ, σ ′ ) , ∀ σ, σ ′ , τ, τ ′ . (1. 2) It is easy to check that any such µ is necessarily an equilibrium or stationary point for the dynamics (1.1).A mass action system may have many positive equilibrium points, but it is known (see, e.g., [23], and also Proposition 2.10 in this paper) that if any one of them satisfies the detailed balance condition then they all do.We stress that we do not require the kernel Q to be irreducible (i.e., the directed graph describing it need not be strongly connected).The mass action system defined in (1.1) above can be viewed as a natural nonlinear analog of a reversible Markov chain, whose dynamics takes the form p t+1 (τ ) = σ p t (σ)Q(σ ; τ ), where now Q(σ ; τ ) is the transition matrix of the chain and the reversibility condition is µ(σ)Q(σ ; τ ) = µ(τ )Q(τ ; σ) for all σ, τ .In the linear setting, there are well known criteria for convergence to stationarity and there is by now a vast literature on mixing times of reversible Markov chains and their algorithmic applications to sampling, approximate counting and integration, statistical physics, etc.By contrast, in the nonlinear setting even the most basic questions are still open: for example, the Global Attractor Conjecture [23] asserts that any detailed balanced 3 mass action system converges to a stationary point when started from any initial point p 0 with full support.(In fact many stationary points may exist, but only one is consistent with any given initial condition p 0 .)And even in particular cases of interest where convergence has been proved, almost nothing is known about the rate of convergence (the analog of the mixing time for Markov chains).
In the discrete combinatorial setting, one of very few examples for which useful bounds on the convergence rate are known is the classical Hardy-Weinberg model of genetic recombination [27,54].Here the types are bit strings Ω = {0, 1} n (each bit representing an allele on a chromosome), and a reaction between two strings σ, σ ′ involves picking a "crossover" subset Λ ⊆ {1, . . ., n} of positions according to some probabilistic rule and exchanging the bits in Λ between σ and σ ′ to obtain two new strings τ, τ ′ .(For example, one classical rule is to pick i ∈ {0, . . ., n} u.a.r. and let Λ consist of the first i bits.)It is well known that this dynamics converges to the distribution µ in which all bits are independent, with the marginal probabilities of a 1 at each position given by those in the initial distribution p 0 .In [43,11], the rate of convergence was related precisely to the rate at which the strings are fragmented by the repeated random cuts Λ, thus enabling very precise estimates of the convergence time for any choice of crossover rule.
The above analysis relies crucially on the fact that in the equilibrium distribution all bits are independent.When there is even a small amount of correlation, there appear to be no techniques available to obtain useful bounds on convergence rates.In this paper, we address this question for arguably the most natural example in which correlations arise, namely the Ising model of statistical physics.Here the types are spin configurations σ ∈ Ω = {±1} V which assign one of two possible spin values ±1 to each vertex of a graph G = (V, E).The Gibbs distribution is given by where h = {h x } x∈V is a vector of real numbers whose entry h x represents the external field at vertex x, and J = {J xy } x,y∈V is a symmetric real matrix whose entry J xy represents the interaction between spins at adjacent vertices x, y. (When there is no edge between x and y, J xy = 0.) The normalizing factor Z J,h is the partition function.Note that we allow the interactions J xy to be either positive (ferromagnetic) or negative (antiferromagnetic), and the fields h x to be either positive (favoring +1 spins) or negative (favoring −1 spins).Setting J = βA, where β > 0 and A is the adjacency matrix of G, corresponds to the standard ferromagnetic Ising model on G at inverse temperature β.We emphasize that, while for simplicity we develop our results for the specific case of the Ising model, they hold equally for any spin system with a constant number of different spins and bounded pairwise interactions (such as the q-state Potts model); see Section 5 for more detail.
The classical Glauber dynamics for the Ising model picks a random vertex x ∈ V at each step and resamples the spin at x according to the correct conditional distribution given its neighboring spins; this Markov chain converges to the Gibbs distribution (1.3) from any initial configuration.The analogous nonlinear mass action kinetics is defined by equation (1.1) with the following kernel: Given two configurations σ, σ ′ , pick a random vertex x and exchange the spins σ x , σ ′ x , obtaining two new configurations τ, τ ′ .The transition probabilities Q(σ, σ ′ ; τ, τ ′ ) are chosen to satisfy the detailed balance condition (1.2), where µ = µ J,h is the Gibbs distribution.We emphasize that, in contrast to Glauber dynamics, here the system is evolving endogenously via pairwise interactions between configurations, rather than via exogenously applied spin updates.Our first result shows that this dynamics converges to the Gibbs distribution (1.3), where the fields h are determined by the marginal probabilities of the spins at each vertex in the initial distribution.(The fact that the marginals determine a unique vector of fields h follows from standard convexity arguments; see, e.g., [18].)Theorem 1.1.Let p t denote the distribution at time t for the above mass action kinetics for the Ising model with interactions J starting from any initial distribution p 0 , and let h be the unique choice of external fields such that the marginal spin probabilities at each vertex x ∈ V in µ J,h are the same as those in p 0 .Then p t converges to µ J,h as t → ∞.
Note that, unlike the standard Glauber dynamics, the nonlinear dynamics has conserved quantitiesnamely, the marginal probabilities of the spins at each vertex-and the values of these conserved quantities determine which of the family of stationary points the dynamics converges to.This phenomenon is typical in mass action kinetics.The key to our proof of Theorem 1.1, which is based on an earlier argument in [44] (see Remark 2.12), is establishing an irreducibility property, namely that along any trajectory, the probability of any configuration eventually remains uniformly bounded away from zero.
We pause to briefly mention some features of mass action kinetics that make its analysis much more complex than that of Glauber dynamics, and which explain the lack of quantitative convergence results.First, as noted above, there are in general multiple equilibrium points, which are characterized by conserved quantities.Second, in contrast to the linear case, the total variation distance to stationarity is not monotonically decreasing (see [10,Remark 2.7]) and there are no simple coupling arguments to rely upon.Finally, the nonlinearity means that we do not have at our disposal a spectral theory and other functional analysis tools that have proved so powerful in the analysis of Markov chains.As usual in kinetic theory, a natural way to study convergence to stationarity here is to use relative entropy, which provides a monotonically decreasing functional; however, quantitative analysis of this quantity is a notoriously difficult problem in the nonlinear setting, which has so far been solved only in the non-interacting case (genetic recombination) [11,12].
Our main result establishes tight bounds on the rate of convergence for this nonlinear dynamics in the so-called "high-temperature" regime, when the interactions are non-trivial but relatively weak.Specifically, the condition we require is that max x∈V y∈V |J xy | ≤ δ 0 for some absolute constant δ 0 > 0, i.e., the aggregated strength of all interactions at any given vertex is not too large.This condition mirrors the standard Dobrushin condition for Glauber dynamics, which gives a non-trivial sufficient condition for rapid mixing (see, e.g., [55]).We state this result in the following theorem.Theorem 1.2.In the scenario of Theorem 1.1, with the additional assumption that max x∈V y∈V |J xy | ≤ δ 0 for an absolute constant δ 0 > 0, the rate of convergence of p t to µ J,h is given by for absolute constants C, c > 0, where • TV denotes total variation distance.Thus in particular the time required to achieve variation distance ε is t = O(n log(n/ε)).
We note that this upper bound on convergence time is (up to constants) the same as for the analogous version of the genetic recombination model discussed above (where just a single allele is exchanged between the strings at each step) [43], and is therefore also tight by virtue of the lower bound in the same paper.That model is equivalent to the trivial case of the Ising model in which there are no interactions (J xy = 0 for all x, y), with the h x determined by the marginal probabilities at each allele x (and the spins ±1 identified with the bits 1, 0).However, as we explain in Section 1.1 below, the correlations present in the Ising model make the analysis much more challenging.
We also consider a "block" version of the nonlinear dynamics, in which σ, σ ′ exchange their spins at a random subset Λ ⊆ V of vertices (rather than just at a single randomly chosen vertex).The kernel Q(σ, σ ′ ; τ, τ ′ ) is again determined by the detailed balance condition (1.2), and the basic convergence result in Theorem 1.1 still holds.Under the same Dobrushin-type condition on the interactions as in Theorem 1.2, we again obtain a tight bound on the convergence rate: Theorem 1.3.With the same notation and assumptions as in Theorem 1.2, the variation distance of the block version of the mass action kinetics for the Ising model satisfies for absolute constants C, c > 0. Thus in particular the time required to achieve variation distance ε is t = O(log(n/ε)).
Note that convergence here is exponentially faster than in the single-vertex version of Theorem 1.2, reflecting the fact that this version is non-local and changes large portions of the configurations at each step.Again, the bound of Theorem 1.3 matches the lower bound for zero interaction [43].
We stress that the goal of this paper is not to design an efficient algorithm for sampling configurations of the Ising model.Such algorithms, based on standard linear Glauber dynamics, are already known throughout the high-temperature regime.Rather, our goal is to analyze the rate of convergence of a natural nonlinear dynamics, for the first time in a model with correlations.We view this as a first step towards a better understanding of such dynamics and the techniques needed to understand them; in addition to their inherent interest, these techniques may lead to algorithmic applications in future.Our work can be viewed as an extension of the successful application of a TCS lens in the analysis of mixing times of linear dynamics (Markov chains), which, as is well known, has seen both mathematical and algorithmic applications over many years.However, we should point out that our convergence analysis in Theorems 1.2 and 1.3 actually does yield polynomial time sampling algorithms based on simulation of the respective nonlinear dynamics.We outline these algorithms, together with some associated open questions, in Section 5 at the end of the paper.
We also point out a further interesting algorithmic aspect of our results.Recall that our nonlinear processes sample from an Ising Gibbs measure µ J,h , where the fields h are determined implicitly by the marginals at each site.It is these marginals (not the fields) that are specified by the initial distribution p 0 .(As far as we are aware, all existing sampling algorithms for the Ising model require the specification of the fields h rather than the marginals.)Thus our processes can be viewed as a novel, direct method for constructing a maximum entropy distribution subject to these marginal constraints, an important problem in its own right (see, e.g., [47]).Additionally, our processes can be used to learn the fields h corresponding to given marginals, an inverse problem that is also of independent interest (see [40] for a survey of such inverse problems): given samples from µ J,h produced by the nonlinear dynamics, standard methods can be used to infer the field vector h.
1.1.Techniques.We begin by describing the earlier approach of [43] to analyzing the rate of convergence of the simpler population genetics dynamics, which corresponds to the trivial case of the Ising model with no interactions (J = 0).Since the equilibrium distribution here consists of independent bits, the analysis is relatively straightforward once one observes the following insight.The derivation of an individual σ at time t can be viewed as a binary tree going backwards in time, in which each individual inherits a random subset of its bits from each of its two parents according to the random crossover subset Λ.We may therefore follow the derivation of the n bits in σ back in time, until each of these bits is derived from a distinct individual at time 0. At that point we can deduce that the bits of σ are independently sampled from their respective marginal distributions, so σ is in equilibrium.The analysis therefore reduces to the question of how many steps are needed until all the n bits are separated, or "fragmented", under the repeated action of partition by the random crossover subset Λ, which in turn is a straightforward combinatorial calculation.(See Section 2.4 for a more detailed description of this process.) In the case where correlations are present in the equilibrium distribution, as in the Ising model, the above analysis breaks down because it is no longer sufficient to consider only fragmentation of the bits: indeed, the process must involve not only the breakdown of correlations in the initial distribution, but also, crucially, the creation of the correct equilibrium correlations as mandated by the Gibbs distribution (1.3).Moreover, the process by which an individual inherits bits from its parents is no longer independent of the parents, but dictated by a complex function of both parents.
To account for this, we appeal to the information percolation framework developed by Lubetzky and Sly [35] in the context of Glauber dynamics for the Ising model.This framework suggests that we keep track of a "dependence cluster" going back in time, which records the neighboring spins that have influenced each spin in our current configuration.In the linear setting of [35], it can be shown (under a similar hightemperature assumption to ours) that this cluster is dominated by a subcritical branching process and thus will die out with large probability: the equilibrium correlations are then implicitly encoded by the history of this process.The time until the process dies out gives a bound on the mixing time.
In our nonlinear setting, the dependence clusters are no longer describable in terms of a simple branching process, but rather by a new type of process that combines branching with fragmentation, a process we refer to as "fragmentation plus noise."The first main ingredient of our analysis is the precise construction of such a process and the proof that it encodes the complex dependence structure of the nonlinear dynamics.The second main ingredient is the proof that, under the high-temperature assumption max x∈V y∈V |J xy | ≤ δ 0 , the fragmentation plus noise process has a subcritical behavior and therefore dies out with large probability on a suitable time scale.To establish this latter fact, we introduce a non-standard form of "high-temperature expansion" for the dependence structure obtained by a coupling with non-uniform Erdős-Rényi random graphs.We refer the reader to Section 3.1 for a more technical high-level description of these ideas.1.2.Related work.The framework of mass action kinetics was first introduced in the chemical reaction networks literature, most notably in the landmark paper of Horn and Jackson [29].The dynamics defined in (1.1) above, generalized to allow interactions between arbitrary numbers of types, models chemical processes in an obvious way.The recent monograph of Feinberg [23] describes the state of the art in the area.While almost nothing is known about rates of convergence, much effort has been devoted to a proof of the Global Attractor Conjecture mentioned earlier; indeed, the original paper [29] contained this as a theorem, but the authors later retracted it and restated it as a conjecture [28].Since then there have been numerous attempts at a proof, including recent papers that handle various special cases, typically based on rather severe structural conditions on the set of reactions (such as forming a single connected component)-see, e.g., [19,41,26,1,2,21].The Conjecture remains open.
Several other important classes of dynamics fit into the mass action framework, the most classical of which is Boltzmann's model of an ideal gas [7], where the types are momentum values of the molecules and interactions correspond to (randomized) collisions between pairs of them 4 .In this case the dynamics (1.1) is the so-called Boltzmann equation, which remains a major object of study in mathematical physics today (see [51] for a survey).Another famous example is Hardy-Weinberg recombination in population genetics, as described earlier.The same dynamics can also be used to model the "recombination" step in genetic algorithms [25,38], where two (or more) candidate solutions to a combinatorial optimization problem are combined to produce new solutions; here the mass action dynamics is typically combined with a "selection" operator that weeds out less desirable solutions.The viewpoint taken in this paper, where mass action kinetics are viewed as a nonlinear version of the Markov chain Monte Carlo method for sampling combinatorial structures from a given distribution, was first proposed in [44] and later explored in a different context in [47].Negative results on the worst-case computational complexity of simulating mass action kinetics were derived in [4].Mass action kinetics resembles certain more refined MCMC approaches that are used in practice for numerical simulations of spin systems such as the Ising model, including replica Monte Carlo [48], parallel tempering [31] and-most closely-cluster Monte Carlo [30]; these are in fact linear dynamics, but because they update a set of configurations in a dependent way, they can be viewed as approximate finite realizations of nonlinear dynamics.
As discussed earlier, there are almost no quantitative results on the rate of convergence of mass action kinetics in combinatorial settings, the main exception being genetic recombination [43,11,12].There is a vast and still evolving literature on convergence rates of the Boltzmann equation (see, e.g., [32,51,37,20]), which however is tailored to the specifics of that model and does not seem to generalize.
Finally we mention the more general class of so-called "nonlinear Markov chains", which (in discrete time) are stochastic processes (X t ) in which the distribution of X t depends not only on the previous state X t−1 but also (in an arbitrarily complex way) on the distribution of X t−1 .(Mass action kinetics as discussed above is a particular example where the dependence is quadratic.)This class was formally introduced by McKean [36], who studied the continuous time version and its deep relationships to nonlinear parabolic equations, including the Boltzmann equation, and other kinetic models.These nonlinear systems arise naturally as the limit of large finite mean-field particle systems through the so-called "propagation of chaos" [32,33,49,52,14,13,15,37,8,16,17].Several works have been devoted to the development of nonlinear Markov chain Monte Carlo methods [3,39,22,9,45,53], but in contrast with the classical (linear) Markov chain framework, quantitative results on convergence to stationarity are very limited and difficult to obtain.
1.3.Organization of the paper.In Section 2 we formally define both of our nonlinear dynamics and establish some of their basic properties, including a proof of a general convergence result, Theorem 2.8, which we then use to prove Theorem 1.1.In Section 3 we proceed with our quantitative analysis of the convergence rate for the nonlinear block dynamics, culminating in a proof of Theorem 1.3; prior to embarking on the details, we provide in Section 3.1 a more technical, high-level sketch of our approach.In Section 4 we apply a similar approach, though substantially different in detail, to analyze the nonlinear Glauber dynamics and prove Theorem 1.2.We conclude with some additional observations, extensions and open problems in Section 5.

PRELIMINARIES
2.1.The Ising model.We recall from the introduction the definition (1.3) of the Ising model via its Gibbs distribution µ J,h .Note that we allow arbitrary edge-dependent interactions J = {J xy } x,y∈V and arbitrary external fields h = {h x } x∈V .When all the external fields are zero we write simply µ J .We identify the set of vertices (or sites) V with [n], and denote the set of spin configurations by Ω = {±1} [n] .Remark 2.1.For simplicity we have taken a model with no boundary conditions.However, there is no difficulty in extending our analysis and results to the case of a Gibbs measure with arbitrary boundary conditions, i.e., when the spins in some subset V 0 ⊆ V are pinned to given values ±1.This generalization can be easily achieved by taking limits h x → ±∞ for all x ∈ V 0 that are pinned to the values ±1, respectively.
We view this dynamics as a dynamical system p → T t (p), where T 0 (p) = p ∈ P(Ω) is the initial distribution, T t (p) ∈ P(Ω) is the distribution after t steps, and T t (p) = T t−1 (p) • T t−1 (p).Here one step of the dynamics is defined, as in (1.1), by (2. 2) It will be convenient later to write (2.2) in the equivalent form where, for fixed σ, σ ′ ∈ Ω, the distribution In this paper we take µ = µ J,h as the Ising measure (1.3) and consider two natural choices of the kernel Q that satisfy (2.1), which we now describe.

Nonlinear block dynamics.
The first model, which we refer to as the nonlinear block dynamics, corresponds to interactions in which a pair of configurations (σ, σ ′ ) exchange their spins at an arbitrary, randomly chosen subset Λ ⊆ [n] of sites, i.e., where σ Λ σ ′ Λ c denotes the element of Ω with entries σ x for x ∈ Λ and σ ′ x for x ∈ Λ c = [n] \ Λ.Here, to ensure reversibility, the set Λ is chosen with probability proportional to µ(σ Thus the associated kernel is defined as Note that transitions of the form (2.5) can only produce pairs (τ, τ ′ ) that belong to the equivalence class Thus the kernel (2.6) defines a (linear) Markov chain on the pair space Ω×Ω that is in general not irreducible, and whose communicating classes are precisely C(σ, σ ′ ).Note also that the kernel Q J depends on µ = µ J,h only through the interaction J and is insensitive to the choice of fields h.Indeed, once σ, σ ′ are given, then for any (τ, τ ′ ) ∈ C(σ, σ ′ ) and (η, η ′ ) ∈ C(σ, σ ′ ) one has Thus, w.l.o.g., we may take h = 0, and µ = µ J = µ J,0 , in the definition of the kernel (2.6).The kernel Q J in (2.6) is an example of a so-called "folding" transformation [50].
Observe that, for all h ∈ R n , the reversibility condition (2.1) holds in the form for all σ, σ ′ , τ, τ ′ ∈ Ω.Thus, for a fixed interaction J, the kernel Q J is reversible w.r.t.all measures {µ J,h , h ∈ R n }.In particular, all these measures are stationary for the dynamics (2.2), i.e., as can be easily checked from reversibility.We note that in the case of zero interactions, i.e., J = 0, the nonlinear block dynamics reduces to the uniform crossover model from population genetics [43,12].In this case, the stationary distributions µ J,h are just product measures over spins with marginals determined by h.

Nonlinear Glauber dynamics.
In our second model, the configurations σ, σ ′ exchange their spins at a single randomly chosen site x ∈ [n], i.e., . By analogy with the familiar Glauber dynamics (a Markov chain that updates the spin at one site in each step), we refer to this as the nonlinear Glauber dynamics.As usual, to ensure reversibility w.r.t.µ J,h , we need to perform such an exchange with an appropriate probability α x (σ, σ ′ ).Specifically, we use the generic dynamics (2.2) with the kernel where .
Once again, the Markov chain on pairs Ω × Ω defined by the kernel (2.9) is not irreducible, and the kernel Q J depends on µ = µ J,h only through the interaction J.As in (2.7)-(2.8),reversibility and stationarity of all measures µ J,h can be easily checked.

Conservation laws.
In both dynamics defined above, the map p → p • p conserves the marginal probabilities of spins at every vertex, i.e., for every x ∈ [n], and for any p ∈ P(Ω), one has where p x (a) := p(σ x = a), a ∈ {−1, 1}, denotes the marginal of p at x.It is convenient to state the following stronger property.Let us define the commutative convolution product of two distributions p, q ∈ P(Ω) by where Q is defined by (2.4) and (2.6) for the nonlinear block dynamics and by (2.4) and (2.9) for the nonlinear Glauber dynamics, respectively.Note that the notation (2.11) is consistent with (2.3).
Lemma 2.2.Both of the above dynamics satisfy In particular, the conservation law (2.10)holds.
Proof.The proof is a consequence of the fact that, in both dynamics, spins are simply exchanged between σ, σ ′ , as well as the symmetry of Q J .For the kernel (2.6) we compute In conclusion, This proves the lemma for the nonlinear block dynamics.To prove it for the nonlinear Glauber dynamics, observe that in this case by (2.9) one has By the symmetry α and the conclusion follows as in (2.12).

2.4.
The derivation tree and fragmentation.Throughout the paper, the following view of the nonlinear dynamics will be central.By definition, T t (p) is the result of repeated pairwise interactions and can be represented as the distribution at the root of a binary "derivation" tree, where each leaf is equipped with the distribution p ∈ P(Ω), and recursively, starting from the leaves, each internal node is assigned the distribution p 1 • p 2 where p 1 , p 2 represent the distributions assigned to the left and right descendants of that node; see Figure 2.1 for a schematic picture of the case t = 2.
We focus now on the simple case J = 0, i.e., no correlations between spins.Under block dynamics, the configuration τ at the root of the tree (at time t) is constructed according to the random partition (Λ, Λ c ) of V , which is equivalent to drawing each spin τ x from the configuration at the left or right child node with probability 1  2 , independently for each site x ∈ V .Continuing down the tree in the same fashion, we see that each spin at the root is drawn from one of the 2 t leaves (at time 0), independently and uniformly at random.Thus this process induces a partition of the sites V into 2 t disjoint subsets (some of which may be empty), where the ℓth subset consists of those sites that draw their spin from the configuration at leaf ℓ.Now let A denote the event that none of the 2 t subsets in this partition contains more than one site; equivalently, each spin in τ is drawn from a distinct leaf.We call A the "complete fragmentation" event.Note that, conditional on A, the distribution of the configuration τ at the root is just the product π := ⊗ x∈V p x , since there are no remaining correlations between spins.Hence we may write for some other distribution q(t), where ν denotes the uniform distribution over all 2 t −1 independent random subsets Λ occurring in the tree.We can use (2.13) to obtain an upper bound on the convergence time for the nonlinear block dynamics when J = 0, as was done in [43].First, we claim that ν(A c ) ≤ n 2 2 −t .To see this, note that for any given pair of distinct sites x, y ∈ V , the probability that x, y are not separated after t levels of the successive partitioning process is 2 −t , and then take a union bound over pairs.Hence by (2.13) For the nonlinear Glauber dynamics with J = 0 a similar analysis applies, except that now each node in the tree chooses one random spin from the left child and the remainder from the right child.The complete fragmentation event A now corresponds to isolating each of the n bits in this process, which is just a couponcollecting event for n = |V | coupons.Thus we have ν (where now ν denotes the uniform distribution over all 2 t − 1 independent choices of random spins occurring in the tree), which by (2.13) implies a convergence time of O(n log(n/ε)).
When J = 0, so that correlations are present, it is no longer possible to reduce the analysis of convergence to the above simple fragmentation process, because the mechanism by which a configuration inherits spins from its parents depends on the actual configurations at the parent nodes.Thus to prove Theorems 1.2 and 1.3 we will need to augment the simple derivation process above to obtain a more complex process that we call "fragmentation with noise" (see Sections 3 and 4).

2.5.
Irreducibility.Next we establish a rough lower bound on the probability of any configuration after a sufficiently long time.This observation will be key to our proof of convergence in the next subsection.We note that the lack of such a lower bound is the principle obstacle to proving the Global Attractor Conjecture for general reversible mass action systems.
The following definition captures the desired property.
Definition 2.3.We say that an initial distribution p ∈ P(Ω) is irreducible for a given mass action system with kernel Q if there exist ε > 0 and t 0 such that T t (p)(τ ) ≥ ε for all t ≥ t 0 and all τ ∈ Ω.
Thus irreducibility says that the trajectory of the dynamics starting from p eventually remains bounded away from the boundary of the simplex.Note that, under this definition, irreduciblity is a property of initial distributions: i.e., for a given kernel Q, some initial distributions may be irreducible while others are not.
In what follows we shall assume that the initial distribution p of our dynamical system has nondegenerate marginals, by which we mean that there exists δ > 0 such that This is actually no loss of generality since one can otherwise restrict to the nondegenerate spins and consider the degenerate spins as a fixed boundary condition, or pinning; see Remarks 2.1 and 2.11.
Lemma 2.4.For any interaction matrix J, any initial distribution p ∈ P(Ω) with nondegenerate marginals is irreducible for both the nonlinear block dynamics and the nonlinear Glauber dynamics.
Proof.Let us first consider the nonlinear block dynamics (2.6).We note that for a fixed interaction matrix J ∈ R n×n , there exists a constant δ J > 0 such that (Recall that Q J,h = Q J does not depend on h, so since we are discussing a property of Q J we may take h = 0.) It follows from (2.6) that where, for each σ, σ ′ , Q J (σ, σ ′ ; •, •) ∈ P(Ω × Ω) is some probability distribution that we do not need to describe explicitly.Therefore, from (2.3), where ) denotes the product of marginals of p on Λ, Λ c , and Φ(p) ∈ P(Ω) is some new distribution.We may interpret (2.15) as saying that the outcome of each interaction is, with probability (at least) δ J , equal to the factorized distribution p Λ ⊗ p Λ c , where Λ is chosen u.a.r.among all subsets of V .Now note that the map p → 2 −n Λ⊆V p Λ ⊗ p Λ c corresponds to one step of the block dynamics when J = 0. Writing T t (p) for the t-step evolution of this J = 0 dynamics, and noting from the tree representation that the construction of T t (p) involves N := 2 t − 1 interactions, we may rewrite (2.15) as where T t (p) ∈ P(Ω) is again some other distribution that we will not describe.
To complete the argument, we appeal to the analysis of the J = 0 case from the previous subsection.Specifically, we use equation (2.13) together with the analysis of the complete fragmentation event A, which implies that ν Here π is the stationary distribution for the J = 0 case, which is just the product π := ⊗ x∈V p x , and the assumption (2.14) implies π(τ ) ≥ δ n for all τ .Plugging these observations into (2.16)gives where N 0 = 2 t 0 − 1.This proves the claim for t = t 0 with ε = 1 2 δ N 0 J δ n .To prove it for all t ≥ t 0 , observe that if t ≥ t 0 then T t (p) = T t 0 (T t−t 0 (p)).Since by Lemma 2.2, T t−t 0 (p) has the same marginals as p, we can apply the bound (2.14) to T t−t 0 (p) with the same constant δ.Using (2.17) with p replaced by T t−t 0 (p) yields T t (p)(τ ) ≥ ε with the same ε for all t ≥ t 0 .This completes the proof of the lemma for the nonlinear block dynamics.
To prove it for the nonlinear Glauber dynamics we follow the same reasoning.We first observe that for some δ J > 0 one has and therefore (2.15) now takes the form for some new measure Φ(p) ∈ P(Ω).We may again write the expression (2.16), where now T t (p) is the t-step evolution of the J = 0 nonlinear Glauber dynamics, p → p x ⊗ p [n]\{x} with x ∈ V chosen uniformly at random.(Incidentally, it is interesting to note that, in contrast with the block dynamics, the process T t (p) here is actually linear in p, since the marginals p x are constants of the motion.)Following the same reasoning as above, again using equation (2.13) from the previous subsection together with the analysis of the complete fragmentation event A (which in this case corresponds to couponcollecting) leads to ν(A) ≥ 1 2 for t ≥ t 0 = n⌈1 + log n⌉, which in turn with the assumption (2.14) implies T t 0 (p)(τ ) ≥ δ n /2 for all τ ∈ Ω.It follows that The desired conclusion for all t ≥ t 0 follows as in the block dynamics case.We now address the convergence result claimed in Theorem 1.1 in the introduction, which we restate here more formally.
Theorem 2.7.Fix an interaction matrix J.For any initial distribution p ∈ P(Ω) with nondegenerate marginals, both the nonlinear block dynamics and the nonlinear Glauber dynamics satisfy the convergence where h is the unique choice of external fields such that µ J,h and p have the same marginals, i.e., (µ J,h ) x = p x for all x ∈ V .
We will prove Theorem 2.7 as a consequence of a more general convergence theorem (Theorem 2.8 below), together with the irreducibility property established in Lemma 2.4.
Our general convergence theorem is based on the following more abstract framework for mass action kinetics.Given a finite space of types Ω, consider the nonlinear mass action dynamics T t (p) ∈ P(Ω) defined by T t (p) = T t−1 (p) • T t−1 (p), T 0 (p) = p, where the collision operator • is defined as in (2.2), and As always we assume the mild exchange symmetry property Indeed, as a consequence of (2.18), we may actually assume w.l.o.g. that Q satisfies the additional symmetries To see this, note that because of the symmetry of the factor p(σ)p(σ ′ ) in the dynamics 2.2, one can always replace Q in (2.2) by the symmetrized kernel without altering the dynamics.Using (2.18), we immediately see that this symmetrized kernel satisfies (2.19).Accordingly, we will assume (2. 19) from now on.We also assume that the kernel has positive diagonal elements, i.e., in particular, this rules out periodic behavior.However, note that, as usual, we do not assume that Q is irreducible.We will also not assume that Q is reversible (detailed balanced), as it is in our Ising model systems; rather, it will be enough to assume the weaker property that there exists a strictly positive distribution This is equivalent to saying that the product distribution ) holds for some positive µ, we say that Q is balanced 5 .Note that if Q is detailed balanced (reversible), as defined in (2.1), then it is also balanced.Moreover, it is easy to check that any µ satisfying (2.21) must necessarily also be stationary for the mass-action dynamics defined by Q.Indeed, for any balanced system, it will follow from the proof of Theorem 2.8 below that the converse is also true: any stationary µ must satisfy (2.21) (see Proposition 2.10).
The following general theorem says that, for any mass action kernel satisfying the above properties, irreducibility of the initial distribution (as specified in Definition 2.3) is sufficient to guarantee convergence.Theorem 2.8.Suppose the kernel Q satisfies (2.19) and (2.20), and is also balanced.Then, for any irreducible initial distribution p, we have that T t (p) → ν as t → ∞ for some stationary ν ∈ P + (Ω).Moreover, Note that, in general, the limit point ν will depend on the initial distribution p.
Proof.Define ρ := p ⊗ p and π := µ ⊗ µ.The assumption that Q is balanced w.r.t.µ implies that πQ = π, while the fact that p is not stationary implies that ρQ = ρ.Now we may equivalently write the operation (2.2) in the form Equation (2.23) suggests a two-step decomposition of p → p • p.The first step is a mapping on the pair space Ω × Ω, which takes the product distribution ρ = p ⊗ p to the distribution ρQ (which is not typically a product).The second step is the mapping back down to Ω obtained by marginalizing out the second element τ ′ of the pair.We argue that each of these steps separately decreases the relative entropy w.r.t.µ, the first step yielding a strict decrease whenever p is not stationary.For the first step, note that Q defines one step of a Markov chain on Ω × Ω. Simple convexity considerations imply the inequality D(ρQ π) ≤ D(ρ π).We claim the stronger property that whenever ρQ = ρ.To prove this, we use the following basic fact that applies to any (not necessarily irreducible or reversible) Markov chain with a finite state space, with positive diagonal entries, and with an everywhere positive stationary distribution π.Let C 1 , . . ., C k denote the communicating classes (irreducible components) associated with Q.Note that the assumption π > 0 implies that there is no transient state, so that the C i partition the state space Ω × Ω.Let F = ρ/π denote the density of ρ w.r.t.π and note that ρQ = ρ iff F is constant within each C i .Thus, to prove (2.24) it is sufficient to show that the identity where 25) holds.To see the converse, observe that the strict convexity of x → x log x implies that and that H(u) = 0 for some state u if and only if F (v ′ ) = F (v) for all v, v ′ in the neighborhood of u in the Markov chain graph (i.e., the set of v such that Q(u, v) > 0).By the assumption (2.20) of positive diagonal elements for Q, this is equivalent to F (v) = F (u) for all v such that Q(u, v) > 0. Therefore, H = 0 everywhere implies that F is constant within each communicating class C i .On the other hand, it follows from (2.26) and the invariance πQ * = π that (2.25) is equivalent to π(H) = 0, which by the positivity assumption on π, and the fact that H ≥ 0, is equivalent to H = 0 everywhere.This ends the proof of (2.24).
For the second (marginalization) step, we appeal to the well-known fact that, among all distributions on Ω × Ω with fixed marginals, the relative entropy w.r.t. a product measure π = µ ⊗ µ is minimized when that distribution is a product distribution.Namely, if a probability ν ∈ P(Ω × Ω) has marginals ν 1 , ν 2 on the first and second element respectively, one has for all probability measures µ 1 , µ 2 ∈ P + (Ω).To see this, recall the variational principle for relative entropy, asserting that where ν, ζ are arbitrary probability measures and F ranges over all real functions.Now, take In conclusion, if p is not stationary we have shown that This completes the proof of (2.22) and hence the lemma.
Proof of Theorem 2.8.Let Q be balanced w.r.t.µ ∈ P + (Ω).Lemma 2.9 shows that D(T t (p) µ) is monotonically strictly decreasing with t.Hence, since D(T t (p) µ) ≥ 0, we know that D(T t (p) µ) converges to a limit, say d µ .By compactness, there exists ν ∈ P(Ω) and a subsequence t 1 , t 2 , . . .such that T t i (p) → ν as i → ∞.Also, it must be the case that D(T t i (p) µ) → D(ν µ), and therefore (2.28) Moreover, ν must be stationary.To see this, assume for contradiction that it is not.Then, by Lemma 2.9, one step of the dynamics must strictly decrease the relative entropy and therefore, for some ε > 0, On the other hand, by continuity of the map T 1 and the function D(• µ), since T t i (p) → ν, there exists t ε ∈ N sufficiently large such that Combining this with (2.29) gives which contradicts (2.28).Now note that, if p is irreducible, then it must be the case that ν ∈ P + (Ω).Thus we have established that ν is a stationary point with full support.Therefore, we may take µ = ν in (2.28), so that D(T t (p) ν) → D(ν ν) = 0 as t → ∞.The latter implies, e.g., by Pinsker's inequality, that T t (p) → ν, completing the proof.
Before proceeding with the proof of Theorem 2.7, we pause to note the following simple consequences of the arguments given above.
Proposition 2.10.Suppose Q is a balanced kernel satisfying (2.19) and (2.20) and let p ∈ P(Ω).Then p is stationary iff p ⊗ p is Q-invariant.Moreover, if Q is also detailed balanced, then p is stationary iff p ⊗ p is Q-reversible (i.e., iff (1.2) holds with µ replaced by p).
The argument in the proof of Lemma 2.9 shows that if p • p = p then ρ is a constant multiple of µ ⊗ µ over the communicating classes of Q.This implies (a).To prove (b), notice that if the detailed balance condition (1.2) holds then it continues to hold if µ ⊗ µ is replaced by any ρ that is a constant multiple of µ ⊗ µ over the communicating classes of Q, since only transitions within such components are relevant in (1.2).
We conclude this section with a proof of our convergence result for the nonlinear Ising model dynamics, Theorem 2.7.This will follow immediately from the general criterion for convergence in Theorem 2.8, together with the irreducibility property we proved in Lemma 2.4.
Proof of Theorem 2.7.In light of Theorem 2.8, it suffices to check that both our kernels satisfy all the required properties and that any p ∈ P(Ω) with nondegenerate marginals is irreducible.The latter fact, for both nonlinear block and Glauber dynamics, is precisely the statement of Lemma 2.4.The fact that both kernels are balanced follows from the fact that both are reversible.To check the properties (2.19) and (2.20), notice that the diagonal entries are positive for both the nonlinear block dynamics and the nonlinear Glauber dynamics.Moreover, the pair symmetry property (2.19) is easily seen to be satisfied by the block dynamics kernel (2.6).Finally, the single site kernel as written in (2.9) does not satisfy (2.19), but it does satisfy the exchange symmetry (2.18) and hence is equivalent to a symmetrized kernel Q that satisfies (2.19), as discussed earlier.
Remark 2.11.We proved Lemma 2.4 and Theorem 2.7 under the assumption (2.14) of nondegenerate marginals.However, there is no difficulty in extending to the general case of an arbitrary p ∈ P(Ω), where some spins may be deterministically set to +1 or −1.It is easy to check that the proofs of Lemmas 2.2 and 2.4 and Theorem 2.7 continue to hold in this setting, with the external fields at the pinned sites taken to be ±∞ as discussed in Remark 2.1.
Remark 2.12.Theorem 2.8 shows that, for balanced systems, convergence to a stationary point with full support is guaranteed provided one can prove that the trajectory starting from a given initial distribution p eventually remains uniformly bounded away from zero everywhere; i.e., no type "dies out".This observation is already known in the chemical reaction networks literature [46].However, we have provided an alternative proof here for several reasons: (i) we are working in discrete rather than continuous time as in the reaction networks community, which necessitates different arguments; (ii) we have made extensive use of probabilistic, rather than dynamical systems concepts in our proof; and (iii) we aim to make this paper self-contained.We point out also that our proof follows the same lines as that in [44,Theorem 2] for the restricted case where Q is symmetric (i.e., reversible w.r.t. the uniform distribution µ), while correcting some omissions in that earlier proof: namely, the assumption of positive diagonal elements (2.20) and the requirement that the initial condition p be irreducible.In light of Theorem 2.8 the key to the proof of our convergence result Theorem 2.7 for the nonlinear Ising model dynamics is establishing irreducibility, as we do in Lemma 2. 4. With respect to progress on the Global Attractor Conjecture, the most interesting question here seems to be that of identifying minimal assumptions on the kernel Q that guarantee such an irreducibility property; we leave this question for future work.Theorem 2.7 provides no quantitative estimate on the rate of convergence to stationarity.In particular, there is no explicit dependence on the size of the system n.In analogy with the mixing time analysis for linear Markov chains, in the remainder of the paper we will study the rate of convergence to equilibrium under the assumption that the interactions in J are sufficiently weak (usually referred to as the "high temperature" regime).

THE NONLINEAR BLOCK DYNAMICS
Let T t (p), t ∈ N, denote the evolution of the initial distribution p ∈ P(Ω) under the nonlinear block dynamics (2.6).From Theorem 2.7 we know that for any fixed interaction matrix J, and any p ∈ P(Ω), one has the convergence where h is the unique vector of external fields such that µ J,h and the initial state p have the same marginals at x, for all x ∈ V .Our main result for the nonlinear block dynamics (Theorem 1.3 in the introduction) establishes a tight bound on the rate of convergence in (3.1) as a function of the cardinality n = |V |, under the Dobrushin-type high-temperature condition on the interaction matrix J.For convenience we restate this result here.
Theorem 3.1.There exist absolute constants δ 0 > 0, c > 0 and C > 0 such that, if then, for any p ∈ P(Ω), t ∈ N, where h is the unique choice of external fields such that p x = (µ J,h ) x for all x ∈ [n].In particular, for any ε > 0, one has T t (p) − µ J,h TV ≤ ε as soon as t ≥ 2 c log n + C 1 (ε), where C 1 (ε) = 1 c log(C/ε).3.1.Main ideas of the proof.Before embarking on the details of the proof, we give a high level description of the main steps.By symmetry we may rewrite the operator (2.11) in the form where ) is a probability measure over subsets Λ ⊆ V .It will be convenient to view the distribution γ(• | σ, σ ′ ) as a spin system, i.e., a probability measure over spin configurations η ∈ {−1, +1} n , by identifying η x = +1 with x ∈ Λ and η x = −1 with x / ∈ Λ. Recall that in the non-interacting case J = 0, the distribution γ(• | σ, σ ′ ) does not depend on the pair (σ, σ ′ ), and is simply the product of Bernoulli measures with parameter 1/2.As described in Section 2.4, the dynamics is then entirely governed by the pure fragmentation process that starts with the set V and recursively splits sets of vertices uniformly at random until it reaches a collection of singletons.The simple argument given in that proof then allows one to obtain (3.3) with C = 1  2 and c = log 2; see [43,12] for a detailed analysis of the non-interacting case.
When there is a nontrivial interaction J = 0, this straightforward analysis breaks down.Our proof of Theorem 3.1 is based on a coupling argument that allows us to reduce the problem to the analysis of a more general process in which the fragmentation mechanism is perturbed by a "local growth" process arising from the correlations inherent in the interactions.The main idea is that if the local growth is sufficiently sparse, then the underlying fragmentation dominates and eventually the memory of the initial distribution (except for the marginals) is lost.
The first step in the proof is to couple the above random variable η with distribution γ(• | σ, σ ′ ) with a random subgraph G of the complete graph K n having a suitable distribution ν, i.e., we shall write where the sum extends over all possible subgraphs G ⊆ K n , and γ G (• | σ, σ ′ ) is a probability measure on Ω for each realization G.The key features of this coupling are: • the distribution ν does not depend on the pair (σ, σ ′ ); • the distribution γ G (• | σ, σ ′ ) depends on the pair (σ, σ ′ ) only through the spins Actually, it will be crucial that ν can be taken to be the inhomogeneous Erdős-Rényi random graph with edge weights proportional to λ xy := e 4|Jxy| − 1.
This ensures that, under the assumption (3.2), the graph G will be sufficiently sparse and the size of the connected components will satisfy good tail bounds.Note that the expression (3.6) can be seen as a form of high-temperature expansion [24] for the measure γ(• | σ, σ ′ ).However, a standard high-temperature expansion would produce an expression of the form (3.6) with real-valued coefficients ν(G) which depend on (σ, σ ′ ), while it is crucial for our coupling argument that ν be a probability measure independent of (σ, σ ′ ).
Armed with the coupling (3.6), we consider all 2 t − 1 interactions in the derivation tree of Section 2.4 that produce the final distribution T t (p).For each interaction we use a realization of the graph G and we specify a realization B of the Bernoulli random variables with parameter 1/2 which determine η y for y ∈ V \ V G .We then compute the resulting distribution.Letting ( G, B) = (G 1 , B 1 ), . . ., (G 2 t −1 , B 2 t −1 ) denote the vector of all such realizations, we may then write where ν is a suitable distribution over the realizations ( G, B) and T t (p | G, B) ∈ P(Ω) represents the distribution at time t conditional on the realizations ( G, B).The important point here is that ν is independent of the initial conditions, and therefore all correlations in the initial distribution appear only in the measures T t (p | G, B).Moreover, the measure ν can naturally be interpreted as a stochastic process that combines fragmentation with local growth.
The second main ingredient in the proof of Theorem 3.1 is the identification of an event E t for this process, roughly representing the fact that within time t all fragments have reached their minimum size, and such that for some absolute constants A, b > 0. The nature of the event E t will be such that for all p, p ′ ∈ P(Ω) which have the same marginals at every vertex x ∈ V .Once these facts are established, (3.7), (3.8) and (3.9) imply that for any such p, p ′ ∈ P(Ω) one has This implies the result of Theorem 3.1 by taking p ′ = µ J,h .We now turn to detailed proofs of the various claims sketched above.
Let G be the set of all subgraphs of the complete graph K n over V ⊆ [n] with isolated vertices removed, and write P(G) for the set of probability measures over G. Thus G ∈ G can be viewed as a collection of unordered pairs {x, y} for x, y ∈ V .Note that G ∈ G need not be connected and can be the empty graph (with no vertices).We write V G , E G for the vertex and edge set of G ∈ G, respectively.(3.12) where, for any assigns independently the values ±1 with probability 1/2 to each x ∈ V \ V G .Moreover, the probability measure µ where G 1 , . . ., G k are the connected components of G.
Proof.We start with a high-temperature expansion, which is valid for any probability measure on Ω = {−1, +1} V of the form where e denotes an arbitrary undirected edge e = {x, y} for x, y ∈ V , η(e) = η x η y , and φ(e) ∈ R are some given interaction coefficients.Note that (3.10) is of this form.By adding a constant independent of η to the exponent, we can rewrite this as Letting G 1 , . . ., G k denote the (maximal) connected components of G, we see that where, for any connected graph G ∈ G, we define Since δ e (η(e)) ≥ 0, the weights w(G) are all nonnegative.Therefore, the measure (3.14) satisfies where is a probability measure p φ ∈ P(G) depending on the interactions φ := {φ(e)}, and μG is the probability measure on {−1, +1} V G given by the product μG = ⊗ k i=1 v G i , where, for any connected graph G ∈ G, we define the probability measure We remark that, for any G ∈ G, μG depends on the interactions φ G := {φ(e), e ∈ E G } only.We now apply this decomposition to the measure (3.5), which by Lemma 3.2 is (3.14) with the choice φ(e) = Jxy , e = {x, y}, to obtain for a distribution p J,σ,σ ′ ∈ P(G) that depends on the interactions J = {J xy } and on the configurations σ, σ ′ through the set D(σ, σ ′ ); see (3.11), (3.15) and (3.16).We note that, because of the dependance on σ, σ ′ , this is not sufficient to prove the desired claim (3.13).We shall use a further coupling argument to lift the decomposition (3.17) to the decomposition (3.13) with the desired properties.
We are going to show that for all (σ, σ ′ ), the measure p J,σ,σ ′ is stochastically dominated by ν J , or equivalently that there exists a coupling π ∈ P(G × G) of ν J and p J,σ,σ ′ such that for all G, H ∈ G, In particular, it follows that the conditional distribution Moreover, we will also show that for any where H 1 , . . ., H k denote the connected components of H. Once these facts are established we can quickly conclude the proof of Lemma 3.3 as follows.Let π be the coupling of p J,σ,σ ′ and ν J as above, so that From (3.17) we obtain where we define (3.20) must necessarily also have the desired product structure along the connected components H 1 , . . ., H k of H. Therefore, the decomposition (3.19) concludes the proof of the lemma once we establish the desired properties of the coupling π.
To construct the coupling π we use the following coupled Glauber-type Markov chains.Let U 1 , U 2 , . . .For any fixed G 0 , H 0 ∈ G, we write (X t , Y t ), t = 0, 1, . . ., for the Markov chain with X 0 = G 0 , Y 0 = H 0 , and such that for any t ≥ 1, In words, at each step a uniformly random edge e t is chosen, and the graphs X t−1 , Y t−1 are updated by adding or removing the edge e t according to the prescribed probabilities and the common source of randomness U t .This defines the Markov chain with state space G × G.By construction, we have the reversibility conditions for any G, H ∈ G. Therefore, the marginals X t and Y t , t = 0, 1, . . ., are Markov chains with state space G with stationary distributions p J,σ,σ ′ and ν J respectively.
Let us now observe that the above coupled process preserves ordering, in the sense that if G 0 ⊆ H 0 ∈ G then X t ⊆ Y t for all t = 1, 2, . . . .In view of our construction, to prove this it suffices to note that, for any G ⊆ H and any e, p(G, e, +) ≤ p(H, e, +) .The monotone coupling π of p J,σ,σ ′ and ν J may then be defined as the stationary distribution of the Markov chain (X t , Y t ).The required property (3.18) is a consequence of the above construction and the inequality (3.21), since if we start the Markov chain at (X 0 , Y 0 ) with X 0 ⊆ Y 0 we will have X t ⊆ Y t at all times and thus the limiting distribution π as t → ∞ must be supported on ordered pairs.Finally, we need to check that for any H ∈ G, the conditional distribution π(• | H) of G ⊆ H depends on the spin configurations σ, σ ′ only through their values σ V H , σ ′ V H on V H .To this end, observe that π(• | H) is the stationary distribution of the Markov chain (X t , Y t ) obtained by conditioning on Y t = H for all t, or equivalently the Markov chain defined as above with the restriction that any transition that would result in adding or removing an edge from the graph Y t = H is rejected.This yields the Markov chain X H t , t = 0, 1, . . .with state space G H = {G ∈ G : G ⊆ H} and transition probabilities given by (3.22) with the restriction that adding an edge e is only allowed if G ∪ {e} ⊆ H. Since the expression (3.22) factorizes over connected components of G, and since G ⊆ H it follows that (3.22)only depends on the coefficients φ(e) = Jxy , where e = {x, y} ∈ H, and therefore the stationary distribution π(• | H) has the same property.This ends the proof of Lemma 3.3.So far we have not used the weak interaction assumption (3.2), so Lemma 3.3 holds for arbitrary coefficients J xy .Next, we observe that the condition (3.2) implies a strong sparsity property of the measure ν J in Lemma 3.3.From the definition (3.12), this measure is the inhomogeneous Erdős-Rényi random graph where each edge e = {x, y}, x, y ∈ V , is included independently with probability (3.23) In particular, using (e z − 1)e −z ≤ z for z ≥ 0, one has p xy ≤ 4|J xy |, and if (3.2) holds, then for any x ∈ V , Moreover, if δ 0 is sufficiently small, then for any given x ∈ V the size of the connected component of G at x has an exponential tail.We make this precise in Lemma 3.8 below.

3.3.
Fragmentation with noise.We now develop the main construction behind the convergence result in Theorem 3.1.It is based on a perturbed fragmentation process, i.e., a process that combines the random fragmentations of the non-interacting case (as described in Section 2.4) with some competing noise represented by the random graphs encoding dependencies.Given a set A ⊆ [n] and a random graph G ∈ G, we define the random set A ′ as the vertex set of the union of all connected components of G that have non-empty intersection with A. More formally, write G = ∪ ℓ i=1 G i where G i are the connected components of G and let Then we set A ′ = V G(A) .We may sample A ′ starting from A by a breadth-first search, i.e., by revealing sequentially for each x ∈ A the neighborhood of x in G, then recursively the neighborhood of each vertex revealed at the previous step, and so on until there are no more neighbors to reveal.Clearly, A ′ may contain sites that are not in A.
, we independently declare x to be in or out by a fair coin flip.We thus obtain two random sets A in and A out , such that Next, for any A ⊆ [n], we define two sets Φ 0 (A), Φ 1 (A) by and Definition 3.4.The fragmentation plus noise process F t , t = 0, 1, . . . is the random process defined as follows.For each t ∈ N, F t consists of 2 t labeled fragments, i.e., (possibly empty) subsets obtained by repeated application of the following rule.At time zero we have 2 t−1 ), then for each i independently, we replace )) where Φ 0 , Φ 1 are the random maps defined by (3.24)-(3.25),so that We say that the process dies out if there is a time t such that all fragments are empty, i.e., F (t) i = ∅ for all i = 1, . . ., 2 t .With slight abuse of notation, we write F t = ∅ for the latter event.
In the non-interacting case J = 0 one has p xy = 0 for all {x, y}, and thus ) is one step of a pure fragmentation process, where the set A is partitioned into two subsets using independent fair coin flips for each vertex.In this case F (t) i ∩ F (t) j = ∅ for all i, j = 1, . . ., 2 t and for all t ∈ N. In particular, it is not hard to see that in this case Indeed, by construction F t = ∅ implies that there exist two vertices x, y ∈ [n] that belong to the same fragment up to time t − 1.Thus (3.26) follows as in the analysis of Section 2.4.In the interacting case, there is a first stage where the set A grows according to the local branching at every x ∈ A, and the fragmentation occurs only on those vertices that have an empty neighborhood in G. Our main technical result below establishes that the fragmentation plus noise process also satisfies a bound of the form (3.26), with a slightly weaker exponential decay rate, provided (3.2) holds for a suitably small δ 0 > 0.
Lemma 3.5.For any δ ∈ (0, 1), there exists δ 0 > 0, and a constant C δ > 0 such that if (3.2) holds with constant δ 0 then The proof of this lemma is quite technical and is postponed to Section 3.5.We turn first to the proof of Theorem 3.1.

3.4.
Proof of Theorem 3.1.We now have the tools to conclude the proof of Theorem 3.1, our main result for nonlinear block dynamics.Recall again the construction of T t (p) in terms of the binary derivation tree in Section 2.4.
By the invariance property (2.8), the target measure µ J,h can be obtained at the root by taking the distribution µ J,h on each leaf.Thus, Theorem 3.1 says that when each leaf is given a distribution p with the same marginals as µ J,h , the two distributions T t (p) and µ J,h can be coupled with an error at most Cn 2 e −ct for any t.We shall actually prove the following stronger result.Let p = (p 1 , . . ., p 2 t ) and q = (q 1 , . . ., q 2 t ) be arbitrary vectors of distributions in P(Ω) whose marginals on σ x at all sites x agree, i.e., p i , q i ∈ P(Ω) satisfy Let T t ( p) (resp., T t ( q)) denote the distribution at the root of the binary tree of depth t, where the leaf labeled i = 1, . . ., 2 t is equipped with the distribution p i (resp., q i ); recall Figure 2.1 for the case t = 2.
Theorem 3.6.There exist absolute constants δ 0 > 0, c > 0 and C > 0 such that, if (3.2) holds with constant δ 0 then for any choice of initial distributions p, q as in (3.27), Clearly, Theorem 3.6 implies Theorem 3.1 since we may take p i ≡ p ∈ P(Ω) and q i ≡ µ J,h , where the external fields h are chosen in such a way that p and µ J,h have the same marginals.
We shall prove Theorem 3.6 by analyzing the interaction history backwards in time, i.e., from the root to the leaves.This is reminiscent of the coupling from the past approach for linear Markov chains [42,34], and to some extent our proof is inspired by ideas that have been developed in that context.In particular, our proof is related to the information percolation framework developed by Lubetzky and Sly in [35].
Each internal node of the tree is associated with an interaction, or collision, which according to (3.4) is specified by the random set Λ with distribution γ(• | σ, σ ′ ).For each such interaction we reveal the realization of the graph G and of the Bernoulli variables B in V \ V G that are used in sampling Λ; see Lemma 3.3.In this way, starting from the root, we have a pair (G, B), G ∈ G and a subset B ⊆ V \ V G is identified with the set of x ∈ V \ V G for which η x = +1.Suppose the descendants of the root have distributions p and q respectively, as in Figure 3.1.Then, according to Lemma 3.3, the distribution at the root is given by where ν(G, B) = 2 −|V \V G | ν J (G), and for each realization (G, B), T (p, q | G, B) ∈ P(Ω) is the distribution Here, for τ ∈ Ω, the notation τ with the understanding that the value of Therefore, to compute the distribution T (p, q | G, B) we only need the marginals p V G ∪B and q V G ∪B ′ , where ), where Φ 0 , Φ 1 are the maps defined in (3.25).Indeed, by definition of the measure ν J , these random sets have the same distribution since, when A = [n], V G is equivalent to A ′ and B is equivalent to A in .One way to rephrase this is to say that, as far as the distribution T (p, q | G, B) is concerned, the only relevant information about the distribution p is contained in the set Φ 0 ([n]) and the only relevant information about the distribution q is contained in the set Φ 1 ([n]).
Graphical representation of the distribution at the root, when t = 2 and the leaves are equipped with distributions p ′ , q ′ , p ′′ , q ′′ .Each internal node is equipped with a realization of the random pair (G, B), where G ∈ G and Next, we move one step backwards in time and consider the interaction that produced the distribution p from the previous computation.Suppose that p ′ , q ′ are the distributions at the two descendants of p respectively, so that p = p ′ • q ′ , as in Figure 3.1.Suppose we revealed the realization (G ′ , B ′ ) of the graph and Bernoulli variables associated with this interaction.Note that we can use the same expressions (3.29)-(3.30) to compute T (p ′ , q ′ | G ′ , B ′ ), provided we replace (p, q) by (p ′ , q ′ ) and (G, B) by (G ′ , B ′ ).However, the key point is that we now only need the marginal of p on the set Φ 0 ([n]), and therefore when we compute T (p ′ , q ′ | G ′ , B ′ )(τ ) as above we can sum away all τ y , y / ∈ Φ 0 ([n]), so that the indicator function (3.30) is relevant only for sites x ∈ Φ 0 ([n]), and the distribution of )), where we recall that G ′ (A) is the union of all connected components of G ′ that have nonempty intersection with the set A. The latter property is a consequence of the product structure of the measure Hence we can neglect all connected components of G ′ that do not intersect Φ 0 ([n]), and we can discard the information about all Bernoulli variables at sites y ∈ V \ V G ′ such that y / ∈ Φ 0 ([n]).A close inspection of our definition of the maps Φ 0 and Φ 1 then reveals that the only information about the distributions p ′ , q ′ that is needed to compute )) and q ′ Φ 1 (Φ 0 ([n])) .Similarly, considering the interaction which produced the distribution q = p ′′ • q ′′ (see Figure 3.1), we may fix a realization (G ′′ , B ′′ ) of the graph and Bernoulli variables, and repeating the above reasoning one has that the only information about the distributions p ′′ , q ′′ that is needed to compute T (p ′′ , q ′′ | G ′′ , B ′′ ) is contained in the marginals p ′′ Φ 0 (Φ 1 ([n])) and q ′′ Φ 1 (Φ 1 ([n])) .We note that, after two steps of the evolution, conditional on the realizations of the variables (G, B), (G ′ , B ′ ), (G ′′ , B ′′ ), we have obtained a probability measure at the root depending only on (G, B), (G ′ , B ′ ), (G ′′ , B ′′ ) and on the marginals )) .Equivalently, using the notation introduced in Definition 3.4, after two steps we have that the correlations of the initial distributions at the leaves are entirely contained in the fragments Note that in this example, conditionally on the given realizations of the variables (G, B), (G ′ , B ′ ), and (G ′′ , B ′′ ), the distribution p • q at the root in Figure 3.1 can be computed only using the marginals (p ′ ) {1} , (q ′ ) {2} , (p ′′ ) {1,2} , (q ′′ ) {3,4} of the input distributions p ′ , q ′ , p ′′ , q ′′ .
Repeating the above procedure one has that, after t steps, conditional on the realizations of all graphs and Bernoulli variables involved in each of the 2 t − 1 interactions, the only information needed from leaf i is contained in the marginal of the distribution at that leaf on the fragment F (t) i , for each i = 1, . . ., 2 t , as defined in Definition 3.4.The crucial observation is that, as soon as a fragment either becomes empty or contains one site only, then the information carried by the corresponding leaf is irrelevant.Indeed, if it is empty this is obvious, while if it contains one site only then the marginal at that site is irrelevant since all marginals are assumed to be fixed, and in particular they are the same in any choice of the initial conditions p or q.This explains why we introduced the killing step (3.24) in our definition of the fragmentation plus noise process F t , which in turn is crucial to the probability of extinction we are able to establish in Lemma 3.5.
The above discussion shows that, if we denote by ( G, B) = (G (1) , B (1) , . . ., G (2 t −1) , B (2 t −1) ) the vector of realizations of the random graphs and Bernoulli variables involved in each interaction at the 2 t − 1 internal nodes of the binary tree of depth t, one can write where ν( G, B) = is the distribution of the random graphs and Bernoulli variables, while T t ( p | G, B) is some probability measure that may depend on ( G, B) and on p = (p 1 , . . ., p 2 t ) in a complicated way but has the property that its dependence on the distribution p i from the i-th leaf occurs only through the marginal of p i on the fragment = ∅ for all i, then T t ( p | G, B) = T t ( q | G, B).We note that the event F t = ∅ is measurable with respect to ( G, B), so that we may write and hence Next, we note that where the latter is the probability estimated in Lemma 3.5.Indeed, (3.34) is a consequence of our definition of the fragmentation plus noise process: we have already observed that each step of the fragmentation is produced with the correct distribution, and the product structure (3.32) of the measure ν( G, B) guarantees that all such steps are performed independently.From Lemma 3.5 we thus conclude that, for any δ ∈ (0, 1), there exists a constant δ 0 > 0 in (3.2), and a constant C δ > 0 such that This implies (3.28) (and in fact shows that we can take the constant c as close as we wish to log 2 provided δ 0 is taken suitably small).This ends the proof of Theorem 3.6 and Theorem 3.1.
We now need to provide the missing proof of Lemma 3.5.

3.5.
Proof of Lemma 3.5.The event F t = ∅ implies that there exists a fragment F (t−1) i at time t − 1 that has cardinality |F (t−1) i | ≥ 2. Indeed, by construction, if all fragments have size at most 1 at time t − 1, then F t = ∅; see (3.24).Since there are 2 t−1 fragments at time t, and since all the F (t−1) i have the same distribution, by a union bound it suffices to show that for any δ ∈ (0, 1),

P(|F
provided (3.2) holds with a sufficiently small δ 0 > 0. To prove (3.35) we shall use a stochastic domination argument that bounds the evolution of the fragment F (t−1) 1 , t ≥ 1 by means of independent labeled branching processes.
Consider n independent processes X y := {X y ℓ , ℓ = 0, 1, . . .}, y ∈ [n], such that for each y ∈ [n], X y is the labeled branching process with X y 0 = {y} and such that, at time t ∈ N, each individual with label x in the (t − 1)-th generation independently gives birth to the set of individuals U ⊆ [n] with offspring distribution where G(x) denotes the connected component of G containing x and, for any x ∈ [n], is the probability that x has a non-empty neighborhood in the random graph defined by (3.23)Let N y (t) denote the size of the whole population of the labeled branching process X y at time t − 1, i.e., the total number of individuals generated up to time t − 1.The proof of Lemma 3.5 is based on the following bound on the exponential moment of the random variable X y (t).Lemma 3.8.For all a ∈ (0, 1), there exists δ 0 > 0 and C a > 0 such that if (3.2) holds with constant δ 0 then, for all y ∈ [n] and all t ∈ N, E[2 aN y (t) ] ≤ C a . (3.38) We postpone the proof of Lemma 3.8 and conclude the proof of Lemma 3.5 assuming the validity of this bound.Now denote by X y ℓ the set of all individuals generated at the ℓ-th step, i.e., the ℓ-th generation of the process X y , and let |X y ℓ | denote its cardinality.With this notation, an inspection of the definition of the fragmentation plus noise process shows that F (t−1) 1 is stochastically dominated by the union of the X y 's, i.e., {F (ℓ−1) 1 , ℓ ∈ N} and the independent processes {X y , y ∈ [n]} can be coupled so that, for any ℓ = 1, 2, . . ., The main observation at this point is that, from the definition of fragmentation plus noise, the event |F | ≥ 2 for all 1 ≤ ℓ ≤ t.This in turn, by the domination (3.39), implies that at all times 1 ≤ ℓ ≤ t − 1 one has For this to happen there must be two processes X y , X z and a time s < t such that both X y , X z are alive up to time s and such that X y has at least two individuals in each generation from time s + 1 to time t − 1.For example, taking s = t − 1, this includes the case where both X y , X z are alive up to time t − 1, while taking s = 0 it includes the case where all processes die at the first time step, except for X y which has |X y ℓ | ≥ 2 for all 1 ≤ ℓ ≤ t − 1.We refer to Figure 3.2 for an illustrative example.
y z FIGURE 3.2.Illustration of the event in (3.40).The dashed line represents a time s up to which both processes X y and X z have cardinality at least 1 and after which the process X y has cardinality at least 2.
Using the independence of X y , X z , and the union bound one has the estimate P(|F by the total population of W x , i.e., the union of all sets W x ℓ , ℓ = 0, 1, . . ., where W x ℓ represents the ℓ-th generation of W x .To control the size of the total population of W x , we note that for any integer ℓ ≥ 1, a union bound and the multinomial theorem show that where the third sum extends over all nonnegative integer vectors {n y , y ∈ [n] \ {x}} such that y n y = ℓ.We use (3.43) to define a random variable N such that the size |V x | of the neighborhood at x is stochastically dominated by N , and N has distribution μ given by Indeed, provided ρ 0 < 1, (3.44) defines the geometric distribution with parameter 1 − ρ 0 , and by (3.43) one has for all ℓ ≥ 0. This implies the desired stochastic domination, so that V x and N can be coupled so that |V x | ≤ N .Thus, the size of the total population of W x is dominated by the total size |T | of the Galton Watson tree T with offspring distribution μ.In particular, we can use a coupling of U x and T such that ] denote the moment generating function of N .Standard calculations show that the extinction probability for T is given by the smallest positive solution s ext of s = ϕ(s), and since E[N ] = ρ 0 /(1−ρ 0 ) < 1 here one has s ext = 1.Recall also that ϕ is increasing and convex.With similar calculations one finds that for any fixed u > 0, the exponential moment where s * (u) is defined as the smallest positive solution of the equation s = uϕ(s).Note that we are free to choose u.Let us show that, if we take u = 1/(4ρ 0 ), then s * (u) ≤ 1/(2ρ 0 ).Indeed, by explicit calculation, Noting that the solution of the equation s = ϕ 0 (s) is s 0 = 1/(2ρ 0 ), we see that the desired solution s * (u) of s = uϕ(s) exists and satisfies s * (u) ≤ 1/(2ρ 0 ).From (3.38) and the previous arguments, using Markov's inequality one finds This proves (3.41).
We now turn to the proof of Lemma 3.8.Recall the definition of X y as the labeled branching process with offspring distribution µ x from (3.36).Call Ũx the random variable with distribution µ x .Let ν be defined as , where ρ * = 8ρ 0 .Note that this is a well defined probability provided 16ρ 0 ≤ 1.Let Ñ denote the integer valued random variable with distribution ν.Lemma 3.9 implies that | Ũx | is stochastically dominated by Ñ .
To see this, note that for any k ≥ 2, the event | Ũx | = k has the same probability as the event |U x | = k, where U x is the random variable with distribution ν x .From Lemma 3.9 we know that Moreover, by the definition of µ x one has P(| Ũx | = 1) = 1 2 − ρ x ≤ 1 2 , and therefore This proves the stochastic domination | Ũx | ≤ Ñ .Recall that N y (t) denotes the size of the whole population of the labeled branching process X y up to time t − 1.The above domination argument implies that, for any t ∈ N, we can dominate N y (t) by the size | T | of the total population of the Galton Watson tree T with offspring distribution ν, so that As in (3.45) we have that, for any fixed u > 0, E u | T | = s * (u), where s * (u) is the smallest solution of the equation s = u φ(s), for φ(s) = E s Ñ , the generating function of ν.We calculate for some fixed positive η ≤ 1/2, and note that κ(t) ≤ η if t ≤ η/18 for all η ∈ [0, 1/2].Thus, assuming sρ 0 ≤ η/18, one has Therefore, s * (u) ≤ (1 − η)/η 2 .For this value of s to also satisfy the requirement sρ 0 ≤ η/18, we need ρ 0 ≤ η 3 /(18(1 − η)).By the discussion at the end of Section 3.2, δ 0 ≤ η 3 /(72(1 − η)) is sufficient for this to hold.Under this assumption one has The estimate (3.38) then holds with 2(1 − η) = 2 a , C a = η −2 , provided δ 0 ≤ η 3 /(72(1 − η)).This completes the proof of Lemma 3.8.
Remark 3.10.Note that the value of δ 0 required for Lemma 3.8 to hold is precisely the value of δ 0 that we will require in the high-temperature condition (3.2) in our main result for the nonlinear block dynamics, Theorem 3.1.We have made no attempt here to optimize the dependence of δ 0 on our arguments, and with further work one could no doubt obtain a better, explicit bound.However, it is clear that our current arguments will not be able to obtain an optimal value of δ 0 , matching Dobrushin-type thresholds for rapid mixing for linear Markov chains.For instance, in the mean field case where J ≡ β/n, the threshold for the equilibrium phase transition is β = 1, and thus δ 0 < 1 would be an optimal condition.

NONLINEAR GLAUBER DYNAMICS
Recalling the definition (2.9), the interaction (2.11) in the nonlinear Glauber dynamics takes the form where We write S t (p), t ∈ N, for the t-th iteration of p → p • p. From Theorem 2.7 we know that for any fixed interaction matrix J, and any initial distribution p ∈ P(Ω), the above dynamics converges to µ ,h , where h is the unique vector of external fields such that µ J,h and the initial state p have the same marginals on σ x , for all x ∈ V .Our main result for the nonlinear Glauber (single site) dynamics, Theorem 1.2 in the introduction, establishes a tight bound on the rate of convergence as a function of n = |V |, under the same Dobrushin-type condition (3.2) on the interaction as in the case of block dynamics.We restate this theorem here for convenience.
Theorem 4.1.There exist absolute constants δ 0 > 0, c > 0 and C > 0 such that, if (3.2) holds with constant δ 0 , then for any p ∈ P(Ω) and t ∈ N, where h ∈ R n is the unique choice of external fields such that p x = (µ J,h ) x for all x ∈ [n].In particular, for any ε > 0, one has S t (p) − µ J,h TV ≤ ε as soon as t ≥ n c log n + C 1 (ε), where C 1 (ε) = n c log(C/ε).The main difference with respect to Theorem 3.1 is the rate of exponential decay, which is n times slower in this case.This reflects the intuitive fact that only one spin is exchanged at each time step whereas Θ(n) spins are exchanged in a typical transition of the nonlinear block dynamics studied in Theorem 3.1.
The proof of Theorem 4.1 follows a similar strategy to that of Theorem 3.1, but with some important technical differences.We begin with an analog of Lemma 3.3, in which the role of Erdős-Rényi graphs is now played by random star graphs centered at a specific vertex x.

4.1.
Coupling with inhomogeneous random star graphs.We start with a single site version of Lemma 3.2.

Lemma 4.2. For any
Proof.The expression (4.2) follows directly from (1.3), the definition (2.9) and the observation that Remark 4.3.Consider the random variable η x which takes the value −1 with probability α x (σ, σ ′ ), and +1 with probability 1 − α x (σ, σ ′ ).Then by Lemma 4.2, its probability density can be written as α and Jxy is defined as in (3.11).This also shows that α x , then Jxy = 0 for all y = x, and therefore α and let G x be the set of all subgraphs of the star graph S x = (V, E x ) with vertex set V = [n], and edge set E x := {{x, y}, y ∈ [n] \ {x}}.We view G ∈ G x as a collection of edges, i.e., a subset of E x , with no isolated vertices, and write P(G x ) for the set of probability measures over G x .Note that any G ∈ G x is always connected, but can be the empty graph (with no vertices).We use V G , E G for the vertex and edge sets of G ∈ G x , respectively.The following is the single site version of Lemma 3.3.
• if x / ∈ W t then set W t+1 = W t • if x ∈ W t then let V x denote the neighborhood of x in G, i.e., the random variable with distribution as in (3.42): if From our definitions we see that 1 has the same distribution as W t .Next, consider a continuous time version of the process W t .Namely, let T 1 , T 2 , . . .denote the arrival times of the Poisson point process with intensity 1 on [0, ∞) and call W t , t ≥ 0, the process obtained by repeating the steps above at each arrival time T j , so that W t = W Tt for every t ∈ N.
From standard properties of the Poisson process, the random process W t can be equivalently obtained by attaching to each x ∈ [n] an independent Poisson process with intensity 1/n, and performing updates at x independently at the arrival times of the process at x. Neglecting the overlap between different branches, and using the domination argument in the proof of Lemma 3.9, one can stochastically dominate the cardinality of W t by where M x t , x ∈ [n] are i.i.d.copies of the continuous time branching process obtained by setting M x 0 = 1 and then letting each branch evolve independently with branching times given by exponential random variables with parameter 1/n.In this process, each individual gives birth to ℓ individuals with probability µ * (ℓ) given by where μ is defined as in (3.44) with the same value of ρ 0 .(See [5] for background on continuous time branching processes.)Since M 1 t is subcritical as soon as δ 0 is sufficiently small, and μ has exponential tails, standard estimates on subcritical continuous time branching processes imply that P(M 1 t > 0) ≤ Ae −at/n for some constants a, A > 0. It follows that To conclude the proof, recall that W t = W Tt and that T t is the sum of t i.i.d.exponentials of parameter 1, so that for some constant c > 0. Therefore, for all t ∈ N, P(C {C t = ∅} = {Q Tt = ∅}.Note that the branching times (but not the choices of vertices x to be updated) of newly added evolutions are fully coupled to the branching times of the master evolution dictated by the original Poisson process.If, for every newly opened evolution we use instead an independent realization of the Poisson point process, then we obtain a global process Q t , which is dominated as explained above by the process Z t .Thus it is sufficient to find a coupling of Q t and Q t such that P(Q Tt = ∅) ≤ P( Q Tt = ∅) + e −bt , for some constant b > 0.
Note that if b > 0 is a sufficiently small constant, then a simple large deviation bound shows that the probability that a Poisson process of intensity 1 has a total number of arrivals in a time t/2 that is less than bt, is at most 3 −t .Thus, if there are 2 t such processes, the event E t that all of them have at least bt arrivals within time t/2 satisfies P(E c t ) ≤ 2 t 3 −t = (2/3) t .
We As in the proof of Theorem 3.1, the first step is to use the coupling with random (star) graphs and Bernoulli variables to decompose the distribution at every internal node of the interaction tree.For this we use Lemma 4.4.This leads to the analysis of the branching process, which in this case takes the form of coupon collecting with noise.The conclusion then follows from Lemma 4.8.With the notation from Theorem 3.6 we are going to establish that there exist absolute constants δ 0 > 0, c > 0 and C > 0 such that, if (3.2) holds with constant δ 0 then, for any choice of initial distributions p, q with the same marginals as in (3.27), S t ( p) − S t ( q) TV ≤ Cne −c t/n , t ∈ N , (4.12) where S t ( p) denotes the distribution at the root of the binary tree when the leaves are equipped with the distributions p and the single site interaction (4.1) occurs at each internal node.
For each interaction we draw a uniformly random vertex x ∈ [n], and reveal a realization of the neighborhood G = V x and a Bernoulli variable B ∈ {−1, +1}.Then, according to Lemma 4.4, the interaction (4.1) has the form and, for τ ∈ Ω, the notation τ ∼ (σ, σ ′ , η x ) stands for τ ∼ (σ, σ ′ , η x ) ⇔      τ x = σ x , η x = +1 τ x = σ ′ x , η x = −1 τ y = σ y , ∀y = x (4) As mentioned in the introduction, our results for the Ising model generalize to any spin system with a constant number of spins at each vertex and bounded pairwise interactions, including the q-state Potts model, under an analogous Dobrushin-type condition (3.2).(In that condition, the sum is now over the maximum absolute values of all interactions involving any given site x.)This follows from the fact that, as can readily be checked, the representation of the measure γ(• | σ, σ ′ ) in terms of a two-spin system as described in Lemma 3.2 still holds in this more general setting, and beyond that point the rest of the analysis depends only on that representation.(5) We note that, in contrast to linear Markov chains (which can be simulated using a single point walking randomly around the state space), nonlinear dynamics do not immediately provide an efficient algorithm for sampling from the stationary distribution, even when the convergence time is short.This is a consequence of the fact that, in order to obtain a single sample from the time-t distribution T t (p), we would naively need to simulate the entire evolutionary tree of depth t, which begins with 2 t independent random samples from p at the leaves.(Indeed, Arora et al. [4] prove that simulating arbitrary reversible quadratic dynamics is in fact PSPACE-complete.)However, for both of the dynamics we consider here, our analysis does in fact also provide a polynomial time algorithm for sampling from (very close to) the stationary distribution µ J,h .This is immediately obvious for the nonlinear block dynamics: since Theorem 1.3 establishes that t = O(log n) steps are enough for convergence, the entire tree is of polynomial size and thus can be constructed in polynomial time.
(A detail here is that each interaction in the tree requires sampling the set of sites Λ to be exchanged according to the distribution (3.11).However, this itself is a high-temperature Ising Gibbs measure, and thus can be sampled from efficiently by other means.)For the nonlinear Glauber dynamics, a polynomial time simulation is also possible once we observe from our analysis that the actual size of the master evolution for S t (p) is O(t); since the master evolution is sufficient to construct our time-t sample, this can be done in poly(n) time.
The simulations in the previous paragraph leave much to be desired algorithmically, as they are rather unwieldy and, more significantly, do not retain the underlying pairwise interaction structure of the nonlinear dynamics.An interesting related question, for both processes, is whether a more explicit simulation is possible: namely, starting with a finite population of size N = poly(n), each member of which is sampled independently from the initial distribution p, evolve this population in the obvious way (by carrying out interactions between randomly chosen pairs to construct the nextgeneration population of the same size).The question is whether the time-t population in this finite implementation is close to the true population T t (p), at least for modest times t-this is not obvious since unwanted correlations will arise due to the finite population size, though these will decrease with the population size N .(This implementation, which is actually a Markov chain on a very large state space, is known as the "Kac model" in kinetic theory [32], and the convergence to the true population is referred to as "propagation of chaos".)In the non-interacting (population genetics) case, it was shown that a low-degree polynomial population size does in fact suffice [43,12].It would be interesting to see if this argument can be extended to the Ising model at high temperature; in particular, since the key insight in [43] comes from the fragmentation process, the question boils down to whether the same insight carries through in the presence of noise.
and observe that ν(F ) = D(ν 1 µ 1 ) + D(ν 2 µ 2 ) while ζ(e F ) = 1 because of the product structure.This proves (2.27).Now, in view of the symmetry (2.19), the marginal of ρQ on the first element of the pair equals the marginal of ρQ on the second element of the pair, and they are both equal to p • p.It follows that D(ρQ µ ⊗ µ) ≥ 2D(p • p µ) .

1 2 (
S(p, q | x, G, B) + S(q, p | x, G, B)) (4.13)whereS(p, q | x, G, B)(τ ) = σ,σ ′ p(σ)q(σ ′ ) ηx∈{−1,+1} [11]rk 2.5.For (linear) Markov chains, an irreducibility statement as inLemma 2.4immediately implies exponential convergence to stationarity (at some possibly very slow rate).For the nonlinear dynamics considered here, however, it is an open question whether this holds.We will see in the next subsection that Lemma 2.4 is enough to ensure convergence, though without any information about the rate; one reason why this may be delicate is the fact that, for our nonlinear dynamics, the total variation distance to stationarity is in general not monotonically decreasing[10, Remark 2.7].However, our main results (Theorems 1.2 and 1.3 in the introduction) do imply exponential convergence in the high-temperature regime (i.e., when the interactions in J are sufficiently weak).2.6.Convergence to stationarity.As we have seen, reversibility implies that, for a fixed interaction matrix J, the Ising measures µ J,h defined in(1.3) are all stationary, regardless of the choice of h.In fact, these are the only stationary distributions, as proved in[11]: Lemma 2.6.[11, Lemma 3.2] For both the above dynamics, for any fixed interaction matrix J, a distribu- tion µ ∈ P(Ω) is stationary for (2.3) if and only if µ has the form (1.3) for some choice of the fields h.
. Notice that by definition either G(x) is empty or |G(x)| ≥ 2, and therefore (3.36), for any x ∈ [n], defines a probability measure on subsets of [n]: a sample U from µ x is obtained by first sampling the neighborhood G(x) from ν J ; if |G(x)| ≥ 2 then we set U = G(x); if G(x) = ∅ then we flip a fair coin and set U = ∅ if heads and U = {x} if tails.