Physical Oscillator Model for Supercomputing

A parallel program together with the parallel hardware it is running on is not only a vehicle to solve numerical problems, it is also a complex system with interesting dynamical behavior: resynchronization and desynchronization of parallel processes, propagating phases of idleness, and the peculiar effects of noise and system topology are just a few examples. We propose a physical oscillator model (POM) to describe aspects of the dynamics of interacting parallel processes. Motivated by the well-known Kuramoto Model, a process with its regular compute-communicate cycles is modeled as an oscillator which is coupled to other oscillators (processes) via an interaction potential. Instead of a simple all-to-all connectivity, we employ a sparse topology matrix mapping the communication structure and thus the inter-process dependencies of the program onto the oscillator model and propose two interaction potentials that are suitable for different scenarios in parallel computing: resource-scalable and resource-bottlenecked applications. The former are not limited by a resource bottleneck such as memory bandwidth or network contention, while the latter are. Unlike the original Kuramoto model, which has a periodic sinusoidal potential that is attractive for small angles, our characteristic potentials are always attractive for large angles and only differ in the short-distance behavior. We show that the model with appropriate potentials can mimic the propagation of delays and the synchronizing and desynchronizing behavior of scalable and bottlenecked parallel programs, respectively.


INTRODUCTION AND RELATED WORK 1.Additive runtime models
Analytic white-box (first-principles) performance models are based on a simplified mathematical description of the hardware, the software, and their interactions.They deliver an unmatched level of insight into performance bottlenecks, but they also tend to be imprecise beyond the node level; this is because the intricate interplay between node-level and cluster-level bottlenecks challenges the assumption that the overall runtime is the sum of computation time and communication overhead:  total =  comp +  comm [2].Many parallel programs do have the regular compute-communicate cycles underlying this model, but they still fail to show additive runtime.The reason is that, irrespective of a perfect translational symmetry, noise, slight load imbalances, hardware topology, and the presence of hardware bottlenecks leads to processes going out of their "natural" lock-step [3,5], which may have surprising consequences for performance in applications [1,6].

Coupled oscillators
The motivation for describing a parallel program in terms of a dynamical system emerges from the observation of analogies between the dynamics of coupled oscillators and the behavior of processes in bulk-synchronous MPI-parallel codes.For example, if a singular effect causes one MPI process to be delayed, the communication dependencies cause this delay to "ripple" through the system, a phenomenon called an idle wave.Idle waves get damped as they travel and will run out eventually [16].For parallel programs that are not subject to bottlenecks like memory or network bandwidth, this behavior leads to a resynchronization of the processes [2], similar to a swarm of fireflies flashing randomly at first but later going into sync [22].Memory-bound programs, on the other hand, are prone to desynchronization and can develop a permanent out-of-sync state that we call a computational wavefront [3].
A plethora of research has been conducted in engineering and physics about the general phenomenon of self-synchronization among coupled "oscillating entities." Motivational examples of oscillators are the iconic swarms of fireflies, lasers, the Millennium Bridge, coupled metronomes, neurons, and heart cells.There is an abundance of literature on the paradigmatic Kuramoto model [14], which is the most popular among the many models proposed for a description of synchronization [19].Numerous reviews [8,21,22] have been dedicated to the analysis of the Kuramoto model in exploring the onset of synchronization for networks of coupled oscillators.This paper uses the Kuramoto Model as a motivation to describe the local and global dynamics of communicating processes in parallel programs on high performance computing clusters.
To the best of our knowledge, describing the execution of a parallel program by mapping it to a collection of coupled harmonic oscillators is a novel idea.The model presented here seems to provide a good representation of the qualitative behavior of parallel code, especially in terms of bottleneck behavior, de-and resynchronization, and the propagation of noise.The coupling among the oscillators is implemented as a custom interaction potential, which differs from the one in the Kuramoto model, and a topology matrix, which describes dependencies between processes and mimics the communication topology of a parallel program.Note that there is an obvious mismatch between the physical concept of a harmonic oscillator and an MPI process which "oscillates" between different states of execution and/or communication.However, this is also true for many physical systems for which Kuramoto-like models were used to describe collective phenomena.
Motivation The physical model is an unknown analogy in the parallel computing field that enables simple and cheap experimentation by solving a system of ODEs rather than running highly parallel programs.The model, which we provide as a convenient Matlab application, is fully parameterized.Beyond the additive runtime model, a physical model seems more appealing for some scenarios since it incorporates communication topology, noise, and bottlenecks for the description of parallel distributed computing.
Contributions In this paper, we show that a collection of coupled harmonic oscillators with a suitably chosen interaction potential and connectivity (or topology) matrix can reproduce dynamical effects in scalable and memory-bound parallel programs such as propagation and decay of idle waves, re-and desynchronization, and the formation of computational wavefronts.We also provide a MATLAB implementation 1 [7] of the model that can be used to explore its parameter space.
Overview This paper is organized as follows: After covering the motivation for having a physical model for supercomputers, Sec. 2 provides details about the search for suitable physical analogs.Sec. 3 covers the description of the physical model and the analogy between the model and the parallel programs.After providing the details about our experimental environment and methodology (Sec.4), the evaluation is done in Sec. 5. Finally, Sec. 6 discusses the findings and provides an outlook to future work.

SEARCH FOR POSSIBLE PHYSICAL MODELS
Many analogies are thinkable between interacting processes in parallel programs and physical models.Here, we briefly discuss the possible options, highlighting their strengths and weaknesses.

Continuum models
There are a lot of possible continuum models which include wave equations with dissipative and non-linear parts [11,12] and which may serve as blueprints of wave-like behavior in parallel programs.For instance, solitons [10] are solutions of weakly nonlinear dispersive PDEs.They are stabilized by a cancellation of nonlinear and dispersive effects in the medium and propagate at a constant velocity and are similar in character to idle waves.Other motivational examples of PDEs include Navier-Stokes and shallow-water equations, etc. [9,20].Despite the appeal of such analogies, it is as yet unclear how a continuum limit of the discrete MPI processes in parallel programs should be performed.

Discrete models
We considered the following two discrete models to look for a physical analogy to the dynamics of parallel programs.

Ising Model (IM).
Models like the Ising model [13] describe interacting entities (spins in the Ising case) and show interesting effects like symmetry breaking, phase transitions, and long-range correlations.The system tends to assume a state of lowest energy.Heat disturbs this tendency, thus creating the possibility of different structural phases.Since this model lacks the necessary natural oscillating behavior (processes alternate at least between computation and communication), it seems unsuitable for the desired analogy.

Kuramoto Model (KM).
The Kuramoto model [14,15] is an iconic, well-studied model of self-organizing behavior on which a lot of literature exists, including its continuum limit.It describes a collection of  oscillators with natural frequencies {  } which are coupled via a periodic interaction potential that depends on the 1 https://github.com/RRZE-HPC/OSC-ADmutual phase differences (  () −   ()): The coupling strength is globally parameterized by the parameter .There are several reasons why the plain Kuramoto Model is unsuitable for describing parallel programs.
First, it implements full connectivity, i.e., a global, all-to-all coupling of the oscillators, which is an undesirable pattern in parallel computing.In fact, a parallel program with this pattern shows extremely fast idle wave propagation and no potential for natural process desynchronization because the all-to-all connectivity acts like a synchronizing barrier in each time step (analogous to one oscillator period).Later research modified the model by incorporating varying coupling strengths between pairs of oscillators, which can be seen as a kind of network topology [18,21].We will come back to this idea later.
Second, the periodic sinusoidal interaction potential in the KM leads to the onset of self-synchronization of the oscillators, a phenomenon that occurs also in resource-scalable parallel programs [3]. 2owever, resource-bottlenecked (e.g., memory-bound, communication-bound) parallel programs also exhibit spontaneous-onset desynchronization [3], which the original Kuramoto Model does not show.
Third, the periodicity of the KM potential allows for phase slips, i.e., two oscillator phases can be a multiple of 2 apart without changing the dynamics.This is impossible with periodically communicating processes where usually a computation phase cannot start before a message has been received.Finally, parallel computing systems are subject to many sources of noise, which the plain KM does not know.However, later research modified the model to include it [21].

PHYSICAL OSCILLATOR MODEL (POM) 3.1 Theoretical description
We use the following modified coupled oscillator model to describe  parallel MPI processes on a cluster: The intrinsic angular frequency   of the oscillators in the plain Kuramoto Model is substituted by the frequency of the processes alternating between computation and communication phases of duration  comp and  comm , respectively.Noise in MPI can take different forms; here we include time-dependent process-local noise with some distribution   () and interaction noise    (), i.e., random delays caused by varying communication time.The former is implemented as a jitter in the local oscillator frequency and can also serve to model load imbalance, while the latter impacts the phase difference  (,    ()) =   ( −    ()) −   () [21].The coupling strength   =  •  comp + comm is motivated by the connection between idle wave speed and communication characteristics as described in [4]: Messages sent via the eager (rendezvous) protocol have  = 1 (2), and  is the sum over all communication distances.However, if the outstanding non-blocking MPI requests of all communication partners are grouped in the same MPI_Waitall, the parameter  becomes equal to longest distance only [4].The topology matrix    together with the interaction potential  describes how processes interact: We have    = 1 if there is a connection between oscillators  and , and    = 0 otherwise (see Fig. 2).The shape of the potential  (•) can be used to distinguish between synchronizing (i.e., Kuramoto-like) and desynchronizing behavior.

Model implementation
To solve the first-order coupled ODEs system (2) we use a robust explicit Runge-Kutta (4,5) method (Dormand-Prince).We provide a full implementation of the solver, including a graphical interface for easy experimentation, as a Matlab code in the artifact appendix 1 .
It is a flexible tool to study the influence of the model parameters and allows to set different initial conditions (synchronized, desynchronized), noise (both variants in our model), one-off delays, coupling strengths, and communication topology.
Three options for result visualization are provided: (i) the circle diagram, where colors represent the different frequencies, with blue being fast and yellow being slow, (ii) the timeline of phase differences for oscillators, and (iii) the timeline of potentials (see Listings in the artifact appendix 1 ).To demonstrate analogy between MPI program and model, in the standard view we choose the timeline of oscillator phases   −  normalized to the slowest ("lagger") process as the baseline.

TEST BED AND EXPERIMENTAL SETUP
We use MPI-parallel toy codes for comparisons with the oscillator model.They include MPI point-to-point communication via MPI_Irecv, MPI_Send, and MPI_Wait*.To mimic resource-scalable parallel programs, we use a "PISOLVER" code which calculates the value of  by computing ∫ 1 0 4/(1 +  2 ) d using the midpoint rule and 500 M steps.For resource-bottlenecked parallel programs we run pure-MPI versions of the STREAM Triad [17] A(:) = B(:) + s*C(:) and of the "slow" Schönauer Triad A(:) = B(:) + cos(C(:)/D(:)), for which the low-throughput cosine and floatingpoint division shifts the bandwidth saturation point to a higher number of cores.The scaling behavior across the cores of one socket is shown in Fig. 1(b) for the three codes.In all cases we add MPI with short messages after each sweep to implement different communication topologies.The benchmark results in terms of delay propagation, de-, and resynchronization were published in previous work [2][3][4].We present the results on one of the benchmark systems (Meggie3 ) here for reference; the results for the other system (SuperMUC-NG) can be found in the artifact appendix 1 ."Meggie" is a fat-tree 100 Gbit s −1 Omni-Path cluster with dual-socket nodes (64 GB DDR4) comprising ten-core Intel Xeon "Broadwell" E5-2630v4 CPUs (2.2 GHz, 68 GB/s).We ran programs with 40 and 18 MPI processes on 4 and 2 sockets of the Meggie system, respectively.Working sets for the memory-bound codes were chosen large enough to not fit into any cache, i.e., at least 10× the last-level cache size.We refer to artifact appendix 1 for further details on the experimental setup, the methodology, and the hardware and software environment.

EVALUATION AND IMPLICATIONS
To evaluate the modeling power of our novel coupled oscillator model, we describe its connections to MPI program phenomenology. Figure 2 visualizes some of the corner cases with next-neighbor (top row) and next-plus-next-next neighbor (bottom row) communication and scalable (left column) versus saturating (right column) codes, using MPI traces (inner images) and phase visualizations (circles).

Delay propagation
5.1.1Resource-scalable parallel programs.Figure 2 shows MPI traces obtained using Intel Trace Analyzer and Collector (ITAC) with computation (white) and communication (red) Within the circular phase visualizations of the physical oscillator model.Idle waves originating from a one-off delay (extra workload performed by the 5th MPI process in blue) ripple through the parallel program due to the dependencies between processes mediated via MPI communication.This is called delay propagation.Idle waves interact nonlinearly with each other and with system noise, leading to their eventual decay.Communication properties (periodicity, range), system noise, and system topology all have a huge impact on the speed and general behavior of the wave [4], but the most important property is its speed.Very weak or nonexistent coupling (corresponding to  ≈ 0 in the model (Eq.2)) corresponds to free processes with no dependencies.The case  = 1 describes nextneighbor coupling and minimum idle wave speed, leading to slow relaxation into a synchronized state.The larger  the faster the wave and the "stiffer" the system gets, until with very large  the system is strongly synchronizing since any variation is immediately propagated through the whole system.
5.1.2Resource-bottlenecked parallel programs.The propagation speed of idle waves is influenced by the coupling in the same way for non-scalable as for scalable programs.One major difference is that for memory-bound code, idle waves have an additional decay mechanism even under noise-free conditions, and after the idle wave has run out, a residual computational wave with desynchronized execution remains; see Fig. 2(b, d).

5.
2 Connecting scalability and model potential 5.2.1 Self-synchronization in resource-scalable parallel code.Bulksynchronous, resource-scalable applications exhibit self-resynchronization on an undisturbed system, i.e., the coupling favors a close proximity of the phases.A disturbance or "pull" causes phase differences across oscillators, but the system "snaps back" into a synchronized state, i.e., all oscillators are eventually in phase with the same natural frequency   ; see Fig. 2(a, c).The interaction potential in the oscillator model should be able to mimic this behavior, but the periodic sinusoidal potential of the plain KM is unsuitable since it allows for phase slips (see above) and has zeros at multiples of .
One option is a hyperbolic tangent; red line Fig. 1(a): This potential forces oscillators with any phase difference into sync, which is the property we are looking for.

Self-desynchronization in resource-bottlenecked parallel code.
Bulk-synchronous bottlenecked programs on a silent system avoid the bottleneck by drifting out of lockstep.In other words, the translationally symmetric state is unstable and any slight disturbance blows up and leads to a broken-symmetry state [3].In the model, this means that each oscillator still has the same frequency but there is a phase difference between adjacent oscillators; see Figs. 2(b, d) for the asymptotic state of the phase differences among the oscillators in the circular phase diagram and the runtime differences among processes on three of four Meggie sockets in the MPI traces.
There are many interaction potentials that would show this property; the requirement is to have attraction at long range but repulsion at short range.We use the following function: where  acts as an "interaction horizon" which marks the transition to a constant potential in either direction; see the blue line in Fig. 1(a).A small  corresponds to almost synchronized oscillators, or a "stiff" parallel code with long-range, synchronizing communication since the phase differences settle at the first zero of the potential, which is at 2/3.Large  describes strong desynchronization with short-range dependencies.The "interaction horizon"  is thus directly correlated with the idle wave propagation speed and the phase spread.For instance, increasing the communication stiffness from

DISCUSSION AND FUTURE WORK
Motivated by the well-known Kuramoto model, we have suggested a description of MPI-parallel, bulk-synchronous and barrier-free codes using a novel physical model comprising coupled oscillators.The coupling is described by a topology matrix and an interaction potential, where the former mimics the dependencies between processes enforced by MPI communication and the latter is used to model different scaling behavior on the system's inherent hardware bottlenecks.We introduced potentials for those two cases which can reproduce the asymptotic behavior of memory-bound and scalable programs under one-off noise injections and subsequent idle wave propagation: resynchronization, i.e., re-establishing the translationally symmetric state, and desynchronization, i.e., settling in a state of broken translational symmetry.The potentials were designed so that different communication behavior in the MPI program can be described in a qualitative way.
As the parallelism of HPC clusters grows, the inherent nodelevel and network-level bottlenecks become more difficult to assess, and their interaction with parallel code causes counter-intuitive effects.Furthermore, frequently synchronizing parallel programs are incompatible with massive parallelism; in the future, parallel code may be more strongly task-based and asynchronous, allowing for slow idle wave progression and desynchronization.Since the number of model parameters is very small, the physical oscillator model provides a simple way to characterize systems where collective behavior is relevant for program performance.If a well-defined continuum limit of the model can be found, it could be useful in hardware-software co-design for expensive components like network infrastructure. 4The deviation from lock-step behavior could help to evade bottlenecks, allowing for cheaper design points in the hardware.
Although the idea of mapping a parallel program to a physical oscillator model is novel and appealing, it is as yet unclear whether more refined models and/or potentials could be found that do a better job at mimicking parallel program dynamics; for instance, we have not yet explored the role of the noise functions that the physical model includes and whether these would be able to properly describe idle wave decay.It is also unknown whether the symmetry-breaking transition observed for non-scalable programs (and in the model) is connected to a Goldstone mode.These questions are subject to future work.

1 𝜎Figure 1 :
Figure 1: (a) Potential  ((  −   )) for scalable (red) and nonscalable (blue) parallel programs.In the latter, the position of the first zero defines the stable desync state.(b) Scalability of the MPI-parallel micro-benchmarks on Meggie cluster.

Figure 2 :
Figure 2: Analogy between MPI programs and physical oscillator model for the communication typologies of  = ±1 (top) and  = ±1, −2 (bottom).Inside the circles we show MPI traces of idle waves progressing.The dots on the circle show the asymptotic state of the phase differences among the oscillators in the model.See text for more details.Video animations are available at http://tiny.cc/MPI_triad.

Figure 2 (
b) to Figure 2(d) led to a threefold increase in the speed of delay propagation (seen in the MPI traces) and a corresponding decrease in oscillator phase spread in the asymptotic state (circular phase diagrams).