JuliQAOA: Fast, Flexible QAOA Simulation

We introduce JuliQAOA, a simulation package specifically built for the Quantum Alternating Operator Ansatz (QAOA). JuliQAOA does not require a circuit-level description of QAOA problems, or another package to simulate such circuits, instead relying on a more direct linear algebra implementation. This allows for increased QAOA-specific performance improvements, as well as improved flexibility and generality. JuliQAOA is the first QAOA package designed to aid in the study of both constrained and unconstrained combinatorial optimization problems, and can easily include novel cost functions, mixer Hamiltonians, and other variations. JuliQAOA also includes robust and extensible methods for learning optimal angles. Written in the Julia language, JuliQAOA outperforms existing QAOA software packages and scales well to HPC-level resources.


INTRODUCTION
The Quantum Alternating Operator Ansatz [20] (QAOA), building on the earlier Quantum Approximate Optimization Algorithm [13,14], is a leading quantum algorithm for solving combinatorial optimization problems.Many questions remain open regarding the overall power of QAOA, as there exist few theoretical guarantees on performance.Instead, QAOA is most commonly studied as a heuristic optimization tool, involving both a classical outer loop and quantum inner loop.The building blocks of QAOA are an optimization problem, encoded in a cost function  () on binary strings (), an initial state | 0 ⟩, a unitary determined by the mixer Hamiltonian   , and a set of 2 parameters {  ,   } known as angles.Finally, the cost function  () is used to create a cost, or phase separator

Pre-computation
Learning Angles Quantum Simulation Save data for re-use To begin evaluating a QAOA, JuliQAOA precomputes the value of the cost function  () across all feasible states, as well as a diagonalized version of the mixer Hamiltonian.The quantum simulation subroutine calls on this information to efficiently compute ⟨ ()⟩.The angle-finding outer-loop uses ⟨ ()⟩ to find optimal angles, saving results for intermediate steps.
The physical intuition behind QAOA is that the the mixer Hamiltonian is designed to generate destructive interference between states with poor  () and constructive interference between states with good  ().QAOA can also be viewed as a Trotterization of quantum annealing, with   serving as the initial Hamiltonian and the   as the target.
While there are some theoretical proofs on the efficacy of QAOA [13], the ideal choices for many of the above quantities -| 0 ⟩,   , classical optimization technique -for a given  () remain the focus of much active research.Numerical experimentation is therefore a critical tool in evaluating the potential and limitations of QAOA.Previous numerical methods for testing QAOA have largely been based off of general purpose quantum circuit simulators [6,32], which are often costly to execute and have thus limited simulations to small numbers of qubits and few rounds (e.g. < 10,  < 5).Meanwhile, purpose-built QAOA simulation has pushed up to  = 30, but only in the special case of MaxCut on 3-regular graphs [24].Here we introduce the QAOA simulation tool JuliQAOA, which has been specifically built to study a wide range of QAOA problems.JuliQAOA enables quick, low-overhead QAOA simulation on personal computers, while also scaling well to HPC-level resources.JuliQAOA has already enabled robust numerical studies in several publications [17][18][19]29], and will be released as an open source package later this year to facilitate further development.

JULIQAOA OVERVIEW
JuliQAOA is a quantum simulator expressly built for exact statevector simulation of QAOA, exploiting several QAOA-specific features in order to reduce time and memory usage.In particular, the repeated nature of the QAOA algorithm means that several key components can be pre-calculated and re-used throughout the simulation.JuliQAOA is designed for large-scale and wide-ranging numerical experimentation, incorporating a broad variety of cost and mixer Hamiltonians.See Figure 1 for a conceptual overview.The core distinguishing features of JuliQAOA are: • No circuits -does not require a circuit-level description of either the cost or mixer Hamiltonians.• Flexible -can efficiently study both constrained and unconstrained optimization problems, and incorporate nontraditional QAOA approaches (e.g.multi-angle [21], threshold phase separator [18], different initial states [11]).• Performant -outperforms other QAOA implementations in terms of time and memory.• HPC-friendly -easy to use on large scale HPC clusters with either GPUs or CPUs.
As an example of the capabilities of JuliQAOA, Figure 2 shows the results of using JuliQAOA to find high-quality angles for random instances of four different problem types (MaxCut, 3-SAT, Densest -Subgraph, -Vertex Cover), each with a different mixer (Transverse Field, Grover, Clique, Ring), at  = 12 qubits up to  = 10 rounds.The data for Figure 2 was generated on an Apple M2 Max laptop in ≤ 1hr.
The flexibility and performance of JuliQAOA comes from three main sources: pre-computation, efficient quantum simulation, and robust angle-finding.

Pre-Computation
The pre-computation step is used to store key quantities which will be re-used throughout the calculation.First, for a given optimization problem characterized by a cost function  (), the user computes and stores the value of  () across all feasible states .In the case of unconstrained optimization,  is the set of all -qubit computational basis states.For constrained optimization  is a subspace, e.g., in the cases of -Densest Subgraph or Max -Vertex Cover the feasible solutions are states with  qubits set to 1 (also known as the Dicke state |   ⟩, which is an equal weight superposition of all  qubit states with Hamming Weight  [7,9,10]).The feasible subspace thus has size   , and one need only provide  () evaluated on that subspace.
The second quantity in the pre-computation step is the mixer Hamiltonian.For unconstrained problems, JuliQAOA is specifically optimized to use mixer Hamiltonians written as sums of products Pauli  operators.This choice covers a broad array of mixers [19], including the original transverse-field mixer [13,20] and Grover mixer [8].Time evolution of such a mixer Hamiltonian, which we generically denote  (  ) can be diagonalized by exploiting  =  : ( This reduces the complicated matrix exponential into a simple sequence of single-qubit operations and vector exponentiation.For a given input set of   mixing terms, we pre-compute and store the equivalent diagonal matrix in terms of   . For constrained problems, we employ a similar trick of diagonalization.Commonly studied mixers for constrained problems, e.g. the Clique <      +     and Ring =+1     +     mixers [20], are designed to only mix between feasible states, however they generally do not admit a diagonalization in terms of singlequbit gates.Instead, we pre-compute the eigenvalue decomposition   =   −1 , where  is a diagonal matrix.These computations can be costly but need only be done once, and then the results are stored for future re-use.This again allows us to avoid matrix exponentiation and instead use  −  =   −  −1 .As with the cost function, we do not consider the mixer as a 2  × 2  matrix, instead restricting the action of the mixer to the feasible subspace, e.g.resulting in matrices of size   ×   for Dicke state problems.Furthermore, any mixer that is not of the above formats (for both constrained and unconstrained problems) can be implemented as a unitary matrix, and JuliQAOA will compute and store the eigendecomposition.

Simulation
Once the cost function and mixer have been pre-computed, simulating the QAOA is a relatively straightforward effort in efficient linear algebra.JuliQAOA is written in Julia [5], a language that is inter-operable with Python at performance levels comparable to C. GPU support is currently enabled via CUDA.jl[4], and was used in [17] on an NVIDIA RTX A6000 with 48GB to analyze constrained optimization problems with  = 18 up to  ≈ 30.The main limiting factor in this case was the memory requirements in finding the eigendecomposition of the Clique mixer matrix.
In calculating the statevector simulation, we pre-allocate and reuse memory, allowing for functionally zero overhead.As discussed previously, the most time-intensive part of the simulation is multiplying by the  and  −1 of the eigendecomposition,   =   −1 .In the case of the Clique and Ring mixers, this is necessarily a matrix-level operation (unless one adopts Trotter approximations, which we do not consider here).
In the case of unconstrained problems with Pauli  mixers,  =  −1 =  ⊗ .Because this is a sequence of single qubit operations, they can be efficiently implemented in  (2  ) time via appropriate tensor contractions.The specifics of our Julia implementation are drawn from Yao.jl [23], the same basic techniques have been used elsewhere [2,31].

Angle Finding
Finally, we include an extensible format for learning good angles for QAOA in the classical angle-finding outer-loop.JuliQAOA has been specifically designed to exploit automatic differentiation [3], commonly referred to as autodiff or AD.AD is a method of calculating exact gradients of complicated functions, in essence using the chain rule across every step of the function.JuliQAOA relies on Enzyme.jl[26][27][28], which works with code compiled at the LLVM level.This allows for highly efficient AD, and also avoids difficulties with complex numbers as well as in-place memory modification, common issues in other AD packages such as ForwardDiff.jland ReverseDiff.jl.We discuss the specifics of the performance gains garnered by AD in Section 4.
Research done with early versions of JuliQAOA [17] showed the power of iterative angle-finding, i.e. using high-quality angles for a ( − 1)-round QAOA to seed angles for the -round QAOA.Starting with these extrapolated angles, we then use the basinhopping algorithm [33] to explore nearby local minima.Other common angle-finding methods are grid search, random local minima exploration, and median angles.For example, in [22] they use all three of these techniques to study MaxCut with the Transverse Field mixer up to  = 9 and  = 3.The random local minima search begins at a random choice of angles and then uses the Broyden-Fletcher-Goldfarb-Shanno (BFGS) [15] algorithm to find the closest local minima.This process is repeated 100 times, each from a different random initial point, and the lowest minima is taken as the final output.The median angles approach takes the results of the random local minima approach over a large number of problem instances and then finds the median angles.In Figure 3 we extend this analysis to  = 12 and up to  = 10, and show the difference in performance against our extrapolated basinhopping approach.
JuliQAOA has a built-in wrapper for this iterative approach that saves angles for each round as they are found.User-defined methods for exploring angles are also supported.

Grover Mixer
JuliQAOA provides an additional level of specialized and optimized code when studying the Grover mixer [8].The Grover mixer is given by   = | 0 ⟩ ⟨ 0 |, where | 0 ⟩ is the equal superposition over all computational basis states.The Grover mixer has several  [22].The speedup provided by JuliQAOA allows us to easily extend the analysis of [22], which was limited to  = 9,  = 3.These results are mean values across 50 random MaxCut instances at  = 12 on  (, 0.5) graphs.interesting properties.First, it conserves Hamming weight when mixing, making it suitable for Hamming weight constrained as well as unconstrained problems.Second, it can be used in conjunction with a threshold-based phase separator to reproduce Grover's search algorithm as a QAOA [17].Third, it gives fair sampling, that is, all states with equal objective values have the same amplitude.The third property can be significantly exploited to calculate the expectation value in a recursive fashion [16].
JuliQAOA includes special code for evaluating Grover-QAOA simulations, allowing simulation for very large (up to  = 100) problems.For problems of this size, significant computational bottlenecks arise in the traditionally straightforward step of pre-computing the  (2 100 ) objective values.Such calculations are made possible by the fact that, in the special case of the Grover mixer, one does not need to store all objective values, but instead only the distinct values that the objective function can take as well as their degeneracies.The calculation of these degeneracies can be easily spread across many threads or GPUs.Determining which binary states are evaluated on each worker is simple in the case of unconstrained problems, as the range of integers 1 . . . 2  can be partitioned appropriately.In the case of Hamming weight  states, one can use Gosper's hack [1] to efficiently iterate through all binary strings with  ones.

EXAMPLES
Note: Please see the latest documentation, available at https: //lanl.github.io/JuliQAOA.jl,for more up-to-date examples.
As described in the introduction, a QAOA is defined by three core components: a cost Hamiltonian, a mixer Hamiltonian, and a set of angles.For JuliQAOA, the cost and mixer Hamiltonians are pre-computed and then passed to the simulator along with angles.JuliQAOA includes several common cost-functions, e.g.MaxCut, -SAT, -Densest Subgraph, etc.These all take as input some structure, usually a graph or a set of clauses, as well as a computational basis state (passed as an array of 0's and 1's) and output a scalar objective value.Any user-defined cost function following this basic format will work as well.See Listing 1 for a simple example of this approach.The core simulate() command outputs a special object, which stores the statevector as well as objective values, and can be used to extract the expectation value, amplitudes for each state, and ground state probability.The command simulate_gpu() is equivalent in functionality to simulate() but runs on any available GPUs via CUDA.jl[4].
The unconstrained Pauli  mixers are extremely efficient to calculate, and do not take much memory, so there is no need to save the results for future calculations.The same cannot be said for constrained problems utilizing the Clique or Ring mixers.In this case, JuliQAOA provides an option to save the mixers to a user-specified location for re-use.See Listing 2 for an example of a constrained QAOA.Note that in this case the cost function densest_subgraph is called across all Dicke(, ) states, i.e. -qubit states with Hamming weight .A particular strength of JuliQAOA is its flexibility and extensibility.By default, simulate() begins the QAOA in the uniform superposition of all states (or all Hamming weight  states, if the mixer is targeted at a specific weight).However, simulate() accepts an optional argument initial_state, which can be used to specified any other initial state, e.g. to study the effects of warm starts [11].Furthermore, the mixers argument can accept an array of mixers (of length ) to test the efficacy of multiple mixers in a single QAOA.In order to test multi-angle QAOA [21], one can even pass an array of arrays of mixers, along with a nested array of angles, which allows for multiple mixers at each layer.
Finally, we include a simple call for finding good angles, see Listing 3. find_angles() defaults to attempting to maximize the objective value.In the case of minimization, an overall minus sign can be added to the objective value list.And in the case of problems where the objective values are both negative and positive, one must add an offset to make them all the same sign.
The function find_angles() uses the angle-finding scheme described in [18], which starts by finding good angles at  = 1 and then uses them as a seed for finding good angles at  = 2, continuing up to a target number of rounds.The results of each step of the angle-finding can be stored in a user-defined file.If the angle-finding is interrupted for any reason, e.g. a server crash, it will load any saved results and resume from the last calculated angles.At each step, we use the basinhopping algorithm, via Basinhopping.jl,to explore local minima.The number of minima to explore, acceptance criteria, and other basinhopping parameters can be modified through optional arguments passed to find_angles().Listing 3: Examples of angle optimization outer-loop for  rounds.find_angles() defaults to an iterative approach, which uses the angles from round  − 1 to seed the basinhopping at round .If the included file path exists, the previously calculated angles will be used.One can also avoid the iterative approach by specifying initial_angles, which will then perform basinhopping from that point.The function find_angles_rand() is included to showcase a straightforward user-defined angle-finding technique.In this case, we implement the random local minima search from [22].

PERFORMANCE
QAOA circuits can be written and simulated in many different simulation platforms, e.g.Qiskit [30] or Pennylane [12].Two prominent QAOA-specific packages are the Python-based QAOAKit [32] and the recently introduced Julia package QAOA.jl[6].QAOAKit only works with the MaxCut problem and transverse-field mixer, Figure 4a shows the scaling in qubit size for CPU time (solid lines) and memory usage (dashed lines) needed to simulate a  = 1 MaxCut QAOA on a random  (, 0.5) graph with the Transverse Field mixer.Figure 4b shows the scaling in rounds for CPU time needed to evaluate an  = 14 MaxCut QAOA on a random  (, 0.5) graph (memory increase is neglegible across rounds for all packages and is therefore not shown).All data generated on an Apple M2 Max device.
while QAOA.jl can accommodate different mixers and problem types.In both cases, these packages compose QAOA circuits and then pass them to general simulators for evaluation; QAOA.jl utilizes Yao.jl, while QAOAKit uses Qiskit.We find that the pre-computation and purpose-built simulation of JuliQAOA outperforms existing alternatives, see Figure 4.For example, for an  = 6,  = 1 MaxCut QAOA, JuliQAOA is faster than QAOAKit by a factor of over 2000, and faster than QAOA.jl by a factor of over 70.Furthermore, the automatic differentiation built in to JuliQAOA gives a factor of  () improvement in time to compute the gradient of the QAOA expectation value ⟨,  | ()|,  ⟩ in terms of ,  .This is because, after an initial caching pass, AD gives the gradient with a single evaluation of the expectation value and some constant overhead.Meanwhile, traditional finite difference methods require at least  () evaluations of the expectation value.See Figure 5 for a comparison of time to find the closest local minima to a random initial point using the BFGS algorithm, with either finite difference or AD providing the gradient.
Perhaps most importantly, we find that JuliQAOA facilitates a wide range of QAOA studies.For example, only requiring a list of  () evaluated across all feasible states allows total freedom in the choice of cost function, and simplifies usage for scientists without experience developing Hamiltonian encodings of optimization problems.Researchers can explore arbitrarily complicated or synthetic optimization functions and mixer Hamiltonians, which can be useful in exploring the limits of QAOA performance.
Constrained optimization is a particular strength of JuliQAOA.In traditional circuit-based simulators, constrained optimization problems must be encoded in terms of unconstrained Hamiltonians with artificial "penalty" terms meant to dissuade the classical anglefinding optimization from selecting non-feasible solutions.The design of JuliQAOA allows us to use mixers which stay within the feasible subspace and simply ignore all non-feasible states, increasing accuracy and significantly reducing computational effort.
While this manuscript was under preparation, a similar QAOA simulation toolkit, QOKit, was introduced [25].QOKit shares many of the same fundamental ideas with JuliQAOA, namely, precomputation and efficient implementation of mixer Hamiltonians.However, there are several key differences between the two packages.First, QOKit is written in Python, and leverages multiple backends for executing the statevector simulation (most notably NVIDIA's cuQuantum framework).They have further optimized their code for large-scale GPU simulation, going up to  = 40 with over 1000 GPUs.However, for exact statevector simulation they only support the Transverse Field mixer.They include both Clique and Ring mixers, but their implementation is equivalent to a firstorder Trotter approximation.Finally, they do not currently provide support for automatic differentiation.

Figure 1 :
Figure1: Conceptual overview of JuliQAOA.To begin evaluating a QAOA, JuliQAOA precomputes the value of the cost function  () across all feasible states, as well as a diagonalized version of the mixer Hamiltonian.The quantum simulation subroutine calls on this information to efficiently compute ⟨ ()⟩.The angle-finding outer-loop uses ⟨ ()⟩ to find optimal angles, saving results for intermediate steps.

Figure 2 :
Figure 2: Example use of JuliQAOA to find high-quality angles for increasing  across a variety of problem types with different mixers.Each line represents the results for a single random instance of the given problem type.All problems have  = 12.The MaxCut, Densest -Subgraph, and -VertexCover instances are all on a random  (, 0.5) graph, with  = 6 for the constrained problems.The 3-SAT instance has clause density 6.The mixers are described in Section 2.1, and the angle-finding technique is described in Section 2.3.

Figure 3 :
Figure3: Comparison of our extrapolated basinhopping angle-finding technique against the random local minima exploration and median angles approaches of[22].The speedup provided by JuliQAOA allows us to easily extend the analysis of[22], which was limited to  = 9,  = 3.These results are mean values across 50 random MaxCut instances at  = 12 on  (, 0.5) graphs.

Listing 2 :
densest_subgraph(graph, x) for x in dicke_states(n, k)] mixer = mixer_clique(n, k; file="path/to/saved/mixer") Setting up a Densest -Subgraph problem with the Clique mixer.If the included file path exists, the pre-computed mixer is loaded.If it does not exist, the eigencomposition is stored for future re-use.

Figure 4 :
Figure 4: Performance comparison of JuliQAOA against alternative QAOA simulation packages QAOA.jl and QAOAKit.Figure4ashows the scaling in qubit size for CPU time (solid lines) and memory usage (dashed lines) needed to simulate a  = 1 MaxCut QAOA on a random  (, 0.5) graph with the Transverse Field mixer.Figure4bshows the scaling in rounds for CPU time needed to evaluate an  = 14 MaxCut QAOA on a random  (, 0.5) graph (memory increase is neglegible across rounds for all packages and is therefore not shown).All data generated on an Apple M2 Max device.

ACKNOWLEDGMENTS
This work was supported by the U.S. Department of Energy through the Los Alamos National Laboratory.Los Alamos National Laboratory is operated by Triad National Security, LLC, for the National Nuclear Security Administration of U.S. Department of Energy (Contract No. 89233218CNA000001).Research presented in this article was supported by the NNSA's Advanced Simulation and Computing Beyond Moore's Law Program at Los Alamos National Laboratory.This material is based upon work supported by the U.S. Department of Energy, Office of Science, National Quantum Information Science Research Centers, Quantum Science Center.This research used resources provided by the Darwin testbed at Los Alamos National Laboratory (LANL) which is funded by the Computational Systems and Software Environments subprogram

Figure 5 :
Figure 5: Performance comparison of local minima finding with the BFGS algorithm using either finite difference or AD to calculate gradients.Performance is averaged over 100 random  = 14 MaxCut instances.All data generated on an Apple M2 Max device.