Fast American Option Pricing using Nonlinear Stencils

We study the binomial, trinomial, and Black-Scholes-Merton models of option pricing. We present fast parallel discrete-time finite-difference algorithms for American call option pricing under the binomial and trinomial models and American put option pricing under the Black-Scholes-Merton model. For T-step finite differences, each algorithm runs in O (T log2 T)/p + T) time under a greedy scheduler on p processing cores, which is a significant improvement over the Θ (T2/p) + Ω (T log T) time taken by the corresponding state-of-the-art parallel algorithm. Even when run on a single core, the O (T log2 T) time taken by our algorithms is asymptotically much smaller than the Θ (T2) running time of the fastest known serial algorithms. Implementations of our algorithms significantly outperform the fastest implementations of existing algorithms in practice, e.g., when run for T ≈ 1000 steps on a 48-core machine, our algorithm for the binomial model runs at least 15× faster than the fastest existing parallel program for the same model with the speedup factor gradually reaching beyond 500× for T ≈ 0.5 × 106. It saves more than 80% energy when T ≈ 4000, and more than 99% energy for T > 60,000. Our algorithms can be viewed as solving a class of nonlinear 1D stencil (i.e., finite-difference) computation problems efficiently using the Fast Fourier Transform (FFT). To our knowledge, ours are the first algorithms to handle such stencils in o (T2) time. These contributions are of independent interest as stencil computations have a wide range of applications beyond quantitative finance.


Introduction
Option pricing or computing the value of a contract giving one the right to buy/sell an asset under some given constraints is one of the most important computational problems in quantitative finance [3].Rapid changes in financial markets often lead to rapid changes in asset prices which makes the ability to quickly estimate option prices essential in avoiding potential financial losses [81].
An option is a two-party financial contract that gives one party (called the holder) the right (but not an obligation) to buy/sell (i.e., exercise) an asset from/to the other party (called the writer) at a fixed price (called the strike/exercise price) on or before an expiration date (called the exercise/maturity date).A call option gives the right to buy whereas a put option gives the right to sell.Also, based on the expiration date and the settlement rule, there are two major styles of options: European and American.A European option can only be exercised at the expiration date while an American option can be exercised at any time before that.
The option pricing problem asks for assigning a value or price to an options contract based on the calculated probability that the contract will be exercised at expiration.The theoretical value of an option [13,15,39,58,71,98] is determined by its stock price  (i.e., its current market price), strike price , risk-free rate of return  (i.e., the theoretical rate of return assuming zero risk), dividend yield  (i.e., a ratio that shows how much dividend/year is paid relative to ), volatility  (i.e., how much the trading price varies over time), and time to expiration  (e.g., in days).
Analytical solutions to the option pricing problem are sometimes available, particularly for European options [48,83,96,97].But they are not available for American options except for a limited number of cases with significant constraints (e.g., American call options with zero or one dividend [97] and perpetual American put options [65,97]).This difficulty in finding closed-form analytical solutions for most option pricing problems makes computational approaches the only path forward.The main computational approaches to solving the option pricing problem include the binomial tree method [54], the finite difference method [3,47,60,94,102,107], and the Monte Carlo method [34,62,108].
The binomial tree method works by tracing the option's value at discrete time steps over the life of the option.For a given number of time steps  between the valuation and expiration dates of the option, a binomial tree of height  is created with the leaves storing the potential prices of the asset at the time of expiration.Then one works backward to compute for each  ∈ [0, − 1] the value of the nodes at depth  of the tree (each giving a possible price at time step ) from the values of the nodes at depth  + 1 using a simple formula.The value computed for the root node is the required option value.Straightforward iterative implementation of this method runs in Θ  2 time on a single processing core and Θ  2 / +  log time on  cores (see Figure 1 and Table 2).It provides a discrete-time approximation of the continuous-time option pricing in the Black-Scholes model and is widely used by professional option traders.
The trinomial tree method extends the binomial method by accounting for the possibility that an asset value remains the same after a time step [18].With only a constant factor increase in run-time it provides more precise predictions than the binomial model.
The finite-difference method approximates the continuoustime differential equations describing the evolution of an option price over time by a set of discrete-time difference equations and then solves them iteratively under appropriate boundary conditions.The explicit finite difference method divides the lifetime of the option into  discrete time steps and then uses the potential values of the asset at time step  (the time of expiration) to compute the asset values at each time step  ∈ [0, − 1] from the asset values at time step  + 1 based on the difference equations (i.e., update equations or stencils).The final option value is found at time  = 0. Similar to the binomial tree method, the iterative implementation of this method runs in Θ  2 time on a single processing core and Θ  2 / +  log time on  cores.Other finite difference methods used for option pricing include implicit finite difference and the Crank-Nicolson method [29].
The Monte Carlo method works by generating random backward paths the asset price may follow starting from the time of expiration and ending at the time of valuation.The bounds hold for call option pricing under the binomial and trinomial option pricing models, and put option pricing under the Black-Scholes-Merton model.Here,  = number of time steps,  = number of processing cores, and  = cache size.Also, T  = running time on  cores, and thus T 1 (Work) and T ∞ (Span) represent run-times on one core and an unbounded number of cores, respectively.Under a greedy scheduler, T  = Θ (T 1 / + T ∞ ), which is asymptotically optimal.

Work
Span Parallel Running Time T1 (𝑇 ) T∞ ( ) T (𝑇 ) Nested Loop (standard, see Figure 1) Recursive Tiling (cache-oblivious) [20,37,38,105] Our Algorithms Each of these paths leads to a payoff value for the option and the average of these payoff values can be viewed as an expected value of the option.This method is used for pricing options with complicated features and/or multiple sources of uncertainty that other methods (analytical, treebased, finite difference) cannot handle [34], but is usually not competitive when those methods apply as the convergence rate of Monte Carlo method is sublinear [51,73].They are hard to develop for some options, such as Black-Scholes Model for the American put option, but still many results exist on Monte Carlo methods [19,49,62,108].
Our Contributions.We present three shared-memory parallel algorithms for American option pricing -call option under the binomial and trinomial option pricing models and put option under the Black-Scholes-Merton model.All three algorithms run in Θ  log 2  / +  time on  processing cores which is a significant improvement over the Ω  2 / +  log time taken by the state-of-the-art parallel algorithms, where  is the number of time steps.When run on a single processing core they run in Θ  log 2  time compared to the Θ  2 time taken by the fastest existing serial algorithms.We use the Fast Fourier Transform (FFT) to speed up the computation.Table 2 summarizes the results.We use the work-span model [27] to analyze the performance of parallel programs.Let T  be the running time on a -processor machine under a greedy scheduler.Then T 1 and T ∞ are called work and span, respectively.The parallel running time The following proposition, which follows easily from the complexities given in Table 2, notes that each of our parallel algorithms runs asymptotically faster than the corresponding fastest existing parallel algorithm for every value of .We have implemented our algorithms and compared their running times, energy consumption, and cache performance with those of the option pricing implementations available in the Par-bin-ops framework [20] developed recently in 2022.Implementations of our algorithms run orders of magnitude faster, consume significantly less energy, and usually incur far fewer L1 cache misses than those implementations.How Our Algorithms Differ from Existing FFT-based Option Pricing Algorithms.FFTs have been used for European option pricing before.European option pricing is simpler than American option pricing, e.g., the European version of the American option pricing algorithm shown in Figure 1 can be obtained by replacing the assignment  , ← max  0  +1, +  1  +1,+1 ,  2 − −  in Step 3 with the simpler assignment  , ←  0  +1, +  1  +1,+1 .The absence of the 'max' operator in this assignment makes an efficient evaluation of the doubly-nested loop in Step 3 easier.
Black, Scholes, and Merton [15,71] showed that the European option can be calculated using a Parabolic PDE with infinite domain constraint.By using the Fourier transform, one gets an integral form for European options.To obtain the numerical value from the integral form, one uses numerical integration [21,44] which can be sped up using FFTs.
There are also approximation results [22,64,76,126] based on repeated Richardson extrapolation [87,88] and FFT for numerical integration in American options.However, even if the extrapolation is repeated only for a constant number of times for an option that expires in  days, the approximation takes Ω ((/Δ)  log  ) time when  grid points are used to discretize the price of the underlying asset and /Δ exercise points are placed with every pair of consecutive exercise points being Δ days apart.Observing that  = /Δ corresponds to the number of discrete time steps in the finite difference formulation of the problem, we can rewrite the complexity as Ω (  log  ).Usually,  ≥  is used in practice [76], which reduces the complexity to Ω  2 log .
A major weakness of the existing FFT-based numerical integration approach above is that a closed-form expression for the characteristic function of the log-price must be known for the technique to work.However, our approach does not need to know such a closed-form expression as we apply FFT to speed up stencil/finite-difference computations and not numerical integration.Thus, our approach will work on a larger set of option pricing problems.We are not restricted to infinite-domain problems either [25].Implications for Nonlinear Stencil Computations.Our option pricing algorithms can be viewed as solving a class of nonlinear 1D stencil computation problems asking to evolve a grid of size Θ ( ) for  time steps, in Θ  log 2  work (i.e., serial time) or Θ ( log 2  )/ +  parallel time on  processing cores.As stencil algorithms, they are of independent interest.A stencil is a pattern (equation) used to update the values of cells in a spatial grid and evolve the grid over a given number of time steps.The process of evolving cell values in the spatial grid according to a stencil is called a stencil computation [37].The finite-difference method performs a stencil computation with an update equation derived from the differential equations used as a stencil.Stencils are widely used in various fields, including mechanical engineering [79,84,86,103,123], meteorology [7,52,53,74,89,90], stochastic and fractional differential equations [63,120,121], chemistry [6,42,61,75,99], electromagnetics [5,55,104,110], finance [24], and physics [12,30,40,45,69,78,106,112,113], image processing [82,111,115].
A stencil is called linear if it computes the value of a cell at time step  as a fixed linear combination of cell values at time steps before , otherwise it is called nonlinear.For 1D linear stencils Ahmad et al. [1] provide FFT-based algorithms that take O ( log ) serial time for periodic grids and O  log 2  serial time for aperiodic grids, assuming that the input grid is of size Θ ( ).
The stencils we encounter in our current work are nonlinear because they do not use a linear combination of cell values from prior time steps for updating a target cell, provided that the resulting value is smaller than the value of a function computed solely based on the spatial coordinates of the target cell and other option pricing parameters (e.g., see Step 3 of Figure 1).Such a stencil divides the space-time grid into two disjoint regions -in one region only the linear combination applies, while in the other only the function value applies.However, the problem is that the boundary between these two regions is not known ahead of time and the location of the boundary may move as the time step  progresses.As a result, Ahmad et al. 's [1] results for linear stencils do not apply.To the best of our knowledge, ours are the first algorithms for handling such stencils running in time subquadratic in  .Binomial Option Pricing Model 2.1 Binomial Option Pricing Model (BOPM) BOPM [28,85,95] is a simple discrete-time option pricing model without using advanced mathematical tools.It is a paradigm of practice.
BOPM encodes the various sequences of prices the asset might take as paths in a binomial tree.Each node in the tree represents a possible price at a certain time and the nodes at two successive layers in the tree represent prices at times that are apart by some fixed time step Δ.The prices increase or decrease by some factor after every Δ time.Figure 2() gives an example of a 3-time-step binomial price tree that is produced by moving from the valuation day to the expiration day.Denote the initial price by .The price in the next time step (i.e., after Δ time) can go up to   =  •  or go down to   =  • , where the up factor  =   √ Δ and the down factor  = 1/ are determined by Δ and volatility  .
Denote the node value as   =  ×    −  , where   and   are the numbers of ticks up and down, respectively.The final nodes of the tree represent the prices on the expiration date.Given the strike price of , the price one can call or put before the contract expires, i.e., the exercise value of each node will be max(  − , 0) for a call option and max( −   , 0) for a put option.
The risk-neutral valuation of the binomial value is performed iteratively backward.Under the assumption of risk neutrality, the value of the option today is its expected future payoff discounted at the risk-free rate of .Let us number the nodes in each layer of Figure 2() from top to bottom by successive integers starting from 1. Then the node values  , and  ,+1 of a layer representing some time  can be used to compute the binomial value of the th node in the layer representing time  − Δ as follows: ), where,  = ( Δ − )/( − ) [48].Denote  =  −•Δ ,  0 = (1 − ), and  1 = .Then the binomial value of that node is: The value at each node will be equal to its binomial value for European options and the larger of its binomial value and exercise value for American options.
We say that cell (, ) of  is red provided  , =   , , and green otherwise.We show in Section 2.2 that all red cells in  form a single contiguous region and all green cells form another.A single boundary divides the two regions.We analyze the properties of this red-green divider in Section 2.1 which we will exploit to design an efficient algorithm for American call options in Section 2.3.Lemma 2.2 shows that within the upper-left triangle of , if a cell is green then the cell to its right is also green.
Proof. +1,+1 =   +1,+1 and Lemma 2.4 gives: The following corollary says that at every time step all red cells appear to the left of all green cells, and the boundary between the green and the red regions either remains the same or moves by one cell towards the left every time step.

Algorithm for American call option pricing under BOPM
The solution space is a right-angle isosceles triangle with base length  .We know the boundary between the red and green cells in the first row of the triangle (solution space); however, we do not know the locus of the boundary in the subsequent rows of the triangle.We compute the boundary in the following process.
We partition the triangle (solution space) into trapezoids (see Figure 3a).We compute the first trapezoid with its first row as the same first row of the triangle (the solution space) and solve this newly created trapezoid (we explain how we create a trapezoid and solve it later in this section).Then we compute the second trapezoid with its first row as the last row of the first trapezoid and solve the second trapezoid.This process continues until we are left with a right-angle isosceles triangle with base size at most √  .We solve this triangle iteratively by doing quadratic work in time O (𝑇 ).We describe the process in detail below.Partitioning the Triangle into Trapezoids.Let  be a right-angle isosceles triangle with base length  (see Figure 3a).Let ℓ 1 be the number of red cells on the line segment .They will appear consecutively from  to some point .
Let  be the point in the line segment  that is ℓ 1 distance from .Draw a horizontal line from ; let the line intersect  at point .Therefore, we get a trapezoid  with height ℓ 1 with ℓ 1 red cells in its first row.
We solve trapezoid , which means that we compute the values of all red cells in its last row and thus find the boundary point  between red and green cells in .Let || = ℓ 2 .Let  be the point on  that |  | = ℓ 2 .We draw a horizontal line  , and get the second trapezoid  of height ℓ 2 .The last row of the trapezoid  becomes the first row of  .
We solve the trapezoid  and all subsequent trapezoids created following the approach described above until we are left with a right-angle isosceles triangle  with base length at most √  .We compute all cell values of  iteratively in O ( ) time.
Solving a Trapezoid.Solving a trapezoid means given all red cell values in its first row computing all red cell values in its last row.Let  be a trapezoid of height ℓ with ℓ red cells from  to  in its first row (Figure 3b).Let  be the point on  such that | | = ⌊ℓ/2⌋.Draw a horizontal line from  intersecting  at .To compute the values of red cells on , we (1) compute all red cells on , and (2) using the cell values on , compute all red cells on .
(1) Computing all red cells on   .Let  and  intersect .We compute the cell values on line segment  using the FFTbased stencil algorithm of [1] which are guaranteed to be red because of Corollary 2.7.Note that | | = | | = ⌊ℓ/2⌋.The newly created trapezoid  has height ⌊ℓ/2⌋, and there are ⌊ℓ/2⌋ red cells in its first row.We solve trapezoid  recursively similar to trapezoid , and thus compute all red cell values on .
(2) Computing all red cells on    using the red cells on   .Let point  be on the boundary between red and green cells on .Draw a line from  parallel to  intersecting  at . Next, draw a line through  perpendicular to  intersecting  at point .We compute all red cells on  in exactly the same way we computed the red cells on  above.We first use the FFT-based stencil algorithm of [1] to find all cell values on , and then recursively solve trapezoid  of height ⌈ℓ/2⌉ to find all red cell values on .
The following theorem gives the work and span of our algorithm.Theorem 2.8.Our algorithm solves the American call option pricing problem under BOPM in O  log 2  work and O ( ) span, where  is the number of time steps.Proof.Let  1 (ℓ) and  ∞ (ℓ) be the work and span, respectively, for solving a trapezoid of height ℓ recursively.We call the O (ℓ log ℓ)-work FFT-based periodic algorithm [1] twice and solve two smaller trapezoids of height ℓ/2 each recursively.Hence,  1 (ℓ) = 2 1 ( ⌈ℓ/2⌉) + Θ(ℓ log ℓ) = O ℓ log 2 ℓ .Observe that although the two smaller trapezoids must be solved one after the other (e.g.,  after  in Figure 3b), each of them can be solved in parallel with the FFT-based algorithm (of span O (log ℓ log log ℓ) [1]) called to find the red cells on its left (on line segments  and , respectively).Hence,  ∞ (ℓ) = 2 ∞ ( ⌈ℓ/2⌉) + O (log ℓ log log ℓ) = O (ℓ).
Suppose that we solve  trapezoids of heights ℓ 1 , ℓ 2 , . . ., ℓ  , respectively, using the above process as shown in Figure 3a.Let Ψ 1 and Ψ ∞ be the total work and span, respectively, for solving those  trapezoids followed by the time needed to solve the leftover triangle of height Since those trapezoids and the triangle are solved in sequence, we have . □

American Call Option under the Trinomial Option Pricing Model
The trinomial options pricing model (TOPM) encodes the possible sequences of prices for a given asset within the structure of a trinomial tree (see Figure 2()).It expands on BOPM by allowing the value of an asset to remain unchanged after a given time step.TOPM was introduced by Boyle [18], and while it is less popular than the BOPM, it provides for the possibility of more accurate predictions than BOPM at the cost of only a constant factor blowup in runtime when using naïve O  2 methods.Langat, Mwaniki, and Kiprop showed that TOPM converges to the same solution as Black-Scholes with half as many time steps [59].TOPM is also equivalent to the explicit finite difference method [48].TOPM carries over many properties from BOPM, e.g.,   =  ×    −  since the number of "remain the same" moves does not factor into the price.Here  =   √ 2Δ , and  = 1/.The exercise value in TOPM is max(  − , 0).
The transition probabilities can be expressed as , and   = 1 −   −   , which are alternate forms of those given in [48]. Let , otherwise.
In the supplementary material, we show that the TOPM grid shows properties similar to the BOPM grid, that is, in every row all red cells appear first in contiguous locations followed by all green cells, and with every time step the red-green boundary moves by at most one cell to the left.This allows us to use a similar algorithm to that given for BOPM (Section 2.3) with the identical work and span.
4 American Put Option Under the Black-Scholes-Merton Model BSM is a mathematical method to calculate the theoretical value of an option contract.The option pricing problem is transformed into a partial differential equation (PDE) with variable coefficients.An explicit formula for the price can be obtained assuming a log-normal distribution of the asset price.Note that the limit of the discrete-time BOPM approximates the continuous-time BSM under the same assumption.
While BOPM utilizes simple statistical methods, BSM requires a solution of a stochastic differential equation.
It is equivalent to satisfying the following: where,  ∈ [0, ], Recall that at the maturity time  , the option price (call option case) will be  ( ) = max(S( ) − , 0) = ( − S( )) + .It also means that  ( , ) = ( − ) + on the boundary.Therefore, the goal becomes solving Equation (1) or Inequality (2).For more details on how the BSM model is formulated or why the complementary form is equivalent to the classical form, see [23,97,109].We show the two areas are contiguous and find properties of their boundary in Section 4.2, which can be exploited by our algorithm in Section 4.3.

Properties of the Two-Area Boundary
Note that Equation (1) includes dimensional variables.First, we find nondimensionalized forms of Equations ( 1) and (2).

Δ𝑠
and  0  = max(1− Δ , 0) for all integer .It also satisfies the following condition by discretizing Equation (4): Similar to the BOPM for American option, we define the green zone and the red zone: Definition 4.1.We call that    is in the green zone when Δ ≤ L(Δ), otherwise it is in the red zone.
To apply Equations ( 5)-( 6), we need to determine the form of L(( + 1)Δ) which has no analytical form, although it has asymptotic results [2] or approximation results [124,125].Instead we can use the following theorem from [23]:  Proof.We first prove that   −  +1 ≥ 0. Suppose that this is not true.Then we will have: L(( + 1)Δ) ≥  +1 ≥   + 1 > L(Δ), which contradicts Theorem 4.2.Now we prove that   −  +1 ≤ 1.Because   is in the green zone, we will have the following: Considering  +1   −1 , we first observe that: −1 is in the green zone.Suppose that it is in the red zone.Then: which leads to a contradiction because  +1   −1 should be ≥ 1 −  (  −1)Δ .By the definition of  +1 , we must have  +1 ≥   − 1, completing the proof of the theorem.□

Algorithm for American put option pricing under BSM
Our algorithm for the American put option under BSM is similar to our algorithm for the American call option under BOPM as described in Section 2.
Observe that we will have to perform a nonlinear stencil computation based on the update equation (5).For  time steps we use a ×2 space-time grid with the time dimension being  and spatial dimension 2 .According to Equation ( 5), we compute a cell  +1  of that grid, where  +1 represents the time coordinate and  the spatial coordinate, from cells    −1 ,    , and   +1 using a 3-point stencil provided  > L ( (+1)Δ )

Δ𝑠
. Otherwise, we set it to 1 −  Δ .In the first case, cell  +1  will be in the red zone, and in the second case it will be in the green zone.As explained in Section 4.2, the entire boundary between these two zones in not known ahead of time, but it moves by at most one cell toward the green region with every time step.The goal of the algorithm is to compute the value of the central cell of the spatial dimension at time step  (e.g., apex  of the isosceles triangle  in Figure 4b).
We solve the problem by decomposing the isosceles triangle  into a sequence of the isosceles trapezoids of geometrically decreasing heights (see Figure 4b) and solving (i.e., find the cell values of top base given those of the bottom base of the trapezoid) them one by one from bottom to top until we reach a leftover triangle of small constant size which we solve naïvely to find the value of .We solve an isosceles trapezoid recursively by decomposing it into two smaller trapezoids of smaller height and solving them recursively and also using the FFT-based algorithm from [1] solve two subtrapezoids that are entirely composed of red cells (see Figure 4a).Details of this algorithm are given below.
We first show how to solve an isosceles trapezoid  (as shown in Figure 4a) of height , bottom/longer base () length 4ℓ, and ∠ = ∠ = 45 • .Thus, the top/shorter base  is of length 2ℓ.Solving trapezoid  means computing the values of the cell at the top base  given the values of the cells at the bottom base .
If ℓ ≤ 10, we naïvely solve  and identify the location of the red-green boundary point  in  in O (1) time.If ℓ > 10, we find the row ℎ at height ℓ 2 and calculate all the cells in it.To do so, we recursively solve the trapezoid   which is found as follows: 1. Let  be the point on  that lies in the green region, but  + 1 is in the red region.Proof.The proof is very similar to that of Theorem 2.8.Let  1 (ℓ) and  ∞ (ℓ) be the work and span, respectively, for solving a trapezoid of height ℓ (see Figure 4a).We recursively solve two trapezoids of height ℓ/2 each in sequence but use a parallel FFT-based algorithm [1] on each half, both size Θ (ℓ).
Since the FFT-based algorithm performs O (ℓ log ℓ) work in O (log ℓ log log ℓ) span, we can write: Now, let Ψ 1 ( ) and Ψ ∞ ( ) be the work and the span, respectively, of solving an isosceles triangle of base size  (see Figure 4b).Then Ψ 1 ( ) = Ψ 1

Experimental Results
In this section, we present an experimental evaluation of our algorithms and compare them with the fastest existing solutions.Our experimental setup is shown in Table 3.The legends used in our plots are listed in Table 4, which are described in more detail in the next few paragraphs.
Benchmarks.For benchmarks, we use American call option pricing under BOPM and TOPM, and American put option under the BSM.For the BOPM call option benchmarks, our baselines are the option call probability calculations from QuantLib [26] and Zubair et al.'s parallel cache optimized model [127].We use the optimized implementations of these two baselines available in Par-bin-ops [20].These implementations are the fastest existing implementations of BOPM call option pricing.For the TOPM call option and BSM put option, our FFT-based implementations are compared with our parallel looping-based vanilla implementations, as we could not find any publicly available faster implementations.We use the perf (version: 3.10.0-1160.53.1.el7.x86_64.debug)tool [80] to analyze the system-wide energy consumption, and the PAPI (version: 5.6.0)library [114] for cache miss counts for our implementations and benchmarks.Par-bin-ops.In 2022, Brunelle et al. [20] released an opensource framework that can leverage parallel, cache-optimized algorithms to compute a variety of binomial option types and enables a simple interface for developers.We used the latest version from Github.Experimental evaluations by Brunelle et al. [20] has shown that Par-bin-ops achieves more than 139× speedup over the QuantLib library when evaluating a European call option using 200,000 steps.Therefore, we chose the Par-bin-ops tool to benchmark our implementations of our FFT-based algorithms.To have a valid comparison of running times, we use Par-bin-ops for both QuantLib and Zubaer et al. 's [127] option probability calculation equations and report the running times comparing with our FFT-based implementations.We use the stencil-based cache-optimized version of Zubaer et al. 's algorithm from Par-bin-ops.Parameter Values.As Table 2 shows, the only option pricing parameter that influences the performance bounds of the algorithms in our experiments is the number of time steps  .Therefore, we keep all other option pricing parameters fixed in all of our experiments.We use the following parameter values:  = 252,  = 130,  = 127.62, = 0.00163,  = 0.2,  = 0.0163 (notations are in Table 1).

Parallel Running Times
Figure 5() shows the parallel running time comparison of American call option pricing calculations under BOPM.Our FFT-based algorithm uses a recursive divide-and-conquer approach.We have found empirically that a base case size of 8 steps yields the best running times.Experimental results show that our FFT-based algorithm can outperform Par-binops for any number of step sizes for both serial and parallel implementations.We achieve more than 16× speedup for  ≈ 1000 and more than 500× speedup for  ≈ 0.5 million w.r.t.Par-bin-ops implementations.
Our TOPM algorithm runs more than 2.5× faster for  ≈ 2000 and more than 390× faster for  ≈ 2.1 million w.r.t. the parallel vanilla code.Figure 5() shows the comparisons.
Figure 5() shows the parallel running time comparisons of the American put option pricing computations under BSM.Our FFT-based implementation is compared with the loopingbased vanilla implementation.Our algorithm achieves more than 8× speedup for  ≈ 1000 and more than 14× speedup for  ≈ 0.13 million w.r.t. the vanilla implementation.

Energy Consumption
Figure 6 shows the comparison of system-wide energy consumption while running our FFT-based implementation and respective benchmarks.We collected the energy consumption estimate of the entire package (pkg) and the main memory (RAM) through the RAPL (Running Average Power Limit) interface of model-specific registers (MSR) using the perf [80] profiling tool.Our FFT-based BOPM, TOPM, and BSM implementations consume 99%, 99%, and 96% less energy, respectively, compared to their benchmarks for large  values   used in our experiments.For  ≈ 4000, the energy savings are 80%, 50%, and 40%, respectively.Energy plots for pkg and RAM separately are included as supplementary material.

Cache Misses
Figure 7 shows L1 cache-miss (= L2 cache access) and L2 cache-miss results, respectively, for all implementations.Our FFT-based implementation incurs far fewer L1 cache misses than both Par-bin-ops implementations for BOPM.However, while we incur far fewer L2 cache-misses than ql-bopm, the other one (zb-bopm) incurs fewer L2 misses than ours.In case of TOPM, while our FFT-based implementation incurs far fewer L2 misses than our parallel looping implementation (vanilla-topm), the trend is not clear for L1 misses.For BSM, neither L1 nor L2 misses seem to have a clear winner.

Scalability
As Table 2 shows, our algorithms reduce the span T ∞ of solving the option pricing problems that we consider by a factor of Ω (log ), but they reduce the work T 1 by a substantially larger factor of Θ  /log 2  .While the reduction in work leads to a significant reduction in energy consumption (see Section 5.2), it also leads to a low parallelism of T 1 /T ∞ = Θ log 2  .But it is still easy to see from the complexities given in Table 2 that the parallel running time T  of our algorithms will be asymptotically lower than that of the corresponding fastest existing parallel algorithms for every value of  (stated in Proposition 1.1).Table 5 above shows how the parallel running times of our FFT-based BOPM implementation (fft-bopm) and that of the QuantLib-based implementation from Par-bin-ops (ql-bopm) vary with  as  is kept fixed at 2 15 .We see that although the parallel running time of fft-bopm stops decreasing somewhere between  = 4 and  = 8, it remains significantly faster than ql-bopm even when  = 48.Our algorithm scales better as  increases, e.g., for  = 2 19 , it scales to a  ∈ [8,16), and compared to a subsecond running time of fft-bopm for  = 1, ql-bopm takes ≈ 2 hours to run.

Conclusion and Future Work
We have designed fast American option pricing algorithms under the binomial, trinomial, and the Black-Scholes-Merton models.We solve a type of nonlinear stencil problem that is of independent interest with potential applications beyond quantitative finance.
Future work may explore other models for American option pricing, such as the time dependent volatility model, stochastic volatility model, and the jump diffusion model.European and Asian options, Lookback options, Knock-out Barrier options, and Bermudan options are also of interest.Supplementary Material.Includes our FFT-based implementations and details on our TOPM results (Section 3).takes up the upper left triangle, such that the diagonal has a slope of 1/2.Lemma A.1 shows that within the upper-left triangular area of  if a cell is green then the cell to its right must also be green.Assuming the claim holds for some  + 1 ≤  , we will show that it holds for  (Δ +1, ≤ 0 for 0 ≤  <  + 1).From this assumption, this implies the existence of a  +1 such that  , =   , for  >  +1 and  , =   , for  ≤  +1 .This  +1 is defined as the largest  such that  +1, > 0, the last "red" cell in row  + 1. Therefore: From the inductive assumption, we can assume that  0 δ+1, +  1 δ+1,+1 +  2 δ+1,+2 ≤ 0. Therefore, to show that Δ , ≤ 0, it suffices to show that the remaining terms are less than or equal to 0. A useful note is that since diagonals indicate no change with respect to the exercise value, From here, we can start dividing out common terms so long as they are greater than 0.
We can transform the expression   −   .We will let  (− )Δ =  to simplify.By induction, Δ , ≤ 0 for all 0 ≤  ≤  , 0 ≤  ≤ 2 − 1, we prove Lemma A.1.□ Lemma A.2 shows that within the upper-left triangular area of  if a cell is green then the cell below it must also be green.
From Lemma A.1, we know that all the  +1, =   +1, for all  <=  <= 2 + 1.Therefore, we can expand this to: (  +   +   − 1/) ≤   − −1 ( 0 + ( 1 − 1) +  2  2 ) Note that since   +  +  = 1, we know that the expression on the left side is strictly greater than 0, allowing us to make the following substitution: In Lemma A.1, we showed this expression to be less than or equal to 0, which is a contradiction.Therefore, Lemma A. Proof.This can also be proved by induction using the similar methods as Lemma 2.5.We have the same base case, when  =  we have   −1, −1 ≤   , .If both are green, then since   −1− +1 −  =   − −  it holds.If   −1, −1 is green and   , is red, then we know that   −1, −1 ≥ 0 since it is green and that   , = 0 because it is red.If both are red, then the same applies.We then can consider the general case of  = ℓ, where if all ℓ <  ≤  hold.

B Energy Comsumption
We give additional plots of energy consumption in Fig. 10.In addition to outperforming existing approaches in terms of total energy consumption, the reduced energy consumption is pronounced when restricted to specific domains.Here we give the values for the package domain (pkg) and the RAM domain as described in Section 5.2 of the main body.For both measures across all algorithms our algorithm outperforms vanilla and standard implementations after instances of size 2 1 3. Our algorithm for the Black-Scholes-Merton Model was even more competitive, outperforming on instances of all sizes.

Figure 1 .
Figure 1.Standard looping code for pricing American Call options under the Binomial Option Pricing Model.

Proposition 1 . 1 .
Let T ( )  ( ) be the running time of any existing algorithm from Table 2 on  cores, and let T ( )  ( ) denote the same for our algorithm.Then for every (positive) value of  under a greedy scheduler: T ( )  ( ) =  T ( )  ( ) .

Figure 2 .
Figure 2. () A 3 time-step binomial tree.() A binomial tree of 7 time steps embedded in an 8 × 8 grid.An upward arrow has a price change factor of 1/ while a rightward arrow has a price change factor of .

Figure 4 .
Figure 4. American put option pricing under BSM

2 .
Identify the points  and  to the left and right of  , respectively, such that |  | = |  | = ℓ.Construct an isosceles trapezoid with base , height ℓ 2 and top  such that ∠  = ∠ = 45 • .Thus, |  | = ℓ.After solving the trapezoid  , we have the cell values in  and the location of the red-green boundary point  in .We can easily calculate the values of cells in ℎ  since those values are independent of time and depend only on spatial coordinates.Finally, we use the   -based algorithm of[1]   to solve the trapezoid  , where the point  is found by forcing   to be an isosceles trapezoid.Therefore, we can get the values of the cells in  and thus the values of all the cells in ℎ.Then we can calculate the values of the cells in  given the values of the cells in ℎ exactly the same way as we computed the cell values in ℎ from those in .Now, let us go back to Figure 4b to see how to compute the value of the apex .After solving  as above to calculate the cells in , if | | ≤ 10, we naïvely calculate the value of , which takes O (1) time.But if | | > 10, we recursively apply our trapezoid algorithm to solve a smaller trapezoid with the bottom base .Theorem 4.4.Our algorithm solves the American put option pricing problem under BSM in O  log 2  work and O ( ) span, where  is the number of time steps.

Figure 10 .
Figure 10.Additional comparisons of energy consumption by domain, Package (pkg) and RAM.

Table 4 .
Legends used in plots and tables.
Running time comparisons of our algorithms with state-of-the-art parallel algorithms.