Efficient CHAD

We show how the basic Combinatory Homomorphic Automatic Differentiation (CHAD) algorithm can be optimised, using well-known methods, to yield a simple, composable, and generally applicable reverse-mode automatic differentiation (AD) technique that has the correct computational complexity that we would expect of reverse-mode AD. Specifically, we show that the standard optimisations of sparse vectors and state-passing style code (as well as defunctionalisation/closure conversion, for higher-order languages) give us a purely functional algorithm that is most of the way to the correct complexity, with (functional) mutable updates taking care of the final log-factors. We provide an Agda formalisation of our complexity proof. Finally, we discuss how the techniques apply to differentiating parallel functional array programs: the key observations are 1) that all required mutability is (commutative, associative) accumulation, which lets us preserve task-parallelism and 2) that we can write down data-parallel derivatives for most data-parallel array primitives.

Our starting point is the CHAD automatic differentiation technique [Elliott 2018;Nunes and Vákár 2023;Vákár 2021;Vákár and Smeding 2022], which has our desired features of simplicity, pure functionality, general applicability and verified correctness, but is not known to be efficient or parallelism preserving.
The main contributions of this paper to the existing body of literature (see Section 10) are: • the observation that the CHAD algorithm for first-order functional programs (Section 2) has efficiency problems (Section 4) that are solvable, up to some log-factors, by using a sparse implementation of the required data types (Section 4.1) and state-passing style (Section 4.2); • a further observation that the remaining log-factors can be optimized away by making the vectors mutable (Section 5); • a proof (formalized in Agda) that the resulting implementation achieves the correct computational complexity expected of reverse AD algorithms (Section 6); • a demonstration that the same techniques enable efficient AD of array programs (Section 7); • the insight that all required state is only written to using (commutative, associative) accumulation operations that can easily be parallelized (Section 8); • the finding that the naive CHAD algorithm for higher-order functions is fundamentally inefficient, but that efficiency can be restored through the use of defunctionalisation or closure conversion (Section 9).
This paper aims to provide a comprehensive exploration of the proposed AD algorithm while addressing its efficiency and applicability to a wide range of programming contexts.

BASIC REVERSE-MODE CHAD
We summarize the basic reverse-mode CHAD algorithms described and proved correct in [Elliott 2018;Nunes and Vákár 2023;Vákár 2021;Vákár and Smeding 2022;Vytiniotis et al. 2019], before proceeding to explain how its implementation can be made efficient. Consider programs written in a standard first-order functional language with the types and programs of Fig. 1. The types of this language are built using a primitive type R of real numbers, tuple types 1, × and sum types ⊔ . This language allows us to build complex programs by combining primitive (partial) functions op : R → R, where R R × · · · × R, and sign : R → B, which can be used to perform case distinctions on real numbers. We postpone the treatment of array and function types until Section 7 and Section 9, respectively. We use the standard sugar ⟨ 1 , . . . , ⟩ ⟨⟨ 1 , 2 ⟩, · · ·, ⟩, let ⟨ , ⟩ = in let = in let = fst in let = snd in and ⟨ , ⟩.
with the typing of Fig. 3. In particular, all linear types are commutative monoids. For now, the reader may keep in mind the naive implementation of the linear types of Fig. 4, which we will later replace with a more optimized one. The linear operations op : R ⇀ R → R are assumed to be (efficient) implementations of the transposed derivatives of the operations op.
Identifying Complexity Challenges. We want to prove this criterion by induction on the structure of programs. Doing so immediately exposes efficiency problems with the naive implementation of the linear types of Fig. 4. For such an inductive argument to go through, we would hope to match up the time complexity of all program constructs with that of their derivative. In particular, to efficiently differentiate programs that (1) discard values, we need zeros 0 to cost constant time; (2) copy values, we need addition + to cost constant time; (3) reference variables, we need one-hot environment vectors one : ∈Γ to cost constant time. In fact, these turn out to be the only problems with the complexity of the naive CHAD transformation on first-order programs, as split Γ, is already constant-time, as required to differentiate let-bindings.
Sparsity for Cheap Zeros and One-Hots. Problems (1) and (3) have a relatively straightforward solution: in Section 4.1 we replace the naive implementation of cotangents to tuples and environments by a sparse one: we implement tuples × in all linear data types by 1 ⊔ ( × ) and we use a (tree) map EMap Γ to implement EV Γ. The code transformation does not change, merely the under-the-hood implementation of linear types. This has introduced a slight new problem, however: the cost of split Γ, has become log-time. We will later remove this and other final log-factors from the complexity in Section 5.
Monadic Lifting to Avoid Sums on Cotangent Environments. The hardest problem in the complexity of CHAD turns out to be the efficiency of +, particularly on environment cotangents. Such additions arise from the use of the structural rule of contraction in the typing context in the source program. Put differently, for programs formed from multiple subprograms (partially) sharing the same typing context, we end up with such additions in the derivative program: in the derivatives of let-bindings, pairing, case-constructs and primitive operations. Unfortunately, such sharing of environments costs constant time in the source program, but constant-time addition of sparse environment cotangents is impossible [Brown and Tarjan 1979]. We therefore have to do something fundamentally different.
Our solution, in Section 4.2, is to observe that there is never a need to keep previous versions of the environment around: we can mutably update the cotangent environments. We realise this in a functional framework by slightly modifying the CHAD transformation to generate monadic code: in a monad EVM Γ that we think of as a local state monad with an environment vector of type EMap Γ as the stored state. We replace the use of split Γ, by the scoping operation of the monad: scope Γ, : EVM (Γ, : ) → EVM Γ ( × ).
(We postpone the discussion of the precise implementation of the monad, but the reader may keep the naive implementation EVM Γ EMap Γ → × EMap Γ in mind for intuition.) Environment Fig. 5. Pictured are two sparse structures, in blue and in red, together with their overlay. Leaves represent R, nodes with a vertical child indicate (one branch of) ⊔ , and nodes with diagonal children indicate × . Omitted lines are children omitted due to sparsity. Computing the sum of and involves work on the overlap of the two structures; in this case, at 4 nodes. The structure of the sum is shown in black on the right.
cotangents are updated when we reference a variable in the source program. Finally, we replace environment vector addition by monadic sequencing, which is indeed constant-time as required.
Amortising the Cost of Sums. As we explain in Section 4.3, our algorithm now has the correct complexity, as long as we can come up with some good implementation of the monad that does insertions in EMap Γ in constant-time. Indeed, the key idea is that we can amortise the cost of additions to cotangent environments (which happen in the derivative of variable references and, due to the monadic lifting, now are the only occurrence of cotangent addition) against the creation of the cotangents being added. The reason that this amortisation works is that (1) CHAD treats cotangents in an affine way, never copying them and thus only reusing the result after something has been added to a cotangent, and (2) the addition of our sparsely represented cotangents can be performed in cost that is proportional to the size of their intersection, as depicted in Fig. 5. This lets us carry out the amortisation argument by tracking a potential run · size for each cotangent , when it is produced and consumed, that is proportional to its size (sparsity). As we will see, by generalising the complexity criterion to arbitrary programs and by including the potential terms for amortisation, we can prove it by induction without problems.
Proving the Complexity Criterion by Induction. We need to slightly modify our complexity criterion of Eq. (1) before proceeding to prove it by induction, which we do in Section 6. Firstly, we need to generalize it to apply to programs 1 : 1 , . . . , : ⊢ : of arbitrary type and in arbitrary typing context, to carry out the induction. Secondly, the fact that we have modified the CHAD transformation to produce monadic code means that we now need to make some mostly cosmetic modifications to the complexity criterion to involve the handler run : EVM Γ → Γ → × Γ of the monad. Thirdly, as discussed in the previous paragraph about amortisation, we need to add (negative and positive, resp.) terms to the complexity criterion for the potential of the incoming and outgoing cotangents , respectively. We arrive at the following modified complexity criterion: where we have abbreviated ⟨_, ⟨res 1 , . . . , res ⟩⟩ run (snd D Γ [ ] ) ⟨ 1 , . . . , ⟩. We prove Eq. (2) by induction on , assuming a "reasonable" cost model (see Section 5) for the API of EVM Γ. Most cases in the proof are straightforward, since we do not build or consume any non-trivial fragments of cotangent values and hence do not change the potential. In those cases, the outgoing potential cancels precisely against the incoming potential and the induction goes through by our affine use of cotangents. The only exception is the case of differentiating variable references: here we need to manipulate potential because we doing non-trivial computations with cotangents (i.e. adding an accumulated cotangent to the running total in the environment cotangent). Thus it is here that we need our observation about the cost of addition on sparsely represented cotangents being proportional to the size of their intersection to carry out the induction step.
By initialising env 0 with a tuple of zeros (as one would do in practice) in Eq.
(2) and using that 0 ≤ res , we get the following corollary which is the key complexity result of our paper: We have formalised our development of this complexity proof in Agda. The key construction is to implement the source and target languages of our CHAD code transformation as embedded languages with an evaluator eval : Programs → Values × Z that computes the result of a computation as well as a formal cost. Our transformation uses an abstract EVM Γ monad that we stipulate must have the desired cost for its operations. Next, describe a few ways to implement such a monad.
Implementing a Mutating Monad for the Final Log-Factors. In Section 5 we explain how one might in practice replace our monad with one that does mutation under the hood, which is crucial to shave the final undesired log-factors off our complexity and to achieve the "reasonable" cost model under which our complexity proof goes through. As we have phrased our CHAD code transformation with respect to an abstract monad, the transformation stays identical. A first option uses a linear state monad EVM Γ EArr Γ ⊸ ! × EArr Γ that we can implement in, for example, Linear Haskell [Bernardy et al. 2018], where EArr is like EMap except that its operations have linear types. A second approach relies on the ST monad [Launchbury and Jones 1994]: EVM Γ ST , where universal quantification of the type parameter in the handler runST : (∀ . ST ) → is used to ensure that the mutable state is properly encapsulated.
Efficient CHAD for Arrays. Arrays are viewed analogous to product types, and our approach builds on the concept of sparsity to efficiently represent zeros and one-hot cotangents. Our chosen representation for the cotangents to an array Array , in Section 7, is a collection, denoted as Bag (Z × D[ ] 2 ), where Bag is a free monoid on . We think of each element as a pair of an index and a cotangent at that index. This representation is essential for efficient differentiation, especially when considering operations like discarding, indexing, and sharing of elements in the array, as it achieves constant-time zero, one-hot cotangent, and addition of cotangents. The collection is finally converted into an array form, if needed, using collect : Bag → Array . The cost of this conversion is proportional to the time spent constructing the bag, and this cost is amortized.
We discuss how this representation gives us efficient derivatives for typical array operations. We treat the following illustrative set of array primitives, but the methodology applies more generally: • array indexing ! that retrieves position from an array ; • an element-wise construction operation 'build ( . )' that constructs an array of length with value at position ; • reduction 'fold ( . ) ', combining the elements of a non-empty array : Array using .
Similarly to the case of tuple projections, the derivative of indexing is given by a one-hot cotangent. As 'build' is the paradigmatic backward permutation operation, it is not surprising given the reversal of data-flow in reverse AD that we obtain its derivative through the use of the typical forward array permutation operation, 'scatter'. Anticipating our desire to preserve parallelism, we record the tree structure of the performed reduction in the primal computation for the CHAD transformation of 'fold' and invert the data-flow for the transposed derative computation.
In all cases, our complexity argument works as before since: (1) 0 and + are constant time; (2) we use cotangents affinely; (3) CHAD-transformations of subterms are executed exactly once in the CHAD-transformation of the whole; (4) further overhead per primitive is within the same complexity as the primitive itself (for example, one-hot arrays are constant time, just as indexing is).
Parallelising CHAD. Parallelism tends to be important in code that we perform AD on. We typically want to retain it under AD and get parallel derivative code, as we discuss in Section 8. Our CHAD transformation generates monadic code and monadic sequencing cannot be expected to parallelise, however. Luckily, we only use monadic sequencing where there are already data dependencies in the original program; otherwise, we independently combine monadic actions using >> and sequence . Further, we only modify the state by adding to it. That is, we are working with a (commutative) accumulation effect rather a general bit of state that allows for arbitrary destructive updating. Such a commutative effect allows for parallel implementations of >> and sequence to get task-parallelism in the AD'ed code. Finally, we discuss how to give data-parallel derivatives for our data-parallel array operations.
Closure Conversion for Efficient CHAD of Lambdas. Naive CHAD of higher order functions leads to a duplication of work, due to separation of the transposed derivative w.r.t. function arguments and captured context variables. This inefficiency can be exploited to get an exponential blow-up in complexity. An obvious solution is to get rid of all captured context variables, i.e. to apply closure conversion before CHAD. One option is to apply full defunctionalisation first, which, using a global program analysis, reduces function types to a combination of finite sum and tuples types, constructs we already know how to differentiate. Another option, which does retain locality of the code transformation, is to explain how to differentiate the required infinite sum types or existential types required to do typed closure conversion. We discuss both in Section 9.

FINDING AND SOLVING EFFICIENCY PROBLEMS
The basic algorithm defined in Section 2 is relatively simple and pleasant to analyse. However, it fails to satisfy the complexity criterion in Eq. (1), so there are some complexity issues to address.
Identifying Complexity Problems. To spot the complexity problems in the algorithm from Fig. 2, we can try to prove the complexity criterion in Eq. (1) inductively. To do this, we first have to strengthen the statement to also apply to subexpressions with larger contexts and non-R result types: ∃ , ′ > 0. ∀( 1 : 1 , . . . , : ⊢ : ). ∀ 1 : 1 , . . . , : . ∀ : Interpreting Eq. (4) intuitively, the criterion states that the cost of evaluating the code on the right-hand side of D Γ [ ], including calling the backpropagator in the second component of its result, must be within a constant factor of the cost of evaluating the original expression . 2 For instance, consider = ⟨ 1 , 2 ⟩. We can see (in Fig. 2) the evaluation of D Γ [ ] on the right-hand side, and calls to their backpropagators inside that of . By the induction hypothesis, these are offset by the evaluation of the original on the left-hand side. Thus, the inductive step of the proof would hold for pair expressions -were it not for the (single) use of + on environment vectors EV Γ in the backpropagator: the fact that + EV Γ is expensive on the naive representation of EV Γ from Section 2 makes the pair case problematic.
Inefficient Monoid Structures on Cotangents. Similar reasoning elsewhere in the code transformation uncovers some more operations that are required to be efficient and are potentially not. Naively (but see Section 4.3), all of the operations shown in Table 1  Because our source language type system ( Fig. 1) does not have unbounded-size products (i.e. -ary tuples), one-hot cotangents are not an issue, because just building the one-hot value from (constant-time) zeros will take bounded cost anyway. An example of this is the ⟨| , 0| ⟩ in D Γ [fst ] in Fig. 2. Had we included -ary tuples in our language, these one-hots would have been a larger issue, needing a solution similar to that for arrays (Section 7); after all, from a complexity perspective, arrays are similar to (homogeneous) -ary tuples. Thus we can focus on the other operations.
As an example of the need for constant-time zero, consider the program : ] also take time, and thus it would seem that + needs to be constant-time as well, explaining its presence in the table above. However, as we explain in Section 4.3, with a smarter analysis we can weaken this requirement.
Our current naive implementations of EV Γ and D[ ] 2 do not at all reach these constant-time requirements, so we need to do something about this. Indeed, so far, EV Γ is a simple product of the types in Γ: (EV ( 1 : 1 , . . . , : ) = 1 × · · · × , as defined in Section 2); this representation has expensive zero, plus and one-hot (all at least ( ) work), but efficient split Γ, .
Solving the Problems with Sparse Representations. We address the problem of expensive zeros in Section 4.1 by choosing sparse representations of the monoids that are not already sparse: D[ × ] 2 and EV Γ. Then (in Section 4.2) we lift the output of the transformation to monadic code in order to eliminate the need for + EV Γ , as well as to prepare for making one and split constant time (which became logarithmic because of the sparse representation) using a mutable array in Section 5. Finally, we only have + left, which then turns out to not be a problem any more: an amortisation argument (informally in Section 4.3, more formally in Section 6) shows that we can discount + D[ ] 2 against building up of cotangent values.

4.1
Step 1: Sparsity Sparse data structures aim to represent the uninteresting parts of a data structure as compactly as possible, focusing on the interesting (usually the non-zero) parts. Such representations not only conserve memory but also serve to avoid computing with many useless (zero) values, since the result is typically as uninteresting as the input: zero.
data EMap Γ -all types in Γ must be monoids -only a type change; (1) pop ′ : EMap (Γ, : ) → EMap Γ -(log(map size)); map delete operation modify : ∈Γ : ( → ) → EMap Γ → EMap Γ -(log(map size)) + (one call to the function) union : 6. The interface of the environment map, used in Section 4.1. Here, the notation " : ∈ Γ" means a pointer into Γ, i.e. an integer from 1 to the length of Γ. Ignoring type safety, this interface, including the noted time complexities, can be implemented with a standard immutable tree map. Regarding union: if passed two maps of size and , ≤ , then its runtime is ( log( + 1)) + (the required + calls), and this complexity is optimal. [Brown and Tarjan 1979] The first term is ( ) when ≈ and (log( )) when is small.
Fixing Cotangent Zeros. Given the tree structure of our types (which are built out of products and sums), a natural first attempt for a sparse representation is to add an explicit, redundant value representing "zero" for products. (0 1 , 0 R and 0 ⊔ are already constant-time.) That is, we change the implementation of × from × to 1 ⊔ ( × ). The implementation of its API then changes to: The result is that 0 D[ ] 2 is now constant time for all types in our type system (Fig. 1).
Fixing Environment Zeros and One-Hots. Since we implemented EV ( 1 : 1 , . . . , : ) as 1 × · · · × in Section 2, the 0 EV Γ in D Γ [⟨⟩] and the one : still have to create a big tuple filled with zeros. A standard way to make a sequence (a total map from bounded integers to values) sparse is to use an associative array, often implemented as a balanced binary tree in programming languages 3 (which is essentially a partial map from keys to values). Since our keys are slightly odd -they are pointers into the environment, and the type of the associated value depends on the index -we will explicitly specify the interface of such a map data structure instead of directly instantiating the standard data structure. However, it should be clear that it can be implemented with a standard immutable tree-map, modulo type-safety. This API is shown in Fig. 6. In the interface, empty creates an empty map, and get : ∈Γ gets the entry for from the map , returning 0 if there is no entry there yet. push 0 and pop ′ , respectively, extend and reduce the environment, where pop ′ performs actual work because it has to remove the now-extraneous entry from the underlying map structure; push 0 simply changes the type to make an additional key allowable, but does not need to add the key yet. (Conceptually, push 0 pushes a 0, but zeros are elided due to sparsity.) modify : ∈Γ applies the function to the value in the map for variable , preinitialising with 0 if necessary. union takes the union of the two maps, adding overlapping values using the addition operation + of the monoids contained in Γ. Finally, pop is a derived operation.
Using this interface, we can implement EV Γ as EMap Γ and instantiate its methods as follows: At this point, zero cotangents (due to our new sparse × ) as well as zero and one-hot environment vectors can all be constructed in constant time as required. Note that the program transformation of Fig. 2 has not changed: simply the implementation of some of the types has changed. Thus the remaining points of care in the derivative program are addition of cotangent values and environment cotangents, as well as split Γ, , which has inadvertently become log-time with this change in representation of EV Γ; we address these in the following two sections.

4.2
Step 2: Monadic Lifting Let us first focus on the addition operation on environment cotangents (EV D[Γ] 2 ). It occurs in the differentiated program whenever evaluation of the corresponding source program term involves evaluating more than one subterm. 4 Usually, however, these source program terms do not take time on the order of the size of the environment, and hence do not "pay" (in the sense of a direct inductive proof of the complexity criterion Eq. (4) as explained above in Section 4) for union, which is at least logarithmic in the the size of its arguments. 5 Fortunately, we have some yet-unexploited flexibility: because the environment cotangent is only ever modified by adding contributions to it (and since addition is commutative and associative), we can rearrange these additions at will. For example, instead of returning the environment cotangent contributions from each subprogram and merging the results using union, we can also pass a growing accumulator around in state-passing style, adding individual one : ∈Γ contributions to it as the derivative program executes. Using a slightly modified definition of one: we can add the single contribution (here¯) from D Γ [ : ] in time (log(map size)) + (cost of '+') to the passed state .
To use this one ′ , we have to change the structure of the derivative program somewhat: instead of passing environment contributions upwards, merging them as control flows meet, we have to pass around a single environment cotangent in state-passing style, that we modify with one ′ each time we encounter a variable reference. The result is that the derivative program then lives inside a variant of a local state monad [Plotkin and Power 2002;Staton 2010]: it is a state monad whose state is divided up into components, one for each entry in the environment. 0 EV Γ becomes return ⟨⟩, + EV Γ becomes >>, one : ∈Γ becomes one ′ : ∈Γ , and usages of split Γ, become usages of scope Γ, instead, using >> = to extract the results: EVM is the new local state monad, with methods as shown.
About the Methods. Notable here is that the role of split is now fulfilled by scope. In principle, because we are now state-passing, we must not only pop (split off) the outermost variable of the environment cotangent vector returned by the derivative of a subcomputation with an extended scope (e.g. the body of a let), but also push an empty entry on the incoming vector before being able to pass it on to that same subcomputation in the first place. Had we chosen to use pure state-passing style (EV Γ = EMap Γ → EMap Γ), we would indeed have used push 0 from Fig. 6; however, here in monadic style, such separate push and split would not typecheck. Thus, scope combines both.
Fortunately, D Γ is compositional, meaning that D Γ ′[ ] is a subterm of D Γ [ ] whenever is a subterm of . (And D Γ [ ] does not depend in any other way on the structure of .) Therefore, we can scope the usage of an extended environment to the monadic subcomputation that handles the subterm with that extended environment using a grading in the style of a local state monad. This scoping is done by the updated scope Γ, above: conceptually, it first extends the EMap Γ in the state to an EMap (Γ, : ) (the push step -semantically storing 0 in the new cell but operationally, because of sparsity, just changing the monad type), then runs the subcomputation of type EVM (Γ, : ) with that extended state, and finally pops off the extra value of type and returns it along with the return value of the subcomputation (of type ).
Finally, run is the handler (see [Plotkin and Pretnar 2013]) of the monad, where for an environ- What Did We Achieve? In contrast to the changes in Section 4.1, we now have to change the code transformation to produce monadic code; the updated code transformation is shown in Fig. 7.
Let us look back at the operations in Table 1 and make up the balance. All zeros are now constant time and + EV Γ is gone. On the other hand, one : ∈Γ adds (using + ) the cotangent value to the entry in the environment cotangent (in the monad state) corresponding to the variable . This takes time logarithmic in the size of the environment cotangent (thus usually (log |Γ|), unless most variables are unused), plus the time required to invoke + on and the running total. Furthermore, split Γ, is also logarithmic in the size of the environment cotangent.
In Section 5, we will remove the logarithmic-time operations by using a mutable array instead of a persistent tree map (EMap); this will modify only the implementation of EVM and its methods, keeping the code transformation itself completely as-is. 6 Then, the only remaining potential problem is the cotangent addition (+ ) in one. However, as we discuss below in Section 4.3, we can amortise the cost of these additions against the creation of the cotangent values being added, meaning that the code transformation as it is now in Fig. 7 -with the efficient implementation of EVM from Section 5 -is actually already finished and efficient.

4.3
Step 3: There is no step 3 (the amortisation argument, informally) Affine Use of Cotangents. To see how we are going to argue that the use of is already efficient, first observe that in Fig. 2 (and still in Fig. 7), cotangent values are used in an affine 7 manner: they are mostly not duplicated, and when they are (in D Γ [⟨ , ⟩]), the structure is split using lfst and lsnd before using the constituent parts affinely again. (We could encode this affine usage with a resource-aware type system, but this does not really bring benefit for our presentation; the observation is simply useful to explain why the amortisation argument will go through, and in a way it is proved by the amortisation argument itself.) Because of this affine usage, once a cotangent value has been added to another, only the sum will again be used elsewhere; the values used to build this sum will never be used again.
Addition of Sparse Structures. The second ingredient that we need is an observation about the cost of + . The addition of two sparse cotangent values and of type D[ ] 2 , i.e. + D[ ] 2 , is essentially a zip of the two (sparse) structures. In this zip, we assume that the two cotangents, as far as they are defined (i.e. not omitted due to sparsity), have equal structure: all values of type × have the form ⟨| , | ⟩ for further structures and , and although a value of type ⊔ can be both linl and linr , we raise a runtime error if the two do not correspond (see Fig. 4 and the error calls inside lcastl and lcastr).
Hence, this zipping operation visits precisely the common "prefix", or rather the intersection, of both structures -no more, no less. Subtrees that occur in only one of the two arguments to + are simply returned as-is, which is constant-time. An example is shown in Fig. 5 in Section 3. The circled nodes are the nodes that the two inputs share, i.e. neither has omitted. The implementation of + does not have to recurse into the left subtree of (blue), nor into some of the branches of (red).
More formally, define the size of a cotangent value as follows: Then, formalising the observation of Fig. 5, we have the following inequality: Here, is the number of computation steps (in our cost model) that + at most requires to fully handle one node in a cotangent value. Now we define our potential function as · size , measuring the number of computation steps that we budget in a node and can still use when consuming the cotangent value later. Then, rearranging terms, the inequality rewrites as follows: in other words, after accounting for the potential flowing in (through and ) and flowing out (through + ), addition is free.
Amortising Sums. Armed with these two observations, first consider the simplified situation where a large number of cotangents are successively added to a single, threaded-through accumulator: Each of the additions involved takes time at most proportional to the size of the added there -"size" being the number of materialised nodes in memory. Furthermore, since we do not duplicate structures, constructing itself will also have taken time at least on the order of the size of . Thus, this chain of additions (at most) duplicates the work done in constructing the , which we had to do anyway; so essentially these additions are free. We have amortised the additions against the construction of their inputs.
Of course, in a general derivative program, the additions will not necessarily be done in such a linear fashion, but a more precise analysis of the above situation will show how the amortisation argument works in general. Indeed, because of our second observation (Fig. 5), the cost of (. . .) + is not really proportional to the whole size of : only the intersection of (. . .) and is traversed in +. Furthermore, the parts of that are not traversed, are included as-is in the sum without processing. We can still amortise against those non-traversed subtrees of the sum! Indeed, assuming we could still amortise against the entirety of both arguments 1 and 2 of an addition 1 + 2 , their sum will consist of a number of subtrees that were included unchanged, connected by their intersection that + did traverse once. However, on precisely that intersection, we had two input nodes we could amortise against: we can arbitrarily choose to amortise this addition against the overlapping part of 1 , and still have the corresponding part of 2 to amortise against later. So in the end, the entire result of the addition can still be amortised against, preserving the invariant that we can always still amortise against the entirety of a cotangent value. Seeing that we never copy cotangents, there is no risk of amortising against the same cotangent twice.
It turns out that modifying the full complexity criterion by accounting for potential flowing in and out like in Eq. (5), yields a statement that easily proves itself by induction; the only thing we still need is to implement EVM efficiently, removing the logarithmic factors in its complexity. This optimisation is described in Section 5, after which Section 6 sketches the full complexity proof.
data EArr Γ -all types in Γ must be monoids (1), and vector push amortisation pop :

GETTING RID OF LOG FACTORS
The current implementation of EVM is not quite efficient enough for our purposes: one ∈Γ not only adds (using +) to the value for in the monad state, but also takes (log |Γ|) time to find and update that value in the Map, a purely functional tree map. Similarly, scope has logarithmic overhead for updating values in the Map. These logarithmic computations violate the complexity criterion (because neither variable references nor let-bindings in the source program account for those logarithmic costs), but fortunately we can do better by replacing the Map with a mutable array.
Using resource-linear types (such as those implemented in Linear Haskell [Bernardy et al. 2018]), we can redefine EVM as follows: 8 where EArr Γ is a mutable array with methods analogous to those of EMap Γ; its methods are shown in Fig. 8 together with their complexities. The changes are the usual ones for a resource-linear mutable array (giving back the input array in the result and changing some arrows to resource-linear ones). ! denotes an unrestricted type, e.g. Ur in Linear Haskell. The constant-time complexity of push 0 can be justified either by appealing to the standard amortisation argument 9 or by preallocating an array as large as the maximum environment depth (a statically-known property, after all) at the start of execution.
If the reader is unfamiliar with resource-linear types, they can simply imagine that we use the original, purely functional State monad: but assume that EMap somehow has a magical implementation where pop ′ , modify : ∈Γ and get : ∈Γ from Fig. 6 all run in (1) (apart from the cost of calling once in modify). A further possible implementation (and the one that we will use in Section 8) is the following in terms of the Haskell ST monad [Launchbury and Jones 1994]: This version uses the monadic structure (instead of linear types) to enforce sequentiality of array access operations, enabling mutable updates just like the version using linear types from Eq. (6). Our Agda formalisation of the complexity proof is actually generic over all these implementations, because it assumes only an API, complexity properties of that API, and some equations on that API that define the semantics of the methods. The implementation is kept black-box; see Section 6.2.

Method
Cost one : ∈Γ : → EVM Γ 1 one : ∈Γ ( : ) (arr : EArr Γ) = modify : ∈Γ ( +) arr (1) + (cost of adding to the value in arr for ) scope Γ, : EVM (Γ, : Table 2. The API of EVM Γ , using the implementation of EArr Γ from Fig. 8. The cost of run is not written as the slightly weaker (|Γ|) + (cost of ) to allow cancelling some run occurrences when computing with run in the proof. The implementation of run uses the linear array API and is verbose but unsurprising.
Resulting Complexities. The resulting complexity of the methods of EVM is shown in Table 2. A user of the API will typically pass a tuple of zeros to run, but in our inductive argument in Section 6 we will also pass in more filled-in environments.
To understand the complexity of run, note that its implementation has to allocate and deallocate an array, serializing the input and the output environment (of type Γ) to and from that array. We assume here that we are allowed to return cotangents in our sparse format; if not, the |Γ| term in its complexity would increase to : ∈Γ size (where size is proportional to the time required to (de)serialise a value of type , and thus to convert back and forth to our sparse representation).
With this optimised implementation of EVM, we are finally ready to prove our algorithm efficient.

COMPLEXITY PROOF
Because the derivative program now lives in a monad, the (generalised) complexity criterion (Eq. (4)) has to be modified to include its handler, run: Since run takes an initial environment cotangent (to be accumulated into) as an additional argument, we need to provide one; we generalise over which environment the monad is initialised with to enable a proof by induction. The constant run is from Table 2. The next step is to account for amortisation, by subtracting incoming potential from the cost of the scrutinised expression (i.e. making incoming potential available as free computation steps), and adding outgoing potential to its cost (i.e. declaring outgoing potential as additional computation steps). This yields the criterion of Eq. (2) (in Section 3). We prove this statement by induction on ; Section 6.1 gives a proof sketch illustrating the main ideas.
From Eq.
(2) we derive a corollary that more directly states what a user can expect from our optimised version of CHAD. We initialise env 0 with a tuple of zeros, because we typically want the gradient, not the gradient plus some initial value; furthermore, we bound using the fact that 0 ≤ res and recall that = run · size . This eliminates from the theorem statement, yielding Eq. (3) as a corollary. Note that ′′ captures both run and the ( ) creation of the tuple ⟨0 1 , . . . , 0 ⟩. This is our final complexity theorem, and our Agda formalisation proves Eq. (3). 10 Note that Eq. (3) is the same as Eq. (4) apart from the additional ′′ · + size term on the right-hand side. However, this is usually small, because even if a program has many inputs, these are typically organised in data structures rather than being passed as a large set of separate inputs; for most applications, we feed reverse AD'd code very sparse cotangents (e.g. basis vectors) for which size is small.

Proof Sketch: Induction on
The statement being proved here is Eq. (2). We consider three representative cases of terms in the induction: (1) the very simplest case of = ⟨⟩ (where we do not require an induction hypothesis), to illustrate some of the basic book-keeping about potential flowing in and out of computations; (2) a slightly more complex case of = ⟨ 1 , 2 ⟩ to show how we invoke the induction hypothesis; (3) the case of variable references = , which is the only case in the whole proof where real work happens as this is where the potential is actually used, so we cannot merely cancel out the incoming and outgoing potentials.
The proof sketch uses " (1)" as notation to indicate some unspecified bounded value whose bounds are independent of any of the other variables in the proof. We use this to abbreviate the cost of some constant-time work whose exact cost is immaterial to the argument it appears in.
(2) simplifies to the following clearly true inequality: Note what happened with the potential: the first source of incoming potential (in : D[1] 2 ) was ignored, resulting in an extra −1 term on the left-hand side of the inequality. This only made proving the theorem "easier": we got some unused free evaluation steps. The second source of incoming potential (in 1 , . . . , ) cancelled exactly against the outgoing potential (in res 1 , . . . , res ) because we did not change the accumulated environment cotangent.
Subterms and Sparsity. Now let us consider the term = ⟨ 1 , 2 ⟩. When we evaluate the scrutinised expression, run (snd D Γ [ ] ) env 0 , in addition to some constant-cost work we are going to do the following things (see Fig. 7 on the argument lsnd ; (5) Sequence the results of (3) and (4) in the monad and run the result.
In short, this is equivalent (apart from some constant-cost work) to evaluating the expression: Because of the implementation as a state monad, we have, after defining ⟨ , env ′ ⟩ run env : where the term − run · |Γ| arises because in the left-hand side, we (de)serialise env only once whereas in the right-hand side we do so twice: we have to subtract one of the two to make the left and right-hand sides equal. Define for convenience in the below: Using the above lemma about the cost of bind (Eq. (9)) in simplifying Eq.
Again, note what happened to the potential: the potential in the incoming cotangent, , was mostly immaterial: it contributed +1 or −1 depending on how sparsely it was represented, but did not do anything significant. This is expected, since in D Γ [⟨ 1 , 2 ⟩] we do not build or consume any non-trivial fragments of cotangent values. As for the potential in the environment cotangent accumulator: the outgoing potential of snd D Γ [ 1 ], equal to =1 env ′ , cancels precisely against the incoming potential (via the environment cotangent) of snd D Γ [ 2 ], which is again expected because the environment cotangent itself is passed as-is from 1 to 2 .
In general, this is what always happens in the proof: as long as we do not do anything material to cotangents, they at most consume a bounded number of evaluation steps in stored potential, and as long as we do not modify the environment cotangent ourselves, all the corresponding terms cancel. The only case where we do something material to all of these, and where the terms do not immediately cancel, is for variable references.
Variable References: Amortisation. Taking = : in the environment 1 : 1 , . . . , : and inlining into Eq. (2), we get: (1) + cost(run (one : Here we write for the value of the variable in the current evaluation environment. Now we know that cost( ; Γ) = 1 for a variable , and furthermore that for all ≠ , we have res = . For res , we know from the semantics of one (see Table 2) that res = + . Its complexity property specialises to the following: Thus Eq. (12) simplifies to: Now we use the central amortisation property of (+) (Eq. (5) in Section 4.3), which implies that: Subtracting Eq. (14) from Eq. (13) gives us that it is enough to show that (1) ≤ + ′ + run · , which is immediate for sufficiently large . Unlike before, we have actually used potential here: we received potential for and and needed to return potential for their sum. Because the sum will contain less potential than the inputs to (+), we can use the excess potential to pay for the execution of (+) itself, without needing to count more than a bounded number of evaluation steps here for the transform of a variable reference.
The Other Cases. The other cases in the proof are mostly analogous to the cases discussed above. For let = 1 in 2 , we end up needing the lemma that the CHAD primal is equal to the original program result (i.e. that fst D Γ [ ] returns the same result as when run in the same environment); this is required because we need to relate the cost of evaluating 2 to that of evaluating run (snd D Γ [ 2 ] ) env 0 , and these two evaluations happen in the same environment only if fst D Γ [ 1 ] and 1 return the same result.

Agda Formalisation
We have formalized the above complexity proof in Agda (2.6.3, --safe --without-K). Agda [Norell 2007] is a dependently-typed functional language and proof assistant, and one of the standard proof assistants in the domain of programming languages. While not typically used for proofs with integer reasoning, it admits a very natural encoding of the problem statement. Our full development can be found online 11 ; the statements of the theorems, and the definitions required to write those statements, are included in Appendix B.
In the development, the source and target language are encoded as a fully well-typed well-scoped (De Bruijn) inductive data type in the standard fashion; the cost model is encoded in the evaluator, which evaluates an expression of type to a (meta-language, i.e. Agda) value of type × Z. The integer containes the number of 'steps' taken in evaluating the expression: our cost model. In a way, the Agda proof is more generic than the sketch above, because it defines the methods of EVM 12 together with properties about their semantics and complexity in an abstract block. This means that the rest of the proof cannot use the implementation of the methods and the monad type itself, but only their types (and the properties of those methods that we provide in the block). Because all three of the monad implementations that we outlined in Section 5 satisfy those properties, we know that our Agda proof works for all three, regardless of exactly which concrete monad implementation we choose for the Agda formalisation. 13 7 ARRAYS So far, the AD algorithm is defined only for simple first-order programs. However, adding arrays to our language is not difficult, if somewhat tedious. We will show how the elements of a complexity proof for array operations are analogous to the cases already discussed in Section 6, but we leave a full proof to future work.
An array is a product type, hence we should take inspiration from the handling of binary products ( × ), where we introduced sparsity in order to efficiently represent zeros 0 D[ × ] 2 and one-hot cotangents ⟨ , 0 ⟩ and ⟨0 , ⟩. Thus our choice of D[Array ] 2 should allow efficient zeros and one-hots as well.
A one-hot array cotangent is a pair of an index (of type Z) 14 and a cotangent for that cell (of type D[ ] 2 ), hence a sufficient choice seems to be to let D[Array ] 2 be a collection Bag (Z × D[ ] 2 ) of pairs that we will convert to an array using collect : Bag → Array once we are done constructing it. For efficient differentiation of discarding, indexing and sharing, this bag should furthermore support constant-time creation of empty and singleton collections, as well as constanttime combination of two collections. It turns out to be sufficient to simply defunctionalise the operations that we want to be efficient and use the following type definition (i.e., make them constant-time by construction): 15 data Bag = BEmpty | BOne | BPlus (Bag ) (Bag ) Observe that 'collect'ing such a Bag costs at most as much time as was spent constructing it. Hence, if we use the Bag affinely, which we do, the cost of 'collect'ing it can be amortised against its creation. Therefore, in terms of complexity, we cannot do better than this, absent parallelism. Operations. The array operations that we discuss here are elementwise construction, indexing and associative reduction, with the following typing rules: The informal semantics are as follows: build ( . ) = [ [ 0 / ], . . . , [ −1 / ]], [ 1 , . . . , ] ! = , and fold ( . fst ★ snd ) [ 1 , . . . , ] = 1 ★ · · · ★ . 'fold' requires its array argument to be of non-zero length, and performs a reduction in unspecified order, assuming that is associative. 16 12 The concrete implementation there called LACM for "Linear ACcumulation Monad". 13 The actual Agda implementation is a state monad with cons-list state, but with costs as if it was a constant-time version. 14 With arrays we also need Z in our type system; being a discrete type, D[Z] 1 Z and D[Z] 2 1 suffices. 15 The fact that Bag is a free monoid is unsurprising (because among the operations we defunctionalised are zero and plus), but also not very fundamental: in Section 8.1 we will add more constructors to Bag to avoid pessimising other operations. 16 This is a typical operation in parallel array languages, such as fold1 in Accelerate and reduce in Futhark. One could lift the requirement that the array be non-empty by adding an additional initial value for the reduction, but that would make this section more verbose without adding interesting new problems.
:: -(::) is list cons Parallel array languages typically have other operations as well, including special cases of 'build' such as gather, transpose, stencils/convolutions and more (which can be implemented more efficiently than the general case), but also independent operations such as various scans and scatter (forward array permutation). In practice one will need a specialised derivative for each of these, the former for efficiency and the latter for expressivity, but here we restrict ourselves to the mentioned threesome, which is already powerful enough to express most machine learning models.
Derivatives. We show the derivatives of the three operations in Fig. 9. The simplest of the three, D Γ [ ! ], should not be surprising given the choice of D[Array ] 2 : it behaves similarly to the derivative of a tuple projection (fst and snd), and its complexity is clearly sound for the same reasons. For build, we use three additional array operations with the following types: unzip : Array ( × ) → (Array ) × (Array ) scatter : Monoid ⇒ Array → Array (Z × ) → Array Γ, : , : ⊢ : : Array : Array Γ ⊢ zipWith ( . ) : Array where 'scatter' "adds" the values in its second argument to the indicated positions in the first argument using the "add" operation from its monoid structure. (The 'Monoid ⇒' notation indicates a constraint on , in Haskell syntax.) 'unzip' and 'zipWith' can be defined in terms of 'build' and indexing; 'scatter' is a new primitive running in time linear in the size of its inputs in the sequential case. Finally we also use collect : Bag → Array and the monadic sequence operation (sequence : Array (EVM Γ ) → EVM Γ (Array )). 17 Note that the appearance of 'scatter' (forward array permutation) is unsurprising, because 'build' is the quintessential gathering (backward array permutation) operation, and reverse differentiation dualises data flow. As for the complexity for 'build': all array operations in D Γ [build ( . )] operate on arrays of the same length and can run in linear time in the sequential setting -in addition to the expected invocations of the backpropagators . The derivative functions resulting from the execution of D Γ [ ] and D Γ, :Z [ ] are executed at most 18 once, and cotangent values are treated affinely as required. Hence, the complexity proof should extend analogously to the cases shown in Section 6.1.
The derivative of 'fold' in Fig. 9 is a bit more involved. 19 The approach taken here is to record (in a 'Tree' -see Fig. 10) the reduction tree taken in the primal pass by the 'fold' combinator, and to unfold over that same reduction tree, but now from the root instead of from the leaves, in the reverse pass (with 'unTree') 20 . In practice, one would implement 'unTree' as a primitive operation together with the 'Tree' data type, and hide this complexity from users. (See also Section 8.) Finally we also use a new array primitive: fromList : List → Array , clearly also linear-time in the length of the list. The complexity for fold is sound for the same reasons as for 'build' above: its direct work is within bounds, and it calls backpropagators of its subterms at most (here precisely) once.

PARALLELISM AND PRACTICAL EFFICIENCY
Automatic differentiation is typically applied to programs with inherent parallelism, so it would be a shame if the derivative program was forced to be sequential. So far, the prime inhibitor to parallelisation of the derivative program seems to be the monad EVM. Of course, we cannot run the left-hand and right-hand side of a bind operation (>> =) in parallel, but in our code transformation (see Figs. 7 and 9) such binding is only used when there was an actual dependency in the source program already. For independent source expressions, the corresponding monad actions are independently sequenced using (>>) and 'sequence'. 21 Can we run those actions in parallel?
Because our monad is a (local) accumulation monad, all updates to the individual cells add a new contribution to the value already in that cell. Addition is commutative and associative, hence it does not matter in which order we add these contributions: the inevitable reordering resulting 17 This sequencing is the thing that seems to prevent a parallel implementation here, but see Section 8. 18 snd D Γ [ ] is not executed at all, because being a linear function with 1 as domain, it cannot compute anything useful. Said differently, being of discrete type (Z), it cannot make non-zero derivative contributions to anything. 19 An alternative is given in [Paszke et al. 2021], which is expressed just in terms of standard second-order array combinators but has the downside of being not quite complexity-efficient -it has a complexity blowup in the case of nesting fold in the combination function of fold. 20 The 'List → List ' return type is a Cayley-transformed list, or "difference list", to make concatenation constant-time [Hughes 1986]. 21 Note that 'do ; ' is syntactic sugar for >> . from concurrent updates is fine. We just need to ensure that the individual contributions do not get corrupted by concurrent access to the same mutable cells; for this locks or atomic updates suffice. 22 Reimplementing the API of EVM in this way, it becomes safe to execute the monadic actions sequenced with (>>) and 'sequence' in parallel, resulting in parallelism corresponding to independent expressions in the source program. In an actual implementation it would be prudent to have both sequential and parallel versions of (>>) and 'sequence', because for some uses, parallelisation might cost more in overhead than it gains in useful parallelism.
This covers all operations expressed with the mentioned parallelisable combinators; what is left is the unTree function from Fig. 10, which we need to execute in such a way that the parallelism of the original 'fold' reduction is reflected in its derivative. Fortunately, it suffices to execute the two recursive calls to unTree in parallel: this is possible because they are independent (which one could formalise by putting them in an array to 'sequence'). The resulting task-parallelism mirrors the reduction tree structure, and thus the parallelism structure, of the original 'fold'.
Complexity. Unlike the sequential case, it is unclear what a proper complexity criterion should be for the parallel case: requiring a constant factor slowdown over the source program is impossible, since parallel replication (build ( . )) seemingly has (in a naive cost model) constant runtime given enough parallel execution units, whereas its derivative, which must perform a parallel reduction (summing all entries in the incoming cotangent value), surely spends time at least logarithmic in . Note that this need not be a visible, or even easily computable, property of the source program if it is a computed value, making it hard to even formulate the optimal complexity criterion for reverse AD on parallel array programs. For this reason, we leave a formal complexity analysis of the parallel case to future work.

Constant Factors and Execution on Wide-Vector Machines
The approach described above will work acceptably on multicore CPU platforms with relatively low core counts, once some care has been taken to avoid excessive parallelism overhead by switching to sequential execution for subexpressions that are already executed in a sufficiently parallel manner. However, to work on GPU/TPU platforms or similar wide-vector machines, as well as to gain more performance on MIMD architectures, it will be necessary to: (1) find alternate implementations of the tree-like structures: Bag (Section 7) and Tree (Fig. 10); (2) analyse and optimise the output derivative program, recognising places where our base transformation was too general and a special-case approach would be more efficient.
'Tree' on Vector Machines. In D Γ [fold], 'Tree' is used to record the reduction tree in the primal pass so that we can replay it in reverse order in the reverse pass. In practice on wide vector machines such as GPUs, the reduction tree structure of a fold is statically determined to a certain extent. For example, for the (still competitive) approach described in [Merrill and Garland 2016] (see their Fig. 5), despite the fact that the block aggregates are combined in some nondeterministic order, the tree of intermediate values corresponding to the more classical chained-scan approach (their Fig. 4) is still computed and stored. Keeping these stored intermediate values around until the reverse pass allows assuming the chained-scan reduction order in the reverse pass, meaning that we only need to store the block size (an integer), the block aggregates and the block-local sequential aggregates, where our Fig. 9 stored the full 'Tree'. In effect, we thus specialise 'Tree' to the practical (strongly reduced) space of possible reduction orders in the actual implementation, and choose a more compact -and in this case non-recursive! -representation for 'Tree' that describes just this smaller space. Doing so allows us to convert the task-parallelism in unTree to data-parallelism (for suitable combination functions), mirroring the data-parallel reduction in the primal pass.
'Bag' on Vector Machines. First note that we may only add constructors to Bag, not subtract, as we certainly need the current ones (zero, plus, and singleton) for general source programs that use array indexing. But sometimes we can, and need to, do better. For example, when differentiating the following program, which first computes some array and then multiplies by 2 pointwise: 23 the derivative program will create (in D Γ [ ! ]) many BOne values, add those together into a large tree (in 'sequence' in the D Γ [build] for the result, thus in a parallel context of nondeterministic associativity), serialise that tree to an array (with 'collect' in the D Γ [build] of ), and finally perform a 'scatter' to construct the gradient of . However, clearly a more efficient derivative is to simply multiply the result cotangent by 1 2 pointwise, and while what we generate is indeed "only" a constant factor off in a sequential setting, this constant factor is in fact very large, and furthermore it parallelises poorly.
This program exhibits a pattern known as a gather operation 24 , which is the program shape 'build ( . ! )' for some term that uses . In this case, aside from the operations that we already made constant-time by construction by making them constructors of Bag directly (zero, plus, and singleton), we also want the insertion of a full array of cotangents to be constant-time. Thus we can improve the situation with a principled change to our data structure, adding a constructor (BArray (Array )) to the Bag data type. To then productively use this constructor in differentiating the sample program in Eq. (15), the implementation should either recognise the gather shape of the source program pre-differentiation and rewrite it to use some gather-style primitive, or should recognise the (inefficient) pattern resulting from naively differentiating a gather-like build and optimise that to the special-purpose form using BArray.
As with all compiler optimisations, however, such tricks cannot cover all possible programs, but the common cases can be dealt with in this manner. We leave a more thorough investigation of the problems and solutions to future work, in particular how to efficiently handle on GPU hardware the Bag operations that do not fall into the nice, vectorisable case like described above.

HIGHER-ORDER CHAD
Naive CHAD of Higher-Order Functions. We give here the (semantically verified) recipe of [Vákár 2021;Vákár and Smeding 2022] for differentiating function types using CHAD, where we omit any use of abstract or linear types and work directly with their default implementation. We write [] and + + for an empty list and concatenation and fold with , → from 0 for the usual (right) fold elimination of the list , starting the accumulator from an initial value 0 and 23 Assuming the addition of length : Array → Z in the source language; because its return type is discrete, its derivative is trivial: , _. return ⟨⟩ ⟩. 24 The derivative of 'gather' is 'scatter', already used above. Using a single scatter, even if no more efficient form is found based on the particular index mapping used in the program, will be much faster than creating large numbers of BOne values and having to combine those in a log-depth tree with many uses of BPlus.
folding in values using the operation : The key idea is that in a function application, the incoming cotangent must be propagated backwards through the -abstraction being called, both to the function argument and to the captured context variables of the closure. CHAD separates these two parts of the derivative and handles the former with D[ → ] 1 and the latter with Identifying the Complexity Issues. It is precisely this separation of the derivative that leads to real 25 complexity problems. In particular, because the derivative of a function value is split in two parts, it is impossible (from D Γ [ : . ]) to get the full gradient of a function, i.e. with respect to both its argument and its context, in one pass through D Γ, : [ ]. And for function application, which is the only eliminator of functions in the source language, snd D Γ [ ] does indeed need the full gradient of the function that was called (i.e. ). It is therefore reduced to using both halves of the function's derivative separately, meaning that we end up differentiating through a function twice each time it is called. If that function contains other function applications inside its body, the derivatives of the functions called there are evaluated four times, etc.
This behaviour can be exploited to violate our complexity criterion Eq. (1). Indeed, the programs for each (which are nested identity applications, and thus semantically just the identity): 26 Their transposed derivatives snd D :R [ ] using the CHAD formulas above, however, take (2 ) time to execute: because they contain nested pairs of an application of an abstraction, the backpropagator of each will execute the backpropagator of its body twice.
Solving Complexity Issues Through Defunctionalisation. Clearly, defunctionalisation [Reynolds 1998] translates away function types into a language that we can already differentiate efficiently, by implementing function types → as a sum type of tuples ℓ 1 × · · · × ℓ ℓ for each syntactic lambdaabstraction ℓ of type → in the program, writing ℓ 1 , . . . , ℓ ℓ for the list of types of ℓ's captured context variables. (This list is a subset of the types in ℓ's environment.) As such, we can simply defunctionalise (a well-known strategy for compiling code with function types) before applying CHAD and then call it a day. Why exactly does this solve the problem of inefficient function types though? The key observation is to decompose defunctionalisation into the composition of: • the local program transformation of (typed) closure conversion [Minamide et al. 1996]: we convert every function into a closure, which is a pair of (a subset of) its environment and a function that does not capture any context variables (a "closed" function); • the global program transformation of "deexistentialisation" that replaces an existential type with the finite sum type of all of its instantiations found in the whole program. A global program analysis is needed here to be able to use a finite rather than an infinite sum type; we need to analyse precisely which instances of the existential are actually used.
As we show in Appendix C, it is the first part of the transformation that solves the efficiency problems of CHAD on function types: by replacing all functions with closed functions, we avoid the need to propagate back any cotangents to captured context variables, removing the need for one half of a function's CHAD derivative: with only one half left, the duplication is gone, eliminating the complexity problem. In particular, it will now suffice to take D[ → ] 2 1, which simplifies the term-level derivatives accordingly. We can retain locality in the resulting code transformation.
This idea of using closure conversion to speed up AD of higher-order functions is first used by [Pearlmutter and Siskind 2008] (later distilled to its essence by [Alvarez-Picallo et al. 2021]). More recently, the short paper [Vytiniotis et al. 2019] suggested its use in the context of CHAD-like AD transformations. This section can be seen as an elaboration of the suggested idea of the latter paper.
10 RELATED WORK This paper shows how the basic AD algorithm CHAD can be made efficient. The basic CHAD technique for a first-order language with tuples in combinator form was originally introduced by [Elliott 2018]. [Nunes and Vákár 2023;Vákár 2021;Vákár and Smeding 2022] show how it applies to a -calculus with various type formers, giving a correctness proof, but no complexity proof. [Kerjean and Pédrot 2022] points out that the resulting code transformation closely resembles the Diller-Nahm variant of the Dialectica interpretation.
Similar optimisation techniques to the ones we use to make CHAD efficient (notably, sparse vectors and functional mutability) were previously used by [Krawiec et al. 2022;Smeding and Vákár 2023] to make dual numbers AD (a fundamentally different AD technique) efficient. The required ID-generation in that algorithm makes it less obvious how it might parallelise.
Previous work in computer formalized proofs about AD are, to the best of our knowledge, limited to [Chin Jen Sem 2020], which formalizes the correctness proofs for the forward mode AD transformation of [Huot et al. 2020[Huot et al. , 2021 in Coq, and [de Vilhena and Pottier 2021], which gives a Coq proof of the correctness of an effect handler-based variant of the reverse mode AD techniques of [Wang et al. 2018;Wang and Rompf 2018;Wang et al. 2019] (which rely on non-functional control flow). Both papers focus on the correctness of AD, rather than its complexity.
In currently used industrial systems (such as TensorFlow [Abadi et al. 2016], PyTorch [Paszke et al. 2017] or JAX [Bradbury et al. 2018]), AD is typically performed on (data-parallel) functional array processing languages via taping rather than directly as a source transformation, sometimes on a second-order array language (JAX). AD of functional array languages as a source transformation has been considered recently by [Schenck et al. 2022]. They allow some recompution and a resulting suboptimal complexity, to achieve a simpler and more practically efficient algorithm, in the hope that such recomputation is rare in practice. By contrast, here we study how to avoid all recomputation. Relatedly, [Paszke et al. 2021] discusses in detail how scans over arrays can be efficiently differentiated, similarly to how we differentiate folds.
The idea of using closure conversion to make AD of higher order functions efficient first appears burried in the details of VLAD/Stalin∇ [Pearlmutter and Siskind 2008;Siskind and Pearlmutter 2016]. The idea is again present in [Vytiniotis et al. 2019] (in the context of CHAD) and [Alvarez-Picallo et al. 2021] (for an AD algorithm using string diagram rewrites), without precisely demontrating its importance. We have made an effort to spell out and motivate the idea in the present paper, making clear how it arises as a natural solution to the complexity problems of higher-order CHAD.

APPENDIX A WHY UNION IS NOT EFFICIENT
In Section 4.2 we changed the right-hand side of the transformation to pass around a growing environment cotangent in state-passing style instead of simply returning the local environment contributions upwards from each branch of the program. Not only does this provide the right program structure to later swap out the (log-time) functional persistent tree map for a mutable array in Section 5, but it is also necessary from a complexity perspective: simply keeping union in is inefficient, even if one has a magical tree map that has linear instead of linearithmic complexity. The actual map union 27 has runtime ( log + 1 ), where is the size of the smaller argument to union and the size of the larger. For small > 0 this simplifies to (log ), and for < in ( ) it simplifies to ( ). Suppose that we had access to a magical union running in time ( ) where is the size of the smaller argument to union. This is still too expensive, as witnessed by 1 : R, . . . , : R ⊢ magic : R defined as follows: i.e. a complete binary tree, adding 1 , . . . , 2 where ⌊log 2 ( )⌋, so that 2 ≤ < 2 +1 . All variables in the leaves are distinct. For 0 ≤ ≤ − 1, layer has 2 occurrences of ★, and hence in D Γ [ more magic ] we will get 2 applications of union on maps of size 2 − on layer . The total computational cost of all these unions is (under the magical union assumption): which is asymptotically larger than ( ), the runtime cost of magic . (Note that the total number of ★ operations in magic itself is equal to −1 =0 2 = 2 − 1 = ( ).) Hence even with this magical union, CHAD does not yet attain the correct complexity for reverse AD, and the state passing modification in Section 4.2 is necessary.

B AGDA FORMALISATION SPECIFICATION
The specification of the Agda proof, which provides only those definitions necessary to state the main complexity theorems (but not prove them), follows on subsequent pages. This specification consists of three modules; each module starts on a new page.
The full formalisation can be found at https://github.com/tomsmeding/efficient-chad-agda. 27 As defined in the Haskell containers library, and proved optimal in [Brown and Tarjan 1979]. --The zero part of the monoid structure of the linear types. Aside from --returning the value, this also returns an integer recording the number of --evaluation steps taken during the operation. This integer is used for --complexity analysis.
----For sum types, we return zero on adding incompatible values instead of --throwing an error. This prevents D2τ (σ :+ τ) from being a monoid, but of --course, if a proper implementation that would throw errors does not in fact --error on a given input program, the implementation here would not introduce --values that violate the monoid laws either.
----In particular, because the dependently-typed variant of CHAD is correct (see --Nunes, Vákár. "CHAD for expressive total languages." MSCS 2023 (to appear) --(also arXiv:2110.00446)), there is an external proof that those error cases --would be impossible, and thus that the cases that violate the monoid laws --here are also impossible.
----The reason we have this explicit representation of reindexing mappings, as --opposed to a general sink function with the following type: --sink : {Γ Γ' : Env tag} {τ : Typ tag} ---> ({σ : Typ tag} -> Idx Γ σ -> Idx Γ' σ) ---> Term tag Γ τ -> Term tag Γ' τ --is that with the above representation we'd need (a very weak form of?) --functional extensionality to use certain lemmas in the complexity proof. The --reason for that is that we'd like to use multiple lemmas about the same --things together, and all of those lemmas return facts about terms that --normalise to the same thing but contain uses of 'sink' applied to unknown --indices inside them. If 'sink' took a function argument, then proving --equality here would involve proving equality of functions given equal --syntactic representation, which Agda does not do, despite being much weaker --than full functional extensionality.
After applying closure conversion, we can apply the CHAD as follows to types without free type variables (monomorphic types) as well as their programs: which we can implement, completely analogously to the case of binary sum types, by representing Σ :Type as 1 ⊔ (Σ :Type ). To implement this API, it is crucial that we work with Σ-types that hold type tags with decidable equality rather than untagged existentials. Observe that similarly to our treatment of sum types (with lcastl and lcastr), the differentiation of existential types requires some runtime casting 30 , in the absence of a type system with full dependent types. We can prove, however, that all required casts are type-safe in a stronger dependently-typed type system. If we are willing to do a global program analysis, we can identify at compile-time the finite subset of components of the sum types that are actually used in practice. This allows us to simply replace the infinite sum type with a finite sum type (a transformation that we have referred to as "deexistentialisation" in Section 9). We have now effectively arrived at our combination of closure conversion and defunctionalisation that we discussed in Section 9. 29 Indeed, for any program between first order types (types on which [−] acts as the identity), is -equal to , so in particular is observationally equivalent. 30 This throws an error if it fails -such an error will never be hit by the code generated by CHAD after closure conversion.