Unboxed Data Constructors: Or, How cpp Decides a Halting Problem

We propose a new language feature for ML-family languages, the ability to selectively unbox certain data constructors, so that their runtime representation gets compiled away to just the identity on their argument. Unboxing must be statically rejected when it could introduce confusion, that is, distinct values with the same representation. We discuss the use-case of big numbers, where unboxing allows to write code that is both efficient and safe, replacing either a safe but slow version or a fast but unsafe version. We explain the static analysis necessary to reject incorrect unboxing requests. We present our prototype implementation of this feature for the OCaml programming language, discuss several design choices and the interaction with advanced features such as Guarded Algebraic Datatypes. Our static analysis requires expanding type definitions in type expressions, which is not necessarily normalizing in presence of recursive type definitions. In other words, we must decide normalization of terms in the first-order λ-calculus with recursion. We provide an algorithm to detect non-termination on-the-fly during reduction, with proofs of correctness and completeness. Our algorithm turns out to be closely related to the normalization strategy for macro expansion in the cpp preprocessor.


INTRODUCTION 1.Sum Types and Constructor Unboxing
A central construct of ML-family programming languages is algebraic datatypes, in particular (disjoint) sum types, also called variant types.In OCaml syntax: ; in particular they start with a (data) constructor, here Pos or Zero or Neg, followed by arguments (zero, one or several arguments), which is data carried along the constructor.In mathematical terms this corresponds to a sum or coproduct + between sets of values, rather than a union ∪ , because one can always tell from which side of the sum a value is coming from, by pattern-matching.
In mathematics, the coproduct between sets is typically implemented as a (disjoint) union: using a cartesian product of the form { } × to build the pairs of a tag value and an element of .Software implementations of programming languages use the same approach: the representation of sum types in memory typically includes not only the data for their arguments, but also a tag representing the constructor; pattern-matching on the value is implemented by a test on this tag -often followed by accessing the arguments of the constructor.For example, the standard OCaml compiler will represent the value Neg(n) by a pointer to a block of two consecutive machine words in heap memory, with the first word (the block header) containing the tag (among other things), followed by the argument of type nat in the second word.We say that the parameter of type nat is boxed in this representation: it is contained inside another value, a "box".
Boxing induced by data constructors introduces a performance overhead.In the vast majority of cases it is negligible, as ML-family language implementations heavily optimize the allocation of heap blocks and often benefit from good spatial locality.There are still some performance-critical situations where the overhead of boxing (compared to carrying just a nat) is significant.
One common, easy case where boxing can be avoided is for sum types with a single constructor: type t = Foo of arg.Values of this type could be represented exactly as values of type arg, as there are no other cases to distinguish than Foo.Haskell provides an explicit newtype binder for this case.In OCaml, this is expressed by using the @@unboxed attribute on the datatype declaration.
newtype Fd = Fd Int --Haskell type fd = Fd of int [@@unboxed] (* OCaml *) Note that the programmer could have manipulated values of type int directly, instead of defining a single-constructor type fd (file descriptor) isomorphic to int; but often the intent is precisely to define an isomorphic but incompatible type, to have explicit conversions back and forth in the program, and avoid mistaking one for the other.Single-constructor unboxing makes efficient programming more expressive or safe, or expressive/safe programming more efficient.

Constructor Unboxing
In the present work, we introduce a generalization of single-constructor unboxing, that we simply call constructor unboxing.It enables unboxing one or several constructors of a sum type, as long as disjointedness is preserved.
Our main example for this selective unboxing of constructors is a datatype zarith of (relative) numbers of arbitrary size, that are represented either by an OCaml machine-word integer (int), when they are small enough, or a "big number" of type Gmp.t implemented by the GMP library [Granlund and contributors 1991]: The [@unboxed] attribute requests the unboxing of the data constructor Small, that is, that its application be represented by just the identity function at runtime.This request can only be satisfied if it does not introduce confusion in the datatype representation, that is, two distinct values (at the source level) that get the same representation.In our zarith example, the definition can be accepted with an unboxed constructor, because the OCaml representation of int (immediate values) is always disjoint from the boxed constructor Big of Gmp.t (heap blocks).Otherwise we would have rejected this definition statically.type clash = Int of int [@unboxed] | Also_int of int [@unboxed] Error: This declaration is invalid, some [@unboxed] annotations introduce overlapping representations.
While constructor boxing is cheap in general, unboxing the Small constructor of the zarith type does make a noticeable performance difference, because arithmetic operations in the fast path are very fast, even with an explicit overflow check, so the boxing overhead would be important.On a synthetic benchmark, unboxing the Small constructor provides a 20% speedup.

Head Shapes
We propose a simple criterion to statically reject unboxing requests that would introduce confusion -several distinct values with the same runtime representation -parameterized on two notions: • The head of a value, an approximation/abstraction of its runtime representation.
• The head shape of a type, the (multi)set of heads of values of this type.
Our static analysis computes the head shape of datatype definitions, and rejects definitions where the same head occurs several times.
In addition, we require that the head of a value be efficiently computable at runtime.In presence of an unboxed constructor, the head shape of its argument type is used to compile patternmatching.The generated code may branch at runtime on the head of its scrutinee.
We provide a definition of heads that is specific to the standard OCaml runtime representation.Other languages would need to use a different definition, but the static checking algorithm and the pattern-matching compilation strategy are then fully language-agnostic.

A Halting Problem
To compute the head shape of types, and therefore to statically check unboxed constructors for absence of confusion, we need to unfold datatype abbreviations or definitions.Example: type num = int and name = Name of string [@unboxed] type id = | By_number of num [@unboxed] | By_name of name [@unboxed] To check the final definition of id, we must determine that num is the primitive type int and that name has the same representation as a primitive string, through a definition-unfolding process that corresponds to a form of -reduction.
In the general case, type definitions may contain arbitrary recursion.In particular, unfolding definitions may not terminate for some definitions; we need to detect this to prevent our static analysis from looping indefinitely.type a id = Id of a [@unboxed] type loop = Loop of loop id [@unboxed] This practical problem is in fact exactly the halting problem -deciding whether terms have a normal form -for the pure, first-order -calculus with recursion.We present an algorithm to detect non-termination on the fly, during normalization.Running the algorithm on a term either terminates with a normal form, or it stops after detecting a loop in finite time.We prove that our algorithm is correct (it rejects all non-terminating programs) and complete (it accepts all terminating programs).The proof of correctness requires a sophisticated termination argument.
It turns out that this termination-monitoring algorithm is related to the approach that the cpp preprocessor uses to avoid non-termination for recursive macros -another example where nontermination must be detected on the fly.
Our Related Work Section 8 has a mix of discussions on production programming languages, experimental language designs, and theoretical work on termination of recursive -calculi.We list unboxing-related features in other languages and implementations (GHC Haskell, MLton, F♯, Scala, Rust).We mention research projects that provide an explicit description language for data layout (Hobbit, Dargent, Ribbit).Finally, we discuss Rust niche-filling optimizations in more detail.

CASE STUDY: BIG INTEGERS 2.1 A Primer on OCaml Value Representations
In the reference implementation of OCaml, sum types have the following in-memory representation: • constant constructors (without parameters), such as [] or None, are represented as immediate values, exactly like OCaml integers.The immediate values used are 0 for the first constant constructor in the type, 1 for the second one, etc. • non-constant constructors, with one or several parameters, are represented by pointers to memory blocks, that start with a header word followed by the representation of each parameter.The header word contains some information of interest for the GC and runtime system, including a one-byte tag: 0 for the first non-constant constructor, 1 for the second one, etc., and the arity of the constructor -its number of parameters.
Some primitive types that are not sum types -strings, floats, lazy thunks, etc. -have special support in the OCaml runtime.They are also represented as memory blocks with a header word, using a dozen reserved high tag values that cannot be used by sum constructors.They also include a tag Custom_tag (255) for blocks whose parameters are foreign data words accessed only through the C Foreign Function Interface (FFI).
OCaml distinguishes immediate values from pointers to blocks (both word-sized) by reserving the least significant bit for this purpose: immediate values are encoded in odd words, while pointers are even words.(In particular, OCaml integers are 63-bits on 64-bit machines.)Patternmatching can check this immediate-or-pointer bit first, to distinguish constant from non-constant constructors, and then switch on a small immediate value or block tag.

Unsafe Zarith
The Zarith library [Miné and Leroy 2012] provides an efficient OCaml implementation of arbitraryprecision integers, on top of the reference C library Gmp [Granlund and contributors 1991].
Some users of arbitrary-precision integers perform a majority of their computations on very large integers, way larger than the "small" integers that fit a machine word.On the other hand, many users perform computations that rarely, if ever, overflow, but they need the guarantee that the result will remain correct even in presence of occasional overflows.For this latter use-case, we want to minimize the overhead of Zarith compared to using machine-sized integers directly -in OCaml, the int type.We want to ensure that when operating on "small" integers, the operation only performs machine-size arithmetic and bound checks, without any memory allocation, nor any call to non-trivial C functions; in other words, we want a "fast path" for small integers.
Zarith uses a type Zarith.twhose inhabitants are either machine-sized OCaml integers (type int) or a "custom" value, a pointer to a memory block with tag Custom_tag and the gmp digits as arguments.This type cannot be expressed in OCaml today, so Zarith has to use the low-level, unsafe compiler intrinsics to perform unsafe checks and casts, giving up on the type-and memorysafety usually guaranteed by the OCaml programming language.This code is equivalent to the previous version, generates exactly the same machine code, but does not require any unsafe casts.One still has to trust the FFI code of c_add to respect the intended memory representation, and we trust the annotation shape [custom] that claims that the abstract type custom_gmp_t is only inhabited (via the FFI) by Custom_tag-tagged blocks.But the unsafe boundary has been pushed completely off the fast path; it can disappear completely in other examples not involving bindings to C libraries.
On a synthetic microbenchmark, we observed that our new version has essentially the same performance as the previous unsafe code, and is 20% more efficient than a boxed version -using a sum type without [@unboxed] annotations.
Case-study conclusion.Unboxing relieves some of the tension between safety and efficiency in performance-critical libraries.Users sometimes have to choose between safe, idiomatic sum types or more efficient encodings that are unsafe and require more complex code.In some cases, such as Zarith, constructor unboxing provides a safe, clear and efficient implementation.

Other Use-Cases
Let us briefly mention a few other use-cases for constructor unboxing.
A ropes benchmark.Our original design proposal Yallop [2020] includes a similar performance experiment on ropes (trees formed by concatenating many small strings), reporting a 30% performance gain on an example workload -a similar performance ballpark to our 20%.The example was implemented using unsafe features only as unboxing was not implemented at the time; we can express it as follows: | Branch of { llen: int; l:rope; r:rope } Coq's native_computemachinery. Another use-case where constructor unboxing could provide safety is the representation of Coq values in the native_compute implementation of compiled reduction, first introduced in Boespflug, Dénès and Grégoire [2011].native_compute is a Coq tactic that compiles a Coq term into an OCaml term such that evaluating the OCaml term (by compilation then execution, in the usual call-by-value OCaml strategy) computes a strong normal form for the original Coq term.It uses an unsafe representation of values that mixes (unboxed) functions, sum constructors, and immediate values, and could be defined as a proper OCaml inductive if constructor unboxing was available.(The relation with constructor unboxing was pointed out to us by Jacques-Henri Jourdan, Guillaume Melquiond and Guillaume Munch-Maccagnoni.) A partial sum-type presentation of dynamic values.This feature could make some forms of dynamic introspection of runtime values more ergonomic than what is currently exposed in the Obj module.We could think of defining a sort of "universal type" as follows: let to_dyn : a -> dyn = fun x -> (Obj.magicx : dyn) This interface cannot cover all needs -for example it is not possible to distinguish between Int32, Int64 by their tags, they would be lumped together in the Custom case -and we may need to restrict it due to shape approximations required by portability concerns -see Section 4.2.But it would still provide a pleasant pattern-matching interface for unsafe value introspection code that people write today using the Obj module directly. 3on-use-case: magic performance gains in many places.One should not hope that there are plenty of performance-sensitive OCaml codebases lying around today that would get a noticeable performance boost by sprinkling a few (or many) unboxing annotations.In the vast majority of cases, unboxing provides no noticeable performance improvement.There are two reasons: (1) Allocating values in the OCaml minor heap is really fast.In the boxed version of the Zarith benchmark, checking against integer overflow is slower than allocating the Small constructor.Most programmers overestimate the performance cost of boxing, it makes little difference for most workloads.4(2) In the few cases of performance-sensitive programs where boxing would add noticeable overhead, the authors of the program already chose a different implementation strategy to avoid this boxing, for example using unsafe tricks as Zarith.The second point is common to most language design for performance: existing performancesensitive codebases are written with the existing feature set in mind, and typically do not present low-hanging fruits for a new performance-oriented feature.The benefits rather come from giving more, better options (safer, simpler, more idiomatic) to write performant code in the future.
It is also reassuring for users to know that a new idiom made available is "zero-cost".Even in cases where in fact the non-optimized approach would have perfectly fine performance, there is a real productivity benefit for users to know that a given change has zero performance impact.For example, this can avoid the need to write specific benchmarks to reassure reviewers of a change.

HEADS AND HEAD SHAPES
To formalize constructor unboxing, we use a simple language of types and datatype definitions .This captures sum types in typical ML-inspired, typed functional programming languages.
Notation.We write ( ) ∈ for a family of objects indexed over .(Placing the domain as a superscript is reminiscent of the exponential notation for function spaces.)We often omit the indexing domain , writing just ( ) .Indexing domains , , , etc. are meta-variables that we understand as denoting finite, totally ordered sets -for example, integer intervals [0; ].

Types, Datatype Declarations, Values
Types ∋ ::= | ( ) | t ( ) TyDecls ∋ ::= type ( ) = ( of ( , ) ∈ ) ( unboxed of ) A datatype definition introduces a datatype constructor parameterized over the family of type variables ( ) as a sum type made of a (possibly empty) family of boxed constructors ( of ( , ) ) , where each expects a family of arguments at types ( , ) ∈ , and a (possibly empty) family of unboxed constructors ( unboxed of ) each expecting a single argument of type .
A type is either a type variable , an instance ( ) of a datatype (the ( ) instantiate the datatype parameters ( ) ), or an instance t ( ) of some primitive type constructor t, such as integers, floats, functions, tuples, strings, arrays, custom values, etc.
Closed types (without type variables) in this grammar are inhabited by values defined by the following grammar, with a simple typing judgment : expressing that has type .
language-specific rules for primitive values at primitive types v : t ( )

Low-Level Representation of Values
Unboxed constructors intrinsically depend on a notion of low-level data representation.
We assume given a set Data of low-level representations, and a function repr : Value × Type → Data that determines the data representation of each value.
(2) Unboxing: for any unboxed of ′ in and : ′ , we have repr( unboxed , ) = repr( , ′ ) Our injectivity assumption merely states that our representation was correct before the introduction of constructor unboxing.Our static analysis rejects some unboxed constructor definitions to extend this property to well-typed values with unboxed constructors.
3.2.1For OCaml.In the specific case of OCaml, writing Z m for the set of machine integers, we claim that the representation of values in the reference OCaml implementation can be modeled as: As we mentioned in Section 2.1, the low-level representation of OCaml values is either an immediate value, which we approximate as living in Z m , or a block starting with a header word containing a tag in Z m followed by several words of block arguments 0 . . .−1 .Block arguments can be either valid OCaml values themselves or arbitrary machine words.Note that this representation loses some information, for example: immediate values live in a smaller space with one less bit available, and the tag of a block determines whether its arguments must be valid OCaml values (most tags) or machine words (Custom_tag, String_tag, Double_tag, Double_array_tag, etc.).
We can now define the repr function going from well-typed source values to their low-level representation.In OCaml, constant constructors (taking no argument) are represented as immediates, while non-constant constructors are represented as blocks, and the representation of a constructor in each category depends on its position (indexed starting at 0) in the type declaration.
As required, this repr function is injective on boxed constructors and erases unboxed constructors.
The representations of some primitive values include:

Heads and Head Shapes
We assume given a set Head of value heads.The head ℎ of a value represents an easily computable abstraction/approximation of the low-level representation of the value : we assume a function head data : Data → Head computing the head of a value representation, and define head( , ) def = head data (repr( , )) Note that if two values have different heads, then they are necessarily distinct.
Our static analysis will run on arbitrary type definitions allowed by our syntax of type declarations, and reject certain type declarations that would introduce conflicts, that is, allow distinct values with the same representation.The cleanest way we found to model this was to define the head shape of a type expression as a multiset of heads, which may contain duplicate elements.
We use two standard union-like operations on multisets: the maximum and the sum, defined by: We define the head shape headshape ClosedTypes ( ) of a closed type expression as the multiset of heads of values of this type.We extend this notation to constructor declarations instantiated at a closed return type, which we call "type components" as they come up in type declarations.
Finally, we can extend the notion of head shapes to open types (or type components) containing type variables , by taking the union of all their closed instances.We write ↑ ′ if ′ is a closed type or type component) that instantiates the free type variables of .
headshape TyComps : TyComps → M (Heads) headshape TyComps ( ) Note that we take the maximum of all closed instances, not their sum.In particular, if all the closed instances have head shapes that are sets (they do not contain any duplicates), then the headshape of the open type is itself a set.In particular, in absence of unboxed constructors, headshape( ) is typically equal to the set Heads of all heads seen as a multiset (assuming that each shape is in the image of at least one well-typed value).If we had used a sum in our definition above, then headshape( ) would have duplicates as soon as two distinct types have heads in common.
3.3.1For OCaml.In the specific case of OCaml, we define the head of an immediate ∈ Z m as just the pair (Imm, ), and the head of a block of tag ∈ Z m as the pair (Block, ).
head data,OCaml (Block ( 0 , . . ., −1 )) Note that, while our previous choice of Data and repr functions for OCaml model an existing representation, and are not visible to the users -even in presence of unboxed constructors -the definition of heads and the function head : Value → Heads are new design choices that have the user-visible impact of accepting or rejecting certain unboxed constructors in datatype declarations.
We could use a finer-grained notion of head (for example we could include the arity of a block in its head), which allows to distinguish more types and thus accept more unboxed type definitions.
Conversely, a coarser-grained notion of head would be more portable to other implementations.For example, an implementation that would represent constructors by user-visible name rather than position could not use our notion of head as is.We discuss this portability question in Section 4.2.
Another reason to choose a coarser-grained notion of head is to have a simpler model to explain to users, at the cost of rejecting more declarations; for example, one could restrict unboxing to immediate types by using a pessimistic ⊤ shape for all types containing blocks.
Finally, our implementation defines a concrete syntax of head shapes that denote sets of heads and is easy to use in computations.Elements of this head shape syntax are pairs of approximations, one for immediates and one for blocks.Approximations are defined as either a finite set of machine words (including in particular the empty set ∅) or the wildcard shape ⊤ representing all heads.
This syntax is a correct abstraction of multisets that happen to be mere sets.It does not let us express multisets with conflicts.We can directly implement the non-disjoint union 1 ∪ 2 of two syntactic shapes, and also implement the disjoint union 1 ⊎ 2 as a partial operation that returns a syntactic shape if 1 , 2 are disjoint, and is undefined otherwise -if the resulting multiset has duplicates, and cannot be represented as a syntactic shape.
In the general case, the definition headshape( ) of head shapes for open types may be difficult to compute, as it contains a quantification over all closed extensions of .The OCaml value representation is very regular, which makes it easy to compute shapes of type variables, constructor declarations and primitive types.We can represent them directly in our head shape syntax: is the -th constant constructor at its type headshape OCaml ( of ( ) ≠ ∅, is the -th non-constant constructor at its type headshape OCaml ( t ( ) ) def = the immediates and tags of primitive type constructor t This definition of headshape OCaml on base types and boxed constructor definitions agrees with the generic definition headshape, in the sense that headshape TyComps ( ) = headshape OCaml ( ) for the OCaml value representation and our choice of heads.

Sum Normal Form
To compute the head shape of a type expression , we must unfold type definitions and traverse unboxed constructors.This transformation is of independent interest, we formalize it in this section.
We define a grammar of sum normal forms that capture the result of this unfolding process, and a (partial) normalization judgment ⇒ that computes the head normal form of a type.
A sum normal form is a multiset of components written as a formal sum, that are either a boxed constructor, a type variable or a primitive type constructor.The normalization judgment unfolds datatype declarations, sums boxed constructors and the normal form of the unboxed arguments.
In presence of mutually-recursive definitions, some type expressions may "loop" forever: they don't have a sum normal form.Consider for example: type loop = Int unboxed of int | Loop unboxed of loop Fortunately, the problem of whether a given type expression has a sum normal form is in fact decidable.We discuss our decision procedure in Section 6.

Rejecting Conflicts
Finally, given a type declaration , our static analysis computes a head shape headshape( ) of its sum normal form by summing the head shape of each component of the sum.Our static analysis accepts the definition if and only if headshape( ) does not contain any duplicates.
The result of this analysis can be easily computed, in the case of OCaml, by using our head shape syntax: the head shape of 1 + 2 is conflict-free if headshape OCaml ( 1 ) ⊎ headshape OCaml ( 2 ) is defined and has a conflict otherwise.

Pa ern-Matching Compilation
When checking a type declaration with unboxed constructors, we record for each unboxed constructor unboxed of the head shape of its type parameter .When compiling pattern-matching clauses using an unboxed constructor in a pattern, say unboxed , we implement matching on unboxed of as a condition that the head of the scrutinee must belong to the shape of .(The details of how to do this depends of course on the pattern-matching compilation algorithm of the language.) We know that this approach is always sound, thanks to the property that none of the other scrutinees (starting, at the source level, with a different constructor) may have a head belonging to the head shape of .Note that this property, enforced by our static analysis, is in fact slightly stronger than the absence of conflicts: not only must the representation of inhabitants of be distinct from all the other possible scrutinees, they should furthermore have distinct heads.
The runtime cost of checking the head depends on the language and the notion of head chosen.For our choice of heads for OCaml, it is exactly as costly as checking the head constructor of a value, so this does not add overhead on pattern-matching.A finer-grained notion of head that would inspect the value "in depth" could add a higher cost -to balance against the space savings of accepting more unboxing.

SCALING TO A FULL LANGUAGE
In this section, we describe less formally all the "other issues" that we had to consider to scale this proposed feature to a full programming language, namely OCaml.

Handling All the Tricky Cases
We did not encounter any conceptual issue when scaling this approach to all the primitive types supported by the OCaml runtime.(This is not too surprising given that our heads are closely modeled on the existing runtime data representation.)In the interest of demonstrating the difference between the simple situation of datatypes with a simple representation and everything else, let us give here an exhaustive list of all the tricky cases.
(1) Double_array_tag is used to represent nominal records whose fields are all float, and also values of type float array.
To determine the shape of a record type, we must call the same logic that the type-checker uses to the decide the "unboxed float record" criterion and use (∅, {Double_array_tag}) instead of (∅, {0}) in that case.
For arrays, we use the head shape (∅, {0, Double_array_tag}) in all cases.Note that OCaml supports a configuration option to disable the unboxing of float array (supporting this representation adds some dynamic checks on array operations), but we decided to use the pessimistic shape with both tag values independently of the configuration value, to avoid having the compiler statically reject some programs only in some specific configuration.
(2) Values of type ty Lazy.t have an optimized representation where they may be represented by a lazy thunk of tag Lazy_tag, a computed value of tag Forward_tag, or directly a value of type ty, under some conditions on ty.The corresponding shape is the maximum of (∅, {Lazy_tag, Forcing_tag, Forward_tag}) and of the shape of ty.(OCaml 5 added a third tag Forcing_tag to detect concurrent forcing; it was trivial to adapt our analysis.)Note that Section 3.3.1 defined headshape OCaml t ( ) as depending only on t, not the ; here we are handling lazy values in a more precise way due to their non-uniform representation.(We could also approximate them to the uniform shape ⊤.) (3) Function closures may have either tags Closure_tag or Infix_tag (used for some mutuallyrecursive functions).The pattern-matching code that we generate on sum types with an unboxed function type (which is useful for Coq native_compute) is slightly less good than it could be because these two tags are not consecutive (247 and 249), so we are slightly tempted to renumber the tags in the future.(4) Exceptions, and in general inhabitants of extensible sum types, use tag Object_tag.Object types themselves have an obscure and complex data representation due to various optimizations, and we just assigned them the top shape ⊤.

Portability of our Heads
The language of head shapes makes some aspects of the low-level representation of values visible to users of the surface language.This comes at the risk of complexity, but also at the risk of reducing the portability of the language by setting in stone certain representation choices, that would rule out other implementations.
One could think of making constructor-unboxing a "best effort" feature to avoid this downside, by simply emitting a warning in the case where an unboxing annotation would introduce a conflict under the current implementation, and keeping the constructor boxed.We decided against this for now, because we believe that advanced performance features such as constructor unboxing are used when users reason about the performance of their application, that is, when data representation is part of their specification for the code they are writing.In this context, silently ignoring representation requests is arguably a bug: it breaks the specification the user has in mind.
Instead we are trying to discuss with other implementors of OCaml to find whether we should make our heads more coarse-grained in some places, to increase portability without breaking relevant examples of interest.In particular, the alternative backend js_of_ocaml compiles OCaml to JavaScript, and uses native JavaScript numbers for most OCaml numeric types (int, float, nativeint, int32).We are planning to quotient the difference between those types in our language of shape, to improve portability.

Abstract Types with Shapes
Abstract types can readily be given the top shape ⊤.We also support annotating an abstract type with a shape restriction [@shape ..], which gives a head shape for this type.
For abstract types coming from the interface of modules or functor parameters, those shape annotations are checked when checking that the interface conforms to the implementation.
For abstract types used to represent values only populated by the FFI, these shape annotations have to be trusted, in the same way that the OCaml FFI trusts foreign functions to respect their type provided on the OCaml side.
We used this feature in our Zarith example in Section 2.3 to allow unboxing the Big constructor, whose argument is an abstract type custom_gmp_t of GMP numbers implemented through the FFI.
4.4 Are We Really First-Order?Parametrized type definitions fall in the first-order fragment because OCaml does not support higher-kinded types.
Note that some designs for higher-kinded types in related languages are restricted to higherorder "type constructors" that do not create -redexes, so they do not necessarily have the expressiveness of the full higher-order -calculus.
On the other hand, the OCaml module system does provide higher-order abstractions through functors: a type in a functor may depend on a parametrized type in the functor argument.However, unfolding of type definitions remain first-order in nature: • When we are checking the body of a functor and encounter a type that belongs to a module parameter, it is handled as any other type declaration.• When we encounter a type expression containing a functor application, e.g.Set.Make(Int).t, the type-checker has access to the signature of the functor application Set.Make(Int) and we check its type t.Another way to think of the treatment of functor application is that the OCaml type-checker performed -reduction of functor applications before we compute shapes.In other words, in this work, we consider the module language as a strongly-normalizing higher-order subset whose normal forms are first-order.

OCaml Features Subsumed by Head Shapes
The OCaml type-checker currently contains three subsystems that we believe would be subsumed by our head shape analysis: (1) It contains an analysis of the "unboxed form" of a type (due to the presence of unboxing for single-constructor variants and single-field records) that corresponds to our notion of sum normal form of a type, and would benefit from our normalizing algorithm to compute those in presence of recursion.(2) It defines a property called [@@immediate] for abstract types, which claims that the inhabitant of the type are all immediate values.(This is used by the runtime to specialize ad-hoc polymorphic functions such as comparison and serialization.)Computing the head shape subsumes immediateness-checking. (3) To implement unboxing of single-variant GADTs, it must perform an intricate static analysis to reject attempts to unbox existentials, which would make the float array optimization unsound.(Don't ask.)We believe that this static analysis, detailed in Colin, Lepigre and Scherer [2019], could be subsumed by our head shape computation.
We are planning to simplify the compiler implementation by removing all the existing logic to implement these separate aspects, and replace them by a shape computation.

POTENTIAL EXTENSIONS
We have considered the following aspects, but have not implemented them.They are not necessary to consider upstreaming a first useful version of constructor unboxing.

Shape Constraints on Type Variables
Section 4.3 shows how abstract types can now be annotated with constraints on the head shapes of their inhabitants.A related feature would be to constrain the type variables of parametrized types: The type a block_option is similar to the standard a option datatype, but its type parameter a may only be instantiated with type expressions whose shape is included in the shape any_block -any block tag, but no immediate value.This restriction allows unboxing the Some constructor without conflicting with the None immediate value.Conversely, imm_option may only be instantiated with immediate types, and its Some constructor can also be unboxed as we made None of unit a block constructor.
Similarly to abstract types with shapes (Section 4.3), this feature provides modularity.We can construct large types with specialized representations by composing together smaller parameterized types (or functors), with shape assumptions on the boundaries between the various definitions.
This change is easy conceptually, but requires non-trivial changes to the OCaml compiler where type variables do not carry kind information.It goes in the same direction as other "layout" changes experimented with by Jane Street, so some of the implementation work can be shared.

Harmless Cycles
Our algorithm to compute shapes unfolds potentially-recursive type definitions and monitors termination: it stops when encountering a cycle in the definition.Currently our prototype rejects all definition that contain such cycles.But the cycles fall in two categories: most cycles are "harmful cycles" that must be rejected, but there are "harmless cycles" that could be accepted.
Both those examples are rejected by our prototype as their shape computation detects a cycle.harmful cannot soundly be accepted, as there would be a confusion between the values A, Loop(A), Loop(Loop(A)), etc.On the other hand, allowing the unboxing of harmless would not in fact introduce any confusion as the type would be empty -without any inhabitant.Said otherwise, cycles in shape computations can be interpreted as smallest fixpoints; most of those fixpoints contain conflicts but a few are the empty set of value.
Accepting harmless cycles should be of medium difficulty.From an implementation perspective it is not easy to distinguish harmless cycles and accept them, it is substantially more work than rejecting all cycles.Besides, there is no point in writing types such as harmless in practice -just write an empty type directly.So this is naturally left as future work.
There is however one good reason to do more work there, which is related to data abstraction.Consider the following example: This weird definition is accepted by our shape analysis.For the abstract type a foo we assume the shape ⊤ of any possible value -this does not depend on the parameter a. Then weird foo has the same shape ⊤ and the definition of weird is accepted.(Adding any other constructor to weird would make it rejected.)However, we could later learn that the type a foo is in fact defined as type a foo = a.If we perform the substitution, we get a harmless cycle.In other words, rejecting harmless cycles breaks the substituability property for abstract types.This is a nice meta-theoretical property, and breaking it may result in surprising software engineering situations that are problematic in practice.

Unboxing by Transformation
In our work, unboxed constructors act as the identity on the representation of their arguments.One could generalize this by allowing constructor unboxing to be realized by a non-identity transformation on its arguments -chosen to be more efficient than the default constructor representation.Applying non-identity transformations could avoid conflicts in value representations, allowing more unboxing requests.Consider for example: Our work only supports unboxing the constructor A in this example.Unboxing B is not supported: • if A is not unboxed, then we would have a confusion between blocks constructed by A and those coming from the Some constructor of the option.• if A is unboxed, then we would have a confusion between immediates corresponding to the false value (in A) and the None value (in B).
It would however be possible to unbox the constructor B if we accepted to change the representation of the constructor A of bool [@unboxed].Instead of storing a bool value directly (an immediate in {0, 1}), we could transform the bool value to store it as an immediate in {1, 2} for example, avoiding a conflict with the None value (immediate 0) unboxed from B. Examples of representative approaches follow.
(* no shifting, but a different boolean type *) The type t1 is in fact not an example of a transformation associated with an unboxed constructor: instead we assume that it is possible to specify a non-standard choice of tag at declaration time -the imaginary [@tag 2] attribute.Supporting this would be an easy change, but it requires using a non-standard boolean type and thus requires code changes for the user, making it cumbersome or impractical in many situations.| B of a option [@unboxed] The type t2 performs a transformation at unboxing time (adding 2 to the value) that is specified by the user.Pattern-matching code would then have to be careful to undo this transformation on the fly (by subtracting 1).Note that (add 1) is not an arbitrary OCaml term here, it must be part of a dedicated transformation DSL that we know to invert efficiently.
Another valid choice would be to have the constructor B transform its argument in a way that leaves its block values unchanged but shifts its None value from the immediate 0 to an immediate outside {0, 1} or to a block of non-zero tag (with no argument).Note that unboxing B would come at a higher runtime cost as the transformation (to apply at construction time and unapply at matching time) is more complex.
Finally the type t3 assumes a version of constructor unboxing that implicitly infers such transformations to satisfy the user's unboxing request.This is not the design approach that we have used in our OCaml work, but it corresponds to unboxing strategies in some other programming languages -see our discussion of Rust's niche-filling optimizations in our Related Work Section 8.2.
Arbitrary transformations can be supported, as long as we can express the corresponding abstract transformation on shapes.For OCaml and the notion of heads that we proposed, a natural space of transformations are those that would change the head of a value, and leave the rest unchanged, by: (1) applying a mapping to its immediate values, and/or turning some of them into blocks (constant blocks with fixed tags, or non-constant blocks with the transformed immediate as argument) (2) applying a mapping to the tag of its blocks, leaving its arity and arguments unchanged The set of transformations of interest is also constrained by performance considerations.In particular, turning an immediate into a non-constant block requires an allocation and memory indirection (constant blocks can be preallocated), which is precisely what we wanted to avoid by unboxing the constructor.It may still be beneficial if this transformation occurs only for some inputs that are rare in practice, with all other cases unboxed.
Supporting tag choice requests as in t1 should be easy; user-specified transformations as in t2 would be of medium difficulty, depending on the expressiveness of the transformations.We are not planning to work on full transformation inference in the context of OCaml.

Using Unboxing to Describe Existing Representation Tricks
Some subtle data-representation choices of the OCaml compiler and runtime, mentioned in Section 4.1, could in fact be presented as unboxing, possibly with further extensions.
Flat float arrays.OCaml arrays use a uniform representation (using tag 0) except for arrays of elements represented as floats, which have the tag Double_array_tag and a custom representation.The OCaml runtime (written in C) checks the array tag on each low-level operation to determine how to access the array.OCaml cannot currently express the type of custom arrays of float-represented values, but our proposed shape annotations (Section 5.1) would make it possible to do so: type ( a [@shape double]) double_array [@@shape double_array] With constructor unboxing we can then express the array representation trick in safe OCaml code: Lazy forwarding.The representation of a lazy value may be a block of tag Lazy_tag, for a thunk that has not yet been evaluated to a result, or a block of tag Forward_tag storing a result, or sometimes this resulting value directly.The OCaml runtime sometimes "shortcuts" forward blocks when they are moved around, replacing them by their value directly, with a dynamic check that this does not introduce an ambiguity -the value should not itself have tag Lazy_tag or Forward_tag.
Let us try to express this shortcutting trick in OCaml rather than in the runtime code in C. Let us assume an imaginary [@unboxed unsafe] attribute that does not perform any static confusion check for this constructor -we intentionally do not provide this in our current design proposal, which focuses on safe uses.We could then write: One could even think of an imaginary [@unboxed dynamic] variant where the compiler is in charge of inserting this dynamic check: 6 OUR HALTING PROBLEM In Section 3.4, we mention that computing the head shape of a type requires unfolding datatype definitions, and that this unfolding process may not terminate in presence of mutually-recursive datatype definitions.
In the present section, we discuss this problem in more detail, and present a novel algorithm to normalize safely in presence of recursion: it either returns the sum normal form of a type, or reports (in a finite amount of time) that the definition loops and no sum normal form exists.
First, we remark that this problem corresponds to the halting problem for a specific fragment of the pure -calculus (just function types, no products, booleans, natural numbers etc.), namely the first-order fragment with arbitrary recursion.Consider the following example: In this translation, we use a free variable sum as a binary operator to separate constructor cases, and a free variable box over the translation of type expressions appearing under a constructor.(Other free variables encode primitive types.)The sum normal form of any type in the definition environment above can be read back from a normal form of the translated -terms in presence of the recursive definitions.(More precisely, we only need a "weak" normal form that does not reduce under box applications.) The algorithm that we present in this section decides the halting problem for the first-order pure -calculus with recursive definitions, also called "order-1 recursive program schemes" in the literature, with an arbitrary reduction strategy.This is not unreasonable, given that the halting problem for this fragment is already known to be decidable, as demonstrated for example in Khasidashvil [2020] in the first-order case and in Plotkin [2022]; Salvati and Walukiewicz [2015] (for example) in the more general setting of the pure simply-typed -calculus with arbitrary recursive definitions!6.1 On-The-Fly rather than Global Termination Checking The normalization arguments in previous work on recursive program schemes are global in nature; they reason by normalizing all mutually-recursive definitions at once [Khasidashvil 2020], or at least they compute a termination bound that depends on the size of the whole mutually-recursive system [Plotkin 2022].
In our setting, the set of mutually-recursive definitions potentially contains large type definitions in scope, that are either explicitly mutually-recursive, or depend on each other through recursive modules.Normalizing datatype definitions without unboxed constructors is immediate as they are their own normal forms, but we also have to normalize through OCaml type abbreviations which are widely used.(We have not included type abbreviations in our Section 3 as the abbreviation type t = can be understood in this context as syntactic sugar for type t = Abbrev unboxed of .)Expanding all abbreviations is also known to potentially generate very large structural types for some use-cases, so the OCaml type checker uses careful memoization to only expand on-demand during type inference.
Another issue with a global termination analysis is that OCaml type definitions change often due to functor applications. 5Some type definitions in a functor body rely on abstract (or concrete) types from the functor argument.When the functor is applied to a module parameter, we get a new instance of those definitions where previously abstract type constructors from the formal argument are concrete, which may even introduce new recursive dependencies.Any global termination analysis done on all type definitions would thus have to be partially recomputed on functor applications -and delimiting the part of the computations to rerun may not be obvious in presence of recursive dependencies.
Performing a global termination analysis thus runs the risk of large computational costs in practice, which is all the more frustrating that we expect unboxed constructors themselves to be rarely used, being an advanced feature.It should come at no cost when not used and at little cost when used sparingly.
Instead, we propose an on-the-fly termination checking algorithm.Without any static precomputation on the set of mutually-recursive definitions (which may be large and/or change often during type checking), our algorithm takes a term and monitors its reduction sequence: it maintains some information on the side that is updated during reduction, and may "block" the reduction if it detects that it is about to loop forever.We must provide the following guarantees: • Correctness: reduction sequences that are never blocked by the termination monitor are always finite; they cannot diverge.• Completeness: if a reduction sequence is blocked by the termination monitor, then it would have diverged in absence of monitoring.
The existence of a termination monitor that is sound and complete implies decidability of the halting problem for the reduction being considered.We have not found termination results in the existing literature whose proofs would suggest this termination-monitoring approach; to the best of our knowledge, this approach is novel for the pure first-order -calculus with recursive definitions.(But the literature on term rewriting systems is vast and our knowledge of it very partial.) We consider this as a notable contribution of our work whose interest is independent of constructor unboxing.Already in the OCaml compiler, there are other parts of the type checker that need to normalize type definitions (including unboxed single-constructor datatypes), and rely on the unprincipled approach of passing a fixed amount of "fuel" and failing with an error once it is exhausted.We plan to rewrite these computations using on-the-fly normalization checking.We hope that other language implementors could use this approach to work with recursive type definitions, and there may be other use-cases thanks to the generality of the language considered.
Remark: our termination-monitoring approach means that we are only adding some bookkeeping logic to a head-shape computation that we need to do anyway, that happens rarely (singleconstructor unboxing is a rarely used, opt-in feature), and whose cost is bounded by a very small constant in practice, the depth of type definitions that need to be unfolded to compute the head shape (at most 5 for reasonable OCaml programs).In particular, we know with certainty that head shape computations will add no noticeable compile-time overhead to the compilation process of real-world OCaml programs.

Intuition
Attempt 1: detect repetition of whole terms.A first idea to prevent non-termination is to perform a simple cycle detection: block the reduction sequence if we encounter a term that was already part of the reduction sequence.This approach is obviously complete: we have found a cycle that can diverge, but it is not correct in presence of "non-regular recursion", that can generate infinitely many distinct terms.Consider for example (in -calculus syntax): In this environment loop(int) reduces to loop(list int), then loop(list (list int)), etc., without ever repeating the same term or even the same subterm in reducible position.
Attempt 2: detect repetition of head functions.The second idea is that, if detecting repeating of whole terms is too coarse-grained, we should instead track repetitions of the head function of the term, in the usual sense of the topmost function/rule/constructor in reducible position.This approach would prevent the infinite reduction sequence for loop int in the example above, by blocking at the second redex with the same head loop.It is easy to show that it is sound for termination, given that the number of distinct heads is finite.However, this approach is incomplete, it blocks reduction sequences that would have normalized, for example: let rec id(a) = a in id (id int) This reduction sequence needs to reduce the function id twice before reaching a normal form.
Solution: trace head functions for each subexpression.Our solution is a refinement of detecting repetition of heads.Instead of tracking the heads that have been expanded in the whole term, we trace a different set of heads for each subterm, corresponding to the set of function heads whose expansions were necessary to have the subterm appear in the term.
In the first example above, we start with the term loop int where all subterms are annotated with the empty trace (no expansion happened).We can write this as []loop []int: the subterms loop int and int are both in the empty trace.The first step of the reduction results in the annotated term [loop]loop ([loop]list []int): the subterm []int was unchanged by the expansion, but the surrounding context loop (list ) appeared in the reduction of loop in the empty trace, so this part of the new term gets annotated with [loop].At this point, our algorithm blocks the redex loop (list int) as it is already annotated with the function head [loop].
In the second example, []id ([]id []int) reduces to its argument []id []int unchangedthis argument was already present in the term before the expansions, it did not appear during the reduction.[]id []int can in turn reduce to []int which is an (annotated) normal form.

Formalizing our Algorithm
We start from a grammar for programs in the first-order -calculus with recursive definitions, containing in particular terms , and introduce a distinct category of annotated terms ¯ whose function-call subterms carry an expansion trace (a list of function names without duplicates).
: We extend the usual notion of -reduction ′ to annotated terms.Expanding a function call ( ¯ ) is only possible if its trace does not already contain -otherwise this term is stuck, we call it a blocked redex.The arguments ( ¯ ) are annotated terms, but the body ′ of the definition of the function is a non-annotated term: the body and its subterms appear in the reduction sequence at this point, and we use an annotating substitution ′ [ ] @ to annotate them, where is a substitution from variables to annotated terms and is the trace to use to annotate new subterms.
Note that the annotating substitution [∅] @ annotates each function call of an unannotated term with the trace .
We are only interested in annotated terms that were obtained starting from an initial annotated term with all traces empty.Outside this subset of annotated terms there are terms with weird/impossible annotations (for example annotation of functions that do not exist in the recursive environments) that we sometimes want to rule out from our statements.Definition 6.2 (Reachable annotated term).An annotated term ¯ is reachable if it occurs as the subterm of a term in a reduction sequence starting from an initial program of the form let rec in [∅] @∅ .

A Sketch of Correctness (Termination)
Due to space limitations, we moved our proofs of correctness and completeness for this annotated reduction algorithm to Appendix A. It proves the following results.

L
. If ¯ is reachable and reduces, in the annotated system, to a -normal form ¯ , then ⌊¯ ⌋ is the -normal form of ⌊ ¯ ⌋.

T (C
). Annotated reduction is strongly normalizing: it either reduces to anormal form or reduces to a blocked redex in a finite number of reduction steps.

T (C
).If an annotated program ¯ contains a blocked redex, then its unannotated erasure ⌊ ¯ ⌋ admits an infinite reduction sequence.
In this section we will merely sketch our termination argument.The general approach is to use a termination measure.We first define a measure, that is, a function from our terms into a well-ordered set -a set with an order relation, such that there do not exist infinite strictly-decreasing sequences.Then we prove that the our annotated reduction strictly decreases the measure of terms.
We can first define a measure on our traces .There is a finite set of type declarations in our system, and a trace can contain each type constructor at most one, so there is a largest possible trace that contains all type constructors.An application annotated with this trace cannot be reduced.Any trace can then be measured by the length difference length( ) − length( ), a natural number.
Then the question is how to extend this measure on traces into a measure on annotated terms.One could think of using the (measure of) the trace of the head application, but this does not decrease during reduction if a subterm contains a larger trace and ends up in head position.(Note: we state correctness for any reduction strategy on annotated terms, not necessarily head reduction.) A second idea is to measure a term ¯ by the set of (measures of) the traces that occur inside it -technically the multiset of traces, using the multiset ordering.But this is not decreasing either: when we reduce an application ( ¯ ) @ , we may duplicate its arguments arbitrarily many times, and those may contain traces that are strictly larger than , resulting in a larger overall measure for the reduced term.
The trick to make the proof work, which was suggested to us by Irène Waldspurger, is to use multisets of multisets: we measure each subterm in our annotated term by the path from this subterm to the root of the term, seen as a multiset of traces of applications.And then we measure our term by the multiset of measures of its subterms.
Consider now a subterm ¯ of an argument of the redex ( ¯ ) @ .The measure of its head application may be larger than the measure of , but its measure as a subterm is the path to the root, which contains .When the application of gets replaced by new subterms of a strictly smaller trace , , then the path from ¯ to the root will change, gets replaced by these new nodes in the path, so the path measure of ¯ decreases strictly.This works even if this subterm ¯ gets duplicated by expansion: we get several copies, but at a strictly smaller measure, so we are still decreasing for the multiset measure.

ON CPP
Our on-the-fly termination checking algorithm is related to the macro expansion algorithm of the cpp preprocessor for C, as presented in the C11 standard.Quoting the standard: 6.10.3.4 (2) If the name of the macro being replaced is found during this scan of the replacement list (not including the rest of the source file's preprocessing tokens), it is not replaced.Furthermore, if any nested replacements encounter the name of the macro being replaced, it is not replaced.These nonreplaced macro name preprocessing tokens are no longer available for further replacement even if they are later (re)examined in contexts in which that macro name preprocessing token would otherwise have been replaced.In this context, the "replacement list" denotes the body of a function-like macro definition -here we only consider function-like macros, #define FOO(..) ... rather than #define FOO ....The standard explains that a macro must not be expanded from its own definition or from a "nested replacement".This corresponds to our idea of blocking redexes whose function name occur in their own trace.The idea that this non-replacement information remains active "in later contexts" corresponds to the idea of carrying annotations around in subterms as the computation proceeds.
The phrasing of the standard is not very clear!In the 1980s, Dave Prosser worked on a strategy to ensure that macro replacement always terminates by disallowing dangerous cyclic/recursive macros, and wrote a careful algorithm to allow as much replacement as would be possible without -hopefully -endangering non-termination.This algorithm was published as pseudo-code in a technical note Prosser [1986].The C89 standard committee then translated Dave Prosser's pseudocode into the obscure prose that became the standard text.
We know about David Prosser's pseudo-code today thanks to Diomidis Spinellis who spent "five years trying to implement a fully conforming C preprocessor"; Spinellis wrote an annotated version of Prosser's pseudo-code that explains the code: Spinellis [2008].
Prosser's algorithm is strongly related to our termination-monitoring algorithm -it was not an inspiration as we were unfortunately unaware of the connection when designing our own.We are not aware of any proof of correctness for Prosser's algorithm -that it does guarantee termination.
In this section, we will compare the two algorithms.We will only consider the "core" fragment of cpp macros, consisting of the function-like macros that use their formal parameters directlyno conditionals, no use of stringization or concatenation, etc.

First-Order and Closed-Higher-Order Function Macros
The cpp preprocessor performs -reduction, but also parsing: it starts from a linear sequence of tokens instead of an abstract syntax tree.It is difficult to reason at the level of sequences of tokens, in particular about macros that generate unbalanced sequences of parentheses; we will not attempt to do so here.Let us only consider (core) macros whose terms are well-parenthesized.What is their expressivity in terms of a -calculus?
We define the first-order fragment of core macros as the fragment where all bound occurrences of a macro name foo are syntactically an application foo(...)foo is immediately followed by well-bracketed parentheses.This is the fragment that the vast majority of C programmers use.But it is possible to write (well-parenthesized) macros outside that fragment, whose reduction behavior is less clear.We will mention two examples, which we call the NIL example and the a(a) example: In the NIL example, G1 is used in non-applied position in the definition of G0.In the a(a) example, the second occurrence of a in a(a) is not in application position.
We call this general case a closed-higher-order language: it is higher-order in the sense that functions (macro names) can be passed as parameters and returned as results, but those functions remain closed: functions cannot be declared locally and capture lexical variables.This language also corresponds to simply-typed supercombinators [Hughes 1982;Turner 1979].(It is also similar to the use of function pointers in C, but we use macro names instead of runtime addresses.) We can consider typed or untyped versions of this closed-higher-order language.The simplytyped version is a fragment of the simply-typed -calculus with recursion, so its halting problem is decidable.The untyped version is Turing-complete, just like most extensions with more constructors or richer type systems, so their halting problem is undecidable.C macro authors probably do not consider typing their macros, so they work in the untyped version, but an extension of ML with higher-kinded type definitions would correspond to the typed version.Our algorithm does not depend on types, so it can work on either version.
In the rest of this section we will detail the following claims: F 1. Our algorithm gives the same result as Dave Prosser's on the first-order macro fragment.F 2. Our algorithm extends to the closed-higher-order macro fragment, and remains correct.F 3. Neither our algorithm nor Prosser's are complete on the closed-higher-order macro fragment.

The C and C++ Standard Bodies on Closed-Higher-Order Macros
Those two examples, NIL and a(a), come from the Defect Reports 017 of the C and C++ standard committee [The C standard committee, working group WG14 1992].Since the C89 standard was published, programmers have asked for clarifications about the replacement behavior that would dictate how those examples should behave -we have indicated possible reduction sequences in comments, but implementations in the wild would behave differently and often stop before reaching the normal form.
In the first few years, the C standard committee refused to provide clarifications -we understand that those examples were perceived as unrealistic and absent from real-world C code.See the answers to questions 17 and 23 in the Defect report 017 [The C standard committee, working group WG14 1992].It later became clear that some C or C++ programmers made real-world use of the non-first-order fragment, and lately standard bodies have moved towards actually specifying this behavior, choosing to honor Dave Prosser's original intent to reduce as much as possible.See in particular the discussion of the NIL example above in the document N3882 from the C++ standard body [The C++ standard committee, working group SG12 2014].

Relating our Algorithm to Dave Prosser's Pseudo-Code
In Appendix B, we show and explain Dave Prosser's algorithm, we relate it to our own terminationmonitoring algorithm (they are different but related), we claim that they provide the same expansions in the first-order case, and finally we explain how Dave Prosser's algorithm works outside the first-order fragment.
We are not aware of any previous proof that Dave Prosser's algorithm terminates on all inputs; our comparison provides a proof in the first-order case for the core fragment, but the closed-higherorder case remains open.

Adapting our Algorithm to Closed-Higher-Order Macros
For the purposes of OCaml head shape computation, we have only formulated our terminationmonitoring algorithm on the first-order -calculus with recursion.We now extend our first-order calculus from Section 6.3 to the closed-higher-order fragment, and show that our terminationmonitoring algorithm still enforces termination.In the ML world, this would correspond to extending ML declarations to higher-kinded type parameters -a feature present in Haskell.
We have highlighted above the changes compared to the first-order case: we add function names as first-class terms, and relax our syntax of application from an application of a known function ( ) @ to the application of an arbitrary term ( ) @ .Note that the annotation here is the trace of the whole application node, not the trace of the function symbol -function symbols do not carry an annotation. 6he definition of reduction, repeated below for reference, is unchanged.But it now implicitly restricts redexes to the case where the left-hand-side of the function application has first been reduced to a function name .
This extension can trivially express the closed-higher-order examples that we discussed so far, as well as the famous non-terminating -term ( .T 7.1.Our termination-monitoring algorithm remains correct for this closed-higher-order language: the annotated language is strongly normalizing.
The proof of this result reuses the technical machinery of the correctness proof for the first-order fragment.It is detailed in Appendix A.5.

Both Algorithms are Incomplete Outside the First-Order Fragment
Here is a counterexample to completeness in the closed-higher-order fragment: #define f(p,q) p(f(q,q)) #define id(x) x #define stop(x) done f(id,stop) // id(f(stop,stop)) f(stop,stop) // stop(f(stop,stop)) done This example requires two nested expansions of f to reduce to its normal form done. Neither our algorithm nor Prosser's can reach this normal form; they block at f(stop,stop).Note that this term is an example of a term that is only weakly normalizing: it has a normal form, but also an infinite reduction sequence f(stop,stop) stop(f(stop,stop)) stop(stop(f(stop,stop))) ...

RELATED WORK
There is a lot of work on value representations in programming languages, including questions of how to optimize data representation.We are not aware, however, of previous academic work on custom (datatype-specific) representation of sum/coproduct types, which requires checking that a candidate representation ensures disjointedness.We will discuss the state of representation optimization in neighboring languages with sum types.In particular, the "niche-filling" optimizations of Rust are the most closely related work, as they involve a form of disjointedness.To our knowledge they have not previously been discussed in an academic context, except for a short abstract [Bartell-Mangel 2022].

Functional Programming Languages
Haskell.Haskell offers newtype for single-constructor unboxing.GHC supports unboxed sum types among its unboxed value types.Unboxed value types live in kinds different from the usual kind Type of types whose value representation is uniform.GHC also supports an UNPACK pragma [Marlow 2003] on constructor arguments to require that this argument be stored unboxed -generalizing OCaml's unboxing of float arrays and records.Haskell would still benefit from constructor unboxing.Note that lifted types (containing lazy thunks) would conflict with each other, limiting applicability -one has to use explicitly unlifted types, or Strict Haskell, etc.
MLton.MLton can eliminate some boxing due to aggressive specialization [Weeks 2006]; for example, (int * int) array will implicitly unbox the (int * int) tuple.Its relevant optimizations are SimplifyTypes, which performs unboxing for datatypes with a single constructor (after erasing constructors with uninhabited arguments) and DeepFlatten, RefFlatten which optimize combinations of product types and mutable fields.The representation of sum types with several (inhabited) constructors remains uniform.
Scala.In Scala, representation questions are constrained by the JVM but also by the high degree of dynamic introspectability expected.Even the question of single-constructor unboxing is delicate.A widespread AnyVal pattern has disappointing performance [Chan 2017;Compall 2017] and dotty introduced a specific opaque type synonym feature [Odersky and Moors 2018; Osheim, Cantero and Doeraene 2017] to work around this.
Specialization and representation optimizations.Both MLton and Rust create opportunities for datatype representation optimizations by performing aggressive monomorphization.In OCaml, statically specializing the representation of (int * int) array (as MLton does) or Option<Box<T>> (as Rust does) to be more compact would not be possible, as polymorphic functions working with a array or a option inputs are compiled once for all instances of these datatypes.
On the other hand, a non-monomorphizing language could perfectly implement and support representation optimizations for types that are only used in a specialized context.For example, at the cost of some code duplication, high-performance code could define its own a option_ref code that looks similar to ( a option) ref but with a more compact representation.(In fact people do this today using foreign or unsafe code; a ref_option can be safely expressed with inline records.)Finally, in Section 5.1 we described parameterized type definitions that support a quantification on the head shape of the type parameter, which could provide a compromise between genericity and unboxing opportunities.
We thus consider that monomorphization is a separate concern from representation optimizations such as unboxing.Monomorphization creates more opportunities for representation optimizations, with well-known tradeoff in terms of compilation time and code size.Languages could consider representation optimizations whether or not they perform aggressive specialization.
Unboxed sums and active/view patterns.Some languages, such as GHC Haskell and F♯, support both unboxed sums as "value types" [Ağacan 2016; Syme 2016] and active/view patterns [Peyton-Jones 2007;Syme, Neverov and Margetson 2007] to apply user-defined conversions during pattern matching.Combining those two features can get us close to unboxed constructors in some contexts, including our Zarith/gmp example.The idea is to have a very compact primitive/opaque value representation, and expose a "view" of these values in terms of unboxed sums -in F♯ they are called "struct-based discriminated unions".This style can combine a compact in-memory representation, yet provide the usual convenience of pattern-matching, without extra allocations -even when crossing an FFI boundary.The overhead would typically be higher than native constructor unboxing, but only by a constant factor.
This approach is very flexible, it can be used to perform representation optimizations that are not covered by single-constructor unboxing alone.For example, it can be used to view native integers into an unboxed sum type of positive-or-negative-or-zero numbers.On the other hand, constructor unboxing extends the ranges of memory layouts that can be expressed directly in the language, as in our Zarith example; to use the "view" in this situation, we have to implement the representation datatype and the view function in unsafe foreign code.

Rust: Niche-Filling
Rust performs a form of constructor unboxing by applying so-called "niche-filling rules".The paradigmatic example of niche-filling is unboxing Option<A> for all types A whose representation contains a "niche" leaving one value available -typically, if A is a type of non-null pointers, the 0 word is never a valid A representation and can be used to represent the option's None.
Note that niche-filling in Rust is a qualitatively different optimization from constructor unboxing in OCaml, as it does not remove indirection through pointers.In Rust datatypes, sums (enums) are unboxed and pointer indirections are fully explicit through types such as Box<T>, and to our knowledge they are never optimized implicitly.Niche-filling shrinks the width of (value/unboxed) sum types by removing their tag byte; this is important when using (unboxed) arrays of enums, and because all values of an enum are padded to the width of its widest constructor.One could say that it is an untagging rather than unboxing optimization -but a tag is a sort of box.
The current form of niche-filling was discussed in Beingessner [2015] and implemented in two steps in Burtescu [2017] and Benfield [2022].We understand that it works as follows: among the constructors whose arguments have maximal total width, try to find one with a "niche", a set of impossible values, that is large enough to store the tag of the sum.Then check that the arguments of all other constructors can be placed in the remaining space -outside the niche.
We can formulate niche-filling in our framework.Head shapes are abstractions of sets of binary words -all of the same size.In particular, the complement of the head shape of a type is its set of niches.Rust allows unboxing exactly when all constructors of a type can be "unboxed" in our sense -they can be placed in non-overlapping subsets of binary words.Rust will implicitly change the placement of some constructor arguments to obtain more non-overlapping cases, which is an instance of the "unboxing by transformation" approach that we discussed in Section 5.3.
Niche-filling has some limitations: (1) Filling a niche with a type B is a direct inclusion of the tag and values of B, possibly placed at non-zero offsets within A; it is not possible to transform the representation of B along the way (for example, if B has two values, it is not possible to use the niches of two separate non-null pointer fields, there must be a full padding bit available somewhere.)This limitation is shared with our work.(We point it out because some people expect magic from Rust optimizations.)One reason is that this guarantees that projecting the B out of the value is cheap, instead of requiring a possibly-expensive transformation.
(2) Today niche-filling is implementation-defined in Rust, purely a compiler heuristic, and there is no interface for the user to control niche-filling behavior.There seem to be some rare cases where niche-filling in fact degrades performance by making pattern-matching slightly more complex; the only recourse for programmers is to use the enum(C) attribute to ask for an ABI-specified layout that disables all representation optimizations.(There is an instance of this workaround in the PR that generalized niche-filling, 94075, out of concern for a potential performance regression in the Rust compiler itself.)(3) Rust currently only allows niche-filling when all constructors can be unboxed at once.This is less expressive than our approach where some constructors can be unboxed selectively while others are not.In the context of Rust, removing the tag bytes from all constructors is the very point of niche-filling, so unboxing (untagging, really) only certain constructors selectively is not interesting.This suggests however that niche-filling as currently conceived in the Rust community is a specialized form of constructor unboxing, that is not necessarily suitable for other programming languages.Our presentation is more general.

Declarative Languages for Custom Datatype Representations
Hobbit.The experimental language Hobbit offered syntax to define algebraic "bit-datatypes" together with a mapping into a lower-level bit-level representation for FFI purposes, with support for constructing and matching the lower-level value directly.See in particular Diatchki, Jones and Leslie [2005], Section "2.2.4 Junk and Confusion?" on page 9, which refers to the coproduct property as a "no junk and confusion" guarantee, and mentions that their design in fact allows confusion: they are not forbidden by a static analysis, but a warning detects them in some cases.
One expressivity limitation is that "bit-datatypes" form a "closed universe": all types mentioned in a "bit-datatype" declaration must themselves be bit-data whose representation is available to the compiler, it does not seem possible to specify only a part of the variant value as bit-data, and have arbitrary values of the language (that may not have a bit-data encoding) for other arguments.
Cogent / Dargent.Dargent [Chen, Lafont, O'Connor, Keller, McLaughlin, Jackson and Rizkallah 2023] is a data layout language for Cogent, focused on providing flexible data representation choices for structure/product types.Users can define custom layout rules for record types, the system generates custom getters and setters for each field and verifies that each pair of a getter and setter satisfies the expected specification of being a lens.Cogent also has sum types, but Dargent does not allow unboxed layouts; it requires a bitset to store a unique tag for each case.Extending this formalization to richer representations of sum types would allow constructor unboxing.This could be done by generating a pattern-matching function and a family of constructor functions, and checking that those satisfy the expected properties of a coproduct -disjointedness.
(There are many other data-description languages (DDLs), but they tend to not offer specific support for sum types / disjoint unions, especially not approaches to unbox them.The Dargent Related Work mentions the prominent previous works in this area.) Ribbit: Bit Stealing Made Legal.Simultaneous to our work, Baudon, Radanne and Gonnord [2023] propose a new domain-specific language to express data layouts.It is designed for efficiency rather than for interoperability with foreign data, supports flexible representation optimizations for sum types, and formalizes (on paper) the "no-confusion" correctness condition for sums.
Having a full layout DSL is more expressive than our approach.The authors had both OCaml and Rust in mind when elaborating their proposed DSL, called Ribbit, which is thus expressive enough to express constructor unboxing (Zarith is mentioned as a use-case), as well as Rust nichefilling optimizations.The language currently suffers from the same "closed universe" limitation as Hobbit, and needs to be extended to deal with abstract types or non-specialized type parameters, but those are not fundamental limitations of the approach.
It would be nice to restructure our work as elaborating the user-facing feature into an intermediate representation of data layouts, and R would be a good candidate for this.The authors discuss efficient compilation of pattern-matching involving datatypes with a custom layout; they moved from matrix-based pattern-matching compilation (as used in the OCaml compiler) to decision-tree-based compilation to handle more advanced, "irregular" representations.(The R prototype does not yet support real programs, so it is hard to draw performance conclusions; in contrast our prototype supports the full OCaml language.)So far our intuition is that constructor unboxing, and the various extension we considered, are regular enough that matrixbased compilation can give good results and no invasive changes are required in the OCaml compiler.But this question deserves further study.

Deciding Termination of Recursive Rewrite Rules
Our termination-monitoring algorithm provides an alternative proof of decidability of the halting problem for the first-order recursive calculus.This result was well-known (inside a different scientific sub-community than ours), but we want to point out that we provide a new (to our knowledge) decision algorithm, and that having a simple yet efficient algorithm is important to our application.By "termination-monitoring" we mean that our algorithm only adds some bookkeeping information to computations that we have to perform in any case to compute a normal form.
The simplest proof of decidability of termination for the first-order recursive calculus that we have found (thanks to Pablo Barenbaum) is Khasidashvil [2020].Seen as a decision algorithm, this proof is terrible: it performs an exponential computation of all normal forms of all types definitions.
On the other hand, other termination proofs exist that also scale to the higher-order cases, and rely on evaluating the program in a semantic domain where base types are interpreted as the two-valued Scott domain {⊥, ⊤}, where ⊥ represents potentially-non-terminating terms and ⊤ represent terminating terms.Terms at more complex types are interpreted as more complex domains, but (in absence of recursive types) they remain finite, so in particular recursive definitions, interpreted as fixpoints, can be computed by finite iteration.This approach in particular underlies the termination proofs in Plotkin [2022]; Salvati and Walukiewicz [2015].This argument has a computational feel, but naively computing a fixpoint for each recursive definition in the type environment is still more work than we want to perform.It is possible that a clever on-demand computation strategy for those fixpoints would correspond to our algorithm.

ACKNOWLEDGMENTS
Jeremy Yallop proposed the general idea of this feature in March 2020 [Yallop 2020].There are various differences between our implementation and Jeremy Yallop's initial proposal, mostly in the direction of simplifying the feature by keeping independent aspects for later, and our formal developments were a substantial new effort.Jeremy Yallop also suggested Zarith as a promising application of the feature.
Nicolas Chataing and Gabriel Scherer worked on this topic together in spring-summer 2021, during a master internship of Nicolas Chataing supervised by Gabriel Scherer.Most of the the ideas presented in Sections 3, 4, 5 were obtained during this period, along with the termination-checking algorithm, but without a convincing proof of termination.An important part of the work not detailed in this article was the implementation of a prototype within the OCaml compiler, which required solving difficult software engineering questions about dependency cycles between various processes in the compiler (checking type declarations, managing the typing environment, computing type properties for optimization, accessing type properties during compilation).Around and after the end of the internship, Gabriel Scherer implemented the Zarith case study, implemented small extensions, and set out to prove termination -soundness of the on-the-fly termination checking algorithm.Gabriel Scherer also did the writing, and intends to work on upstreaming the feature.
Stephen Dolan entered the picture with high-quality remarks on the work, in particular identifying the notion of benign cycles, and remarking that the termination algorithm is similar to the cpp specification, with precise pointers to the work of Dave Prosser via Diomidis Spinellis that Gabriel Scherer used to propose a detailed comparison in Section 7.
The most important contribution of Stephen Dolan is probably that he killed the first four or five attempts at a termination proof.Once we collectively got frustrated with Stephen Dolan's repeated ability to strike down proposed termination arguments, Irène Waldspurger played the key role of suggesting a correct argument: one has to use multisets of multisets of nodes, instead of trying to make do with multisets of nodes.
Jacques-Henri Jourdan remarked on a relation to the Coq value representation optimization in native_compute, Guillaume Melquiond provided extra information and Guillaume Munch-Maccagnoni tried some early experiments.We have yet to investigate this idea in full, but it may provide a nice simplification of this low-level aspect of native_compute.
Adrien Guatto greatly improved the presentation of a previous version of this work by a healthy volume of constructive criticism.Anonymous POPL reviewers also provided excellent feedback and engaging questions, in particular Section 5.4 is due to their curiosity.
Finally, the types mailing-list has helped tremendously in finding previous work on decidability of recursive rewrite systems (thread).
Definition A.4 ( , ∈ , Subtrees( ), .′ , ′ ≥ ).We define a path , the relation ∈ if the path relates the sub-tree to a parent tree , and the multiset Subtrees( ) of sub-trees of .
Finally, we write that ′ is an extension of , using the notation ′ ≥ , if and only if ′ is of the form ′′ . .
is the tree where the -rooted sub-tree is replaced by : We use -ary trees to model our annotated -terms.
Example A.6 (Trees of annotated terms).Let us consider our annotated terms ¯ as -ary trees tree( ¯ ); these trees belong to the set Trees(T , ∅) for a set of tree node constructors T defined by the following grammar: For example the annotated term ( () @ , ) @ becomes the tree node( @ , 0 ↦ → node( @ , ∅), 1 ↦ → node( , ∅) ) A.1.2Tree Expansion.We now define an expansion process for our -ary trees, that generalizes our reduction of terms.Informally, we consider the following expansion process: • A node (in arbitrary position) in the tree is chosen to be expanded.We call expanded sub-tree the sub-tree rooted at this node.;• The expansion process replaces this expanded sub-tree by another sub-tree containing: zero, or several new nodes, that do not correspond to nodes from the previous tree, and an arbitrary number of copies of some sub-trees of the expanded sub-tree.A sub-tree of the expanded sub-tree can disappear during an expansion, or be duplicated into one or several sub-trees, placed inside the new sub-tree that replaces the expanded sub-tree. Formally: Definition A.7 (Tree expansion).A head tree expansion head ′ replaces a tree by another tree ′ , some sub-trees of which are strict sub-trees of -not itself.
A tree expansion ′ is a head tree expansion in a sub-tree of .Example A.8. Figure 1 shows a tree on the left, and three possible expansions of it on the right, all from the node with labeled by the constructor , with new nodes shown in red.In the first expansion, the node is replaced by the new node ′ , which carries the children of (they were neither duplicated nor erased).In the second expansion, there is no new node, has been replaced by its sub-tree rooted in and was erased.In the third expansion, there are two new nodes ′ and ′′ , two copies of the sub-tree rooted in , one additional copy of the sub-tree rooted in , and is erased: the sub-tree of constructor has been replaced by a sub-tree of the form ′ ( ′′ ( , ), ), where the variable is the sub-sub-tree ( ), and is the sub-sub-tree .F 6. The tree expansion process generalizes -reduction of our annotated first-order terms: for any definition environment , In particular, any infinite reduction sequence of annotated terms could be mapped to an infinite sequence of tree expansions on the corresponding trees.
Remark.-reductions in our annotated terms replace variables by function parameters ¯ .In terms of tree, this corresponds to a subset of expansions where the erased/duplicated sub-trees are always exactly the immediate children of the expanded node, not sub-trees of arbitrary depth.We believe that the ability to manipulate arbitrary sub-trees, which does not make the proof more difficult, is in fact useful to express expansion processes outside our model.In particular, it would be useful to model type constraints in the OCaml programming language: type a t = ( b * bool) constraint a = b * b * int which effectively allow to designate arbitrary sub-expressions of the type parameters and use them directly in an abbreviation definition.In rewriting systems it is also not uncommon to have complex left-hand-side that match on their arguments in-depth.Definition A.9 (Trees(N)).Variables V were used to define expansion by substitution.In the rest of this document we will mostly work with closed -ary trees in Trees(N, ∅), which we write as just Trees(N).
Definition A.10 (constr( )).For ∈ Trees(N) we define constr( ) as the head constructor of (which cannot be a variable as is in Trees(N, ∅)): constr(node( , ( ) )) def = Subtrees( ) → (M ( ), < M ( ) ) on the nodes of , by considering for each node the path from this node (included) to the root of the tree, seen as a multiset of nodes.
Formally, we define our measure Node( ) ( ) into M ( ), for each node ∈ Subtrees( ), as the multiset sum of the measure of the constructor of and a measure Path( ) also into M ( ) defined by induction on the predicate ∈ .Note that the measure of the root is not taken into account by our path measure: we define Path( ) ( ∈ ∅ ) as ∅ rather than {{ N ( )}}.This is essential for the measure to be compatible with path concatenation: From this measure on nodes Node( ) : Subtrees( ) → (M ( ), < M ( ) ) we now define a measure on trees Tree : Trees(N) → (M (M ( )), < M ( M ( ) ) ) by considering for each tree the multiset of measures of its nodes.Our trees are thus measured as multisets of multisets of node constructors, themselves equipped with a well-founded partial order.).If M has a measure N , then the tree expansions that are measured for N are strictly decreasing for the order Tree .In particular, there are no infinite sequences of measured expansions.

P . A measured expansion
′ is necessarily of the following form, where a sub-tree at a path in is replaced by a sub-tree join( ), where the constructors of are of strictly smaller measure than : We have to show that Tree ( ) > Tree ( ′ ).
Nodes inside or outside .We can partition the paths valid inside or ′ in two disjoint sets: • the paths to nodes inside , that is the paths ≥ that extend ; they denote a sub-tree of (in ) or a sub-tree of join( ) (in ′ ), and • the paths to nodes outside , that is the paths that do not extend , they reach an unrelated part of the tree.
The measure Node( ) ( ′ ) of a sub-tree ′ of depends only on the constructors along the path from the root to the node ′ included.In particular, if the node ′ is at a path that does not extend , it is also present at the same path as a sub-tree of ′ , where it has the same measure, as ′ and only differ inside .
, ′ ∈ ′ }} Recall that our multiset order is defined by inequalities + 1 < M ( ) + 2 with ⌊ 1 ⌋ set < P ( ) ⌊ 2 ⌋ set .We use the measures of the nodes outside as the multiset in that definition.It remains to prove the set ordering between the measures of nodes inside , playing the role of 1 and 2 in the definition: A classification of paths under join( ).For each sub-tree under in ′ , that is each sub-tree of join( ), we have to show that there exists a sub-tree under in , that is a sub-tree of , of strictly larger measure.
Let us remark (and prove) that a sub-tree ′ of ′ has a path of the form . ., with, reading right to left from the root to the leaves: • the path from to the root , • a path (new node) inside zero, one or several "new" nodes introduced by the expansion, • a path (subtree), possibly empty, inside a copy of a strict subtree of .When does indeed reach inside a subtree copied from , we furthermore define a path (old node) from to the first old node copied -otherwise is defined to be empty.
For example, in Figure 1, some paths into the nodes of the third expansion of are as followsfor readability we write paths by marking the constructor of the node, rather than its child index: The presence of is by hypothesis, we are classifying the paths ≥ which necessarily have = ′ .for some ′ such that ′ ∈ ′ join( ).Formally, we just defined how to split this sub-path ′ into a pair ( , ) by defining a function split_join ( , ′ ∈ ′ join( )) by induction on and inversion on the definition of join( ): where ( ′′ , ′′ ) = split_join ( , ′ ∈ ′′ join( )) By definition we have .= ′ , so ′ ∈ .join( ), and we also have ′ ∈ join( ′ ) and ′ ∈ for a sub-tree ′ of such that: (1) Either ′ is not of the form var( ): the path ′ stops inside before reaching a node var( ).
In this case ′ is a "new node" of -in particular, we know that expansion generated at least one new node, otherwise we would have = var(_).On have = ∅ by definition of split_join (_, _), and we additionally define as ∅: there is no path to a corresponding "old node" in .
(2) Or the path ′ reaches a node ′ = var( ′′ ) of : is the path between ′ and , and the rest of ′ is a path ′ ∈ ′′ , possibly empty, reaching a strict sub-tree ′′ of .We then define as the path of ′′ in , non-empty as ′′ is a strict sub-tree of .strict.
We will come back to this distinction between those two cases (1) and ( 2) in the rest of the proof.
We shall now conclude the proof by finding a sub-tree of under (thus inside ) whose measure is strictly larger than the measure of our sub-tree ′ of path . .inside ′ .
Larger path.Our candidate sub-tree, that we name ′ , is at the path . . in .
Case (1).In the case (1) above ( = = ∅), ′ is a new node, and ′ is the sub-tree at path ∅.∅., that is : the node ′ is exactly .We have to show Node( ) ( ) The expansion from to ′ is measured, so we have N (constr( )) > N (constr( ′ )), which concludes the proof in this case.
Case (2).In the case (2) above, ′ is a sub-tree of ′′ , a strict sub-tree of copied by the expansion generated by . is the path of ′ inside ′′ , and is the (non-empty) path of ′′ inside .We have: To conclude we have to show Path( ) ( ) > Path( ) ( ).This is true because is non-empty, so Path( ) ( ) is a sum that contains in particular the measure of its root N ( ), which is strictly larger than the measure of all the nodes of as the expansion is measured.
C A.15. Annotated reduction is strongly normalizing.
We have proved that our on-the-fly termination checking strategy is sound, in the sense that it does enforce termination: each reduction sequence terminates in a finite number of step, either because it reaches a normal form (also normal as an un-annotated term), or because it is blocked with all reducible position trying to reduce a function already in the trace.
A.4 On-The-Fly Termination: Completeness In this section we establish completeness of our on-the-fly termination checking strategy, in the sense that it does not prevent any normalizing term from reaching its normal form.

T
A.16 (W ).If a reachable annotated program let rec in ¯ contains a blocked redex (( ¯ ′ ) ) @ with ∈ , then its unannotated erasure ⌊ ¯ ⌋ admits an infinite reduction sequence.

P
. To make notations lighter, we will assume that takes a single argument.This does not change the proof argument in the least.
Our proof proceed in three steps.
(1) We work backwards from the blocked redex ( ′ ) with a trace of the form 1 , , 2 , we rewind the trace, to exhibit a previous redex ( ) with trace 1 that reduces in at least one step to our term ′ [ ( ′ )]: ( ) [ ← ] @ , ★ ′ [ ( ′ )] At this point we can erase the trace annotations and consider the corresponding unannotated reduction sequence.
(2) We generalize this reduction sequence to a valid, non-empty reduction sequence that does not depend on , with instead a fresh variable : ( ) + ′′ [ ( ′′ )] (3) At this point we can build an infinite reduction on unannotated terms by repeating this generalized reduction sequence inside its own conclusion: Step 1: rewinding the trace.We have a blocked redex of the form ( ′ ) @ ( 1 , , 2 ) that is reachable, that is (Definition 6.2), that appears a reduction sequence starting from a term [∅] @∅ with empty traces -we consider the recursive environment fixed once and for all in this proof.
We claim that if a reachable term ¯ has a trace of the form , ′ for some function call ′ (let us assume ′ unary again for simplicity), then it necessarily arises from the reduction of a call of the function ′ , followed possibly by some further reduction steps.More precisely, there exists a non-empty reduction sequence: ′ ( ) @ ′ [ ← ] @ , ′ ★ [ ¯ ] We show this by induction on the reduction sequence reaching ¯ (which must be non-empty as it has a non-empty trace), doing a case analysis on the reductions that can result in a term of the form [ ¯ @ ( , ′ )]: • If the last reduction was precisely the call to ′ whose expansion generated ¯ , we are done.
• If the last reduction occurred in a strict subterm of ¯ , then its source term ¯ ′ also has trace , ′ ; we use our induction hypothesis to get a reduction from a ′ ( ) @ to [ ¯ ′ ] and complete with the last reduction step.• If the last reduction step occurred in the context outside ¯ without touching ¯ , we forget it and apply the induction hypothesis on the strictly smaller (but still non-empty) reachable sequence.• Finally, if the last reduction step substituted ¯ (unchanged) from a different position in the term, our induction hypothesis immediately gives us a suitable reduction sequence reaching this earlier occurrence of ¯ , and we are done.
Now that we know how to rewind the last element from the trace, we can iterate this process on the trace 1 , , 2 of our blocked redex ( ′ ), once per element of 2 and once for .We thus get a non-empty, -repeating reduction sequence ( ) @ 1 [ ← ] @ 1 , ★ ′ [ ( ′ ) @ ( 1 , , 2 )] Let us drop the annotations now.We crucially used our trace annotations to build this -repeating reduction sequence.In the rest of the proof we will not use the annotations anymore, so we move to the lighter world on unannotated terms: we redefine Step 2: generalizing the -repeating sequence.We claim that our -repeating reduction sequence (now unannotated) starting in ( ) can be generalized in a sequence starting in ( ) for any fresh variable .
This proof step relies on the first-order restriction on our -calculus: this works easily because substitution inside the body of cannot create new redexes.After the substitution, we can reduce redexes in copies of (if any), or reduce redexes already present in the body of (possibly with a variable now replaced by ), but substituting does not make new function calls possible.
In consequence, we can systematically replace by a fresh variable in our reduction sequence, dropping any reduction step that is internal to .Notice that the first step in our sequence ( ) [ ← ] ★ ′ [ ( ′ )] is not internal to , so we know that it remains in the filtered reduction sequence -we need it non-empty.Modulo -renaming we will assume that our fresh variable is also the name of the bound variable in the declaration of .We get a filtered sequence of the form: ( ) ★ ′′ [ ( ′′ )] Step 3: profit.Our generalized -repeating reduction sequence starts from ( ) for a fresh variable and, in at least one step, reduces to a term that contains a call to again.We can immediately build an infinite reduction sequence by repeatedly composing our generalized sequence with itself.
What we have shown so far is that blocked redexes could generate infinite reduction sequences.But our reduction is non-deterministic, so some terms have both terminating and non-terminating reduction sequences -they are normalizing but not strongly normalizing.Using standard(ization) results on the -calculus we believe that we can get a stronger completeness result -the proof below is a bit sketchy, so we degraded our claim from Theorem to Conjecture.

C
A.17 (S ).If an annotated program let rec in ¯ has a weakly normalizing erasure, that is, if ⌊ ⌋ has some reduction path to a normal form, then ¯ has an annotated reduction sequence to a (non-blocked) -normal form.

P
. A standard fact about the -calculus, which applies to our setting, is that the leftmostoutermost reduction strategy is minimally diverging, in particular it reaches the normal forms of any weakly normalizing term.
Our argument relies on studying the annotated leftmost-outermost reduction sequence of the annotated term ¯ (under the definitions ).If this reduction sequence does not reach a blocked redex in leftmost-outermost position, we are done: it reaches a -normal form of ⌊ ¯ ⌋.
We now argue that ¯ cannot reach a blocked redex by following the leftmost-outermost strategy.If ¯ could reach a blocked redex, then the proof of the previous theorem would provide an infinite reduction sequence from that point.This infinite reduction sequence is obtained by repeating a suffix of the reduction sequence to this block redex, which follows the leftmost-outermost strategy.
In consequence, the infinite reduction sequence is itself a leftmost-outermost sequence.This is in contradiction with the assumption that ⌊ ⌋ is weakly normalizing.
Finally, it is interesting to study how those proofs break down when we move from our firstorder calculus its closed-higher-order extension presented in Section 7.4.In that section, we show that completeness does not hold in the closed-higher-order fragment, with the following counterexample: let rec f(p,q) = p(f(q,q)) and id(x) = x and stop(x) = done in f(id, stop) (* id(f(stop,stop)) f(stop,stop) stop(f(stop,stop)) done *) The first blocked redex in this reduction sequence is the application of f in the second term id(f(stop,stop)) of the sequence.Using the reasoning of our proof of weak completeness, we can "generalize" the -repeating sequence to a sequence of the form f(k,stop) k(f(stop,stop)) for any k, whose instantiations can be repeated into an infinite reduction sequence: f(id,stop) id(f(stop,stop)) id(stop(f(stop,stop))) id(stop(stop(...))) On the other hand, this first occurrence of a blocked redex is not in head position for the leftmostoutermost reduction strategy, so our proof of the strong completeness theorem does not applies.
Another blocked redex does appear in head position in the third term of the reduction sequence, f(stop,stop).But then the reasoning of our proof of weak completeness cannot be applied anymore: the reduction sequence f(id,stop) id(f(stop,stop)) f(stop,stop) crucially uses a parameter of f to create a new redex, and this sequence cannot be generalized by replacing id by an arbitrary parameter k.
A.5 Proof of Correctness for the Closed-Higher-Order Fragment This is a proof missing from 7.4.

T .
Our termination-monitoring algorithm remains correct for this closed-higher-order language: the annotated language is strongly normalizing.

P
. The proof uses the same machinery as the first-order proof -see Appendix A. We translate our terms to trees with measured expansions.We can then immediately reuse our result that measured expansions are strongly normalizing.