MacoCaml: Staging Composable and Compilable Macros

We introduce MacoCaml, a new design and implementation of compile-time code generation for the OCaml language. MacoCaml features a novel combination of macros with phase separation and quotation-based staging, where macros are considered as compile-time bindings, expression cross evaluation phases using staging annotations, and compile-time evaluation happens inside top-level splices. We provide a theoretical foundation for MacoCaml by formalizing a typed source calculus maco that supports interleaving typing and compile-time code generation, references with explicit compile-time heaps, and modules. We study various crucial properties including soundness and phase distinction. We have implemented MacoCaml in the OCaml compiler, and ported two substantial existing libraries to validate our implementation.


INTRODUCTION
Program generation is a powerful and expressive approach to eliminating abstraction overhead and improving program performance, which has been studied and implemented in a variety of languages with different forms, such as C++ templates [Abrahams and Gurtovoy 2004], macros [Burmako 2013;Clinger and Rees 1991;Flatt 2002;Kohlbecker et al. 1986], or multi-stage programming for compile time [Kovács 2022;Sheard and Jones 2002;Xie et al. 2022] and runtime code generation [Calcagno et al. 2003b;Kiselyov 2014;Rompf and Odersky 2010;Taha et al. 1998].
This paper presents the design and implementation of MacoCaml 1 , an extension of OCaml with support for compile-time code generation.We offer the following contributions: • A unifying framework for phases and staging: In the design space of compile-time code generation, there has been work supporting macros with phase separation [Flatt 2002] and work supporting quotation-based staging [Sheard and Jones 2002], but the two notions are often considered separate.We present a novel, simple, yet effective combination of the two techniques, where macros are considered as compile-time bindings, expressions cross evaluation phases using staging annotations, and macro invocation is part of compile-time evaluation implied by top-level splices ( §2.1 and 2.2).This way, we unify macros and staging in a single framework, enabling users to apply the unified abstraction to build reusable, composable, and maintainable programs.Moreover, we show that staged macros integrate smoothly with features found in real-world languages, including module imports ( §2.3) and computations with side effects ( §2.4).• A comprehensive formalism of a feature-rich macro calculus: We provide a theoretical foundation for MacoCaml by formalizing a typed source calculus ( §3).Staging calculi in the literature often focus on a minimal set of features, but building MacoCaml on the OCaml language forces us to confront several practical issues: interleaving of typing and compile-time code generation, references with explicit compile-time heaps, and modules.We believe that is the first typed formalism to address such a rich staging feature set.The calculus describes the essence of the macro system in OCaml, laying the foundation for a language feature to be integrated into a full-scale language, allowing for further extensions that use or build on top of the OCaml macro system.
• Soundness and phase distinction for staged macros: To model compile-time evaluation, we formalize a core calculus as the compilation target for , and present a typedirected elaboration from to ( §4).Separating the source and the core calculi allows us to distinguish modules from their compiled forms, making the phase separation explicit.We establish key properties of our design, including 1) type soundness of ( §4.4), 2) elaboration soundness from to ( §4.5), and 3) phase distinction ( §4.5).Thus we formally establish properties essential for safe and modular programming: well-typed source programs generate well-typed core programs.Moreover, the theorems, to our best knowledge, are the first to formally reason about macros with compile-time heaps and prove that compiletime computations do not interfere with runtime computations.Specifically, we show that compile-time heaps can be discarded after compilation, and compile-time computations can be erased before runtime evaluation.
• A working implementation for OCaml: We provide an implementation of MacoCaml in the OCaml compiler, following the key ideas in our calculus ( §5).The implementation is detailed, and the modified compiler is available as an artifact [Xie et al. 2023].All examples presented in the paper work in the compiler.To validate our implementation, we have ported two substantial existing libraries: Strymonas for stream fusion [Kiselyov et al. 2017], and the OCaml library for typed formatting [Vaugon 2013].The results show that MacoCaml works in practice and can be applied to large-scale implementations.
We discuss related work in the rich design space of staging and macros in §6, and conclude and discuss future work in §7.Our formalism is detailed, and some rules are elided for space reasons.
The complete set of rules and all proofs of stated theorems are provided in the appendix.While this work focuses on OCaml, we believe this study can help with the design and formalism of staging or macros in other programming languages.For example, power 5 2 returns 32, as expected.However, while this function can take an arbitrary integer n, the abstraction comes with low-level performance overhead for each recursive call.
Macros provide an expressive way to balance abstraction and efficiency.Below we define power as a macro in MacoCaml, assuming that n is statically known: macro rec mpower' n x = (* int -> int expr -> int expr *) if n = 0 then << 1 >> else << $x * $(mpower' (n -1) x)>> macro mpower n = << fun x → $(mpower' n <<x>>) >> (* int -> (int -> int) expr *) In MacoCaml, macros are always functions, which are the most common case in practice.The macro definition makes use of staging annotations: <<e>> are quotations that delay an expression's computation by turning it into its code representation, while $e are splices that trigger evaluation of a code representation.From a typing perspective, if e : t, then <<e>> : t expr; conversely, if e : t expr, then $e : t.Splicing a call to the macro generates at compile-time a specialized power function with n equal to 5: let power5 = $(mpower 5) (* fun x -> x * (x * (x * (x * (x * 1)))) *) Calling power5 2 produces 32, the same result as before, but now with less runtime overhead as all function calls to mpower have been unrolled and inlined.

Evaluation Phases
The power example illustrates the first key design of our system: we divide program evaluation into two phases, where let definitions are runtime bindings and macros are compile-time bindings.Staging annotations are used to cross phases.The design provides a novel view of macros, unifying macros and staging in a single framework. 2  Call relation.Fig. 1 depicts the call relation between let definitions (represented as ) and macros (represented as ), where ⟨⟩ is quoting and $ is splicing.As we have seen, we can splice a macro application in a let definition, as in the definition of power5.Conversely, we can quote the application of a let-defined function (and splice it elsewhere in the program), delaying its computation to runtime, e.g.macro delayed32 () = <<power5 2>>.
Leveled bindings and well-stagedness.Formally, we manage definitions using the notion of a level.The level of an expression is defined as the integer given by the number of quotes surrounding it minus the number of splices: quotation increases the level of an expression, while splicing decreases it.Intuitively speaking, levels indicate the evaluation phase of expressions: expressions of negative levels are compile-time expressions, and expressions of level 0 are runtime expressions. 3acoCaml further features leveled bindings: let definitions are bindings at level 0, while macros are bindings at level −1.Well-stagedness specifies that a definition can only be used at the level it is defined.Specifically, the right-hand side of a let definition is type-checked under level 0, and the definition itself can be used at level 0. Similarly, the right-hand side of a macro definition is type-checked under level −1, and the macro itself can be called at level −1.This explains why we can splice macros inside let definitions, or quote let definitions inside macros.This way, we unify macros and staging in a single framework, and all bindings simply follow the same requirement specified by well-stagedness.
Macros and compile-time evaluation.Now the question is: where exactly does a compile-time computation happen?In MacoCaml, compile-time computation happens inside top-level splices, namely, splices without surrounding quotations.In the example for power, the top-level splice $(mpower 5) generates the definition for power5 during compilation.In other words, macro applications (such as mpower 5) alone do not force any compile-time evaluation; rather, macro invocation is part of compile-time evaluation.We explain the key idea with the following definitions: let power5 = $(mpower 5) (* well-typed and mpower expanded *) let err = mpower 5 (* error: mpower is defined at level -1 but called at 0 *) macro mpower5 () = mpower 5 (* well-typed and mpower not expanded *) macro merr () = $(mpower 5) (* error: mpower is defined at level -1 but called at -2 *) Recall that macros are defined at level −1, allowing macros to be spliced inside top-level splices in let definitions or called inside macro definitions.Here, only power5 expands mpower 5 as it is inside a top-level splice; mpower5 is well-typed but does not expand mpower; err and merr are ill-staged.
The examples demonstrate that MacoCaml provides a systematic approach to combining macros with staging.This distinguishes MacoCaml from macro systems where macro invocation is itself a compile-time computation, in which case err and mpower5 might expand mpower.
Interleaving typing and compile-time computations.In MacoCaml, compile-time evaluation happens during typing.As we will see, interleaving typing and compile-time evaluation poses challenges to the system's type safety.A similar design is used in Template Haskell [Sheard and Jones 2002], but unlike Template Haskell where the result of splicing a code value may not type-check, MacoCaml offers a static guarantee: we prove that well-typed programs always generate well-typed programs.
Lastly, we note that since all top-level splices are evaluated during typing, compiled modules no longer contain top-level splices -though they may still contain splices inside quotations, as in the definition of mpower.Moreover, as macros are functions, after compilation, macros are always values.We will revisit this observation when we discuss module imports ( §2.3 and 2.4).

Modules and Imports
In the power example, all definitions were defined in a single module.In practice, programmers organize programs into separate modules for maintainability, and import modules to use their definitions.However, this separation means that the levels of definitions are determined by the imported modules that they inhabit.For example, importing a let definition means that the definition can only be used at level 0 but not -1.Changing the level involves changing the let definition to a import -1 compile-time import 0 Fig. 2. Le : The call relation between definitions across modules.The module in the middle is the one currently being defined, and the one to the le is imported at level -1, and the one to the right is imported at level 0. Definitions in the gray box are compile-time computations.Right: The call relation between definitions across levels.
macro, which requires the source of the imported module to be under our control, which may not be the case.Further, we may wish to import and use a definition at both compile-time and runtime, and it would be tedious and error-prone to give essentially the same definition at both phases.
As an example, consider a module Term: The type term defines a datatype for lambda terms.The definition id defines an integer reference, initialized to 0, and every call to fresh generates a fresh variable.Suppose we want to import this module to define a macro that builds a representation of function . ., calling fresh twice: macro apply () = let x = fresh () in let y = fresh () in Lam (x, Lam (y, App (Var x, Var y))) As written, this program is ill-staged, since the imported function fresh is a let definition defined at level 0, but used in the definition of the macro apply at level -1.The module import system.To solve this issue, we allow importing modules at different levels, following Flatt [2002].Specifically, when a module is imported at runtime (or level 0), all definitions are at the same level as if they were defined in the current module.Namely, let definitions are imported at level 0 and macros are imported at level -1.More interestingly, when a module is imported at compile-time (or level -1), all definitions will have their levels shifted by -1.That is, its let definitions are imported at level -1, while its macros are imported at level -2.
Therefore, importing Term at level -1 allows apply to call fresh.In MacoCaml, we write4 : module DT = Term (* Term is imported at level 0 as DT *) module ~ST = Term (* Term is imported at level -1 as ~ST *) Now we can define the apply macro as follows5 : Combining macros, staging, and modules.The module import system of Flatt [2002] fits extremely well into our framework.In particular, since let definitions and macros are simply bindings at different levels, importing modules at a specific level simply shifts the levels of the bindings.Wellstagedness, as before, is managed through levels, with staging annotations that can adjust levels in an expression.In this work, we focus on module imports at two levels, 0 and -1, which are the most practical and are sufficient to encode many interesting applications, but the essential idea can be generalized to support importing modules at more levels.
The updated call relation between definitions across modules is given on the left of Fig. 2. Definitions in the gray box are compile-time computations.That is, they can be spliced at top-level; notice how they are all targets of some splice arrow.Importing modules makes more definitions available for compile-time computations: let definitions can splice let definitions imported at level -1, and macros can splice macros imported at level -2.Note that the relation is not defined in an ad-hoc way; rather, it is derived following the rule of levels, as shown on the right of Fig. 2. In principle, these relations can extend to an arbitrary number of levels.
Finally, we note that only compiled modules can be imported.As all top-level splices are evaluated during compilation, imported modules never contain top-level splices.We have seen in Fig. 2 the call relation between definitions across modules.The call relation inside a compiled module can be simplified as the original call relation with splice edges removed to indicate the absence of top-level splices.An example is given as the module (a) at the top of Fig. 3; details about the figure can be ignored for now and will be introduced in the next section.Recall that the call relation features direct relations, and there can still be non-top-level splices in a compiled module.

Compile-Time Side Effects
The Term and Apply example demonstrates another key aspect of our design: compile-time evaluation may perform side effects such as allocation and updates of references.
At the point when apply is spliced, we expect the reference in id from Term to have been initialized.Consequently, id ought to have been evaluated at that point.Indeed, as specified in Flatt [2002], a module imported at level -1 needs to be evaluated during compilation.The alternative semantics of evaluating imported definitions whenever they are spliced would produce quite unexpected results.In our example, every time ~ST.fresh was called, the alternative semantics would allocate a new fresh reference for id, causing ~ST.fresh to return 1 twice and making the final result . ., instead of . ., which is not the programmer's intention.By evaluating modules imported at level -1, the reference in id from Term will be properly initialized in the heap.
Compile-time evaluation with references means that compilation needs a compile-time heap.Moreover, we observe that the two calls to fresh in apply use the same reference id, and the state of the reference should be updated between the two calls -the first time fresh () is called, it returns 1, and the second time it returns 2. This suggests that the compile-time heap needs to be threaded through compilation, and after every step of compile-time computation, the possibly updated heap will be used for the rest of the computation.
A potential problem with compile-time heaps.Compile-time heaps require some care to avoid problems.Consider: module ~ST = Term macro fresh_err () = ST.id:= !ST.id + 1; <<!ST.id>> (* error: id is bound at level -1 but used at level 0 *) In this definition, fresh_err returns the result of id inside a quotation.Consider what would happen if this definition was accepted: the reference id evaluates to some location loc in the compile-time heap, and splicing fresh_err expands to code that refers to the location: let err () = $(fresh_err ()) (* expands to (!loc) at compile-time*) Now evaluating err () will raise an error at runtime -we cannot find the location loc as the compile-time heap is no longer available!In this case, fortunately, the definition for fresh_err is rejected as ill-staged: id is imported at level −1, but used at level 0 inside the quotation.But in general, how can we ensure that a runtime computation will not depend on compile-time values?
Our solution: phase distinction of heaps.To answer this question, we formalize (integer) references and compile-time heaps in our calculi, and prove that well-staged expressions never need compiletime heaps after compilation ( §4.5).As such, the compile-time heap can be safely discarded after compilation, which implies separate compilation of separate modules.This is the first time, to our best knowledge, such a result is formally established for compile-time computations.The id and fresh_err example demonstrates the intuition behind this result: if a reference is created at compile-time, it can be called at compile-time, but its well-staged form can never be captured using quotations.Note that references may still appear inside quotations, such as $<<!ST.id>>, but as the top-level splice gets evaluated at compile time, this is equivalent to !ST.id.Before evaluating the call to apply in the splice, we must ensure that Term has been evaluated and thus id has been initialized.This kind of dependency can be arbitrarily deep: for example, importing Program in another module would also make it necessary to find and evaluate Term.For this reason, Flatt [2002] considers a tower of modules.
Fig. 3 presents a tower of modules, focusing on three modules: module (b) in the middle is the module currently being compiled, and it imports module (a) at level -1 and module (c) at level 0. As both (a) and (c) have been compiled, their call relations have no splice edges to indicate absence of top-level splices.And as they can themselves import modules, the tower can be arbitrarily high.Returning to our example, the definition program corresponds to a in module (b), and apply is 2 from module (a), and fresh and id are 5 imported in module (a).Now the goal is to find what definitions need to be evaluated during compilation.As all macros are values after compilation, macros themselves do not need to be evaluated, but they may require other definitions to be evaluated.All definitions colored in gray are let definitions of negative levels that need to be evaluated during compilation of module (b).The intuition behind evaluating those let definitions is as follows.Let definitions of negative levels can be spliced and thus evaluated at compile-time.We thus need to ensure that they have been evaluated before they can be spliced, so that all references have been initialized properly in the compile-time heap, just like id from Term.
As the tower of modules can be arbitrarily high, finding and evaluating those let definitions of negative levels are defined using two definitions, called visiting and invoking [Flatt 2002].
Visiting and invoking in MacoCaml.Fig. 4 shows the two processes.Visiting a module evaluates its compile-time computations, and invoking a module evaluates its runtime computations.We represent visiting by dotting macro definitions, and invoking by coloring let definitions gray.The call relation between definitions across module towers.The module (b) in the middle is the module currently being compiled, and it imports two modules, in green and purple respectively, with their call relations given in (a) and (c).This figure highlights 3 modules, but actually involves 7 modules: module (b), module (a) and its two imported modules at level 0 and -1 respectively, and module (c) and its two imported modules at level 0 and -1 respectively.Definitions in gray are evaluated during compilation.In practice, the tower can be arbitrarily high.Since macros are values, visiting a module (Fig. 3a) involves invoking modules imported at level -1 and recursively visiting modules imported at level -1 and 0. Invoking a module (Fig. 3b) evaluates let definitions and recursively invokes modules imported at level 0.
When compiling a module (module (b) in Fig. 3), we start by visiting as well as invoking modules imported at -1 (corresponding to the coloring in module (c)), and also visiting modules imported at 0 (corresponding to the coloring in module (a)).For our example, the Program module visits Apply ( 2 ), which then in turn invokes Term and evaluates id ( 5 ).
Incorporating visiting and invoking into our framework introduces several differences to Flatt's design.The fact that macros in our system are always values after compilation makes visiting simpler in our system, since there is no need to evaluate macros during visiting.Moreover, as top-level splices have been evaluated for compiled modules, the value of in the purple background of Fig. 3a will not be needed after compilation.As such, it is a design choice whether the needs to be visited.In our system, we choose to visit for its effects.Furthermore, we prove that phase distinction of heaps continues to hold in the presence of module towers, where a compile-time heap is threaded through visiting and invoking.
Compile-time and runtime phase distinction.We further prove that compile-time only computations do not interfere with runtime evaluation: when evaluating a compiled module, we can erase all macros and modules imported at compile-time.With such a result, we establish a full phase distinction between compile time and runtime.
Summary.We briefly summarize the MacoCaml notions introduced so far.Bindings are separated into let definitions at level 0 and macro definitions at level -1.Quotations and splices respectively increase and decrease the level of expressions, and can thus be used to cross phases in definitions.Modules can be imported at runtime or compile-time; in the latter case levels of imported definitions are shifted by -1.As imported modules can themselves import other modules, visiting and invoking traverse the tower of modules, evaluating definitions during module imports.Side-effects require care with compile-time heaps; we prove that compile-time heaps can be discarded after compilation.As we can see, phases, staging, and modules work together in MacoCaml.We have implemented MacoCaml in the OCaml compiler ( §5) and compare our design to other systems in §6.

Larger Example: Print Forma ed Data
Having introduced the key features of MacoCaml, in this section we present a more substantial example program.§5 describes larger-scale examples.
Consider defining a C-like printf-style function that takes a format and a sequence of arguments, and returns the formatted output.We start by encoding the format using a datatype: fmt is defined as a generalized algebraic datatype (GADT) taking two type parameters.Intuitively, in a format ('a, 'b) fmt, 'a prepends a number of int → to b, reflecting the number of integer arguments required by the format.There are three cases.The format either asks for an extra integer argument (Int), or is given a string (Lit), or concatenates two formats (Cat).The datatype can be easily extended to take more sorts of inputs.The function % is simply an infix alias for Cat.
For example, the following format corresponds to the C-style printf format string "(%d, %d)" that consumes a pair of integers.
One way to implement printf is through a continuation-passing-style (CPS) auxiliary function S 2 (Fig. 7) (Fig. 9) Fig. 5. Key judgments of modules, structures, and expressions with their dependencies in the paper The printk function takes a continuation k of type string → b and a format (a, b) fmt, and constructs a function that takes as many integer arguments as the format specifies.If the format is Int, printk returns a function that takes an integer s and passes its string representation to k.If the format is Lit s, s is passed directly to k.For Cat (l, r), printk processes l, binding its result to x, processes r binding its result to y, and finally catenates (^) the two strings, passing the result to k.
The definition gives us the desired behavior: let three_and_four = printf pair 3 4 (* "(3, 4)" *) Defining printk as an interpreter of formats is inefficient, so it is useful to eliminate the interpretation overhead with macros.Here are macro definitions corresponding to printk and printf: Passing a format to mprintf generates code that takes exactly the right number of arguments and does not create or inspect the constructors of fmt: (* BEGIN_EG2 *) let mpair = $(mprintf pair) (* fun s1 -> fun s2 -> "(" ^string_of_int s1 ^", " ^string_of_int s2 ^")" *)

A MACRO CALCULUS WITH STAGING AND MODULES
In this section, we present our source calculus , which forms the foundation for MacoCaml.Since the source can import compiled modules and enforce compile-time evaluation, the source judgments may depend on the core judgments to be defined in §4.Fig. 5 presents the key judgments and their dependencies in the paper.The reader is advised to refer back to this figure while reading the rest of the article, as what it depicts will gradually come to make sense.

Syntax.
. Syntax in the source calculus .For clarity, we use colors to distinguish source and core, and syntax in blue and light blue denotes and refers to definitions in the core calculus.
a sequence of items: source modules (module : Δ = M), types (type = ), let definitions (def = ), and macros (def ↓ = x : .) that are always functions.We write def and def ↓ , instead of let and macro, to emphasize that macros are essentially definitions living at a shifted level (-1).Structure items also include explicit constructs for importing modules at level 0 (import : For clarity, imports in our system are qualified imports.Here M denotes a module in the core calculus, since it is only possible to import modules that have been compiled to the core.We will describe the core calculus in detail in §4.Here, we remark that the core calculus is essentially the source calculus after compile-time evaluation.Consequently, the core calculus does not have top-level splices, which will be evaluated during type-checking, but additionally includes locations, which are values of evaluating references.A path is a sequence of module variables that can be used to locate a definition inside (nested) modules.
Expressions include literals , unit unit, variables x, functions x : ., applications 1 2 , definitions and qualified definitions ., and macros and qualified macros . .For simplicity, the calculus has only integer references6 , with reference creation ref , access !, and assignment 1 := 2 , Lastly, ⟨ ⟩ quotes an expression and $ splices an expression.
Types.Module types Δ include structure types sig end, and structure types keep track of module types : (Δ, ), with a level to indicate the level the module is available at, type definitions ( = ; ) to propagate type equivalence, and def and def ↓ types ( : and : ).Types include the integer type Int, the unit type Unit, functions 1 → 2 , references Ref , code fragment Code , type variables and type variables from a path . .

Contexts.
The context Γ maps definitions, as well as local variable x, to their types and levels7 .Throughout this paper, we assume all definitions have distinct names, so there is no shadowing.
Heaps and evaluation contexts Ω refer to definitions in the core calculus.maps a location to a value ; while Ω maps modules, definitions, and macros to their definitions and values.Both contexts are needed in the source calculus for compile-time computations.
Fig. 7. Typing modules and structures in

Typing Modules and Structures
The rules for typing modules and structure items are given in Fig. 7.The reader is advised to ignore , Ω, and the elaboration part ( ) until §3.3.The elaboration judgment 1 Ω Γ ⊢ M : Δ M 2 reads: under the heap 1 , the evaluation context Ω, and the type context Γ, the module M has type Δ, elaborates to a core module M, updating the heap to 2 .Rule m-struct simply uses the typing rule for structure items.In rule mmvar, we get a module variable from the type context.Rule m-pmvar type-checks a module variable from a path.In both cases, the module variable has level 0, as in programs modules themselves are always at level 0, in contrast to their names in paths, e.g., .where can be of level −1.
The path typing judgment Γ ⊢ : Δ says that under the type context Γ, the path has type Δ with level .The types and levels of modules variables are obtained from the context (rule p-mvar).For a nested path .(rule p-pmvar), we first get the type and level 1 of , and get from the type (Δ, 2 ) for the module.The return type is ⌈Δ⌉ with level 1 + 2 .The notation ⌈ ⌉ prefixes Fig. 8. Typing expressions with compile-time code generation in to all variables that are defined in .For example, if 1 defines = Int and a nested module 2 : (sig : end), then 1 .2 has type ⌈sig : end⌉ 1 = sig : 1 .end.The judgment 1 Ω Γ ⊢ S : S 2 type-checks a structure item.Rules st-empty and sttype are straightforward.For definitions = (rule st-def), we type-check at level 0 (expression typing is explained in the next section) and put : in the context Γ for type-checking S.
Rule st-macro type-checks macros = .We type-check the macro definition at level −1 and add its type : to the context Γ for checking S.
Rules st-module, st-importR, and st-importC type-check modules and imports.When we define or import a module, the module is at level 0. In contrast, when we import ↓ a module, the module is at level -1.Importing a module also requires the module to be well-typed in the core under an empty heap (• • ⊢ M : Δ), making compilation independence [Culpepper et al. 2007] obvious that compilation of a module does not depend on side effects that occurred during the compilation of imported modules.

Typing Expressions
Fig. 8 presents the typing rules for expressions.The judgment 1 Ω Γ ⊢ ★ : 2 reads: under the heap 1 , the evaluation context Ω, and the type context Γ, the expression has type under level and mode ★, elaborates to , and updates the heap to 2 .The definition of compiler modes appears at the top of the figure.The mode is only significant for staging annotations, and is explained with the staging rules.
The first three rules are straightforward.Rule kvar says that a definition variable is well-typed only at level 0. Similarly, rule macro says that a macro is well-typed only at level −1.Definitions from paths can have lower levels (rules pkvar and pmacro).Thus we first get the level of the path .Then . is well-typed at level , and . is well-typed at level − 1.In both cases, the return type is ⌈ ⌉ .
Rule abs introduces the binder x at level , so later x can be used only at level .The notation Γ ⊢ checks that all type variables in are bound.Rule app is self-explanatory.Rules ref, get, and set are standard typing rules for creating, getting and setting a reference.Rule eq says that if has type 1 , and 1 is equivalent (≈) to 2 , then can also be typed at 2 .The type equivalence judgment is standard [Leroy 1994] and is put in the appendix.As an example, if one module defines = Int and : , then 1 + . is well-typed, as .: .(pkvar), .≈ Int, and thus .: Int (eq).
Typing staging annotations.The final three rules type-check staging annotations.Rule qote says that if ⟨ ⟩ has type at level + 1, then ⟨ ⟩ has type Code at level .Dually, rules splice and codeGen say that if has type Code at level − 1, then $ has type at level ; the two rules apply under different compiler modes.
The compiler mode ★ manages compile-time code generation, and is similar to the typing states in Template Haskell [Sheard and Jones 2002].The transition between the three modes is given at the bottom of Fig. 8.They work as follows.As described in §2.2, compile-time evaluation happens inside top-level splices, i.e. splices that are not inside quotations.When we first enter into the judgment for typing expressions and macros, as in rules st-def and st-macro during structure typing, the compiler is in mode c.If the compiler then encounters a splice, rule codeGen applies.The rule switches to mode s to type-check the spliced body, and then forces evaluating the elaboration result ( §3.3).With compiler modes, we ensure that compile-time code generation happens only inside top-level splices in rule codeGen.In contrast, if the compiler encounters a quotation, then it goes to mode q (rule qote).Rules splice and qote switch back and forth between mode s and q, but the compiler can never go back to c. 8Notably, our rule codeGen generalizes that of Sheard and Jones [2002]: our rule applies at any level, reflecting the fact that top-level splices can occur in let definitions at level 0 (rule st-def) as well as in macros at level -1 (rule st-macro), while Sheard and Jones [2002] required (and only needed) the level to be at 0.

Elaboration and Compile-Time Evaluation
As we will see, the source syntax is a subset of the core syntax.The elaboration part ( ) in Fig. 7 and 8 mostly does nothing except for two kinds of compile-time computation: (1) Visiting and invoking imported modules in rules st-importR and st-importC; (2) Compile-time code generation in rule codeGen.
Heaps and evaluation contexts.To support compile-time computations, typing judgments take as input a heap and an evaluation context Ω. Heaps keep track of references, whose values are stored inside locations.The typing judgment takes in a heap 1 and updates it to 2 .

Key rules of visiting and invoking
The evaluation context Ω stores definitions needed for evaluation, and is extended in four rules: rule st-macro when typing macros, and rules st-module, st-importR, and st-importC when typing (imported) modules.Notably, in source typing the evaluation context stores only compiletime definitions, hence we ignore definitions of in rule st-def as the definitions will not be needed during compile-time evaluation.
Visiting and invoking modules.As explained in §2.4,when we import a module at level 0, we need to visit the module (rule st-importR), and when we import a module at level -1, we need to visit as well as invoke the module (rule st-importC).Intuitively, visiting and invoking ensure that compile-time computations from imported modules are evaluated for their side effects (e.g., all compile-time references are initialized).Note that in both cases, the elaborated module is the original module, as visiting and invoking do not affect runtime computations.
We present the key rules for visiting and invoking in Fig. 9; for space reasons, the complete rules are in the appendix.We have two judgments for visiting and invoking a module respectively: -rule v-struct: Visiting a module will traverse its structure to find imported modules, and visit and invoke import ↓ ed modules (rule v-importC) and visit imported modules (rule v-importR).-rule i-struct: Invoking a module will simply evaluate it; rules for evaluating a module are defined in the core calculus ( §4).
Compile-time code generation.Rule codeGen is where compile-time code generation happens inside top-level splices.After elaborating the spliced expression to , the rule evaluates into a value ⟨ 1 ⟩.Level-annotated values ( 1 ) are introduced in §4.1; for now, it suffices to know that ⟨ 1 ⟩ is a quotation value that cannot be further reduced.The rule then removes the quotation and inserts 1 as the elaboration result.

Example
Fig. 10 presents an example for typing and compile-time code generation.The expression $(x ← ref 0; ⟨$(x := 1; x)⟩) is a top-level splice.Inside the splice, we first create a reference x.Then inside a quotation and another (non-top-level) splice, we set x to 1, and call the macro with x.
The example demonstrates several interesting aspects.First, it shows how the level and the mode change during typing.Second, at 1 , we apply rule splice rather than rule codeGen, as the splice is not a top-level splice.Applying rule codeGen here, in fact, would go wrong -if rule codeGen is applied, we would have no way to evaluate x := 1 since x has not been initiated yet!Lastly, at Fig. 10.Example: typing with compile-time code generation in .For clarity, we use the following syntactic sugar: (1) x ← 1 ; 2 for ( x : _. 2 ) 1 , and (2) 1 ; 2 for ( _ : _. 2 ) 1 .We assume if-expressions and the equality operator = on integers.We omit some uninteresting details for space reasons.

COMPILATION TARGET WITH DYNAMIC SEMANTICS
In this section, we present , which is the compilation target for .

Syntax and Typing
Fig. 11 presents the syntax, typing rules, and level-annotated expressions and values for .
Syntax.The syntax in the core is mostly the same as the source.In structure items S, the definitions def and def ↓ have an extra syntactic condition 0 and 0 (explained below), which ensures that definitions have no top-level splices after compilation.Modules and imported modules are now both of form M, but we still distinguish them as imported modules need to be type-checked under an empty context for separate compilation.Expressions include an additional construct, locations , which are the values of references.We omit the definition for types (Δ, , ) and contexts (Γ, , Ω), which are exactly the same as in the source calculus (Fig. 6).
Typing.Typing rules in the core has no compile-time evaluation, and thus judgments have no evaluation context, output heap, compile mode, or elaboration result.The input heap is used to type-check locations (rule c-loc), where a location is well-typed only if it is bound in the heap.We omit the typing rules for modules, paths, and structure items, as they are otherwise the same as the corresponding typing part in the source calculus (Fig. 7).
The judgment Γ ⊢ : reads: under the heap and the type context Γ, the expression has type at level .Most rules are self-explanatory.Rule c-loc, as mentioned above, type-checks locations.As references are always of integers, locations have type Ref Int.
Level-annotated expressions and values.We disallow top-level splices in the core calculus, enforcing that restriction by means of level-annotated expressions [Calcagno et al. 2003b;Taha et al. 1998] The notion means that is an expression at level , where ≥ 0. Importantly, a splice $ is an expression at only positive levels + 1.Thus, 0 in the structure syntax ensures that expressions do not have top-level splices.In other words, there are no negative level-annotated expressions.
The level-annotated values , where is a subset of expressions , means that is a value at level .Values 0 include literals , units unit, locations , and lambdas x : .and quotations ⟨ ⟩ only if 0 , i.e., is an expression at level 0 that has no top-level splices.This suggests that evaluation can happen inside lambdas and quotations, as we will see.Values +1 are the same set as expressions (level-annotated expressions) .Intuitively, +1 is a value at level + 1 if it does not evaluate at level + 1, and thus it can only have up to nested splices, which is exactly expressions of .For example, if $ is an expression at level 1, then it is a value at level 2.
Notations: we often write simply for 0 .We write partially annotated expression (and values), such as 1 2 to mean that 1 is an expression at level but there is no level restriction on 2 .

Dynamic Semantics
Fig. 12 presents the dynamic semantics for the core calculus.
A module value M contains a structure value S .A structure is a value S when its modules, defs, def ↓ s, and imported modules are values.Note that import ↓ ed modules can be any M, as they are compile-time only computations that do not evaluate in the dynamic semantics.The judgment 1 Ω M 1 −→ M 2 2 reads: under heap 1 and evaluation context Ω, the module M 1 evaluates to M 2 , updating the heap to 2 .Rule ev-m-struct evaluates the structure.Rules ev-m-mvar and ev-m-pmvar get the module definition from the context.We use the notation .= M ∈ Ω to mean that we can get the definition of inside Ω following the path .Recall that the notation ⌈ ⌉ prefixes to all variables that are defined in .
The judgment 1 Ω S 1 −→ S 2 2 evaluates structures.For def = , we first evaluate to a value (rule ev-st-def1), and then adds = in the evaluation context to evaluate the rest of the structure (rule ev-st-def2).Type definitions are ignored in the dynamic semantics (rule ev-st-type).Macros are not added to the evaluation context (rule ev-st-macro), making it explicit that macros are compile-time only computations.For space reasons, we omit the evaluation rules for modules.At a high-level, just like def, we evaluate module and imported modules to values and add them to the evaluation context; and just like macros, we ignore import ↓ ed modules.
The judgment 1 Ω 1 −→ 2 2 reads: under heap 1 and evaluation context Ω, evaluating 1 at level results to 2 and updates the heap to 2 .The rules are used for both compile-time evaluation (as in rule codeGen) and runtime.The judgment is level-indexed: intuitively, the rule searches for expressions at level 0 to evaluate, adjusting the level when evaluating inside quotations and splices.For an application 1 2 at level , we first evaluate 1 (rule ev-app1) until it becomes a value , and we evaluate 2 (rule ev-app2).We evaluate a lambda x : .by evaluating its body, so splices inside the body can get evaluated (rule ev-abs); the level is positive as there are no splices at negative levels.Beta-reduction only happens at level 0 (rule ev-beta).Evaluation for references is similar to applications, and we have rules for reference creation (rules ev-ref1 and ev-ref), gets (rules ev-deref1 and ev-deref), and sets (rules ev-assign1, ev-assign2, and ev-assign).
Quotations (rule ev-qote) and splices (rule ev-splice) increment and decrement the evaluation level respectively.Rule ev-spliceCode is where we cancel out a pair of quotation and splice: when 1 is a value at level 1, splicing the quotation ⟨ 1 ⟩ at level 1 removes the quotation and steps to 1 .The last four rules evaluate definitions at level 0 to their values in the evaluation context.Rules ev-macro and ev-pmacro for macros should be used only for compile-time evaluation.

Example
Fig. 13 presents the evaluation steps for the compile-time evaluation happened in Fig. 10.The derivation from 1 to 2 is given on the right of the figure, demonstrating how we search for expressions to evaluate inside quotations and splices and how heaps are updated.The last step from 3 to 4 cancels out a pair of quotation and splice with rule spliceCode.

Type Soundness
We discuss type soundness for the core calculus .First, we need notions of contexts being well-formed.Γ ok means that all types in the type context are well-formed with all type variables bound.ok checks all values in the heap are integers.An evaluation context Ω is well-typed, with respect to a context Γ and a heap , if all definitions in it has the type specified by Γ.Interestingly, as the reader may have noticed, our calculi distinguish between compile-time and runtime evaluation contexts.Specifically, when typing the source calculus, we add macros (rule st-macro) but not definitions (rule st-def) to the evaluation context, while we do the opposite when evaluating modules: we add definitions (rule ev-st-def2) but not macros (rule ev-st-macro).Therefore, evaluation contexts has two well-formedness judgments Γ ⊢ c Ω and Γ ⊢ r Ω, as given in Fig. 14, used for compile-time and runtime computations, respectively.
Preservation holds for both compile-time and runtime evaluation.For space reasons, we show theorems only for expressions, but theorems 4.1, 4.2, and 4.4 also hold for modules and structures.Theorem 4.1 (Preservation).Given Γ ok, and ok, and The theorem does not relate the typing level ′ to the evaluation level , as it takes typing and the evaluation step as given, and proves that every possible step of evaluation preserves typing.The progress theorem is subtler.For example, evaluating modules (rule ev-st-macro) does not add macros to the evaluation context, so we must ensure runtime evaluation never encounters macros (rule ev-macro) to avoid evaluation getting stuck.Similarly, compile-time evaluation should never encounter rule ev-kvar.But given an expression, how do we know whether we are evaluating at compile-time or runtime?Rule codeGen gives us a hint: the elaborated expression is typed at level − 1, which is either -1 or -2 for top-level splices in let definitions and macros, but evaluated at level 0 -thus, we can determine the evaluation phase by comparing the typing level ′ with the evaluation level : a smaller typing level indicates compile-time evaluation; otherwise, the phase is runtime.The following theorem establishes progress, where the notation Γ checks that Γ does not contain local variables (i.e.x).
Theorem 4.2 (Progress).Given Γ ok, and ok, and where , then either is a value , or there exist ′ and ′ such that Ω −→ ′ ′ .
Combining preservation and progress yields type soundness.Below, we show soundness for runtime where ′ = = 0, and for compile time where ′ = −1 (or any negative level) and = 0.

Elaboration Soundness and Phase Distinction
We discuss additional properties of our calculi.
Elaboration soundness and phase distinction of heaps.First, we want to show that elaboration preserves typing.A natural question then is: if Ω Γ ⊢ ★ : ′ , under which heap, or ′ , should we type-check ?The answer is: neither.In practice, we may compile a program in one environment, and then run the compiled program in another environment, where compile-time information is no longer available.It is therefore important that does not refer to values from the compile-time heap.Indeed, in the following theorem, we prove that is well-typed under an empty heap in the core calculus.Namely, heaps have a clear compile-time and runtime phase distinction, suggesting that we can safely discard the compile-time heap after compilation.
Theorem 4.4 (Elaboration Soundness).Given Γ ok, ok, and The intuition behind the empty heap is as follows.In rule codeGen, suppose is typed at level −1 and then evaluated at level 0. Any locations created at evaluation level 0 are at typing level −1.Because of preservation, the result ⟨ 1 ⟩ is also of typing level −1, and thus 1 has typing level 0, and so 1 will not be able to capture locations from typing level −1.The theorem also says that is an expression at level 1 under mode q and at level 0 otherwise.Thus in rule codeGen since is typed under s, we have 0 .Therefore compile-time type soundness (Theorem 4.3) applies to the evaluation derivation in rule codeGen.
Phase distinction.The distinction between compile-time and runtime heaps have already made it evident that compile-time only computations are not needed for runtime evaluation.The following theorem makes the phase distinction further explicit: the notation erases all macros (including those inside modules) and modules imported at compile-time (import ↓ ).The theorem says that if a module M evaluates to M ′ , then after erasure, M evaluates to M ′ .Consequently, when evaluating a module we can safely erase all compile-time only computations.
Theorem 4.5 (Phase Distinction).Given Γ ok, ok, and 5 MACOCAML: IMPLEMENTATION So far we have focused on the calculi that capture the essence of our design.We have incorporated our design into OCaml, and provide the modified OCaml compiler as an artifact [Xie et al. 2023].

Compiler
The implementation is a substantial change that touches many parts of the compiler (top level, type checker, runtime, dynamic loading, parser, code generator, primitive types, standard library, etc.).
Syntax.For harmony with other OCaml features our syntax differs from the formalism in several ways.Rather than def = and def ↓ = we write let k = e and macro m = e.In place of import or import ↓ we project from modules using M.k at runtime or ~M.k at compile time and make names available without qualification using open M and open ~M.For the code type we write postfix expr rather than prefix Code.
Quotation.Following other multi-stage language implementations such as BER MetaOCaml [Kiselyov 2014], compilation translates typed quoted expressions into combinators that construct terms.Since type checking guarantees that generated code is well-typed, the combinators do not need to carry type information; they construct and compose untyped representations.
Splicing.The execution of the compile-time code that is inserted into top-level splices builds an array of intermediate code values (type lambda array).Splicing these values constructs a single large representation for each module that is compiled with OCaml's standard backend.
Compile-time evaluation.As in the formalism, compilation involves evaluating top-level splices.These are first translated to bytecode, then executed using the Meta.reify_bytecodefunction that OCaml's interactive top-level uses to execute phrases.Although the OCaml compiler supports compilation to both bytecode and native code, for simplicity and portability in our implementation compile-time code is currently always compiled and executed as bytecode.In the future we plan to support native code execution in compile-time by integrating a method to generate and execute native code dynamically as proposed by Fischbach and Meurer [2011].
Compilation.Since modules can only import compiled modules which type-check under an empty heap, it is obvious that compilation of a module does not depend on side effects that occurred during the compilation of imported modules ( §3.1).When compiling a module, each imported module is visited (similarly, invoked) at most once at each level.For example, module ~ST1 = Term followed by module ~ST2 = Term will only visit Term once 9 .As another example, if both modules A and B import Term at compile-time, and another module C imports A at runtime and B at compiletime, then Term will be visited twice, once at level -1 (for A) and once at level -2 (for B).If module C imports both A and B at runtime, then Term is visited only once at level -1, and A and B share its heap state at level -1.

Language Extensions
The implementation of MacoCaml follows the key ideas of our design.Full integration in OCaml requires additional steps, which we touch on briefly here.

References of code. Our calculus
restricts to integer references to study compile-time heaps.If references can store code, then a variable can escape its scope if code quoting it can be stored in references and spliced outside the corresponding binder; this issue is known as scope extrusion.MacoCaml follows other systems (e.g.Kiselyov [2014]) in detecting scope extrusion at splice time.
Module subtyping.In the formalism of , every module exports all the names that are defined in its body.In contrast, the richer module systems of full-scale ML family languages such as OCaml offer module subtyping [Mitchell and Harper 1988] that supports exporting only a subset of names.There is an interesting interaction between subtyping and quotation.In the example below, the signature of M exports a macro public that quotes an unexported function secret.When the result of M.public is spliced outside the module, its expansion includes a name that is not in scope: To deal with this, the compiler incorporates path closures, which systematically transform macros and signatures to ensure that names like secret that are hidden from user code by module boundaries are nonetheless accessible in elaborated programs.Essentially, the compiler closure-converts each macro that uses module-local definitions; the produced closure is a quoted module containing those definitions, whose path is then injected into the macro definition.As a result, the above program will generate M.Closure1.secret(3+1), instead of M.secret (3+1).As users cannot access closures, data abstraction is preserved for user programs.As macros are always function values, the insertion of an additional parameter to inject the path closure does not change their evaluation behaviour.We leave a formal treatment of path closures to future work.
Cross stage persistence.Cross-stage persistence (CSP) in multi-stage languages comes in three flavours: heap-based CSP, value-based CSP, and path-based CSP.In heap-based CSP, heap-allocated values are stored in code as reference to the heap.In compile-time staging systems such as MacoCaml and Template Haskell [Sheard and Jones 2002] this form of CSP is not permitted, since the compiletime heap where code is constructed is discarded after compilation.For path-based CSP, top-level identifiers can appear in code quotations and are stored as names.Our system naturally allows something similar: a macro can quote a top-level let identifier and, more generally, quotations at level can quote identifiers at level + 1. Path closures, discussed above, manage the tricky interactions with the hiding of top-level names.
Finally, in value-based CSP, simple immutable values are automatically converted to their code representations.For example, in MetaOCaml one might write let x = "foo" in .<x >. to produce a code value equivalent to .<"foo">..In our system this example is rejected as ill-typed; the user must apply a type-specific function lift_string : string → string expr instead.In some systems, such as Template Haskell, these lift functions are inserted automatically using type classes, and we hope that the eventual integration of modular implicits [White et al. 2014] will allow a similar approach in MacoCaml.

Libraries
While we leave a systematic evaluation of MacoCaml to future work, we have ported two substantial existing libraries to MacoCaml to validate our implementation.
Stream fusion.Kiselyov et al. [2017] present a MetaOCaml library, Strymonas, for stream fusion with appealing performance improvements on OCaml microbenchmarks.The interface offers high-level combinators such as map and fold, and generates efficient loop-based code.
We have ported the 904 lines of code of Strymonas to MacoCaml, which generates identical code to the original library, thus inheriting the same performance improvements.Although Strymonas was designed for run-time staging, the port was straightforward, since MacoCaml's quotes and splices are similar to MetaOCaml's.The compilation time is also similar, at around 600ms for both the MetaOCaml implementation and the MacoCaml port.Besides syntactic updates, porting involved two changes.First, the original implementation has several uses of cross-stage persistence, which MacoCaml does not support; we changed these to explicit applications of MacoCaml's lifting functions (e.g.Expr.of_int : int → int expr).Second, where the original implementation executes quoted code either using MetaOCaml's primitive run function or by printing to a file followed by compilation, the MacoCaml port uses top-level splices.We expect porting other existing MetaOCaml libraries to be similarly straightforward.
Format.OCaml has a sophisticated formatting library that represents typed format trees using GADTs [Vaugon 2013] similarly to the example in §2.5.The library is distributed with OCaml, and the compiler includes a translation from format strings such as "(%d, %d)" to typed format trees, which are then interpreted at run-time by functions such as printf and scanf.
We have staged the core formatting library (2476 lines) using MacoCaml to interpret the format trees during compilation, eliminating run-time overhead.The port to MacoCaml was largely a matter of performing a manual binding-time analysis to ascertain which subexpressions in the program are statically known, then adding the expr type, quotations, splices and import annotations accordingly.The implementation uses most of the features described in the paper (147 quotes, 85 splices, 33 module annotations), but does not use compile-time effects, since the generation of specialized code from format strings in the original library is purely functional.
We have measured the compilation overhead of using MacoCaml's features, and found it to be modest: the ported format library compiles in 543ms, compared to 477ms for the original library.
Other libraries.We expect that other OCaml libraries will present similar opportunities for optimization using MacoCaml's staging constructs .Regular expression libraries (e.g.re [2023]) will benefit from compiling regexes during compilation, and from generating OCaml code without the overhead of table-based automata.Numerical libraries (e.g.Owl [Wang and Zhao 2022]) will be able to use macros to generate specialized code (e.g.applying the generative techniques in Carette and Kiselyov [2005]).Libraries that generate code in an untyped way (e.g. the ctypes foreign function library [Yallop et al. 2018]) will enjoy improved implementations using our typed code generation facilities.Generic programming libraries based on type representations (e.g.lrt [2023]) will be able to statically generate type-specialized functions using the techniques in Yallop [2017].More generally, a library will benefit from MacoCaml's staging constructs whenever some aspect of the library can be specialized using information (e.g. a regex, a parameter, or a type) that is unavailable when the library is written, but available at the point where code using the library is compiled.

RELATED WORK
Staging and macros have a rich literature.Here we discuss works that are most relevant.
Staged programming.Staging has been used for both compile-time [Sheard and Jones 2002;Xie et al. 2022] and runtime code generation [Calcagno et al. 2003b;Kiselyov 2014;Taha et al. 1998], each with its own merits.Compile-time code generation, as in our system, has a clear and proven phase distinction between compile-time and runtime heaps, while dynamic code generation allows code to be specialized with respect to information that only becomes available at runtime.
Template Haskell (TH) [Sheard and Jones 2002] supports compile time code generation and our notion of compiler modes is inspired by TH.However, TH does not present formal operational semantics.Xie et al. [2022] formalized Typed Template Haskell (TTH) with operational semantics, but TTH is not yet implemented.Neither work supports side effects, compile-time heaps, or modules.TTH also does not explicitly formalize compile-time evaluation; instead, negative-leveled splices are lifted to top-level so that they are evaluated before the rest of the program, leaving evaluation phases unclear.Kovács [2022] takes a radically different approach, employing two-level type theory (2LTT) to model staged compilation in a dependently typed setting.
Staging for runtime code generation comes with a rather different methodology.Languages such as MetaML [Taha et al. 1998], MetaOCaml [Calcagno et al. 2003b], and BER MetaOCaml [Kiselyov 2014] support an additional run construct that evaluates code at runtime.There are no compile-time bindings, top-level splices, or compile-time evaluation.As code is generated at runtime, there is no clear phase distinction.Like the present paper, the MetaML and MetaOCaml work also defines calculi and establishes theorems, but their formalisms do not support features like side-effects, heaps, or modules.Our future plans include extending MacoCaml with runtime code generation.
Another line of work uses modalities to model staged computation.Davies and Pfenning [1996] introduce a language Mini-ML □ based on the modal logic S4, that supports manipulation of closed terms.Davies [1996] shows that the application of the Curry-Howard correspondence to lineartime temporal logic produces a language ⃝ supporting manipulation of open terms (similarly to MetaML, which was introduced shortly afterwards).Nanevski et al. [2008] generalize the work of Davies and Pfenning in a different direction, augmenting the □ type modality with the context of free variables that may appear in a term of that type.This contextual modal type theory provides an appealing basis for metaprogramming, but integrating it into a language like OCaml would be a significantly more disruptive change than adding a simple MetaML-style type constructor for code.
Macros and modules.The design of the module system in MacoCaml is directly built on top of Racket [Flatt 2002] (also see its extensions [Culpepper et al. 2007;Flatt et al. 2012]), which fits extremely well with our notion of macros and staging ( §2.3).Flatt [2013] further allows shifting a module at a positive level.From the typing perspective, shifting a module at a positive level would work in our system, though it remains to see what implications that would have to code generation and runtime; for example, shifting a module at level +1 means macros from that module are now at level 0 and can be used at runtime.However, there are important differences between the notion of macros in MacoCaml and those in Racket.First, Racket and MacoCaml manage macros differently.In particular, a macro in Racket can be viewed as a binding whose body is defined at level -1 but which itself is bound at level 0 10 .This approach allows a macro body to use definitions imported at compile-time, while the macro itself can be called and expanded in a normal program.The discrepancy between levels can cause surprising results.For example, consider the following example adapted from the Racket document 11 : Instead of returning 0, evaluating module b will raise an error about button being unbound.Let us understand what happened.Module a defines button and see-button as normal bindings at level 0, where see-button's value is the syntax object (#') for button. 12The provide form exports the definitions.Module b imports module a at compile-time, so both button and see-button are at level -1.The macro m, as its body is defined at level -1, can see see-button at level -1 and will return the #'button syntax object, which refers to button at level -1.The use of m is at level 0, as the macro binding itself is at level 0. So both the macro definition and its use are well-leveled -but the program raises an error after macro expansion!The reason is that since macro m is used at level 0, it expands to #'button at level 0, but there is no button at level 0!13 MacoCaml differs from Racket, giving a more consistent and refined view of compile-time computations: macros are defined and bound both at level -1, and by leveraging staging annotations we can distinguish between a macro such as mpower ( §2.3), which is an expression at level -1, and splices of macros such as $(mpower 5), which is a top-level splice at level 0 that triggers compiletime evaluation.Moreover, MacoCaml is fully typed, built on top of with proven soundness results, so a well-typed program never generates an ill-typed program.
On the other hand, macros in Flatt's system are more expressive as they may scrutinise abstract syntax.In a typed setting, such analytic macros require a sophisticated type system (such as contextual types used in Squid [Parreaux et al. 2018] or Moebius [Jang et al. 2022]) that exposes contexts in types rather than a simple MetaML-style system that is sufficient for our purely generative macros.
Macros and staging in MacroML.MacroML [Ganz et al. 2001] views macros as multi-stage programs, by translating macros to multi-stage programs in MetaML.The key idea is to model macro expansion as one step of elaboration under level 1 (⊢ −1 ′ ), which translates ordinary arguments into code and macro applications into splices, followed by one step of evaluating the quotation of the resulted program (⟨ ′ ⟩ 0 −→ * ′ 1 ).Regular execution then calls the run construct (run 2 ) to get the final result.As an example, the power macro from §2.1 can be expressed in MacroML as (using MacoCaml notations for quotations and splices throughout this section): MacoCaml differs from MacroML in several aspects.At a high level, rather than considering macros as staged programs, MacoCaml combines macros and staging in the same language, where macros are compile-time bindings and staging and macros can interact.Moreover, MacoCaml models compile-time evaluation using top-level splices and runtime evaluation via direct operational semantics.In contrast, evaluation in MacroML is, in a sense, simulating the behavior of top-level splices: as macro applications elaborate to splices, the elaboration result ′ could contain top-level splices, and thus macro expansion evaluates ⟨ ′ ⟩ rather than ′ .This in turn forces regular execution to call run on the evaluation result (i.e.run ′ 1 ) as the result of macro expansion ′ 1 is unnecessarily wrapped inside a quotation.Lastly, extending MacroML with references would require proving phase distinction by showing that the runtime heap from Combining macros and staging in Scala.Stucki et al. [2018] described a design in Scala to combine macros and multi-stage programming.As they did not present formal semantics, we compare with their system based on the design description and example programs from their paper.
In their setting, macros are inline functions defined using top-level splices, which are only evaluated when the code is inlined.For example, the power macro from §2.1 is defined as follows: where mpower' is a normal definition, and the macro mpower is defined as an inline function with a top-level splice.There are a few notable things.First, mpower' is defined at level 0, but used at level -1 in the definition of mpower.To make the call to mpower' well-staged, macros such as mpower are type-checked "as if they were in a quoted context" [Stucki et al. 2018, §3.3].This can be implemented by type-checking mpower under level 1, similar to MacroML.Moreover, this means that programs using macros such as mpower (5, 2) also need to be "thought of as a quoted program" to make sure that after mpower expands, the call to mpower' is still well-staged.This is conceptually similar to MacroML where the elaboration result is evaluated inside a quotation.Furthermore, because mpower has a top-level splice, the parameter n seems ill-staged, as it is introduced at the definition level of mpower, but used in a lower level inside the splice.To make it work, the program marks n as inline, reflecting the fact that its value will be known during macro expansion.Furthermore, the top-level splice in the definition of mpower does not trigger evaluation until it is inlined when mpower is called.
MacoCaml differs from Scala in several aspects.First, by modeling macros as compile-time bindings, MacoCaml maintains well-stagedness without needing the inlining mechanism or considering programs in a quoted context.Moreover, while both MacoCaml and Scala consider top-level splices as compile-time evaluation, MacoCaml uses top-level splice at the macro expansion site instead of the macro definition site, thus treating top-level splices more consistently.However, we note that some of those differences arise from Scala's aim of supporting both compile-time and runtime code generation.For runtime code generation, it is important that mpower' is not a macro: It is thus interesting as future work to extend MacoCaml with runtime code generation and compare the resulting system again with Scala.
Combining staging and references.Calcagno et al. [2003a] studied references in MetaML and handled scope extrusion, where a variable can escape its scope if code quoting it can be stored in a reference then spliced outside its binder.More recently, Kiselyov et al. [2016] introduced a type system that models context nesting using subtyping, allowing references to store open code values while preventing scope extrusion.
restricts references to integers as they serve a different purpose: they make compile-time heaps explicit, thus making the language interesting for studying phase distinction.We remark that scope extrusion is orthogonal to our phase distinction result.That is, even in a system with scope extrusion, one may still prove phase distinction (Theorem 4.5), as the extruded part is still well-leveled (though not well-scoped) and any compile-time only (extruded or not) computations are not needed for runtime evaluation.
We believe that the type systems of Calcagno et al. and Kiselyov et al. can be integrated into our system to support records of code.Currently, our implementation ( §5.2) follows MetaOCaml [Kiselyov 2014] in detecting scope extrusion at splice time.
Staging modules.While MacoCaml focuses on term-level code generation, there is a line of work [Inoue et al. 2016;Sato and Kameyama 2021;Sato et al. 2020] on generating modules for eliminating performance penalty from functor applications.In the future we would be interested to see how module generation could be incorporated in our system.In particular, we would like to study how the approach in [Sato and Kameyama 2021;Sato et al. 2020], which converts code of a module into a module of code, relates to path closures ( §5.2) used in our system, which essentially require quoting and splicing of module paths.
Existing approaches to compile-time computation in OCaml.OCaml currently offers two approaches to compile-time computation.Historically, Camlp5 [de Rauglaudre 2007] (and the similar tool, Camlp4), have supported pre-processing of OCaml programs by transformation of concrete syntax trees.More recently, ppx [White 2013] supports transformation of OCaml programs by abstract syntax tree transformation.Although these tools have notable drawbacks -e.g.they do not guarantee hygiene or type-preservation -they have found many uses in practice: for example, ppx_monad provides Haskell-style "do" notation, ppx_expect provides facilities for embedding expect tests in programs, and ppx_bitstring provides bitstring pattern matching.We expect that these and other such libraries can be restructured to keep their current ppx-based interfaces and use MacoCaml's typed code representations internally, improving confidence in their correctness.

CONCLUSION
We have presented the design and implementation of MacoCaml.The MacoCaml system supports compile-time code optimization of a variety of OCaml programs, and its formalization can be used to support a novel combination of staging and macros for other languages.
In the future, we plan to extend MacoCaml with a number of features.First, we would like to allow nested quotations and splices.Nested splices will raise an interesting question: how to ensure that a nested splice is evaluated before a single splice, while evaluation is interleaved with typing?Second, we plan to enrich our module language with module subtyping as described in §5.2, as well as functors, where macros, like types, will be considered as compile-time components of modules, by following and extending the phase splitting transformation by Harper et al. [1989].
x : .:   ( 1 2 ) (ref ) • ok Proof.The proof is the same as Theorem B.2, except for the case for references.Note during proof we can apply I.H., because the difference between the typing level and the evaluation level stays constant.For example, in case rule ev-qote, we have = ⟨ 1 ⟩, and Proof.By a straightforward induction on the typing derivation.Since contains no , the rule rule c-loc is not needed.□ Lemma B.10 (Shifted typing contains no locations).If and Γ ⊢ ′ +1 : in the constrained core calculus with ′ − , then does not contain any .
Proof.By induction on the typing derivation.Most cases follow from I.H.. Below we discuss interesting cases.

Fig. 1 .
Fig. 1.The call relation between let definitions and macros ; we use ⟨⟩ for quotation and $ for splicing.

Fig. 3 .
Fig.3.Tower of imports: The call relation between definitions across module towers.The module (b) in the middle is the module currently being compiled, and it imports two modules, in green and purple respectively, with their call relations given in (a) and (c).This figure highlights 3 modules, but actually involves 7 modules: module (b), module (a) and its two imported modules at level 0 and -1 respectively, and module (c) and its two imported modules at level 0 and -1 respectively.Definitions in gray are evaluated during compilation.In practice, the tower can be arbitrarily high.
Fig. 6 presents the syntax of .Modules M include structures (struct S end), module variables ( ), and qualified module variables ( .).Structures S are either empty (•), or contain module
Tower of modules.Side effects can happen across multiple modules.Suppose that we have another module, Program, that imports and uses apply: