The Next 700 ML-Enabled Compiler Optimizations

There is a growing interest in enhancing compiler optimizations with ML models, yet interactions between compilers and ML frameworks remain challenging. Some optimizations require tightly coupled models and compiler internals, raising issues with modularity, performance and framework independence. Practical deployment and transparency for the end-user are also important concerns. We propose ML-Compiler-Bridge to enable ML model development within a traditional Python framework while making end-to-end integration with an optimizing compiler possible and efficient. We evaluate it on both research and production use cases, for training and inference, over several optimization problems, multiple compilers and its versions, and gym infrastructures.


Introduction
With the success of Machine Learning (ML) models in various domains, there is a growing interest in applying ML to improve optimization heuristics in compilers [CST02,Ama20].Several ML and Reinforcement Learning (RL) approaches have been proposed to improve optimizations like vectorization [HAAW + 20, MYP + 19], loop unrolling, distribution [SA05, JVA + 22], function inlining [SCWK13, TQB + 21], register allocation [DAK20, TQBL21, KPM22, VJK + 23], prediction of phase sequences [ABP + 17, HHAM + 19, JAVU22], among many others [ABDS18,WO18].More specifically, the widely used LLVM compiler [LA04] has support for RL-based inlining decisions from version 11, and RL-based eviction decisions in its register allocator from version 14 [TQBL21].The title of our paper acknowledges this growing trend and anticipates the needs of the ML-enabled optimizations that are yet to come, in the spirit of Landis' seminal paper [Lan66] on the diversity of existing and future programming languages.
Setting up an ML-based compiler optimization is a challenging task.In addition to model design, it involves specialized data collection, compiler engineering, packaging: 1. Preparing  and Supersonic [WTZ + 22]. 4. Finally, building and deploying the compiler with the trained model for inference.
In most works, the process ends with step (3) and a simplified benchmark-oriented version of step (4) to evaluate the trained model.Indeed, while there exist a number of solutions for steps (1 & 2), a proper methodology based solutions for steps (3) & (4) that involve model-compiler interaction between have not yet been adequately addressed.
The diversity of compiler optimizations and ML models is associated with an equally broad range of requirements for model-compiler interaction.In Tab. 1, we illustrate this on recent proposals.There exists multiple ML frameworks and even more types of ML models.A model's input may be a plain floating point vector, or tensors of different ranks and shapes.Outputs range from a unique Boolean decision to complex data structures.These need to be communicated with the compiler; it may be only once for simple scenarios, or many times and involving large amounts of data for more intricate ones.And this may involve extensive source code modifications for the sole purpose of implementing the compiler-model interface.
Some of these interactions have been explored in the literature and even landed in production; however, there does not exist a single generic method to address the vast diversity of scenarios that are imaginable and the trade-offs therein.Such a situation limits the scope, applicability and effectiveness of ML for compiler optimizations in the following ways: • Scalability: Integrating a Python model with C++ code using wrappers induces significant [JVA + 22] compile time overhead: e.g.6×-100×.• Integration: Not all optimizations are simple enough that the outputs of the model can be communicated using flags [VJK + 23, JVA + 22, KPM22, TQB + 21].As ML-based between the compiler and a specific ML framework; we however believe that a generic compiler infrastructure like LLVM should remain ML-framework-independent.
The existing gym libraries primarily aim at facilitating model training for research and reproducibility by providing a high-level integration.For example, the recent Com-pilerGym [CWG + 22] provides a high-level interface in the form of C++ wrapper methods outside the compiler to invoke out-of-tree compiler APIs to materialize the predicted actions.Such integration caters well to training certain interactions like Phase Ordering [JAVU22].However, other optimizations like RegAlloc [VJK + 23, KPM22, DAK20], loop distribution [JVA + 22] and inlining [TQB + 21] necessitate a deeper interfacing of the model within the compiler; with multiple rounds of interaction for both training and inference scenarios.Further, in these gym libraries, the inference flow is driven by Python: the compilation starts by invoking a Python process, breaking the isolation between the end user and the internal compiler algorithms; this limits deployment opportunities among other downsides.We discuss these issues in detail in Sec. 4.
To address these shortcomings, we propose ML-Compiler-Bridge, a library that allows ML model development within a traditional Python framework while providing tightly coupled and efficient end-to-end integration with the compiler.Our library bridges the compiler and ML model by providing a suite of communication approaches (model runners) and the related (de-)serialization mechanisms (SerDes) to cater to diverse scenarios.It also provides support for both inter-and in-process communication by exposing different model runners: gRPC and named-pipes for the former, and the TensorFlow interface and ONNX for the latter.Diverse SerDes options based on Protobuf, JSON, and native bitstreams improve efficiency and versatility.The appropriate model runner and SerDes can be chosen based on the usage scenario and requirement, and these may differ during training and inference.Our library provides C++ and Python APIs to expose model runners and SerDes for integration with compilers and ML frameworks respectively.
We show that the inter-process model runners effectively supports training.Once the model is trained, the in-process model runners provide interfacing of the model within the compiler in a transparent manner, with much lesser latency to aid in deployment.Besides, our both model runner and SerDes modules can be easily extended to support more forms of communication and serialization.Our library also provides C-APIs to aid in integration with C-based compiler infrastructures like Pluto, GCC, and SQLite.
We evaluate ML-Compiler-Bridge on four ML-enabled optimizations in LLVM: RL-LoopDistribution, POSET-RL, RL4ReAl, and Inliner.We show that our library can be integrated with other compilers like Pluto [BHRS08] and MLIR [LAB + 21] with minimal effort.We study the impact of communication and serialization options on compile time under different complex scenarios that the existing infrastructures could not handle.We conduct extensive evaluations to measure the overhead caused by each model runner and SerDes.We also study the impact of integrating ML-Compiler-Bridge with LLVM in terms of additional dependencies, compile-time, and binary size overhead.Here are our contributions: • We propose ML-Compiler-Bridge, a library to enable the deeper integration of ML models and the compiler in a framework-independent manner.• We provide a suite of two-interand two-in-process model runners, and three (de-)serialization mechanisms (SerDes) to support different interaction scenarios.• We provide multi-language user APIs: C++ and C APIs to interface model runners and serializers with compilers and Python APIs to interface inter-process model runners with ML frameworks.• We show that our library is easy to integrate with three different compilers spanning different representations, and carry out extensive evaluations on four ML-enabled optimizations on two versions of LLVM (V10, V17).ML-enabled Compiler Optimizations.The process of supporting or fully implementing optimization decisions with one or more ML models involves the steps shown in Fig. 1.This process repeats until the end of the compilation process for each ML-based optimization.The above scheme is generic enough to capture any optimization involving single or multiple ML models with multiple two-way interactions.For the cases that would need multiple interactions, steps (1)-(7) are repeated until the final outcome.
More broadly, there are three actors involved in developing and using such an ML-enabled compiler.(i) The Compiler expert who develops the compiler optimization, (ii) The ML expert who designs the ML model for the optimization problem, and (iii) The end-user who uses the compiler.Ideally, compiler experts should use the ML models with minimal understanding of the internals/process specific to ML modeling and the framework on which the model is built to arrive at the result.Similarly, ML experts should instead design the models with minimal or no understanding of compiler internals, infrastructural details, and integration points, focusing on the optimization objectives and information flow.For the end-user, however, the presence of ML-compiler optimization should be transparent, and indistinguishable from the conventional (non-ML based) compilation process.To achieve this scheme of abstraction/segregation among all three actors, it is important to distinguish between the training and inference flows.
Training.Typically, training the ML model becomes part of compiler development and build-up, and inference becomes part of the compiler deployment and execution.However, occasionally this boundary may shift towards the user, like domain-specific training or fine-tuning at deployment time.Since ML developers usually prefer developing models within a Python-based framework, the training process involving a C++ compiler infrastructure like LLVM requires a communication channel, typically inter-process, while catering to the needs of (de-)serializing data between the native types of C++ and Python.The distributed nature of training processes may also require extending communication beyond a single operating system node.
Inference.When focusing on inference/deployment, compile time and ease-of-use become crucial factors.The communication and serialization methods involved should take this into account, along with considering converting the Python model to a streamlined C++ implementation.These factors are true even for the simplest forms of communication, like one-time evaluations of the ML model and communicating via flags.Making the flow transparent to the user also requires a deeper, end-to-end integration with the compiler.
There is no tool providing the necessary layers of abstraction between the three actors while supporting the required training and inference scenarios, not to mention ML-framework independence.Designing such a library and evaluating its suitability for diverse use cases is the challenge we tackle in this paper.We propose an abstraction mechanism made of two main components: Serializer and Model Runner.The SerDes module (de-)serializes the data to/from the requested format, and the MLModelRunner module is responsible for communication with the model.The model runner obtains the serialized data, writes it to a communication channel, queries the model, and deserializes the output received from the model.ML-Compiler-Bridge exposes methods to be invoked by the user to interact with the model decoupled from serialization and communication.We provide three framework-independent model runners, gRPC, named-pipes, and ONNX, and one framework-specific TensorFlow model Listing 1. Skeleton of MLModelRunner class runner.These can be combined with three different serializations: Protobuf, JSON, and bitstream.The modular design enables new forms of communication and serialization to be added by overriding a minimal set of methods.Fig. 2 shows the components and interactions of ML-Compiler-Bridge.

ML Model Runners
We provide two classes of model runners.The inter-process class provides the easiest mechanism to decouple Python models from a compiler running as a separate process.The inprocess class assumes that the ML Model is readily available in a compiled form and can be accessed within the compiler through a specific API.Clearly, in-process communication is designed with inference and deployment in mind, while inter-process communication enjoys more diverse use cases.Model runners may support simple ML queries and feedforward networks as well as more involved Reinforcement Learning (RL) algorithms or Graph Neural Networks (GNNs).
Internally, MLModelRunner is the abstract base class from which the other model runners are derived (List.1).It exposes two APIs: populateFeatures() populates the input features, and evaluate() queries the model.The latter returns the output of the model and is templated according to the expected output type.Internally, evaluate() invokes evaluateUntyped() that is to be overridden by the concrete model runner classes that derive from MLModelRunner.The MLModelRunner interfaces with the methods of SerDes using the populateFeatures() so as to serialize the inputs.The method populateFeatures() is implemented as a variadic function that uses variable-length key-value pairs as arguments.The key is a string identifier that describes the input, and the value is of template type.(3) When input from the compiler is required, the model sends requests to the compiler with appropriate queries and waits for the response.(4) The compiler gets out of the blocked state and processes the query to generate an appropriate response.(5) The response is sent back to the client, and the model goes on to completing training on that input.
Inference follows the same steps, yet the compiler becomes the client and the model becomes the server so as to support a regular compilation process.
gRPC Model Runner.gRPC [WZZ93] provides RPC methods specifying the type of input and output in Protobuf format [Pro].During the build process of the library, the proto files are automatically translated to C++ and Python code by invoking the protoc compiler.An example is shown in List. 2. The generated code defines the Service class that exposes the RPC methods to be overridden by the user in the optimization that makes use of gRPCModelRunner.Due to design constraints of gRPC, we only support Protobuf serialization with gRPCModelRunner.
gRPCModelRunner takes in the server address and the port number at which the connection is to be established.In training mode, gRPCModelRunner starts the server and starts listening for an RPC call invoked by the model.The overridden RPC method is directly called by the Python model to generate new observations by applying the action predicted by the model.In inference mode, gRPCModelRunner starts the gRPC connection at the given address and port.evaluateUntyped() is overridden to invoke the RPC method defined by the Python model after preparing the input data, andgetAdvice() serves as the RPC method for inference.Pipe Model Runner.As the name suggests, pipeModelRunner relies on named pipes for inter-process communication (the mkfifo system call).Pipes provide a simple and effective means of communication that is local to the machine without any network or security constraints.
As pipes are unidirectional, the pipeModelRunner creates the read and write pipes for communication.The read pipe in the compiler obtains the data written by the model in Python, and the write pipe provides the data into the pipe that is read by the model on the other end.The evaluateUntyped method is overridden to read and write into the pipe appropriately.read() is a blocking call forcing the compiler to wait till data is written by the model.Once the data is written, the model gets to a blocking state by invoking read() on the second pipe waiting for the response from the compiler.The pipe model runner ensures proper opening, closing, and clean up.pipeModelRunner provides a simpler interface for establishing communication as the user directly invokes evaluate() after setting the inputs.ONNXModelRunner for RL.For RL, the agent is usually the learner trained to predict appropriate actions given the observations from the environment.Exporting a trained model to ONNX implies exporting only the agent.To facilitate RLbased interaction for a generic multi-agent scenario between the environment and the agents, ONNXModelRunner provides Environment and Agent classes separately and accesses the ONNXModelRunner queries the model using the C++ APIs.A map containing the identifier of the agent (label) and the corresponding model path is passed while instantiating the ONNXModelRunner.In the case of multiple agents, the identifier of the next one to use is set by the Environment while returning the observation.ONNXModelRunner queries the corresponding agent with the observation to obtain the requested action.This process goes on untill Environment invokes setDone().
ONNXModelRunner for plain ML models.ONNXModelRunner can also be used to query non-RL models by directly invoking the evaluate method upon instantiating the object with the path to the ONNX model.
TensorFlow Model Runners.This is a framework-specific model runner built on the TensorFlow ahead-of-time (AOT) saved model.There are two implementations: (i) "Release Mode Model Runner" used in production environments, (ii) "Model Under Training Model Runner" intended either for finetuning or when quickly evaluating candidate models and parameters.TFLite is a scaled down TensorFlow interpreter designed to be embedded in native binaries, and can be used to further reduce overheads.
The TensorFlow model runner uses the AOT saved model compiler which produces a header exposing the model as a C++ class, and a native object file with its implementation.The model runner reduces again to a simple adapter [GHJV95] around that class.The compiler binary does not expose new runtime dependencies as it is statically linked, and this simplifies its deployment.Note that the model compiler can be configured to generate code loading the weights from a file passed via the command line to the LLVM compiler.

SerDes: Serializer and Deserializer Module
When data is transferred, specifically across two processes, it is important to convert data that is present in the native types (of C++ and Python) from one format to another.This is the purpose of (de-)serialization as implemented by the SerDes module.
Internally, the MLModelRunner interacts with SerDes to (de-)serialize C++ native data to model-specific types and back.The choice of (de-)serialization depends on the optimization and ML model.We currently provide three options: bitstream, JSON, and Protobuf.They vary in terms of usage scenario, usage effort, and (de)serialization time.SerDes effectively abstracts away the underlying mechanism while providing the flexibility of different serialization options.
Base SerDes.Internally, each SerDes is derived from the BaseSerDes class.SerDes uses key-value based serialization as described in Sec.3.1.The populateFeatures method of MLModelRunner invokes the appropriate version of the overloaded setFeature() exposed by BaseSerDes to serialize inputs.These methods are overridden by the SerDes classes that derive from BaseSerDes according to the underlying serializer.This class also exposes the deserialize method to deserialize the received data and is overridden by the derived classes to obtain the data in native types.Our library supports (de)serializing in basic (int, float, double, string, bool) and compound (vector, list) data types.
Protobuf SerDes.Protobuf SerDes needs the user to provide the input and output data specifications in a proto file.These are compiled to generate the C++ and Python sources (Sec.3.1.1).ProtobufSerDes serializes the input key-value pair by overriding the setFeature methods to set the appropriate fields of the message described in the proto file.Deserializing protobuf data to the native format only involves reading and returning the appropriate fields of the message.Except for providing the proto file, ProtobufSerDes is transparent to the user.JSON SerDes.JSONSerDes overrides the setFeature methods to populate the JSON buffer appropriately, given the key-value pairs.Similarly, the received data is deserialized by first converting it to a JSON object, then the JSON fields are casted to native types and returned.JSON SerDes is also transparent to the user.
Bitstream SerDes.The bitstream starts with a JSON header which specifies the key (identifier), type and shape of the tensors, and the order in which they will be serialized.Tensor values themselves are dumped as raw bytes.The received bitstream is interpreted based on the type and shape specified in the header and converted to native types.Processing the header induces negligible overhead if communicated data does not involve complex data types.Internally, BitstreamSerDes overrides the setFeature methods similar to the other SerDes to expose the functionality.Fig. 4 shows the class diagrams [GHJV95] of model runners and SerDes.

C-APIs
We provide C wrappers around the C++ implementation to integrate with C-based compilers.These wrappers are C++ files written in C-style.Each method internally queries the original C++ implementation and returns results in a way compatible with C calling conventions.This code is built as a separate library that may be linked with a C-based compiler.We used it with the Pluto polyhedral compiler in particular.

Extensions
Both MLModelRunners and SerDes can be easily extended to support new model runners and serializers.New runners may include TVM [CMJ + 18], ahead-of-time compiled Py-Torch models and FlatBuffers [Fla], and serialization also supports YAML formats.New model runners can be contributed by inheriting MLModelRunner and overriding the evaluateUntyped method according to the model runner.Similarly, a new (de)serializer can be added by inheriting BaseSerDes and overriding the setFeature and deserialize methods specific to the new serializer.interactions with an ML model.All the components are configured, compiled and linked during the regular build process of LLVM.Integration challenges range from redesigning the entire framework of the original publication, to minor changes to the communication mechanisms.

Phase Ordering of Optimization Passes
POSET-RL predicts the ordering sequence of passes to jointly optimize code size along with execution time.An RL agent is trained with the DDQN algorithm [VHGS16] to predict a subsequence as action, given program embeddings as input observation.There are about 15 predetermined subsequences provided by the authors.The predicted optimization subsequence is applied on the input program, and the embeddings corresponding to the transformed program are used as the new observation.This process goes on until reaching a threshold on the number of subsequences.
In the published version, the above process was not integrated within LLVM but driven from a Python model.An LLVM-opt process was spawned, passing the optimization sequence through a compiler flag for each prediction by the agent.In addition, embeddings involve spawning yet another process to invoke IR2Vec on the .llIR file generated by the compiler.A similar strategy was in place for training.
We revisited the above using ML-Compiler-Bridge to operate directly within LLVM as a new transformation pass.Our new PosetRL implements a pass manager that applies the predicted optimization sequence, and also generates the next observation by invoking IR2Vec.The MLModelRunner communicates with the model and serializes the data to be transferred.The model communicates the predicted optimization subsequence as an integer ID (one among 15) to PosetRL, and the R 300 module-level embedding vectors are sent to the model for the next prediction.Integrating with the ONNX model runner only amounts to extending the Environment class and overriding the step, reset methods.We also override setDone() to signal the end of the episode upon reaching the threshold.

Loop Distribution for Vectorization and Locality
Jain et al. [JVA + 22] improve loop distribution by modeling SIMD parallelization and locality optimization opportunities.It uses two RL agents with fully-connected networks to identify the vertex processing order and when to distribute.Along with these agents, a Gated Graph Neural Network (GGNN) [LTBZ16] processes the connected components of the dependence graph, where each node holds the embeddings for the corresponding instructions.
During training, a Python driver spawns a process to invoke the Loop Distribution pass.The RL model processes the input graph and predicts the sequence of instructions to be packed together as a loop.Upon applying the prediction, the rewards indicate the effectiveness of distribution.All these steps involve model-compiler interaction via file I/O.Inference itself is integrated with LLVM using Python wrappers.
In this paper, we eliminate the need for Python wrappers, file I/O and and spawning new processes.The model runners internally (de-)serialize data depending on the chosen SerDes and the MLModelRunner.For the runners that use serialization, the input graph is represented as key-value pairs, and a variable length matrix in R ×300 encodes the sequence of  300-D instruction embeddings.The output takes the form a variable-length integer array with node identifiers that are to be distributed.

RL-Based Register Allocation
We also evaluate RL4ReAl, an RL-based register allocator implementing the splitting, coloring, and spilling sub-tasks as separate RL agents on LLVM's Machine IR.These RL agents pose a formidable engineering challenge in interfacing the model with the compiler during both training and inference.Unlike other optimizations that need one single communication at the end, RL4ReAl involves multiple interleaved communications rounds to obtain a new observation and let the relevant agent make the next prediction.Also them RL agents are arranged hierarchically: the outcome of one agent determines which agent would be invoked next.Unlike other use cases, this optimization involves transferring an interference graph where each variable is associated with a R ×100 matrix, and where each one of the  instructions in the live range of the variable is represented in 100-D, a variablelength integer array to specify interferences and use points, and a variable-length floating point array of spill weights.Other metadata like function name, file name, and status are also sent as string fields.The model returns key-value pairs mapping variables to split or color decisions.Both training and inference use gRPC and Protobuf serialization.
We will investigate different communication and serialization improvements in this paper, with specialized scenarios for distributed training and deployment-friendly inference.

LLVM Inliner
The inliner pass traverses call sites in a bottom-up fashion, one connected component of functions at a time.For a given component a working queue is initialized with the set of all static call sites.As the algorithm marks some call sites for inlining, it appends the former callee's call sites to the work queue.The decision to inline or not is made in two steps.First, it determines legality and whether the user provided any guidance (always/never inline).Only if the operation is legal and non-mandatory, a heuristic determines its profitability.
The decision is driven by a simple RL based model.It takes a number of scalar features characterizing both the caller/callee (instruction counts, basic block counts, maximum loop depth), the call site itself (the number of compile-time constant parameters), as well as module-wide features (the current number of functions and statically known call edges).For the published version [TQB + 21], the cost metric was size, with no reliance on dynamic profile data.The implementation uses AOT compiled TensorFlow model for inference with C++ APIs.We modularized it to use any model runner.

Evaluation
We measure compilation time on an Intel Xeon SkyLake W2133 with 6 cores, 12 threads and 32GB RAM.Training time is measured on an Intel Xeon W1390P with 8 cores, 16 threads, 64GB RAM and an Nvidia 3060 GPU.We evaluate POSET-RL, RL-LoopDistribution and RL4ReAl with gRPC, Pipe and ONNX model runners and different SerDes options, and take the median of 3 runs.Most experiments use SPEC CPU 2006 and SPEC CPU 2017 benchmarks.

Impact on Deployment
Tab. 2 shows the POSET-RL compile time using different model runners.Within the in-process runners, we use ONNX for PyTorch models and RLLib.Overall, in-process runners achieve better compile times in all cases in comparison with any of the inter-process ones.Among the latter, gRPC has higher compile times (6.8-7.6%)compared to pipes, with JSON and bitstream SerDes.This is because of the overheads associated with establishing connections and invoking RPC methods.Pipes with Bitstream SerDes yield slightly higher performance than JSON SerDes due to the lower (de-)serialization overhead with bit streams.ONNXModelRunner yields a 7.2× speedup with POSET-RL compared to the original method in Sec.4.1 that involved spawning new processes to invoke the compiler and other dependencies.
In-process model runners natively support multithreaded compilation, while inter-process model runners necessitate concurrently running multiple instances of the model resulting in a high memory and compute overhead.Tab. 3 shows compile times with in-process model runners on LLVM Inliner and RL4ReAl optimizations by varying the degree of parallelism.As LLVM Inliner and RL4ReAl respectively rely   [Bou] and the LLVM Test Suite [LO].The ONNX model runner yields an improvement of 16× in comparison to the original Python wrapper.

Impact on Training
In this section, we evaluate the effectiveness of ML-Compiler-Bridge during the training of POSET-RL and RL4ReAl.We use inter-process model runners for training.

5.2.1
Training Time.Fig. 5(a) shows the cumulative training time and number of training iterations observed in POSET-RL.We obtain large improvements in the training time across all the model runners.We see similar trends with gRPC and Pipe, as explained in the previous experiment.
The original training process of POSET-RL involves spawning processes that takes ≈ 10Ks to complete 500 iterations.In comparison, the gRPC model takes about 5.7Ks, while the pipes with JSON and bitstream serialization options take about 5.5Ks each.Throughout the iterations, we observe an overhead of about 20s between JSON and bitstream serialization options.This minimal overhead is associated with the additional serialization effort involved while using JSON SerDes.However, using the inter-process model runners enables an end-to-end integration of model and the compiler while training yields a significant improvement.

5.2.2
Multi-Worker Support.ML-Compiler-Bridge supports multi-worker training on both CPUs and GPUs.To support multiple workers while using gRPC, we expose a method taking an array of ports to establish connections with each worker.Similarly, multi-worker support with pipes is enabled by instantiating one pair of pipe per worker.We extended RL4ReAl to handle multi-worker scenarios; training times are shown in Fig. 5(b) for CPU and GPU workers.Using 10 workers with a GPU trainer takes about 2 seconds per episode, while a CPU trainer with <10, 5, 1> workers takes <4s, 8s, 15s> respectively.We obtained similar trends among the workers even upon using pipes for communication.

5.2.3
Using Different RL Policies.One may train and deploy models with different RL policies without impacting the compiler.For this experiment, we evaluate RL4ReAl with the different RL policies provided by RLlib.We perform hyperparameter tuning using Tune [LLN + 18b].We trained the models with PPO [SWD + 17], APPO [SWD + 17], and A2C [MBM + 16] policies untill convergence.On the SPEC CPU 2017 benchmarks, this resulted in 2% improvement on average using the APPO policy.The PPO and A2C perform similarly to original paper.

Round-Trip Time
Let us finally isolate the Round-Trip Time (RTT) of each model runner as a limit study of the achievable communication throughput.We consider random floating point vectors of increasing length ranging from 500 to 50K elements in steps of 500.The model itself is a single fully-connected layer that consumes the vector and returns a scalar float.Fig. 5(c) shows the RTT of the whole process.The TF and ONNX runner achieves a very high throughput with a total RTT of 21 and 68ms respectively; while Pipes+JSON and Pipes+Bitstream yield 3154ms and 772ms respectively, and gRPC yields a larger RTT of 5948ms.These differences can be attributed to the serialization and communication overhead.The TF and ONNX runners benefit from in-process communication, proving to be suitable candidates for deployment.The higher throughput of TF is due to the AOT precompiled model.The Pipe runner proves to be a good candidate for training on local machines.And the gRPC runner provides support for training in a distributed environment.This makes all the model runners important in their own way.

Gym Integration
We carried out additional experiments to evaluate the benefits of our library in the context of a state of the art RL Gym.The two goals are to facilate deployment and to reduce compilation time by using in-process model runners.For this purpose, we trained the pass ordering for code size of CompilerGym [CWG + 22] and exported the resulting model in the ONNX format.We then used our ONNX model runner within LLVM to materialize predictions and generate code.The inference times are shown in Fig. 6, with speedups ranging from 2× to 13×.These are primarily due to gRPC overheads in CompilerGym, as shown in Fig. 5(c).

Domain-Specific Compilers
Given LLVM's dominance in the general-purpose and backend compiler landscape, it forms the natural basis for most Integration with MLIR.Given that end-to-end ML compilers based on MLIR are still undergoing rapid changes [Tea23a], we designed a simple experiment to demonstrate the integration of ML-Compiler-Bridge with MLIR.We wrote a custom pass in MLIR to communicate data with a dummy ML model to mimic a typical ML-compiler interaction.We use the same experimental setup as discussed in Sec.5.3 and measure the round-trip time.The results are shown in Fig. 5(c).This opens up ML-based optimizations in MLIR-native compilers such as IREE and OpenXLA [Tea23a], Triton [Tea23b], Polygeist [MCZZ21], and many other frameworks.
Integration with Pluto.We also experimented with the polyhedral source-to-source compiler Pluto.As Pluto is written in C, we use the C-APIs of ML-Compiler-Bridge for interfacing the models, to illustrating the Pipe model runners and SerDes.We measured round-trip time using different SerDes and show the same in Fig. 5(e).This integration opens new opportunities for ML-based polyhedral optimizations, including autoscheduling and tile size selection.

Discussion
Let us now study the ease of integrating ML-Compiler-Bridge with compiler optimizations.

Lines of Code
In Tab. 4, we show the number of additional Lines of Code (LOC) to integrate ML-Compiler-Bridge with different compiler optimizations.We observe a significant reduction in LOC compared to the original published works.We do not compare with the size of the published version of POSET-RL, as its model was not integrated with the compiler.With Loop distribution and RL4ReAl, the effort of writing Python wrappers and invoking protobuf and gRPC is completely removed.Among the available model runners and SerDes, only gRPC, ONNX and Protobuf involve (small) additional codes to handle RPC, environment, and Protobuf messages.It is pertinent to note that ML-Compiler-Bridge removes the tedious work of managing dependencies like gRPC and Python wrappers which was otherwise necessary.

Impact on binary size, compile time and memory
In Tab. 5, we show the compile time, binary size and average resident set size (RSS) used during compilation of Clang 10 with/without ML-Compiler-Bridge.The difference in binary size is ≈ 80KB, while the average RSS value differs by 400KB with the release build time increasing only by a few seconds.ML-Compiler-Bridge incurs only a negligible overhead in terms of binary size, compile time and memory upon statically linking with the production version of Clang.

Limitations
As mentioned earlier, not all model runners are compatible with all ML models due to the nature of the underlying libraries.For instance, Tensorflow AOT compilation supports any Tensorflow or JAX model, but not PyTorch.Also, upon exporting the inliner model from TensorFlow to ONNX, we encountered an operator (TFL-Bucketize 1 ) that is not supported by ONNX.To handle such cases, the ONNX runtime allows registering custom operators.Once exported, the models can be used seamlessly without restriction.Similarly, protobuf does not natively support C runtime.Hence, our C APIs do not support using the gRPC model runner with protobuf serialization.The current TF AOT compilation generates C++ code thereby making it not usable directly with C.This issue can be mitigated by using TF C-APIs instead of using AOT models.
1 https://www.tensorflow.org/mlir/tfl_ops#tflbucketize_tflbucketizeop7 Related Work RL environments for compilers come closest to our work, such as CompilerGym [CWG + 22], PolyGym [BGC21], Supersonic [WTZ + 22].These primarily aim at facilitating research and reproducibility, which are only two of the broader ambitions of our research (e.g., deployment, programmable compiler interface, finer-grained interaction).CompilerGym internally calls the compiler APIs from a C++ wrapper, and the communication between the Python model and the wrapper is established by predefined gRPC methods.This limits the functionality to only the APIs supported by the library and a particular compiler version with which the library is compatible.Supersonic [WTZ + 22] also uses the Compiler-Gym way of interfacing via gRPC.And, to our understanding, PolyGym [BGC21] does not provide a programmable compiler interface.
The gym libraries and ML-Compiler-Bridge solve different problems; the former facilitates research and training, while our library aims to facilitate different interfaces for communication.We envision ML-Compiler-Bridge to supplement these gym environments by providing a variety of options for more diverse, finer-grained, and frameworkindependent interfacing of ML models with compilers facilitating the transition from research to production.

Conclusions
We present ML-Compiler-Bridge, a modular and extensible library to integrate ML models within compiler optimizations.It provides inter/in-process model runners with different serialization options to support both training/deployment scenarios.We show that a model and compiler pass can be integrated with only 3 lines of code, while also enabling very deep interleaving of RL-based algorithms like RL4ReAl, as well as leaner and production-friendly optimizations like function inlining.
Our library exposes C++/C and Python APIs for integration with compilers and ML frameworks respectively.We considered multiple ML frameworks (TensorFlow, Py-Torch, RLlib), both feature-based and embedding-based representations, multiple compilers (and versions) written in different languages to show versatility and suitability of ML-Compiler-Bridge on research and production environments.We will open-source the library and artifacts with extensive documentation.

Figure 2 .
Figure 2. The compiler instantiates a model runner and sets the input features to be used by the model.MLModelRunner internally invokes SerDes to serialize the data in one of the supported formats and query the model.The returned decision is deserialized and provided to the optimization.
to the user ; returns model 's output template < typename T > T evaluate () { return * reinterpret_cast < T * >(←↪ evaluateUntyped (be overridden by derived classes virtual void * evaluateUntyped () = 0 ; }; 3.1.1Inter-process Model Runners.gRPCModelRunner uses gRPC may run the model and compiler on different machines, and pipeModelRunner uses named pipes for singlemachine scenarios only.At training time, the compiler acts as a server and the Python-based ML model acts as a client.The sequence of steps is described as follows: (1) Compilation starts and the compiler listens for queries at the wait() call inserted at the point of interest.(2) The Python model starts training; this can be started concurrently with Step (1).RPC function for model evaluation 6 // Blocking call ; Released upon receiving a ←↪ response 7 rpc getAdvice ( Observation ) returns ( Action ) {} 8 } Listing 2. Example gRPC function declaration

Figure 3 .
Figure 3. Sequence diagram indicating different events and the interaction between various classes for RL based optimization by ONNXModelRunner.Only the methods that highlighted are to be overridden by the user.Other methods are internal to the library.

1
void * ONNXModelRunner :: evaluateUntyped () { 2 Observation obs = this -> env -> reset () ; 3 while ( true ) { 4 Action action ; 5 // current agent 6 auto current_agent = this -> agents [ this -> env ←↪ -> getNextAgent () ]; 7 action = current_agent -> computeAction ( obs ) ; 8 obs = this -> env -> step ( action ) ; 9 if ( this -> env -> checkDone () )Listing 3. Snippet from ONNXModelRunner showing the environment-agent interaction to generate an observation APIs internally.The sequence of events describing this interaction is shown in Fig.3.ONNXModelRunner exposes the Environment class with APIs for standard step and reset operations along with setDone() API to indicate the end of the episode.step() returns the next observation given an action.Internally, the step operator applies the action predicted by the agent (model) to move on to the next state and returns the new observation from the environment.step() also signals if the terminal state is reached by invoking setDone() to stop the current prediction.reset() resets the environment to the initial state and returns the initial observation.Hence ONNXModelRunner involves the Reset operator first to obtain the initial observation.This sequence of APIs is invoked within the evaluateUntyped() of ONNXModelRunner and is shown in the Listing 3. The optimization pass using the ONNXModelRunner should inherit from Environment and override step() and reset() depending on the optimization requirements.

Table 1 .
Diverse ML and RL requirements in previous work; unknown or unclear ones are left blank.ML models are typically written in Python across different frameworks like TensorFlow, JAX, PyTorch, etc. Expecting the model to be written in C++ within the compiler is not ML developer-friendly.•Portability: Several proposals involve a tight coupling

Table 3 .
Multithreaded compile time with -O3 (in s) with in-process model runners.Compile time with gRPC is shown for RL4ReAl for comparison.

Table 4 .
LOC to integrate model runners.gRPC shows LOC for API calls and RPC; Values in parenthesis indicate LOC in protobuf specification.Other serdes do not need any additional code.

Table 5 .
Comparisons of time taken to build clang and final binary size with/without ML-Compiler-Bridge We updated the build system of Clang 10 to use C++17 and fixed the issues arising from the migration of earlier experiments on POSET-RL, RL-LoopDistribution, and RL4ReAl.We were able to use Clang 17 for Inliner experiments.Though ML-Compiler-Bridge itself does not introduce any dependency, model runners do: gRPCModelRunner, ONNXModelRunner, and ProtobufSerDes require gRPC, the ONNX C++ Runtime, and Protobuf setups respectively.6.4 CharacterizationAs discussed earlier, different model runners exhibit different characteristics.During deployment, neither of the inter-process model runners offer multi-threaded compilation upon running a single model instance.It could be done by instantiating multiple model instances but this would consume unreasonable amounts of memory.The in-process model runners however do not face this problem.Though there is a separate serialization overhead involved with gRPC and pipe model runners, they are handled automatically without the involvement of the developer.Due to the nature of inter-process communication, there is a possibility of encountering communication errors arising from network and compiler crashes.We handle such cases as explained in Sec.3.5.We summarize these characteristics in Tab. 6.

Table 6 .
Characteristics of different model runners