Zoetrope Genetic Programming for Regression

The Zoetrope Genetic Programming (ZGP) algorithm is based on an original representation for mathematical expressions, targeting evolutionary symbolic regression.The zoetropic representation uses repeated fusion operations between partial expressions, starting from the terminal set. Repeated fusions within an individual gradually generate more complex expressions, ending up in what can be viewed as new features. These features are then linearly combined to best fit the training data. ZGP individuals then undergo specific crossover and mutation operators, and selection takes place between parents and offspring. ZGP is validated using a large number of public domain regression datasets, and compared to other symbolic regression algorithms, as well as to traditional machine learning algorithms. ZGP reaches state-of-the-art performance with respect to both types of algorithms, and demonstrates a low computational time compared to other symbolic regression approaches.


Introduction
Symbolic Regression (SR) is a supervised learning approach that consists in searching through a vast space of predictive models.
is space encompasses the rigid linear and polynomial models by enabling other transformations such as trigonometric and logarithmic, as well as the generalized additive models by allowing (linear and) nonlinear combinations of the transformed variables (see [37] and references therein). SR models are o en represented via expression trees, and may include decision tree models if we consider equality and inequality operators instead of functions (see e.g. the Boolean multiplexer in [16]). Models can also be unravelled through mathematical formulae, which make them much more interpretable than other tree or network-based machine learning algorithms such as random forests [6] or neural networks where, with few recent exceptions [14], the relationships among the variables remain hidden. SR thus o ers a good tradeo between exibility and interpretability. More-over, it does not need large numbers of observations as in deep neural networks, and can be applied to smaller datasets (typically from several dozens to tens of thousands). e history of SR is closely related to that of Genetic Programming (GP), starting with the early works of Koza [16,20]. Indeed, most of the approaches for SR are GP algorithms (o en denoted GPSR), as these o er a nice framework with expression trees representing the potential models. GPSR algorithms start with a pool of initial models, which are then iteratively and randomly perturbed to create new ones, until the one that ts the data best is nally selected. Variants to this scheme are discussed in [34], whereas newer approaches are enlisted in [35]. On the contrary, "standard" machine learning algorithms are not well suited for the complex task of optimizing both parameters and model shape at the same time [28,36]. Several exceptions are worth noting, combining the idea of expression trees with classical ML [22] or with neural networks [14].
While SR is very present in the eld of Evolutionary Algorithms (EA) [22], it is almost completely absent in Machine Learning (ML) reference books [12,5] and toolboxes [27]. Possible reasons could be the following: rst, there are many SR algorithms in the literature, each offering various advantages [36], and it might be di cult to know which one to use and how to tune their o en large number of parameters; second, these algorithms are often slower than most ML algorithms, with performance that did not match those of e.g. random forests up until recently; nally, earlier works mostly tested symbolic regression on synthetic data with a known equation involving few input variables, with the aim of recovering exactly this equation [15], and not o en on real datasets with more than 5 variables. For a critical view on SR benchmarks we refer the reader to [23,25] and references therein.
ese limitations have been overcome in the last decade, at least partially. e issue of computational time has been treated by Geometric Semantic Genetic Programming (GSGP) 2.0 [7] with the proposition of a very e cient algorithm. However, this e ciency comes at the cost of interpretability, as the use of geometric semantic variation operators results in exponentially growing trees. e performance of GPSR has been increased for instance by the combination of GP with more standard ML approaches [2]. Finally, novel benchmarks were established lately that also compare SR and classical ML algorithms on real datasets [26,38,1]. ese benchmarks show that the performance of random forests can be matched by increasing the number of individuals and generations, considerably slowing down the computations. So the issue remains for GPSR to get good performance in a reasonable time without losing its characteristic interpretability. A recent and promising work has been proposed in that sense [18], as well as our own work that we present here.
In this article we present a new GP algorithm called Zoetrope Genetic Programming (ZGP), which brings the following main contributions: (1) a new and unseen representation of models, allowing fast computation and feature engineering, while keeping the interpretability advantage of most SR methods through the explicit model formula; (2) novel mutation and crossover processes, leading to improvement of models over the generations; (3) performance that is comparable to the best ML (Gradient Boosting) and SR algorithms. While ZGP can handle the three main supervised learning tasks, namely regression, binary classi cation and multiclass classi cation, we focus here only on the regression one. e paper is organized as follows. Section 2 brie y presents the general context and introduces GPSR stateof-the-art frameworks which are related to our work. Section 3 describes the entire ZGP algorithm, with its uncommon representation of individuals and variation operators, as well as the choices for tness and cost. Section 4 presents and discusses the results of our benchmark against state-of-the-art SR frameworks and ML algorithms on 98 regression datasets. Finally, we conclude on our main contributions in Section 5, and suggest potential future work.

Context
Given a dataset D made of i.i.d. observations ( , ) ∈ R × R with = ( 1 , . . . , ), the goal of regression algorithms is to nd a function M : R ↦ → R modelling the link between and , such that it generalizes well on unseen data from the same distribution.
As common in regression problems, as performance measure for M on dataset D we use the Mean Squared Error (MSE) de ned by: Any dataset D used in this work will be divided into training (D ), validation (D ) and test, or holdout (D ) sets. e holdout set D is never to be seen during the learning procedure, and is only used to assess the nal performance of the model. e uses of the training and validation sets is detailed in Section 3.2.

Related work
As mentioned above, several SR frameworks have been recently proposed and already compared to classical ML algorithms. First, some SR techniques are not based on evolutionary process. For example, Fast Function Extraction (FFX) [22] only generates a large set of linear and non linear features and then ts a linear model on the features using elastic net [39]. While the deterministic part can be a ractive to avoid ge ing di erent models from one run to another, it turns out that FFX o en results in much larger models than conventional GP. Evolutionary Feature Synthesis (EFS) [3] uses a similar idea, but avoids building the basis entirely by randomly generating them. It is however not a GP algorithm. e idea of linearly combining branches of a tree is also very present in GP, as it allows the construction of new features. Multiple Regression Genetic Programming (MRGP) [2] combines all the possible subtrees of a tree through LASSO [33], thereby decoupling the linear regression from the construction of a tree. More recently, La Cava et al. developed Feature Engineering Automation Tool (FEAT) [18], which trades conciseness for accuracy. It is a stochastic optimization providing a succinct syntactic representation with variable dependencies explicitly shown (in contrast to the semantic approach [29]). Another related recent work is the Interaction-Transformation Evolutionary Algorithm (ITEA) [9], which builds generalized additive models including interactions between variables. e e ciency of GP has been another direction of study. Geometric Semantic Genetic Programming (GSGP) [24] is a technique combining trees to get new individuals and adds semantic methods for crossover and mutation in order to introduce a degree of 'awareness'. However, in GSGP the generated individuals are larger than their parents, resulting in large bloat, and longer computing times.
is is addressed by using a practical development environment, GSGP-C++ [7] with operators in native C++. Finally, other frameworks propose e cient selection techniques. Age-Fitness Pareto Optimization (AFP) [31] is meant to prevent premature convergence in evolutionary algorithms by including age as an optimization criterion using a Pareto front between age and tness. is allows younger individuals to compete with older and er ones. Also, -lexicase selection (EPLEX) [19] performs parent selection according to their tness on few random training examples, dropping all the population individuals with error higher than the best error. is selection technique is used in FEAT.
With respect to the above works, ZGP proposes two novelties. First, ZGP uses a parametric representation for its models. Second, within its complex genotype-tophenotype mapping, ZGP borrows to Geometric Semantic Crossover [24], and thus compensates the could-be limitations of a xed representation by creating a richer set of smoother trajectories in the space of all possible analytical expressions / programs. Furthermore, this process sets a strict bound on the complexity of the resulting expressions, and thus limits the bloat.

e ZGP algorithm
e Zoetrope Genetic Programming 1 (ZGP) algorithm is based on the original Zoetropic representation for programs, together with the corresponding variation operators (crossover and mutation). However, the "natural selection" components of all evolutionary algorithms are here directly incorporated into the variation operators (i.e., selection takes place between the parents and their o spring only). Furthermore, ZGP uses evolutionary components to build possible branches of a regression tree, and standard ML techniques to optimize the combination of those branches.

e Zoetropic Representation
is section describes both the genotype and the genotype-to-phenotype mapping of ZGP individuals. As in standard tree-based GP [16,4], ZGP individuals are built from a set of unary or binary operators O and a set of terminals T , variables of the problem and ephemeral constants. e genotype of a ZGP individual is built using elements (partial expressions built on T and O) and fusion operations (see below). Two parameters control the size of the genotype as well as the derivation of the corresponding phenotype (the nal expression used to evaluate the tness of the individual): the number of initial elements and the number of maturation stages . An individual is built as follows: Overview and notations e elements used during the process can be seen as organized in levels, one per maturation step. e elements of level , denoted ( 1 , . . . , ), are constructed by maturation step from the elements of level − 1. However, due to boundary conditions depending on the parity of , it is more convenient to visualize all the elements in a circle (reminding the original zoetrope mechanism 2 ). Figure 1 illustrates the creation process, but for the sake of simplicity, the elements of levels 0, 1, 2, 3 are denoted , , and respectively (explanations below).
Initialization e elements ( 0 1 , . . . , 0 ) are randomly drawn in T , being a uniformly chosen variable with 90% probability, or an ephemeral constant with 10% probability. e la er are uniformly drawn in , for some user-de ned parameters and . Note that these fusion operations, and in particular Equation (2), are somehow similar to the Geometric Semantic Crossover [24]. But the linear combination with random weight is done here at the level of simple operators, not subtree, and during the genotype-to-phenotype mapping, not during crossover. In both situation, this results in a smoother landscape than only allowing blunt choices between one or the other argument, o ering more transitional states to the evolutionary process.
Maturation e ℎ maturation step, or stage, consists of the sequence of /2 fusions de ning elements from pairs of elements −1 .
e Zoetrope model A er maturation steps, the elements of level , called "Zoetropes", are linearly combined to obtain the nal model (see Section 3.2.1).
Complexity analysis ere are = · /2 + %2 fusions in total. At each fusion, the size of the elements increases by the application of Eq. (2). In terms of standard GP indicators (though we never express the ZGP models as trees), the depth of F ( , ) is three more than the maximum depth of and . Hence the depth Figure 1: Illustration of Zoetropic representation building: for = = 3, there are = 4 fusions in total, and for the sake of readability, the third one, generating ( " 2 , " 3 ) from ( 2 , 3 ), taking place between center and right gures, is not represented. Note that 3 = " 3 as no element is le for a fusion.
of the zoetropes is at most 3 * + 1. e linear combination applied to the zoetropes using the -ary addition operator can be viewed as adding two more levels of depth. In particular, because all created individuals use the same template, the complexity of any ZGP model remains bounded. erefore, ZGP individuals are not subject to uncontrolled bloat.

Combination of zoetropes
As said in previous Section, at the end of all fusions, the zoetropes are combined to obtain the full model as: for some weights = ( 1 , . . . , ) ∈ R .

e Fitness Function
e tness function, used in Darwinian selection, is the MSE of the best linear combination of the zoetropes on the training set, (M * , D ). In ZGP, this tness function is applied within the variation operators, between parents and o spring. Furthermore, the best individual in the population w.r.t. the MSE on the validation set D is stored at every generation, and a er the algorithm has stopped, the overall best of these best-pergeneration is returned (still according to the MSE on the validation set).

e variation operators
is section introduces the representation-speci c variation operators, i.e., crossover and mutation (the initialization has been described in Section 3.1). As said, in ZGP, the selection is made within these variation operators, between parents and their o spring, using the MSE for comparisons. Furthermore, the way these operators are applied is also speci c.
is section will hence describe the operators as well as the choice of the individuals they are applied to.

e Crossover Operator
e crossover process of ZGP uses two parents, but works one-way: it only propagates components from the est parent to the other one, somehow similarly to the In-verOver operator for permutations [32]. It starts by selecting individuals uniformly from the population (as in standard tournament selection). en, it randomly replaces some of the 'genes' of the weakest parent by the corresponding genes of the est parent. e genes to replace are randomly chosen from the initial elements (terminals in T ) and the fusions, each fusion being considered as a single gene here.

Applying the Crossover
Our GP strategy for applying the crossover amounts to repeat · times the above procedure (tournament of size , one-way gi of genes from best to worst), for some hyperparameter ∈ [0, 1]. Its actual implementation runs some tournaments in parallel (i.e., without replacement between the tournaments), in order to decrease the overall computational time.

Point Mutation
e point mutation operator considers one parent, and works as expected: it replaces some 'genes' of the parents by random values. However, the fusions are here considered made of four 'genes' here, the four components (op 1 , op 2 , , ) (Section 3.1), that can be modi ed independently. For each point mutation, either one element or one fusion is randomly chosen from the "genes" and mutated.
When an element is to be mutated, it is replaced by a constant with probability , or with a variable uniformly chosen (and di erent from the current one if the element is a variable). When replacing a variable with a constant, this constant is simply chosen uniformly in [ , ]. When mutating a constant to a new constant, an auxiliary constantˆis uniformly drawn also in [ , ], an operator is uniformly drawn in {×, /, +, , }, and is replaced byˆ(wherê =ˆ). When a fusion is to be mutated, only one (uniformly drawn) of its four components op 1 , op 2 , , is modi ed. In case of an operator, a new operator is chosen uniformly in O. is modi ed by ipping one bit of its binary representation, and is simply ipped. Note that in ZGP, each individual designated for mutation is actually mutated twice. A rst point mutation is applied to a compo-nent (element or fusion) randomly chosen from the "effective components", i.e., the components which are actually used by the model, thus ensuring that the mutation has an impact on the model. A second point mutation is applied to a component randomly chosen from all the components (e ective or not), allowing components free from tness pressure to dri and preserve diversity once they become e ective [17].

Experimental Validation
is section describes and analyzes the performance of ZGP on regression tasks with tabular data, and compares them with those of state-of-the-art symbolic regression and classical machine learning algorithms.

Experimental Setting
e experiment closely follows the benchmark in [26], where the algorithms were run on the Penn Machine Learning Benchmarks (PMLB) database [25], a collection of real-world, synthetic and toy datasets, with a restriction to datasets with less than 3000 observations (small data regime). We compare ZGP with the same SR algorithms as in [26], namely MRGP [2], GSGP [24,7], EPLEX [19], AFP [31], with the best parameters their authors found by 5-fold cross-validation. To this list, we also added the more recent FEAT [18] and the deterministic FFX [22]. We also chose those algorithms because of the availability of a Python interface (provided by the benchmark's authors in the case of GSGP and MRGP). Indeed, while many state-of-the-art SR algorithms are open source, their source code comes in di erent languages (C++, Java, Matlab), hence quite some work is needed to re-implement and run those under the same conditions. As for classical ML approaches, we chose the following algorithms from scikit-learn [27]: gradient boosting, random forests (RF), decision trees, elastic net, kernel ridge and linear SVR, the la er three being optimized by 5-fold cross validation. Finally, we added a multi-layer perceptron with keras [8] as in our experience, the one from scikit-learn does not perform well in general. e parameters for each algorithm are provided in supplementary material. e experiment consisted in 20 runs of each algorithm, based on the same splits of training and test sets (70-30%) for all algorithms 3 . All datasets were standardized with scikit-learn' StandardScaler. We computed the Normalized Root Mean Squared Error (NRMSE), i.e., the squareroot of the MSE divided by the range of target values, the R2-score (computed with scikit-learn), and the computational time for each algorithm and each run. Note that in [26], only 10 independent runs were run for each algorithm, with random train-test splits. However, given the variability of symbolic regression approaches, we believe it is more robust to increase the number of runs, and fairer to compare them on exactly the same data.
All experiments were performed on a HP Z8 server with 40 cores 4 . All runs end when the maximum number of generation is reached, or when the standard deviation of the best tness over a window of size reaches some user-de ned threshold , whichever comes rst. e code to replicate the comparison experiments will be provided together with the nal paper. A le containing the results for all the algorithms, runs and datasets is provided in CSV format in the supplementary material.

Hyperparameters
ZGP has quite a large number of hyperparameters. On the one hand, this allows a great exibility when tuning the algorithm. But on the other hand, it makes its use time-consuming. Hence some default values have been xed by intensive trial-and-error experiments, which are reported in Table 1.
e optimization of these hyperparamters by some automatic Hyper Parameter Optimization (HPO) procedure, like SMAC [13], AutoSkLearn [10] or HyperBand [21] will be the subject of further work.

Results and Discussion
As in [26], we report the median values for R2 and NRMSE over the 20 runs, for each algorithm and each dataset. Figure 2 compares the distribution of these median R2 scores (2a) and NRMSE (2b) over all datasets, while the red dots show the average of the median R2/NRMSE scores over all datasets ("average R2/NRMSE" in the sequel) . e algorithms are ordered by decreasing average R2 and increasing average NRMSE, the best one being on the le . Table 2 gives the average rank for each algorithm, based on the median R2 scores (middle column) and median NRMSE (right column) for each dataset. Standard deviations of the ranks are given in parenthesis. Figure 2 and Table 2 show that gradient boosting attains the best performance, closely followed by random forests, FEAT and ZGP, and less closely by MRGP. Note that ZGP has lower average R2 than FEAT and be er average rank in R2 and NRMSE. is fact comes from a few worse estimations for ZGP on some datasets (lower outliers in R2, Figure 2a), and be er ones for other datasets Stopping criterion * , 30, 1 −3 * e algorithm is stopped if either one of the two following criterion is reached: = (the number of generations reaches the maximum) or the standard deviation of the best tness over generations goes below a threshold (inspired by [30]).
(higher rst and third quartiles in R2). e remaining algorithms display lower performance with a much higher variance, especially on the R2 scores.
To further the comparison for symbolic regression, Figure 3 displays the performance in R2 against the computational time for all SR algorithms (top), and for the top three SR algorithms (bo om), namely ZGP, FEAT and MRGP. A 'good' algorithm should be in the upper le corner of these graphs (high performance and low computational time). Note that this comparison of runtime is somewhat qualitative because the algorithms rely on di erent programming languages (ZGP and FEAT are in C++ with a Python interface, while MRGP is in Java). In order to make the comparison as fair as possible, we provided FEAT and MRGP with the maximum execution time as run by ZGP, because both FEAT and MRGP require an upper limit in their input time parameter. ese gures show that ZGP and FEAT share the best performance. Moreover, while all SR algorithms have sca ered computational time, GSGP is the fastest SR algorithm, complying with its claim. However, it has a large variance in performance, as does FFX. Among the best three, ZGP thus shows the highest performance and the shortest computational time. Also, we note that the algorithms with the lowest performance, GSGP, AFP and EPLEX, are those relying on the smallest operator set O = {+, −, * , /}. On the contrary, the best performance is obtained by algorithms with a wider operator set, including trigonometric functions and square roots among others (the full list of operators for each algorithm is provided in supplementary material). e choice of the operator set appears to be an important one: we investigated it further by expanding it for EPLEX and AFP to {+, −, * , /, sin, cos, sqrt}, and their performance was indeed greatly improved, but still far from those of ZGP and FEAT; we therefore decided to keep the default set in the results presented here to be consistent with the benchmark in [26], and to report the corresponding metrics in supplementary material. Note also that EPLEX and AFP are selection methods, and not full GPSR algorithm, and that EPLEX is used as a selection mechanism for FEAT. As for the deterministic FFX, it turns out that both performance and computational time are widely sca ered, ranging from very good R2 and low computational time for some datasets, to the worst R2 or computational time on others. It is also worth noting that FFX is the only algorithm that cannot be parametrized directly to run on one thread only and it actually spans all 40 cores of our server, while all the other algorithms were limited to one core per run. We are aware of the limitations of the datasets utilized. Despite their number, as mentioned in this section and in the introduction, a good portion (62 of them) consists of simulated data from the Friedman collection of articial datasets, that follow a known nonlinear function with only 3 relevant variables, as described in [11]. Hence, the chosen database may be a favorable se ing for symbolic regression approaches that tend to select few variables in the nal model (among which ZGP).

Conclusion and future work
ZGP, a novel GPSR algorithm, is presented and its inner working explained in detail. e algorithm has been validated on regression tasks, in comparison with severalstate-of-the art algorithms, both from classic ML tools and from existing GP-based SR frameworks. e performance of ZGP is comparable or be er than the state-of-the-art SR algorithms, in terms of accuracy of the resulting model (measured both by the RMSE or the R2), and of computational time. It is also comparable to state-of-the-art classic ML algorithms, but performs somewhat worse than the most advanced ML algorithms like gradient boosting. However, the comparison of the (a) All SR algorithms  Similarly to the majority of the SR algorithms, ZGP's interpretability is a ained through a tight selection of variables, and the output of an analytical formula, linking the selected variables to the target. However, and di erent from most other SR algorithms, ZGP is "bloatadverse by design": the zoetropic representation, and the genotype-to-phenotype mapping give an upper bound for the complexity of all ZGP models. Last but not least, ZGP performs feature construction and selection: the "zoetropes", the elements obtained on the last layer of the development process, are simply combined by linear regression. erefore, they do represent useful features, being a by-product of the algorithm rather than a separate pre-processing step. ese features o er yet another insight on the interpretation of the model.
One of the speci cs of ZGP is the use of the fusion operation during the genotype-to-phenotype mapping (see Section 3.1 and in particular Equation 2). No operator is applied alone, and smooth transitions from one operator to the other are possible through modi cations by mutation of the random weight , in a way similar to that of Geometric Semantic Crossover [24]. However, the linear combination is limited here to simple operators, and is only performed times, thus does not result in uncontrolled bloat: instead of increasing the search space by augmenting the complexity of the trees, as in traditional GP, the search space is extended in ZGP by replacing the discrete set of operators by the continuous family obtained by their linear combinations. On-going ablation studies are investigating this hypothesis.
In contrast to algorithms designed for big data, ZGP, like all GP-based SR algorithms, can a ain its results by handling datasets with less than a few thousands of observations (less than 3000 in the present experiments), a context o en loosely referred to today as "small data". Whereas emphasis has been put on Big Data in the recent years due to impressive results in image recognition and Natural Language Processing, to name a few, for many more companies out there the available data does not qualify as "Big".
As mentioned in the introduction, ZGP can also be applied to classi cation and benchmarking in both binary, and multi-class tasks is the subject of on-going work. Furthermore, an extension of the benchmark to a database including more real-world datasets for all three tasks will provide a fuller assessment for those algorithms. Even though the inner mechanism of the ZGP algorithm does limit the bloat, a major e ort for future work is the quantitative assessment of the model complexity. Several measures may capture the complexity of symbolic regression models, and we plan to assess it as an additional level for model selection. Finally, hyperparameter tuning is o en performed ad-hoc, whereas a systematic treatment may help, in particular to select the number of elements and stages, which can be constraining at present.