Discovering Attention-Based Genetic Algorithms via Meta-Black-Box Optimization

Genetic algorithms constitute a family of black-box optimization algorithms, which take inspiration from the principles of biological evolution. While they provide a general-purpose tool for optimization, their particular instantiations can be heuristic and motivated by loose biological intuition. In this work we explore a fundamentally different approach: Given a sufficiently flexible parametrization of the genetic operators, we discover entirely new genetic algorithms in a data-driven fashion. More specifically, we parametrize selection and mutation rate adaptation as cross- and self-attention modules and use Meta-Black-Box-Optimization to evolve their parameters on a set of diverse optimization tasks. The resulting Learned Genetic Algorithm outperforms state-of-the-art adaptive baseline genetic algorithms and generalizes far beyond its meta-training settings. The learned algorithm can be applied to previously unseen optimization problems, search dimensions&evaluation budgets. We conduct extensive analysis of the discovered operators and provide ablation experiments, which highlight the benefits of flexible module parametrization and the ability to transfer (`plug-in') the learned operators to conventional genetic algorithms.


INTRODUCTION
Motivation.Genetic algorithms (GAs) provide a set of evolutioninspired optimization algorithms, which are flexibly applicable to black-box optimization (BBO) problems.They commonly rely on human designed operators, which impose a restrictive and most importantly subjective set of manual priors.This bears the risk of domain overfitting and limited generalization capabilities.Based on recent results in the discovery of attention-based Evolution Strategies [25], we propose that these limitations can be overcome by meta-learning effective GA operators from data.Thereby, the inductive biases of the GA itself are indirectly encoded by its parametrization & can be discovered in an optimization-driven fashion, i.e. by improving its meta-performance on a distribution of relevant tasks.Approach.Inspired by the recent success of the Set Transformer [27] architecture, we introduce neural network-based architectures to substitute core genetic operators: Selection and mutation rate adaptation are cast as dot-product attention modules, which can flexibly be applied to problems with varying dimensions & population sizes.The resulting family of genetic algorithms can implement different operations based on the specific module weights.We

RELATED WORK
Discovery via Meta-Learned Algorithms.Recent efforts have proposed to replace manually designed algorithms by end-to-end optimized inductive biases, by meta-learning parametrized components on a representative task distribution.E.g. this includes the discovery of Reinforcement Learning objectives [28,33,44], schedules of algorithm hyperparameters via meta-gradients [9,34,45,47], and the meta-learning of entire learning algorithms [18,19,42].The discovery process can be supported by suitable neural network architectures.Our proposed LGA architecture leverages attention layers to derive a neural network-based family of GAs.Meta-Learned Gradient-Based Optimization.Our work is closely related to the ambition of meta-learning gradient descent-based learning rules [1,3,31].These approaches rely on access to efficient gradient calculations via the backpropagation algorithm and thereby do not apply to BBO problems.A small neural network processes the gradient and standard optimizer statistics (momentum, etc.) to output a weight change.The optimizer network weights in turn have been meta-learned on a task distribution [30].Metz et al. [29] showed that this results in a highly competitive optimizer for deep learning tasks.Our MetaBBO-discovered LGA, on the other hand, provides a general-purpose BBO, which does not require differentiable objective functions.Meta-Learned Population-Based Optimization.Shala et al. [38] meta-learn a controller for the scalar mutation rate in CMA-ES [15].Chen et al. [5], Gomes et al. [12], TV et al. [41] previously optimized entire neural network-parametrized algorithms for lowdimensional BBO.All of them use a recurrent network, which processes raw solution candidates and their respective fitness scores.These methods often struggle to generalize to new optimization domains and are often constrained to fixed population sizes and/or search dimensions.Lange et al. [25] recently leveraged the equivariance property of dot-product self-attention to the input ordering [20,27,39] to learn adaptive recombination weights for evolution strategies.The proposed LGA extends this attention-based BBO perspective in order to characterize GA operators.After successful meta-training, the learned GA is capable of generalizing to unseen populations and large search spaces.To the best of our knowledge we are the first to demonstrate that a meta-learned GA generalizes to challenging neuroevolution tasks.Finally, the MetaBBO approach does not require access to knowledge of meta-task optima [41] or a teacher algorithm [38].Baseline Genetic Algorithms.Throughout the paper we compare against four competitive baseline GAs including the following: • Gaussian GA [35]: A simple GA with Gaussian perturbations and fixed mutation strength using truncation selection.• MR-1/5 GA [35]: Doubles the mutation rate if 1/5 of all perturbations were beneficial.Otherwise, the rate is halved.• SAMR-GA [7]: Self-adapts per-parent mutation rates based on a simple co-evolution heuristic and meta-mutation rate.• GESMR-GA [21]: Avoids vanishing mutation rates by using a group elite selection criterion and mutation rate sharing.We additionally consider Sep-CMA-ES [36] as a scalable evolution strategy baseline for neuroevolution tasks.Each baseline GA was tuned using small grid search sweeps (see Appendix B).Otherwise, we adopted the settings provided by the authors.

BACKGROUND
Black-Box Optimization.Throughout this manuscript, we are interested in efficient continuous black-box optimization: Given a function  () : R  → R with unknown functional form, i.e. we cannot compute its derivative, we seek to find its global optimum: Commonly the number of parents is set to be small  ≪  and often even  = 1, enforcing a type of hill climbing. children candidates are uniformly sampled with replacement from the parents: Afterwards, they are perturbed using a mutation rate (MR)  ∈ R + , which controls the strength of the Gaussian noise,   ∼ N (0  , I  ): ..,  .Many competitive GAs keep a vector of parent-specific [7] mutation rates   ∈ R  and additionally perform an intermediate mutation rate adaptation (MRA) step to improve the mutation rate(s) given information gathered throughout the fitness evaluations: In this case, the children are perturbed by their sampled individualspecific mutation rate (  ← σ ∈ R  ) and selection also applies to the children's MR.Alternatively, GESMR-GA [21] forms parent sub-groups and online co-evolves group-level mutation rates based on their observed fitness improvements.Set Operations via Dot-Product Self-Attention.Scaled dot-product attention (SDPA) is especially well suited to characterize algorithms performing set operations, since it naturally enforces a permutation invariant function.Consider the standard formulation of SDPA, which embeds a set of  input tokens,  ∈ R  × , into   -dimensional latent query , key  and value  representations: The output  is computed as a linear combination of the values: It can be shown that this transformation is equivariant to the ordering of the tokens in  , i.e. permuting the rows of  will apply the same permutation to the rows of  [20,39].We will leverage this suitable inductive bias to characterize GAs, which inherently operate on sets of solution candidates and their fitness scores.

ATTENTION-BASED GENETIC OPERATORS
We now introduce an attention-based parametrization of the genetic Selection and Adaptation operators (Figure 2).These in turn will be meta-optimized on a set of representative optimization tasks in order to capture useful BBO mechanisms.We start by answering a natural question: What inputs should be processed by the operators in order to enable generalization across fitness and solution scales?Attention Features via Fitness Scores & Mutation Strength.
To compute attention scores across parents and children, we need to construct sufficient features, which modulate effective selection and mutation rate adaptation.Furthermore, we want the metalearned operations to generalize across different test optimization scales.Hence, we consider scale invariant normalizations: E.g. zscoring and centered ranks (in [−0.5, 0.5]).We transform both the raw fitness scores (  dim.) and the parent mutation rates (  dim.) to construct a set of features processed by the attention layers: •  ∈ R ( +)×  : Joint fitness transformations of parents and children (z-scores & centered ranks).•   =  1: ∈ R  ×  : Fitness transformations of children extracted from joint transforms (z-scores & centered ranks).•   =   :( +) ∈ R ×  : Fitness transformations of parents extracted from joint transforms (z-scores & centered ranks).•   ′ ∈ R  ×  : (Separate) fitness transforms of sampled parents after selection operation (z-scores & centered ranks).
mutation rate features of sampled parents.
The fitness features additionally include a Boolean indicating whether an individual performs better than the best fitness observed so far.Selection via Cross-Attention.We replace the common sortingbased selection mechanism with a cross-SDPA layer, which compares children and parents.It first embeds the fitness transformations of the parents and children into queries, keys and values: Afterwards, we compute the normalized dot-product cross-attention features   and construct a selection matrix   ∈ R ×( +1) : In the final line we concatenate an -dimensional vector of ones to the outer product of   and   .Intuitively, this column represents a fixed offset used to indicate whether the parent copies any child at all or if it is not replaced.The rows of   then specify the probability of each offspring to replace a parent: 11 denotes the probability of replacing parent 1 with child 1, while   1 +1 corresponds to not replacing the first parent.We sample row-wise from a categorical distribution in order to determine whether a child replaces a particular parent.Afterwards, we use the selection matrix to update the parent archive (  ′ ) and fitness archive (f  ′ ): and similarly we obtain f  ′ and   ′ via masked addition.This selection operator can flexibly regulate the amount of truncation selection by replacing multiple parent slots with the same child.Mutation Rate Adaptation (MRA) via Self-Attention.Next to selection we meta-learn MRA.The concatenated fitness and MR features of the sampled parents are processed by a SDPA layer, which outputs a child-specific feature matrix   ∈ R  ×  : Afterwards, the multiplicative adaptation to the MR is constructed by projecting and re-parametrizing the attention output: The children MR   is obtained via element-wise multiplication: denotes the joint attention weights, which characterize a specific instance of an LGA.We use a small feature dimension   = 16, which results in <1500 trainable meta-parameters.In summary, we introduced two dot-product attention-based operators, which replace the standard selection and MRA operations.Throughout the paper we focus on learned selection and MRA, but in Appendix A we outline how to additionally construct sampling and cross-over operators using self-attention.

META-TRAINING, OBJECTIVE & TASKS
We meta-optimize the weights of the GA attention modules to perform BBO on a family of representative tasks.More specifically, we make use of the previously introduced MetaBBO procedure [25] and evolve the LGA parameters to maximize performance on a task distribution of 10 BBOB [13] functions.These include functions with different properties, i.e. separability, conditioning, multi-modality (see Table 1).At each meta-generation (see Figure 1) we start by uniformly sampling a set of BBO tasks and LGA parameters from a meta-evolutionary optimization algorithm.We denote the set of parameters characterizing the LGA by   for  = 1, ...,  metapopulation members.Afterwards, each LGA is evaluated on all tasks by running an inner loop search.We compute an aggregated meta-performance score to update the meta-EO.The MetaBBO- Sample  BBO tasks with   , ∀ = 1, ...,  .

13:
Mutate:   ← Mutation( X  ,   ).Update parent archive: 16: Update meta-search  ′ , Σ ′ ← MetaEO({  , f }  =1 |, Σ). 23: end while objective is computed based on the collected inner loop fitness scores of each LGA instance, where   denotes task-specific parameters for  = 1, ...,  tasks.For each candidate   we minimize the final performance of the best population member.Afterwards, we -score the task-specific results over meta-population members and compute the median across tasks: .
The outer loop optimizes  using OpenAI-ES [37] for 750 metagenerations with a meta-population size of  = 512.We sample

EXPERIMENTS
We now turn to an exhaustive experimental evaluation of the MetaBBO optimization procedure and the discovered LGA.We thereby set out to answer the following questions: (1) Is it possible to meta-evolve competitive LGAs via MetaBBO using a limited set of meta-training BBO tasks (Section 6.1)? (2) Does the resulting LGA outperform GA baselines on unseen BBO problems and different search budgets (Section 6.2)? (3) How much can an LGA discovered on a limited set of tasks generalize beyond its meta-training setting (e.g.hyperparameter optimization & neuroevolution; Sections 6.3 & 6.4)?

Meta-Training on BBOB Functions
We start by meta-evolving the LGA parameters on a task distribution consisting of 10 BBOB functions with different random optima offsets, evaluation noise and considered problem dimensionality ( ≤ 10).Throughout meta-training we evaluate the performance of the optimized LGA on several different downstream tasks.These include BBOB functions seen during meta-training, hold-out metatest BBOB functions and small neuroevolution tasks.In Figure 3 we plot the detailed evaluation curves across meta-training.The MetaBBO-trained LGA quickly learns how to perform optimization on the low-dimensional BBOB meta-training functions.Interestingly, we find that the MetaBBO procedure can lead to an LGA that overfits to the BBOB tasks on which it was meta-trained.The downstream performance of LGA decreases and becomes unstable for both an unseen Pendulum control task with MLP policy and a downsized 14-by-14 MNIST classification task using a CNN.We therefore investigated whether meta-regularization can improve the generalization on such unseen neuroevolution tasks.More specifically, we compared three different meta-mean regulatization coefficients  ∈ {0, 0.005, 0.02}, which exponentially decay the meta-mean to zero,  ′  = (1 − ) ′ .We observe that the generalization to the neuroevolution tasks can be improved and stabilized using a properly chosen decay of 0.005.In Appendix C we further explore the impact of the meta-task distribution, meta-objective and LGA attention size.The MetaBBO procedure is largely robust to the choice of these settings.Small attention layers are sufficient for consistently discovering performant LGAs.This comes with the additional advantage of reducing the FLOPs and memory requirements of executing the LGA.The evaluation of a meta-trained LGA is easily feasible on a single core CPU device.

Meta-Testing on BBOB Functions
Next, we exhaustively evaluate the performance of LGA on the full set of BBOB benchmark functions including test functions unseen during meta-training.We compare against 4 competitive GA baselines: Gaussian GA, MR-1/5 GA, SAMR-GA, GESMR-GA.We compare the performance on all BBOB functions for a population size of  = 32,  = 20 search dimensions and for  = 50 generations.The best-across generations function value is normalized by the performance of the Gaussian GA.In Figure 4 we find that LGA outperforms all baselines on the majority of both BBOB functions seen during meta-training (left) and unseen BBOB functions (right).This holds true for functions with very different characteristics (single/multi-modal, high/low conditioning, separable/not separable), search dimensions and population sizes (Appendix Figure 17 & 18).This provides further evidence that LGA does in fact not All baselines were tuned using grid searches over the parent archive size and initial mutation rate scale.We report all BBOB evaluation & tuning settings in Appendix B.2. overfit to the BBO functions seen during meta-training, but instead has discovered a general-purpose GA algorithm.In Figure 5 we further demonstrate that LGA generalizes to different population sizes and problem dimensions.The meta-learned GA achieves lower function values in fewer generations (right) and performs well for different problem settings (left).As the problems become harder with increased dimensionality, LGA can compensate with a larger population size.

Meta-Testing on Continuous HPO-B
Next, we test LGA's performance on the HPO-B benchmark [2].The benchmark considers a vast array of hyperparameter optimization tasks including 16 different model types (SVM, XGBoost, etc.) and their respective search spaces ( ∈ {2, 3, . . ., 16}).Each model is evaluated on 2 to 7 different datasets, which leads to a total of 86 hyperparameter search tasks.We consider the continuous HPO-B version, which uses a previously fitted surrogate model.Note that the LGA has not been trained on such hyperparameter optimization tasks.Figure 6 compares the performance of LGA against the GA and a random search baselines.Additionally, we report the reference performance of the recently proposed OptFormer model [6] after 105 total evaluations.We find that LGA outperforms the majority of considered GA baselines.Again, this observation holds for two considered population sizes.LGA can also achieve similar performance as OptFormer, which has been trained on a much more diverse task distribution.This highlights the transfer capabilities and applicability of LGA to new during meta-training unseen optimization domains.

Meta-Testing on Neuroevolution Tasks
Until now we have evaluated the performance of the discovered LGA on moderately small search spaces (i.e. ≤ 20).But is it also possible to deploy LGA to neuroevolution settings with thousands of search dimensions and arguably very different fitness landscape characteristics?Again, note that LGA has never explicitly been trained to evolve such high-dimensional genomes and that this requires strong transfer of the learned GA operators.
The considered Reinforcement Learning (RL) tasks consist of 4 robotic control tasks (Pendulum-v1 [24] and 5 Brax tasks [10]) using MLP policies and three MinAtar visual control tasks (SpaceInvaders, Breakout & Asterix [46]) with CNN-based policies.The MLP genomes consist of less than 1000 weights, while the MinAtar CNNs have ca.50,000.Hence, the search space is many orders higher than what the LGA has been meta-trained on ( ≤ 10).The top three rows of Figure 7 show that LGA can compete with all tuned baseline GAs on the nine RL tasks with different search spaces, fitness landscapes and evaluation budgets.Interestingly, the performance gap between LGA and the considered baselines is the biggest for the CNN policies and tends to increase with the number of search dimensions.Finally, in the final row of Figure 7 we show that LGA can also successfully be applied to three image classification tasks including MNIST, Fashion-MNIST and K-MNIST classification (28-by-28 grayscale images) with a small CNN (2 convolutional layers, ReLU activation and a linear readout) with 11274 evolvable weights.The LGA generalizes far beyond the meta-training search horizon( = 50 versus 4000) and does not meta-overfit [26].

ANALYSIS OF DISCOVERED LGA
After having established that the meta-trained LGA is capable of outperforming a set of GA baselines on unseen optimization problems, we now investigate the underlying discovered mechanisms, transfer ability and robustness of the meta-learned GA operators.

Visualization of Learned Genetic Operators
What types of mechanisms underlying the black-box genetic operators has LGA discovered?Has it simply re-discovered fitness-based truncation selection or a more complex parent replacement procedure?In Figure 8 we consider a 2-dim Sphere problem and visualize the selection mask  :,1: used to update the parent archive   .We observe that the selection operator uses children solution to replace parents based on their improvement over the best seen solution.
Furthermore, one well-performing child often times replaces more than a single parent.This indirectly implies that the selection operator has meta-learned to dynamically adapt its elite archive size and thereby also the effective sampling distribution.A child that has replaced multiple parents will be sampled (with replacement) more frequently in the next generation.Furthermore, this implies a robustness mechanism: Since children can be stored multiple times in the parent archive, they are less likely to be 'forgotten' by the stochastic selection.The mutation rates, on the other hand, are decreased over the course of generations in order to explore closer to the global optimum.Furthermore, we observe a grouping of the MR based on the performance of the parents.Children with bad performing parents tend to exhibit a higher mutation rate.

Ablation & Transfer of Genetic Operators
How much do the different learned components contribute to the overall performance of LGA? Can the learned modules act as dropin replacements for other genetic algorithms?To answer this question we consider two types of comparative studies: (1) Operator ablation before MetaBBO discovery: We metatrain the LGA with a variable amount of learned genetic operators.E.g. we fix the selection operator to white-box training is completed, we ask whether or not it is possible to substitute the learned operators into other genetic algorithms?This in turn allows us to assess whether the specific learned operator is overfit to the downstream GA computations or whether it can act as a transferable inductive bias for genetic computation.
For the first study we compare meta-training combinations of attention-parametrized selection (SE), mutation rate adaptation (MRA), cross-over (CO) and sampling (SA).For each combination we plot the evaluation performance across meta-generations in Figure 9.We observe that MRA is crucial for good performance on the neuroevolution tasks.Intuitively, this can be explained by the smaller scale of solution parameters associated with neural network weights.The GA benefits from the ability to flexibly downregulate its perturbation strengths.Cross-over, on the other hand, is detrimental for the generalization of LGA to the MNIST CNN CO and SA parametrizations are additionally introduced in Appendix A. Throughout the main text we mainly focused on learned mutation rate adaptation and selection.
neuroevolution task.This behavior can arguably be attributed to the challenge of finding beneficial crossing over pairs for different neural network genomes.We further observed that learned sampling does not significantly improve the performance of the LGA.We hypothesize that this is due to the indirect effect of the selection mechanism on the sampling of children.Finally, the overall best performing configuration only meta-learns selection and MRA.Next, we considered replacing the truncation selection and fixed mutation rate of the Gaussian GA baseline with the learned selection and MRA operators.In Figure 10 we show that this can successfully be accomplished for four neuroevolution tasks.Replacing either the selection or adding learned MRA operator improves the performance of the Gaussian GA.The learned operators can act as drop-in replacements and are transferable inductive biases.

Hyperparameter Robustness of LGA
Finally, we assess the sensitivity of LGA to its remaining hyperparameter choices.More specifically, we compare the performance of LGA and the baseline GAs for various initial mutation rate scales  0 and parent archive sizes  = ⌈ ×  ⌉, where  denotes the fraction of population members making up the number of parents.Note that while LGA was meta-trained for  = 1, i.e.  =  , we find that it is capable of generalizing to many different archive sizes and is robust to the initial scale parameters (Pendulum control task; see Figure 11).In Section D.2 we provide the same analysis for all neuroevolution tasks.LGA is far less hyperparameter sensitive than the considered baseline GAs.This highlights the robustness of the LGA induced by the MetaBBO process.

CONCLUSION
Summary.In this study, we used evolutionary optimization to discover novel genetic algorithms via meta-learning.We leveraged the insight that GA operators perform set operations and parametrized them with novel attention modules that induce the inductive bias of permutation equivariance.Our benchmark results on BBOB, HPO-B and neuroevolution tasks highlight the potential of combining flexible GA parametrization with data-driven meta-evolution.Limitations.We use powerful neural network layers to characterize GA operators.While flexible and interpretable (Section 7.1), the underlying mechanisms do remain partially opaque.Future work needs to be done in order to fully reverse-engineer the discovered operators.In Appendix E we provide a first set of insights unraveling simple linear relationships between the attention inputs and their outputs.We believe these can guide the design of new 'whitebox' GAs informed by 'grey-box' discovered LGAs.Furthermore, our analysis highlights the importance of meta-regularization and the potential for the automated design of meta-training curricula.Future Work.We are interested in explicitly regularizing LGAs to maintain diversity in their parent archive.This may provide a bridge to meta-learned quality-diversity methods [8].Furthermore, it may be possible to parametrize a flexible EO algorithm that can interpolate between the exploration of multiple solution candidates as in GA and a single search distribution mode as in traditional evolution strategies.Finally, we believe that better meta-learned GAs can be discovered by simultaneously co-evolving the meta-task distribution and the learned GA.

C ADDITIONAL METABBO RESULTS C.1 Attention Feature Dimension & Heads
Figure 13: LGA MetaBBO -Different model sizes.We report mean/1.96ste intervals across 3 independent runs.

C.4 Comparison of Meta-Objective Functions
We also compared differernt meta-objectives, which construct the meta-fitness in different ways: minN-minT: Minimizes over both the population evaluations with a generation and across all generations.minN-finalT: Minimizes over all population evaluations evaluated during the final generation.meanN-minT: Computes the mean performance across members within a generation & minimizes this score across generations.meanN-finalT: Computes the mean performance across all members within the final generation.(minN-finalT) (meanN-finalT) Interpretation of MetaBBO Comparitive Studies.The LGA discovery process is largely robust to the considered MetaBBO specifications.Small attention module parametrizations (single head and   = 8) is sufficient to learn powerful GA operators.Furthermore, a small task distributions (single Sphere function with random offsets/noise) already lead to strong performance on BBOB tasks.For MNIST classification more function diversity is required.The choice of the meta-optimizer and objective are more important.OpenAI-ES (Figure 3) and minN-finalT provide the best performance.

E REVERSE-ENGINEERING THE LEARNED GA
We further investigated whether there exist simple relationships between the attention input features and the learned operator's   output.More specifically, we explored if the mutation rate adaption multiplier Δ  and selection logits    can be explained by the normalized fitness features (z-score and centered ranks) of the children.
In Figure 31 we plot all features, logits and mutation multipliers across an LGA evaluation run on a 2-dim Sphere task.We can observe a clear positive correlation between the performance of the children and their selection probability.Furthermore, the mutation rate is decreased for well-performing solutions.This may open up the possibility to reverse-engineer a new discovered GA without the need for arguably opaque neural network modules.

F SOFTWARE, COMPUTE REQUIREMENTS
This project has been enabled by the usage of freely available Open Source software.This includes the following: • NumPy: Harris et al. [16] • Matplotlib: Hunter [17] • Seaborn: Waskom [43] • JAX: Bradbury et al. [4] • Evosax: Lange [23] • Gymnax: Lange [24] • MinAtar: Young and Tian [46] • Evojax: Tang et al. [40] • Brax: Freeman et al. [11] • MLE-Infrastructure: Lange [22] A network checkpoint and accompanying architecture code is publicly available in evosax.All experiments (both meta-training and evaluation) were implemented using the JAX library for parallelization of fitness rollout evaluations.Each MetaBBO metatraining was run on 4 RTX2080Ti Nvidia GPUs and take roughly 2.5 hours.The LGA downstream BBOB, HPO-B and gym task evaluations were run on a CPU cluster using 2 CPU cores.They last between 2 and 5 minutes.Finally, the neuroevolution tasks were run on individual NVIDIA V100S and A100 GPUs.The Brax evaluations require between 30 minutes and 1.5 hours depending on the control task.The computer vision evaluation experiments take ca. 10 minutes and the MinAtar experiments last for ca. 1 hour on a V100S GPU.
∞], ∀ = 1, ..., , Genetic Algorithms.GAs provide a class of BBO algorithms, which iteratively evaluate a population consisting of  solution candidates   = [x  1 , . . ., x   ]  ∈ R  × ('children') with fitness f  ∈ R  .Given a set of  'parent' solutions   = [x  1 , . . ., x   ]  ∈ R × with associated fitness f  ∈ R  , the parents are replaced by the children using a heuristic fitness-based selection criterion.Most GAs make use of truncation selection in which all children and parents [x  1 , . . ., x   , x  1 , . . ., x   ]  are jointly sorted by their fitness.The top- performing solutions replace the parent archive:

Figure 2 :
Figure 2: Learned Genetic Operators.Top: Cross-attention selection between parent & children fitness features.Bottom: MRA via self-attention on parent mutation & fitness features.

Figure 4 :
Figure 4: Meta-Evaluation on BBOB Tasks.Left: Training Functions.Right: Hold-Out Functions.Scores are normalized by the Gaussian GA baseline performance.Lower is better.Averaged over 50 independent evaluation runs.

Figure 10 :
Figure 10: Transfer of learned operators to a Gaussian GA.Mean & 1.96 standard error intervals across 5 runs.

Figure 11 :
Figure 11: Hyperparameter Robustness of LGA.Pendulum-v1 performance across elite ratios and initial mutation rates. = 0 uses a single parent  = 1.Results are averaged over 5 independent evaluation runs.

Figure 31 :
Figure 31: Top: Linear relationship between fitness features and MRA.Bottom: Linear relationship between fitness features and selection logits.