Curing ill-Conditionality via Representation-Agnostic Distance-Driven Perturbations

The objective value of an ill-conditioned function may significantly change with a minor shift of the argument in the search space. Ill-conditioned functions do not have at all or exhibit very few hints towards better solutions and, thus, they are usually difficult to optimize with randomized search heuristics. However, problems that emerge in practical applications are likely to be formulated as ill-conditioned functions, as often Euclidean metric is used to measure distance in the search space. At the same time, it may be possible to use domain-specific knowledge to define such a metric in the search space so that the function stops being ill-conditioned. We consider finite search spaces and propose two mutation operators that leverage such metric to optimize the function more efficiently. The first operator assumes prior knowledge about the distance, the second operator uses the distance as a black box. Those operators apply an estimation of distribution algorithm to find the best mutant according to the defined function, which employs the given metric. For pseudo-Boolean and integer optimization problems, we experimentally show that both mutation operators speed up the search on most of the functions when applied in considered evolutionary algorithms and random local search. Moreover, those operators can be applied in any randomized search heuristic which uses perturbations. However, our mutation operators increase wall-clock time and so are helpful in practice when distance is (much) cheaper to compute than the real objective function.


INTRODUCTION
Randomized search heuristics (RSH) are convenient off-the-shelf solvers of difficult optimization problems [6,53].While they do not necessarily guarantee to locate the exact solution to any problem, they aim to approximate a solution of sufficient quality within a feasible time frame.Applications of such algorithms require the user to define a search space  , where possible solutions are located, and the objective function  with domain  .This function defines the considered problem and usually is expensive to evaluate (in terms of computation cost, licenses or resources).In the scope of the current work, we consider finite set  ⊆ Z  , function  with codomain R and algorithms that aim to maximize function  , meaning finding such  * ∈  that max{ () |  ∈  } −  ( * ) = 0 Evolutionary algorithms (EAs) are RSH inspired by the process of natural evolution.Such algorithms are supported by theory [18,27,37], show promising applications to various ranges of problems including pseudo-Boolean/integer problems [19,22,38] and real-world scenarios [7,24,58].One of the possible assumptions on the function optimized by means of EA is that the function is not ill-conditioned.This is also classically formulated as strong causality [48] of the optimization problem, contiguity, and locality [45].It means that a small change in the candidate solution leads to a solution with similar performance according to the objective function.The significance of the change is measured according to the distance metric  :  2 → R [28], for example, Hamming distance in the case of pseudo-Boolean domains, or Euclidean distance in the case of R  domains.The classical definition of an optimization landscape [51] is a triple of the search space, the metric defined for this search space, and the function being optimized.We will say that the landscape is well-conditioned if any two points in the search space with a small distance between them produce similar values of the objective function (this is defined more formally in Definition-2.2).Let us consider a set of functions F defined on the same search space  .If there exists such metric  on  that landscapes {(, ,  ) |  ∈ F } are well-conditioned then F is not closed under permutations of the search space [29].In this case, the NFL theorem [56] guarantees the existence of an algorithm that outperforms other algorithms on average for functions in F .So the existence of such metric is sufficient to design a problem solver for F that performs above average.The well-conditioned optimization landscapes {(, ,  ) |  ∈ F } may exist for a set of functions F that a practitioner solves.However, obtaining the metric  for these landscapes usually requires a certain level of expertise in the domain.It means that a well-conditioned landscape gives away some information on the structure of the considered problems.While EAs do not require any information on the domain of the problem, it is well known that problem-specific knowledge significantly increases the performance of EAs [21].We identify 4 main categories of approaches to take advantage of different kinds of domain-specific information in EA: 1) Asymmetric mutation operator.Methods in this category use the information about the problem to modify the mutation operator employed by an EA.Doerr et al [14,15] consider the Eulerian cycle problem and rigorously show that asymmetry in mutation operator, for example avoiding certain modifications, leads to a performance increase compared to EA with classical mutations.Raidl et al [46] show the benefits of EA with a mutation operator that chooses rather edges with smaller weights to solve the minimum spanning tree problem and the traveling salesmen problem.Jansen and Sudholt analyzed the influence of a bias in the mutation operator when it is applied to pseudo boolean functions [30].They considered a mutation operator that flips 1 with a higher probability when a solution contains more ones and vice-versa.The work rigorously shows that such bias speeds up (1 + 1)-EA equipped with such operator on three of the functions they considered, and slows down the algorithm on another function.Their work was extended by proposing an adaptation of the degree of asymmetry during the optimization process [47].
2) Enhanced representation.Traditionally, there are no specific restrictions on the used representation, but there are a number of recommendations [52].In the EA community, the mainstream is to simply use the most natural representation of the individual [54].However, finding such a representation may already require expert-level knowledge in the domain of the problem.Many works with a focus on practical applications prove this, for example, Yu et al applied an EA to the object decomposition for 3D printing [64], where they used a Binary Space Partitioning tree encoded as an array of continuous numbers.The design of such representation requires an understanding of the principles of object decomposition.In the area of spatial design creation, the challenging problem of defining a search space that does not significantly restrict the space of all possible solutions was subjected to comprehensive research in [42].Moreover, standard EA may work significantly slower when the representation is not carefully chosen.Doerr et al developed such a representation for the Eulerian cycle problem that allowed to develop an EA with the arguably best reachable by RSH expected run-time [16].
3) Data-driven approaches.This category covers automated algorithm selection [25,33,35,39,63] and dynamic algorithm configuration [1,9], where automatically extracted features of the problem are used to suggest an optimization method or a configuration of the algorithm.
4) Diverse set of candidate solutions.Diversity measure of the set of candidate solutions can be evaluated using a domain-specific distance, as it is suggested in [59].It is known, that more diverse sets of parents in EA facilitate global exploration properties of the algorithm, reduce the risk to get stuck in local optima and allow obtaining multiple dissimilar high-quality solutions [60].Niching methods were developed to maintain diverse sets of individuals in EA, for a comprehensive overview see [11,43,44].Distance-based niching methods have a great number of successful applications, which shows the advantage of distance knowledge.For example, Dynamic Peak Identification niching [57] maintains high-performing solutions with a pairwise distance larger than the given lower bound.This method was successfully applied to find a number of high-performing designs of lenses in a single run of an EA in [34].In the domain of airfoil optimization, a variation of niching was proposed, where the algorithm alternates between pure quality and distance-based diversity optimizations which allows to obtain better-performing solutions [49,50].
As we see from the examples of approaches in categories 1 and 2, the user needs to have deep knowledge of both: the domain of the problem, and the optimization method.Approaches in category 3 have the disadvantage that the predicted algorithms/configurations do not always perform well, especially when a new (unseen) problem is dissimilar from the problems that the approaches are trained for.Approaches in category 4 leave the hard work to pick a distance that represents the problem well to the expert in the domain and allows using this distance as a black-box in the optimization algorithm.However, this distance has influence only on the selection part of EA.Since distance gives away information on the problem, it can also be leveraged in other parts of an EA to speed up the optimization even more.At the moment, there exist classical requirements to the design of crossover and mutation operators for an EA in a metric space [20,21,28].Apart from this, there is a number of works, where neighborhood structure and distance were applied explicitly or via group theory, for example in the works [8,40,41,55].We observe in those works that a practitioner, who wants to use metrics in EA, needs to redesign the operators accordingly, which requires the application of methods from categories 1 and 2.
Our contribution: We propose two versions of the mutation operator which employs a given metric in the search space.It is possible to change the metric used in our mutation operators without amending the rest of the algorithm.The first developed mutation operator relies on the given distribution over mutation strengths.The second operator applies the distribution which is created automatically and allows control of its parameters.The application of such mutation operators in an EA does not require any additional effort from a practitioner to switch one metric to another and has great benefits where a natural representation makes the problem ill-conditioned.In this case, our mutation operator can cure the problem, if a domain-specific distance is known.

PRELIMINARIES
Evolutionary algorithms (EAs) are a class of search and optimization algorithms inspired by the principles of natural selection and Darwinian evolution.These algorithms are based on the idea that a population of solutions to a problem can evolve and improve over time through the application of selection, reproduction, and variation operators.
During the optimization process an EA samples elements of the search space  and evaluates the values of the objective function  in the sampled points.While doing it, EA ideally balances exploration/exploitation.It explores the search space by sampling elements with significantly different structures, which can be seen as gaining knowledge of the problem.The significance of the structural difference between sampled solutions can be measured with metric , which is usually considered as the distance between elements in  .Despite the fact that the optimization problem is completely specified when  and  are given, it is convenient to talk about the landscape of the problem when the exploration/exploitation capabilities of an EA are analyzed.The landscape of the optimization problem is classically defined as follows.
Definition 2.1 (Landscape [51]).For the problem  defined on the metric space  with metric  the landscape L is a triple (, ,  ).
In this work, we distinguish between ill-conditioned and wellconditioned landscapes.In order to clearly define the latter ones we need to refer to small values in order to restrict the significance of the change in argument and in the objective value.Of course, the significance is always relative, so we assume that the knowledge of what is small already exists.In the following definition, we assume that this knowledge is given as sets of constants for the corresponding sets of real numbers.Definition 2.2 (Well-conditioned landscape).For the given search space  , metric  :  2 → R and the objective function  :  → R consider sets of problem-dependent, small constants {  0 , ( 0 ,•) |  0 ∈  }, {  0 ,|  (•) − ( 0 ) | |  0 ∈  }, that are given by an oracle.Given those constants, optimization landscape (, ,  ) is well-conditioned and metric  is well-conditioning if The following algorithms are used in this work for the basis of our proposed mutation operators: (1 + ) EA is an elitist evolutionary algorithm without crossover.Given the objective function  :  → R, such an algorithm maintains the fittest solution and iteratively samples modifications of this solution.Each modification is obtained using the given parameterized distribution over the solution candidates [13].
(1 + (, )) EA is an extension of (1 + ) EA which uses crossover with bias .At first, this algorithm samples  solutions using mutation, then it picks the best mutant and recombines it  times with the best-so-far solution.Each recombination applies crossover in such a way that for every component the bit is taken from the best-so-far solution with probability (1 − ), and from the best mutant with probability .The best-so-far solution is updated with the fittest individual after such recombinations.This algorithm has proven itself well in theory [2,3].
Randomized Local Search (RLS) is a stochastic local search algorithm.RLS iteratively explores the neighborhood of a current solution and selects a new solution based on a randomized criterion that takes into account the quality of the neighboring solutions.RLS is characterized by its simplicity, efficiency, and ease of implementation.In this work, we apply an extension of this algorithm to integer search spaces, so-called RLS-ab [12].With this extension, the algorithm maintains the candidate solution, as a vector of  integers, and an additional vector of  integers that identify the step size with which every component of the solution is updated.This additional vector is called the vector of velocities.Values of velocities are adjusted independently during the optimization process.When a new mutant is fitter than the best-so-far solution, the corresponding velocity is increased, otherwise, it is decreased.We give the pseudocode of RLS-ab in Algorithm 3.
(1 + ) EA ab This is our simple extension of RLS-ab.The update adds a possibility to create  individuals on every iteration and sample the number of modified components from the binomial distribution with success probability 1/.The operator to make mutation for every selected component stays the same, including the update rule of the parameters.For simplicity, we propose a pseudocode of a particular case of (1 + ) EA ab with  = 1 in Algorithm 5.
Univariate Marginal Distribution Algorithm (UMDA) [36] is another population-based stochastic optimization algorithm.UMDA operates by sampling and updating the probability distribution of each variable in the solution vector independently, and generating new candidate solutions using a random sampling process based on the updated probability distributions.Every iteration it samples  individuals and uses  best of them to update the probability distribution.This algorithm does not use the notions of the neighborhood and so it has the potential to be efficiently applied even when a well-conditioned landscape does not exist for the problem.However, it is suggested to use a large enough value of  to solve problems with difficult properties [17], so this algorithm may not be practically applicable right away to the initial optimization problem.We apply this algorithm to a relatively cheap function when we design our mutation operator.In this work, UMDA is adapted to handle integer domains.To achieve this we maintain the estimated probability of every integer in every component.We propose a pseudocode of this extension in Algorithm 4.

HIGH-LEVEL OVERVIEW OF THE PROPOSED APPROACHES
Given Definition 2.1 of a landscape, we say that an algorithm becomes landscape aware if it uses the specific distance  for defining the variation operators.In the scope of this work, we only propose ways to use this distance in the mutation operator.We will call a mutation operator distance-driven if it uses such distance to generate mutants.Accordingly, we will say that an algorithm is distance-driven if it uses a distance-driven mutation operator.Let us consider RSH A which applies a mutation operator when it solves an optimization problem in the landscape L ′ = (,  ,  ′ ).We make an assumption that L ′ is not well-conditioned.At the same time, we assume that well-conditioned landscape L = (,  , ) is given to us by an oracle.In this section, we propose a way to change A to a distance-driven algorithm dd-A which solves the optimization problem (at least partially) in the landscape L. The optimization is performed in the new landscape completely when A uses only mutations to sample new solutions.However, we leave distance-driven crossover operators for future work, so if A uses recombination in addition to mutation, then dd-A partially works in the old landscape.
Both our distance-driven mutation operators rely on a parameterized family of distributions over all possible step sizes.In Section 4 we explain what we exactly mean by this.In Section 4 we assume that the distribution is defined by hand, which is possible when the so-called structure of the distance is known.Then in Section 5, we show how to create this distribution automatically for any given distance.The distance-driven mutation operator itself is given in Algorithm 1 and it is applied in both proposed methods, but with different distributions.Consequently, given a well-conditioned landscape L and distribution D defined in Section 4 we transform A to dd-A by substituting the usual mutation operator in A by the distance-driven mutation operator.

PROPOSED METHOD FOR THE DISTANCE WITH KNOWN STRUCTURE
Traditionally, mutation creates a structural change in the individual.
The change is usually defined by the probabilistic distribution over different structural changes.The shape of the distribution reflects the intuition about the search.For example, in local search methods, the distribution is defined over only small mutations, that likely do not lead to quitting the basin of attraction.In global search methods, such as Evolutionary Strategies (ES), the distribution is defined over all the search space, but smaller changes happen with higher probability.Even though the mutation itself is sampled from this distribution, the significance of this change is defined w.r.t. the distance metric.Classically, it is assumed that the distribution is equipped with such a distance metric, that considers components independently and a small change in the value of every component contributes a small addition to the total value of distance.When the metric is different, but the same algorithm is applied, the mutations sampled from the same distribution likely do not give the properties that are expected from the perturbations.Let us consider mutation operator  of an optimization algorithm as a stochastic rule which chooses some mutant  for every candidate solution  such that (, ) ∈  2 .We will write  () to denote a mutation of the individual .We will say that step size of mutation w.r.t.distance  is the value   (,  ()).The probabilistic distribution that is used in the mutation operator defines the step size of the operator.This step size may be sampled from the distribution explicitly, or implicitly, as happens in ES when the mutation itself is sampled.In the mutation operator that we design, we do not use anything from the distribution except the information on the step size.Hence, we consider the distribution defined over step sizes.
Let us define what we denote as the distribution over step sizes more precisely.First of all, we denote the set of all pairwise distances between points in the search space  = { (, ) | ,  ∈  } as structure of the distance.In this section, we assume that  is known to the designer of the algorithm.In order to allow balancing exploration/exploitation in the algorithm we add parametrization to the distribution, which significantly improves the performance of EA [4,23].All the parameters of the distribution are denoted with .This vector controls the shape of the distribution and is updated outside of the mutation operator according to the rules defined in the distance-driven algorithm.For the given individual  ∈  we work with elementary outcomes { |  =  ()} and probabilistic measure Pr  parameterized by a vector .Given parameters , individual  ∈  which is being perturbed and metric , the distribution over step sizes from  is: For the given , distance-driven algorithms possess the following set of distributions in order to apply a distance-driven mutation operator at any point of  : Then the parametrized family of sets of distributions is given by: We will say that D defined in Eq. 3 is distribution family.The choice of this distribution family is left to the user.To choose D it is sufficient to define D  in Eq. 2 and so it is sufficient to define D , in Eq. 1. Therefore it is sufficient to define the parameterized probability of mutation to the distance from any element in  to any element in  .In order to do this, it is sufficient to know the set .
Our distance-driven mutation operator generates another individual that steps away from the initial one to the distance closest to the given step size.To implement this generation we solve another optimization problem as written in Algorithm 1.The goal of solving the problem in line 2 is to locate any such  ∈  that the distance between  and  is closest possible to the sampled step size .If several pairwise different individuals are solutions to the optimization problem in line 2, then we do not distinguish between them and pick an arbitrary one of them.

Analysis of internal optimization problem
In line 2 of Algorithm 1 the optimization problem is solved.In particular cases of constrained  it might be possible to solve this problem analytically or using a deterministic algorithm.However, when the size of  is too big or  is not known (as in Section 4) we can solve this problem with RSH.In order to pick an optimization algorithm to solve this problem we notice the following property.Now let us assume that the landscape (, ℎ,  , ) is well-conditioned with this  for all  ∈  .Then For  0 =  the latter transforms to  (, ) <  , (,•) .Since (, ,  ) is well-conditioned it implies that Combining these together, we obtain: It means that (, ℎ,  ) is well-conditioned, which contradicts the initial condition of the proposition.□ We assumed that the landscape (, ℎ,  ) is not well-conditioned when ℎ is Hamming distance.Then, following the Proposition 1, we see that the internal optimization problem may not be wellconditioned for some values of .To optimize this function we pick an optimizer that does not use a notion of a neighborhood.One of the simplest such algorithms is UMDA.We applied it with a large  and a big budget.In practical applications,  evaluations of the objective function from the line 2 can be done in parallel, which reduces the wall-clock time spend for internal optimization.
Application of this version of the mutation operator requires the user to know the structure of distance and define the distribution family.Obtaining knowledge of the distance structure may require expertise in the domain of the problem.Moreover, the size of  may be extremely big, so defining the distribution family may not be feasible in some cases.To free the user from the necessity to learn this structure, we propose the following extension of the mutation operator, which automatically does both: explores the structure of distance and creates a distribution family.

PROPOSED METHOD FOR BLACK-BOX DISTANCE
The main purpose of this section is to propose a method that is able to extend the perturbation operator for the distance with an unknown structure.This allows the decoupling of an expert in optimization and an expert in the field of the domain.We expect that the distance created by an expert in the domain is well-conditioning and satisfies certain structural properties, which we summarize in the following assumption.
Assumption 1.For any points ,  ∈  : The most important assumption for our methods is cheapness because the inner optimization problem is solved for every application of the mutation operator, which is normally called as often as the objective function.When the metric does not satisfy this property our method becomes very time-consuming.If the variability assumption is not upheld for many points, then the practitioner, who knows this fact, has already enough knowledge to use the previous method and create distributions family D based on their tastes.The consistency assumption is needed to save computational resources when we explore the space of possible values of the given metric.Our method is still applicable if this condition is not upheld for a subset of the search space, but for every point in this subset, we need to run a new exploration procedure, which becomes time-consuming.
In the case of black-box distance, it is impossible to define a distribution family D in advance.Here we propose a method to create such a distributions family that can be used in Algorithm 1. Designing this distribution family for the general case of the metric is challenging because of the two following problems.At first, we propose a general transformation of the distance metric to solve both problems.Then we define the distributions family over the transformed values of distance.This transformed distance and distributions family are used in Algorithm 1 in the same way as in the previous subsection.

Transformation of the distance
Given metric , for every point  ∈  we map  , = { (, ) |  ∈  } to a subset of (0, 1) with a monotonic function The second term in the mapping is the well-known Gaussian function with codomain (0, 1).Hence, the codomain of   (, ) is (0, 1).Let us define the set I  = {  (, ) |  ∈  \ { }}.Such mapping transforms  , to I  ⊂ (0, 1) for any metric  and so allows to have the same ranges of distances, which solves the first problem we mentioned.The parameter  of Eq. 4 is chosen in such a way that I  is as spread as possible.This requirement ensures that all the values of I  do not collapse to a very dense segment where different values of distance are not distinguishable given the limitations of the representation of floating point numbers.
To satisfy this requirement we consider the following problem regarding  1 ,  2 : The constraint Eq. 7 defines the upper bound on the smallest mapped distance, analogously constraint Eq. 8 defines the lower bound for the largest mapped distance.
For the purposes of our application, we specify the requirements for the solution of the problem.We aim to find sufficiently small  1 and ensure that  2 is small as well.Specifically, we iterate over values of  1 in the range (10 −4 , 10 −2 ) with step 10 −4 and stop when we find smallest  1 ≥  2 such that  1 ,  2 satisfy the constraints Eq. 7, Eq. 8.
To check if the given value of  1 satisfies the constraints we simplify them as follows.Using monotonic property of Eq. 4 we have: 1 Which transforms to: It implies the following lower-bound for  2 : Then there exists  2 which satisfies our additional condition  2 ≤  1 if and only if  1 satisfies: To estimate the values  max max  , and  min min  , we consider fixed values , ∈ R : 0 <  < 1 <  and such sequence (  ) =−,..., that: As approximation of  max and  min we take  max  (,   ),  min  (,  − ) accordingly.We substitute the values to Eq. 10 in order to check if  1 satisfies our constraints.When we obtain a solution ( * 1 ,  * 2 ) we evaluate  * .It is taken as an average of its bounds obtained from Eq. 9. Steps above to compute  * 1 , if  2 ≤  1 then break end if 12: end for min 14: return { 1 ,  2 , } In our implementation, we used the following constants:  = 10, = 1.2,  = 0.5,  = 100.To find an element from the set defined in line 3 we apply UMDA to maximize the objective function  1 ()  (, ) −  •  (,   −1 ) with stopping criteria that, apart from limiting the number of function evaluations, has condition  1 () ≥ 0. We stop optimizing the function once the termination condition is satisfied for some solution.We return this solution as the element from the set.Since  •  (,   −1 ) can be an arbitrary number, we follow Proposition 1 and motivate the application of UMDA the same as earlier, i.e. in Section 4.1.Similarly, to compute  − in line 4 we apply UMDA to maximize the objective function  2 ()  •  (,  +1 ) −  (, ) and stopping criteria that includes conditions  2 () ≥ 0 and  2 () <  •  (,  +1 ).The second additional condition is needed to ensure that  − ≠ .The first solution which satisfies the termination criteria is returned as the element from the set.It is possible that for some  and some  ∈ {1, 2} we have ∀ :   () < 0. It means that, the set defined in line 3 for  = 1 and in line 4 for  = 2 is empty.In this case we set   to   −1 for  = 1, and set  − to  +1 for  = 2.
Algorithm 2 takes as input a point from  .When a certain point  ∈  is passed as input to Algorithm 2 we say that the output values  1 ,  2 ,  are computed for the point .Following our consistency assumption, the values of  1 and  ′ 1 are similar when computed for different points ,  ′ ∈  .Using the same assumption we have:  (,   ) ≫  (,  − ).When the optimization problems in lines 3, 4 of Algorithm 2 are solved precisely we obtain:  max ≫  min .

Distribution over the transformed values
In this subsection, we derive distribution D , for arbitrary element  ∈  and parameters .As we discussed before, if the distance function satisfies our assumptions, then the values  1 ,  2 ,  are practically the same for all the choices of  ∈  .Hence, the distributions D , are the same for all  when parameters  are fixed.To be more concise, in this section, we will denote D , as D.
The distribution over I  is chosen to be continuous to avoid computation of the exact value I  .The fact that I  has finite size does not corrupt the algorithm, because of the big density of points in I  .Moreover, the optimization problem in Algorithm 1 may not be solved precisely even when the distribution is discrete.It means that for a random variable  ∼ D the value  − min{| (, ) −  |} may not be zero in practice, but close to zero in the case of discrete and continuous distributions.Then, the distribution is chosen as such D * that have maximal differential entropy  (D) over continuous distributions with fixed mean  ∈ I  : The maximal entropy is enforced to obtain an unbiased mutation operator [54].We fix the mean of this distribution to allow controlling the step size during the optimization process.Distributions with fixed moments of higher order and other properties are left for future works.This is done because enforcing other properties will make it much more complex to solve the optimization problem defined in Eq. 11.On the other hand, distribution with a fixed mean and maximal entropy is already enough to efficiently solve problems with different complex properties, when the algorithm adjusts the mean during the optimization process.For example, one of (1 + ) EA with mutation rate control optimizes a number of difficult pseudo-boolean functions faster than competitors [19].The step size of mutation in (1 + ) EA is sampled from the Binomial distribution.Mutation rate control is equivalent to controlling only the mean of this distribution.At the same time, Binomial distribution is the maximal entropy distribution among discrete distributions with the fixed mean and fixed support [26].This motivates the hope, that our mutation operator with distribution from Eq. 11 has the potential to assist optimization better than mutation operators with other distributions, given that a smart enough method to control the mean is used.
In order to find the probability density function (pdf) P of distribution defined in Eq. 11 we consider the following optimization problem: The following solution of Eq. 12 is unique for such  0 ,  1 ∈ R that satisfy the constraints Eq. 13, Eq. 14 [10,31,32]: Substitution of Eq. 15 to Eq. 13, Eq. 14 and integration, gives us the following system: Combining those two equations into one with condition that  1 ≠ 0 gives us the following: If the solution of Eq. 16 exists regarding  1 then it is unique because the solution of Eq. 12 is unique.The left part of Eq. 16 converges to 0 for  1 → −∞ and converges to 0 for  1 → 0. The right part of Eq. 16 is infinitely big for  1 → −∞ and converges to 0 for  1 → 0. Moreover for  1 → −0 the left part is greater than the right part.Since the functions of  1 < 0 in both parts are continuous there exists a solution of Eq. 16 for  1 < 0. To find the solution we apply binary search with bounds (−, 0), where  is sufficiently big.

EXPERIMENTS
In this section we outline the different experimental setups used to validate the proposed contributions.This is done using two experiments: binary OneMax-based problems to show the effectiveness of using distance-driven permutation in an easy-to-analyze use-case, and a discretized version of the BBOB benchmark suite to show the generalizability of the proposed approach.

Experimental Setup
Binary problems.In order to benchmark Algorithm 1 as it is, we consider a binary OneMax-based problem in  dimensions (i.e. with search space  = {0, 1}  ).We adopt Ruggedness problem based on OneMax and implement it as in W-Model suite [62].In our experiments, we define OneMax as  1 () | | ( 1 -norm of binary string ).Ruggedness uses a given permutation  and applies it to the objective values of the underlying function, which is OneMax in our case.Application of this permutation reorders the objective values.We define such Ruggedness as:   () −1 ( 1 ()).We define distance for this problem as  (, ) |  () −   ()|.Note that this is not a metric since  (, ) can be zero even when  ≠ .However, this distance satisfies all other requirements of the metric because it is based on the absolute value.Moreover, the landscape (, ,   ) is well-conditioned.The permutation is stored implicitly, such that for every integer 0 <  < : Here  1 ≡  (mod ) and  = ( 1 () mod  1 ) − ((⌊/⌋) mod ).In our implementation, we considered  ∈ [1,5].Note that with  = 1 Ruggedness is OneMax.Greater values of  create dissipative intervals of size , for example, for  = 5 the objective value 99 of OneMax is permuted to 95, 99 to 96, . . ., 95 to 99.
We extended both (1 + ) EA and (1 + (, )) EA with our distance driven mutation operator.The modified algorithms are called dd-(1 + ) EA and dd-(1 + (, )) EA accordingly.Both algorithms are considered without any parameter control.In the implementation of Algorithm 1 we implemented Binomial distribution with  Bernoulli experiments with success probability 1/.Note that the value sampled from this distribution may be larger than the maximal distance reachable from point  (for example when it is of the form 11..100..0).In this case, the optimization in line 2 of Algorithm 1 is terminated when it runs out of budget.The parameters of UMDA taken in the experiments on binary problems are the following:  = 50,  = 100, budget = 1000.
Integer problems.We consider integer problems to demonstrate the advantages of the distance-driven mutation operator when the distance satisfies the assumptions.The domain of those problems is  = Π  =1 {0, . . .,   −1}, where (  ) =1... are cardinalities.The problems are taken from COCO/BBOB-MIXINT suit [61].We extended the problems to get rid of continuous variables and support the arbitrary cardinality of every component, the rest details on these problems were left unchanged.To show the advantages of our mutation operator, we transform those problems in our experiments.For the given permutation  and function  from BBOB-MIXINT we define the transformed function   ()  ( −1 (  )) =1,..., .Here   stands for the -th component of .Note that here the same permutation is used to reorder integers in every dimension of the search space.We implemented this transformation because in our experiments the cardinalities of all the dimensions are equal.When permutation is applied to the search space the neighbourhood structures are destroyed in every component, so Euclidean distance becomes ill-conditioned.We define distance for the transformed problem as  (, )  2 ( ( −1 (  )) =1,..., − ( −1 (  )) =1,..., ).This distance takes inverse permutations of every component and computes  2 -norm of the difference of obtained vectors.In all of the experiments, the same permutation was used for search space transformation.
In the experiments we consider  = 40 dimensions, each one with cardinality  = 100 for every problem in the suit COCO/BBOB-MIXINT.We extended UMDA to the domains where the cardinality of every component is greater than 2. The parameters of this extended UMDA taken in the experiments on those problems are the following:  = 10 2 ,  = 10 3 , budget = 4 • 10 4 .
In our experiments we maximized negative regret, meaning the value  ( * ) −  (), where  is a problem,  * is the global optimum of  ,  is a candidate solution.At the same time, we plot positive regret in Figure 2 to avoid confusion.
We extended RLS algorithm with our mutation operator.Note that the new algorithm does not do local search anymore because the distribution is defined over all found values of distances.We control the mean of the distribution in the following way.Assume that the mutation operator produced individual , and the best-sofar solution is .Then if  () >  () we increase the mean (which increases the expected step size)  ←  • , where  > 1.If  () <  () then we reduce the mean  ←  • , where  < 1.The rest of the steps (except mutation and parameter update) are the same as in RLS algorithm.We call this modified algorithm dd-(1 + 1) EA ab .In the implementation, we take values  = 1.001,  = 0.999.To simplify the numerical solution of Eq. 16 we took  1 = 0,  2 = 1 at that equation (and only there).In this case the solution regarding  1 always belongs to [−1/, 0).
The implementation of all considered algorithms, problems and experiments can be accessed at Zenodo [5].

Results
Binary problems.In Figure 1 we compare the performance of the described algorithms on Ruggedness problem with different values of .Algorithms with modified mutation operators performed much better than their analogues in all of the considered cases of  > 1.
For  = 1 the performance of the algorithm is almost the same, indeed distance-driven mutation operator uses the same step sizes because the distribution is the same for all the algorithms.However, dd − (1 + (4, 4)) EA slightly outperforms dd-(1 + 4) EA on most of the points, except the final points for  = 4 and  = 5.When an individual is in the local optima, the mutation-only algorithm waits in this optima until a sufficiently big step size is sampled from the distribution.At the same time, crossover adds additional variability which helps to escape the local optima.This is the reason for the slight advantage of dd − (1 + (4, 4)) EA on the most of optimization trajectory.At the same time, the performance of both algorithms is significantly better than their analogues and the gap between them grows with the growth of .
Integer problems.In Figure 2 we compare the performance of dd-(1 + 1) EA ab against two algorithms that are not aware of wellconditioning metrics for the problems.Despite the fact that we maximized the negative regret, we plotted the value of positive regret.So the lower the curve is in Figure 2 the better the performance of the corresponding algorithm.As expected, for the majority of functions, specifically for F1, F2, F6-F14, F16, and F20-F22 the increase in performance of dd-(1 + 1) EA ab is significant compared to the other algorithms.It shows that our algorithm manages to explore the search space well enough to find the values of parameters that allow the mutation operator to generate small distances as well as big ones.However, we observe that performance is much worse on functions F3 and F4.For function F4, this may happen because both competitor algorithms got lucky with a permutation of the search space, which allowed them to jump to the local optima of better quality at the beginning, where they converged to the basin of attraction and got stuck.For function F3, the distance-aware algorithm may lose because it treats all the directions evenly.When it finds the ridge of F3, it continues to sample solutions outstanding to the sampled step size .Vectors of all possible perturbations make a hyper ball of radius  which intersects many other points apart from the points on the ridge.So the probability to improve the value of the objective function becomes very small.This is the reason why we observe a very slow convergence of distance-driven algorithms on this function.For function F24 distance-driven algorithm also loses to the competitor that is able to make jumps to bigger step sizes in the search space.The reason for this might be the fact that basins of attraction on F24 are small, so the algorithm needs to make small steps to not jump from one valley to another.This is the weak spot of the distance-driven algorithm because in order to find smaller values of distance it needs to spend much more time on internal optimization, which makes the algorithm significantly slower.On top of that, the algorithm stops making jumps to other basins of attraction, because, after the initial convergence, the mean of the distribution becomes too small.It happens because the parameter control method that we pick for dd-(1 + 1) EA ab is rather simple.On the rest of the functions, the distance-driven algorithm slightly wins or shows approximately the same performance as competitors.

CONCLUSION
In this work, we propose two extensions of the perturbation operator.Both operators are applicable to a broader class of algorithms than just EAs.They can extend other RSHs that use the notion of distance to make changes in the solution, for example, Simulated-Annealing.The first extension relies on the structure of the distance.The user of this approach is expected to design the distribution over step sizes that they target to see in the perturbation operator.When the step size is sampled from the distribution, it is used to solve the inner optimization problem, where the mutant is generated.The goal of the inner optimization is to generate such an individual that is a number of steps away from the given individual to a distance close to the sampled step size.
The second mutation operator uses distance as a black-box.It explores the search space prior to defining the distribution.The information about the problem that was obtained during the exploration is then used to define the distribution over the step size with maximal entropy and fixed value of the mean.To estimate the parameters of this distribution we solve the numerical optimization problem and repeat this optimization every time the mean changes.This allows us to control the strength of changes that are made in the mutation operator.
We observed the performance of the first operator on pseudoboolean problems and the performance of the second operator on integer problems to empirically evaluate the proposed operators.From the results, distance-driven operators were found to do better on most functions.It shows the advantage of being aware of the distance during the optimization.However, the performance on some functions is worse, because of imperfections of parameter control, specificities of the considered problem, and due to the fact that our operator treats all the directions evenly.
In future work, we plan to identify directions in the metric space to assist the mutations to proceed in the most promising direction with higher probability.This should improve the performance of the functions F3.Moreover, it will allow us to apply the most advanced methods of parameter control that exist in Evolution Strategies in our mutation operators.Apart from this, we plan to investigate the possibility to create a surrogate for well-conditioning distance.
The creation of such a model will significantly loosen the current constraints on the optimization problem for which our mutation operators are efficient.
values  2 and  ′ 2 are very close, when computed as in line 10 of Algorithm 2 for different points ,  ′ ∈  .We take advantage of this and compute  1 ,  2 ,  only once for a point  ∈  chosen uniformly at random.Following the variability assumption,  , ≫ 1 which makes elements of I  = {  (, ) |  ∈  } located very close to each other in I  .It means that there are no significant gaps between values in I  , which solves the second problem.

Figure 1 :Figure 2 :
Figure1: Average parallel optimization time and standard deviation of the considered algorithms (x-axis: best-so-far value of the objective function; y-axis: average number of iterations) on the Ruggedness with  ∈[1,5].Dimensionality  = 100, 11 runs for every algorithm.UMDA is used as an internal optimizer of distance-driven algorithms with parameters  = 50,  = 100, budget = 1000.The limit on the number of iterations is 5 • 10 6 .