Some General Heuristics in the Traveling Salesman Problem and the Problem of Reconstructing the DNA Chain Distance Matrix

With all their differences, the two problems under consideration, namely the traveling salesman problem and the problem of restoring the DNA chain distance matrix, have a lot in common. This generality primarily consists in the following. For real problems and for standard methods of solving them, such as gradient descent, these problems can be formally solved, but in fact they are described by systems of equations with several dozen variables, and sometimes hundreds. In this regard, to solve them, we use sequential algorithms (step-by-step) for filling matrices, sometimes also using backtracking for the variables already considered. We show that such heuristics in the situations we are considering give acceptable anytime algorithms.


INTRODUCTION
This paper is a continuation of our previous works [1][2][3].We continue the consideration of the application multiheuristic approach to discrete optimization.In it, we add several auxiliary heuristic algorithms to the "usual" variants of the branch-and-bound algorithm, which are almost equally implemented in different subject areas.It is important to note that the greatest effect of these auxiliary heuristics is usually given by their simultaneous application, i.e. in the complex.
There is often necessary to calculate the distances between sequences of different nature.Similar algorithms are used in bioinformatics to compare sequenced genetic chains.Due to the large dimension of such chains, it is necessary to use heuristic algorithms that give approximate results.
There are various such algorithms for genomes, but the obvious disadvantage in calculating the distance between the same pair of DNA strings is obtaining different results when using different algorithms.Therefore, there is a problem of assessing the quality of the used metrics (distances), the results of which can be concluded about the applicability of the algorithm to various studies.
The other problem considered in biocybernetics is the recovering the matrix of distances between DNA sequences, when not all elements of the considered matrix are known at the input of the algorithm.We consider the possibility of using the developed and studied by us earlier method of comparative evaluation of algorithms for calculating distances between a pair of DNA strings to restore the partially filled matrix of distances.Matrix recovery occurs as a result of several computational passes.Estimation of unknown matrix elements are averaged in a special way.
Continuing to improve the algorithms, we consider the use of the branch and bound method in it.To do this, for some known sequence of unfilled elements, we apply the algorithms we considered before, but now we choose the special sequences of elements.In our interpretation of the branch and bound method, all possible sequences of unknown elements of the upper triangular part of the matrix are taken as the set of admissible solutions.In each current subtask, any of the blank elements of the matrix is taken as the separating element, and the sum of the badness values for all triangles that have already been formed by the time this subtask is considered is taken as the bound.Thus, the definition of elements of an incompletely filled matrix occurs in such a sequence that the final badness value for all triangles is selected using greedy heuristics that fits completely into the framework of the classical variants of the description of the branch and bound method.
As a result of applying such an algorithm, we get the lowest possible badness (in the case of a completed version of the branch and bound method), or close to optimal ones.In our computational experiments, the running time of the algorithm practically coincides with the time of the algorithm considered before (it exceeds it by no more than 10%), and the badness value usually decreases by 20-40% from the initial value.Thus, we are able to quickly and efficiently restore the DNA matrix, often even if it is filled less than 40%.
Here is a brief description of the content of those subsections that we consider the most important; we shall consider the following questions: • the mathematical justification of the correctness of the constructions being made (Subsection 2.3); • a brief statistical study of the problem of DNA matrix reconstruction of small dimensions (Subsection 3.1); • the consideration of the application example of the method of the branch-and-bound algorithm in the problem of reconstructing a DNA template (Subsection 3.4).

SOME HEURISTICS FOR THE TRAVELING SALESMAN PROBLEM
The traveling salesman problem (TSP) is a well-known optimization problem where the goal is to find the shortest possible route that visits a given set of cities and returns to the origin city.Mathematically, it can be represented as follows: where    is the distance between city  and city ,    is a binary variable that equals 1 if the path goes from city  to city  and 0 otherwise, and   are auxiliary variables to prevent sub-tours.
The following two subsections include the possibility of applying in our case (that is, in the case of the pseudo-geometric traveling salesman problem) standard algorithms commonly used for the geometric case

Nearest neighbor algorithm
The nearest neighbor algorithm is a greedy heuristic that selects the nearest city to the current city at each step.The steps of the algorithm are as follows: (1) Select a random city as the starting point.
(2) Find the nearest city to the current city that has not been visited yet.(3) Move to the nearest city and mark it as visited.(4) Repeat step 2 until all cities have been visited.(5) Return to the starting city to complete the tour.
The objective function of this algorithm can be expressed as: where  NN is the total distance of the tour,   is the -th city in the tour, and  (  ,  +1 ) is the distance between city   and city  +1 .

Simulated annealing
Simulated annealing is a probabilistic technique that allows for the possibility of accepting suboptimal solutions in the early stages of the search to escape local minima.The algorithm can be outlined as follows: (1) Choose an initial solution and set the initial temperature  .
(2) Perform a small perturbation on the current solution to get a new solution.
(3) If the new solution is better, accept it.If it is worse, accept it with a probability exp −Δ  , where Δ is the increase in the objective function value.(4) Decrease the temperature according to a cooling schedule and go to step 2. (5) Repeat steps 2-4 until the stopping criteria are met.The objective function can be written similarly to the one in the nearest neighbor section, but the acceptance criterion involves the Metropolis criterion given by:

𝑇
where  (Δ) is the probability of accepting a solution with an increase in energy Δ, and  is the temperature parameter that decreases over time according to a cooling schedule.

The mathematical justification of the correctness of the constructions being made
As we said above, we use heuristics previously used to solve the traveling salesman problem.The possibility of their use is due to the fact that for a certain point of the distance matrix, which represents the distance between two species, we consider a triangle that is three distances between three species, and denote the remaining two sides of this triangle as two segments of the traveling salesman problem.
(1) The cubic functions pass through the points: (2) The first and second derivatives are continuous at the junctions: ).We denote the sets of piece-wise functions as: Then we define the functional Φ[,  ] as follows: here, the domain Ω is the union of all the segments' domains, i.e., Let  be such that the minimizing elements   () and   () of Φ[,  ] satisfy: We find the minimizing elements   () and   () of Φ[,  ]: These minimizing cubic spline functions   () and   () will provide the best approximation of the true path, given the noisy data and the chosen smoothness parameter .
2.4 Geometric approach to the pseudo-geometric problem: optimal and pseudo-optimal placement of the points In this subsection, we delve deeper into the methodology developed for resolving the pseudo-geometric Travelling Salesman Problem (TSP).This method includes state-of-the-art techniques for classifying input data and establishing a pseudo-optimal placement of points which may have significant applications in various fields including logistics and network design.The task of pseudo-recovering the original coordinates of a set of points bears a greater complexity than merely classifying them (i.e., determining their class based on the known cost matrix in the context of the TSP).
Our proposed algorithm addresses this by simplifying the problem to restoring the location of points in a geometric rendition of the TSP using a distance matrix defined by the function This problem can be tackled effectively using the following strategy: • Select two arbitrary points  1 and  2 from the vertex set  .
• Determine the coordinates of each subsequent point (denoted as  with coordinates (, )), select two points ( and ) that have already been positioned, and solve the equation set: It should be noted that applying this algorithm to the distance matrix described in equation ( 3) in the pseudo-geometric TSP context does not generally recover the original points' locations in the geometric TSP scenario.Furthermore, the potential violation of the triangle inequality in the provided distance matrix may render the algorithm inapplicable.
In this research, our main focus and forthcoming work envision a refined algorithm conceived by us to address a specific TSP instance which employs the aforementioned algorithm for retrieving points' positions in a geometric TSP context.Moreover, this algorithm is versatile enough to be utilized in any TSP variant, although its application may not be recommended in most "random" or non pseudo-geometric special TSP cases.
Therefore, notwithstanding the general infeasibility of applying similar algorithms to the pseudo-geometric TSP variant, we endeavor to employ the same algorithms to address the city positioning problem, essentially tackling the minimization problem for a specially computed discrepancy or "badness" metric, defined as 2  • ( − 1) where: •   ( = 1, . . ., ) are the points, with the coordinates of the -th point represented by (  ,   ).•  (  ,   ) denotes the (, )-th element of the given cost matrix.• c (  ,   ) signifies the (, )-th element of the obtained cost matrix.
In a degenerate case (i.e., when  = 0) and considering an ideal solution, a "badness" value of 0 should be achieved.
As a heuristic solution to this issue, we propose the following algorithm.Essentially, we are addressing the same minimization problem, albeit disregarding the geometric positioning of the points.
Algorithm for pseudo-optimal placement of points.
Input.Matrix  :  → N 0 (the matrix of weights of edges of a complete weighted graph with vertices  =  1 , . . .,   ); the value  ∈ N.
The parameter  represents the number of selectable pairs from the already allocated points.These pairs are selected to accommodate each new point, starting with the fourth one.

HEURISTICS IN THE PROBLEM OF RECONSTRUCTING THE DNA CHAIN DISTANCE MATRIX
The reconstruction of the DNA chain distance matrix is a critical problem in computational biology, which has implications in phylogenetics, molecular evolution, and other fields.In general, the problem entails determining the evolutionary distances between different DNA sequences.
A DNA molecule is a double helix consisting of four types of nucleotides: adenine (A), cytosine (C), guanine (G), and thymine (T).The sequential arrangement of these nucleotides forms a DNA sequence, which can be represented as a string of characters, e.g.,  = ACGTGACGTGGTAC. . .A distance matrix is a matrix that contains the distances between all pairs of sequences in a set.If we have  sequences, the distance matrix  will be an  ×  matrix, where    represents the distance between sequence  and sequence .
approach to computing the distances between sequences is to perform pairwise sequence alignments, using methods such as the Needleman-Wunsch algorithm.
The Needleman-Wunsch (NW) algorithm is a foundational global alignment method in bioinformatics, formulated by Saul B. Needleman and Christian D. Wunsch in 1970.This dynamic programming algorithm is primarily used to find the optimal alignment between two sequences, which includes identifying the similarities between them over their entire length.The mathematical formulation of the algorithm is as follows: Given two sequences  1 and  2 of lengths  and  respectively, the NW algorithm constructs an ( + 1) × ( + 1) scoring matrix  where the entry  (, ) represents the optimal score for aligning the prefixes  1 [1 . . .] and  2 [1 . . .].The recursive definition of the scoring matrix is given by: ,  − 1) +  where: •  (, ) is the score of aligning character  with , often obtained from a substitution matrix.•  is the gap penalty, representing the cost of inserting a gap in the alignment.
The matrix  is initialized as follows: After computing the entire matrix  , the optimal alignment is obtained by backtracking from  (, ) to  (0, 0), constructing the alignment by either matching or inserting gaps as dictated by the values in the matrix.
This global alignment algorithm ensures that the best alignment across the entire length of the sequences is found, providing a holistic view of the similarities and differences between them.
The time complexity of the Needleman-Wunsch algorithm is  (), and the space complexity is also  (), which may become a bottleneck for very long sequences.

A brief statistical study of the problem of DNA matrix reconstruction of small dimensions
In this section, we consider a problem of reconstruction of the matrix of distances between DNA sequences.This problem belongs to the field of biocybernetics was previously described by the first author in [4,5].The task is to restore the elements of the distance matrix between DNA sequences.As a rule, about 50% of the matrix elements are known.To solve this problem, it is advisable to use the so-called anytime-algorithm [6], which would allow to track the gradual recovery of the elements of the matrix of distances.
The paper [7] investigates algorithms for solving the problem of restoring a low-rank matrix with an arbitrarily damaged fraction of its elements.This problem can be considered as a reliable version of the classical principal component analysis [8] and occurs in a number of applications, including image processing, web data ranking and bioinformatics' data analysis.The text of your paper should be formatted as follows.
The distance matrix reconstruction algorithm is based on the analysis of all possible triangles in the distance matrix.For each triangle, the value of badness is calculated, the tracking of which is part of the heuristic approach (see [4] for details).However, if you pass through the elements of the matrix "left to right" and "top to bottom" you can get unwanted results with a significant increase in badness for previously studied triangles.One approach to work around this problem is to form a sequence of subtasks by modifying the branch-and-bound algorithm [5].In this case, the badness matrix value will act as the boundary value in the classical branch-andbound algorithm [9,10].Our goal is to consider various statistical indicators arising from the reconstruction of matrices described in [5] by the modified branch-and-bound algorithm.Below are the results of our statistical study of the DNA matrix of small dimensions for the problem of reconstruction.The results of our computational experiments are given in the following table (see below).As in the case of the previous problem (TSP), we answered only one question: how often at least 1 pair of similar matrices occurs when applying the first few steps of the "classical" branchand-bound algorithm.However, in comparison with the TSP and the previous section, we have replaced the word "identical" with the word "similar": in this case we consider similar matrices in which the sets of empty elements coincide.It is easy to make sure (also, for example, by computational experiments) that the further work of the branch-and-bound algorithm with two similar matrices in the vast majority of cases occur with the same order of selection of not yet filled matrix elements.
Generation of input data was carried out based on the matrix obtained working with matrices obtained by applying the algorithm of the Needleman -Wunsch [11].We applied this algorithm to the MDNA chains of different animals taken from the NCBI data Bank [16], and sequenced mDNA chains were taken for one representative of each of the 28 mammalian orders (mammalian classification is chosen according to [15], other classification options were not considered).For this matrix, we randomly, applying a uniform distribution, chose the desired number of its rows / columns (5, 10 or 15), obtained a matrix of smaller dimension, which left the desired percentage of elements (from 30% to 70%).For each pair consisting of the dimension and the percentage of deleted items, we have done 1000 of these generations.Next, we ran the method of the branch-and-bound algorithm, but, in contrast to the consideration of the TSP in the previous section, the last parameter was not the number of steps of the branch-and-bound algorithm, and the number of resulting subproblems: when you receive 10 (or 30) of the subproblems, we calculate the stop.Also, of course, we stopped the calculations and when you get two similar matrices, and it is the number of such options (out of 1000 possible) and is reflected in Tab. 1.In that table: • the lines specify dimension; • the columns indicate the percentage of deleted items (for example, if the size is news 10, we have only 45 items located above the main diagonal; removing 40% means removing 18 items from them); • in each of the filled cells-the results of calculations for this case: the number of cases (out of 1000 considered) for which at least one pair was obtained; • in this case, the first value in the cell is given for calculations that stopped after receiving 10 subtasks, and the second -after receiving 30 subtasks.

Minimization technique
In the study of the matrix  =  , and its constituent elements  , , it is conventionally assumed that with the exception of diagonal elements.In this context, diagonal elements are the elements denoted by  , , and any arithmetical expressions containing these elements are excluded from formulaic considerations.This presupposition aids in focusing on the pivotal elements contributing to the total error denoted by .The formal definition of  is expressed as: where different approaches exist for its calculation, one of which involves the sequential application of the following formulas: •  (1) , ,  = max , ,  , ,  , , •  (2) , ,  = min , ,  , ,  , , • and the definition of  , is: ,, + Moving forward, we can rephrase the primary problem as an optimization task aiming to minimize the error value , a metric previously referred to as "badness" in earlier research.The procedure necessitates a sequential, stepwise insertion of missing elements, thereby streamlining the implementation of the corresponding algorithm.
By employing this approach, we generate a matrix populated with noisy data, denoted as   , .Subsequently, we reconstruct the   function through the resolution of previous equation.It is crucial to note that the noise level , which manifests during the restoration of missing values, can be quantitatively assessed through a meticulous analysis of the aberrations from the "isosceles triangle" principle, as detailed in [12].
Ultimately, this methodical strategy of sequentially populating missing matrix elements assures a progressive enhancement of the resultant solution.Theoretically, this rationalizes the forsaking of the more time-consuming branch and boundary method in favor of the greedy algorithm for discerning the value of individual elements, thereby expediting the overall computational process.

Quality assessment criteria for numerical solutions
Previously, we discussed the imperative of determining an effective method to gauge the quality of the solutions generated by our recovery algorithms.It remains clear that the existing computational model does not fully address the concerns pertaining to the quality assessment of matrix restoration.Consequently, a straightforward quality criterion could be the comparison, based on an appropriate metric, of the restored matrix with the actual derived distance matrix.This comparison could potentially be executed for smallscale examples, albeit restricted to a finite number of iterations, preferably during the preliminary phases of algorithm testing.
To this end, we introduce two plausible criteria for evaluating the numerical solutions to such restoration problems: (1) The first criterion involves contrasting the matrix reconstructed by the current simplified algorithm with the matrix developed through the application of a comprehensive algorithm during the formation of each element, as documented in [13,14].This criterion value will be denoted by .
(2) The second criterion scrutinizes the discrepancy using a distinctive method, leveraging the same algorithms that function as supportive entities within the general recovery algorithm explored in this study.The criterion value in this case will be symbolized by  (or  in some previous publications).
In both instances, the overarching aim remains to diminish the values deduced by the applied criteria.
We delineate the specific formulas as stated below: (1) For , the typical formulation is where all the elements ,  are retrieved via the implementation of the original algorithm (like the well-cited Needleman-Wunsch algorithm), excluding any element restoration.It is pertinent to mention that this method is seldom utilized, especially for extensive matrices generated through certain distance determination algorithms, highlighting the more universal application of the subsequent criterion, .
(2d) Subsequently, we define It is critical to acknowledge that  can be computed swiftly, despite the necessity to analyze approximately  3 triangles.Furthermore, we observe a correlation between these criteria and the task at hand: for instance, random matrices yield notably poorer results according to the  criterion, even for minor dimensions.As evidence, a 13x13 random matrix recorded  values within the 0.4 to 0.5 range, substantially exceeding the corresponding values for correct matrices of size 28x28 with a lower initial fullness percentage.

The consideration of the application example of the branch-and-bound algorithm
Each matrix has a number of characteristics that affect the outcome of the branch and bound method.One of these characteristics included in this list is the percentage of deleted items.On Fig. 1, there is a selection of the results of the method for the problem discussed in the previous sections, for 5×5 matrices with the removal of 40%, 50% and 60% of the elements of the original matrix.
The following graph (Fig. 2) shows the average percentage of successful recoveries based on the percentage of deleted items.
The results of similar numerical experiments for 10×10 matrices are given on Fig. 3.We also present a graph of the average number of successful recoveries, Fig. 4. Compared to the previous results for 5×5 matrices, it is noticeable that the percentage of successful recoveries has increased significantly.This seems to be due to the fact that there is more room for triangle selection and element recovery during the computational process.With a further increase in the dimension of the matrices, we observe the tendency of the percentage of successful recoveries to 100 with the specified percentage of deleted elements (40%, 50%, 60%), Fig. 5.The next interesting characteristic is the height of the decision tree (i.e. the maximum path length from the tree root to the leaf).In the presented graphs it is clearly seen that the average value of the height of the decision tree for the matrix 5×5 is not too varies with the percentage of deleted elements.
It also makes sense to look at the change in the height of the decision tree depending on the dimension of the matrix with a fixed number of deleted elements.The two diagrams below illustrate this comparison, Fig. 6 and Fig. 7.
We can see that the increase in the height of the decision tree with the growth of the dimension of the matrix occurs at a high speed.Also, with increasing the dimension of the matrix significantly  increases the number of subtasks of the same level, and therefore the search for problems takes much longer.

CONCLUSION
The statistical regularities obtained by us in this article are actually the probability of a successful situation-which makes it possible not to calculate the separating element once again.(According to our calculations, obtained also in the course of computational experiments, the choice of the separating element in some variants of the branch-and-bound algorithm is spent more than 99% of the time of the program.)Therefore, the results provide a rationale for the application of clustering situations in the development of algorithms for solving discrete optimization problems using the branch and bound method.This application gives easily observed improvements of the algorithm; it is this variant, from our point of view, reflects the representativeness of the data in many real problems.

Figure 1 :
Figure 1: The selection of the results, 5x5.

Figure 5 :
Figure 5: The tendency of the percentage of successful recoveries.

Figure 6 :
Figure 6: The height of the decision tree, 40% of deleted elements.

Figure 7 :
Figure 7: The height of the decision tree, 60% of deleted elements.

Table 1 :
The number of cases (out of 1000 considered), for which at least one pair was obtained