GPU-acceleration of neighborhood-based dimensionality reduction algorithm EmbedSOM

Dimensionality reduction methods have found vast applications as visualization tools in diverse areas of science. Although many different methods exist, their performance is often insufficient for providing quick insight into many contemporary datasets. In this paper, we propose a highly optimized GPU implementation of EmbedSOM, a dimensionality reduction algorithm based on self-organizing maps. We detail the optimizations of k-NN search and 2D projection kernels which comprise the core of the algorithm. To tackle the thread divergence and low arithmetic intensity, we use a modified bitonic sort for k-NN search and a projection kernel that utilizes vector loads and register caches. The evaluated performance benchmarks indicate that the optimized EmbedSOM implementation is capable of projecting over 30 million individual data points per second.


INTRODUCTION
Dimensionality reduction algorithms emerged as indispensable utilities that enable various forms of intuitive data visualization, providing insight that in turn simplifies rigorous data analysis.The development has benefited especially the life sciences, where algorithms like t-SNE [16] reshaped the accepted ways of interpreting many kinds of measurements, such as genes, single-cell phenotypes and development pathways, and behavioral patterns [2,15].
The performance of the non-linear dimensionality reduction algorithms becomes a concern if the analysis pipeline is required to scale or when the results are required in a limited amount of time such as in clinical settings.To tackle the limitations of poor scalability, Kratochvíl et al. developed EmbedSOM [7], a dimensionality reduction and visualization algorithm based on self-organizing maps (SOMs) [5].EmbedSOM provided a 10× speedup on datasets typical for single-cell cytometry data visualization while retaining the competitive quality of the results.Still, the parallelization potential of EmbedSOM remained mostly untapped as of yet.
This paper describes an efficient, highly parallel GPU implementation of EmbedSOM designed to provide real-time results on large datasets.The implementation is accompanied by performance benchmarks of individual optimizations to evaluate the optimal variants for different dataset sizes.Both the implementation and the data are available in our GitHub repository 1 .Furthermore, the repository also contains the figures that were omitted due to space constraints.
In the paper, we first describe the EmbedSOM algorithm in Section 2. We specifically detail the CUDA-based GPU implementation of the algorithm in Section 3 and evaluate its performance in Section 4. Related work is discussed in Section 5 and Section 6 concludes the paper.

LANDMARK-DIRECTED REDUCTION
EmbedSOM is a visualization-oriented method of non-linear dimensionality reduction that works by describing high-dimensional points by their locations relative to landmarks equipped with a topology and reproducing the point in a low-dimensional space using an explicit low-dimensional projection of the landmarks with the same topology [7].
Formally, the EmbedSOM algorithm works as follows.Let  be the dimension of the high-dimensional space and assume R 2 is the low-dimensional space for brevity.EmbedSOM processes  -dimensional points in a matrix  of size  × , and outputs  2-dimensional points in a matrix  of size  × 2. The high-and lowdimensional landmarks similarly form matrices  of size  ×  and  of size  × 2, where usually  ≪ .Each point   is transformed to a point   as: (1)  nearest landmarks are found for point   ( is a constant parameter satisfying 3 ≤  ≤ ) (2) the landmarks are ordered and scored by a smooth distance function that assigns the highest score to the closest landmark and 0 to the -th landmark (ensuring the smoothness of projection in cases  <  [7]) (3) for each pair (, ) of the closest  − 1 landmarks (the ones with non-zero score), a projection of the point   is found on the 1-dimensional affine space with coordinate 0 at   and 1 at   ; the 1-dimensional coordinate of the projection in this affine space is taken as   (  ) and the same projected coordinates are defined in the low-dimensional space as   (  ) (4) point   is fitted to the low-dimensional space so that the squared error in the coordinates weighed by nearest landmark scores (  ,   ) is minimized: Because   () is designed as a linear operator, the error minimization problem (step 4) collapses to a trivial solution of 2 linear equations with 2 variables.A complete algorithm may be found in the original publication [7, Algorithm 1].

GPU IMPLEMENTATION OF EMBEDSOM
While EmbedSOM is relatively straightforward to parallelize for mainstream CPU architectures, several challenges appear when optimizing for contemporary GPUs.This section outlines the key optimizations that made the high-performance EmbedSOM implementation possible and overviews the relative performance gains achieved by individual choices made.The algorithm consists of two main parts to address which we describe in the remainder of the section: • -NN , is difficult to optimize due to irregular memory access patterns of collecting the data for the computation.

𝑘-NN selection step
The task of the first part of the algorithm is to find  nearest landmarks (from ) for every data point in  .This comprises two substeps: computing Euclidean distances for every pair from  and  and performing point-wise reduction that selects a set of  nearest landmarks for each of the  points, based on the computed distances.
While the Euclidean distance computation is mathematically simple and embarrassingly parallel, achieving optimal throughput on GPUs is quite challenging [10].In particular, the ratio between the data transfers and the arithmetic operations performed by each GPU core is heavily biased towards data transfers.The overhead of data transfers is best prevented by finding a good caching pattern for the input data that is able to optimally utilize all hardware caches (L1 and L2), shared memory, and core registers.
The parallel implementation of the -NN search is even more challenging.The -NN problem is computed individually for each data point, which provides the space for possible parallelization.However, concurrently processed instances of a naïve -NN implementation exhibit severe code divergence because the selection process is purely data-driven, and requires a high amount of memory allocated per core.Optimally, the -NN selection is realized by customized versions of parallel sorting algorithms, which are well-researched and possess existing GPU implementations [13].
Our implementation chooses to optimize both sub-steps since the ratio of the amount of required computations can be easily biased by the configuration of parameters  and .In particular, processing high-dimensional datasets with a low  parameter spends significantly more time in the distance computation, but lowerdimensional datasets with higher  require more time in the nearest neighbor selection.
Concerning the perspective of software design, the implementation may use separate kernels for both sub-tasks or a single fused kernel.Kernel separation provides better code modularity and more flexibility in work-to-thread division and data caching strategy, at the cost of having to materialize all the computed distances in the GPU global memory, thus significantly increasing the total amount of data transfers.In contrast to that, a fused kernel may immediately utilize the computed distances in -NN computation without transferring the data to global memory and interleaving the distance computations with -NN may help to improve the ratio between computations and data transfers.Since our initial observations showed that the overhead of the data transfers required for kernel communication is relatively high, we decided to implement only the fused variant for the sake of simplicity.The usage of separate kernels might be interesting in the future, especially for extreme values of  that diminish the relative cost of the distance data transfer.

Available algorithms for 𝑘-NN.
There are many approaches to -NN selection, varying in complexity and parameter-dependent performance.We implemented several of the possibilities (as described in this section) to substantiate our choice of the algorithm for GPU EmbedSOM.
As a baseline (labeled Base), we used the most straightforward approach to GPU parallelization which simply invokes original sequential code for every data point concurrently.The Base kernel is spawned in  threads (one for each data point), and each thread computes the distance between its data point and all landmarks while maintaining an ordered array of  nearest neighbors.The array is updated by an insert-sort step performed for every new computed distance -i.e., by starting at the end of the array and moving the new distance-index pair towards smaller values until it reaches the correct position.
Shared algorithm is a modified version of the baseline algorithm that utilizes shared memory as a cache, following the recommended optimization practice of improving performance by caching data that are reused multiple times [12].In this case, we cache the landmark coordinates, which are sufficiently small to fit in the shared memory for all tested parametrizations.In GridInsert algorithm, we utilize the shared memory to cache both landmarks and points.However, the limited size of shared memory imposes limitations of the amount of cached data.Hence, the algorithm was parametrized by the block height ℎ (number of cached points from  ) and the block width  (number of cached landmarks from ).The algorithm runs in epochs, each of which first caches ℎ points and  landmarks, and then computes ℎ •  distance values using only data in shared memory.While the distances are computed concurrently by the whole thread block, we chose to avoid explicit synchronization in the -NN step, using only ℎ threads to incorporate the newly computed distances into ℎ separate -NN results using the insert-sort steps.The GridInsert should achieve better throughput in the distance computation thanks to the caching, at the cost of slightly sub-optimal -NN reduction; thus, giving the best performance on high-dimensional datasets and low values of .
Finally, improvising on our previous work [8], we implemented Bitonic -NN selection algorithm, which utilizes routines from the highly parallelizable bitonic sorting algorithm.Bitonic sorting is very suitable for parallel lockstep execution [10], and the capability to merge sorted sequences has allowed us to keep only 2 distances (instead of ) in the shared memory.This method benchmarked the best on the average, so it is selected as default for EmbedSOM and we describe it more thoroughly in the following.
3.1.2Bitonic approach to -NN.The Bitonic approach can be seen as a combination of the benefits of the other algorithms: It does not require materializing all distances in the memory to do a full sort and even though it does not use an elaborate input caching strategy like GridInsert, it still gives interesting results because the data loading operations can be partially overlapped with bitonic sorting operations if enough warps are allocated to one streaming multiprocessor.
The bitonic comparator network provides a building block that, given two buffers of size  of neighbor distances sorted by bitonic sort, selects the closest  of the neighbors in a single (parallel) operation, allowing us to quickly discard neighbors that do not belong into the -neighborhood.Applying this operation iteratively on -sized blocks of distances sorted by the bitonic sort (as shown in Figure 1), we obtain a highly performing scheme that requires only 2 items present in the shared memory.In particular, the shared memory always contains a -block of distances (and corresponding indexes) that holds  so-far-nearest neighbors, and one block of  distances that are computed from ; in each iteration, both blocks are sorted by the bitonic sorter in parallel and merged by the bitonic comparator to move the distances of new nearest  neighbors into the intermediate block.The other block is then re-filled by a new set of  distances from .
Technically, each step of the sorting net requires  2 comparators, thus optimally  2 threads that work concurrently on the ℎ-sized block.Hence, we allocate  threads for each data point, which alternate their work between computing a block of  distances and performing two bitonic sorts on two -sized blocks in parallel.For simplicity, our implementation assumes that  is always a power of 2, and excessive output of the sorter is discarded.

Projection step
The second part of the dimensionality reduction method is the actual projection into the low-dimensional space.The computation of the low-dimensional point position   by EmbedSOM involves: (1) Conversion of the distances collected in the -NN to scores; (2) Orthogonal projection of   to  2 lines generated by the  neighbors to create contributions to the final approximation matrix; (3) Solution of the resulting small linear system using Cramer's rule.
Since the first and the last steps are embarrassingly parallel problems with straightforward optimal implementation and since the second step is the most time demanding (performing O ( 2 ) operations on vectors of size ), we focus mainly on the orthogonal projections.Its computation is complicated by a highly irregular pattern of repeated accesses to an arbitrary -size subset of .We designed several algorithms that successively optimize the access patterns, detailed below.
The baseline algorithm Base uses the most straightforward parallel approach (similar to Base -NN), where each thread computes the projection of one single point sequentially so the concurrency is achieved only by processing multiple points simultaneously.All data are stored in the global memory, and no explicit cache control is performed.
The irregular repeated access to the elements of  hinders the performance of the baseline algorithm.In the Shared algorithm, we chose to reorganize the workload so that each projection is computed by a whole block of threads that cooperatively iterate over the landmark pairs.As a result, the input data of the orthogonal projection (i.e., the  nearest neighbors from  together with the distances, scores, and 2D versions of the landmarks) can be cached in shared memory.The intermediate sub-results represented by 2 × 3 matrices are successively added into privatized copies of each thread to avoid explicit synchronization and aggregated at the end using a standard parallel reduction, enhanced with warp-shuffle instructions (a similar scheme is used in optimal CUDA k-means implementation [9]).
Because the data transfers comprise a considerable portion of the Shared algorithm execution time, we have optimized the transfers using alignment and data packing techniques, yielding the Aligned algorithm.The implementation is based on using vector data types (e.g.float4 in CUDA) to enable utilization of 128-bit load/store instructions, which improves overall data throughput.The vectorization comes only at a relatively small cost of aligning and padding the vectors to 16-byte blocks.
To further improve the data caching, we implemented algorithm Registers, where each thread computes more than one landmark pair in a single iteration so that the coordinates loaded into its registers can be shared as inputs among multiple landmark-pairs computations.The data sharing scheme is detailed in Figure 2. We found that it is optimal to group the threads into small blocks of 2×2 computation items, saving half of the data loads.Larger groups are theoretically possible, but even 3 × 3 caused excessive registry pressure and impaired performance on contemporary GPUs.The innermost loop of the algorithm iterates over  so that only a single float4 value per each landmark is kept in registers.

EXPERIMENTAL RESULTS
The main objective of the benchmarking was to measure the speedups achieved by different applied optimizations and to determine the optimal algorithms and their parameter setting for the sub-tasks of EmbedSOM computation.
The timing results, presented in the following sections, were collected as kernel execution times measured by a standard system high-precision clock.Each test was repeated 10× and the mean values are presented in the subsequent figures.The relative standard deviations of the measurements were less than 5% so we chose not to include them.Complete measurements are available in our GitHub repository 2 .
Results were collected on NVIDIA Tesla A100 PCIe 80 GB running CUDA 12.2.All benchmarking datasets were synthetic, containing exactly 1Mi points ( = 2 20 , reflecting the common sizing of real-world datasets [1]) with all coordinates sampled randomly from the same uniform distribution.The performance of the benchmarked algorithms is not data-dependent, except for the case of caching performance in the projection step, where the completely random dataset is the worst-case scenario.

Performance of k-NN selection
Here we give an overview of performance and viable parameter settings observed for the -NN selection algorithms.Notably, all algorithms for -NN are affected by CUDA thread block sizing which affects warp scheduling and data reuse possibilities of the shared-memory cache.We observed that the total thread block size of 256 threads was either optimal or near to optimal for almost all tested configurations, except for GridInsert that performed the best with 64 threads for lower values of  and  parameters.
Parameters  and ℎ of the GridInsert algorithm determine the ratio between data transfers and computations, but may also affect the pressure on the shared memory 3 .Empirical evaluation indicates that the algorithm performs the best when each parallel insertion sort is performed in a separate warp, so the code divergence in SIMT execution is prevented (i.e.,  is a multiple of 32) The optimal performance was observed for  equal to 96 or 128; However, the speedup over  = 32 is relatively low.
A comparison of the best parametrizations of each algorithm on various configurations common in our target use cases is shown in Figure 3.The Bitonic algorithm significantly outperformed the other algorithms.The speedup of Bitonic over Base was between 3× to 20× and usually more than 2× over the second-ranking method.
The benchmarking also confirmed a rather huge scaling difference between algorithms based on divergent insertion sort and algorithms based on sub-quadratic parallelizable sorting schemes.We conclude that despite the simplicity that might enable GPU speedups in certain situations, the insertion sort is too slow for larger values of  in this case.
As an interesting result, we observed that despite following the general recommendations, the straightforward use of shared memory (in the Shared algorithm) did not improve overall performance over the Base.Quite conversely, the overhead of explicit caching even caused a slight decrease in the overall performance.
We additionally report the performance measurements for two selected corner cases with extreme values of  and  (figure omitted due to the page limit).Mainly, the total volume of the computation required to prepare the Euclidean distances scales with  • , which becomes dominant when both are maximized.At that point, we observed that GridInsert provides comparable or mildly better performance than Bitonic, especially in cases where  is small and the overhead of insertion sorting is not as pronounced.Naturally, we should ask whether it could be feasible to combine the benchmarked benefits of GridInsert and Bitonic algorithms in order to get the best of both approaches (optimal inputs caching and fast -NN filtering).While an investigation of this possibility could be intriguing, we observed that a fused algorithm would require very complicated management of the shared memory (which both algorithms utilize heavily), and the estimated improvement of performance was not sufficient to substantiate this overhead; we thus left the question open for future research.

Performance of projection step
The projection algorithms described in the previous section have only two execution parameters: The size of the CUDA thread block and the number of data points assigned to a thread block (threads are divided among the points evenly).We observed that selecting more than one point per thread block is beneficial only in the case of relatively small problem instances (low  and ) because it prevents underutilization of the cores.
The optimal size of the CUDA thread blocks depends mainly on the parameters  and .In case of Shared algorithm, optimal values ranged from 32 (for  = 8,  = 4) to 64 ( =  = 64).With the caching optimizations in Aligned and Registers, the optimal thread block size was slightly higher, reaching 128 for the most complex problem instances.We assume this is a direct consequence of the improved memory access efficiency which gives space for parallel execution of additional arithmetic operations.
Figure 4 shows the performance of the best algorithm configurations for the representative parametrizations.All three algorithms perform almost equally for small , giving around 3× speedup over Base.The importance of optimizations in Aligned and Registers grows steadily when parameter  increases, up to around 10× speedup at  = 64.In conclusion, the optimal algorithm for the EmbedSOM projection is determined by the dimensionality of the  dataset -Registers performs better at higher dimensions ( ≥ 32) while Aligned was slightly better for lower dimensions.

Complete algorithm
A complete GPU implementation of the EmbedSOM algorithm is the combination of the best implementations of -NN and projection steps.The selected algorithms Bitonic and Registers are simply executed sequentially on large blocks of  , sharing only a single data exchange buffer for transferring the -NN data.Notably, since the data exchange between the algorithm parts is minimal, comprising only distances and neighbor indexes from the -NN selection, we claim that no specific optimizations of the interface are required.Our implementation can provide a speedup between 200× and 1000× over a serial CPU implementation, and between 3× and 10× over a straightforward GPU implementation that we used as a baseline (figure omitted due to the page limit).Finally, we highlight the relative computation complexity of both steps (Figure 5), which changes dynamically with  and might be viable as a guide for further optimization.We observed that for common parametrizations ( ≃ 20,  ≃ 500), most of the computation time is spent in -NN step, and projection performance becomes problematic only in cases of almost impractically high .The results align with the asymptotic time complexities of the algorithms, roughly following O ( •  •  • log 2 ) for the -NN and O ( •  •  2 ) for the projection.

RELATED WORK
The essential component of our success is GPU acceleration of the projection computation which needs to be fast enough to recalculate the embedding in real-time.In the following, we address the most relevant works that influenced or inspired our solution.
Being one of the most profound visualization methods, t-SNE was studied to explore the possibilities of having a fast GPU-enabled implementation.One of the initial implementations was t-SNE-CUDA library [3].The most complicated step (computing the attractive forces of the N-body simulation) is handled as a multiplication of a sparse matrix and a vector by the CUSparse library.This work was slightly improved a year later [4] when the authors replaced the CUSparse library with their implementation of multiplication, which takes advantage of atomic operations to perform the reduction in scalar sums.
Perhaps the most popular contemporary method for data visualization is the Uniform Manifold Approximation and Projection algorithm (UMAP), which often produces better results than t-SNE at the cost of higher computational demands.There are two GPU implementations worth mentioning which were both made part of RAPIDS cuML library [11,14].They both use a similar approach, implementing a -NN approximation based on gradient descent methods.The first implementation [14] relies more on existing solutions and libraries, and the second one [11] is slightly more low-level as they implement the embedding using custom kernels.
Even though the presented methods (especially t-SNE) exceeded the speedup of two orders of magnitude, they are still quite far from real-time processing when the number of points reaches the order of millions.The proposed EmbedSOM projection is based on SOMs and linear projection based on -NN search [6,7], which is technically closest to the work of Yeh et al. [17].For the SOM part, we have adapted the state-of-the-art implementation of -means algorithm [9] since SOM shares many of its steps.The crucial part of the projection is the -NN search, which is also repeated in the aforementioned papers; however, we have found that the solution based on bitonic-sorting [8] performs the best in our case.

CONCLUSION
We have presented a GPU implementation for the semi-supervised dimensionality reduction algorithm EmbedSOM where we optimized independently two kernels: A general NN search and a 2D projection which may be used independently.The -NN was solved by adapted bitonic sorting, which eliminates thread divergence.The projection kernel was optimized to fetch and use data most efficiently by utilizing vector loads and data reuse on the register level.Thorough benchmarking indicated that both kernels achieved a significant speedup over the baseline GPU implementation.
The proposed implementation should enable subsequent research in interactive dimensionality reduction tools where the user changes SOM parameters or landmarks and the projections are re-computed and visualized in real-time.The results show that the optimized EmbedSOM version can project more than 1 million individual data points each frame, while maintaining a frame rate above 30fps.

Figure 1 :
Figure 1: Bitonic algorithm for -NN selection ( = 4).Each horizontal line represents a data item in the shared memory.Red lines represent comparators ensuring, that the intermediate  'best' neighbors and in the top buffer.

Figure 2 :
Figure 2: Detail of the caching of landmark data in Registers projection kernel.Multiple landmark pairs (small boxes) are processed by each thread (large boxes).Caching of the landmark data in registers allows the reuse of loaded data (color lines), thus reducing the amount of memory accesses.

Figure 3 :
Figure 3: Amortized performance of -NN step for a single input point using parameters usual in flow cytometry

Figure 4 :
Figure 4: Amortized performance of a single projection operation in the algorithms that compute the projection step (showing the most important problem parametrizations)

Figure 5 :
Figure 5: The relative time spent by the -NN computation usually dominates the execution of GPU EmbedSOM, composed of Bitonic+Registers algorithms.Projection computation time becomes dominant only for relatively impractical parametrizations of low  and high .
Computation of the small linear system that is used to find the projection of a point, namely of projections   and the derivatives step The search of -nearest landmarks in  for each data point from  requires a highly irregular selection of indices of  lowest values from columns of the dynamically computed distance matrix   •  .• Projection step