An Efficient and Scalable Approach to Build Co-occurrence Matrix for DNN's Embedding Layer

Embedding is a crucial step for deep neural networks. Datasets, from different applications, with different structures, can all be processed through an embedding layer and transformed into a dense matrix. The transformation must minimize both the loss of information and the redundancy of data. Extracting appropriate data features ensures the efficiency of the transformation. The co-occurrence matrix is an excellent way of representing the links between elements in a dataset. However, the dataset size becomes a problem in terms of computation power and memory footprint for using the co-occurrence matrix. In this paper, we propose a parallel and distributed approach to efficiently constructing the co-occurrence matrix in a scalable way. Our solution takes advantage of different features of boolean datasets to minimize the construction time of the co-occurrence matrix. Our experimental results show that our solution outperforms traditional approaches up to 34x. We also demonstrate the efficacy of our approach with a cost model.


INTRODUCTION 1.Embedding in Deep Learning
Thanks to the development of Deep Neural Networks (DNN), embedding has become omnipresent in modern life.Embedding plays a key role in generalizing models to tasks with different data structures.Models such as Word2vec [16] have made possible the capture of semantic and syntactic information about words, enabling a more subtle representation of words.Popular Transformer models [29] such as BERT [11] or GPT [26] have been adapted for other domains such as image recognition [6], image generation [4] and graphs [30].While initially used in the Natural Language Processing (NLP) domain to represent words [10,21], methods have been introduced to present vector representations of other data structures such as graphs [8,20] or categorical data [9].
A good embedding ensures that DNN training is based on data with quality and variability and has an impact on the overall training.The setting of the embedding method is a trade-off between loss of information and the redundancy of data.Data embedding generally consists of mapping data into a finite, reduced-dimensional space.However, reducing complex data structures into a low-dimensional Euclidean space cannot preserve all the information that was previously available.A correct embedding needs to retain enough information to maintain an accurate representation so that the model can make its predictions correctly from the embedded data.
Ensuring good data representation requires control of the embedding but one of the main limitations of the embedding layer is the lack of control over its training.The embedding layer is early in the model's structure: the back-propagation used to tune the model's parameters will bring relatively I like you.
1 = I You like dogs.
2 = you I don't like dogs.
3 = don't  4 = like  5 = dogs little information up to the embedding layer.The embedding layer needs a lot of time to be trained and to produce relevant results.This is paradoxical because an adapted embedding layer facilitates the training of the rest of the network.
The only instruments we have to control embedding are hyper-parameters and parameter initialization.Pre-trained embedding or representation [17,22] can be used when it is possible.However, this is only possible if the data is already known and properly explored.For other types of data, it's necessary to set up learning embeddings from scratch.In this situation, the embedding layer will be initialized with random weights, and the embeddings will be learned jointly with the rest of the model parameters during training.
A possible in-between is to extract information from the input data to initialize the embedding layer.In the same way as Word2vec, which offers word embedding from a given dataset, one can use methods to extract key features from the dataset and use the results to initialize the model's layer embedding.A good initialized embedding layer provides the model with a consistent input data representation, reducing model training time.The cost of initialization methods must be relatively low to justify their use.We need approaches with limited computational complexity, since analyzing datasets can quickly require a considerable computation power.

Co-occurrence matrix
The co-occurrence matrix [14] is a matrix that depicts the frequency of co-occurrence of pairs of items in a dataset.This matrix provides information about the relationships and patterns between items in a dataset.Each row and column with the same index represent a unique item, and the cells of the matrix store the frequency or count of how often two items co-occur together in the dataset.Initially used for visualizing co-citations [15], its use has become very popular in information science [12] for tasks like finding associations, identifying patterns, calculating similarity measures, and building recommendation systems.
In NLP, a co-occurrence matrix can be used as the basis for numerical analysis of how words or word pairs appear together within a given corpus.For example, the co-occurrence matrix plays a crucial role in the GloVe [19], a neural networkbased algorithm used to generate word embedding, by providing the statistical information necessary to learn the word embeddings through the neural network training process.The co-occurrence matrix also has a major role in different topic models like LDA [1,25] or PLSA [13].An example of co-occurrence matrix construction is shown in figure 1 with a small sentence corpus.This dataset is composed of 3 sentences and using a total of 5 different words.We'll use the terms instances to designate the sentences and features the words that compose the corpus.The co-occurrence matrix is thus a good tool to prepare the embedding layer.This paper focuses on how to build a co-occurrence matrix efficiently.

Computation complexity limitation
The co-occurrence matrix could be obtained following by a matrix multiplication, whose complexity is O ( 3 ) where  is the size of the matrix.However, the usage of this symmetrical dot product could be quickly limited because of exponential growth in volume of textual data and real-time applications.Reducing the complexity of co-occurrence matrix construction would improve the efficiency of the algorithms and methods that are based on it, and would also improve the attractiveness of this matrix.
In DNN, we observed that the datasets are with very low density, and the arithmetic is boolean for DNN's applications including NLP and recommendation systems.Taking these domain-specific features into account would help to find out a way to reduce computation complexity while maintaining good scalability.
In this paper, we propose to improve the efficiency of the construction of the co-occurrence matrix for a dataset with Boolean features.The main proposed solutions in this paper are: • A new approach that reduces the computation time to construct the co-occurrence matrix associated with a binary dataset.
• Cost analysis to compare the computation and memory complexity of different approaches.• A comprehensive verification of computational complexity.• Validation of the approach with real-world datasets.
By taking advantage of both the sparsity of this class of dataset and the arithmetic particular to this data, our method enables the co-occurrence matrix to be built efficiently.Designed for use with large datasets, the computations are well adapted to a massively parallel or distributed environment.With our innovative method for faster and more efficient construction of co-occurrence matrices, this study aims to overcome these fundamental limitations, paving the way for smoother data manipulation and deeper comprehension of large-scale textual data.

BASIC NOTATIONS
Let's recall in this section how a co-occurrence matrix is basically built, in order to prepare a smooth understanding on our design of Sparse-Pairwise co-occurrence matrix construction presented in the section 3. We will first provide here the notations with the basic symmetrical dot product approach (a.k.a.matrix product) in both sequential and distributed environments in respectively part 2.1 and part 2.2.We then extend it to sparse matrices in part 2.3 with a discussion on storage format and space complexity.

Symmetrical dot product from incidence matrix
Let's first define an incidence matrix before going into the entire symmetrical dot product approach.An incidence matrix, noted  , is a representation used to show the connections between two sets of data.In our example in figure 1, the incidence matrix is used to show the connections between instances and features in our dataset.Each row of the matrix represents an instance, and each column a feature.We can quickly see from this matrix which data are linked to each other.Therefore, the co-occurrence matrix  is constructed from this incidence matrix  .Based on the associations between instances and individuals, this can be used to determine how often each feature is associated with another feature.
More generally, the construction of the co-occurrence matrix between the  features of a dataset composed of  instances is a level-3 BLAS matrix multiplication.We can build with   ×  that co-occurrence matrix , which represents the true together frequencies of elements.The result of this operation is a symmetrical matrix.This operation corresponds to a multiplication between a  × and a  × matrix, which corresponds to  ×  2 multiplication and ( − 1) ×  2 addition.The complexity of this operation as a function of  and  is O ( ×  2 ).
The proportion of non-zero elements in the matrix over the total number of elements in the matrix is called the density of the matrix.The inverse of the density is called the sparsity of the matrix.When the density of non-zero elements in a matrix is sufficiently low, storing only the positions and values of non-zero elements can save both memory and computing power.Low-density matrices are called sparse matrices [7].
Sparse matrices can be used to build the co-occurrence matrix.When the proportion of non-zero values is very low in the  vectors of the dataset, it's possible to consider the incidence matrix  as a sparse matrix to speed up calculations.Exploiting matrix sparsity considerably reduces the computational costs associated with matrix multiplication.However, performing a multiplication between two sparse matrices is a complex and costly operation.The costs associated with reformatting data and/or preparing this operation make its use limited.Therefore, we will not discuss the SpGEMM approach in this paper and will consider the sparse approach as being the approach where one of the two matrices is considered to be stored in a sparse storage format.Multiplying a sparse matrix with a dense matrix is a very popular and well-referenced operation.The great advantage of this approach is that the computational complexity depends on the density of the sparse matrix.So, the use of this approach is optimal when the density is close to 0.
In the rest of this paper, we'll refer to the dense symmetrical dot product approach when both multiplication matrices are stored in memory in dense storage format.The approach where one of the two matrices is stored in memory and manipulated in a sparse storage format will be called Sparse symmetrical dot product.We'll compare both the dense and sparse symmetrical dot product (SDP) approaches in section 5.

Distributed Dot Product
Multiplying two matrices in a distributed environment is well studied.A comparison of different data distributions in terms of computational power, memory and communications costs can be found in the paper [23].
By distributing the left matrix in √  row blocks and the right matrix in

√
column blocks, we maximize the load balancing while minimizing the memory space required on each node and limiting communications.This distribution of data and calculations ensures optimal performance efficiency.The computational complexity of such Dense symmetrical dot product approach is  ( ×  2  ) and we need two blocks of size  × √  on each processor.Each processor calculates partial values of the result matrix block.A communication phase is required to obtain the final values of the result matrix elements.Many-to-many communications are needed to process the reduction of these partial results.

Sparse storage format
To efficiently store sparse matrices, there are several available formats to choose from.One of the most commonly used formats is ELLPACK [27].ELLPACK is essentially a compression of non-zero values per row, achieved through the use of two matrices.The first matrix stores the column index of non-zero elements, while the second matrix stores the values of these same elements.When working with Boolean element matrix compression, only the matrix storing the element indices is necessary, as non-zero element values are always equal to 1.
ELLPACK is ideal for cases where the number of non-zero values is distributed relatively evenly between the rows of the sparse matrix.However, when this is not the case, it's preferable to use other sparse matrix storage formats, such as CSR or COO, which are also quite popular.
To facilitate a better understanding of approaches that deal with sparse matrices, we will be using ELLPACK as the sparse storage format in our examples.This format is easy to visualize and comprehend while also effectively demonstrating the benefits of compressing matrix data.It should be noted that depending on the characteristics and requirements of the dataset, other sparse matrix storage formats can be employed in place of ELLPACK.The choice of format is completely free and flexible.
Storing low-density matrices in a sparse storage format saves a lot of memory space.If the matrix can be stored in memory on each node, then it's very interesting to consider duplicating the sparse matrix on each node.In fact, one of the data distribution options allows you to obtain blocks of the result matrix on each node without any additional communication.The result matrix will then be distributed to the different nodes.Duplicating the sparse matrix on each node and splitting the other dense matrix into  blocks avoids the communication phase involved in the reduction of partial results with the √  block approach described in section 2.2.This data distribution is more memory-intensive on each node but eliminates any need for communication to obtain the final results.

PROPOSED APPROACH
We have seen that the symmetrical dot product and the basic notations.However, how can we take advantage of DNN's domain-specific features to reduce the cost of constructing the co-occurrence matrix?To answer this question, we'll first, in part 3.1, propose another way of visualizing the construction of the co-occurrence matrix, using the pairwise approach and taking into account the Boolean nature of the data.In subsection 3.2, we then present an upgraded approach, named Sparse-Pairwise, which is a mixture of the symmetrical dot product approach and the pairwise approach.This allows us to take advantage of both data sparsity and data-specific arithmetic.We will then discuss in section 3.3 implementations of these approaches in a massively distributed environment.This will help us compare the different approaches' complexity in the section 4.

Pairwise approach
The pairwise approach is based on the following idea: the  , element of the co-occurrence matrix represents the number of times that features  and  have been simultaneously active for instance.In other words, in a dataset composed of  elements, the co-occurrence matrix allows us to visualize the number of times the features were simultaneously present on an instance.When  = , the co-occurrence matrix tells us how many times the feature has been associated together on one instance.Therefore, it is possible to construct the cooccurrence matrix by forming the set of feature pairs (, ) among all dataset instances.In concrete terms, it consists in finding all combinations of pairs of non-zero values within each vector of the dataset.for each non-zero element  in the same vector than  do 4: , ←  , + 1 5: end for 6: end for 7: return  Let's take as an example the dataset proposed in figure 1a.The first instance (e.g., "I like you") is composed of the features  1 ,  2 and  4 .We should add 1 to the three elements on the diagonal of the co-occurrence matrix   1 , 1 ,   2 , 2 ,   4 , 4 , then add 1 for each possible pair with  ≠ .We have 6 possible pairs which are as follows: ( 1 ,  2 ), ( 1 ,  4 ), ( 2 ,  4 ), ( 2 ,  1 ), ( 4 ,  1 ), ( 4 ,  2 ).We, therefore, add 1 to all the elements of the co-occurrence matrix with these indices.Do the same with the other sentences in the dataset to obtain the co-occurrence matrix .
Note here that it is possible to limit the search for pairs with  ≤ .This makes it possible to construct only the upper triangle of the co-occurrence matrix.If we name the resulting triangular matrix   , we obtain  =   +    − diag(  ).Algorithm 1 represents the pairwise method.Although on theory this is a very interesting approach, since it takes advantage of the fact that the data set is Boolean, it generally gives less interesting performance.Finding all possible pairs of elements in a vector means finding the non-zero elements in the vector, then for each of these values, finding the other non-zero elements in the same vector.Still in the algorithm 1, the for loop in the lines 2 and 3 are actually two nested loops whose execution depends on the result of a condition.For each element in the vector, test the element value.If the result is yes, continue in the next loop; otherwise, test the next element.The problem is that if statements tend to break the pipeline that runs within CPUs on modern architectures [18].We'll look at this in more detail in section 5.
The sparsity of the dataset has an impact on the performance of this method: it will define the number of times we enter the first loop for (line 2).The second loop, for, will run through all elements, regardless of sparsity.In the next section, we'll take a look at an approach derived from the pairwise approach that takes greater advantage of data sparsity.for each non-zero element  in  do 5: for each element  in  ,: do , ←  , + 1 7: end for 8: end for 9: end for 10: return

Sparse-Pairwise approach
We have seen in the previous sub-section that the pairwise approach takes advantage of the fact that the dataset is composed only of boolean elements, and the symmetrical dot product approach takes advantage of the fact that sparsity is high to speed up computations thanks to sparse linear algebra.In this part, we propose an approach that combines this approach with the dot product approach to speed up the construction of the co-occurrence matrix with both sparse linear algebra and boolean arithmetic.Figure 2 illustrates the main points of this approach to build co-occurrence matrix.
The limitation of the pairwise approach is that each time a non-zero element is found, the set of other non-zero elements in the feature vector must be found.Instead of traversing the entire vector when a non-zero value is found in the dataset, the Sparse-Pairwise approach consists of an initial scan the dataset to prepare the index list of non-zero values.By doing this, each time a non-zero value is found, we can immediately refer to the index list to find the pairs in which this non-zero element will be found.This quickly completes the list of pairs, without having to go through the rest of the vector.
Taking as an example the dataset in figure 1, compressing the incidence matrix in ELLPACK format gives the following index matrix: Then, for each vector  of dataset instances, we'll increment all the elements of the co-occurrence matrix whose coordinates are the index pairs stored in row  of the above matrix.
With this approach, we take advantage both of the pairwise search made possible by the fact that dataset elements are binary values, and of the sparse storage format made possible by the data's sparsity.The algorithm 2 represents the sequential version of this approach, and we'll discuss its deployment in a massively distributed environment in the following part.

Deploying in a Massively Distributed Environment
Datasets are generally very large, and to be able to build the co-occurrence matrix on very large datasets, it is essential to have an algorithm adapted to a distributed computing environment.In this section, we will compare the two previous implementations and see what possible optimizations we can take advantage of with distribution.Let's note  the number of processors on which calculations will be distributed.For communications purposes, we assume that these  nodes are linked by a network and have distributed memory.
Given the general size and density of large DNN datasets, it has been assumed that every node possesses ample memory space to replicate the sparse matrix, as elaborated in section 2.3.Duplicating the data to avoid communication seems to be the most advantageous approach for data distribution while dealing with sparse matrices.In the case where the sparse matrix is too large to be stored as such on each node, dividing the sparse matrix into several blocks of size

√
is also a plausible method for data distribution.This guides us to the data distribution described in section 2.2.

Pairwise approach.
The naive pairwise search approach distribution is to distribute for loops between the nodes.This approach is not well efficient because building the final co-occurrence matrix will create a lot of communication for the reduction.A more interesting approach is to construct the co-occurrence matrix by blocks of rows.This approach allows us to play with the intervals covered by the for loops.Let  be the number of blocks into which you want to divide the matrix .The  block of the matrix represents the rows [   × ,   × ( + 1) [.Since the matrix is symmetrical, adding data to the  , element will also add data to the  , element.So we can limit the range of loops by checking that either  ∈ [   × ,   × ( + 1) [ or  ∈ [   × ,   × ( + 1) [.In addition, since  ≤ , we can also limit the interval of the first loop for to [0,   × ( + 1)].When the element visited by the first loop is non-zero, it checks whether the element is in the interval.If yes, the second loop must traverse the rest of the vector.If not, then the interval of the second loop will be limited to the interval [   × ,   × ( + 1) [.Implementing the concept of the Sparse-Pairwise approach in a distributed environment is a challenging task.Indeed, distributing the different index lists of non-zero values of each instance will effectively distribute the computational power need, but each node will build a partial result of the entire co-occurrence matrix.Allreduce communications must be made with a length of  2 values.This scenario is unthinkable with very large datasets, given the communications size and the associated cost.
To be able to eliminate communications, each node must build a block of the final result of the co-occurrence matrix independently of the other nodes.This would result in the co-occurrence matrix being distributed across the different nodes, with blocks of similar size.

3.3.2
Sparse-Pairwise adapted from the sparse symmetrical dot product.The first approach is to use the same data and computation distribution of the sparse symmetrical dot product approach.The  matrix representing the list of indices is duplicated on each calculation node.Each node then calculates a block of   rows of the co-occurrence matrix, by scanning each vector in the dataset for non-zero elements.When a non-zero value is found, we update the matrix as explained in section 3.2 with the indices of the  matrix.This approach requires no additional computation.This is the approach we'll be deploying when memory constraints are not the priority.We will refer to this approach as the standard Sparse-Pairwise approach in the remainder of this paper.

3.3.3
Sparse-Pairwise approach to save memory.Storing dense blocks of vectors for scanning may require sparing memory to store the entire dataset when the dataset is large.This is why we propose an approach that uses the Sparse-Pairwise principle to limit the memory space required.The aim is to transform all input data into sparse formats.This reduces the amount of memory required to store the input data and adds to the cost of transforming the data.
The principle of this approach is similar to the first, except that instead of dispatching the vectors to the different nodes, we first calculate the columns' compressed matrix, then dispatch this compressed matrix and use the indices in this matrix.It's impossible here to use the already calculated rows' compressed matrix, as it gives no information on the position of the indices to be taken into account when creating a block of the co-occurrence matrix.Consequently, searching for the values included in the processing interval requires going through the entire compressed matrix, reducing the interest in this approach in a distributed environment.Scattering the matrix ensures that each node immediately has the set of non-zero values it needs to find in order to update its result block in the matrix.
However, using only compressed matrices requires more computational power than the standard Sparse-Pairwise approach.To build the columns' compressed matrix, we need to go through the blocks of vectors in the dataset and then build the matrix.The columns' compressed matrix requires more computational power to build than the standard Sparse-Pairwise approach for simply traversing the dense blocks.
This approach is very interesting for processing very large datasets on machines with limited RAM.The saving in terms of memory space will depend on the sparsity of the data.All the input data used to build the co-occurrence matrix is compressed.This approach will only be used when RAM memory cannot store all the information required to build the co-occurrence matrix with the standard Sparse-Pairwise approach.In the next section, we will examine the theoretical comparison of the different approaches with a cost analysis.

A PRIORI COMPLEXITY ANALYSIS
The table 1 compares the different approaches regarding computational complexity, memory and communication.Complexities are given as a function of  and , the dimensions of the dataset, the number of processors  and the density of the dataset noted . is between 0 and 1 and represents the ratio between the number of non-zero values and the total number of elements in the matrix.To obtain the theoretical approximation of complexity, we have used the BSP approach [3].For the dense and sparse symmetrical dot product approaches, we used the data partitioning described in section 2.2.We also used the different Pairwise and Sparse-Pairwise approaches described in section 3.3.
When sparsity starts to become significant, the most interesting approach from a memory perspective is the save memory Sparse-Pairwise approach.This is the only approach where the total memory space required for input values is directly related to the matrix density.This means that if the density is very low, the storage space required to store the data will be low.However, from a computational point of view, compressing data by both rows and columns is computationally more demanding than the standard Sparse-Pairwise approach.If there's a need to save even more memory, it's possible to compress the incidence matrix in SGP format [24], a compression pattern that lets you quickly toggle between row and column compression in exchange for a certain additional computation.
Regarding the computational complexity required to build the co-occurrence matrix, the two Sparse-Pairwise approaches are equivalent in complexity.The dense symmetrical dot product approach is the only one where the complexity does not depend on the density .Sparse-Pairwise approaches have smaller complexities than the sparse symmetrical dot product and Pairwise approaches.This is due to the fact that  ∈ [0, 1] and therefore  2 ≤ .While  < 1, our proposal Sparse-Pairwise approach is the most interesting in terms of computational complexity.In the next section, we will verify these complexities in practice, which presents our experiments.

EXPERIMENTS
To validate our cost analysis and check the performance of our Sparse-Pairwise approach, we experimented with implementing the 4 approaches described above in C++ and with MPI.In the 5.1 part, we'll be describing our experimental environment.In the 5.2 section, we'll use a dataset generator to independently vary different parameters to see how the different approaches perform.Finally, in the 5.3 section, we'll look at the performance of the different approaches with various datasets from real-world applications.

Experimentation environment and datasets
Our working environment is as follows: we have at our disposal 25 nodes comprising 2 Intel Xeon Gold 6230 20 cores @ 2.1 GHz (Cascade Lake).This enabled us to distribute calculations over a maximum of 1000 cores.Each compute node has a RAM capacity of 192GB.The Operating System is CentOS 7.9.2009 and the network technology is an Intel Omni-Path Architecture network 100 Gbit/s.Our disk storage capacity is 500 GB.It is a Spectrum Scale GPFS parallel file system that allows 9 GB/s input/output rate.
To be able to test co-occurrence matrix construction approaches accurately and under different conditions, we have developed a Boolean dataset generator.The algorithm 3 shows how we can build a dataset with a defined size and sparsity.Parameters  and  are respectively the number of instances and the number of features we want in the dataset.After generating an empty dataset of the desired size on line 1, we use the parameter  to fill our dataset according to Change the value of  to 1 6: end if 7: end for 8: return  We also chose to use three datasets for our experiments.An overview of the characteristics of these datasets is available in table 2. We selected the Anonymous MS Web dataset for its low density.In contrast, Criteo is a relatively highdensity dataset.Finally, the last dataset, Kasandr, will enable us to see the scalability of the approaches thanks to its large size.

Efficiency and scalability
In order to test co-occurrence methods, we used the generator introduced in 5.1 to vary the parameters one by one and observe the resulting variations in execution time to construct the co-occurrence matrix.This will also enable us to progressively verify that the results are consistent with the complexity analysis performed in section 4 and to see the prevalence of Sparse-Pairwise approach relative to other approaches.

Memory complexity analysis.
The memory complexity of our implementation is shown in the table 3. The theoretical values according to the table 1 are also given for comparison.In this example, the environment parameters have been set as follows:  = 100000,  = 200,  = 1000 and  = 1%.As the memory required to store the co-occurrence matrix is the same for all approaches, only the memory required to store the input data has been taken into account in this table.
For the Sparse SDP and Sparse-Pairwise approaches, we observe a fairly large difference with the theoretical value.This is due to our sparse matrix storage format.Using the ELLPACK format, we initialize an array larger than the number of non-zero values when the distribution of non-zero values is not perfectly distributed between the rows.The slight difference in complexity of the Dense SDP approach is due to the fact that

√
is rounded up to the nearest integer during load balancing.
For each method implementation, we used a vector of size  to store the index of the first row of the block associated with each matrix.This buffer vector is used to distribute the data to ensure good load balancing.However, the additional memory cost required to store this information is very low.We can see that approaches using sparsity require the least memory space.We also observe that the pairwise approach is very memory-intensive, making it difficult to use with very large datasets.5.2.2 Density . Figure 3 shows the execution times of the different approaches as a function of dataset density.The figure shows that all the approaches vary as a function of density except the dense symmetrical dot product approach.The results have been deliberately zoomed in on the lowest curves, removing the Pairwise approach, whose results are very high when the density exceeds 0.2.The Sparse symmetrical dot product approach performs better than the dense one when the density is less than 0.7.Similarly, the Pairwise approach performs better when the density is less than 0.2.We observe that execution times follow a curve in a similar way to the cost analysis predictions.We observe that execution times increase linearly as a function of density with the sparse symmetrical dot product approach.The execution times for the Sparse-Pairwise approach follow a parabolic pattern, confirming the squared complexity according to density.The Sparse-Pairwise approach achieves the fastest execution times regardless of density in the [0.1, 0.9] range.
To take a more detailed study of the performance of the approaches at low density, we experimented as function of density in the interval [0, 0.1].The results are given in figure 4. The results in this figure show that even with a low density, the Sparse-Pairwise approach is the most interesting in terms of execution time.The Pairwise approach has about the same performance as the sparse symmetrical dot product approach when the density is 1%.
The time required to build the co-occurrence matrix becomes negligible with the Sparse-Pairwise approach when the density is very low.With a density of 1%, the execution time to build the matrix is 0.19 seconds, while building the sparse matrix from the dataset takes 3.09 seconds.

Number of instances 𝑘.
For the  parameter, which corresponds to the number of individuals in the dataset, experiments have shown that the impact on execution time is linear.This fully verifies the cost analysis carried out in section 4. Doubling  means doubling the execution time.The difference between the two approaches is the value of the linearity coefficient.In the table 4, we have calculated   the coefficient of linearity for each method between two measurements with  = 200 and  = 2000.For each approach, this coefficient represents the additional time required when  is incremented by 1.The results were obtained by setting the parameters  = 1000,  = 50000 and density at 30 %.It can be seen that  has no impact on the differences in performance between the approaches.Whatever the value of , we observe that given the experimental conditions of the table 4, the Sparse-Pairwise approach is 25 times faster than the Pairwise approach, 10 times faster than the Dense symmetrical dot product approach and also 4.6 times faster than the Sparse symmetrical dot product approach.

Number of features 𝑛.
In figure 5, we've scaled the parameter  by setting the other variables to  = 500,  = 1000 and fixing the density at 10 %.The figure shows execution times for  between 10000 and 100000.We can see that all the different approaches have execution times that follow a curved trajectory with an increase of the value of  increases.The differences are in the second-degree coefficients associated to each curve.We can see that the slope of the curve is  very slight for the Sparse-Pairwise approach compared to the other approaches.In this configuration, the Sparse-Pairwise approach offers the best performance, whatever the value of .
We have shown that the performance of the Sparse-Pairwise approach is the most interesting whatever the values of ,  and matrix density.The Sparse-Pairwise approach is scalable and well suited over a wide range of  and  values.The efficiency of the Sparse-Pairwise approach is improved even further with very sparse datasets, but it's still worth using regardless of density.In addition, we have verified that the experiments match the theoretical performance in terms of computational complexity obtained in the previous section.6 shows execution times as a function of the number of processors .The matrix size is set to  = 100 and  = 100000 and the sparsity is fixed to 20%.We can see in this figure that the different methods for building the co-occurrence matrix all have excellent scalability.The Sparse-Pairwise approach has an efficiency of 96.9% with 1000 nodes compared with the execution time with 100 nodes, which is very good scalability.The efficiencies of the other methods are quite similar, although the sparse SDP approach achieves 87.8%.Which makes this approach the least interesting in terms of scalability.
The results for the study of weak scalability are shown in figure 7.In this figure, we varied  and  linearly, so that each processor always has a block of the input dataset of the same size.In other words, the problem size is fixed for each processor.In the experiments shown in figure 7, we set  = 100 and  = 100 × , so the size of the block distributed to each compute node is 100 × 100.
That execution time increases linearly as a function of  and .When  (and ) are doubled, execution time is also doubled.This verifies the computation complexity given in table 1.If the complexity of  2  = , then (2) 2 2 = 2  2  = 2.All else being equal, we efficiently expect execution times to double when  and  are doubled.The experiments in figure 7 were performed with a density of 5%.We have the same performances associated with this density as in figure 4.

Validation with real-case datasets
Now that we've demonstrated the efficiency of our method, we need to show that it also works on real-world datasets.To do this, we'll use the three datasets presented in table 2. We'll apply the different co-occurrence matrix building approaches to these datasets and compare performance.
The execution times for building the co-occurrence matrix for each dataset with each approach are printed in table 5.The performances obtained highlight that the Sparse-Pairwise approach builds the co-occurrence matrix fastest with all datasets.We can see that the performance of the Sparse-Pairwise approach is just over 4 times better than the Sparse symmetrical dot product approach with the Criteo dataset and up to over 34 times faster with the Kasandr dataset.This shows that the lower the density of the dataset, the more effective the Sparse-Pairwise approach.Thanks to the sparse storage formats, our approach also takes advantage of the limited memory required to store matrices.It makes it possible to work with large datasets like Kasandr, where memory space is insufficient to store matrices densely.The results obtained correspond to the performance observed with the dataset generator.The Sparse-Pairwise approach significantly reduces the execution time required to build the co-occurrence matrix.The greater the sparsity of the dataset, the greater the performance gains.The results obtained with Kasandr allow us to justify the scalability of the Sparse-Pairwise approach with very large datasets.

CONCLUSION
In this paper, we proposed Sparse-Pairwise, an approach to building the co-occurrence matrix from a dataset composed of categorical and Boolean variables.This approach takes advantage of both arithmetic and sparsity to efficiently build the co-occurrence matrix.Cost analysis and experiments show that our Sparse-Pairwise approach reduces computational complexity compared with dense and sparse symmetrical dot product approaches, regardless of dataset density.We defined and used a dataset generator to experiment with the impact of each matrix parameter on the performances of the approaches.The results show that Sparse-Pairwise reduces the execution time required to build the co-occurrence matrix for a very large field of values.Experiments with datasets from real-world applications show that the performance of our approach makes it possible to envisage the use of co-occurrence matrices as tools for many applications.A future work would be applying this approach to initialize deep neural networks in order to reduce the training time.Sparse-Pairwise's distributed approach was validate with a CPU cluster.The use of accelerators which can be used to express the intra-node parallelism is a perspective of this work for increasing efficiency and scaling the proposed approach.We plan to integrate this approach into MindSpore [5] to enable Sparse-Pairwise to be used on GPUs and NPUs.

Figure 1 :
Figure 1: Example of corpus of words with (b) the incidence matrix and (c) the co-occurrence matrix associated with the (a) distribution.

Algorithm 1
Build the co-occurrence matrix from the Pairwise approach Require:  the dataset (list of the  input boolean vectors of size ) Ensure: The co-occurrence matrix  of size  ×  1: initialize all elements of  to 0 2: for each non-zero element  of  do 3:

Algorithm 2
Sequential algorithm of Sparse-Pairwise approach Require:  the dataset (list of the  input boolean vectors of size ) Ensure: The co-occurrence matrix  of size  ×  1: initialize all elements of  to 0 2:  ← build the ELLPACK sparse matrix index from  3: for each of the  vectors in  do 4:

Figure 2 :
Figure 2: Overview of the Sparse-Pairwise approach.

Figure 3 :
Figure 3: Execution time comparison between the different pairwise and the matrix approaches to build the co-occurrence matrix in the function of the sparsity.

Figure 4 :
Figure 4: Execution time comparison between the different approaches to build the co-occurrence matrix in function of the density.Zoom in the interval [0, 0.1].

Figure 5 :
Figure 5: Execution time for different co-occurrence matrix building approaches in the function of the size of .

Figure 6 :
Figure 6: Strong scalability: Execution time for different co-occurrence matrix building approaches in the function of the number of processors .

Figure 7 :
Figure 7: Weak scalability: Execution time for different co-occurrence matrix building approaches with a linear modification of  and .

Table 1 :
Comparison of required computation power, memory and communication for approach in a distributed environment.

Table 2 :
Datasets overview for each element in the dataset, we add non-zero elements to the dataset with a probability of  (line 3-6).The value of parameter  is included in the interval [0, 1].Require:  the number of elements in the dataset,  the numbers of dataset features,  the expected density of the dataset Ensure: A dataset  1:  ← create  vectors of size  and initialize all elements to 0 2: for each element  in  do

Table 3 :
Memory complexity for each approach implementation.

Table 4 :
Execution time to build the co-occurrence matrix with different approaches for two values of .The coefficient represents the coefficients of the linear functions of execution time.

Table 5 :
Execution time in seconds to build the cooccurrence matrix with different approaches.These results are obtained with  = 1000.The execution times take into account the time required to build sparse matrices from dataset data, if necessary.