Recommender Systems based on Parallel and Distributed Deep Learning

As individuals have become overloaded with information, Recommender Systems (RS) were created to provide machine generated recommendations. Significant advancements in RS have been made thanks to Machine Learning methods; Deep Learning (DL) in particular has become extremely popular. Despite the fact that Deep neural networks (DNNs) upgrade notably the performance of RS, they make them larger and more memory-intensive systems. To that end, the solution is adding (data or model) parallel and distributed algorithms to DL RS. In this paper, we present our large-scale, multi-staged, hybrid RS that processes a million-scale dataset, as well as the most noteworthy parallel or/and distributed DL systems. Finally, we outline directions regarding the future evolution of our RS by adding some features and ideas from such systems.


INTRODUCTION
In recent years, people have been overwhelmed with an endless supply of information about practically all of their interests.When faced with this reality, a person must separate the pertinent and useful information while also avoiding being misinformed.This need has already been the focus of technological advancements, and a lot of study has been done in this area.Recommendation or Recommender Systems (RS) have been developed to offer machinegenerated recommendations to aid and facilitate people's daily lives.The algorithms and software used by RS aim to provide consumers with personalized recommendations that help them manage information overload and facilitate decision-making.Thus, RS generate a list of recommendations, which, depending on the circumstances, may take many different shapes.These products can be anything from movies and music to commerce and academic publications (papers).
The application of Artificial Intelligence (AI) and Machine Learning (ML) technologies have resulted in major improvements in RS, as well as in many other fields and systems.In particular, deep learning (DL), a subset of ML, focused on using neural networks for function approximation, has gained widespread popularity.Deep neural networks (DNNs) have enabled the advancement of the state of the art in a plethora of research areas: ranging from image & video recognition and natural language processing (NLP) to RS. Nichols et al. [19] state that the DNNs' popularity stems from their ability to automatically learn low-dimensional representations from high-dimensional unstructured data such as images, text or other.Aminu Da'u and Naomie Salim [4] state that using DL algorithms for RS has gained popularity as a result of a number of outstanding accomplishments in producing high-quality suggestions.DL-based RS models provide a better depiction of the user/item interactions than classic recommendation structures.Consequently, creating individualized DL-based RS emerged as an exciting idea.
However, it is common knowledge that, DL-based RS have been memory-intensive systems that are increasingly demanding in hardware resources.In [19], Nichols et al. argue that it is frequently impractical to train big networks on a single accelerator (Graphics Processing Unit (GPU), Field Programmable Gate Array (FPGA) or Application-Specific Integrated Circuit (ASIC)) due to long execution times or a lack of memory that can accommodate these models.For instance, the incredibly well-liked neural network GPT-2, which is utilized in NLP, needs 84 GB of GPU DRAM for training.Recently, DL tasks have been parallelized by training huge models on several GPUs on a single node [10], [12] or across multiple nodes connected by a communication network [15].In order to deal with this increased memory demand and reduce the training time of DNNs, researchers utilize parallel and distributed technology in ML/DL systems; this way they harvest the power of multi-processors or multi-GPUs clusters.Thus, we decided to focus our research on parallel and distributed technology, in order to include this kind of technology in the next version of our last work's RS (a multi-staged, hybrid DL RS that is submitted as manuscript for publication [26]).
The remainder of this paper is organized in the following order.Firstly, all related work in regard to deep learning and parallel or distributed systems is presented in Section 2. Secondly, our last work, entitled "An Academic Recommender System on large citation data based on clustering, graph-modeling and deep-learning" (submitted for publication [26]), are explained in Section 3. Next, our plans regarding adding new features -parallel or distributed implementation -in the upgraded version of our RS are discussed in Section 4. Lastly, in Section 5 we conclude this work and discuss possible directions of our RS.

RELATED WORK
There are various academic publications and current studies in the area of deep learning as well as in the area of parallel deep learning applications and systems.Parallel and distributed systems utilizing deep learning present increased scalability, enabling them to make use of large datasets or big data, while at the same time they are more efficient in training time.Therefore, in this section, we present some state-of-the-art studies on the above scientific areas, in order to highlight the next steps (future work) in the evolution of our multi-staged, hybrid RS, described in Section 3.

Deep Learning Systems
Firstly, Gunduz in [7], present his study on Parkinson's disease classification using vocal feature sets.His study proposes two frameworks based on Convolutional Neural Networks to classify Parkinson's Disease (PD) using sets of vocal (speech) features.Although, both frameworks are employed for the combination of various feature sets, they have difference in terms of combining the feature sets.While the first framework combines different feature sets before given to 9-layered CNN (Convolutional Neural Network) as inputs, whereas the second framework passes feature sets to the parallel input layers which are directly connected to convolution layers.Thus, deep features from each parallel branch are extracted simultaneously before combining in the merge layer.
Next, Shambour in [21] presents a deep learning-based method for multi-criteria recommender systems that uses deep autoencoders to take advantage of the complex, illogical, and hidden relationships among users' multi-criteria preferences and produce more accurate recommendations.Experiments on the multi-criteria datasets from Yahoo! Movies and TripAdvisor demonstrate that the suggested algorithm is quite effective in terms of delivering more accurate predictions than the most recent recommendation systems.The deep autoencoder is a special type of deep neural network, whose input vectors and the output vectors have the same dimensionality.It is a nonlinear feature extraction method used for learning a representation of the original data at hidden layers.The proposed deep autoencoder-based multi-criteria recommendation algorithm (AEMC) [21] employs deep feedforward neural networks.
Wang et al. [28] apply collaborative neighborhood knowledge to session-based recommendations; they suggest a Collaborative Session-based Recommendation Machine (CSRM), a novel hybrid system.Inner Memory Encoder (IME) and Outer Memory Encoder (OME) are the two parallel modules that make up CSRM.The IME uses recurrent neural networks (RNNs) and an attention mechanism to represent the user's own information in the current session.By looking into nearby sessions, the OME uses collaborative knowledge to more accurately predict the intentions of present sessions.The final representation of the current session is then created by carefully combining data from the IME and OME using a fusion gating technique.When anonymous behavior sequences are the only information provided, the challenge of anticipating the next item to propose is known as session-based recommendation.The collaborative information in so-called neighborhood sessions, which were created previously by other users and reflect similar user intents as the present session, is ignored by previous techniques for session-based recommendation.They propose that the cooperative information found in such neighborhood sessions may contribute to enhancing the effectiveness of recommendations for the current session.IME comprises two parts: global and local encoder; both of them use a Gated Recurrent Unit (GRU), because GRUs have shown better performance than Long Short-Term Memory (LSTM) or RNNs for session-based recommendations.
Moreover, Aminu Da'u et al. in their paper [5] propose a RS model that exploits neural attention techniques to learn adaptive user/item representations and fine-grained user-item interaction for enhancing the accuracy of the item recommendation; namely, the Adaptive Deep learning-based method for Recommendation System (ADRS).The system essentially comprises three components: attentive CNN, mutual attention and prediction layer.An attentive pooling layer is firstly designed based on CNNs to learn the adaptive latent features of the user/item from reviews.A mutual attention network technique is then introduced for modelling the fine-grained user-item interaction to enable jointly capturing the most informative features at the higher granularity.Eventually, a prediction layer is then applied for the final prediction based on the adaptive user/item representation and the user/item importance.
In [14], Lee and Kim study the complicated interplay between user and item attributes, which has been the focus of numerous studies in the field of DL RS.They propose a convolutional neural network-based RS that makes use of cross convolutional filters and the outer product matrix of features.The suggested approach can handle a variety of feature types and record relevant higher-order interactions between users and products.The proposed method effectively models the user-item interactions, capturing useful nonlinear relationships between user and item features by using cross convolutional filters.The proposed RS consists of: (1) variable embedding, (2) convolution with cross convolutional filters, and (3) rating prediction.
Hanafi et al. [8] state that e-commerce is the most essential application for conducting business transactions.Therefore, RS have been adopted in many large e-commerce companies such as Amazon, e-Bay, Alibaba, YouTube, iTunes, and so on.Ratings have become an essential factor to calculate product information.Unfortunately, the number of ratings is extremely sparse.Generating rating prediction is a major issue in the RS research field.Their research aimed to develop a novel model to generate rating prediction using two deep learning variants based on Stack Denoising Auto Encoder (SDAE), Long Short-Term Memory (LSTM), combined with a latent factor model based on Probabilistic Matrix Factorization (PMF).This study considered integrated information resources, including user information and document product information.Following the experiment report involved in Movielens and Amazon Information Video dataset, their model outperformed previous works.LSTM is an enhanced variant of RNN, which is the advance of traditional feed-forward artificial neural networks (ANN).

Parallel and Distributed Deep Learning Systems
Large and complex models are needed for modern personalization and recommendation systems to fully utilize massive volumes of data.Particularly, compared to other popular deep learning models like CNN, RNN, and Generative Adversarial Networks (GAN), DL RS have a relatively high parameter density.As a result, training periods of DL RS can last for a few weeks or more.Therefore, it is crucial to add parallel or distributed technology to these models in order to achieve realistic time scales.The most significant details of such systems are listed in Table 1.Firstly, Naumov et al. in [18] develop a state-of-the-art deep learning recommendation model (DLRM), while they design a specialized parallelization scheme utilizing model parallelism on the embedding tables to mitigate memory constraints while exploiting data parallelism to scale-out computations from the fully-connected layers.They utilize the most fundamental model,which is the multilayer perceptron (MLP), composed of an interleaving sequence of fully connected (FC) layers and an activation function.The parallelized DLRM uses a combination of model parallelism for the embeddings and data parallelism for the MLPs to mitigate the memory bottleneck produced by the embeddings while parallelizing the forward and backward propagations over the MLPs.Combined parallelism (model and data) is a unique requirement of their system as a result of its architecture and large model sizes.
Huang et al. [10] address the need for efficient and task-independent model parallelism and introduce GPipe, a pipeline parallelism library that allows scaling any network that can be expressed as a sequence of layers.By pipelining different sub-sequences of layers on separate accelerators, GPipe provides the flexibility of scaling a variety of different networks to gigantic sizes efficiently.Moreover, GPipe utilizes a novel batch-splitting pipelining algorithm, resulting in almost linear speedup when a model is partitioned across multiple accelerators.They demonstrate the advantages of GPipe by training large-scale neural networks on two different tasks with distinct network architectures: (i) Image Classification and (ii) Multilingual Neural Machine Translation.GPipe allows scaling arbitrary deep neural network architectures beyond the memory limitations of a single accelerator by partitioning the model across different accelerators (model parallelism).
Kim et al. [12] design and implement a ready-to-use library in PyTorch for performing micro-batch pipeline parallelism with check-pointing proposed by GPipe [10].In particular, they develop a set of design components to enable pipeline-parallel gradient computation in PyTorch's define-by-run and eager execution environment.They show that each component is necessary to fully benefit from pipeline parallelism in such environment, and demonstrate the efficiency of the library by applying it to various network architectures including AmoebaNet-D and U-Net.Hence, one can effectively parallelize the tasks by assigning tasks with different micro-batch indices to different devices -which is data parallelism.
Kalamkar et al. [11] focus on Facebook's DLRM benchmark.By enabling it to run on the latest CPU hardware (CPU cluster archtecture) and being software tailored for high performance computing (HPC), they are able to achieve more than two-orders of magnitude improvement in performance (110x) on a single socket compared to the referenced CPU implementation, and high scaling efficiency up to 64 sockets, while fitting ultra-large datasets.DLRM comprises of the following major components: a) a sparse embedding realized by tables (databases) of various sizes, and b) small dense multi-layer perceptron (MLP).Both a) and b) interact and feed into c) a larger and deeper MLP.
Shoeybi et al. [23] deal with language modeling as well as training large transformer models that include the state-of-the-art in Natural Language Processing applications.However, due to memory limitations, training very big models can be quite challenging.The approaches they use to train very large transformer models are presented in their paper, along with a straightforward, effective, intra-layer model parallel approach that permits training of transformer models with trillions of parameters.Their method can be fully implemented with the addition of a few communication operations in native PyTorch, is orthogonal and complimentary to pipeline model parallelism and does not call for a new compiler, or library changes.They illustrate this approach by converging transformer based models using 512 GPUs.
Li et al. in their paper [15] present the design, implementation, and evaluation of the PyTorch distributed data parallel module.PyTorch is a widely-adopted scientific computing package used in deep learning research and applications.Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources.Data parallelism has emerged as a popular solution for distributed training.PyTorch natively provides several techniques to accelerate distributed data parallelism, including bucketing gradients, overlapping computation with communication, and skipping gradient synchronization.PyTorch offers several tools to facilitate distributed training, including DataParallel for single-process multi-thread data parallel training using multiple GPUs on the same machine, and DistributedDataParallel for multi-process data parallel training on general distributed models.
Shi et al [22] argue that synchronous stochastic gradient descent (S-SGD) with data parallelism is widely used for training DL models in distributed systems.A pipelined schedule of DL training is an effective scheme to hide some communication costs.In such pipelined S-SGD, tensor fusion (i.e., merging some consecutive layers' gradients for a single communication) is a key ingredient to improve communication efficiency.Therefore, they are exploiting simultaneous All-Reduce communications.Through theoretical analysis and experiments, they show that simultaneous All-Reduce communications can effectively improve the communication efficiency of small tensors.
Nagrecha et al. [17] propose a new form of "shard parallelism" combining task parallelism and model parallelism, packaging it into a framework named "Hydra".Hydra recasts the problem of model parallelism in the multi-model context to produce a fine-grained parallel workload of independent model shards, rather than independent models.This new parallel design promises dramatic speedups relative to the traditional model parallelism paradigm.Systems like BERT have increased accuracy in practical fields.However, the memory and processing power needed to train such models can be astronomical.Single-device model training solutions have consequently grown increasingly unworkable.The idea of model parallelism has emerged as a potential remedy.
Lai et al. [13] present SplitBrain, a high performance distributed deep learning framework supporting hybrid data and model parallelism.Specifically, SplitBrain provides layer-specific partitioning that co-locates computational intensive convolutional layers while sharding memory demanding layers.A novel scalable group communication is proposed to further improve the training throughput with reduced communication overhead.To address the massive demand on space and performance, both data parallelism (DP) and model parallelism (MP) are taken into consideration.
Cai et al. [2] propose an efficient Hybrid Parallel deep learning Model (HPM) for intrusion detection based on margin learning.Firstly, HPM constructs two parallel CNN architectures and fuses the spatial features obtained through full convolution.Secondly, the temporal information of the fused features is parsed separately using two parallel LSTMs.Finally, the extracted spatial-temporal features are fed into the classifier for classification detection.
Russel et al. [20] argue that categorizing leaf diseases poses challenges like intensity of the disease in the leaf, resolution of the image, shot category and complex background.Thus, rather than a single deep stream of network, they propose a specialized parallel multiscale stream (data parallelism) with learnable filters, that extract inherent attributes, which are utilized for improved performance.Their system applies data parallelism in a deep CNN model for image classification.
Sima et al. in [24], introduce "Ekko" a large-scale, distributed DL RS with low-latency model update.Ekko, as a geo-distributed system, updates models in a central data center (DC) and then disseminates the updated models to geo-distributed DC close to global users (i.e.clients), using software-based routers.Ekko has an efficient P2P model update algorithm which can coordinate billions of model updates to be efficiently disseminated to replicas in geodistributed DC.Its production environment comprises 4,600 servers, spreads across 6 geo-distributed DCs and supports a wide range of recommendation services, for more than one billion users per day.
Dai et al. [3] present BigDL (a distributed deep learning framework for Apache Spark), which has been used by a variety of users in the industry for building deep learning applications on realworld, big-data platforms.It allows DL applications to run on the Apache Hadoop/Spark cluster so as to directly process the production data, and as a part of the end-to-end data analysis pipeline for deployment and management.Unlike existing deep learning frameworks, BigDL implements distributed, data parallel training directly on top of the functional computing model of Spark.
Miao et al. [16] present "Hetu", a highly efficient and easy-to-use distributed DL framework.Hetu is the first distributed DL system developed in Chinese universities; it provides both high availability in industry and innovation in academia.

AN ACADEMIC RECOMMENDER SYSTEM ON LARGE CITATION DATA BASED ON CLUSTERING, GRAPH-MODELING AND DEEP-LEARNING
This section presents the description and architecture of our last work [26], which is currently submitted for publication.
In [26], we present a novel multi-staged RS based on clustering, graph-modeling and Deep Learning (as depicted in Figure 1.) that manages to run on a full dataset (scientific digital library) in the magnitude of millions users and items (papers).
Our system is a hybrid one, based on the implemented recommendation algorithm.Content-based Filtering (CBF) is applied at the first stages of the system, while in the last stage, a DL/Collaborative Filtering (CF) RS is incorporated; the HP-tuned version of CATA++ (Collaborative Dual Attentive Auto-encoder method for recommending publications [1]), as described in our previous work [25].In order to tune the HPs of CATA++ lots of activation functions, weight initializers and training epochs numbers were selected for the testing phase (evaluation) [25].Thus, our system combines the powers of CBF anc DL/CF and overcomes issues like cold start (for a new user or publication).Moreover, it can model the user's behavior or likes and retrieve/learn latent features of this relationship between users and items (papers).
Another point is that our system deals with the fact that DL models are highly intensive regarding memory, CPU or GPU, during the training process.Usually, state-of-the-art RS that use an ANN/DL architecture while running on a single machine (PC or Server), are constrained to exploit some thousands of papers in order to train a model and make recommendations, due to hardware limitations (Memory, CPU or GPU).Facilitated with the field of study (fos) characteristic, our system can handle millions of papers of AMiner's database by creating two models in the respective two stages: • graph model: A fos-to-fos weighted graph structure.A vertex (or node) of the graph is a unique fos and it is connected to one or more other fos with a weighted edge.An edge connects two fos when they appear in the fos-list of one, two, three or more papers, making the weight of the edge to be one, two, three, etc. • clustering model: A clustering model based on fos similarity.We have utilized the cosine similarity algorithm, as it has been very well performing with textual data.
Moreover, we have utilized some other techniques and algorithms like: • Dimensionality Reduction (DR): The TF-IDF algorithm is used for DR and Feature Selection.Term Frequency: TF of a term or word is the number of times the term appears in a document compared to the total number of words in the document.Inverse Document Frequency: IDF of a term reflects the proportion of documents in the corpus that contain the term.• K-means: Used for clustering based on fos.Actually we used the K-means++ algorithm with improved initial centroid selection.• Elbow method: Used before running K-means in order to determine the optimum number (K) of clusters.[18] Hybrid ✓ MLP PyTorch, Caffe2 Huang et al. [10] Model Transformer Tensorflow Kalamkar et al. [11] Hybrid ✓ MLP PyTorch Li et al. [15] Data ✓(32 GPUs cluster) BERT [6] PyTorch Shoeybi et al. [23] Model ✓(512 GPUs cluster) BERT, GPT-2 PyTorch Shi et al. [22] Data ✓(32 GPUs cluster) CNN, BERT [6] PyTorch Nagrecha et al. [17] Model ✓ --Lai et al. [13] Hybrid ✓ CNN C++ Dai et al. [3] Data ✓ NCF [9], CNN Apache Hadoop/Spark Cai et al. [2] Hybrid CNN, LSTM PyTorch Russel et al. [20] Data CNN -Sima et al. [24] Model ✓ -- All the development and experiments of our system were made based on data from the DBLP-Citation-network-v13 (2021-05-14) 1 from AMiner [27] which was available at the time this work was written, and which included 48,227,950 citation relationships and 5,354,309 publications.
DBLP-Citation-network-v13 consists of a variety of information for each paper, where some of the most useful are: paper's unique number, title, author, keywords, paper fields of study (fos), abstract or indexed abstract.

FUTURE WORK
As stated earlier, large and complex models are needed for modern, personalized recommender systems to fully utilize the massive volumes of data available online or offline in digital libraries.As 1 https://www.aminer.org/citation a result, training periods can last for a number weeks or more.Therefore, it is crucial to effectively parallelize these models in order to handle these issues at realistic scales.
In the process of improving our RS described in Section 3, while motivated by the state-of-the-art literature described in Section 2.2, we went deeper into a number of ideas for future advances and improvements on our current work:

Data Parallelism
In deep learning, data parallelism (DP) refers to parallelisation across several processors in parallel computing environments.It concentrates on spreading the data across various nodes, which carry out operations on the data in parallel [19].In simpler words, the term data parallelism refers to the division of data into processes (or parallel computing nodes), with each process receiving a specific portion of the data.The size of the data sections is almost equal (even division of training data among workers, e.g.GPUs).If the processing times for the sections vary considerably, the performance is constrained by the pace of the slowest operation.In that scenario, the issue can be resolved by dividing the data up into numerous smaller chunks.
As mentioned before, the AMiner's data (DBLP-Citation-network-v13 dataset), that our RS utilized in order to make recommendations, has 5.3M papers; however new papers on Computer Science (CS) are published every day.This (i.e. the size) will be intensified if we consider merging data with scientific publications from other sources, e.g.Semantic Scholar2 , arXiv3 etc., so that we create an even larger scientific publications' dataset.Consequently, this leads us to believe that data parallelism will be inevitable in order to feed batches of data in multiple copies of our system running on different processing units, to reduce the overall processing time.

Model Parallelism
In model parallelism (MP) every computer node participates by training the same samples of data on different portions of the model.Each processing node, such as the GPU, is in charge of one of the various pieces that make up the model [23].When a neuron receives its input from another computational node's output, communication between the nodes of computation takes place.Model parallelism typically doesn't perform as well as data parallelism.This is because model parallelism has far greater transmission costs than data parallelism.
Thus, we will add model parallelism to the Stage A (Figure 1), so that the offline pre-processing time of the AMiner's Citation-Network dataset is reduced; currently the system needs about two days to complete the full process of Stage A, while running on a modern, regular PC.If Stage A becomes a faster procedure overall, we could add even more data, as well as repeat this step more often to include the latest publications available in online sources.
Moreover, we will try to apply model parallelism in stages B, C and D, as depicted in Figure 1.This implementation would provide our system with the capability to serve (make paper recommendations) much more users concurrently.

Implementation and Available Frameworks
Taking into consideration the presented literature, along with the important points listed in Table 1, the most prominent solutions / frameworks regarding parallel and distributed DL systems (keep in mind that Caffe2 is now a part of PyTorch) are the following: • PyTorch (https://pytorch.org/)• TensorFlow (https://www.tensorflow.org/overview)• Apache Hadoop/Spark (https://spark.apache.org/)Regarding hardware resources for the implementation and evaluation of a parallel and distributed RS, we plan to use a server machine equipped with multiple Nvidia GPUs, or a cluster of Nvidia Jetson Nano (micro-computers) 4 ; each one of the latter comes with a powerful GPU with 128 CUDA cores that lets you run multiple DNN in parallel.In addition to that, we could make our RS a distributed system using DP or MP on a cluster of machines or a cluster of GPUs.

CONCLUSION
To conclude, it is clear from the above-mentioned observations of state-of-the-art DL models, that parallel and distributed technology can be highly beneficial for them in general, as well as for DL-based RS, in particular.Such systems achieve significantly lower training times and at the same moment achieve advanced performance on larger datasets, compared to the traditional systems (single processing machine).
From our standpoint of view, we believe that our large-scale, multi-staged, hybrid RS can be upgraded and notably advanced, when we adopt the appropriate parallel (DP or MP) or/and distributed technology in its architecture.

Table 1 :
Characteristics of publications on parallel and distributed DL systems