skip to main content
research-article
Open Access

esDNN: Deep Neural Network Based Multivariate Workload Prediction in Cloud Computing Environments

Published:17 August 2022Publication History

Skip Abstract Section

Abstract

Cloud computing has been regarded as a successful paradigm for IT industry by providing benefits for both service providers and customers. In spite of the advantages, cloud computing also suffers from distinct challenges, and one of them is the inefficient resource provisioning for dynamic workloads. Accurate workload predictions for cloud computing can support efficient resource provisioning and avoid resource wastage. However, due to the high-dimensional and high-variable features of cloud workloads, it is difficult to predict the workloads effectively and accurately. The current dominant work for cloud workload prediction is based on regression approaches or recurrent neural networks, which fail to capture the long-term variance of workloads. To address the challenges and overcome the limitations of existing works, we proposed an efficient supervised learning-based Deep Neural Network (esDNN) approach for cloud workload prediction. First, we utilize a sliding window to convert the multivariate data into a supervised learning time series that allows deep learning for processing. Then, we apply a revised Gated Recurrent Unit (GRU) to achieve accurate prediction. To show the effectiveness of esDNN, we also conduct comprehensive experiments based on realistic traces derived from Alibaba and Google cloud data centers. The experimental results demonstrate that esDNN can accurately and efficiently predict cloud workloads. Compared with the state-of-the-art baselines, esDNN can reduce the mean square errors significantly, e.g., 15%. rather than the approach using GRU only. We also apply esDNN for machines auto-scaling, which illustrates that esDNN can reduce the number of active hosts efficiently, thus the costs of service providers can be optimized.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Today’s organizations and enterprises are becoming more dependent upon information technologies with cloud services that are deployed in cloud data centers [1, 2]. Cloud services offer significant benefits for both customers and service providers [3]. The customers can access the services with high availability, and the service providers can take advantage of elasticity and low management costs of infrastructure. The pay-as-you-go pricing model is also a dominant benefit that promotes the fast development of cloud computing [4]. Due to these benefits, large cloud service providers, e.g., Amazon, Google, and Microsoft, have established large-scale data centers to provide resources for their services and a great number of companies have started to migrate their local services to the cloud [5].

Although cloud computing is featured with these attractive benefits, some unpredictable situations, e.g., workload bursts can lead to resources being insufficient. The unmatched resources for workloads can also waste resources or degrade performance, for instance, more resources are provisioned than required when workloads are at a low level and only limited resources are offered when workloads are increasing dramatically [6]. Therefore, to improve the resource usage, predicting workloads in an accurate manner is required. With the effective prediction of future workloads, the service provider can plan resources in a more efficient and rational way by allocating or de-allocating resources in advance [7]. However, it is not an easy job to predict cloud workloads efficiently and accurately due to their native characteristics. Cloud workloads have high variance and high dimensionality, which make them difficult to forecast. High variance represents that the number of workloads and their demanded resources can change dramatically. According to the analysis of Alibaba cloud data centers, the average resource utilization can range from 5% to 80% [8]. And in Google cloud data centers, workloads can change randomly during a specific observation period. As for high dimensionality, it represents that cloud workload traces record a great amount of information and different specification of machines, which needs to extract the necessary and valuable information for the training model.

To address the high variance challenge of cloud workloads, the pattern of workloads, as well as the relationship with time series, should be learned and exploited to design efficient and accurate prediction algorithms to fit with the variances of workloads. As for the high dimensionality challenge, the dataset can be further analyzed to extract the necessary data while assuring the prediction accuracy.

A significant amount of research has been devoted to cloud workload prediction. Traditional approaches are mostly based on the regression methods, heuristic algorithms and traditional neural network approaches. Traditional neural networks generally refer to shallow networks that contain only several layers, such as Multi-layer Perceptron (MLP) and Radial Basis Function (RBF). However, these approaches can only work effectively for the workloads with obvious patterns, e.g., for small-scale data centers for ordinary companies or organizations. For the large-scale public cloud data centers, these approaches can not obtain high prediction accuracy. The main reason is that the regression methods and simple neural networks cannot capture the complicated correlation of workloads. Therefore, to achieve higher accuracy, more complicated neural networks can be applied to take full advantage of the correlations of neurons.

As a representative of neural network-based approaches, the Recurrent Neural Network (RNN) [9] has been applied to predict cloud workloads as it has the feature to model the changes with time series. RNN can use its memory to process a set of inputs in sequence. However, it is inefficient for RNN to learn long-term memory dependencies because of the gradient vanishing. To overcome this limitation, some revised RNN, including Long Short-Term Memory (LSTM) [10] and Gated Recurrent Unit (GRU) [11] have been proposed, which have demonstrated a strong capacity to learn long-term memory dependencies. Compared with LSTM, GRU has demonstrated better prediction accuracy and learning efficiency in practice. Thus, in this work, we apply a GRU-based approach to capture the variance of cloud workloads.

1.1 Motivation and Our Contributions

To address the high-dimensionality challenge, extraction of features of the original data is required. Our main motivations are as follows:

  • Some approaches including Principal Component Analysis (PCA) [12] and auto-encoder [13] have been investigated, which can reduce dimension largely, while the accuracy is degraded as some features have been ignored.

  • Traditional machine learning models can only show the mapping relationship between the source data and the target data, however, the time relationship cannot be extracted and exploited.

  • When predicting long periods, the dominant time-series data prediction approaches based on LSTM and RNN have the limitations of gradient disappearance and explosion.

To address the aforementioned challenges for cloud workload prediction, we first extract some key features from the realistic traces derived from the cloud data center, and then convert the multivariate time series into supervised learning time series [14] for further training with our designed training algorithm based on GRU. Our objective is to achieve efficient and accurate predictions for highly variable and high dimensional cloud workloads to finally optimize the resource usage in cloud computing environments.

The main contributions of this article are summarized as follows:

  • The sliding window for Multivariate Time series Forecasting (S-MTF) is designed to convert multivariate time series into supervised learning time series for multivariate workloads and keep sufficient information. The S-MTF can reorganize the time series to sample X and label Y and model the correlation between predicted data, which can use algorithms based on Deep Neural Network (DNN) to achieve predictions.

  • An efficient supervised learning-based Deep Neural Network (esDNN) algorithm is proposed for cloud workload prediction to learn and capture the features of historical data and accurately predict future workloads. The proposed algorithm can adapt to the variances of workloads by updating the gates of GRU and overcome the limitations of gradient disappearance and explosion.

  • Comprehensive experiments are conducted by using realistic data derived from Alibaba and Google cloud data centers to evaluate the performance of esDNN. The results demonstrate that the proposed approach can achieve better prediction accuracy than state-of-the-art algorithms. Experiments also show that the proposed approach can be applied for auto-scaling scenarios to improve resource provisioning.

1.2 Article Organization

The rest of this article is organized as follows: Section 2 discusses the related work for workload prediction in cloud computing environments. Section 3 depicts the system model of our proposed approach, followed by system problem statement. The proposed algorithm based on DNN is introduced in Section 4. Section 5 introduces the details of our experiments that apply dataset derived from realistic traces to predict workloads, and demonstrate the feasibility of our approach to improve the resource provisioning of cloud data centers. Finally, conclusions along with the future directions are given in Section 6.

Skip 2RELATED WORK Section

2 RELATED WORK

Many researchers have conducted research on workload prediction. The main contributions for cloud workload prediction can be classified as regression-based and learning-based approaches. The regression-based approaches mainly include linear regression, auto-regression and other traditional regression-based approaches. While for the learning-based approaches, both the traditional approaches based on machine learning and some updated methodologies based on deep learning have been investigated.

2.1 Regression-Based Approaches for Cloud Workload Prediction

Calheiros et al. [15] proposed an approach based on auto-regression to predict future workloads by using requests of web applications. The proposed approach can achieve high accuracy in resource utilization and QoS prediction. Yang et al. [16] introduced an approach based on linear regression for workload prediction to satisfy Service Level Agreement (SLA) and reduce scaling costs. Based on the prediction data, the auto-scaling mechanism can be further applied to optimize virtualized resource usage. Centinski et al. [17] combined statistical and machine learning methods together to improve workload prediction for cloud applications. The training method is utilized to learn the dominant system parameters of the influence application, and the prediction method is based on the regression approach. Singh et al. [18] presented a combined algorithm based on linear regression and support vector machine for workload prediction of web applications. A workload classifier was also proposed to select the model based on workloads features. Liu et al. [19] introduced an adaptive workload prediction approach based on workloads classification, in which different prediction models can be assigned to the different categorized workloads. Bi et al. [20] proposed a prediction method that integrates Savitzky-Golay filter and wavelet decomposition with stochastic configuration networks to predict workloads.

These regression-based approaches have proven their effectiveness in workload prediction. However, most of these approaches are only suitable for workloads with obvious patterns, e.g., Wikipedia workloads with fixed daily tendencies. The modern cloud workloads with high variance make these approaches hard to represent correlations between different parameters. Besides, these approaches were applied to high-performance computing workloads, small-scale data centers or synthetic workloads, which have lower variance compared with cloud workloads. Therefore, to efficiently capture the characteristics of cloud workloads, more advanced learning approaches, e.g., machine learning and deep learning-based methodologies have been investigated.

2.2 Learning-Based Approaches for Cloud Workload Prediction

Kumar et al. [21] applied a neural network and self-adaptive differential evolution algorithm to learn and extract the pattern from workloads. This evolution-based approach can reduce the prediction error by searching a large solution space, thereby minimizing the effects of initial solution choice. Zheng et al. [22] presented a deep learning model based on canonical polyadic decomposition to predict the usage of virtual machines for cloud workloads for industry informatics. Compared with machine learning-based approaches, deep learning-based approaches can achieve higher accuracy. Kumar et al. [23] proposed a prediction model based on LSTM and showed good performance in reducing mean square errors. Qiu et al. [24] introduced a deep learning approach to predict Virtual Machine (VM) workloads by extracting high-level features of VMs workloads and then predicting future VM workloads. Zhu et al. [25] presented an approach based on LSTM encoder-decoders network with an attention mechanism. The features of historical data are extracted via the encoder network and the attention mechanism is integrated into the decoder network. Amiri et al. [26] introduced an online learning approach to adapt resources according to workloads variations based on sequential pattern mining, which can learn new behavioral patterns rapidly. Chen et al. [7] proposed a deep learning-based approach, which includes a top-sparse auto-encoder to extract essential features of workloads and GRU to obtain an accurate and adaptive prediction for cloud workloads. Several different types of workloads have been investigated to validate the effectiveness of the proposed approach. Eli et al. [27] presented a resource central system to collect Azure VM parameters to learn the VM behavior offline with Microsoft learning libraries and then make online resource usage prediction, which predicts the oversubscription of VM types while ensuring VM performance.

Bi et al. [28] applied bi-directional LSTM (Bi-LSTM) to predict large-scale workloads and resource consumption in the cloud computing environment. The performance of the approach has been validated with Google traces and shown better results than baselines. Karim et al. [29] proposed a hybrid approach combing RNN and Bi-LSTM to forecast CPU workload of VMs, which can improve the performance of using a single technique separately. Chen et al. [30] introduced the LSTM-based approach to predict the useful life of components to indicate system health. A support vector regression is also combined to enhance the prediction robustness and marginal utility. Results based on NASA have validated the effectiveness of the proposed approach. Singh et al. [31] proposed an evolutionary quantum neural network-based approach for cloud workloads prediction, which leverages the computational efficiency of quantum computing to encode workloads, and utilizes the neural network to estimate resource demands. The experiments with traces from cloud data centers and traditional data centers have validated the effectiveness of the proposed approach. Kim et al. [32] introduced a cloud prediction framework named CloudInsight that combines multiple predictors based on traditional machine learning techniques to enable accurate predictions for real cloud workloads. The ensemble supports dynamic and periodical optimization to handle the variations of workloads. The framework can also reduce the periods of under-provisioning and over-provisioning, thus improving system efficiency.

The deep learning-based approaches have been applied in predictions in many areas, such as communication, economic market, and pedestrian motion. Sun et al. [33] proposed an LSTM-based approach to predict the link quality confidence interval for wireless communication under a smart grid environment. A wavelet denoising algorithm has been applied to decompose the signal-to-noise ratio time series into the deterministic and stochastic ones to train two LSTM neural networks. Li et al. [34] introduced a recurrent attention and interaction model to predict pedestrian trajectories, which includes several modules to achieve precise prediction collaboratively. The introduced approach can comprehensively mine the spatio-temporal information to model attention mechanisms, interactions, and multimodality of pedestrian motion. Barra et al. [35] presented an approach to forecast market behavior by encoding time series to Gramian angular fields images based on neural networks. Qiao et al. [36] proposed an approach based on a neural network to model the uncertain nonlinear systems by utilizing a distance concentration algorithm to increase prediction accuracy and reduce computation time. However, these approaches are not focusing on cloud workloads prediction.

To summarize, most of the learning-based approaches are based on machine learning algorithms or traditional RNN, which either cannot exploit the long-term memory dependencies or address the gradient vanishing challenge. Thus, it is also difficult for them to predict cloud workloads accurately. Only limited research has paid attention to GRU, which is an improved version of RNN and can address the gradient vanishing challenge to achieve better accuracy. For instance, Chen et al. [7] applied GRU for cloud workload prediction; however, they also apply the auto-encoder approach to compress the dimensionality of the original data. Although the auto-encoder approach can address the high dimensionality, the accuracy is also undermined since the full data is not utilized to capture the whole features of workloads.

2.3 Critical Analysis

This article contributes to the growing body of work in the cloud workload prediction area. The comparison of our proposed approach and the related work is summarized in Table 1. To solve the aforementioned challenges, e.g., high-dimensional problems and multivariate problems, we apply GRU to capture the long-term memory dependencies to address the high variance of cloud workloads, thereby achieving high accuracy prediction of cloud workloads. We also apply a sliding window for multivariate time series prediction to convert the original time series into supervised learning time series to address the high dimensionality and further achieve higher accuracy. From the technique perspective, our GRU-based approach is advanced in prediction compared with traditional regression and machine learning based approaches, and aims to overcome the limitation of gradient disappearance and explosion that exist in approaches like LSTM. From the data preprocessing perspective, our approach focusing on a sliding window to take advantage of full data information and the correlation between the predicted data rather than only extracting part of data like in auto-encoded based approach. We also validated our approach based on realistic traces of Google and Alibaba and multiple metrics have been evaluated comprehensively.

Table 1.
ApproachTechniqueData PreprocessingPredicted ResourcesWorkloadsPerformance Metrics
RegressionMachine LearningEpisode MiningDeep LearningAuto-encoderSliding windowVM utilizationServer UtilizationQoSRealisticSyntheticMSERMSEMAPECDF
DNNDBNLSTMGRUCloud Data CentersTraditional Data Centers
Calheiros et al. [15]\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)
Amiri et al. [26]\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)
Ceninski et al. [17]\( \surd \)\( \surd \)\( \surd \)\( \surd \)
Kumar et al. [21]\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)
Kumar et al. [23]\( \surd \)\( \surd \)\( \surd \)\( \surd \)
Qiu et al. [24]\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)
Rodrigo et al. [15]\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)
Singh et al. [18]\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)
Yang et al. [16]\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)
Zhang et al. [22]\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)
Zhu et al. [25]\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)
Bi et al. [20]\( \surd \)\( \surd \)\( \surd \)\( \surd \)
Liu et al. [19]\( \surd \)\( \surd \)\( \surd \)\( \surd \)
Eli et al. [27]\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)
Sun et al. [33]\( \surd \)\( \surd \)\( \surd \)\( \surd \)
Li et al. [34]\( \surd \)
Barra et al. [35]\( \surd \)
Qiao et al. [36]\( \surd \)\( \surd \)
Bi et al. [28]\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)
Singh et al. [31]\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)
Kim et al. [32]\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)
Karim et al. [29]\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)
Chen et al. [30]\( \surd \)\( \surd \)\( \surd \)\( \surd \)
esDNN (This Work)\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)\( \surd \)

Table 1. Comparison of Related Work

Skip 3SYSTEM MODEL Section

3 SYSTEM MODEL

In this section, we introduce our system model and optimization objective. In our system model, we aim to offer an efficient and accurate prediction model that the service providers can apply to predict future workloads. Thus, the resource usage can be optimized to reduce their costs, e.g., integrating the model with auto-scaling to reduce the number of active hosts.

It is not easy to predict cloud workloads as they can change dramatically within a short time and the pattern is also difficult to capture precisely. For example, workloads in every 5 minutes from the dataset of Alibaba can vary significantly [8]. The cloud workloads are tightly coupled with time series, and it is inefficient to get accurate prediction results from a simple regression model or univariate time predictions. Multivariate time series can contain more dynamic information than univariate time series. For instance, the data in multivariate time series forecasting can have certain correlations, such as CPU usage and memory usage in workload forecasting. Therefore, we built a multivariate time series forecasting model to predict highly random workloads and use the real-world dataset to verify the accuracy of the model. In our prediction model, we use CPU usage as our standard for measuring the prediction results. Figure 1 shows the main components and flow of the system model.

Fig. 1.

Fig. 1. Multivariate time series prediction model for cloud workloads.

Step 1: Data Preprocessing. This step is equipped with a workloads preprocessing component and a data cleaning component, which processes the raw data derived from the realistic cloud traces. With the raw data of cloud workloads, we first remove the columns that contain empty data. Because whether it is to use the zero-filling scheme or simply ignore these data, they will have a negative impact on our forecast data. Afterwards, we classify the dataset by time, then calculate the average value of each parameter with the same timestamp. Next, we normalized the Alibaba dataset and Google dataset. Normalization is a dimensionless processing method that makes the absolute value of the physical system value into a certain relative value relationship. From the perspective of model optimization, normalization can not only improve the convergence speed of the model but also improve the accuracy of prediction. The normalization method has two forms, one is to change the number to a decimal between (0, 1), and the other is to change the dimensional expression to a non-dimensional expression and become a scalar. In this article, the former is chosen as the normalization method, and we use MinMaxScaler to achieve this function. The MinMaxScaler operation is based on the min-max scaling method as follows: (1) \( \begin{align} X_\text{std}&=\frac{X-X_{\min }}{X_{\max }-X_{\min }}-X_{\min }, \end{align} \) (2) \( \begin{align} X_\text{scaled}&=X_\text{std} *(max - min) + min. \end{align} \) We apply the MinMaxScaler to transform features with default configurations and scale each feature to be a value between min and max. The X represents the set of the data to be processed, and the \( X_{\min } \) and \( X_{\max } \) are the minimum and maximum data in the dataset, and the final processed data is represented by \( X_\text{scaled} \). To be noted, in this work, the predicted value is resource utilization, which ranges from 0.0 to 1.0, and the MinMaxScaler can handle these data well. As for the missing data, they will be filled with the data in the previous time slot.

Step 2: Supervised Learning Conversion. The difference between supervised learning and unsupervised learning lies in whether there are labels of samples for training. In supervised learning, it has labeled training samples. It trains through the existing training samples to obtain an optimized model and then uses this model to map all inputs to the corresponding outputs, thereby realizing data prediction and classification. For unsupervised learning, there are no pre-labeled training samples. In our system model, we use the supervised learning transfer function to convert the multivariate time series prediction problem into a supervised learning problem based on [14]. More details about the transfer function will be introduced in Section 4. The key motivation is to use the normalized dataset as the input of the transfer function, and reframe the time series datasets as supervised learning datasets. To achieve this, we split the dataset into a training set and a validation set. After that, the dataset is divided into sample X and its corresponding label Y. With these conversion operations, we can transform the time series forecasting problem into a supervised learning-based time series problem.

Step 3: Model Construction. In this step, our system model focuses on the construction of deep learning networks and establishes an optimization model for cloud workload prediction based on the preprocessed data. The preprocessed data are considered as input, and the output is the optimized parameters of the model as well as the evaluation metrics, e.g., mean square errors. In this step, the hyperparameters of the deep learning network should also be defined, e.g., the number of layers, number of neurons, and types of network. Our proposed network model is derived from GRU and more design details will be given in the following sections. By predicting the cloud workloads, we aim to obtain future resource usage and thus the optimization of the number of active machines can be optimized by auto-scaling approaches, which will be coordinately achieved with the next step.

Step 4 and Step 5: Model Deployment and System Adaption. These two steps focus on utilizing the models for workload prediction or other system optimization purposes. In the actual workload prediction, there is a time interval between two consecutive predictions, which means that the sequence prediction is based on a discrete time series dataset. For the Alibaba dataset, the prediction interval is usually 10 seconds, while the time prediction interval of Google is 5 minutes. We apply this time interval as the prediction unit. Based on the trained model in Step 3, in this step, the system model can obtain the predicted future workloads and adjust the number of machines by applying auto-scaling. With the predicted data, the realistic system can dynamically adapt the resource provisioning for the system, e.g., adding or removing machines physically, which requires the use of Application Programming Interfaces (APIs) provided by hardware.

Skip 4ESDNN: EFFICIENT SUPERVISE LEARNING-BASED DEEP NEURAL NETWORK Section

4 ESDNN: EFFICIENT SUPERVISE LEARNING-BASED DEEP NEURAL NETWORK

This section presents our proposed approach, which is a deep learning-based approach for cloud workload prediction. To achieve efficient and accurate prediction results, a sliding window-based approach for multivariate time series prediction is applied to convert the original dataset into supervised learning-based time series data. Thereafter, a GRU-based deep learning network, named esDNN, is proposed for future workload prediction.

4.1 Sliding Window for Multivariate Time Series Forecasting

Time series forecasting requires the dataset to contain a set of time-dependent data, regardless of whether the time units of the dataset are seconds, minutes, or hours. This data needs to have a minimum time unit, but it does not need to have the same time interval between two adjacent timestamps. Having clarified this concept, we can say that a time series is a sequence of numbers sorted by time index. However, only have one time series is not sufficient. In Section 3, we have introduced the definition of supervised learning. Complete supervised learning requires a sample group (X) and a label group (Y). There are two major differences compared with the work-based sliding window [37] including: (1) we consider the multivariate workloads to construct time series data rather than single variate; and (2) we utilize the relationship between the predicted data by merging the predicted data and source data into supervised time series data together. To illustrate the conversion process more vividly, we use a small piece of sample data in the Alibaba cloud workloads dataset to show the conversion process and results. To simplify the example, we use one-step univariate forecasting.

Table 2 shows the first five rows of data from the Alibaba dataset, after we convert these time series data into supervised learning data, it will be presented in the form of Table 3, where each row data is moved up with data in the group (Y) and the time label has been removed.

Table 2.
TimeCPU utilization percentage
016.127
1021.5878
2017.3193
3016.8287
4018.6518

Table 2. One-Step Univariate Forecasting: Raw Dataset

Table 3.
XY
None16.127
16.12721.5878
21.587817.3193
17.319316.8287
16.828718.6518
18.6518None

Table 3. One-Step Univariate Forecasting: Supervised Learning Sequence

For the multivariate time series datasets, we can also convert them into supervised learning datasets with sliding window approach. Similarly, we also take a small fragment from the Alibaba dataset. The difference is that in addition to Time and CPU utilization percent, we also take memory utilization percentage to reflect that this is a multivariate dataset. Here, we choose the memory utilization percentage as Y, which is considered as the label. In our model, we choose to use the one-step multivariate forecasting. Tables 4 and 5 show the original data and the converted data, respectively.

Table 4.
TimeCPU util. percentageMemory util. percentage
016.12787.139
1021.587887.0543
2017.319386.9491
3016.828786.9454
4018.651886.9495
5020.023286.9985
6017.867186.9249

Table 4. One-Step Multivariate Forecasting: Raw Dataset

Table 5.
X1X2X3Y
NoneNone16.12787.139
16.12787.13921.587887.0543
21.587887.054317.319386.9491
17.319386.949116.828786.9454
16.828786.945418.651886.9454
18.651886.949520.023286.9985
20.023286.998517.867186.9249
17.867186.9249NoneNone

Table 5. One-Step Multivariate Forecasting: Supervised Learning Sequence

By this step, we have completed the application expression of multivariate time series forecasting, and what we have to do now is to abstract it into an algorithm. First, we define this algorithm as Sliding window for Multivariate Time series Forecasting (S-MTF) algorithm, which transforms a multivariate time series forecast into a supervised learning time series. The S-MTF algorithm can be applied to any time-related dataset, and it is still linearly related to time because it contains all the data of the previous moment at any time. At this point, the S-MTF is somewhat similar to LSTM, but the difference is that the forget gate of LSTM will weaken the influence from the previous moment, while S-MTF retains all the values of the previous moment, and it can be determined if you need to keep it or not. Besides, S-MTF contains the future label while using the multi-step forecasting. Furthermore, S-MTF satisfies the definition of supervised learning as it transforms time-related datasets into the sample and labeled datasets. With a more general form, Figure 2 depicts the transformation of the S-MTF algorithm for time series and presents the supervised learning sequences obtained from the transformation in a tabular form. Algorithm 1 shows the pseudocode of the S-MTF algorithm. Before the original data are processed as the input of the algorithm, we have deleted NONE values in the dataset for the convenience of data processing, as they can influence the accuracy of the proposed algorithm.

Fig. 2.

Fig. 2. The key procedures of the S-MTF algorithm.

Figure 3 shows the typical conversion process by operating the original data. Assuming that we have the time sequence as \( R(t-1), R(t), \) and \( R(t+1) \), where \( R(t-1) \) is the last one, \( R(t) \) is the current one, and \( R(t+1) \) is the next one. We set elements \( E(i-1), E(i) \) and \( P(i+1) \) as the elements to be combined as the supervised time sequence \( S(n) \). The \( E(i-1) \) is from the data of \( R(t-1) \), \( E(i) \) is assigned by \( R(t) \), and \( P(i+1) \) is assigned by \( R(t+1) \). The \( E(i-1) \) and \( E(i) \) will be the sample data and \( P(i+1) \) will be the label. The other supervised time sequence, e.g., \( S(n-1) \) and \( S(n+1), \) can be obtained in the same way.

Fig. 3.

Fig. 3. Data conversion in S-MTF algorithm.

Algorithm Complexity Analysis: Given that there is a set of time series data with size N, the algorithm processes the data from 1 to \( N-1 \) to construct the matrix \( S_n \). To obtain all the data in \( S_n \) with 3 sub-data in each time interval, the complexity will be \( O(3\times (N-1)) \), which equals \( O(N) \).

4.2 esDNN Algorithm

In our deep learning networks, the input data in the training phase include the resource utilization and the corresponding time series data, e.g., at time 08:00:00 am, the CPU utilization is 20%. In the prediction phase, the input data are the resource utilization in the recent time intervals, e.g., the previous 5 minutes (can be configured via parameters in network model). To construct our network model, we include one layer of Convolutional Neural Network (CNN). The CNN model is usually built on the feedforward neural network model. It generally consists of Input Layer, Convolutional Layer, Pooling Layer, Non-linearity Layer and Fully Connected Layer [38]. Two-dimensional convolutional neural networks (2D CNN) are widely used in image recognition, and one-dimensional convolutional neural networks (1D CNN) are generally used in Natural Language Processing (NLP). Additionally, the one-dimensional convolutional neural networks also have the capability in processing continuous sequences. For example, when obtaining a certain feature from a shorter segment in the whole dataset, while the feature is not highly correlated with the position of the data segment in the overall dataset, in this situation, the 1D CNN can play an important role. The 1D CNN can extract features from local original time series data, and then model the short-term correlation between local time series data and subsequent trends [39]. So we will use the 1D CNN to analyze our data. We built a one-dimensional convolutional layer and added it to our neural network. We also add padding, which maintains the boundary information of the time series. If there is no padding, most of the obtained information will only be operated by the convolution kernel once, but the data in the middle of the sequence are scanned many times, thus the results obtained will lose the accuracy of the boundary information. To improve accuracy, we apply a casual strategy for padding, which simply pads the layer’s input with zeros in the front so that we can also predict the values of early time steps in the frame [40]. Finally, we adopt Rectified Linear Unit (ReLU) as the activation function of the 1D CNN. After introducing the convolutional layer, the GRU-based layer can be added.

GRU is a derived version of RNN. RNN uses traditional backpropagation and gradient descent algorithms to learn the target data. The BackPropagation Through Time (BPTT) algorithm is a commonly used method of training RNN. The idea of BPTT is the same as the backpropagation algorithm, which continuously finds better points along the negative gradient direction of the parameters to be optimized until convergence. However, the application of the BPTT algorithm can lead to the accumulation of activation function derivatives, which in turn leads to the occurrence of gradient disappearance and gradient explosion. In order to solve this problem, we can use two methods to avoid gradient explosion/disappearance. The first method is to replace the activation function. In our model, we avoid the disappearance of the gradient to a certain extent by setting ReLU as the activation function. But the derivative of ReLU in the range greater than 0 is always 1, which is easy to cause gradient explosion. Therefore, the second method is applied to change the circulation structure. GRU that combines the forget gate and input gate into a single “update gate” is exploited in our model. It also merges cell state and hidden state and makes some other changes.

We have adopted the GRU-based neural network and made some improvements to optimize its performance in long sequence prediction. The structure of GRU is demonstrated in Figure 4. The reset gate \( r_t \) and update gate \( z_t \) are the same as LSTM. But there is no output gate in GRU. Compared with LSTM, there is one less “gating” inside the GRU, which has fewer parameters than LSTM, but it can also achieve functions equivalent to LSTM [41].

Fig. 4.

Fig. 4. GRU structure.

The sparse processing provided by ReLU can reduce the effective capacity of the model, which means too much feature masking makes the model unable to learn effective features. Since the gradient of ReLU is 0 when \( x \lt 0 \), this neuron may never be activated by any data, which is called neuron necrosis. In addition, one of the similarities between ReLU and Sigmoid is that the result is a positive value without a negative value. To address this issue, we multiply ReLU and Sigmoid, and we can get the activation function Swish that is represented as below: (3) \( \begin{equation} f(x)=x \cdot \operatorname{sigmoid}(\beta x), \end{equation} \) where \( \beta \) is either a constant or a trainable parameter [42].

In the choice of activation function, instead of choosing the ReLU activation function that is commonly used by DNNs, we use Swish, which is a smooth and non-monotonic function. Its design is inspired by the use of sigmoid function for gating in LSTM. We use the same value for gating to simplify the gating mechanism, which is called self-gating. The advantage of self-gating is that it only requires a simple scalar input, while traditional gating requires multiple scalar inputs. This feature allows Swish to easily replace activation functions that take a single scalar as input without changing the hidden capacity or number of parameters. The pseudocode of esDNN is shown in Algorithm 2.

Algorithm complexity analysis: The time complexity of esDNN depends on the number of networks (N), number of network weight connections (C), number of input node (n), hidden nodes (h), where \( h \approx n \), dropout value (d). Therefore, the total time complexity for a maximum number of b iterations is represented as \( b \times O(n^2\times N \times C \times d) \), which equals to \( O(n^2bdNC) \).

Skip 5PERFORMANCE EVALUATION Section

5 PERFORMANCE EVALUATION

In this section, we will first introduce the details about the dataset we use and the experimental configurations for workload prediction. Then we compare the performance of esDNN and other RNN-based approaches. Finally, we demonstrate that our approach can be applied to auto-scaling for cloud resource provisioning optimization.

5.1 Datasets and Environment Configuration

We implement the multivariate time series forecasting based on TensorFlow 2.2.0 [43], and the Python version is 3.7. We used two real-world datasets in the experiments for performance evaluation of our proposed approach.

  • Alibaba dataset [8]: It is cluster-trace-v2018 of Alibaba that recording the traces in 2018. Cluster-trace-v2018 includes about 4,000 machines in a period of 8 days, which we use all the data to make predictions. The data can be found from Github.1

  • Google dataset [44]: It is derived from Google’s cluster data-2011-2 recorded in 2011. The cluster data-2011-2 trace includes 29 days of data that contains 37,747 machines, including three different machine types. The data can also be fetched from Github.2

Both of these datasets can represent random features of cloud workloads. We use CPU usage as a key performance measurement of the accuracy of our prediction model. To show the effectiveness of prediction and remove the redundancy information, we configure the prediction time interval as 5 minutes. For the metadata for prediction, there are some differences between the two datasets because of the different types of data collected. For the Alibaba dataset, in addition to the time series and CPU usage data, we also select memory usage, incoming network traffic, outgoing network traffic, and disk I/O usage as the source data for prediction. As for Google dataset, in addition to the time series and CPU usage data, we also select canonical memory usage, assigned memory usage, total page cache memory usage as the source data for prediction. When processing the dataset from Google, we select 5 minutes as the time interval. Then we group the tasks according to the Machine ID and finally normalize the dataset.

Figures 5 and 6 demonstrate the CPU usage in the datasets of Alibaba and Google cloud data centers, respectively. We have divided them into per-day and per-minute workloads fluctuations of machines so that we can see the fluctuations of CPU usage more clearly over time. We can notice that both the datasets show high variance and random features. For the Alibaba dataset, we divide the dataset into the first 40,000 rows of data (59.5%) and the rest, which are used to train and test the model. Similarly, we have divided the Google dataset in this way. We selected about 72 hours of data from Google’s dataset, and we used the first 49 hours (68.4%) as the training set and the rest as the validation set. For these two datasets, the number of training epoch is 200, the batch size is 72, the loss function is Huber, the optimizer is Adam, and the metric we use is mean square errors.

Fig. 5.

Fig. 5. Alibaba (a) per-day workload (b) per-minute workload fluctuations.

Fig. 6.

Fig. 6. Google (a) per-day workload (b) per-minute workload fluctuations.

5.2 Comparison with Unsupervised Learning-Based Approach

In contrast to the supervised learning approach used by esDNN, the unsupervised learning approach can also be applied to high-dimensional problems such as multivariate time series forecasting, therefore, in this section, we evaluate our approach with unsupervised learning-based approach.

Among the unsupervised learning approaches, Autoencoder is a representative one for efficient feature extraction and feature representation of high-dimensional data [45]. Currently, Autoencoder as well as Stacked Autoencoder, Sparse Autoencoder [46], and Denoising Autoencoder [47] are widely used in the research field. Autoencoder maps the input sample x to the hidden layer by the encoder (g) and then maps it back to the original space by the decoder (f) to obtain the reconstructed sample. For the neural network-based autoencoder model, the encoder part compresses the data by reducing the number of neurons layer by layer, while the decoder part improves the number of neurons layer by layer based on the abstract representation of the data, and finally realizes the reconstruction of the input samples. The optimization objective is to optimize both the encoder and decoder by minimizing the Loss function. The optimization equation is shown as below: (4) \( \begin{equation} f, g=\mathop {\operatorname{min}}\limits _{f, g} Loss(x, f(g(x))). \end{equation} \)

The prediction results of esDNN and Autoencoder within 10 minutes are shown in Figure 7, which demonstrates that Autoencoder has a good prediction result only at the beginning of the observed period, and it is significantly less accurate than esDNN. The reason can be that Autoencoder does not need to use the label of the sample in prediction, and it uses the input of the sample as both the input and output of the neural network. Although this can greatly improve the generality of the model, autoencoder is prone to be overfitting when the parameters of the neural network are complicated. Based on the results compared with unsupervised learning, supervised learning-based approach has demonstrated better performance. In the following experiments, we evaluate the performance with other neural network-based approaches.

Fig. 7.

Fig. 7. Comparison of the prediction results of esDNN and Autoencoder.

5.3 Comparison with Neural Network-Based Approaches

We first start the evaluation with the Alibaba dataset. To compare with our algorithm, we have selected several RNN-based deep learning algorithms, which have been applied for time series prediction, including RNN, Bi-LSTM [28] and GRU. We compare the prediction accuracy of these four algorithms and measure them by Mean Square Errors (MSE), which is represented as: (5) \( \begin{equation} MSE=\frac{1}{m} \sum _{i=1}^{m}(y_{actual}-{y}_{predict})^{2}, \end{equation} \) where m represents the timestamp, \( y_{actual} \) is the actual value and \( y_{predict} \) is the predictive value. The higher the MSE of the algorithm, the greater the gap between the predicted value and the actual value. In order to capture the changing trend of MSE in each period, we set four different time scales: second, minute, hour, and day.

Figure 8 shows the MSE fluctuation of these four RNN-based methods based on the Alibaba dataset with various prediction lengths. In general, all the MSE curves follow the same trend, which shows that the MSE value first increases until it reaches a peak, after that, the curves will remain at a relatively stable value. For the second-level prediction, apart from keeping RNN at a relatively high value, there are just subtle differences between Bi-LSTM, GRU, and esDNN. With the increase of the prediction length, all these curves are maintained at relatively stable values. But for the day-level prediction, there is no significant difference in MSE value between Bi-LSTM, GRU, and esDNN. The main reason is that the RNN and Bi-LSTM models are designed to process time-series data and can perform well in representing the nonlinear relationship between data and time. However, the drawback of RNN and Bi-LSTM is that their gradient can disappear or explode, especially for the long time-series data during the data training process. The GRU can alleviate the side effects of gradient disappearance that usually happens in Bi-LSTM and RNN. Therefore, the results of GRU-based approach can maintain relatively more stable values compared with RNN and Bi-LSTM.

Fig. 8.

Fig. 8. Prediction accuracy (MSE) of four different RNN-based methods based on the Alibaba dataset with different prediction lengths.

In order to brighten the differences between them, we choose to use the Cumulative Distribution Function (CDF) method to measure them, which is the integral of the probability density function. For discrete variables, it can represent the sum of the probability of occurrence of all values less than or equal to x, which is formulated as: (6) \( \begin{equation} F_{X}(x)=\mathrm{P}(X \le x). \end{equation} \)

Figure 9 shows the CDF of MSE based on these four RNN-based methods, these almost overlapping curves in Figure 8 can be distinguished more apparently by the difference in CDF values. Since RNN is quite different from the other three methods, and its effect is the worst one, therefore, we focus on the discussions on the other three algorithms. Except for the RNN method, we can clearly see that when the value of MSE is between 0.006 and 0.008, the value of CDF rises very quickly, which shows that the MSE values of these three methods are concentrated. Meanwhile, esDNN rises significantly faster than Bi-LSTM and GRU, we can see that for any MSE in this time period, the CDF value of esDNN always remains at the highest value. Although the curves are very close, the difference in value between them can still be easily identified. It means that the overall MSE value of esDNN is smaller than the values of Bi-LSTM and GRU.

Fig. 9.

Fig. 9. CDF comparison of MSE curve.

For Google’s dataset, we also use the RNN-based methods as baselines for esDNN. Compared with the Alibaba dataset, we utilize less data from Google’s dataset, therefore, we show the prediction length with minute-level and hour-level. Figure 10 shows the MSE fluctuation of the four methods based on the Google dataset. For the minute-level prediction, all these methods have little difference between each other except RNN. When we focus on the hour-level prediction, the trend of these methods is stable after a short growth, which is consistent with the results of the Alibaba dataset. We can also notice that RNN always maintains at a high level. Compared with GRU and esDNN, Bi-LSTM has a higher MSE. For esDNN and GRU, the MSE of these two are quite close, where the esDNN can achieve a more stable trend, while GRU fluctuates more dynamically. To conclude, we can notice that the prediction result of esDNN is better than GRU, since the MSE value of esDNN is smaller than GRU.

Fig. 10.

Fig. 10. Prediction accuracy (MSE) of four different RNN methods based on the Google dataset.

We notice that esDNN is very close to GRU’s results over a long sequence of time while the trend of RNN is much worse than the other three approaches. Thus, we choose to compare the MSE values without RNN. We analyze the MSE values from Alibaba and Google dataset separately, as shown in Figure 11, where the GRU is set as the baseline, and the MSE values of esDNN and Bi-LSTM are divided by the MSE values of GRU. The value less than 1.0 represents better performance than GRU, and vice-versa. Although GRU performs better prediction results than esDNN in a short period of time, esDNN can maintain better accuracy and stability in the long run.

Fig. 11.

Fig. 11. The ratio of MSE compared with GRU for Alibaba and Google Traces.

Next, we evaluate the difference between the predicted value and the actual value of esDNN. Figures 12 and 13 show the CPU usage curves, so that we can see the difference between the predicted value and the actual value. For the analysis of the Alibaba dataset, we can observe that for the minute level prediction, though there are some large differences between consecutive values, esDNN can still give relatively accurate prediction results. For the hour-level prediction, on the whole, the predicted value is very close to the actual value because their curves are almost fit, and only a small part of the predicted curve is different from the actual value. For hour-level prediction based on Google cluster data, esDNN can still accurately predict the trend of CPU usage.

Fig. 12.

Fig. 12. Prediction performance of the esDNN compared with actual Alibaba data.

Fig. 13.

Fig. 13. Prediction performance of the esDNN compared with actual Google data.

In order to identify the difference between them more intuitively, we summarize the performance of these algorithms as listed in Tables 6 and 7. Apart from MSE, we also compare Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE) that have been widely used to evaluate prediction performance. The esDNN approach can achieve the lowest MSE values compared with other baselines when the prediction length is longer, which is more difficult to be predicted. For the Google dataset, our approach can also achieve the lowest RMSE with a longer prediction length. The reason is that our proposed prediction approach based on revised GRU and CNN not only captures the periodical features inherent in the data, but also significantly reduces the impact of resource variations on prediction results.

Table 6.
Prediction lengthRNNGRUBi-LSTMesDNN
MSERMSEMAPEMSERMSEMAPEMSERMSEMAPEMSERMSEMAPE
10s1.25E-040.01120.21754.27E-060.00210.04019.33E-060.00310.15961.59E-070.00040.0077
30s7.50E-050.00870.13961.75E-050.00420.05415.77E-050.00760.10863.13E-050.00560.0681
1 min9.02E-050.00950.15689.02E-060.00300.03253.80E-050.00620.08911.71E-050.00410.0480
30 min2.93E-040.01700.20241.71E-040.01250.07722.00E-040.01420.09781.84E-040.01280.0766
1h4.61E-040.02150.23433.13E-040.01770.09363.31E-040.01820.10533.20E-040.01790.0891
6h5.07E-040.02250.22373.88E-040.01970.10504.03E-040.02000.08913.96E-040.01990.0990
1 day8.91E-040.02930.23877.34E-040.02710.12607.32E-040.02710.08987.25E-040.02690.1080
2 days8.68E-040.02950.20586.74E-040.02600.10506.66E-040.02580.08726.62E-040.02570.0901
3 days8.83E-040.02970.19736.54E-040.02560.09786.50E-040.02550.08576.43E-040.02540.0842

Table 6. MSE, RMSE, and MAPE Comparison with Alibaba Dataset

Table 7.
Prediction lengthRNNGRUBi-LSTMesDNN
MSERMSEMAPEMSERMSEMAPEMSERMSEMAPEMSERMSEMAPE
30min0.001532320.03910.16516.53E-050.00810.02287.22E-050.00850.02090.000312760.01770.0439
1h0.00113340.03370.12010.000302350.01740.05060.000320740.01790.05670.00035760.01890.0507
2h0.001137960.03370.11730.000565420.02380.07370.000829480.02880.08470.000526290.02290.0691
4h0.001138560.03370.12460.000717190.02680.09420.00110120.03320.11430.000797930.02820.0884
6h0.001131660.03360.10870.000797710.02820.08860.000948570.03080.10100.00072280.02690.0787
8h0.001565610.03960.13500.000742630.02730.08800.000860320.02980.09600.000737870.02720.0833
12h0.001646310.04060.13680.000654550.02560.08190.000953400.03090.09540.000701970.02650.0814
15h0.001709410.04130.13680.000868640.02950.08760.00104030.03250.10170.000736970.02710.0822

Table 7. MSE, RMSE, and MAPE Comparison with Google Dataset

To compare the performance of different approaches in terms of training and predicting cost, we compare the training time and prediction time as shown in Table 8. The training time is the average time consumed for training one epoch, and the prediction time is the mean value of predicting 1,000 lines of data by repeating 10 times. Based on the results, we can observe that the esDNN consumes the longest training time with about 10% more time than Bi-LSTM and GRU, which is an acceptable cost considering the performance improvement in prediction accuracy. The longer training time can result from the more complicated model of esDNN with application of CNN. And for the prediction time, esDNN can perform slightly better than Bi-LSTM and GRU, the reason can be the Swish activation function that we use can slightly improve the prediction time [48].

Table 8.
RNNBi-LSTMGRUesDNN
Traing time (s)5.116.476.537.14
Prediction time (s)0.0480.0700.0710.065

Table 8. Training Time and Prediction Time Comparison

To summarize, esDNN can achieve good accuracy based on MSE results. Compared with the MSE results evaluated in the same datasets derived from Google and Alibaba in [7], esDNN has reduced the MSE with one order of magnitude from around \( 7 \times 10^{-2} \) to \( 7\times 10^{-3} \).

5.4 Applying esDNN for Machines Auto-Scaling with Simulations

The auto-scaling technique can dynamically adjust the number of active machines in the system based on the system status, e.g., removing machines when the system is at a low utilization level or adding more machines when the system is overutilized. By taking advantage of auto-scaling, the system performance, e.g., energy consumption, can be optimized. However, without sufficiently accurate prediction for workloads, the popular threshold-based auto-scaling approaches, like static threshold, are undesirable for workloads with high variance.

To further demonstrate the capability of the proposed approach, we integrate esDNN into the auto-scaling scenario for physical machines in Alibaba and Google cloud data centers by simulating the number of machines and resource usage. The specifications of machines are derived from the corresponding original datasets, and the scheduling period is configured as 5 minutes.

Our objective is to improve resource utilization and reduce the number of active machines with sufficient accurate predictions. Therefore, the use of CPU utilization as an input to the auto-scaling method is highly desired. And the output is the number of active machines. As an auto-scaling baseline, we use the average number of active machines based on the previous time slots [49], which can be calculated as: (7) \( \begin{equation} M(t)=\frac{\sum _{i=1}^{m} M(t-i)}{m}, \end{equation} \) where \( M(t) \) represents the number of active machines at time interval t, and m is the number of previous time slots used for the prediction that we set m as 5 for our experiments. The number of active machines calculated by Equation (7) is normalized as 1.0. We normalize the number of active machines of the auto-scaling approach based on esDNN and calculate the ratio between the esDNN-based approach and baseline, and we configure the upper CPU utilization threshold as 80%. If the ratio of machines is less than 1.0, it means that the esDNN-based approach can reduce the number of active machines. Figure 14(a) shows the prediction of the number of active machines based on the Alibaba dataset. As expected, the ratio fluctuates in the range of 0.3 to 0.6, indicating that our prediction algorithm has achieved a good effect that only 40% to 80% of the number of machines will be active compared with the original number of machines from the dataset, which can significantly reduce the number of active machines.

Fig. 14.

Fig. 14. Optimization ratio comparison in minutes with different traces.

As for the Google dataset, we analyze the capacity distribution of about 37,678 machines, where the machines with 0.5 capacity are about 92.8%, the machines with 0.25 capacity are about 1.4%, and the machines with capacity 1.0 are about 5.9%. This is different from the distribution of the homogeneous machines in the Alibaba dataset. After normalizing the CPU usage in the Google dataset, we also utilize the algorithm previously applied to the Alibaba dataset, and Figure 14(b) shows the ratio ranges from 0.25 to 1. For example, at the 400th minute, the baseline needs 18,371 machines, while our approach only uses 5,161 machines. The results of these experiments are close to those based on the Alibaba dataset. It can be concluded that the proposed approach can efficiently improve resource usage by reducing the number of active machines, and it is promising to reduce the energy consumption of cloud data centers by providing an accurate prediction method.

Skip 6CONCLUSIONS AND FUTURE WORK Section

6 CONCLUSIONS AND FUTURE WORK

Our deep learning-based approach for cloud workload prediction brings opportunities to optimize resource provisioning in the cloud computing environment. In this article, we apply sliding window for multivariate time series to convert the high-dimension data into supervised learning time series to address the high-dimensionality challenge. Based on the converted data, we proposed a revised GRU-based approach to train the prediction model to achieve high prediction accuracy for high variance cloud workloads. Comprehensive experiments based on the realistic traces derived from Google and Alibaba have demonstrated that our proposed approach can achieve better performance in terms of accuracy compared with state-of-the-art approaches. To further show the effectiveness of optimizing resource provisioning, we applied our approach for auto-scaling based on realistic traces, and results illustrate that our approach can significantly optimize the resource usage of cloud data centers, thus saving operational costs.

In the future, our approach can be integrated into a container-based prototype system, e.g., Kubernetes, to optimize resource provisioning. We would like to investigate the proposed approach to be extended for Edge Computing to reduce response time using offloading techniques, and consider the location-aware and mobility-aware scenarios (e.g., predicting the loads allocating to different devices generated by mobile users). We would also like to make automatic esDNN by using the Monitor, Analyze, Plan, and Execute (MAPE) model.

Footnotes

REFERENCES

  1. [1] Xu Minxian and Buyya Rajkumar. 2019. Brownout approach for adaptive management of resources and applications in cloud computing systems: A taxonomy and future directions. Comput. Surveys 52, 1, Article 8 (2019), 27 pages.Google ScholarGoogle Scholar
  2. [2] Buyya Rajkumar, Yeo Chee Shin, and Venugopal Srikumar. 2008. Market-oriented cloud computing: Vision, hype, and reality for delivering it services as computing utilities. In Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications. 513.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Ghahramani Mohammad Hossein, Zhou MengChu, and Hon Chi Tin. 2017. Toward cloud computing QoS architecture: Analysis of cloud systems and cloud services. IEEE/CAA Journal of Automatica Sinica 4, 1 (2017), 618.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Du M., Wang Y., Ye K., and Xu C.. 2020. Algorithmics of cost-driven computation offloading in the edge-cloud environment. IEEE Trans. Comput. 69, 10 (2020), 15191532.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Chen Yujun, Yang Xian, Lin Qingwei, Zhang Hongyu, Gao Feng, Xu Zhangwei, Dang Yingnong, Zhang Dongmei, Dong Hang, Xu Yong, Li Hao, and Kang Yu. 2019. Outage prediction and diagnosis for cloud service systems. In Proceedings of the World Wide Web Conference (WWW’19). ACM, New York, NY, 26592665. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Wang S., Li X., and Ruiz R.. 2020. Performance analysis for heterogeneous cloud servers using queueing theory. IEEE Trans. Comput. 69, 4 (2020), 563576.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Chen Z., Hu J., Min G., Zomaya A. Y., and El-Ghazawi T.. 2020. Towards accurate prediction for high-dimensional and highly-variable cloud workloads with deep learning. IEEE Transactions on Parallel and Distributed Systems 31, 4 (April 2020), 923934. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Chen W., Ye K., Wang Y., Xu G., and Xu C.. 2018. How does the workload look like in production cloud? Analysis and clustering of workloads on Alibaba cluster trace. In Proceedings of the 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS). 102109.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Mikolov T., Kombrink S., Burget L., Černocký J., and Khudanpur S.. 2011. Extensions of recurrent neural network language model. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 55285531.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 17351780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Chung Junyoung, Gulcehre Caglar, Cho Kyunghyun, and Bengio Yoshua. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In Proceedings of the NIPS 2014 Workshop on Deep Learning.Google ScholarGoogle Scholar
  12. [12] Abdi Hervé and Williams Lynne J.. 2010. Principal component analysis. WIREs Computational Statistics 2, 4 (2010), 433459. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/wics.101Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Wang Yasi, Yao Hongxun, and Zhao Sicheng. 2016. Auto-encoder based dimensionality reduction. Neurocomputing 184 (2016), 232242. RoLoD: Robust Local Descriptors for Computer Vision 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Brownlee Jason. 2016. Supervised and unsupervised machine learning algorithms. Machine Learning Mastery 16, 03 (2016).Google ScholarGoogle Scholar
  15. [15] Calheiros R. N., Masoumi E., Ranjan R., and Buyya R.. 2015. Workload prediction using ARIMA model and its impact on cloud applications’ QoS. IEEE Transactions on Cloud Computing 3, 4 (Oct 2015), 449458. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Yang Jingqi, Liu Chuanchang, Shang Yanlei, Cheng Bo, Mao Zexiang, Liu Chunhong, Niu Lisha, and Chen Junliang. 2014. A cost-aware auto-scaling approach using the workload prediction in service clouds. Information Systems Frontiers 16, 1 (March 2014), 718.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Cetinski Katja and Juric Matjaz B.. 2015. AME-WPC: Advanced model for efficient workload prediction in the cloud. Journal of Network and Computer Applications 55 (2015), 191201.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Singh Parminder, Gupta Pooja, and Jyoti Kiran. 2019. TASM: Technocrat ARIMA and SVR model for workload prediction of web applications in cloud. Cluster Computing 22, 2 (June 2019), 619633.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Liu Chunhong, Liu Chuanchang, Shang Yanlei, Chen Shiping, Cheng Bo, and Chen Junliang. 2017. An adaptive prediction approach based on workload pattern discrimination in the cloud. Journal of Network and Computer Applications 80 (2017), 3544.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Bi J., Yuan H., and Zhou M.. 2019. Temporal prediction of multiapplication consolidated workloads in distributed clouds. IEEE Transactions on Automation Science and Engineering 16, 4 (2019), 17631773.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Kumar Jitendra and Singh Ashutosh Kumar. 2018. Workload prediction in cloud using artificial neural network and adaptive differential evolution. Future Generation Computer Systems 81 (2018), 4152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Zhang Q., Yang L. T., Yan Z., Chen Z., and Li P.. 2018. An efficient deep learning model to predict cloud workload for industry informatics. IEEE Transactions on Industrial Informatics 14, 7 (2018), 31703178.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Kumar Jitendra, Goomer Rimsha, and Singh Ashutosh Kumar. 2018. Long short term memory recurrent neural network (LSTM-RNN) based workload forecasting model for cloud datacenters. Procedia Computer Science 125 (2018), 676682.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Qiu F., Zhang B., and Guo J.. 2016. A deep learning approach for VM workload prediction in the cloud. In 2016 17th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). 319324. Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Zhu Yonghua, Zhang Weilin, Chen Yihai, and Gao Honghao. 2019. A novel approach to workload prediction using attention-based LSTM encoder-decoder network in cloud environment. EURASIP Journal on Wireless Communications and Networking 2019, 1 (2019), 274.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Amiri Maryam, Mohammad-Khanli Leyli, and Mirandola Raffaela. 2018. An online learning model based on episode mining for workload prediction in cloud. Future Generation Computer Systems 87 (2018), 83101.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Cortez Eli, Bonde Anand, Muzio Alexandre, Russinovich Mark, Fontoura Marcus, and Bianchini Ricardo. 2017. Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP’17). ACM, New York, 153167.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Bi Jing, Li Shuang, Yuan Haitao, and Zhou MengChu. 2021. Integrated deep learning method for workload and resource prediction in cloud systems. Neurocomputing 424 (2021), 3548.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Karim Md Ebtidaul, Maswood Mirza Mohd Shahriar, Das Sunanda, and Alharbi Abdullah G.. 2021. BHyPreC: A novel Bi-LSTM based hybrid recurrent neural network model to predict the CPU workload of cloud virtual machine. IEEE Access 9 (2021), 131476131495.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Chen Chuang, Lu Ningyun, Jiang Bin, and Wang Cunsong. 2021. A risk-averse remaining useful life estimation for predictive maintenance. IEEE/CAA Journal of Automatica Sinica 8, 2 (2021), 412422.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Singh Ashutosh Kumar, Saxena Deepika, Kumar Jitendra, and Gupta Vrinda. 2021. A quantum approach towards the adaptive prediction of cloud workloads. IEEE Transactions on Parallel and Distributed Systems (2021).Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Kim In Kee, Wang Wei, Qi Yanjun, and Humphrey Marty. 2020. Forecasting cloud application workloads with cloudinsight for predictive resource management. IEEE Transactions on Cloud Computing (2020).Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Sun Wei, Li Pengyu, Liu Zhi, Xue Xue, Li Qiyue, Zhang Haiyan, and Wang Junbo. 2021. LSTM based link quality confidence interval boundary prediction for wireless communication in smart grid. Computing 103 (2021), 251269.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Li Xuesong, Liu Yating, Wang Kunfeng, and Wang Fei-Yue. 2020. A recurrent attention and interaction model for pedestrian trajectory prediction. IEEE/CAA Journal of Automatica Sinica 7, 5 (2020), 13611370.Google ScholarGoogle Scholar
  35. [35] Barra Silvio, Carta Salvatore Mario, Corriga Andrea, Podda Alessandro Sebastian, and Recupero Diego Reforgiato. 2020. Deep learning and time series-to-image encoding for financial forecasting. IEEE/CAA Journal of Automatica Sinica 7, 3 (2020), 683692.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Qiao Junfei, Li Fei, Yang Cuili, Li Wenjing, and Gu Ke. 2019. A self-organizing RBF neural network based on distance concentration immune algorithm. IEEE/CAA Journal of Automatica Sinica 7, 1 (2019), 276291.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Dietterich Thomas G.. 2002. Machine learning for sequential data: A review. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Springer, 1530.Google ScholarGoogle Scholar
  38. [38] Krizhevsky Alex, Sutskever Ilya, and Hinton Geoffrey E.. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 10971105.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Xu Dongkuan, Cheng Wei, Zong Bo, Song Dongjin, Ni Jingchao, Yu Wenchao, Liu Yanchi, Chen Haifeng, and Zhang Xiang. 2020. Tensorized LSTM with adaptive shared memory for learning trends in multivariate time series. In AAAI. 13951402.Google ScholarGoogle Scholar
  40. [40] Buitinck Lars, Louppe Gilles, Blondel Mathieu, Pedregosa Fabian, Mueller Andreas, Grisel Olivier, Niculae Vlad, Prettenhofer Peter, Gramfort Alexandre, Grobler Jaques, Layton Robert, VanderPlas Jake, Joly Arnaud, Holt Brian, and Varoquaux Gaël. 2013. API design for machine learning software: Experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning. 108122.Google ScholarGoogle Scholar
  41. [41] Fu Rui, Zhang Zuo, and Li Li. 2016. Using LSTM and GRU neural network methods for traffic flow prediction. In 2016 31st Youth Academic Annual Conference of Chinese Association of Automation (YAC). IEEE, 324328.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Ramachandran Prajit, Zoph Barret, and Le Quoc V.. 2017. Searching for activation functions. arXiv preprint arXiv:1710.05941 (2017).Google ScholarGoogle Scholar
  43. [43] Abadi Martín, Barham Paul, Chen Jianmin, Chen Zhifeng, Davis Andy, Dean Jeffrey, Devin Matthieu, Ghemawat Sanjay, Irving Geoffrey, Isard Michael, Kudlur Manjunath, Levenberg Josh, Monga Rajat, Moore Sherry, Murray Derek G., Steiner Benoit, Tucker Paul, Vasudevan Vijay, Warden Pete, Wicke Martin, Yu Yuan, and Zheng Xiaoqiang. 2016. TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). Savannah, GA, 265283.Google ScholarGoogle Scholar
  44. [44] Wilkes John. 2011. More Google cluster data. Google research blog. (Nov. 2011). Posted at http://googleresearch.blogspot.com/2011/11/more-google-cluster-data.html.Google ScholarGoogle Scholar
  45. [45] Bourlard Hervé and Kamp Yves. 1988. Auto-association by multilayer perceptrons and singular value decomposition. Biological cybernetics 59, 4 (1988), 291294.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Ng Andrew et al. 2011. Sparse autoencoder. CS294A Lecture notes 72, 2011 (2011), 119.Google ScholarGoogle Scholar
  47. [47] Lu Xugang, Tsao Yu, Matsuda Shigeki, and Hori Chiori. 2013. Speech enhancement based on deep denoising autoencoder. In Interspeech, Vol. 2013. 436440.Google ScholarGoogle Scholar
  48. [48] Ramachandran Prajit, Zoph Barret, and Le Quoc V.. 2017. Searching for activation functions. CoRR abs/1710.05941 (2017). http://arxiv.org/abs/1710.05941.Google ScholarGoogle Scholar
  49. [49] Aslanpour Mohammad Sadegh, Ghobaei-Arani Mostafa, and Toosi Adel Nadjaran. 2017. Auto-scaling web applications in clouds: A cost-aware approach. Journal of Network and Computer Applications 95 (2017), 2641.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

(auto-classified)
  1. esDNN: Deep Neural Network Based Multivariate Workload Prediction in Cloud Computing Environments

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                • Published in

                  cover image ACM Transactions on Internet Technology
                  ACM Transactions on Internet Technology  Volume 22, Issue 3
                  August 2022
                  631 pages
                  ISSN:1533-5399
                  EISSN:1557-6051
                  DOI:10.1145/3498359
                  • Editor:
                  • Ling Liu
                  Issue’s Table of Contents

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 17 August 2022
                  • Online AM: 14 March 2022
                  • Accepted: 1 March 2022
                  • Revised: 1 November 2021
                  • Received: 1 April 2021
                  Published in toit Volume 22, Issue 3

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article
                  • Refereed

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader

                HTML Format

                View this article in HTML Format .

                View HTML Format
                About Cookies On This Site

                We use cookies to ensure that we give you the best experience on our website.

                Learn more

                Got it!