Smart Quality Monitoring for Evolving Complex Systems

Evolving complex systems, such as complex software systems, dynamic cloud systems and smart ecosystems, arise from the interactions of systems, agents and people, evolve and adjust dynamically over time. Evolving complex systems may fail due to interactions among components, agents and people, interactions that emerge during the evolution of the system. It is essential to adequately monitor evolving software systems to reveal anomalous conditions that emerge during evolution and may lead to catastrophic failures. Current monitoring approaches do not deal with either the dynamic characteristics of evolving complex systems or people as active elements of the system. In this PhD thesis, we define smart monitoring, an approach to monitor evolving complex systems and predict failures. We propose an incrementally trained neural model to capture the evolving characteristics of complex systems and detect anomalies that can later lead to failures. We exploit the state-of-the-art OCEAN model to monitor the impact of the personality traits of the people and detect behaviors that may lead to system failures.


INTRODUCTION
Evolving complex systems (ECS) result from the interactions of systems, agents and people.They evolve over time in response to external stimuli [15], and exhibit behaviors that (i) emerge from collective interactions among systems, agents and people, and (ii) are not observable at the individual level.Notable examples of ECS are dynamic cloud systems and smart ecosystems.Dynamic cloud systems dynamically assign resources based on usage and demand.They are widely used for scalability, on-demand allocation, resource pooling, and variable topology, and behave unpredictably.Smart ecosystems (SES), also called systems of systems [20], arise from implicit interactions among heterogeneous software systems, cyber-physical systems, and people.Notable examples of SES are smart buildings, smart grids, and smart cities. Smart ecosystems are heterogeneous and operationally autonomous.Implicit interactions among systems and people with possibly contradicting requirements may cause unexpected behaviors, and lead to SES failures despite the correct behavior of the individual components [20].
It is extremely important to monitor ECS to reveal anomalies that may lead to failures.The current approaches to monitor software systems analyze the behavior of the monitored systems, but rarely adapt to the dynamic nature of ECS, and leave users out of the loop.The current approaches to test software systems, notably approaches for field-testing [3], sample the execution space to reveal failure and remove faults, but cannot cope with failures that emerge from the intrinsic interactions of heterogeneous and autonomous components of an ECS.
In this PhD thesis, we define a smart monitoring approach to monitor ECS and detect anomalies that can lead to failures if not properly handled.We address the intrinsic contradictions of the heterogeneous and autonomous components in a smart ecosystem by introducing a new concept of systems health that frames the quality of the ECS in terms of hardiness, consistency and resilience, thus overcoming the limitations of correctness as a relation between the system and its requirements that vanish in the presence of intrinsically contradicting behaviors.We define a smart monitor that relies on both a neural model that adapts to dynamic configurations to monitor dynamic systems and incremental training to cope with evolving ECS to predict failures of the ECS from anomalies in the reconstruction error.We extend the monitor with the state-of-theart OCEAN model to capture the relation between personality traits and human actions and monitor the human actions within a smart ecosystem.
The main contributions of this PhD are (i) a new concept of system health that captures the quality of ECS, (ii) a smart monitoring approach that exploits neural networks to cope with the dynamic evolution and the complexity of ECS, and (iii) a smart ecosystem monitor that exploits the OCEAN model to monitor the human activities in ECS.

STATE OF THE ART
Approaches to reveal failures in software systems (i) monitor the system behavior to detect anomalies that occur at runtime and may lead to failure, (ii) test the system to sample the behavior of the system and identify failures, or (iii) analyze the system to reveal faulty statements.Classic testing approaches sample the system execution on test beds [25], and only field testing approaches sample the program execution in production [3].Static analysis approaches reveal anomalies in the code before execution, while dynamic analysis approaches capture the system evolution with dynamically generated models.Neither classic testing nor analysis approaches cope with either the evolution of ECS or the role of people in SES.In the following we overview approaches to monitor systems at runtime and to predict and prevent failures, the approaches are closely related to the work presented in this paper.

Monitoring ECS
The main challenges in monitoring complex systems arise from the substantial amount of information generated, posing difficulties in gathering, handling, and analyzing costly data [21].The early proposals for monitoring dynamic information in complex systems for supporting decisions relied on historical and relational databases [17].A second tier of approaches exploited trend extraction [5], model-based monitoring [11] and Petri nets [19].Classic approaches were defined for statically configured systems and do not adapt to the dynamic characteristics of cloud systems.Early approaches for monitoring cloud systems, such as the top-down layer approach [28] and SLA-oriented monitoring [9] also assume static allocation of resources and do not adapt to dynamic allocation and scalability [33].
The main challenges of monitoring evolving complex systems are the inherent variability of system topology [26], the completeness of the monitored information [23], and the diverse nature of components in smart ecosystems [29].Approaches to address high data collection costs include adaptive monitoring [21], goaloriented monitoring [8], and approaches to mitigate costly data storage [1].Techniques like Gaussian Mixture Models [12] and information-theoretic approaches [14] tackle representativeness and scalability challenges.Fault-tolerant monitoring systems [26] address topology changes.
Global standards for monitoring architectures for cloud computing [4] along with techniques for complete data monitoring [23] support interoperability of different approaches.Recently, some automated platforms [29] and runtime monitoring techniques to detect malicious behaviors [6] address the heterogeneity in smart ecosystems.
The many attempts to address dynamic cloud systems can still not fully cope with the scalability of contemporary cloud architectures, like Kubernetes, and largely ignore the role of people in SES.

Predicting and preventing failures in ECS with machine learning
Many recent approaches target ECS with machine learning.Toka et al. [30] propose an open-source, cloud-native system for predicting failures in cloud-based systems.The approach incorporates time-series clustering for efficient data mining.Roumani et al. [27] propose a hybrid model that combines neural networks, time series, and random forests for predicting failures.Gao et al. [10] introduce a failure prediction algorithm based on multi-layer Bidirectional Long Short Term Memory (Bi-LSTM) for identifying cloud task and job failures.Tengku et al. [22] emphasize the effectiveness of extreme Gradient Boosting combined with decision trees and random forests in predicting cloud failures, with task priority being a crucial feature.
Bandari [2] proposes proactive fault tolerance through cloud failure prediction using classic machine learning algorithms.Zhang et al. [34] integrate design-time with run-time analysis to predict system failures, and Islam et al. [13] use a Recurrent Neural Network (RNN) for proactive maintenance in cloud systems.
Twala [32] compares predictive methods and identifies Naive Bayes classifiers as robust and decision tree classifiers as accurate for predicting software failures.Lucas et al. propose a transformbased feature-extraction method with evolving neural networks for detecting faults in time-varying distributed generation systems [16].
The machine learning approaches proposed so far for monitoring ECS lack generality, do not fully address the challenges of monitoring cost, data representativeness, completeness, scalability, emerging system topology, and standardization [24].

RESEARCH OBJECTIVES
In this PhD research, we address the problem of predicting failures in ECS.We address two main objectives: (1) Smart monitoring ECS: We will define an approach to evaluate the quality of an ECS.This involves the generalization of quality to the intrinsic contradictions that characterize SES and the definition of smart monitors that suitably monitor both scalable ECS and SES to identify anomalies that emerge in production, for a set of Key Performance Indicators, KPI, metrics that we collect from the SES in production without interfering with the ECS behavior.
(2) Predicting failures in ECS: We will define an approach to proactively predict ECS failures.This includes mechanisms to both reveal anomalous KPIs and relate anomalous KPIs to incoming failures.
The overall goal of the PhD is to define and assess a general approach for monitoring and predicting failures in ECS, able to cope with both the scalability of contemporary cloud architectures and the role of people in SES.

RESEARCH QUESTIONS
To define a smart monitor to predict failures in ECS, we need to address the following research questions: (1) Define the quality of ECS: How can we define quality in the context of evolving and sometimes contradicting behaviors of ECS?
(2) Assess the quality of ECS: How can we assess the quality of ECS, by monitoring the ECS in production?
(3) Predict failures: How can we predict failures in ECS?
(4) Prevent failures: How can we prevent failures in ECS?

METHODOLOGY AND PRELIMINARY RESULTS
In our study, we have defined healthiness a new concept of quality that addresses the evolving and unpredictable behavior, on one side, and the sometimes contradicting requirements of ECS, and in particular SES, on the other side (research question 1), and we validated healthiness on a case study, a ride-sharing SES.We have defined smart quality monitoring, a new approach to monitor ECS and predict failures (research questions 2, 3).The smart quality monitors address the autoscaling mechanism that Kubernetes uses to dynamically scale microservices and assign resources based on usage and demand.We validated the smart quality monitoring on a case study, a microservice-based commercial application.We are currently extending smart quality monitoring to include also human and social behavior to monitor and predict failures in SES, and we will validate the approach on the ride-sharing SES case study, Our experiments detail the information that the smart quality monitors produce about the predicted failures, and that we will use to study mechanisms to prevent failures (research question 4).

Healthiness
We define healthiness to capture the evolving and unpredictable behavior and the sometimes contradicting requirements of ECS.We defined the health of an ECS along three non-orthogonal dimensions: hardiness, consistency, and resilience.We borrow the terminology from existential and humanistic psychology, to define three aspects of the quality of ECS that evolve over time, in analogy with human beings.
In a nutshell, hardiness captures the strength of ECS in terms of quantity and quality of provided services, analogous to the existential metaphor of human strength.Consistency captures the ability of ECS to maintain stable behavior, even in the presence of sudden and unpredictable events, akin to the metaphor of human personality.Resilience captures the ability of ECS to respond to inevitable crises and return to acceptable behavior.
We introduce a new concept of healthiness to capture the quality of ECS.We define healthiness along three dimensions: Hardiness reflects the strength of the services of ECS, Consistency embodies its stability amid unforeseen events, and Resilience characterizes its ability to recover from crises.
We measure ECS health in terms of quantifiable indicators that encapsulate the unique aspects of these dimensions.We classify indicators as global ECS and specific system indicators.Global indicators evaluate the overall quality of the ECS, disregarding conflicting requirements from individual systems.specific indicators relate to subsets of systems, and are shaped by the requirements of individual systems.The interpretation of these indicators may vary due to conflicting demands from individual systems, and variations can occur among ECS of different types or within ECS of the same type.
We assess the healthiness of ECS by referring to quantifiable indicators that capture the distinctive aspects of the three dimensions of ECS healthiness.
The ECS healthiness is interrelated with implicit interactions among individual systems that evolve over time with autonomous behavior, wherein health failures are identified by some significant degradation in health indicators, such as a drastic drop in satisfied ride requests in a ride-sharing ECS or major market value loss, and faults are complex patterns of interactions leading to ECS failures.
We characterize the failure of ECS health failures as an unacceptable deterioration of ECS indicators, and ECS faults as a pattern of (implicit) interactions that result in an ECS failure.

Smart Quality Monitoring with Machine Learning
Our research builds upon Monni and Pezzè's work [18], where failures are predicted as energy anomalies in system metrics.We extended Monni and Pezzè's free energy approach to ECS to detect anomalies and predict ECS failures from indicators.Monni and Pezzè's approach efficiently approximates the free energy using Restricted Boltzmann Machines (RBMs), bipartite neural networks that infer energy-based models by using an energy function to approximate the output distribution based on the marginal distribution of the input.We experimented with both RBM and autoencoder to identify an effective way to reveal health anomalies in ECS.
We detect anomalies and predict failures of ECS in two steps: We first identify anomalies from the marginal distribution of the input when using RBM and from the reconstruction error when using autoencoder; We predict system failures based on anomalous energy values over three (overlapping) indicator subsets corresponding to hardiness, consistency, and resilience.
We monitor SES and predict failures by computing the free energy with RBMs and autoencoders from SES indicators.
We defined and studied smart quality monitoring using RBM and autoencoder for both dynamic cloud systems and SES.

Smart Quality Monitoring Dynamic Clouds.
We initially focused on the dynamic aspects of ECS, and defined an approach to predict failures in dynamic cloud systems with autoscaling mechanisms.We extended Denaro et al.'s Prevent approach [7] to deal with autoscaling.Denaro et al. 's Prevent approach predicts failures in statically configured cloud systems with both RBM and autoencoder, with excellent results.However, it requires a statically defined set of KPIs that characterize cloud systems with a static configuration and, thus cannot be used for dynamic cloud systems with autoscaling that results in variable sets of KPIs collected at different timestamps, depending on the changes in the configuration.
We extended Prevent with a rectifier that reduces sets of KPIs with a size that varies over time due to autoscaling to sets of KPIs of fixed size that are suited for both RBM and autoencoder by computing summary values, like average, maximum, and minimum values for pods of the same type.We experimentally evaluated the approach of a commercial Learning Management System.We used historical data to generate a realistic workload, and Chaos Mesh1 to inject failures.The experiments indicate that the new approach that combines an autoencoder with a rectifier produces results comparable to the excellent results of Prevent.

Smart Quality Monitoring SES.
We extended the smart quality monitoring system that exploits RBM and autoencoder to SES, by feeding the approach with health indicators to monitor hardiness, consistency, and resilience, and we reported the results at FSE in 2021 [31].We approximate the free energy of hardiness, consistency, and resilience with an autoencoder that we feed with different subsets of values for the three health dimensions.
We evaluated the smart quality monitor on the ride-sharing smart ecosystem, inspired by platforms like Uber and Lyft.We trained the autoencoder with data that we collected during weeks of the normal executor of the simulator, and we evaluated the smart quality monitor on different scenarios, like peak and greedy crash scenarios.In the peak request scenario, there is a sudden increase in ride requests that does not correspond to an increase in the availability of drivers, for instance, the abrupt and unexpected termination of a popular event.The effect is a temporary disruption with a sudden decrease in satisfied requests, an increase in both ride delays and service slack time.
In the greedy crash scenario, the sudden increase of requests happens in the presence of greedy drivers, that is, drivers who decline requests to offer rides when they perceive a critical situation that may lead to an increase of ride fares2 .The effect is an SES failure, that is, a disruptive peak of unserved requests.
Our smart quality monitoring system correctly observes a perturbation of the SES health in both scenarios and correctly predicts the SES failure in the greedy crash scenario only.

Smart Quality Monitoring Social Behavior.
The preliminary study of our smart monitoring system with the ride-sharing smart ecosystem relies on very simple models of humans that characterize the collective behavior of drivers and passengers with few parameters, for instance, the probability distribution of drivers' declining requests.We investigated the hypothesis of a correlation between human personality traits and actions to predict human behavior in SES and detect anomalous behavior.Drawing on the contributions of Costa and McCrae, we use the OCEAN model through a 60-question BFI-2 personality assessment, to profile the personality traits of a sample population, and we ask our sample to answer a set of questions in different scenarios of the ride-sharing smart ecosystem.We collected data from over 100 voluntary participants, ensuring a diverse sociodemographic profile for a comprehensive exploration.We used a structured questionnaire with three layers, acquiring socio-demographic information, gathering responses to the Big Five Inventory-2 for personality traits, and exploring decision-making within ride-sharing scenarios.Our analysis involves computing composite scores based on the OCEAN model, assessing personality traits across dimensions.Our preliminary results indicate an increasingly strong correlation between personality traits and actions when considering the sequence of actions of increasing length, paving the way to a mature model of human behaviors within SES.

ONGOING RESEARCH ACTIVITY
In the final part of the PhD, we plan to consolidate the results that we obtained so far, by gathering additional evidence of the validity of healthiness to capture the quality of SES, fully integrating the different aspects of the smart monitoring system into a complete monitor that takes into consideration autoscaling, evolution and human activities within a complex SES.We plan to define monitors for large SES, like smart cities, by extending our promising results about the correlation between human personality traits and actions to traits that characterize populations and only individuals.We plan to complete the smart monitor with preventing actions (objective 4).
We will gather additional evidence of the validity of healthiness to capture the quality of the smart ecosystem with an extensive set of experiments with our ride-sharing smart ecosystem that simulates Uber and Lyft in San Francisco, with the actual traffic of San Francisco in the street map of the city.We will complete our user study with questionnaires that will explore additional decision-making scenarios to assess the degree of correlation between human personality traits and actions, and we will extend results to homogeneous human populations within an SES.We will experimentally validate the new models of population personalities with different groups of passengers and drivers in our ride-sharing smart ecosystem.
We will use the information that our smart monitoring system produces when predicting a failure to define approaches to trigger suitable preventing actions.We will start with the simple and effective set of actions offered in popular systems, for instance, restarting critical nodes or migrating unstable pods in autoscaling cloud systems.We will define new preventing actions for cases that current simple mechanisms cannot satisfactorily prevent.

CONCLUSIONS
In this PhD, we will define a complete smart monitoring system for evolving complex ecosystems to monitor the healthiness of the system, predict incoming failures, and prevent them when possible.The smart monitoring system will monitor ECS with configurations that vary dynamically and evolve due to the dynamic assignment of resources or evolution of the smart ecosystem, with systems dynamically entering and exiting the ecosystem.It will monitor the human behavior that depends on personality traits for both individuals and social groups.It will offer a set of actions to prevent predicted failures to occur.the proposed system will incorporate machine learning algorithms to continuously adapt and refine its predictive capabilities, ensuring a proactive approach to system health management.
220 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion) Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.