Introducing Monitoring and extended Elasticity support in TOSCA

Cloud services and applications become ever more important for enterprises, which profit from the advantages of scalability, flexibility and the pay-as-you-go model which are offered by Cloud service vendors. One of the most well-known standards in the domain, which have been developed about ten years ago, is the TOSCA cloud application specification. TOSCA allows the definition of the structure and operation of cloud applications. Although considerable work has been done before in the specification of monitoring and elasticity - of which a thorough analysis is provided - its quality and its integration in TOSCA can be significantly improved. In this work we suggest specific extensions covering the monitoring of processing components and the elasticity policies which are associated with them. Indicative TOSCA examples are provided to aid comprehension.


INTRODUCTION
Cloud computing is constantly gaining momentum among large enterprises and SMEs.It has been estimated that the value of public cloud spending will increase more than 20%, surpassing $720 billion in 2024 [1].Moreover, the IaaS sector is forecasted to experience the largest increase among the Cloud computing sectors [1].The appropriate management of IaaS resources will become ever more important.A crucial role in the effort of maintaining usable and updated cloud applications is held by orchestration software, responsible for the coordination of the deployment and reconfiguration of cloud applications.Specialized components such as Prometheus [2] or other advanced event processing and management systems [3] can help orchestrators perform adaptations which will allow applications to maintain an appropriate QoS level.Indeed, without appropriate monitoring data it is not possible to adapt an application unless its behaviour is perfectly known in advance, e.g., the application is subject to periodicity.Otherwise, it is necessary to be able to acquire quality data frequently, in order to be able to analyze it and ensure the necessary quality of service.However, increasing the amount of measured monitoring metrics and/or the monitoring data sampling rate will result in increased bandwidth usage, therefore a means to allow only a selection of metrics to be used at an appropriate resolution is needed.Having monitoring data though is not enough to ensure appropriate QoS for an application.It is further necessary to define elasticity policies or SLAs, and a means to achieve each.To this end, a multitude of technologies have appeared in the past years, exploiting chiefly vertical and horizontal scaling but also other adaptation actions.In practice, monitoring data acquisition and SLA enforcement can be as simple as a shell script with few lines of code specifying an overprovisioning of resources on a constant load, or as complex as a full-scale monitoring solution (e.g Prometheus [2]) coupled to a data-intensive analyzer and recommender (e.g a machine-learningbased solution using big training data).Whichever the case is, it is necessary to be able to appropriately configure a cloud application to acquire all needed data and attain the required SLAs, using simple and easily understandable terms.This is not a luxury but a need, in order to allow the easy and fast inspection of not only the processing capabilities of a node but also the monitoring data it produces and the SLAs it should respect (and possibly the algorithms it should use to respect these SLAs).This need can be fulfilled by the appropriate use and extension of cloud application modelling languages.We hold that the extension of an open standard such as TOSCA [4] will introduce a normative way to fulfil these needs.This extension is precisely the research question which is addressed by this work -which constructs should be added to TOSCA to allow applications to collect monitoring data and which modeling artifacts should be exploited will enable them to conform to their elasticity requirements and/or SLAs in a modern and future-proof manner.
Support for monitoring variables and elasticity has already been considered before in languages for cloud computing, and TOSCA (e.g [5]- [7]).However, concrete artifacts bringing these capabilities to contemporary TOSCA are missing.These are essential to appropriately model the monitoring (and adaptation) of cloud applications using the TOSCA standard.In addition to its prestige as a cloud deployment standard, TOSCA holds real modeling advantages against tools which are more often used in production.As mentioned in [8] "While it is possible to consider other approaches (e.g., Ansible, Vagrant, Docker) instead of TOSCA for the automated deployment of IoT applications […] TOSCA enables a generic approach based on topology models and a corresponding graphical notation".It should be mentioned that the current official TOSCA standard version, 1.3 [4] includes some support for scaling although this is anticipated to be removed as it does not appear in the specification draft for the next version, 2.0 [9] .However, when reviewing both the existing support and previous work, it is apparent that a complete approach to monitoring, elasticity and SLA specification in TOSCA is missing.
In [10], the author extends the definition of TOSCA to handle SLA attributes, KPI capabilities, and SLOs which should be respected.Furthermore, TOSCA is extended to allow the definition of the penalties which should be applied in the case that an SLO is not met is provided.The work also discusses how these concepts can be applied in a marketplace environment, where clients can choose a service which possesses the appropriate characteristics through TOSCA.Although the concept of SLOs is touched upon, and some SLOs and even penalties are mentioned, the ways in which the application can meet a dynamic SLO are not considered.Instead, static SLOs such as the geographical position of a node, and the service availability are discussed.
In work [11], the authors propose an extension to TOSCA [4] refining the support of its autoscaling capabilities.More specifically, enhancements to the normative TOSCA policy are introduced, specifying constraints on elasticity rules and a clear interpretation of the enforcement of elasticity policies.Their work is a very important first step, especially as it is meant to be compatible with the xOpera orchestrator [12].We fully support their philosophy of pragmatic, realistic modeling targeted on real problems.However, we consider that slightly increasing the abstraction of the modeling can allow to express more generic application contexts.Our approach is -abstraction-wise -similar to what was followed in the official standard [4], although we add considerable more depth both from the monitoring and scaling perspectives.Moreover, the definition of the scaling itself can be enriched, to allow different algorithms to be used -and not rely only on a static addition/removal of instances.
In [13], extensions are proposed to XML-TOSCA to support the specification of SLA policies and current monitoring information in the TOSCA template.The adaptation of topologies is proposed to be conducted using scaling plans featuring a static set of adaptation actions aiming to respect SLA and Budget constraints using queuing models.Information related to the current values of monitoring metrics is also inserted to the TOSCA template.In our proposal, we describe a generic modeling of the monitoring metrics which will be used, and can support the definition of details for more algorithms than queueing networks.Similar to our work, the authors in [7] propose a custom cloud resource description model (cRDM) aiming to allow users to describe their resources and elasticity features while avoiding vendor lock-in and some of the disadvantages associated with TOSCA.They use three event types to trigger a reconfiguration action: temporal events, resource related events and user action events.In order to indicate the actions which should be taken by a cloud platform in response to the triggering of an event, horizontal scaling, vertical scaling, migration and application reconfiguration actions are described.In our approach, we consider temporal and resource events, but not user action events.We consider that the latter case should be handled either at the level of the application (and thus influence hardware-level or custom application-level monitoring metrics) or at the level of the orchestrator and its user interface (which may allow custom editing of the model properties by the user).Moreover, we allow the use of context events which allow a component to adjust its behaviour to the situation of the whole service.We support any adaptation action (see scaling_parameters in Section 3.3)), yet in our view migration should rather be handled at the level of service template optimization.Moreover, we hold that application reconfiguration actions should be taken by the application itself and not be necessarily reflected on the application model (as application logic itself is not reflected on the application model).Most importantly, we consider that TOSCA is an appropriate means to specify cloud topologies, and that the weaknesses mentioned in this work relevant to TOSCA (mainly verbosity, textual form, complex low-level scripting requirements) are either not relevant or are manageable.
In work [14] the authors propose the enhancement of XML-TOSCA, to include information related to monitoring data and application elasticity.Monitoring data is modeled in terms of the current values of monitoring metrics, while application elasticity is supported by the configuration of appropriate constraints which guide the horizontal scaling of the platform.Application elasticity is modeled by integrating the SYBL elasticity language.SYBL [6], [15], is a simple yet expressive language which helps define scaling policies, and a framework which exploits SYBL to govern cloud application elasticity.
In work [6], to model the elasticity behaviour of the application, the authors suggest among other modeling artifacts the use of elasticity requirements, capabilities, and relationships -all directly compatible (in their definition) with TOSCA.Moreover, they suggest the specification of the attributes which should be monitored, although the monitoring frequency is not specified.The elasticity capabilities which are given as an example include infrastructure, platform and application-level actions.Although this conceptual differentiation allows the majority of elasticity actions to be implemented on a single elasticity controller, we support that since platform-level actions (e.g join/leave cluster, adjust number of threads) and application-level actions (e.g set configuration parameter) are in many cases tightly bound to application logic it is more appropriate for the application itself to handle them.Further, it will be simpler (in terms of development, debugging) for a developer familiar with an application to change its configuration, instead of creating a specific plugin in another framework to implement this functionality.Concerning the viability of software, code and concept fragmentation can allow for quick and independent progress, but -unfortunately -can also mean that some parts of the software are underdeveloped or are eventually abandoned.Currently, the rSYBL framework github [16] has not had any commits since 2016 and the few TOSCA examples in [6] refer to XML-TOSCA which has been deprecated in favor of YAML.
The CAMEL cloud application language [17], has developed a sophisticated metric model to describe monitoring metrics and scalability rules.In this work, we try to import the most essential elements of this model to the TOSCA standard.
Industrial efforts have also been made to include some monitoring and scaling capabilities within TOSCA nodes, although these have not yet been officially adopted by the TOSCA standard.Openstack Heat [18] allows the definition of policies in TOSCA which can be used to drive autoscaling.A TOSCA policy can contain the monitoring metric based on which scaling should occur, the threshold, the number of repetitions of the alert and the time window over which the calculation of the scaling condition should happen.Cloudify [19] examples in related repositories [20], [21] illustrate that it can support the enforcement of custom TOSCA elasticity and scaling policies.These contain among others, a description of the number of instances which should be added to a deployment of a component, the cooldown period which should be enforced between adaptations, the monitoring metric which should be used for the scaling action, and the monitoring interval.These are all valid properties which should be considered during scaling, and we have incorporated them in our modelling as well (albeit using a different modelling artifact).
Our approach complements what has already been suggested by the community, by adding support for more algorithms to be used.

Overview
In the following sections we shall describe the core, generic artifacts necessary for monitoring and elasticity support.These artifacts have been created using normative TOSCA constructs, data types, requirements, capabilities, policies, and triggers.We do consider that a TOSCA orchestrator capable of understanding these constructs is indispensable to attain the implied functionality, however this orchestrator is not currently available.

Support for Monitoring Attributes
We define monitoring attributes as any measurable quantity which can be used for the adaptation of a TOSCA application.The most important details which we consider that should exist in a monitoring attribute definition, are the monitoring metric name, the aggregation operator (e.g., max, average, 90th percentile) which should be applied and the unit which is used during the collection of data.In TOSCA, we consider that each node type depicting a software component hosted on relevant processing hardware, should have a monitoring requirement.In turn, this requirement should be fulfilled by a TOSCA node template which offers a relevant capability.The monitoring capability should include the monitoring metrics which should be offered.Similarly, when it is required for a software component to respect SLA constraints, a relevant requirement should be fulfilled by an appropriate capability.An example of a node type requesting monitoring and SLA monitoring to be activated appears in Listing 1:

Support for Elasticity Policies and SLA constraints
In the example of Listing 5, we provide the modelling of a suggested scaling policy using a simple threshold-based algorithm as a basis.We consider that an event named 'monitoring_available' is made available through an appropriate interface similar to the one defined in Listing 6.Additional temporal/context events could also be used to enable the evaluation of a trigger.These would allow a modeller to indicate the need to perform a different adaptation action when e.g., the application is used at night-time or day-time, or under different weather conditions.

Listing 6: Monitoring available interface definition
An important extension which is brought by our approach is the capability to define an unlimited number of parameters, similar to the ones stated in the example of Listing 5 inside the 'scaling_parameters' property of the scaling algorithm.This can allow for refined adaptation actions, and also allows to understand the most important configuration options directly from the model.
To illustrate the versatility of our approach we present a second example of the adaptation algorithm configuration, which can be used to configure the Simple Severity Zones algorithm [22] in Listing 7. The algorithm aims to provide a configurable number different of static responses for scale-in or scale-out depending on the zone of the Severity of the current situation, subject to a cooldown period between successive adaptations.

Listing 7: Configuration of the Simple Severity Zones algorithm
Moreover, based on our experience in handling workloads, we propose two additional properties -'expected_load_intensity' and the 'is_load_periodical'.The first of these additional properties should describe using one word (understandable by the orchestrator) the character and intensity of the workload, while the second is a Boolean value which can hint about any (expected) periodicity of the workload.The complete algorithm configuration datatype which is used in a policy template of a tosca.policies.ScalingPolicy policy type appears in Listing 8. description: A list of the parameters which will be used to guide the operation of the scaling algorithm type: string

Listing 8: The AlgorithmConfiguration datatype
Concerning the definition of triggers, we consider that the multiple adaptation actions which are defined (stop, wait, notify, access and configure) in some works (e.g., [14]), can be reduced to a single `set-state` action, the value of which can be exploited to trigger different behaviour based on the configuration of the scaling algorithm.
We do not differentiate above between triggers which aim to enforce SLAs and triggers which are used to implement elasticity policies.Differences which would be expected between triggers of the two categories in real application topologies would appear i) in the values and the conditions which are used for scaling actions and ii) in the required events which need to be satisfied before a trigger is activated.

DISCUSSION-FURTHER WORK
The approach which was illustrated in the previous sections was not the only one which could be undertaken to specify monitoring and elasticity support in TOSCA.For example, monitoring details could be specified using 'monitoring policies' instead of requirements/capabilities.Moreover, adaptation policies could be specified using adaptation requirements and capabilities.Specifying monitoring details using policies can be -from the perspective of modelling of the intended functionality-an effective solution.The latest specification draft of the standard [9] itself mentions that 'A policy can express such diverse things as monitoring behavior […]', albeit such an example is not provided there.However, using requirements and capabilities is more aligned to the concept of a software stack which is commonly used to describe reference implementations (e.g., LAMP, ELK etc.).
Moreover, it allows one to benefit from encapsulation of node type definitions, and abstract possibly distracting details.Therefore, node types may be easily reused while the manual model inspection and comprehension, whether in a graphical or textual form, becomes easier.These advantages outweigh the decoupling offered by policy types.
In the case of the modelling of elasticity through adaptation policies, the choice to create a new policy type is based on our intent to integrate with the trigger definitions which are available in TOSCA.Therefore, we can not only exploit the existing notification mechanism, but also use the ability of TOSCA to express complex logical expressions inside trigger 'conditions'.
Regarding the modelling of scaling parameters as a TOSCA map, we could instead opt to concretely model scaling parameters for particular algorithm types.For example, in the case of the simple threshold algorithm, we could specify that the required parameters would be a property list of a new data type including the type of scaling, the number of instances, and the cooldown.Similarly, different fields could be defined for other techniques.However, since the number of possible adaptation techniques is not small, a great amount of modelling effort would be required to model all of them.On the other hand, forcing a TOSCA user to model the topology configuration may hinder the adoption of our new extensions.We encourage therefore to extend the AlgorithmConfiguration data type, but we do not consider that this should be mandatory.Data protection and privacy is an important aspect of monitoring data handling which has not yet been handled in TOSCA [23], and is not specified in this work.Therefore, the description of anonymization or encryption of data, along with details on any policies which need to be employed, needs to be worked on to provide the modelling artifacts necessary for privacy-aware applications.
Questioning the methodology of this work, one may disagree with our approach to integrate monitoring and elasticity aspects directly to TOSCA, and not model instead these extensions using a specific language -as is done in CAMEL [17] and was suggested in the case of rSYBL [6].Considering though that the contemporary cloud topology definition landscape is very fragmented, and the most prominent open cloud standard is not universally adopted, we hold that it is currently most important to establish a set of primitive structures.Using multiple specialized languages is an effective approach -as ultimately it is not possible by a single language to cover everything without adding significant complexity -but only to extend the 'main' language features of the language.For example, the html language with limited formatting capabilities was first used, and only afterwards were CSS/Javascript introduced.Further work could also involve the a priori specification of the monitoring capabilities and requirements based on the state of topology components (e.g., different monitoring behaviour under different workload intensities).This would increase modelling complexity, however a modeler could investigate a methodology based on [24].Alternatively, even without relying on any TOSCA additions (as is required in [24]), if an orchestrator is provided we can assume that such changes are propagated into a new, simple model.Moreover, the specification of the QoS constraints should ideally be done using a user interface as suggested in [5].The tight integration of a user interface with an orchestrator, can allow for more detailed modelling of the topology (as it will be less tedious) and may even allow in the long term for real-time model-driven visualization of monitoring streams.Finally, it is obvious that the monitoring details and adaptation policies proposed in TOSCA need an orchestrator to be extended or be implemented anew, to be able to actually realize them.To this end, an integration with a related technology such as Prometheus [2] or EMS [3], can be pursued.

CONCLUSIONS
We presented in a concise manner, what we consider to be the main additions which should be performed to the current TOSCA standard, in order to support the specification of monitoring metrics and elasticity directives.Unfortunately, current solutions are either quite complex, or are not applicable to TOSCA and require additional modelling tools to be used.We consider that solely using language structures already existing in the standard prevents fragmentation of the effort.Therefore, we propose creating generic capabilities to fulfil monitoring requirements, using simple metric definitions can bring monitoring support to TOSCA.Finally, this information can be exploited using appropriate trigger definitions to support elasticity actions without adding unnecessary verbosity to the template.