Modeling Resilience of Collaborative AI Systems

A Collaborative Artificial Intelligence System (CAIS) performs actions in collaboration with the human to achieve a common goal. CAISs can use a trained AI model to control human-system interaction, or they can use human interaction to dynamically learn from humans in an online fashion. In online learning with human feedback, the AI model evolves by monitoring human interaction through the system sensors in the learning state, and actuates the autonomous components of the CAIS based on the learning in the operational state. Therefore, any disruptive event affecting these sensors may affect the AI model's ability to make accurate decisions and degrade the CAIS performance. Consequently, it is of paramount importance for CAIS managers to be able to automatically track the system performance to understand the resilience of the CAIS upon such disruptive events. In this paper, we provide a new framework to model CAIS performance when the system experiences a disruptive event. With our framework, we introduce a model of performance evolution of CAIS. The model is equipped with a set of measures that aim to support CAIS managers in the decision process to achieve the required resilience of the system. We tested our framework on a real-world case study of a robot collaborating online with the human, when the system is experiencing a disruptive event. The case study shows that our framework can be adopted in CAIS and integrated into the online execution of the CAIS activities.


INTRODUCTION
A Cyber-physical system (CPS) has heterogeneous hardwaresoftware components that collaborate to deliver real-time services.The complexity of CPSs is different from one domain to the other.A Collaborative Artificial Intelligence System (CAIS) is a CPS that performs collaborative actions with humans in a shared environment to achieve a common goal, [9,8].Such systems need to be resilient and recover fast from any disruptive events that degrade their performance and eventually prevent them from providing their services within the due schedule, [3,9].
The AI component in CAIS represents the core decision-making instrument that guides system-human interactions.This component can be trained either i) using historical data (offline training), or ii) using data from run-time (online training), [7,9,8].When the AI component is trained online, the CAIS resilience is of paramount importance, [8].A disruptive event may indeed affect the ability of the AI component to restore its prediction accuracy in an acceptable time and, in turn, leading to the degradation of overall the system's performance and extra interactions with the human to recover from the event [1].Thus, CAISs managers need to ensure their system's resilience by monitoring its performance.Monitoring performance allows the managers to understand if their system was able to detect performance degradation and recover back to an acceptable performance state, [3,2].
The major goal of this paper is to provide a new framework that models CAIS's performance while learning online and facing a disruptive event.Our model helps in showing the performance evolution of CAIS, the performance degradation that may occur after a disruptive event, and rules and measures to assess CAIS's resilience upon disruptions.Our framework abstracts the states of CAIS into a learning state and an operational state.CAIS enters the learning state when first the AI model accuracy is not high enough to make a trusted decision and human interaction is required.Second, when the human intervenes to correct a false positive decision.The CAIS is in the operational state when it completes its service autonomously.Our framework tracks CAIS's states during run-time and finds the ratio between the number of the operational state over the number of the learning state during a specific time frame, and provide a set of measures to evaluate the performance across the states.We applied our framework by carrying out an experiment with a real-world CAIS case study.Our CAIS is a collaborative robot that learns to classify objects based on their colors in an online learning process.During the experiment, we disrupted the learning process by turning off the supporting lights of the RGB Camera sensor capturing the objects to be classified.Our framework has shown the incremental evolution of the system performance from a steady state, the performance degradation after the disruptive event occurring within a disruptive state, and the recovery during a recovered state.The model of resilience we have obtained also shows the system entering into a second disruptive state caused by removing the cause of the disruptive event.After such state again the system enters into recovered state.
Our major contribution in this work can then be summarized as follows: (1) We design a novel framework that models the evolution of CAIS performance.Our framework defines the rules and the measurements to automatically track CAIS performance, and detect performance degradation and eventual anomalies.(2) We define specific measurements and rules to describe the system performance at each state of its evolution.Our rules assess CAIS's resilience upon disruptive events and set the comparison baseline for future research on resilient CAISs.(3) We automated our framework with a real-world CAIS demonstrator, a collaborative robotic arm with an AI component that online learns from human gestures.We then performed an experiment in a laboratory setting, and we obtained a model of a performance evolution over different states of the system.The rest of this paper is structured as follows.In Sec. 2 we overview the key concepts concerning this work and the proposed methodology.In Sec. 3 we introduce the research questions, our CAIS demonstrator, our experiment, and the takeaways.In Sec. 4 we discuss threats to validity.In Sec. 5 we discuss the related work.Finally, in Sec 6 we state our conclusion and the future work.

METHODOLOGY
In this paper, we aim to model the resilience 1 of a CAIS while it learns online from a human upon disruptive events.Hence, our method starts by understanding the online learning process, the different variables, and the states a CAIS passes through to achieve resilience.With this knowledge, our method defines a framework and its measures to support CAIS managers in understanding the resilience of their system.The result is a model as in Fig. 3 equipped with measurements as illustrated in the following.
We describe the online learning process by means of the CAIS of our case study.As such, in the following, we refer to CAIS tasks as specific AI classification tasks, but the online process can be similarly defined for other AI tasks.Our robot learns object classification by color.Then, based on the learning, the AI component autonomously recommends with a specific probability an action to perform the classification.Fig. 1 shows the online learning process in a CAIS, where the data is collected by the CAIS sensors, and then preprocessed to extract the learning features.With the learning features, the AI model estimates the probabilities of the classification classes and computes their maximum (a.k.a. the confidence level of prediction),  ∈ [0, 1] so that the class with the highest probability is chosen by the robotic arm.To avoid cases in which more than one class can be chosen as the probabilities are similar, the desired confidence level  is set (e.g.,  = 0.4 for three classes).The value of  is then compared with .If  is less than , the human is prompted to perform the task, otherwise, the CAIS autonomously does it.Moreover, the human has the possibility to intervene the robot misclassifies an object (false positives) switching the robot to the "learning from human" mode.

Performance Evolution
AI models evolve with learning data.In our context, an AI component starts without any knowledge about the action to perform, which means that  is zero.As CAIS keeps running, the accuracy 1 Resilience is a non-functional property that enables a system to recover its performance after an event has degraded it, [3]. of the AI model increases with  in a positive relationship.With higher values of , the number of autonomous actions in a time frame increases, and, correspondingly, the number of human actions decreases.The ratio between these two numbers characterizes the interaction between the autonomous components in a CAIS and the human.The goal of CAIS is to keep this ratio at its maximum.When CAIS experiences a disruptive event, the data and the online learning process might be affected.Therefore, with a disruptive event, the value of  and the model accuracy may decrease.When the  value decreases below , the ratio between the number of autonomous actions and the number of human actions decreases, indicating a degradation of the CAIS performance.Hence, for CAIS's AI model to be able to recover back its ability to perform its tasks autonomously, it requires further training by switching to the learning mode.The performance curve shows the performance degradation caused by the disruptive event, and recovering back to an acceptable performance level.This evolution shows evidence about CAIS resilience upon the current disruptive event, as described in the following.

Modeling Performance
To model the CAIS performance during learning online from the human, we plot the ratio between the number of times CAIS operates autonomously and the number of times the human operates in a specific time frame.Fig. 2 shows the ratio over the states of the system.Our model initiates a first-in first-out queue with a time frame size of zeros.When a new object arrives, CAIS's AI model estimates the classification probability (), then it enqueues a zero if  <  (learning state), and one wait to complete the operating state to enqueue one (to allow the human to intervene in case of false positives).After enqueueing a new value, our model dequeues the value on top of the queue to keep only a time frame size elements.Then we define our measurement, the Autonomous Classification Ratio (ACR) as the ratio of the autonomous actions in a time frame.We compute the ACR by finding the queue sum over the time frame size.For each new object, we plot the ACR value, which results in the final performance model.Fig. 3, shows the expected resilience model of CAIS performance evolution when encountering a disruptive event.CAIS learns the classification until it enters the 1 In the Final State, the CAIS may behave differently than in the previous disruptive and steady states, as it may maintain some historical memory of the disruptive event.It is important to note that this cycle of states are repeatable per each disruptive event that degrades the system performance.Table 1 summarizes each of the performance states of CAIS to represent resilience.
We have defined a few measures (Table 2) and rules to support CAIS managers in using our framework and assessing the CAIS resilience, as described in the following.The State Length is an experimental rule that defines the minimum number of iterations (each new object initiate a new iteration) of a state.We set the state length as equal to the number of iterations in the 1  Steady State.Then, in a disruptive state, we examine a period of close to or equal to this number of iterations, starting from the last ACR point under the ACR Threshold, to understand whether CAIS has recovered or not.Secondly, the Points Under the Threshold (PUT) and the Points Above the Threshold (PAT).Each of these measures indicates how well CAIS resilience is.The goal of CAIS's managers is to maximize the PAT ratio over the PUT.Finally, from a business perspective, a goal of CAIS managers is to minimize the human efforts.Thus, we define the Human Interaction Ratio (HI Average) during the disruptive state to reduce human efforts.We defined the HI Average as the average of the human interactions to classify an object during the disruptive state.

CASE STUDY
In this paper, we introduce our real-world CAIS case study.The CAIS demonstrator, shown in Fig. 4, is a robotic arm (3) responsible The number of ACR points above the ACR Threshold in the disruptive state.PUT to PAT Ratio The ratio between the PUT to the PAT.

HI Average
The average of human interactions' in the disruptive state.
for classifying objects based on their color.The robot learns the object box (4) by tracking the human movement (6) and mapping it with its color histogram.The robot has two vision sensors, one is placed above the conveyor belt (2), and another (5) tracks the human skeleton.The two vision sensors communicate with computer vision software (1).
To understand if CAIS is resilience upon disruptive events by modeling CAIS performance, we aim to answer the following research questions: • RQ1.How does CAIS's performance in an online learning process evolve?To answer this question, we will show a rendered plot of CAIS performance learning from the human, to explain the performance evolution in each state.the baseline for future research that aims to create more resilient CAISs.To address these questions, we perform an experiment where we automate our framework to track the performance evolution of our CAIS in classifying objects of three colors (Red, Green, and Blue).The robot learns the three classes (Box1, Box2, and Box3) from tracking the human hand movements and gestures.We enforce a disruptive event to cause a performance degradation to the classification learning process by turning off the supporting lights of the RGB Camera (Fig. 4 (2)).The experiment then runs in iterations, where each iteration represents a new object entered to be classified.To avoid the learner being trained for one color more than the other, the objects will enter the conveyor belt sorted red, green then blue, with the same quantity.

Experiment Execution
The experiment is executed in iterations and for each iteration, we collect the AI prediction probability (), and the iteration state (learning/operating state).To choose the value of , we consider the three classes we have (1/3 ≃ 0.33).Thus, we chose a value that is too low ( = 0.33), which led to closer values of classification probabilities, and thus, we had two classes with almost the same classification probability.On the other hand, we chose a high-value ( = 0.50%), which required a longer time of training to reach a steady state.The value that best fits our needs is ( = 0.40), which helps avoid the two problems.
During the execution, CAIS performance is expected to transit through different states in which the AI component online learns its task.Fig. 5 shows the performance evolution of our CAIS demonstrator (collaborative robot) as a function of time frame, where each time frame size is equal to five overlapping iterations (or five objects).The figure shows the ACR values over 208 iterations, which shows all the different performance evolution states.We triggered a disruptive event by switching off the lights over the conveyor belt supporting the RGB Camera.This disrupts the robot's vision and the color histogram extracted from the object image.In this experiment, our robot was able to recover both when we switched off and back on the light.The performance evolution model shows that the system will manage to recover back to a recovered state during the disruptive state.It will also be able to go back to a steady state after removing the cause of the disruptive event.

Takeaways
In this experiment, we equipped our real-world demonstrator with our framework.The results of the performance evolution model help address our research questions as follows: Answer to RQ1.The resilient behavior of CAIS is its ability to overcome unforeseen disruptive events and recover its performance from a degradation state to an acceptable state.Fig. 5 shows the performance evolution of our CAIS demonstrator (collaborative robot).The performance evolution shows our robot starting with  = 0, and then starting to learn from the human gestures until entering the 1  Steady State.In the 1  Steady State, the performance formulates a pattern of going up and down, which is expected due to slight changes of the environment (for example the object position on the conveyor belt).The ACR Threshold is calculated to (0.40), which is the minimum value in the 1  Steady State.Then the disruptive event was enforced, and the performance   3. Answer to RQ2.The answer to this question shows the feasibility of our rules and measures defined in Table 2. First, the  ℎ, in our experiment, the length of the 1  Steady State was 34 iterations, thus when the 1  Disruptive State started, we had to wait for at least 34 iterations after the last point under the threshold.This results in a Recovered State of length 37, and the 2  Steady State, 40+ since it is the last state.The experiment shows a total number of  = 23, and the  = 59, which makes the ratios equal 0.28 PUT to 0.72 PAT, the higher ratio of PAT the better.Finally, for the  we need to find the number of times CAIS requires a human classification, in other words, the number of times entering the learning state.In our experiment, the number of human interactions was equal to 44 and the number of iterations is 82, thus  = 44/82 = 0.54.These measures summarize the ability of our CAIS to be resilience to the disruptive event of turning off the lights.

THREATS TO VALIDITY
The major threat to the validity of our study is the internal, conclusion, and external threats.Internal Validity.is related to the factors affecting the research findings.In our experiment, we had to set the size of the time frame to compute ACR.We use a window size that is not too small and not too long, which is five.However, we plan to run a stress experiment with various sizes of time frames to check the effect on the final results.Conclusion Validity.concerns the appropriateness of the measurements analysis and inferences drawn from the data.Our resilience rules and measurements are constructed from our observation of our experimental protocol.However, to have a meaningful representation, we consider them to be comparison rules for future research studies to enhance resilience in CAIS.External Validity.is related to the degree of support for the generalization of the theoretical results.The nature of our research is mainly exploratory, and generalization beyond our case study is needed for consolidating our claims.Thus, we plan to run more experiments with another real-world demo we have in-house and with simulators.

RELATED WORK
We have reviewed existing literature according to i) resilience and ii) online learning in the context of CAIS.In the following, we overview them.Resilience.The state of the art has proposed several models to address resilience, like Januário et al. [4] using a multi-agent model,Liu and Wang [5] using a tri-optimization model, and Zarandi and Sharifi [11] using a deep learning model, to mention a few.All proposed models aim to detect performance degradation or the disruptive event that caused it and return the system to an acceptable performance state.Disruptive events vary from one system to the other, for example, the disruptive event can be a security vulnerability of the system [11,5], defect in the system cyber or physical parts [4], from an oracle component of the system such as humans [9], or it can come from the system surrounding environment, [3].Online learning.CAIS collaboration with humans is a process of transferring knowledge from the human to the autonomous components of CAIS.Wang et al. [10] has defined three stages for human-system learning start teaching the robot through voice instructions, the robot then learns, and finally, the human and robot collaborate.However, updating the learning will require us to manually turn to the learning stage, which is not the case in our dynamic continuous online learning.A recent survey of learning strategy for robot-human collaboration by Mukherjee et al. [6] has discussed several input modes, such as gaze, gesture, voice commands, and facial emotions.Specifically, our system uses human gestures for object classification, which requires scene understanding, [6].The survey illustrates that the majority of the work is about the safety and physical safety of the human-system collaboration.

CONCLUSION AND FUTURE WORK
In conclusion, this work proposes a novel framework to model the performance evolution of CAIS, by tracking the different states of the online learning process with the human.Additionally, it defines the measures and rules to assess CAIS resilience over time.Finally, it automatizes the framework and its measures by designing and executing an experiment with a real-world collaborative robot.The results show the framework's ability to render the performance evolution of our CAIS passing through the different performance states (steady, disrupted, and recovered).
In future work, we plan to execute additional experiments to generalize our results to other real-world and simulated case studies.We are also designing comparative experiments to automatically support CAIS in decision-making, to enhance CAIS resilience based on the defined measures of this study.Moreover, we plan to reconsider further human attributes, for example human energy.

Figure 1 :
Figure 1: Online Learning Process in CAIS.
Steady State.The start of this state is indicated by a set of autonomous classifications for a whole  Steady State, we define the ACR Threshold, which is the minimum ACR value.This value represents the level of performance when the AI component is learning autonomously.A value of  = 0 indicates the end of the 1  Steady State and the start of the 1  Disruptive State after a disruptive event has occurred.During the 1  Disruptive State and due to the policies of the online learning process, CAIS aim to learn again to restore the performance to an acceptable level ( ≥  ℎℎ).When the level is reached, the CAIS enters into the Recovered State.The recovery at this state, represents the system resilience during disruption.After recovery, the cause of the disruptive event can be removed and the CAIS enters with the same modalities as above in a Final State, which again includes a 2  Disruptive State and a 2  Steady State.

Table 1 :
Performance States Definitions.

Table 2 :
Resilience Rules and Measurements.
• RQ2.What are the rules and measures that indicate if CAIS is resilient upon disruptive events or not?To answer this question, we will examine the set of measurements and rules to tell if the resulting performance represents a resilient behavior or not.Additionally, these measurements can set