Three Challenges in Utilizing Machine Learning to Predict Human Behavior from Observational Data

This paper outlines the three principal challenges encountered during the machine learning efforts of the Real-Time Adaptive Systems (R-TAPS) project to learn the behavior of chemical plant workers and provides recommendations for future HRI projects that face similar problems. This paper specifically focuses on data labeling, annotation processes, and model evaluation. The R-TAPS machine learning efforts aimed to predict worker behavior during task execution in real-time. It employed a step-level label system, which caused difficulties in predicting worker behavior on a timestamp level. The annotation process that was carried out lacked uniformity, leading to inconsistencies in the data entries. The model performance presentation caused confusion due to multiple performance values and a lack of understanding of what metric to evaluate. In response, this paper offers recommendations that address each challenge for future efforts.


INTRODUCTION
With the increasing demand for human-robot collaboration (HRC) in various industries, predicting human behavior is essential.In industrial applications, given the potential occurrence of unforeseen events or human error, robots adapting based on predicted worker behavior can avoid collisions and injuries, allowing for safer human-robot interaction (HRI) [3].Previous work in HRI that involves human behavior modeling is spread across various applications, and many utilize machine learning techniques.Tsitos Table 1: "Work As Done" (WAD) Labels WAD Label Description A Step skipped B1 Step done out of order but right action B2 Step done out of order but wrong action C Step done in order but wrong action D Step done as prescribed et al. [9] present the potential for real-time predictions of human behavior within the application of competitive tasks through various machine learning classifers.Liu et al. [4] utilize human behavior modeling for dealing with varying team members' expertise in tasks to adapt the structure in an efort to improve human-robot teaming.Schirmer et al. [8] focus on predicting anomalies or unexpected human behavior with a LSTM model and their possible efect in an industrial assembly use case.Al-Saadi et al. [1] propose using human behavior predictions for any necessary confict resolutions in collaborative tasks with a random forest classifer.
To enable these future scenarios, the authors embarked upon the Real-Time Adaptive Procedure System project (R-TAPS).The R-TAPS project's objective is real-time worker behavior prediction during task execution.The motivation is that these predictions will allow for adaptions and interventions in high-risk environments in an attempt to minimize risk and worker errors.Labeled observational data collected from workers performing three diferent procedural tasks in a chemical plant was used to train a machine learning model that would predict worker behavior during task execution in real time.This led to a complex data management and model training pipeline, the development of which resulted in three clear lessons learned.
The paper's primary contribution is a description of the three challenges and recommendations to consider when utilizing machine learning in similar HRC and HRI applications.The rest of the paper details these three challenges and provides recommendations for each (Section 2) and ends with a summary and concluding remarks (Section 3).
The primary challenges faced were with respect to the data labels, annotation process, and model evaluation; these are discussed separately, and recommendations are provided for each.

Data Labels
There were fve diferent "Work as Done" (WAD) labels [5] considered, shown in Table 1, that can describe the worker's behavior in completing a task with a given procedure.This label taxonomy has been used in the literature to compare the performance of workers as they complete steps in procedures [5].However, the nature of these labels caused substantial friction during the machine learning process.
At frst glance, this label taxonomy appears reasonable in this setting.Nevertheless, because the R-TAPS project was focused on predicting worker behavior in real time, at the timestamp level, this taxonomy became inadequate.This is because this taxonomy is designed to be applied at the step level instead of at the timestamp level.In other words, the labels were associated with steps within the procedures rather than timestamps within the task execution.
At the project's inception, the expectation was that a model could predict the WAD label of the worker's current step.However, this was not possible because the current step cannot be known defnitively.This is because workers do not necessarily complete steps in order or discretely, and they may return to steps that they have previously started.
As a result, it is impossible to disambiguate, in real-time, the step that the worker is performing.Since the data was annotated at the step level, inference must be done at the step level.It quickly became unclear what step should be the target of that inference.
The output of the model would be a distribution over WAD labels.This approach is benefcial because it is easy to train: it is a simple classifcation problem, and it is connected to the procedure in a way that can be leveraged in a real-world setting.However, as discussed, these WAD labels are temporally aligned, and thus models trained on this data cannot predict when an intervention should be made as the worker progresses through the procedure.In hopes of addressing this problem, predictions for all steps in the worker's task needed to be performed in parallel, and thus, models were trained with this prediction target in mind.
This prediction target created a substantial amount of noise in the target function.This is because the same real-time features could correspond to diferent WAD labels when conditioned on diferent steps.
In short, if predictions must be at the timestamp level, labels must be at the timestamp level.This incongruence between the operational use case and the available data hampered the machine learning process.In the future, should timestamp level prediction be required, it is highly recommended that the data be annotated at the timestamp level.

Annotation Process
The annotation process involved paid annotators who were assigned videos of workers completing tasks to watch and annotate.Annotators were instructed to annotate all of the worker's actions in executing the given task, consisting of each procedure step, the worker's behavior, and other relevant data points.Each annotator was provided with a template spreadsheet which they then flled out for the various data felds that were required.This process became difcult to manage because there were no guardrails in place to ensure the uniformity of data entered across the diferent annotators.Certain annotators would label complete spans of time for the videos, while others would annotate only the transition points.This would result in certain spans of data within a video being unannotated, annotated with spurious data, or annotated with data that is incorrectly formatted.
Additionally, diferent annotators developed their own shorthand for certain felds, and each annotator would have diferent ways of spelling colloquial terms (such as "walkytalkie," "walkietalky," "radio," etc.).Some annotators utilized diferent timestamp formats, which required diferent parsers during data cleaning.This efect is well known in language technologies and has been utilized to generate diverse training data when desirable [7].Finally, the Google Sheets UI would occasionally reformat certain numerical felds resulting in malformed data that had to be recovered manually.This type of inconsistency between the diferent annotators, tasks, and felds became an additional source of noise in an already noisy dataset.
The lessons learned ofer two suggestions for the future focused on a theme: guardrails.In other words, additional infrastructure is required to manage label noise generated by annotators.
(1) Utilize an additional layer of processing to ensure uniformity among the labels generated by the annotators.This would ideally be a layer of software (e.g., a data entry tool) that verifes the consistency of the entered data between the annotators.Alternatively, inter-annotator agreement or averaging annotations could be leveraged as another means to manage label noise [2,6].(2) Prior to the annotation process, develop an ontology/taxonomy of devices, subtasks, and actions for each specifc procedure.
Then during the annotation process, instruct annotators to select from this ontology/taxonomy while entering data.This will ensure uniformity among the entered data and serve to decrease noise.
The utilization of these two suggestions would have mitigated the vast majority of noise present in the R-TAPS project.This guidance will translate to future related machine learning projects.

Evaluation
At its core, the R-TAPS project was a multiclass classifcation efort focused on the A, B1, B2, C, and D WAD labels discussed above.Early on in this project, it was discussed that certain WAD labels may be more valuable than others (e.g., it may be more important to predict steps that were completed incorrectly rather than correctly).As a result, model performance metrics were presented for each individual class rather than collectively using an aggregation function.This decision ultimately generated confusion as it became difcult to determine when one trained model was performing better than another.
In the future, an efort should be made to converge on a single numerical value to judge model performance.It may not always be possible to converge on an ideal metric.However, even when there is uncertainty about the relevance of that metric.It can be further refned.The complete class-level performance should not be discarded but rather should be presented alongside the single value for situational awareness.Within this work, once the team aligned on a weighted sum of class labels, the ablation studies that were Three Challenges in Utilizing Machine Learning to Predict Human Behavior from Observational Data HRI '24 Companion, March 11-14, 2024, Boulder, CO, USA conducted became far more clear, and understanding which models and features were better than others was an easier conversation to have.

CONCLUSION
This paper outlines three challenges encountered during the R-TAPS machine learning eforts and provides recommendations for each.
(1) When making real-time predictions, data should be annotated on a timestamp level.(2) To reduce noise due to inconsistencies within an annotated dataset, utilize an additional verifcation layer for annotation consistency between annotators and develop an ontology/taxonomy for annotators to select from.(3) To increase model performance comprehension, eforts should be made to converge on a single value to represent model performance so comparisons can be easily made.As machine learning techniques are utilized more in HRC and HRI applications, these recommendations should taken into consideration.