Rating Prediction in Conversational Task Assistants with Behavioral and Conversational-Flow Features

Predicting the success of Conversational Task Assistants (CTA) can be critical to understand user behavior and act accordingly. In this paper, we propose TB-Rater, a Transformer model which combines conversational-flow features with user behavior features for predicting user ratings in a CTA scenario. In particular, we use real human-agent conversations and ratings collected in the Alexa TaskBot challenge, a novel multimodal and multi-turn conversational context. Our results show the advantages of modeling both the conversational-flow and behavioral aspects of the conversation in a single model for offline rating prediction. Additionally, an analysis of the CTA-specific behavioral features brings insights into this setting and can be used to bootstrap future systems.


INTRODUCTION
Recently, Conversational Task Assistants (CTA) that are able to guide users through manual tasks are gathering more attention [6,11,24] due to their applicability in everyday routines.These differ and expand from other paradigms, such as conversational search [29] and task-oriented conversational agents [30].In these paradigms, the user provides information to the assistant, and the system performs a task (e.g., searching or buying a ticket).In a CTA setting, it is the user that completes a task with the help of an assistant [11].Creating these assistants requires various sub-systems working hand-in-hand to effectively help the user complete a variety of tasks such as "baking a cake" or "fixing a leaky faucet" [11].Figure 1 illustrates a partial CTA dialog.First, users are prompted to search for the task they want to do, which is done using an IR step.Second, the user selects one of the provided tasks and enters the task-execution phase.Third, the task instructions are presented to the user.The user is then able to follow the task or create conversational sub-flows by asking task-specific or general questions, which the system should answer using domain knowledge.Due to the complex interactions between the user and the system, errors are prone to happen which in turn leads to user dissatisfaction and low ratings.Being able to predict the rating of an interaction is thus a critical step to understand the problems of the system, and act accordingly in both an online and offline setting [5,26].The aforementioned problems motivate our work in offline rating prediction, which is a challenging scenario, where the goal is to predict a rating at the end of the interaction taking into account the whole conversation.This task helps discover patterns in user ratings and, more importantly, detect problematic conversations which can be further analyzed to discover avenues for system improvement.In particular, and to the best of our knowledge, we are the first to tackle the problem of rating prediction in a Conversational Task Assistant (CTA) [11] setting.In Figure 1, we show an example of a low-rated CTA conversation.This happens due to various aspects, such as ASR errors recognizing "die hair" instead of "dye hair", or fallback responses, for example, when the system is not able to answer a question.How to use these various signals to predict the rating is one of our goals.To evaluate the rating prediction task, we leverage data collected during the Alexa Prize TaskBot Challenge [8,11], comprised of real human-agent CTA interactions.In this setting, the users interact with Alexa devices, mainly using their voice in a conversational, multi-turn, and multimodal way.
Evaluating conversational assistants is an active and challenging research subject in which the gold-standard metric is humanbased evaluation [16,21,22].Many works [3,5,15,26] leverage this human-labeled data to train automatic methods for conversational assistants evaluation.For example, we highlight the design of manual features in [23] for a flight booking system, and [15] for search dialogs.In [3,5,26], models are proposed to automatically predict the rating/satisfaction on Alexa's SocialBot Challenge [20].In particular, Choi et al. [5] show advantages in leveraging both textual and behavioral features.Motivated by this work, we created user behavior features that are specific to the CTA setting.Moreover, we use the recent advances with the use of Transformermodels [7,19,25] to create conversational-flow features.Despite the significant differences between chit-chat (SocialBot) and the CTA (TaskBot) setting, we believe that a combination of both types of features can bring improvements in rating prediction.With this, we combine the features into a single model which we call TB-Rater (Transformer-Behavior Rater) that surpasses the considered baselines.To conclude, we perform an ablation study, showing how the various design decisions influence the model's results, and analyze the importance of the behavior features in this novel setting.

TRANSFORMER-BEHAVIOR RATER
In this section, we present our proposed model Transformer-Behavior Rater (TB-rater), which combines two sets of features.

Model Architecture
2.1.1Conversational-Flow Features.The content and flow of the dialog conveys information about the current state and rating.Thus, we propose to use conversational-flow features with the aim of capturing intricate and discriminative dialog flows.To model these features computationally, we use a Transformer-based [25] language model, which is able to capture various patterns in the language and derive a representation of the conversation's state [14,28].We represent each turn (  ) of the dialog as follows: where   and   are the system and user utterances, separated by special tokens [S] and [U] denoting the beginning of a speaker's turn.We go beyond the utterances and include flow-based information in the form of the intent detected [   ], which has proven useful in [12,27], and the response generator selected/activated [  ] to provide extra information to the model.A conversation with  turns is modeled as the sequence: The first token of the sequence is a special [] token [7].Specific to our model, we use additional special tokens denoting the type of device [ ] (screen/screen-less) and the domain of the user's task [], which can be none, a recipe, or a DIY [11].We use all turns of the conversation and perform left truncation of the input when it is over the maximum sequence length.Finally, we use the embedding of the [CLS] token ( [ ] ) as the representation of the conversation.This first set of features is then complemented with user behavior features.

Behavior Features.
Taking inspiration from Choi et al. [5], which showed a performance increase when combining text and behavior features.We follow a similar pattern and add manually engineered features specific and unique to the CTA domain, with the aim of providing more domain context to the model.In particular, we use the last turn of the conversation (  ) to get the behavior features (  ).In total, we created 70 features divided in General, System-Induced, and CTA-Specific features.
General.Table 1 presents general conversational features, where we can see a large overlap with the features in Choi et al. [5].System-Induced.We consider the values for a particular turn, the avg, and the max across the conversation for user latency, system latency, and scores given by the ASR model, as in [5].CTA-Specific.In Table 2, we propose CTA-specific features, such as the number of searches or steps read, the number of turns in a phase, which indicates the depth the user is going into the conversation, or the counts of a specific intent as predicted by another model.These features were designed based on real-world interactions and thus can serve as a basis for other works in this setting.With this approach, we make predictions benefiting from both information streams, as shown in [5,17] in different domains.The model is then trained using the cross-entropy loss.

EXPERIMENTS 3.1 Experimental Setting
3.1.1Alexa Prize TaskBot Dataset.To evaluate our models, we use internal data collected in the first Alexa Prize TaskBot challenge [8,11].This challenge focuses on developing a CTA that helps users perform real-world manual tasks in the cooking and DIY domains.It is also the first multimodal challenge of this type, combining both voice-only and voice-and-screen interactions.In this setting, the system interacted with thousands of users, and for each conversation, at the end of the interaction, they are asked to provide an optional rating on a 1 to 5 scale.However, only about 10% of the users provide a rating, making it hard to pinpoint which conversations require more attention, further motivating our work.We used a stable version of the system to collect ratings and considered only rated conversations with a minimum of 3 turns.In total, we used 1681 conversations which we separated into training (90%), validation (10%), and test (10%) sets.The statistics of the dataset are in Table 3.We observe that, on average, a dialog has 8 to 9 turns, with a standard deviation of 6.8, indicating a large variety of conversation lengths.In terms of the ratings, we see a larger concentration in 1 and 5, with a standard deviation of 1.55, indicating that the users generally have a strong opinion about the system's performance, as also noticed in the SocialBot domain [3].

Task and Metrics.
In this work, we define the task of rating prediction at the end of the interaction.This makes this task challenging due to the need for a model capable of understanding the entire conversation, and identify the non-trivial subtleties that contribute to the rating.
Following a similar approach to Choi et al. [5], we use a binary classification task by separating ratings 1-3 into 0 and 4-5 into 1, instead of using the original 1-5 rating scale.In terms of metrics, we considered accuracy (Acc), precision (P), recall (R), and F1.

Methods and Baselines
Behavioral-only -we tested the following methods Random-Forest [2], AdaBoost [9], Bagging [1], GradientBoosting [10], XG-Boost [4], and LogisticRegression.All methods are implemented using sklearn [18] and use the behavior features of the last turn.Conversational-Flow-only -To encode the dialog features, we used a BERT model [7] with a classification head.We also adapted a T5 [19] model for classification.
Conversational-Flow and Behavior -we implemented Con-vSat [5], which combines text features at an utterance and character levels using BiLSTMs [13], which are combined with behavioral features.We also present the results of the proposed TB-Rater model. 3

General Results
. We present the results of the various methods on the Alexa TaskBot Dataset in Table 4. First, we observe that the best behavior-only method is the SVM.Regarding conversationalflow-only methods, the BERT-Base model achieves the best results, surpassing the enc-dec model T5.This might be explained by BERT having a specific and pre-trained classification token [7], while T5 is adapted to classification using a text-to-text paradigm [19].
The BERT-Base approach also surpasses the best behavior-only method (SVM), showing that only using conversational-flow information may be a good alternative for rating prediction, avoiding the need for the design of domain-specific features.Comparing the conversational-flow and behavior models, we see that the best results are achieved by the proposed TB-Rater model, surpassing all of the considered baselines.This result is in line with previous work [5,17] that showed advantages in combining text and behavior features.However, ConvSat [5], which also uses both types of features, did not perform as well.We believe this may be due to having a small amount of training data to effectively train the character and word level embeddings, making Transformer-based models a more robust approach.To conclude, the results show that it is possible to have conversation-flow-only models that are on par with classic approaches based on manually engineered features.We also show that combining both types of features in TB-Rater brings an improvement in performance.This result shows that focusing on the end of the conversation is more important to predict the rating, this can be attributed to the last turns having more impact than the ones at the beginning, indicating a possible recency bias.

Error Analysis.
While user subjectivity plays an important role [3,26], we believe that a portion of the model's errors can be categorized.Thus, we analyze TB-Rater's 50 error cases (counts of error types are given between parentheses).We noticed that the model generally gives a low rating if the interaction is stopped early, but the user is able to find and/or start a task (12).Another mistake is when the user starts a task that is different from the one the user is looking for but still goes further into the task, usually with consecutive dull responses (e.g., next step).In this case, the model predicts a high rating despite the user giving a low one (9).There were also cases where despite the system giving a considerable number of fallback answers, the conversation still moves forward, however, the model predicts this as an unsatisfactory conversation (10).Finally, user ratings have a lot of variability, and some do not seem to reflect how the interaction went, for example, "throw-away"/bad interactions that returned high ratings (7), or interactions where the user is not impressed with the system, returning a low rating despite the system responding to every request correctly (12).These results reaffirm the volatility of user ratings [3,5] and the difficulty of the task, shedding light on the most common error cases.the user finishes a task there is a large token overlap.The higher system latency on the last turn also appears to have importance in a positive rating, which at first seems counter-intuitive.After a closer analysis, we attribute this to the last turn of a finished task having a larger latency while an abrupt stop has a latency value of zero.In practice, these two features indicate that finishing a task is an important signal for predicting the rating.Other features such as the number of steps read, next step, and started task suggest that the user is engaged with the system and going deeper into a task.
Regarding the negative coefficients, we see that a larger number of fallbacks leads to a lower rating.The average system overlap denotes that the system is saying a similar response in multiple turns, which might indicate that the user is stuck.Finally, a higher value of domain indicates that the user did not search for a task, and in opposition, a high number of searches indicates that the user is struggling to find a task, resulting in a lower rating.It is also worth noting that out of the 14 features, 9 are from the CTA-specific set, showing the relevance of the proposed features.

CONCLUSION
In this paper, we propose TB-Rater, a model that combines conversational flow and behavioral features to perform rating prediction in the novel CTA setting.We show the advantages of combining both types of features by evaluating on human-agent interactions collected in the Alexa TaskBot challenge.Moreover, we provided a comprehensive set of CTA-specific features and measured their importance.The model proposed can be used to estimate a rating, which may allow for the discovery and prioritization of system errors.In future work, we intend to apply the model in an online setting, using its predictions to change the course of a conversation.

I
want to die my hair.Sorry, I can't help with that type of task.I said a I want to dye my hair!Start the task.I'm not quite sure how to answer that.What type of product should I use?Ok found the task "how to dye my hair"Step 1: Wash your hair 24-48 hours before ...

Figure 1 :
Figure 1: User and CTA example of a low-rated interaction.System and user utterances were emulated from real dialogs.

Table 1 :
General Features.A / in a feature denotes > 1 feature.

Table 3 :
2.1.3FeaturesCombination.First, we use two feed-forward neural networks (FFNN),      and      (with ReLu activations), that take as input the  [ ] and the behavior features   , respectively.After this, the resulting representations are concatenated and passed through a final       that combines the two streams:       (     ([ ]) ⊕      (  )).Alexa TaskBot Dataset statistics.

Table 4 :
Avg. result of 3 runs on the Alexa TaskBot test set.

Table 5 :
TB-Rater ablation study on Alexa TaskBot test set.Ablation Study.In Table5, we analyze how our design decisions influence the model's results.As we saw previously, removing behavioral features negatively affects the results.In w/o Step Token, we keep the text of the task's step instead of replacing it with a special token [ ].We see a decrease in performance, which we attribute to the step text not being especially important for the rating.Adding to this, keeping the text of a step also decreases the number of turns inputted into the Transformer model due to steps typically being long.In w/o Additional Tokens, we remove the special tokens pertaining to the device, domain, intent, and response generator, but we keep the special [ ].Again, we see that adding extra information in the form of these tokens increases performance.Finally, we test the TB-Rater model but truncate inputs larger than the maximum input size from the right side (end of the conversation) instead of the left side.Here, we observe the worse results out of all methods.
3.3.4BehaviorFeatureImportance.In Figure2, we present the top-14 abs.feature coefficients for the Logistic Regression model.Here positive/negative scores indicate a feature that predicts a positive/negative rating.Starting with the system word overlap on the last turn, this indicates that the last two system utterances share a large number of words.This feature is relevant because when