DeliData: A dataset for deliberation in multi-party problem solving

Group deliberation enables people to collaborate and solve problems, however, it is understudied due to a lack of resources. To this end, we introduce the first publicly available dataset containing collaborative conversations on solving a well-established cognitive task, consisting of 500 group dialogues and 14k utterances. In 64% of these conversations, the group members are able to find a better solution than they had identified individually, and in 43.8% of the groups who had a correct answer as their final solution, none of the participants had solved the task correctly by themselves. Furthermore, we propose a novel annotation schema that captures deliberation cues and release all 14k utterances annotated with it. Finally, we use the proposed dataset to develop and evaluate two methods for generating deliberation utterances. The data collection platform, dataset and annotated corpus are publicly available at https://delibot.xyz.


INTRODUCTION
Group deliberation occurs in a variety of contexts, such as hiring panels, study groups, and scientific project meetings.
It is traditionally explored in the field of psychology, where researchers examine the conditions under which a group can make better decisions.Mercier and Sperber [26] discuss how a group can outperform even the most knowledgeable individual within it -the assembly bonus effect.This was also demonstrated by Navajas et al. [28] who showed that small focus groups can outperform the wisdom of the crowd.
The aforementioned psychology research has mainly focused on the outcomes of the discussion, with less focus on analysing the discussion itself.However, the latter is necessary to understand what makes deliberation successful and inform interventions to facilitate it.This is also echoed by Vecchi et al. [38], who identifies studying argumentation and deliberation as essential for the future of digital democracy.
In order to study what makes deliberations successful and learn how to intervene to this effect, we need a dataset that contains discussions where groups collaborate to solve a task.Furthermore, the task should be such that the correctness of the decisions made can be objectively measured.Most existing datasets are between two interlocutors [3,5,13], thus not containing group discussions.Focusing on group datasets, one could consider negotiation dialogues [1], which, Authors' addresses: Georgi Karadzhov, University of Cambridge, Cambridge, United Kingdom, georgi.karadzhov@cl.cam.ac.uk;Tom Stafford, University of Sheffield, Sheffield, United Kingdom, t.stafford@sheffield.ac.uk;Andreas Vlachos, University of Cambridge, Cambridge, United Kingdom, av308@cam.ac.uk.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted.To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.Request permissions from permissions@acm.org.© 2023 Association for Computing Machinery.Manuscript submitted to ACM Fig. 1.Abridged conversation from our dataset between 3 people solving the Wason card selection task while multi-party, are adversarial in nature, therefore not containing collaboration.Publicly available datasets containing collaborative group discussions are WikiDisputes [12] and AMI [6], but neither is associated with an objective measure of decision correctness, thus not enabling researchers to assess how well did the conversation go.Niculae and Danescu-Niculescu-Mizil [30] collected a group dataset containing collaborative problem-solving conversations with an objective measurement of decision correctness, but their dataset is not publicly available.
In this work, we present the first publicly available dataset for group deliberation associated with a quantitative measure of decision correctness: DeliData -Deliberation Dataset.An example conversation is in Figure 1, with a group deliberating to solve the Wason card selection task [40], a well-studied task in cognitive psychology.In the example, the group engages in various deliberation strategies: a participant is moderating the conversation by prompting the group for a response (utterance 1), whereas in utterance 4 a participant suggests exploring a different solution.Overall, the group starts with a common, but wrong, solution (utterances 2 and 3) and converges on the correct solution (utterances 6 and 9).
We focus on the Wason card selection task as it is a well-studied task by researchers in the psychology of reasoning [14].Furthermore, it does not rely on prior knowledge of participants and is well-characterised as demonstrating the benefits of group deliberation [26].This allows us to focus on the fundamentals of the deliberation process itself, and to study the factors which affect group decision-making: how individuals bring their knowledge and intuitions to the group, and how the exchange of arguments enables groups to combine information to navigate the problem space.DeliData allows us to test ideas on how this is done successfully, and thus suggest appropriate interventions.Using a task that requires no prior knowledge increases the chances that our findings are transferable to other domains of argument exchange, such as education (e.g.tutorial groups), formal moderation (e.g. board meetings, policy discussions), and informal discussions (e.g.deciding where a group should get food in the evening).
The DeliData corpus contains 500 group dialogues, each of them associated with a measurement of decision correctness (hereafter task performance) before and after the group discussion.Given these measurements, we show that after discussing the solution, 64% of the groups perform better at the Wason task, compared to the performance of their members individually.Moreover, in 43.8% of the groups who had a correct answer as their final solution, none of the participants had solved the task correctly by themselves, thus demonstrating how people can solve the task better through deliberation.To aid future analysis and dialogue system development we propose an annotation schema that Manuscript submitted to ACM captures conversational dynamics and deliberation cues in collaborative conversations, and release the 500 dialogues as an annotated corpus with 14k annotated conversational turns in total.Further, we showcase the multiple possible uses of the DeliData corpus by conducting a wide range of analyses and modelling experiments, including predicting whether the deliberation improves the decision-making and the generation of utterances that can probe the conversation by asking questions.Finally, we demonstrate the generalisability of the annotation schema and the annotated dataset by automatically annotating a real-world collaborative task -a group of people debunking deep fake images.

RELATED WORK
Niculae and Danescu-Niculescu-Mizil [30] investigated group collaboration in the context of playing a game attempting to geo-locate a photo on the map.In their experimental setup, they first evaluate each participant individually, and after that, they initiate a group discussion and finally ask the group to make a decision together.Unfortunately, their dataset is not publicly available, and thus cannot be used in other studies.Likewise, Kim et al. [21] investigates how groups of people collaborate in solving a task together, as well as how a dialogue system can be incorporated within the discussion.
Unfortunately, their dataset contains only 12 discussions, making it too small for any reasonable analysis or dialogue system development, and similarly to the dataset of Niculae and Danescu-Niculescu-Mizil [30], their dataset is also not publicly available.
Wikipedia is a popular source of collaborative conversations.Hua et al. [18] collect 91M discussions from Wikipedia, together with the edits discussed in them.It is the largest dataset that captures group collaboration, but it is not supported by an annotated corpus.This is partly addressed by Al-Khatib et al. [2], who annotate 200k discussion turns from Wikipedia in 33 dimensions based on discourse acts, argumentative relations and semantic frames.However, unlike the conversations of Niculae and Danescu-Niculescu-Mizil [30] and the work presented in this paper, there is no assessment of whether the participants in a conversation on Wikipedia reached a better decision, which renders assessing constructiveness more difficult because there is usually no objectively correct answer.
Related to constructive conversations is the research on negotiation dialogues which have been explored in the context of games [10,19] and trading [15,23].However, even though negotiation dialogue research often deals with multiparty conversations [10], such systems are by nature adversarial, rather than constructive.
Multiparty conversations are also the focus of Carletta et al. [6], who created a multi-modal corpus of business meetings containing audio, video, transcriptions and auxiliary materials provided to the participants.However, they did not explore deliberation strategies, nor tried to measure the productivity of the group.Using parts of this dataset, the CALO project [36] proposed a toolkit to assist group meetings, such as dialogue act segmentation, action item recognition and others, but no attempt to assess constructiveness was made.Similarly, previous research [16,41] has investigated how people change their minds in online forums.But as the topics in online forums are very complex and the data is noisy, there is no objective measure of constructiveness.Finally, de Bayser et al. [11] evaluated turn prediction in the context of group dialogues.They evaluate their system on 3 datasets: one is proprietary, one is artificially created by combining 1-to-1 dialogues from Budzianowski et al. [5], and the third dataset consists of transcripts of a popular TV show, which while containing true multi-party dialogues, are not collaborative.

EXPERIMENTAL SETUP
In our experiments with the Wason card selection task [40], participants are presented with 4 cards with a number or a letter on them.They have to answer the following question "Which cards should you turn to test the rule: All cards with vowels on one side have an even number on the other?".A common fallacy is to select the vowel and the even number (i.e.selecting the two cards mentioned in the question), which is incorrect, demonstrating confirmation bias [26].
The correct answer is to turn the vowel, to check for an even number on the other side, and to turn the odd number, to verify there isn't a vowel on the other side.
We calculate task performance in two ways.First, we consider a coarse-grained (binary) scoring of the task -Correct -1 if the vowel and odd number are selected, Incorrect -0 otherwise.Recognising that the coarse-grained scoring may needlessly penalise answers that are close to the correct one, we also devised an alternative fine-grained scoring.We grant 0.25 points for (i) turning a vowel or an odd number, and (ii) for not turning the even number or the consonant.Therefore, if the participant submitted a correct solution, their score would be 1, if they are off by one card -0.75 and so on.We also calculate performance gain, by subtracting the average of the solo solutions from the average of the group performance.For example, if the average score of participants' solo submissions was 0.5 and improved to 0.75 after the discussion, the group performance gain would be 0.75 − 0.5 = 0.25.We collect the data using the following protocol (full participant instructions available in Appendix A.1): (1) Solo Phase.Each of the participants in the group is presented with the same 4 cards and submits a solution to the task.
(2) Group Phase.Following the solo phase solution submission, participants gain access to a chatbox to share their solutions and discuss.We encourage them to do so for at least 5 minutes but no longer than 7 minutes without enforcing these time limits; thus there are cases with very short and very long conversations.
(3) Revised Submission.After discussing their solutions, the participants are asked to revise their initial card selection and submit again.
We posted our data collection on the crowd-sourcing platform Mechanical Turk with the following job specification: (1) Everyone who completes the task is paid $2.00 (approx.£1.60).Participants are given a bonus of $1.00 (£0.80) if they return the right answer.As the average time for participation is about 8 minutes, each participant is paid £12/hour (or £18/hour if they solve the task correctly).This is between 35% and 102% above UK's National Living Wage1 .
(2) No personal information is collected and the participants are asked not to share anything that may reveal personal details.
(3) We recruited only adult participants from countries where English is a primary language, and they complete a simple reading comprehension test.The only language used in our dataset is English.
Participants are informed that we are investigating how people collaborate in solving a cognitive task and that we will be saving chat transcripts.This experimental protocol was approved by the ethics committee of the authors' institution.
The data collection is performed using a web application we call DialogueDen, which we open-source together with this study.The design of the platform allows us to record solo and group selections and the state of the game at key points of the experiment.This data can be used to identify when a participant reached the correct decision, even if they don't express it explicitly in the chat.Moreover, we integrated a number of features to DialogueDen that are specific for the data collection on Mechanical Turk, addressing various issues that arise when collecting group conversations in an unsupervised manner.These are part of the code release and are presented in detail in Appendix A.2.

DELIDATA DATASET
Using the experimental protocol above we initially conducted a pilot study, where we collected 18 group dialogues, with 53 volunteers from a university psychology department, who didn't have prior knowledge of the task.After that, we ran a larger scale data collection on Mechanical Turk which is often used for data collection in behavioural research and often produces similar results to in-lab experiments [9].This data collection was not moderated in any way, making it a realistic data collection process.We ensure the quality and anonymity of the data from MTurk by manually checking each conversation.We excluded a total of 160 conversations that were too short, of poor quality or with too few actively engaged participants.Thus, we release 482 dialogues that are of comparable quality to our in-lab pilot.
Summarised statistics of the two subsets are presented in Table 1.While the two subsets differ in terms of absolute performance, the improvement from solo to group performance is substantial in both data collections for both coarse-and fine-grained metrics, in agreement with results from psychology research on offline deliberation [26], and thus validating our data collection approach using MTurk.Another difference is that the average number of utterances per dialogue is lower on MTurk, which we attribute to the psychology student volunteers being more dedicated than crowd workers.
In Table 2 we compare three multi-party dialogue datasets: StreetCrowd [30], Settlers of Catan (SoC) [1], and ours.Of these three, only two are collaborative -ours and StreetCrowd, as SoC is among players competing against each other.
Ours is the only one containing collaborative group conversations available for research.Moreover, while it contains fewer dialogues than StreetCrowd, these are 2.5 times longer in terms of utterances, thus more likely to exhibit collaborative strategies spanning over multiple utterances.

ANNOTATING DELIBERATION CUES
The DeliData corpus introduced in the previous sections contains multi-party discussions of people solving the Wason card selection task.The transcripts are augmented with some metadata, such as when someone clicked on a card or submitted a new solution via the interface.Both the transcripts and the metadata can be used for various tasks, such as dialogue systems training, evaluation of conversation performance and solution-finding patterns.That said, in order to enable more sophisticated analysis and application in the study of deliberation, a fine-grained annotation of each conversation turn would be required.In this section, we introduce DeliAnnotation -an annotation scheme designed to study deliberation in collaborative problem-solving.

Desiderata and previous work
In order to annotate DeliData, we draw inspiration from theoretical work on argumentation [35,39], and from studies investigating how people reason and deliberate together to achieve a common goal [26].
Given these previous studies, as well as ensuring the transferability of the annotated data we outline 3 key criteria that an annotation scheme should fulfil: • The annotation scheme should capture general argumentation structure, e.g.distinguishing between arguments about reason and arguments about a solution.Simple interactions like agreement and disagreement should also be captured.
• The annotations should highlight deliberative cues such as moderation, argument probing, and solution management.
• The annotation scheme should be specific enough to capture deliberation and collaborative problem-solving phenomena, but should be general enough to be applied to a range of problem-solving tasks.
Given these desiderata, we first considered using the annotation schema previously proposed for discourse parsing [42] and Wikipedia discussions [2].Both of these schemata capture some discussion markers, such as questions, argumentation and agreement, which are important in analysing constructive discussions.Unfortunately, neither of them capture how people collaborate, which could be achieved by identifying deliberative cues.Furthermore, these schemata carry over some specific labels related to discourse parsing and Wikipedia editing.In terms of collaborative discussions, the MapTask schema by Carletta et al. [7] annotates conversations between two participants, who play a game together.Their annotation scheme is limited to basic interactions such as questions and answers, and it is missing important deliberative queues such as probing, argumentation and solution tracking.
Manuscript submitted to ACM

DeliAnnotation
We propose an annotation schema with 3 levels of annotation, each focusing on different aspects of deliberation.Figure 2 gives the overview of the schema, and we describe it in detail in the remainder of this section.
At the top level of the schema, we are interested in identifying the probing deliberation, i.e. any utterance that provokes discussion, deliberation or argumentation without introducing novel information (Hey, @Cat what do you think was the solution?).Such utterances could be considered conversational interventions that may change the flow of the conversation to induce further arguments or to moderate a conversation.On Figure 1 these would be utterances 1, 4, and 7.
We also recognise that most utterances in a conversation are not probing, but are inherently useful for the conversation.
We label these utterances as non-probing deliberation (also abbreviated as NPD), and they include all discussions that are concerned with the task's solution and participants' reasoning (I think the answer is A, because we have to check each vowel for sure).Finally, we include a None label that covers all utterances that are not related to the previous two categories.These utterances often include familiarities (Greetings fellas) or hesitation cues (hmm...).We refer to the first level of annotation as Type.After distinguishing between probing and non-probing deliberation, we classify each utterance into 5 roles at the second level: • Moderation (exclusive to probing deliberation): Moderation utterances are not concerned directly with the task at hand, but rather with how participants converse about it (Let's discuss our initial solutions).These utterances are concerned with modelling the conversation dynamics.
• Reasoning: Utterances focusing on argumentation and can be both probing (Why did you think it wasn't 8?) and non-probing (I think it would be 7 to test if it would be incorrect).
• Solution: Utterances that are managing the solution of the task.Can be both probing (Are we going for A and 4?) or non-probing (I think the answer is 7 and A).
• Agree and Disagree (exclusive to non-probing-deliberation): Utterances expressing agreement or disagreement with a previous argument or solution.
An important caveat with Reasoning is that it takes a priority over other labels.Some of the utterances may carry additional information beyond what is captured by their type and role, i.e. the first two levels of the annotation.Therefore, we introduce a set of additional labels that mark specific phenomena in the conversation, which we defined as follows: • specific_addressee: Utterances explicitly addressing specific participant(s) (@Llama what do you think?) • complete_solution and partial_solution: Utterances advocating for either a complete task solution (Let's turn A and 7), or a partial one (one of the cards is A).
• solution_summary: Utterances that recall previous solutions to prompt for an agreement (So, do we all agree on A and 5?).

Annotated dataset
Using the annotation schema introduced in this section we annotated all dialogues presented in section 4. We performed an annotation agreement study between 3 annotators on 41 of the dialogues using Cohen's kappa [8].We obtained an inter-annotator agreement of 0.75 on the first level, 0.71 on the second level, and an average agreement of 0.53 on the additional labels.

Probing
Non The label distribution for the first two levels is presented in Table 3. Overall, the number of Reasoning and Solution utterances are substantial, confirming that the subjects in our data collection engaged in substantial discussions about the solutions and their reasoning.The corpus also contains 1739 Probing utterances, where most of which are Moderation.
On the other hand, probing for Reasoning or Solution is fairly evenly distributed.This suggests that the strategies chosen for annotation are commonly used.Finally, 3267 utterances were annotated as non-deliberative ("None").
In Table 4 we present the distribution of additional labels.In column Count we show the total number of occurrences of each of these labels, while in Prevalence we show how often this label occurs in all utterances, including those without annotation for an additional label.The most prevalent label is complete_solution, appearing in about 20% of the utterances.While the other additional labels occur less in the conversation (around 5% or less), they might be useful for dialogue analysis.

Two-party and multi-party conversations
While in our dataset two-party and multi-party (3 or more participants) conversations have similar statistics, there are notable differences that we highlight in this section.In Figure 3, we present histograms comparing three conversational statistics -the total number of messages, number of unique tokens and participation balance, represented by entropy.First, dialogues between two interlocutors have mostly between 10 and 25 utterances, while group discussions in DeliData are distributed in a wider range, between 20 and 40 utterances, with a long tail of conversations longer than 50 utterances.
This occurs naturally, as multiparty discussions contain more arguments and exchange of ideas.Likewise, participants in these discussions tend to use a larger vocabulary of words, as shown on the histograms of the unique tokens.
In this analysis, we also look at how balanced the conversations are, i.e. whether all of the participants contributed equally.We calculate the participation entropy similarly to Niculae and Danescu-Niculescu-Mizil [30], where it is maximised if everyone participates equally, and approaches 0 if there is a large imbalance.In our dataset, the balance for Manuscript submitted to ACM two-party conversations is better, where 40 % of the discussions are almost uniformly balanced, while in multi-party discussions, it is often the case that one of the participants is driving the discussion.This is not surprising, as in one-to-one conversations if one of the participants asks a question, it is customary that the other participant answers.Such is not the case for multi-party discussions, where some of the participants may decide to have a more passive role.
Besides conversation statistics, we analyse the difference in task performance.Verifying for the initial conditions first, the solo performance of both types of groups is comparable -0.597 and 0.585.On the other hand, the collective performance of these groups was 0.694 for two-party conversations and 0.724 for multi-party, thus the performance gain is 0.096 and 0.139 respectively.Therefore, we argue that it is the multi-party (as opposed to two-party) discussion that led to an improved conversational performance.A limitation of this analysis is that we did not study whether increasing the number of group participants stops increasing performance gain.We anticipate that given previous findings by Navajas et al. [28] that communication among group members is important, the benefits from increasing the number of participants will be diminishing as communication among them and exchange of ideas becomes harder and the collaboration resembles the wisdom of crowds.

Propagation of correct solutions
Analysing our data we found out that there is 0.36 Kendall's Tau B correlation [20] between group consensus and performance gain.An investigation of how correct solutions propagate through the conversations showed that 21.2% of conversations started and finished with the same amount of correct submissions, thus the participants didn't convince anyone of the correctness of their responses.In 35% of the discussions where a single participant had answered correctly in their solo submission, they convinced at least one more participant in the group phase.However the reverse also happened -in 4% of all dialogues, the group convinced a participant with the correct answer to change it, which is Manuscript submitted to ACM considerably rarer than changing to the correct solution.Finally, in 43.8% of the groups in which at least one participant submitted a correct response after the conversation, no participant had submitted a correct solution in their solo phase.
This supports the group is better than the sum of its parts hypothesis, suggesting that deliberation offers more than just facilitating the spread of a correct solution among group members, and is consistent with the findings of Moshman and Geil [27] and Schulz-Hardt et al. [33], who show that deliberation plays a bigger role in task success, compared to individual participants' ability.
Furthermore, we present an analysis of different solution propagation patterns based on the annotation schema.We compared the groups where at least one of the participants had the correct solution in their solo phase, to the groups which reach the correct solution without anyone knowing the solution in their solo phase (referred to as DELI).The DELI subset contains a higher percentage of probing (17.3% vs 14.4%), and reasoning (43.8% vs 37.8%) utterances, suggesting that the participants are actively engaging in deliberation to get to the correct solution.Naturally, the DELI subset contains fewer utterances that propose a solution (30.4% vs 35.7%), as participants are more engaged with the reasoning behind the solution, as opposed to the solution itself.These findings are suggestive of the rich source of information about the dynamics of deliberation present in the data.In order to analyse the factors that make a conversation constructive, as well as showcase possible applications of the DeliData corpus, we perform a series of modelling experiments, where we predict the performance gain of a conversation, i.e. whether group task performance improved or not following the deliberation.

Predicting conversation's performance gain
In these experiments we use a simple decision tree classifier [32] with a maximum depth of 7 and minimum samples per leaf set to 5 and use leave-one-out cross-validation (LOOCV).As the dataset is imbalanced (318 conversations with performance gain and 182 without), we evaluate our models using the area under the ROC curve and stability.For these experiments, we considered 7 types of features: (1) Annotation statistics (i.e.normalised counts of each of the annotation labels), ( 2) n-grams of annotation Role sequences, (3) interaction features borrowed from StreetCrowd [30] (SC Interaction) (4) linguistic features borrowed from StreetCrowd [30] (SC Linguistic), (5) participation dynamics (i.e.whether one of the participants dominated the conversation), ( 6) conversational statistics (number of messages, tokens, etc.), and ( 7) n-grams of annotation Type sequences.Full experimental details can be found in Appendix A.3 and the code will be made publicly available.As shown in Table 5, the SC interaction features don't perform well in Manuscript submitted to ACM our setup if used alone, achieving accuracy that is below the baseline.Following, without feature combinations, both conversational statistics and Annotation Type n-grams are a good predictor of conversational performance.In terms of best overall performance, the best feature combination is achieved by using the interaction features from StreetCrowd, the participation dynamics, the conversational statistics, and the Annotation Role n-grams.Both SC Interaction and Participation Dynamics, model how participants interact with each other, providing a glimpse into group collaboration.On the other hand, we show that information encoded by deliberation annotation is also important for predicting performance gain.These results suggest that conversational dynamics are a strong addition to traditional feature-based approaches for dialogue classification.

Modeling the annotation scheme
Given that annotating datasets is an expensive, time-consuming, and non-trivial endeavour, we propose to learn models that classify utterances using the annotation scheme proposed in the previous sections.Given that our annotation scheme is hierarchical, we considered two options -(i) build a single classifier that predicts both layers jointly, or (ii) build three classifiers that predict each level.For the latter case, we first classify whether an utterance is Probing, Non-probing deliberation (NPD) or None.Following, depending on the result of the first level classification, we have one model that classifies Probing roles, and one that predicts Non-probing roles.
In terms of models, we evaluate two approaches for current utterance prediction.First, as a baseline method, we consider a TF-IDF encoding of the utterance, which is then passed to a random forest classifier with 5 estimators [32].Second, we experiment with a neural model, that relies on a pre-trained encoder for the utterance embedding.
Each utterance is encoded by the GTR model [29], a T5-based encoder.Then each utterance encoding is passed to a fully-connected feed-forward neural network with one hidden layer of size 512, and one output layer with the desired number of outputs (depending on the level of annotation).The neural network is trained using an Adam optimizer [22].The comparison between the two classifiers is presented on table 6.For both methods, we kept the same experimental setup -10-fold cross-validation, comparing accuracy.Both the RandomForest and the Neural model outperform the majority class baseline.In the column Combined Performance we compare the cascading classifier and the single (joint) classifier.We can see that the RandomForest classifiers perform better in a single classifier setup while for the neural model, the combined performances are comparable.The three-classifier cascade version is slightly better, achieving an accuracy of 0.87 on the first level of annotation.The performance on the second level of annotation is 0.86 irrespective of whether the type of utterance is probing or not.

Manuscript submitted to ACM
Context but if we are trying to verify then maybe we select them all Original how else could you know?

Random
Why did you press V Retrieval How many cards do you think at minimum we need to flip to confirm the rule Generative I think he means that the list of possible candidates is a list that will be evaluated in the upcoming days.

Generating Probing Utterances
We conclude by developing and evaluating two methods for generating probing utterances.We consider two different approaches -a retrieval-based approach and a generative approach with language models.The task setup is: given the previous dialogue utterances and the Role of a probing utterance (i.e.Probing-Moderation, Probing-Reasoning, Probing-Solution), generate the most appropriate utterance to continue the dialogue.For these experiments, we consider 50 of the annotated dialogues using the annotation schema of Section 5 as we assume the Role of the utterance to be generated given, and split them into a training set of 30 dialogues and a test set of 20.In our experiments, we compare 4 candidate responses: • Original.We take the utterance by the human participant from the original dataset.
• Random.We sample from the training data a random utterance that has the same Role as the one we need to generate.This is a strong baseline, as sampling for the same role often yields a contextually adequate utterance (albeit not necessarily the best).• Retrieval.We find the most similar utterance with the same Role in our training dataset.To calculate similarity we encode the context of the probing utterance using a pre-trained DialoGPT model • Generative We use a pre-trained DialoGPT to generate the next utterance based on the current conversation context.
For every method (except for the original) we replaced with placeholders both the mentions of participants and solutions.Once we generate an utterance, if it has a mention of a participant or a solution, we use a simple rule-based system to select the appropriate substitution from the context.We show an abridged example from our experiments in Table 7 (additional examples in Appendix C).We evaluate the three generated candidate responses using both automatic and human evaluation.First, we applied three commonly used measures for evaluating NLG applications -BLEU 4 [31], sentence similarity using RoBERTa [24], and BERTScore [44].As none of our NLG methods is trained to generate the same utterance as the Original, we do not expect that any of the candidate responses will achieve strong results, but automatic measures for NLG evaluation can be a good proxy for the quality of generated responses.On Table 8, we present the results where we compare them to the Original response.The Retrieval approach has the best overall performance, with a BLEU-4 score Utterance Type Role And the punctuation here is weird too so maybe that is a clue to that potentially being something so do we do we agree on paragraph three, is our answer.

NPD Reasoning
Yes We also perform a human evaluation study, where we asked people to rate the generated responses.We recruited 28 workers from Prolific using comparable worker qualifications and payment levels as on MechanicalTurk.We gave crowd workers the following instructions: "Please rank the 4 candidate responses from 1 (for the best response) to 4 (for the worst).You can give the same rank for responses you consider equally good/bad by placing them in the same box.".We asked each of the crowd workers to rank 10 sets of candidate responses, which resulted in 280 annotations of 89 probing cases.First, we compared the average ranks of each of the NLG methods.The Original and the Retrieval approaches had similar ranks -2.12 and 2.15, while the Random candidate was ranked on average at 2.23.Finally, the generative approach performed the worst, being ranked on average at 3.02.To gain a more fine-grained understanding of which method is preferable, we calculated the pairwise preferences (adjusted for ties), presented in Table 9, which showed similar results, with the Original and Retrieval being considered equal, followed closely by Random, and Generative a distant fourth.
Qualitative analysis showed that the responses of the Retrieval are coherent despite the simple representation of dialogue context.Also, we found that, while large-scale pre-trained language models can be adequate in responding to general queries, they fail to produce good responses where more advanced vocabulary and reasoning are required.
Here we demonstrated methods for generating cohesive responses for group deliberation.That said, there is a reasonable question on whether these responses are also useful for the conversation, i.e. can they contribute towards improving the group's deliberation?One way to address that, in future dialogue system development, is to leverage DeliData's numerical measure of success to filter which utterances are contributing towards a constructive conversation, and which are not.

CASE STUDY ON COLLABORATIVE DEBUNKING OF DEEPFAKES
We have so far studied the For this purpose, we use the dataset by Unchendu et al. [37] that contains group discussions, where people collaborate to determine whether an image they are seeing is a deepfake or not.An example snippet is presented in Table 10.
We manually annotated 110 utterances from this dataset with the annotation scheme proposed in this work.The label distribution of the annotated examples is presented in Table 11.
We use this subset of the data to benchmark the accuracy of the classifiers introduced in section 7.2.For these experiments, we selected the best-performing models, namely the neural hierarchical classifier.On the first level of annotation (Probing vs Non-probing deliberation vs None) we achieve an accuracy of 71% (compared to 87% on DeliData).
On the second level of annotation, the classifiers achieve a combined performance of 64% accuracy (compared to 86% on DeliData).This performance is encouraging, and while we note that it is lower than on the original DeliData set, we would like to highlight that the Wason card selection task and deepfake debunking are very different tasks, with very different vocabularies.
In order to get a better understanding of the model, we performed an error analysis study.We note that the biggest drop in performance is for the Solution labels (0.17 F1 score vs 0.83 on DeliData).A solution in the case of the Wason card selection task is focused on which cards should be turned in order to verify a rule; on the other hand, for the deep fake data, participants are proposing whether an image is credible or not.As a result, solution utterances differ substantially in terms of the language used, and this performance drop is expected.The performance on all other labels remains high, e.g.
for Reasoning, which had an F1 score of 0.75, compared to 0.91 on DeliData.This shows that even on a substantially different task, people use similar argumentation structure, expressions and deliberation patterns to express their reasoning.
An example snippet of an automatically annotated conversation is presented in appendix D.
Given these results, two conclusions are important for future work.Firstly, we show that the annotation scheme introduced in this paper applies to other tasks, and can benefit research investigating collaboration and deliberation.
Secondly, even though the Wason task is an abstract task devised with the purpose to study decision-making, the patterns learned on the DeliData corpus are transferable to real-world collaborative scenarios such as deepfake debunking.

CONCLUSIONS AND FUTURE WORK
In this work, we introduced a dataset containing conversations where a group of participants collaborate in order to solve a task.Furthermore, we proposed an annotation schema and annotated corpus that capture key elements of group deliberation, such as probing.In order to evaluate the dataset and the annotation scheme we performed 4 types of modelling experiments.First, we showed that we can build a classifier that predicts the annotated labels with high Manuscript submitted to ACM accuracy.Following, we investigated methods for predicting conversational success based on dialogue features and annotations.Then, we show that one of the modules for a future dialogue system can be addressed with a retrieval method.
Finally, we showed that the classifier and the annotation scheme are transferring well to an out-of-domain dataset.Given the resources and the conclusions from our experiments, we believe that this paper is a step towards addressing the call for "discourse optimization" of Vecchi et al. [38].
Two main research directions can be addressed in future research.First, while this paper performs an analysis on what contributes towards improved group decision-making (referred to as performance gain), future work should perform a more in-depth analysis of the conditions under which a group performs better than the sum of its parts.For example, it can be processed with tools developed in argumentation mining [43] and discourse parsing [25] in order to provide insights in how these relate to problem-solving deliberation.Secondly, this dataset can be analysed to test theories of the dynamics of group deliberation and develop dialogue agents that could be used to improve group decision-making in numerous setups, for example debating groups, project meetings, etc.Such dialogue agents could be decomposed into three modules -determining intervention timing, determining intervention type (i.e.moderation, probing for reasoning) and generating a probing utterance.In this work we introduced an adequate approach for coherent probing generation (Section 7.3), however determining the timing and the type of intervention is left for future work.

ETHICS STATEMENT
In this work, we present a corpus containing conversations, where participants collaborate to solve a cognitive task.
Details on our setup and ethical considerations are presented in Section 3 and appendices A.1 and A.2, but in this section we will reiterate the most important points.
We collected our dataset using the crowd-sourcing platform MechanicalTurk and in-lab volunteers for the initial experiments.Participants gave informed consent to their participation, and we told them the purpose of the study and that the transcripts of the dialogues would be collected and used for further research.The only language used in our dataset is English.Participants were free to withdraw at any time.We asked participants not to share any personal information, and as part of quality control, we have removed any instances of such (like the city they were living in, or the institution they were studying in).We asked the participants not to use any offensive language, and as part of the quality control, we verified whether this is the case, fortunately not finding any such instances.When recruiting participants, we selected adult participants from countries where English is a primary language and where MechanicalTurk operated at the time of collection: US, Canada, UK, Ireland, and Australia.Besides that, we did not put any restrictions on (nor have a record of) participants' exact age, gender, nationality, race, political leaning, education, etc.
Crowd workers were paid on average between £12/hour and £18/hour (approx.$16.46/h-$24.68/h),depending on their time of participation and whether they solved the task correctly.This is well above the UK's living wage (£8.91/hour), as well as the minimum wage in the US ($7.25)2 .Moreover, in cases where we were unable to start the data collection (due to inactive users for example), we paid the participants for their time.
For our human evaluation experiments, we recruited participants from Prolific.We put similar qualification requirements as on MechanicalTurk, namely, minimum age of 18, fluent in English, and minimum approval rate of 90%.We paid annotators in the same pay range as on MechanicalTurk, averaging £14.25/hr (19.5$/h).
The full experimental design was approved by the ethics committee of the authors' institution.We will release the DeliData corpus under Creative Commons 4.0.
Future work may be needed to evaluate whether this dataset would apply to other types of problem-solving (for example in a business setting).
selected a decision tree, as it is a fairly stable model by design, and it allows us to analyse variability between different runs of the model.We performed hyperparameter search with the following parameters: Max Depth: [2, 3, 5, 7 (selected),

A.4 Packages used
For training and evaluation of the performance gain we used [32]

Fig. 3 .
Fig. 3. Comparison between conversational statistics of two-party dialogues(left) and group dialogues (right).Each of the histograms is showing the percentage of dialogues on the y-axis.

20 ,
max] and Min Samples per leaf: [1, 2, 3, 5 (selected), 10].Total number of parameter tuning runs -30.The best model is selected based on model accuracy and stability.Due to the size of the model and the dataset, the hyperparameter search does not require any special infrastructure and the training time is negligible.

Table 1 .
Corpus statistics for pilot and MTurk data.

Table 3 .
Frequencies for the labels in the top two levels of the annotation schema

Table 4 .
Label distribution the additional labels

Table 5 .
Predicting conversational performance gain

Table 6 .
Classification performance of predicting the current utterance.Accuracy score reported.SC stands for a single classifier, where a single model is used to predict all labels.In all other cases, a cascading prediction of 3 models is used.

Table 7 .
Utterances generated by different methods

Table 8 .
Automatic evaluation of Probing generation

Table 9 .
The table reports pairwise preferences in columns over rows, i.e. the first column reports the preference of the Original text vs the other 3 methods.

Table 10 .
, all right um what kinds of reasons lacks common sense.Abridged annotated example from the dataset of collaborative deepfake debunking.NPD stands for Non-probing deliberation of 0.39 compared to 0.35 and 0.09.If we consider just the Similarity and BertScore measures, the Retrieval and Random approaches have similar performance.On the other hand, Generative performs consistently worse on all measures.

Table 11 .
Wason card selection task, which is well-studied and characterized by previous work on group decision-making.While we have argued the use of the Wason card selection task is appropriate due to its abstract nature that does not require prior knowledge, we want to examine whether the patterns learned on Wason are transferable to Label counts in the top two levels of the deepfake annotated data other tasks, closer to real-life scenarios.To this end, we investigate how accurately our classifiers trained on DeliData (introduced in section 7.2) can perform in an out-of-domain task.

Table 14 .
[4]sion 1.0.2.For general language tasks and featurisers we used NLTK[4]version 3.5, Spacy [17] version 2.3.2.For generative experiments, we used DialoGPT-large from HuggingFace's transformers version 4.11.3.For evaluation, we used BertScore [44] version 0.3.11,SentenceTransformersversion 2.1.0.Example of different methods for generating Probing-Moderation utterancesContextbut it says it might be as simple as we think and it seems pretty simple to put U and 2 as that is the vowel and the even number

Table 15 .
Example of different methods for generating Probing-Solution utterances