Rehearsal: Simulating Conflict to Teach Conflict Resolution

Interpersonal conflict is an uncomfortable but unavoidable fact of life. Navigating conflict successfully is a skill -- one that can be learned through deliberate practice -- but few have access to effective training or feedback. To expand this access, we introduce Rehearsal, a system that allows users to rehearse conflicts with a believable simulated interlocutor, explore counterfactual"what if?"scenarios to identify alternative conversational paths, and learn through feedback on how and when to apply specific conflict strategies. Users can utilize Rehearsal to practice handling a variety of predefined conflict scenarios, from office disputes to relationship issues, or they can choose to create their own setting. To enable Rehearsal, we develop IRP prompting, a method of conditioning output of a large language model on the influential Interest-Rights-Power (IRP) theory from conflict resolution. Rehearsal uses IRP to generate utterances grounded in conflict resolution theory, guiding users towards counterfactual conflict resolution strategies that help de-escalate difficult conversations. In a between-subjects evaluation, 40 participants engaged in an actual conflict with a confederate after training. Compared to a control group with lecture material covering the same IRP theory, participants with simulated training from Rehearsal significantly improved their performance in the unaided conflict: they reduced their use of escalating competitive strategies by an average of 67%, while doubling their use of cooperative strategies. Overall, Rehearsal highlights the potential effectiveness of language models as tools for learning and practicing interpersonal skills.


INTRODUCTION
Managing interpersonal conflict is a critical skill.We occasionally find ourselves in situations where our interests, values, or goals conflict with others.If left unchecked, conflict can reach a boiling point, manifesting in verbal arguments, physical altercations, passive-aggressive behavior, or more [12,20].Additionally, conflict correlates with increased stress [31], a downturn in productivity, and absenteeism [44].While avoiding any conflict may be impractical [59], how we choose to deal with conflict is not: in most settings, an ideal outcome for both parties is to work cooperatively [56].
Directing conflict towards cooperative communication is, however, a difficult skill to learn, requiring targeted and repeated practice with immediate feedback [25].Avenues for practicing conflict resolution are unfortunately often limited: training material for conflict resolution is usually static (e.g. a written case study) covering a fixed number of situations.Independently extrapolating beyond these predefined settings-especially without expert guidance-is challenging.While conflict roleplay with an expert is a proven and widely used technique [27], expert training is costly and scarce.If it were possible to simulate expert-level conflict practice, we could significantly improve an individual's conflict resolution skills in a cost-effective and scalable manner.
We envision that, given their generative capabilities [10], large language models (LLMs) offer an opportunity to craft expert-level conflict roleplays and provide immediate feedback to users.Despite remarkable progress in producing compelling content, however, LLMs such as ChatGPT often fall short of simulating conflict and giving feedback on it.Naively prompting LLMs introduces a host of problems that lead to unrealistic and ineffective simulations.First, current LLMs are sycophantic due to instruction following, producing generations that agree too quickly with the viewpoints of a user [66].Second, providing targeted practice and feedback is challenging due to the open-endedness of LLM text generation.An off-the-shelf LLM may produce messages that are not directly informative-potentially even distracting-for teaching conflict resolution.In contrast, students benefit significantly from deliberate and targeted practice [28], where feedback is readily Figure 1: An example interaction trace with Rehearsal, where an employee practices conflict with a simulated customer.The employee quickly realizes that Rights and Powerbased strategies result in heightened conflict.The conflict is eventually resolved using an Interests-oriented approach.
available [38,70].As a result, experts teaching conflict resolution rely on a proven and curated set of conflict resolution strategies when roleplaying conflict, providing feedback specific to each strategy [56].Applying these strategies without guidance and in a fully open-ended setting can be a significant burden for students.If a student unintentionally uses an inappropriate conflict resolution strategy, they can send a conversation into a conflict spiral [9].
To mitigate challenges associated with simulating conflict, we turn to conflict resolution theory as grounding for LLMs.At its core, effective conflict resolution relies on interest-based discussion of conflict.We draw on the Interests-Rights-Power (IRP) framework from the conflict literature [56] to ground our approach.The IRP framework codes utterances from individuals onto a higher-level set of 8 strategies.For example, a Power strategy threatens to apply consequences to the other party: "I'll be telling everyone about what you did here today."Rights asserts a standard of fairness or legitimacy: "I was on call yesterday, and we rotate the job." Interests discuss or draw on each party's independent goals: "If we do this together and win, it will show up on your record for promotion next month."According to IRP, individuals in a conflict should discuss each party's Interests without resorting to threats (Power) or deferring to pre-existing norms (Rights) [76].In a state of conflict, however, we are not often aware of how the strategies we use will cascade as the conflict proceeds.For example, deploying a Power strategy tends to result in the other party responding with Power as well-and when both parties repeatedly use Power or Rights strategies, the conflict is unlikely to end productively [76]. 1  To avoid conflict spirals, one should (if possible) rely primarily on cooperative strategies like Interests and avoid contentious strategies like Rights or Power [9].Experts rely heavily on frameworks like IRP to jointly provide feedback and offer targeted roleplays of specific conflict scenarios [35].
Using these insights, we operationalize the IRP framework as a basis for simulating teachable conflict and enabling expert feedback with LLMs.We introduce IRP Prompting, a method for steering LLM generations with a theoretical conflict resolution framework.In IRP Prompting, the IRP framework plays a core role in planning the course of a conflict, serving as grounding for the LLM.IRP prompting builds on multi-step prompting techniques [77,79,81]: instead of directly generating conflict dialogue-sampling from an exponentially large space of potentially undesirable outputs-we first classify the simulated interlocutor's next conflict strategy based on the current conversation, then generate messages conditioned on that explicit conflict resolution strategy.Compared to out-of-thebox generations from an LLM, we find that IRP prompting produces conflict simulations that are more representative of expert roleplay.Grounding in IRP also yields simulations robust to repeated interaction with a user: simulations can only reach a state of agreement after multiple interest-based strategies are used.Tightly integrating conflict resolution theory with LLMs enables targeted feedback on the use of conflict resolution strategies and mitigates teaching limitations with open-ended conflict generation.
We instantiate IRP prompting in an interactive dialogue system, Rehearsal, where users engage directly with a simulated interlocutor.Using Rehearsal, users receive targeted feedback at each dialogue turn and can explore how a simulated interlocutor might respond if different strategies were employed.For example, consider a new employee who wants to practice managing conflict with a dissatisfied customer (Figure 1).The employee can specify a conflict setting from their own life or pick a predefined one from Rehearsal for practice.The simulated interlocutor starts by using Power, threatening to involve management.Taken aback, the employee responds with Power, tone-policing the simulated customer.Here, Rehearsal (using IRP prompting) generates several alternate messages that use more cooperative strategies and scores each alternative.Looking at the generated alternatives, the employee decides to use Interests, identifying why the customer was angry.After a 1 "I'm pulling the plug on your CHI submission-it's not ready." [Power strategy] "NO!What a betrayal!I've worked on this project nonstop all year, and I need this published to be competitive on the job market!"[Rights strategy] "I'm sorry, but I'm your advisor and senior author on the paper.I make the call here." [Rights strategy] "Not when I tell all the new Ph.D. students about how you kneecap your students."[Power strategy] -illustrative example; don't worry, the coauthors of this paper all get along just fine.few interactions, the employee realizes that using Rights ("we don't do returns here") or Power ("get out of my store!") early in a conversation results in heightened conflict.By observing alternative strategies across multiple interactions, the employee learns that an Interests-based approach-brainstorming a potential compromise with the customer-yields significantly better outcomes.
We conduct evaluations across two axes: a technical analysis of IRP prompting and a behavioral study of Rehearsal's end-to-end effectiveness as a system for teaching conflict resolution.First, we compare IRP prompting to out-of-the-box LLMs and to ablations, evaluating generations in a controlled setting.Compared to our ablations, we find that IRP prompting produces conflict simulations more faithful to expert training.Next, we recruit  = 40 participants and study Rehearsal's effectiveness by comparing it with traditional conflict resolution training.In a between-subjects evaluation, participants engaged in an actual conflict following training with or without Rehearsal.Participants with Rehearsal training significantly improved their application of effective conflict resolution strategies in the unaided conflict, despite not showing differences in "book smarts" recognition or recall on a test.Compared to a control group, Rehearsal participants reduced their use of competitive strategies by an average of 67% while doubling their use of cooperative strategies.
In summary, Rehearsal highlights the potential of generative AI as a teaching tool for social interaction skills, combining social scientific research on conflict resolution theory [30] with research on the generative capabilities of LLMs [10].We contribute: (1) Rehearsal: an interactive system for roleplaying conflict resolution.In a simulated conflict roleplay, Rehearsal generates feedback and lets people send/evaluate their messages in a roleplay.Furthermore, Rehearsal enables learning from alternative conflict resolution strategies.(2) IRP Prompting: a prompting technique for producing conflict faithful to expert training by grounding LLM generations to conflict resolution theory.IRP prompting also supports generating alternative messages (that use a different conflict resolution strategy), enabling Rehearsal's interactions.(3) An evaluation of IRP prompting and a user study of Rehearsal with  = 40 participants.Our studies highlight Rehearsal's significant effectiveness in applying conflict resolution strategies, compared to the status quo of teaching the same material.

RELATED WORK
To design and build Rehearsal, we rely heavily on research from conflict resolution theory, large language models, and computersupported cooperation.

Dealing with Interpersonal Conflict
Broadly, conflict resolution seeks to facilitate the end of conflict and provides frameworks to convert conflicting situations into cooperative ones.Conflict resolution theory splits types of conflict into two states: cooperative and competitive [23].Ideally, conflict should move towards a cooperative state, where all parties discuss common goals and interests.This is harder said than done-the underlying processes behind effective conflict resolution are incredibly delicate.Concrete conflict resolution strategies provide pathways to reaching a cooperative state.Strategies are split into two broad categories: constructive and destructive [24].Constructive strategies seek to cooperatively integrative perspectives from both parties, while destructive strategies do the opposite.While specific resolution strategies can be broadly construed as constructive or destructive, effectively using them in practice requires a narrower set of definitions.To this end, more recent work has expanded on Deutsch [24]'s initial resolution work.Instead of framing conflict as constructive vs. destructive, Ury and Brett [76] frames conflict as a division across three categories: Interests, Rights, and Power.At a high level, the Interests strategy focuses on building common ground between both participants.While using Interests, both parties will actively problem solve, cooperatively discovering interests that lead to an eventual resolution.Importantly, Interests focuses on healthy future outcomes of a conflict, integrating concerns, needs, fears, and desires of both parties (e.g."Let's try to work things out here").Cooperative conflict builds mostly on interests.However, conflict usually escalates to rights or power.When using Rights, an individual will appeal to fixed norms or standards to justify their position (e.g."That's not allowed according to our contract").Finally, Power strategies draw on coercion and threats, and are often attacking or accusatory-this strategy imposes an explicit burden on another person (e.g."I'm going to fire you.').For Rehearsal, we choose to build on IRP, though Rehearsal can support other conflict resolution frameworks.IRP alone is not comprehensive of all cooperative/competitive strategies: Brett et al. [9] extends IRP to support a total of 8 strategies, including Proposal, Facts, and more.A definition of each strategy, along with specific examples, can be found in Table 1.
Conflicts also have a tendency to spiral out of control, especially when contentious resolution strategies are repeatedly used [9].Understanding when and why a specific conflict resolution strategy is used can significantly improve the outcome of a conversation.For example, Brett et al. [9] highlights how reciprocating with Power or Rights can escalate the conflict to a state where returning to cooperation becomes exceedingly difficult.Careful use of conflict strategies early in a conversation can significantly impact the likelihood of a healthy outcome.
To hone the effective use of conflict resolution strategies, experts frequently roleplay fictional conflict scenarios [1] with learners, creating controlled settings to practice conflict resolution [26].Roleplay, however, is dependent on having an expert in the first place.Given the challenges associated with finding experts (e.g.time and resource constraints), Rehearsal offers a simulated and controlled setting to learn these skills.

Adapting Large Language Models
To simulate conflict, Rehearsal relies on the generative capabilities of LLMs.We specifically use , an auto-regressive LLM trained to generate text completions on a next-word prediction objective [10].Prompted with a text prefix, LLMs generate completions that mirror their training distribution.Because of both the scale of the model size and training data, LLMs exhibit impressive performance on a wide range of NLP tasks [53].LLMs have

Cooperative Strategies
Interests Reference to the wants, needs, or concerns of one or both parties.This may include questions about why the negotiator wants or feels the way they do.
We can figure this out-I understand that you've been really busy lately.
Positive Expectations Communicating positive expectations through the recognition of similarities and common goals also been used to simulate online community activity for prototyping [64], socioeconomic preferences [42], and more broad human behaviors [4,63,73].In interactive settings, LLMs have been used for teacher-assistant training [58], to provide real-time feedback for conversation on divisive topics [3], and for mental health support [43,72].Effectively applying LLMs to new domains is challenging; prompting models can be brittle [18,84], and designing systems to support effective human problem solving with black-box models can be unintuitive [87].In contrast to prior work, Rehearsal applies LLMs to teach domain-specific skills, proposing a theoretically grounding prompting pipeline to generate simulations that mirror expert roleplay.After exposing users to a simulation, we explore how likely users are to employ these skills in conflict-laden settings without real-time assistance.We highlight how Rehearsal provides a measurable and significant learning benefit to users compared to the current status quo.Recent approaches view LLMs as a powerful backend for agentbased simulation (e.g.Generative Agents [63], ReAct [82], Swift-Sage [54], Reflexion [74], and more).Rehearsal, however, deviates from these approaches in that it requires direct interaction with humans at each generation step.Unlike prior work-where agents operate in a silo, interacting primarily with their environment and other agents-Rehearsal must remain consistent in the presence of repeated human interaction.To ensure that roleplays generated by Rehearsal mirror experts, simulations from Rehearsal are grounded in the Interests, Rights, Power conflict resolution training framework.We therefore evaluate interaction with LLMs in a more human-centered setting: teaching conflict resolution strategies.To support this interaction, we propose Rehearsal's IRP prompting design in §4.IRP prompting constrains and guides generation from an LLM, producing effective simulations for teaching.

Cooperation and Conflict Resolution in HCI
Rehearsal draws heavily from HCI work on quantifying conflict and designing appropriate interventions.Conflict and anti-social behavior are widespread on social networks [65,78].Furthermore, conflict and coordination costs in these communities have progressively increased with their growth [46,47].Despite conflict's prevalence, members of online communities learn to navigate around conflict, with moderators actively supporting resolution [11,13].
Handling conflict without guidance, however, is a challenging task.A range of HCI systems present effective interventions aimed at avoiding or managing conflict.For example, Zhang et al. [86] looks at identifying situations where conflict will emerge in an online community, and Chang et al. [15] designs an interface that nudges users when they come into contact with online community threads that may devolve into conflict.While these systems are highly effective when actively engaging in conflict, Rehearsal takes a different approach, viewing conflict resolution as an opportunity for learning.Rehearsal not only identifies signs of conflict but simulates settings where conflict may arise.By allowing users to explore conflict in a safe setting, Rehearsal encourages reflectionin-action [69]: users can explore conversational outcomes across different conflict resolution strategies and reflect on the efficacy of each in a low-stakes setting.These affordances are not present when the stakes are higher (e.g.engaging in an actual conflict).
Rehearsal functions to enable cooperative work, a longstanding goal of HCI and CSCW practitioners [45].Forecasting conflict [14,86] and forming productive teams [36,68] are two examples of enabling cooperation.Unlike prior work, Rehearsal focuses on what to do when conflict inevitably emerges, teaching conflict resolution skills through simulated practice.In developing Rehearsal, we emphasize that cooperation comes from productive conflict; cooperation and conflict are two sides of the same coin.We frequently see this in practice: Grudin [37] highlights how navigating interpersonal dynamics (e.g.conflict) in teams is a limiting factor behind the adoption of cooperative social computing technologies.Furthermore, increased conflict appears to be a function of distance, with remote teams seeing higher instances of conflict compared to their co-located counterparts [40].Text as a medium to resolve conflict is an established method on communities like Wikipedia, where individuals conciliate in issue-specific discussion threads [6].Unlike prior HCI systems, Rehearsal's focus is not to reduce the likelihood of conflict emerging or to measure its prevalence.Instead, Rehearsal helps individuals turn pre-existing conflict into opportunities for cooperation.

REHEARSAL'S INTERACTION INTERFACE
Expert roleplay, when complemented with structured feedback [27], proves to be an invaluable teaching tool for conflict resolution [35].We aim to mirror expert roleplay through Rehearsal (Figure 2), a dialogue system that enables users to engage in a simulated conflict, provides feedback, and identifies alternative response strategies at each dialogue turn.Rehearsal is instantiated as an interactive dialogue web-app built using React.js.At a high level, Rehearsal helps the user abstract from the specific simulated dialogue to the higher-level conflict strategies being employed while teaching effective use and recognition of these strategies.First, users complete a brief tutorial on using Rehearsal and watch a short video describing the IRP framework.After selecting a predefined conflict scenario or manually specifying their own, Rehearsal enables two broad interactions: (1) interfacing with a conflict simulation and (2) providing in-context feedback aimed at teaching users how to resolve a simulated conflict.Interaction with Rehearsal alternates between both modes as a conflict progresses.

Conflict Simulation: Interacting with a Faithful Conflict
Interaction with a simulation is instantiated through a dialogue interface (right side of Figure 2), where the user can send messages to a simulated interlocutor.Given a conflict premise, the user must defuse the conflict by sending effective messages to the simulation, ideally using an interests-based conflict resolution approach.
3.1.1Rehearsal's Conflict Premises.What premise should a simulated conflict cover?Rehearsal offers two options: manually specifying a premise or picking from a selection of curated premises.Experts who teach roleplay use collections of pre-authored conflict settings, such as Harvard's Program on Negotiation [1].For the curated preset in Rehearsal, we collect premises that integrate a diverse range of conflicts from pre-existing repositories of conflict resolution case studies [1,52].Premises are filtered to those including only two individuals, are self-contained, and can be completed in less than an hour, yielding a total of 12 conflict case studies.We detail a subset of the premises in Appendix A.
3.1.2Engaging with a Simulated Conflict.After selecting a premise, the user can engage with the simulated interlocutor associated with the premise.The interlocutor begins in a dissatisfied state and sends the first message in the conversation.The user is then prompted to reply to the simulated message.When the user sends a message, Rehearsal classifies their dialogue into the strategy being used (Figure 2, annotations on top of messages from Casey), e.g., proposal, interests, rights, power, etc. From here, Rehearsal then provides feedback on the user's message.Simulated conflicts continue until Rehearsal's internal conflict resolution score (discussed later, in §4) predicts that the conflict is least likely to escalate and the simulated interlocutor is in a satisfied state-analogous to cooperative conflict.
An alert informs users when users reach a predicted cooperative state.However, if the user feels that a conflict has gone completely off the rails, they can click the "Restart Conversation" (top right corner in Figure 2) button to reset a conversation.

Feedback: Learning Effective Conflict Resolution
The Feedback View (left side of Figure 2) is primarily responsible for teaching and communicating effective conflict resolution skills.Altogether, the Feedback View is designed to enable reflection-inaction [69]; users are prompted to reflect on the impact of a message (what-if) and the strategy itself (recognize and recall).

"
What-if?": Identifying Effective Conflict Resolution Counterfactuals.Beyond recalling and recognizing strategies, the Feedback View introduces an interaction that allows for contrastive comparison between conflict resolution strategies.Inspired by the effectiveness of contrastive pairs in teaching [2,34], Rehearsal generates messages that use different resolution strategies at a specific instant in a conversation and scores their resolution efficacy on a slider (methods for grounded generation and scoring are discussed in more detail in Section 4).In presenting these alternatives, we allow users to compare how effective a specific message (paired with a strategy) is in a given context.Furthermore, users can click the "»" fast-forward button to view a predicted reply to their own message or explore predicted replies across counterfactual messages that use different conflict resolution strategies.Clicking the fastforward button multiple times generates a different variation.In this way, users can experiment with how the simulated interlocutor reacts both to their message and the generated, potentially more cooperative, messages.

Recognizing and Recalling Conflict Strategies. Rehearsal
does not immediately reveal which conflict strategy was applied by the simulated interlocutor.The Recall & Recognition interaction requires a user to first recognize the employed conflict resolution strategy (outlined in Figure 3).Concretely, this interaction aims to teach identification of conflict resolution strategies-by typing out the strategy itself, users will practice recognizing which conflict strategy is being used on them.When users fail to recognize the interaction, the feedback view prompts the user to try again.Recalling all the exact names for all 8 strategies immediately, however, might be challenging.In this case, Rehearsal offers a fallback, inspired by related memory-building tasks [57].When a user fails to recognize the strategy two times in a row, the Feedback View switches to a closed-ended multiple-choice interaction.Users can simply recognize the correct strategy used by the simulated interlocutor.
As an additional cue, users can hover over each choice and see the definition of each strategy in a tooltip.Clicking on the correct strategy resumes the simulated conflict.To prevent frequent context switching, we only ask users to recall and recognize strategies used by the simulated interlocutor.

SIMULATION VIA INTERESTS-RIGHTS-POWER PROMPTING
The interactions described in Rehearsal are powered by the generative capabilities of LLMs.Ensuring that simulated conflict is both accurate and educative poses several challenges.The technical contribution of our work is a set of LLM prompting techniques that yield utterances grounded in conflict resolution theory.Our prompts are iteratively composed to generate both high-level conflict resolution strategies (from the IRP framework) and corresponding messages.A high-level overview of our approach-consisting of three components-is summarized in Figure 5.Each component is a zero/few-shot prompt that contextualizes the conversation, generates counterfactual inputs adhering to the IRP framework, or generates corresponding replies from a simulated party.Our components are interdependent, playing a distinct role in grounding the simulation process.Below, we detail elements from each component.At Rehearsal's core is the overarching IRP planning component-demarcated visually as -orchestrating and constraining generation to the Interests-Rights-Power framework.In this section, we discuss IRP planning, and detail how it interfaces with the rest of the IRP-prompting pipeline.

Planning with Interests-Rights-Power
For Rehearsal to work as intended, it must predict which conflict strategy was used by the user and generate counterfactual inputs grounded in IRP.To this end, the IRP planning component plans and constraints generation to the 8 discrete conflict resolution strategies in the outlined IRP framework (Table 1 and §2.1).Planning and classifying utterances within the IRP framework provides a highlevel grounding for each conversation.
Still, why should we expect a descriptive framework of conflict (e.g.IRP) to improve the planning capabilities of an out-of-thebox LLM?We borrow from the same intuition powering chain-ofthought prompting [77] and prompt-chains [79]: out-of-the-box LLMs must sample representative conflict, in one shot, from an exponentially large space of potentially undesirable generations.
Prompt-chaining with a descriptive framework reduces the likelihood of undesirable outputs.Similarly, expert roleplay is fairly strategic in nature-experts think carefully about which strategies are practical and representative of realistic behavior.By enforcing controlled generation through strategy planning, Rehearsal simulates expert roleplay, planning a conversation using finer-grained conflict resolution units.
To control generation, the planning prompt supports several modes (Figure 4): classify a free-form message into one of the Interests-Rights-Power strategies, predict which specific strategy is most likely to be used conditioned on the context, and generate a message given a strategy.Response Generation.Finally, to generate simulated responses, we must jointly predict  (message sim , strategy sim ).We simply reuse the planning component, first predicting just  (strategy sim ), then predicting  (message sim | strategy sim ).In this way, we break message generation into a "chain-of-conflict" prompting process, where the response conflict strategy is determined before the response itself.By breaking down final response generation into a two-step process, we force models to generate new messages that specifically adhere to a conflict resolution strategy.Additionally, we qualitatively observed that zero-shot chain-of-thought generation (e.g.sampling CoTs without constraining them to a framework like IRP) may reduce the diversity of sampled output, mirroring findings from prior NLP work [50,71].

The IRP Prompting Pipeline
While the planning component integrates IRP into Rehearsal, it works alongside a set of additional prompting components.
Contextualization.To generate a simulation of conflict, an LLM must see the premise (setting and context) of the conflict and what messages were previously sent in a conversation (history).These factors significantly impact which strategy is most effective; the eventual resolution is contextualized in both the history and premise of a conflict.At the start of a simulated conflict, we reset the dialogue history.As we generate messages and collect user input, we encode conversation history into the contextualization prompt.The premise for conflict defines what both parties are conflicting about (e.g.Riley and Casey are arguing about an upcoming promotion; Riley is Casey's boss.), and exists at the start of a conversation.Rehearsal allows for premises to be user-defined or selected from a curated preset.
Generating and Scoring Counterfactual Responses.Once a user sends a message, we feed the message into the IRP Planning module , classifying the message strategy, and then generating counterfactuals that differ from the original strategy.A response is generated using each counterfactual message/strategy pair, again using IRP Planning .We finally zero-shot prompt GPT-4 to predict a conflict resolution score for each generated response.The conflict resolution score (1 to 5) indicates how likely a specific message will result in escalated conflict, from most likely (1) to least likely (5).We initialize a simulation to have the lowest possible conflict resolution score (1), as effective conflict resolution roleplays often start with a very dissatisfied party.In the context of our conflict resolution score, we define cooperative conflict as maximizing the conflict resolution score, therefore decreasing the likelihood of conflict.Because the first message in a conflict is conditioned on a low conflict resolution score, initial generations are likely to employ a negative strategy (e.g.power, rights).
User Selection and Interaction.At this stage, a user will select amongst their input message and response, and the alternative (message, response) counterfactuals.Upon selecting an option, the message is added to the Contextualization step, and the simulation loop repeats.The history of the interaction trace-the message and conflict resolution score selected by the user-are included in the contextualization step.Future generations from the model are therefore conditioned on all past interactions with a user.By having a trace of the conflict forecast in the context, we aim to reduce the likelihood of a simulation's "cooperativeness" varying wildly from message to message.We explore this phenomenon closely in §5, where we conduct an extended evaluation and ablation of our prompting method.
Backend LLMs.In theory, IRP prompting can be powered by any LLM with instruction-following capabilities.To implement Rehearsal, we default to GPT-4 [62], OpenAI's most capable language model at the time of writing.We use a temperature of 0.0 for strategy classification in IRP planning and for conflict resolution scoring (encouraging deterministic classification behavior), but use the default parameters otherwise (temperature = 0.7, max_tokens = 256).A web server serves as a proxy between Rehearsal's frontend and OpenAI's API, and maintains the state of each user's conversations.

TECHNICAL EVALUATION: VALIDATING INTERESTS-RIGHTS-POWER PROMPTING
Rehearsal's ability to teach conflict resolution depends on the quality of its generated conflict simulations.Before we evaluate the end-to-end effectiveness of Rehearsal, we first evaluate the IRP prompting strategy.To simulate effective conflict resolution scenarios, Rehearsal must remain grounded in our selected theory of conflict-Interests-Rights-Power-while generating realistic conflict.Here, we validate our approach by measuring the classification performance of IRP prompting; and by ablating specific components in a controlled setting and re-evaluating the ecological validity of corresponding generations.Concretely, we evaluate IRP prompting using two metrics: (1) Accuracy: IRP prompting must correctly classify utterances into Interests, Rights, Power, and the other conflict strategies.
We therefore compute classification accuracy for the IRP Planning component of the IRP prompting pipeline.(2) Ecological Validity: For effective roleplay, generated messages must be believable representations of how an expert instructor might respond.During roleplay, experts strive to make simulated scenarios both believable and grounded in a teaching framework [27,48].Believability judgments have already been used to evaluate the realism of simulations and agents [63,64].We evaluate the ecological validity of generations from IRP prompting in relation to expert roleplay, and ask: how ecologically valid are generated messages for teaching conflict?

Procedure
Measuring accuracy and validity, however, requires evaluation from individuals familiar with IRP and teaching conflict resolution.Both the first and second authors act as annotators: the first author has received significant training and exposure to literature in conflict theory and the second author is a graduate student in a business school studying interpersonal conflict.
Evaluating Accuracy.Successfully identifying which strategy was employed by a user-or which strategy the simulated roleplay should use-is managed by the IRP Planning component.To test this component, we isolate and evaluate its classification accuracy.First, the two evaluators interacted freely with Rehearsal in an end-toend fashion for a total of 100 dialogue turns (50 each), distributed evenly across the three premises.However, instead of allowing the IRP Planning component to select a strategy, a strategy was randomly sampled and provided to the evaluators.The annotators then had to write a message that adhered to the randomly sampled strategy.Concretely, given a conversational context (arguing with roommate), we randomly sample a strategy (Power) from the IRP strategies.Annotators would write a response ("I'm gonna kick you out.") associated with a strategy, resulting in (strategy, response) pairs.Using this method, the evaluators collected a test set of 100 messages.Finally, messages from the test set were re-fed into the IRP Planning component, yielding a set of predicted strategy classifications.We compared the predictions from the IRP Planning component to the test set labels, computing an accuracy score across each strategy.Additionally, we evaluated the scoring component of IRP prompting.We asked evaluators to independently rank-with ties-generated counterfactuals (e.g.most to least likely to worsen conflict) for each interaction while blinded to the generated conflict resolution score.On a subset of 10 conversations, we find that evaluators agree highly with each-other on the ranking of counterfactual generations, with a Spearman rank order  = 0.84.Therefore, we additionally tested if rankings generated by our LLM generated conflict resolution scores correlated with evaluator judgments, computing another Spearman rank-order correlation coefficient.
Evaluating Ecological Validity.To evaluate the validity of IRP prompting, we compare expert assessment of IRP prompt conflicts to ablated conditions.The first ablation is (1) Standard unconstrained generation.In this ablation, we only consider the contextualization step, ignoring planning and scoring.This ablation has no knowledge of IRP theory and is closest to simply prompting a language model to roleplay conflict.We also introduce a (2) Planning-Only ablation, where we consider only the planning and contextualization step.This ablation uses IRP to generate messages but does not score messages based on conflict efficacy, or contextualize the history of scores for prior messages.In contrast, the (3) Scoring-Only setting generates and contextualizes conflict resolution scores at each turn, but does not use IRP theory as a backbone for planning.Finally, we evaluate the entire IRP prompting pipeline-contextualization, planning, and scoring-and denote this as the (4) Full condition.To measure validity, the evaluators from the accuracy evaluation ranked between the four ablations, blind to condition.At each dialogue turn, Rehearsal generated four independent replies, each corresponding to a specific ablation.Evaluators, while blinded to which reply came from which ablation, ranked each set of four outputs (one from each condition) from most ecologically valid to least ecologically valid.In all, we collected a total of 100 rankings (25 sets of dialogues * four conditions per dialogue), split evenly across two evaluators.Agreement between evaluators was moderate, with an averaged Spearman rank correlation of  = 0.68.
Upon collecting rank data, we compute a TrueSkill score [39] for each ablation, in line with prior work on evaluating LLM prompting architectures [63].To summarize, the TrueSkill score computes an Elo-like (chess ranking-like) score for players in multiplayer games, producing a mean  a standard deviation  associated with each condition.Ablations with roughly overlapping scores can be seen as producing equally favorable outputs, while differences indicate that one condition is better or worse than another.We conducted a Kruskal-Wallis test to identify an overall difference between ablations and applied the Dunn post-hoc test [75] to isolate which specific conditions differed from another.The Holm-Bonferroni correction [41] was also applied to results from the Dunn post-hoc test, correcting for multiple comparisons.Finally, the first author qualitatively analyzed the feedback, identifying the failures and strengths of each condition.Two rounds of qualitative open-coding were employed.In the first round, the author outlined codes specifying why one condition was selected above another.In the second round, the author aggregated codes to highlight higher-order strengths and weaknesses of each ablation.

Results
IRP prompting successfully classifies and scores conflict resolution strategies.Our planning component yields high accuracy scores across high-level strategy categories: 86% for cooperative strategies, 90% for competitive, and 79% for neutral (Table 6).We additionally find that the IRP Planning component predicts the specific conflict resolution strategy with moderately high accuracy scores (avg. of 82%), relative to similar classification tasks in NLP [88] (e.g.classifying coarse online community discourse acts ≈ 76.3% [85]).Note that the Interests and Positive Expectation accuracy scores are slightly lower than other categories (66% and 79% respectively).However, most misclassifications fall under confusing Positive Expectations for Interests and vice-versa.Still   average strategy-level accuracy is fairly high (82%), we suspect that further prompt-tuning and additional few-shot examples could improve performance.Finally, we observe that evaluator and LLM rankings of forecasted conflict, produced through the generated conflict resolution score ( §4.2), have moderately high Spearman rank correlation ( = 0.72).
The full IRP prompting pipeline outperforms all ablations.Scores from the TrueSkill rankings (Figure 7) rate the Full system as the best ( = 28.64, = 0.74), followed by Scoring Only ( = 25.63, = 0.72), Planning Only ( = 24.08, = 0.71), and the baseline Standard condition ( = 21.17, = 0.73).Results from the Kruskal-Wallis test and the Dunn post-hoc test also reflect improvements provided by IRP prompting.The Kruskal-Wallis test indicates significant differences between conditions ( < 0.001;  = 101.7).Furthermore, the Dunn post-hoc test indicates significant pairwise differences between all ablations ( < 0.05).Altogether, these results indicate that all components from IRP are critical to generating more believable and informative conflict simulations.Standard instruction-following LLMs are too agreeable.In the Standard setting, where we directly prompt a model to simulate conflict, we find that LLM generations become agreeable too quickly (see Figure 8).In conflict, however, we expect more resistance from the contentious party.Inspecting rank-data, we observe that simulations from the Planning-Only and Standard conditions begin offering solutions and adapting an interests approach early in the conversation-even when the user offers no solution.From a practicing perspective, a user could try any strategy in a roleplay, and the simulation would yield fairly quickly.To address this, we introduced a conflict resolution score in the Scoring-Only ablation.We deconstruct the simulation process by scoring the potential of a resolution first and then generating a message.However: Scoring-Only ablations rarely change positions from the initial prompted viewpoint.When integrating scoring alone, generations are too fixated on the initial prompted viewpoint.Regardless of what a user does, a simulation rarely changes its perspective on how "angry" it is.Similar to a simulation that yields too quickly, a simulation that never yields is also less effective for teaching.Qualitatively, even when one spends the entire conversation applying effective strategies, the conflict resolution score rarely increases.We suspect that simply including a scoring component without specific IRP planning results in models not "knowing" when to adjust a forecast.IRP, however, offers a delineation between effective and ineffective strategies.Jointly including both planning and scoring breaks the problem into multiple steps: models first plan by classifying a strategy, then update the conflict resolution score.Through IRP-oriented planning, we confirm that models improve on identifying when to adjust scores, resulting in simulations that yield only when effective strategies are used.
Integrating all components of IRP prompting yields a practice "goldilocks-zone".Unlike our ablations, the IRP prompting pipeline offers a middle-ground between our system ablations: by jointly planning and scoring, simulations are neither too stubborn nor too agreeable.The benefits of our prompting pipeline become apparent when conversations extend beyond 2-3 turns: in longer conversations, overly stubborn/cooperative simulations become easy to spot.Looking at the mean reciprocal rank (MRR)2 at the tail-end (last 3 messages) of our ranking data, we find that full prompting has an MRR of 0.82, compared to Standard = 0.29, Planning Only = 0.41, and Scoring Only = 0.57.Differences between each ablation are far smaller during the first 3 turns of a conversation, with full prompting = 0.58 compared to Standard = 0.6, Planning Only = 0.54, and Scoring Only = 0.36.In summary, planning and scoring generations with a specific conflict resolution theory is central to generating effective conflict-especially in longer conversations.

USER STUDY: MEASURING END-TO-END EFFECTIVENESS OF REHEARSAL
The core goal of Rehearsal is to teach conflict resolution through simulated roleplay.In this section, we evaluate the effectiveness of Rehearsal's simulated roleplay through a controlled user study, testing improvements Rehearsal brings when paired with standard video-based conflict resolution training.To measure effectiveness, we evaluate our broader education goals: (1) recognition and recall of IRP strategies, and (2) performance in a live conflict setting.Each educational goal is measured through a distinct evaluation: recognition and recall are evaluated through a traditional knowledge quiz, while effectiveness is evaluated through a final, live conflict over chat with an author blinded to the participant's condition.We recruit N=40 participants and outline a control condition where individuals go through a standard conflict resolution training procedure.In the experimental condition, participants are additionally provided access to Rehearsal, alongside the standard training procedure.In this section, we detail components of our user study and highlight aspects of end-to-end use across study participants.
Our study aims to evaluate the following two hypotheses, centered around knowledge and application.
Knowledge Hypothesis: On a quiz, participants in the Rehearsal condition will be more likely to recognize and/or recall specific strategies in the Interests-Rights-Power framework.
Performance Hypothesis: In an applied setting with an evaluator (confederate) roleplaying conflict, participants in the Rehearsal condition will be less likely to use contentious strategies (Rights, Power), and more likely to use cooperative strategies (Interests, Proposal, etc.).
Our study finds little effect on "book" knowledge of IRP but strong effects on conflict performance, with participants roughly doubling their use of cooperative strategies and reducing the use of competitive strategies by 67% in a live conflict.In this section, we outline our evaluation procedure and highlight the results for both hypotheses.

Evaluation Conditions
Our evaluation consists of two conditions: the control group, where participants see the status-quo of conflict resolution training, and Rehearsal, where participants are additionally provided with our system to interactively practice conflict resolution.In each condition, participants are allocated a maximum of 25 minutes of total practice, determined through a series of pilot studies where participants' completion times were measured.
• Control: In the control condition, participants are provided a representative status quo of conflict resolution training: a tutorial video covering the IRP framework, and a list of conflict resolution strategies.No publicly available video covered all conflict resolution strategies, and existing slides were likewise inadequate for an introductory lecture.So, we recorded a 4-minute video covering all relevant conflict resolution strategies, modifying lecture slides from a pre-existing conflict resolution course at our institution.Additionally, participants can refer to a summary of the training video, with definitions and examples of conflict resolution strategies (similar to Table 1).• Rehearsal: In the Rehearsal condition, in addition to the training materials from the control condition, participants are additionally given access to Rehearsal, and can practice conflict with simulations.Because participants are time-constrained to a max of 25 minutes of practice, we randomly sample 3 scenarios from our full curated set of 12 (sample in Appendix A).Each participant was able to practice with the same 3 scenarios.

Procedure and Analysis
We conducted a between-subjects study of  = 40 participants, measuring the causal effect of Rehearsal as a system for teaching conflict.An overview of our study design is in Figure 9.The evaluation consists of three components: a conflict self-efficacy quiz, a knowledge quiz that tests retrieval and recall of conflict resolution strategies, and a final roleplay that tests application skills.All participants begin and end the study by completing a conflict self-efficacy survey.We administer the Dutch test, a set of 5-point Likert scales used to measure conflict self-efficacy.The test consists of 20 total scales, measuring five conflict resolution dimensions (compromising, problem-solving, yielding, forcing, avoiding) and producing a score from 4-20 for each dimension [19].Results from the Dutch test have been verified across a range of conflict management studies [21,22,33].Analogous to the Interests-Rights-Power strategies, ideal conflict results in high problemsolving/compromising scores, and low yielding/avoiding/forcing scores.The Dutch test also enables us to compute pre-post selfefficacy scores after participants complete the study.Because questions from each dimension are independent (no overlap), we conduct a paired -test for the questions in each dimension, identifying if self-efficacy scores change within conditions following training.To identify differences across conditions, we fit a linear regression model for each dimension, predicting the post-efficacy scores to the pre-efficacy scores while treating the experimental condition as a dummy variable.We examine coefficients and -values associated with the dummy condition variable.
For the knowledge quiz, we curate a set of 10 context messages (e.g."How about we spend more time working on the scheduling process instead?"), each relying on a single strategy from the full IRP framework.To keep the quiz short, we focus on core IRP strategies, choosing to omit Procedural and Concession as a result.The full quiz is in Appendix B; each participant receives the same quiz.We ask participants to either recall or recognize the strategy used in each message.Recall questions are answered by providing the exact name of the strategy, while recognition questions require users to select the strategy from a multiple-choice form.The quiz first tests recall and then recognition-we do not allow participants to go back and change their answers, preventing use of the recognition options to answer recall questions.Additionally, we limit time allocated to both recall and recognition.Participants are given 3 minutes to answer all questions across both conditions.As with studying time, quiz time was determined through pilot studies.We conduct a two-sample -test to across recall/recognition questions for both the Rehearsal and control conditions.
On finishing the knowledge quiz, all participants move to the application component of the evaluation.To evaluate a participant's ability to effectively apply conflict resolution strategies, we randomly select a scenario from Rehearsal's collected conflict training role-plays (discussed in §3.1.1):participants never see this scenario before the evaluation (the withheld scenario is detailed in Appendix A.2). Participants are then moved to a 1:1 chatroom with the evaluator.Consistent with our technical evaluation in §5, the evaluator is an author with significant exposure to conflict resolution training, who is blinded to the participant's randomized training condition.All participants then engage in a final conflict of 10 dialogue turns on the same withheld scenario, for a total of 20 messages.Participants were both instructed to resolve the conflict and were informed that they were engaging with a real person.The evaluator carefully followed conflict spiral findings from Brett et al. [9], starting their roleplay with a contentious strategy, retaliating if contentious strategies were used, and yielding when cooperative strategies were repeatedly used.To ensure that the evaluator was consistent across conditions, the first and second author coded all messages with their primary IRP strategy (yielding a Cohen's Kappa [17] of 0.74), and tested for distributional differences between the conditional reply strategies  used in each condition.Specifically, we used the two-sample Kolmogorov-Smirnov [60] test across  Rehearsal (  |  ) and  Control (  |  ), finding no significant difference between distributions for conflict resolution strategies ( > 0.1).In addition to being blinded to the condition, the evaluator behaved similarly in both conditions.After aggregating strategy use across the annotated conversations, we conduct 2-sample -tests comparing each strategy's use across each condition.Following the application component of the study, participants retook the Dutch test.Finally, we asked participants to leave optional open-ended comments on their training in a textbox shown at the end of the study.
Note that participants were offered no explicit incentive or motivation to actually utilize conflict resolution strategies during the study.Aside from an initial attention check early in the study, participants were free to spend as little time as they wanted in training (e.g. using Rehearsal and/or studying static training material) or conducting the final application component.Beyond the 25-minute maximum study time ( §6.1), we enforce no other constraints.

Participant Details
Participants for our study were recruited through Prolific, a crowdwork platform.We recruited participants who were over 18 years of age, lived in the U.S., and self-identified as fluent English speakers.Participants were also filtered to have at least 50 submissions and a 95% approval rate, signed an IRB consent form approved by our institution, and were paid at a rate of $12.00 per hour.

Results
Practice with Rehearsal results in increased use of cooperative strategies and decreased use of competitive strategies.Using Rehearsal has a clear effect on conflict strategy use in the application portion of our evaluation.In the final 10-turn chat, participants using Rehearsal were twice as likely to use the Interests and Proposal-oriented strategies compared to the control condition, averaging 3.0 vs. 1.5 times for Interests, and 2.9 vs. 1.5 times for Proposal.Both strategies show significant improvement compared to the control condition (2-sample -test,  < 0.05).Furthermore, participants in Rehearsal reduce their use of competitive strategies.Use of Power strategies in the Rehearsal condition is 1/3 of the control condition (2.9 vs 1.0).Rights-based strategies are almost never used in the Rehearsal condition (≈ 0) but appear on average 0.5 times per conversation in the control condition.Differences across conditions for both the Power and Rights strategies are also significant ( < 0.05).Participants with Rehearsal are far more performant in the actual conflict, supporting the Performance Hypothesis.
Passively reading or watching conflict resolution training material builds skills orthogonal to actual practice.Results from the application component of our evaluation are universally positive: participants in Rehearsal significantly outperform participants in the control condition.Participants from the control condition offer a simple explanation for our quantitative results: "It's easy to read about conflict-resolution strategies, but a lot harder to implement and stick to."  For our Performance Hypothesis, where we evaluate application skills, we find that participants use significantly more cooperative strategies, and significantly fewer competitive strategies.In contrast, we cannot claim that Rehearsal improves knowledge skills, failing to support our Knowledge Hypothesis.
Simply reading over strategies does not translate over to applying them.Participants mistakenly feel prepared after watching/reading training material and are blindsided by the difficulty of the application component in the evaluation.
"I'll be honest: while I did pay attention to the training, it was really hard to imagine implementing [the conflict resolution strategies] without feeling like a total ass.I feel like it's obvious when people read about some weirdo social technique or something and try to do it in conversation, they end up sounding like brainwashed robots." A common challenge across participants in the control condition lies in application.Rehearsal shines at providing applicationoriented feedback to participants: as we've observed, application is primarily where we see outsized effects in our final evaluation.While it's challenging for participants in the control condition to "imagine implementing" conflict resolution strategies, Rehearsal offers exactly this! Practicing implementation using a simulated roleplay ensures that participants in the Rehearsal condition are more prepared to apply and implement strategies in the evaluation.
While Rehearsal builds application skills, simulated practice does not significantly improve recognition or recall.Across both conditions, the average quiz score is 12.6/20 or 63%.When stratifying by experimental condition, we see a difference in performance: 11.3/20 for control vs. 13.8/20 for Rehearsal.Using Rehearsal, participants also see an average overall increase in recall/recognition quiz scores (5.3/10 → 6.2/10 for recall and 6.0/10 → 7.6/10 for recognition).However, these differences are not significant, with our −test indicating no effect ( > 0.05).While tool helps with application, we find no support for the Knowledge Hypothesis.This result suggests that Rehearsal improves "street knowledge" but not necessarily "book knowledge" of conflict-which, if we had to choose between the two types of knowledge, is the preferable outcome.
The "Conflict Reality Check:" participants reduce conflict selfefficacy after training.Participants in both conditions generally reduced their Dutch conflict self-efficacy scores following the final conflict in the study.Participants in both conditions indicate that they use fewer cooperative strategies (compromising, yielding, and problem-solving) strategies, but use more competitive strategies, specifically forcing (paired -test,  < 0.05).Furthermore, results from our regression model indicate that participants in the Rehearsal condition self-reported increased use of forcing and decreased use of yielding compared to the control group ( < 0.05).This is at odds with our empirical application results.Firstly, any practice should increase efficacy scores, but we observe a decrease for all cooperative dimensions.Secondly, participants in the Rehearsal condition clearly use fewer competitive strategies, but self-report significantly higher use of forcing and lower use of yielding strategies.What gives?We suspect that is a classic example of the Dunning Krueger effect [49].Participants after any training have a better understanding of their true conflict resolution skills, and recalibrate their scores-moving them down!-based on their performance in the evaluation.In the Rehearsal condition, participants are even better at recognizing power-oriented forcing strategies, and reduce their scores to a larger degree than the control condition.
You can lead a participant to water...In aggregate, participants using Rehearsal significantly outperformed participants in the control condition during the actual conflict.However, even in the Rehearsal condition, 2 participants employed just a single interestsoriented strategy in the final evaluation conflict.Surprisingly, both participants performed fairly well on the quiz component of the study, averaging 7.5/10 correct answers.One of these participants left the following comment: "I felt like the training helped me learn new methods, but I am stubborn by nature and once the person was rude from the outset, I did not want them to get their way." These cases highlight the scope of Rehearsal's purpose.While Rehearsal improves use of conflict strategies on average, it offers no guarantee that a specific participant will actually choose to use them.An individual who comes in with no intent to actually resolve a conflict likely will not change their mindset following any kind of training.

Rehearsal's Failure Modes
A two-stage inductive coding of Rehearsal's interaction traces across participants highlights a range of failure modes, some more egregious than others.Anomalies in individual message generations were coded and repeating themes were merged.We describe these anomalies below: Occasionally, simulations feel overly scripted.We note that some simulated messages are too "on the nose": since Rehearsal generates messages conditioned on a strategy.For example, Rights generated messages explicitly mention "rights" or "fairness" (e.g."That's not fair!");Proposal messages explicitly mention a proposal (e.g."I propose we do this...").In realistic conflict, the recognition and application of strategies are more nuanced.Rights-based strategies depend on the context of a conflict and the norms that shape it.What constitutes a Power strategy depends on complex power dynamics between individuals in a party.While Rehearsal allows users to specify properties of their own conflict in custom scenarios, we do not rigorously evaluate this feature in our user studies.
Hallucinations, though mitigated by IRP, still appear.LLMs often make up facts about a scenario to justify an argument (e.g.disagreeing about the specifics of being late to work).For most scenarios, this is acceptable.In roleplay, experts will introduce facts as a conflict progresses.However, we have limited control over when LLMs hallucinate, which may detract from practicing the intended scenario.Related work on story generation [81] proposes the use of an edit module to store and correct factual inconsistencies.Integrating this into IRP prompting is an avenue for future work.
For a specific scenario, generated conversations are less diverse than they initially seem.For our evaluation, participants spend a limited amount of time practicing with Rehearsal, engaging in a few conversations across 3 scenarios.Given this limited exposure, we suspect that participants themselves did not notice diversity problems.Zooming out across all interactions, however, we note that the style of conflict is fairly regular within a scenario.Simulated conversations follow a prescriptive and formal style.We suspect that instruction-following in LLMs serves as a strong prior over the style of generated text.Manually finetuning open-source LMs to simulate conflict will likely mitigate this issue.

DISCUSSION
In this section, we reflect on teaching through strategic simulation and grounding LLMs in social science theories.We also discuss ethical risks related to applying generative models to educational applications.

Simulation as an Effective Teaching Interaction
Roleplay has long been an established method for teaching across a wide range of domains: customer service [8], language learning [55], therapy [51], healthcare [61], and more.Across these domains, effective roleplay requires an expert to simulate a specific scenario, while working meticulously through a domain-specific teaching process.
Rehearsal's goal is to focus on a setting where roleplay serves a foundational role-conflict resolution-and highlight how effective roleplay simulation can yield strong results.More broadly, we argue through this work that applying simulated roleplay requires understanding when, and for what skills, simulation is effective.
When does simulated roleplay work?We hypothesize that simulated roleplay has a higher likelihood of working in domains where expert roleplay is already effective.In these domains, experts have already laid the groundwork for roleplay-frameworks like IRP exist because experts regularly teach conflict resolution.Similarly, Rehearsal's effectiveness is only possible because of its dependence on these frameworks.Remaining faithful to social scientific theory that already works in practice-and ensuring that LLM generations are constrained to a selected theory-will increase the chances of successfully simulating roleplay in those domains.IRP prompting, for example, depends heavily on how experts employ conflict resolution strategies to simulate conflict.Prior work in roleplay lies primarily in soft-skill training, so we expect successful applications to operate primarily in these contexts.
What kinds of skills does simulated roleplay build?After evaluating Rehearsal, we find a significant increase in participants using cooperative conflict resolution strategies compared to a control group.However, when asked to recall and recognize strategies, participants show no significant difference between groups.Rehearsal's primary interaction revolves broadly around applying cooperative and competitive strategies.There is little incentive for users to build "book smarts", i.e. the specific names associated with each of the 8 IRP strategies.For example, our Recall and Recognition interaction-where we ask users to recall a simulation's strategyoccupies a small part of the full interaction loop.All that matters is the "street smarts", or the application of these principles.Given this incentive misalignment, users in a simulated roleplay likely prioritize application over memorization, as reflected in our results.Therefore, while simulations like Rehearsal can be beneficial for the practical application of skills such as conflict resolution, they may not significantly improve specific knowledge skills.In domains where learning these skills is critical, educators can separately apply both experiential learning through simulations and traditional educational methods (e.g.flash cards, spaced repetition).

Grounding in Theory Improves LLM Generation Capabilities
Rehearsal works by constraining the generation of LLMs to a specific theory (in our case, IRP).Constrained generation is already an established trend in a range of more traditional NLP tasks.From code generation [83] to summarization to story generation [80,81], LLMs augmented with explicit planning/constraining components generally outperform out-of-the-box models [32].Our work, however, takes traditional constrained generation a step further, using internal social-scientific knowledge of an LLM as guardrails for the same LLM's generation process.More concretely, we ground LLMs in an established subset of social scientific theory, augmenting models with latent, specialized knowledge of interpersonal interaction.Given that large language models are trained on vast swaths of the internet, we suspect that they contain information relevant to a range of established social scientific theories.IRP prompting first elicits this knowledge (with a planning component), then ties LLM generation to the elicited knowledge.Across our evaluations, we find that grounding to IRP theory improves the validity of Rehearsal as a teaching tool.Instead of sampling randomly across a larger space of generations, constraining an LLM specifically to a teachable theoretical framework significantly increases the effectiveness of Rehearsal.Users have an actionable and controlled space to practice; generations are "predictable enough, " providing users with a practice sweet spot.While unconstrained generation may offer more variety, a lack of control during generation yields less effective simulations.
Still, constraining an LLM with its own internal knowledge is not a straightforward feat.A core challenge of building Rehearsaland more specifically, IRP prompting-lies in eliciting and evaluating this knowledge, and applying it to a carefully constrained task.The IRP prompting pipeline is a multi-step process, with each step supervising a model's final output.Even with multi-step prompts, we conduct careful evaluations of an LLMs ability to use IRP effectively, ablating components in our pipeline and evaluating an LLMs ability to classify conflict resolution strategies.
In theory, one could simply substitute the IRP definitions in our prompts with their own framework, and allow our planning components to elicit knowledge and constrain generation.Indeed, we suspect that general patterns from IRP prompting can be applied to different social scientific theories, enabling a Rehearsal-like teaching interactions for a wider range of domains.While grounding in IRP proves successful for Rehearsal, there is no guarantee that LLMs possess adequate latent knowledge across other theoretical frameworks.For novel social scientific theories not already present in the training data, we are unsure of our pipeline's effectiveness.Successful constrained generation requires that LLMs can apply and recognize the target theory first.

Limitations and Future Work
While Rehearsal proves effective in our user study, there are important limitations and avenues for future work.
IRP does not cover all conflict resolution scenarios.Conflict resolution theory assumes that both parties can resolve conflict in good faith.This is definitely not true of all conflict: there are instances where no application of cooperative strategies will cause the interlocutor to change their position. 3Conflict scenarios that might be actively dangerous to engage in, or pose too high of a risk to try resolving, may require additional support or even disengagement.Rehearsal cannot extrapolate to situations like these.Future work should predict if a specific conflict falls under IRP's scope, and warn against the effectiveness of simulated practice under these settings.Finally, IRP is a Eurocentric conflict resolution framework.Strategies vary significantly across cultures and should be an important consideration for future work [33].
Near and Far Transfer.We evaluate performance on conflict resolution tasks immediately after simulated training.Therefore, we can only claim that Rehearsal is effective with near transfer in learning, where skills are retained immediately after an educational intervention [67].We do not test if Rehearsal's effect persists over a longer period of time via far transfer [5].Evaluating far transfer is an important avenue for future work.
Additional Applications.Rehearsal currently supports two parties during a conflict.Conflict, however, often involves multiple parties.Future work can extend Rehearsal to multiparty setups, where each party is powered by IRP prompting.Supporting multiparty conflict resolution enables training across more general moderation and consensus-building activities.Finally, as generative models continue improving in domains beyond text, we also expect training systems like Rehearsal to support multimodal interaction (e.g.voice, video).

Ethical and Societal Considerations
While simulated roleplay shows promise as a useful teaching interaction, we must also consider critical ethical and societal considerations of deploying systems like Rehearsal.
Deployment Risks and Distributional Shift.Rehearsal is designed primarily as a safe training environment in which users can practice their skills and learn how to transfer those skills to the real world.As with other training environments, we acknowledge this shift between training and the real world.Rehearsal is only intended for simulation; it may not reflect reality exactly.While practicing with simulated roleplays, we urge users to use Rehearsal with caution and to keep in mind these risks.
Stereotypes.LLMs have been documented output a range of stereotypes [7,29].Therefore, Rehearsal might be likely to reflect both negative and positive stereotypes, especially since we rely on personas in each scenario [16] and use a multi-step prompting strategy [71].To mitigate this risk, deployments of Rehearsal should explicitly highlight important attributes (e.g., "male", "aggressive", and "ambitious") in the prompts and enable users to make changes as needed in the interface.This approach explicitly reminds users of potential stereotypes associated with their prompts, and gives users full control over the types of personas they wish to communicate with; this allows for more direct engagement between users and the tool.Furthermore, Rehearsal is likely to follow any descriptions given explicitly in the prompt, so stereotypes are also likely to arise from under-description in the prompt (e.g., if the prompt says that the interlocutor is just a "boss", the model will likely stereotype them as a white male).Deployments of Rehearsal must explicitly spell out any characteristics that the model should be operationalizing, limiting its ability to stereotype.Job Displacement.A final risk is that Rehearsal may lead to job displacement or devaluation for expert trainers.However, this would require that people who currently pay professional trainers for conflict training stop doing so.Such trainers are generally retained by extremely wealthy firms, and it seems unlikely that such firms would stop doing so.More likely, such trainers would integrate these tools as part of their education and training events.Similarly, chemistry simulation tools don't replace chemistry classes, and management books don't replace management coaching.So, we expect that the high end will likely remain stable.It is certainly possible that some individuals on the margin will opt for a cheap or free standalone option.We place this risk against the potential benefit of a free-to-use tool, which can benefit a broader user population, especially those without expensive professional training or social capital.It will be easier for professional experts to handle a larger population and to focus on more tailored, challenging scenarios needed by skill training, and they will still maintain their high level of expertise, maintaining their uniqueness.

CONCLUSION
We introduced Rehearsal, a system for teaching conflict resolution by simulating conflict.While using Rehearsal, individuals interacted with a simulated interlocutor, exercising their conflictresolution skills.Rehearsal is powered by IRP prompting, a multistep prompting technique that grounds language models in conflict resolution theory.Through interaction with grounded simulations of conflict, users built an intuition for applying effective conflict resolution strategies.We conducted a between-subjects evaluation of  = 40 participants, who engaged in an actual conflict following training with or without the tool.Participants with Rehearsal training significantly improved their application of effective conflict resolution strategies in the unaided conflict.Finally, we discussed applications of Rehearsal beyond conflict resolution; and highlighted limitations, ethical considerations, and societal impacts.

A CONFLICT CASE STUDIES
Here, we detail a subset of premises from the Harvard Program on Negotiation [1] and the Crucial Conversations book [52] that are used by Rehearsal during evaluation.

A.1 Practice Case Studies
Undercooked meal.
"You just tried a meal your partner cooked for you, but it's slightly undercooked.You mention this to your partner, and they're visibly unhappy that you brought this up." Where's my refund?
"The complaints clerk (you) in a department store sees a customer (Casey) coming with a blender.The store cannot return these items to the manufacturer.You have a small weekly budget to absorb the cost of such items, if returned, and the department head has instructed that it be used sparingly.The budget for this week is overspent.Casey, having used the blender for over a week, believes it is either defective or an inadequate appliance, and has therefore decided to return it, and is angrily demanding a refund." Work Performance.
"Jerry has been a steady employee for four years.Recently, Jerry's work and attitude have taken a turn for the worse.Jerry's supervisor (Casey) does not know why, but the situation has come to the point where the supervisor is prepared to fire Jerry, and is under considerable pressure from management to do so.The two are about to meet to discuss this situation."

A.2 Withheld Evaluation Case Study
The Unwanted Promotion.
"Your boss Chris keeps telling you that you'd make a great supervisor.You don't want the promotion.You like what you do.Chris said team players take promotions.You've heard that Chris is submitting the paperwork to have you promoted.Yesterday Chris said you'd soon be getting a big surprise.This morning he asked you to be sure to go to the afternoon team meeting.You don't want him to spring the announcement in the meeting and pressure you.You're now in a 1:1 meeting with him, and he's annoyed that you're planning on turning this down."

B KNOWLEDGE QUIZ
Our Knowledge quiz is split into two categories, with 5 questions per category (recall and recognition)

Figure 2 :
Figure 2: Rehearsal's UI.Rehearsal's interaction covers two broad goals: providing users with in-context feedback (left) on a conflict simulation, and allowing users to directly interact with the simulation (right) itself.Users can specify a conflict premise or select from presets under the "Select Scenario" input (top right).In this figure, the user engaged in conversation with the simulation ("I'm sick and tired...") and was presented in the feedback bar with three alternative options of messages to use instead.Before deciding to use an alternative, they press the fast forward button on the first one which projects the conversation one turn in the future (final message in the simulation panel).

Figure 3 :
Figure 3: Recalling and Recognizing Conflict Resolution Strategies with Rehearsal.Before revealing the simulation's strategy, Rehearsal's Feedback interaction prompts users to recall and recognize conflict resolution strategies from IRP.To prevent frequent context switching, users are only asked to recall and recognize for the simulated interlocutor-not for their messages.The figure above outlines how users might practice this interaction (progresses from left to right).
Classification.The planning component is frequently used to classify the strategy employed by messages from both the simulated interlocutor and the user,  (strategy | message, context).We provide an LLM with few-shot examples of messages in IRP, along with the conversational history.An abridged version of the prompt is below: [Output from contextualization (includes conversation history and premise)] [Interests-Rights-Power definitions and few-shot examples.]Sender: [User or Simulation] Message: You're being ridiculous.Strategy: (GPT completion) Power Counterfactual Generation.To generate counterfactuals for usersent messages, we invert the generation order, preselecting a set of strategies beforehand.Concretely, the planning module can generate  (message user | strategy', context).Suppose a user inputs a message whose strategy is Power.IRP prompting can ignore the classified strategy and produce a range of user inputs conditioned on different, potentially more cooperative strategies (like Positive Expectations).Strategy: Positive Expectations.Message: (GPT completion) If we work together, we can figure this out.We only gain counterfactual capabilities because generation is grounded in IRP.Without using a conflict resolution framework, it's impossible to determine what the counterfactual should be in the first place.

Figure 4 :
Figure 4: The IRP planning component supports three different modes: it classifies a user's response, generates counterfactual user messages using a pre-planned conflict resolution strategy (e.g.Interests), or plans and generates a simulated response.

Figure 5 :
Figure5: The Full IRP Prompting Pipeline.Our conditional generation process is split into contextualization, counterfactual input generation, and response generation.To contextualize simulated conflict, we first encode a pre-defined conflict premise (e.g.arguing about job performance).If a conversation has at least one turn, we additionally include history in the contextualization step.To generate counterfactuals, we pre-plan messages across the IRP framework, allowing users to compare different strategies.A corresponding response for each strategy is planned, generated, and scored by its effectiveness.

Figure 6 :
Figure 6: Few-shot classification accuracy for the IRP Planning component in the IRP prompting pipeline.The evaluation was conducted over a balanced test set of 100 messages.

Figure 7 :
Figure 7: TrueSkill scores across ablations of the full IRP prompting pipeline.We observe that the full prompting pipeline handily outperforms all other conditions.

Figure 8 :
Figure 8: Representative errors across IRP Prompting ablations.Without IRP prompting, some ablations-Strategy Only and Standard Prompting-offer responses that are too agreeable even when no attempt to resolve the conflict was made (left).In contrast, other ablations-like Scoring Only-are too stubborn (middle), even if a user employs several cooperative strategies.IRP prompting offers a middle ground (right).

Figure 9 :
Figure 9: Experimental Conditions outlined in our end-to-end user study.We recruit 40 participants from Prolific, and test for the causal effect of Rehearsal by introducing it alongside standard training for conflict resolution.Participants are split into two groups: the control group with just the standard training setup (a video and list of strategies & definitions), and the Rehearsal condition, where participants are additionally provided with interactive roleplay using Rehearsal's simulated conflict.After completing training, both groups complete a quiz testing conflict resolution skills, and practice conflict (without any training material or assistance) in a final scenario with a blinded researcher.

Figure 10 :
Figure 10: Cooperative and competitive strategy use between both the Rehearsal and control conditions in our user study.
, both strategies are positive in nature; occasional misclassification should have minimal effect on the end-to-end effectiveness of Rehearsal.While

Table 2 :
2-sample -test results across our study hypotheses.