Robust Training for Conversational Question Answering Models with Reinforced Reformulation Generation

Modelsforconversationalquestionanswering(ConvQA)overknowl-edgegraphs(KGs)areusuallytrainedandtestedonbenchmarksof goldQApairs.Thisimpliesthattrainingislimitedtosurfaceforms seenintherespectivedatasets,andevaluationisonasmallsetof held-outquestions.ThroughourproposedframeworkReign,we takeseveralstepstoremedythisrestrictedlearningsetup.First, wesystematicallygeneratereformulationsoftrainingquestionsto increase robustness of models to surface form variations. This is a particularly challenging problem, given the incomplete nature of such questions. Second, we guide ConvQA models towards higher performance by feeding it only those reformulations that help improve their answering quality, using deep reinforcement learning. Third, we demonstrate the viability of training major model components on one benchmark and applying them zero-shot to another. Finally, for a rigorous evaluation of robustness for trained models, we use and release large numbers of diverse reformulations generated by prompting ChatGPT for benchmark test sets (resulting in 20x increase in sizes). Our findings show that ConvQA models with robust training via reformulations significantly outperform those with standard training from gold QA pairs only.


INTRODUCTION
Motivation.Answering questions about entities, powered by large knowledge graphs (KGs) at the backend, is a vital component of Web search [7,43,64,82].Nowadays, users' information needs are increasingly being expressed as a conversation, in a sequence of questions and answers ⟨  ,   ⟩, over turns { } [16,53,84]: What's the 2022 LOTR TV series called? 1 : The Rings of Power (TROP)  2 : TROP airing on? 2 : Amazon Prime Video  3 : Which actor plays Isildur in the series? 3 : Maxim Baldry  4 : And who in the Jackson trilogy? 4 : Harry Sinclair  5 : When did the series start?... A conversation over a KG contains a set of entities ("The Lord of the Rings: The Rings of Power", "Amazon Prime Video"), their relationships ("aired on"), and types ("TV series, video streaming service").In ConvQA, users omit parts of the context in several follow-up turns ( 3 −  5 ), and use ad hoc style ( 2 ) [16,17,25,34,61,67]. Leaving a part of the intent implicit, coupled with the use of informal language, make the answering of conversational questions more challenging than complete ones tackled in older and more established branches of QA [28,64,73,76].ConvQA has high contemporary interest [15,33,41,58,75], spurred on to a big extent by systems like ChatGPT that support a conversational interface.Limitations of state-of-the-art.We quantify robustness in QA in terms of the number of distinct question formulations of a given intent, that a QA model can answer correctly: the higher this number, the more robust the model.Methods for conversational question answering (ConvQA) over KGs are usually trained and evaluated on benchmarks of gold-standard ⟨question, answer⟩ pairs [14,25,66,67].Such a paradigm limits robust learning by being restricted to question formulations roughly seen during training time.One approach in QA to demonstrate generalizability is to train and evaluate models on multiple benchmarks [33,38,48].This only addresses the problem partially: the training and evaluation are still limited to surface forms seen in any of the benchmarks.A particular aspect of existing benchmarks, that is attributable to their construction choices via graph sampling [66] or crowdsourcing guidelines [12,14], is that they often do not contain sloppy question formulations that could be asked by real users in the wild.
In the example conversation,  4 is phrased in a very casual way, asking for Isildur's actor in the LOTR movie trilogy (Peter Jackson directed the LOTR movies).With this difficult input, the QA system may give a wrong response.A seemingly natural approach to counter such effects would be to have the QA system automatically reformulate the question into a more complete version [3,9,10,62,75,83], such as Which actor played the role of Isildur in the Lord of the Rings movie trilogy directed by Peter Jackson?this kind of run-time question rewriting to a complete natural language form in a deployed system may sometimes work, but adds inference-time overhead and may not improve performance [30].Approach.We take a different route: instead of reformulating a conversational question at inference time, we strengthen the training of the ConvQA model by exposing it upfront to a larger variety of intent-preserving surface forms for the same training sample.Examples of such syntactic variations representing the same question semantics are in Fig. 1, for  1 −  3 (original questions in orange boxes, perturbed zones in reformulations in blue).With this more diverse training data, the ConvQA model learns to cope better with different syntactic formulations.
Our reformulations are created from first principles.We propose a taxonomy of reformulation categories for ConvQA, that systematically manipulates parts of a given conversational question based on string edit operations.For each category, we generate noisy supervision data to fine-tune an LLM, that then serves as our reformulation generator (RG, shaded gray boxes).New lexico-syntactic forms in reformulations originate via use of a rich set of aliases in KGs, and world knowledge in LLMs.
Given that our generated instances are noisy, it is unlikely that for a given question, all categories of reformulations would improve the ConvQA model's performance.As a result, for each question, we would like to judiciously select a few of these that are most beneficial.So we pass generated reformulations to the QA model we wish to improve, and obtain ranked answer lists as responses -shown in boxes with green (correct) and red (incorrect) answers in the right half of Fig. 1.The model's answer performance metrics (or proxies) are used as rewards (shaded yellow boxes) to train a Reformulation Category Selector (RCS) with Deep Q-Networks [49], a form of RL that approximates value functions.The trained RCS is then used as a means for model-specific data augmentation: it selects only the top- reformulations that would be used for additional training data for the QA model for maximum performance improvement.Instances of such question-specific categories are in Fig. 1

(left half).
Evaluation.To assess the benefits of Reign, we perform experiments against two state-of-the-art baselines: Conqer [35] based on reinforcement learning, and Explaignn [15] based on GNN inference.Note that Reign operates by model-aware training on top of these baselines.For test data, we leverage the generative ability of ChatGPT (GPT-3.5)as a proxy to obtain human-like reformulations at scale: each original question is augmented with 20 distinct reformulations.As GPT models are proprietary and have high computational and environmental cost, we restrict our use of LLMs to the task of generating test data.High-end LLMs are not an analogous baseline to compare with.Contributions.This work calls for more robust training and evaluation of ConvQA models, salient contributions being: • A novel taxonomy of question reformulations for ConvQA over KGs, based on string edit distance; • A reinforcement learning model with Deep Q-Networks, that selects helpful reformulations of conversational questions guided towards better QA performance; • About 335k conversational question reformulations of test cases in two ConvQA benchmarks, suitable for rigorous evaluation of future models; • The Reign framework with reusable components that judiciously augments benchmark training tailored to specific Con-vQA models.All code is at https://reign.mpi-inf.mpg.de.

CONCEPTS AND NOTATION
Salient notation is in Table 1 (some concepts introduced in Sec. 3).Knowledge graph.A knowledge graph (KG) consists of a set of real-world objective facts.Examples of large curated KGs (equivalently, knowledge bases or KBs) include Wikidata [78], DBpedia [4], YAGO [69], or industrial ones (e.g., Google KG).Fact.A KG fact is an SPO (subject, predicate, object) triple, where a subject is an entity (Lord of the Rings); an object is another entity (Maxim Baldry), a type (TV Series), or a literal (01 September 2022); and a predicate is a relationship (cast member) between the subject and the object.Compound facts involving more than two entities or literals are stored as a main triple and additional ⟨predicate, object⟩ pairs (referred to as "qualifiers" in Wikidata [78]).For example, the Answer.An answer  is a response to the information need in question  (Harry Sinclair is the answer  4 to  4 ).In this work, an answer can be a KG entity, a type, or a literal.An answer here can be a ConvQA model's response or a gold answer from a benchmark.Reformulation.A question reformulation is obtained by transforming a question into a different surface form with the same intent.A reformulation is generated using an ⟨, ⟩ pair and the original question.Here, operations could be {insertion, deletion, substitution}, while operands could be {entities, predicates, question entity types, expected answer types}.An example transformation is adding an answer type to question  2 : TROP airing on?, to produce the reformulation  1 2 : Network TROP airing on?.Mention.A mention refers to a sequence of tokens in  that is the surface form of a KG item (entity, predicate, or type).A mention of a predicate is referred to as a relation.For example, in  1  2 : Network TROP airing on?, "Network", "TROP", and "airing on" are mentions of KG answer type video streaming service, KG entity The Rings of Power (TROP), and KG predicate original broadcaster, respectively.

THE REIGN FRAMEWORK
An overview of the workflow in the proposed Reign architecture is depicted in Fig. 2. The pipeline consists of three trainable models, where the first two are our contributions: Reinforcement learning (RL) is used to train the RCS model (Deep Q-Networks [49] in this work), with the goal of learning to select the most suitable transformation categories given a specific question, using existing QA performance metrics or suitable alternatives as reward signals.The categories come from our novel reformulation taxonomy.The RG model is trained with (distantly) supervised learning (SL), using an LLM (BART in our case [39]) fine-tuned with questions paired with a specific category and the resulting reformulation in the form ⟨(  ,    );    ⟩.This is distant supervision in the sense that the reformulations used for fine-tuning are generated in a noisy manner using rules following our taxonomy, and are not human reformulations.The ConvQA model used could be trained with SL [15,33,67] or RL [35], according to its original training paradigm.In Fig. 2, the original model ConvQA orig is trained with ⟨  ,   ⟩ pairs in a ConvQA benchmark, while the more robust model ConvQA robust is trained on additional QA pairs where the reformulations {   } for a specific   are also paired with the original gold answer   .We now describe each component.

REFORMULATION CATEGORY SELECTOR 4.1 Reformulation taxonomy
Categories.We propose a taxonomy of reformulations, a topic that has mostly been treated as monolithic in past work [9,27,35].To begin with, observe that a reformulation of a conversational question is a modification of its basic parts.Thus, a systematic generation of reformulations involves an understanding of these parts and meaningful modifications.For (Conv)QA over KGs, these basic question components comprise mentions of one or more entities, their types, predicates, and expected answer types.In analogy with string edit operations, our modifications include insertion, deletion and substitution.Transposition could be another basic operation, but we do not consider that in this work as reordering question phrases has little effect on several retrieval models.Viewing these three operations and the four parts of a question as operands, we obtain a taxonomy as shown in Fig. 3, where reformulation categories are leaf nodes (marked orange).Examples are "INSERT entity-type", "SUBSTITUTE relation", and "DELETE relation".Note that we require our reformulations to be intent-preserving: this imposes constraints on what we can insert or substitute in the original question.We cannot, for example, replace an entity or relation by a different one -that would disturb the semantics of the conversation as a whole.Phenomena.As shown with dashed boxes in Fig. 3, our taxonomy subsumes several classes of conversational phenomena: • Insertions complete the question to a more intent-explicit form; • Deletions cause ellipses in context; • Substitutions create paraphrases; • Substituting entity mentions specifically leads to coreferencing.
The last operation can be sub-divided into three categories as per the case of substitution with a pronoun ("TROP" ↦ → "it") or with its type ("TROP" ↦ → "the series") or with an alias ("TROP" ↦ → "Rings of Power").A special case in our taxonomy is the operation "RE-TAIN whole question", where the question is left as such: it can be considered as a degenerate reformulation.Finally, we have 15 reformulation categories, corresponding to these leaf nodes.

Training the RCS model
Overall idea.Given an input question and the taxonomy, we would like the Reformulation Category Selector (RCS) model to suggest some categories so that training with reformulations that belong to these categories would lead to better QA performance metrics.This, in turn, means that we would like the RCS to estimate values that correspond as much as possible to such metrics.This motivates us to use Deep Q-Networks (DQN) [49], a reinforcement learning approach that directly learns a value function approximator using QA metrics as rewards.The estimate of the value of a (state, action) pair is in turn used to infer the policy for predicting or sampling actions.This is in contrast to the relatively more popular choice of learning policy gradients that directly model action probabilities given an input state (for example, the REINFORCE algorithm [79]).
Concretely, we employ DQNs to train an agent (the RCS model) to select actions ( ∈ A) (reformulation categories) given a current state ( ∈ ) (the input question   ).The agent interacts with an environment (the reformulation generator and the Con-vQA model, described later).This environment provides the next state  ′ ∈  (the generated reformulation    ) and a reward R (QA performance metric) to the agent.This is a Markov Decision Process (MDP) comprising states, actions, the transition function, and rewards (S, A, , R), where the individual parts are defined next.Algorithm 1 shows the precise steps of applying Deep Q-learning in the RCS model.States.A state  ∈ S is defined by a conversational question, represented by its encoding with function Φ:  = Φ() (lines 3, 7 in Algorithm 1; BERT [19] in our experiments).Actions.The set of actions A corresponds to the 15 reformulation categories from our taxonomy.Note that every action (category) may not be available at every state.For instance, when a question does not have any mention of an entity type, it is not meaningful to apply the actions of deletion or substitution of an entity type.Therefore, we use action masking as follows to allow only valid actions to be chosen given the current state.A masking vector  (, A) that has ones at indices corresponding to valid actions, and zeros elsewhere, is element-wise multiplied with the vector containing learnt probabilities of actions at a given state.Transitions.The transition function  is deterministic and updates the state  ∈  by applying one of the actions  ∈ A, resulting in a new state  ′ ∈  that corresponds to the encoding of the reformulation   .The resulting reformulation is obtained by invoking the RG model (line 6 in Algorithm 1).Rewards.The reward R models the quality of the chosen action, and guides the agent towards its goal, which here is an improved answering performance.When a selected category leads to a reformulation on which the ConvQA model obtains better performance than the original question, the agent should get a high reward, and vice versa (line 8 calls the ConvQA model for this reward).Thus, an obvious choice here is to use any desired QA performance metric as the reward.We use the reciprocal rank (RR) [76] metric in this work, that is the reciprocal of the first rank at which a gold answer is found.We use it for the following reasons: (i) we have binary relevance of response entities, either correct (1) or incorrect (0); (ii) we deal with factoid QA, and there are usually only a few correct answers (typically between one and three for the benchmarks used).Formally, this reward based on reciprocal rank difference is computed as: where ⟨  ⟩ and ⟨   ⟩ are the ranked lists of responses by the Con-vQA model to   and    , respectively.Since reciprocal ranks lie in [0, +1], the range of this reward is in [−1, +1].Since this nicely corresponds to a symmetric positive reward and punishment in respective cases of success and failure, we do not perform any further reward normalization.Note, however, that our framework can be used with any metric of choice: the use of RL removes the dependency on the metric being differentiable.Algorithm.As motivated before, we use Deep Q-Networks (DQN) as our RL algorithm, which is a model-free, value-based method that learns to predict so-called Q-values Q (, ) for each state-action pair to quantify the usefulness of taking action  in state  under a policy  [49].The policy is a function mapping states to actions based on the Q-values.The main update step in Q-Learning is: where  is the step size and  is the discount factor that determines how much influence the next state's estimate has on the current state.The term [•] is called TD target.Q (, ) is randomly initialized, except for terminal states, where this is zero.In practice, the parameters of the DQN are updated batch-wise.The updated parameters  are obtained by calculating the Mean-Squared Error between the TD target and the current Q-values of each state-action pair in the batch (Algorithm 1, line 19).The objective function is to maximize the expected reward: where Q * (, ) is the optimal Q-value for a specific ⟨, ⟩ pair.Since our state space is large, we cannot directly learn tabular entries for each ⟨, ⟩ pair, as was typical in more traditional RL setups.Instead, Q-values are predicted via a neural Q-network with trainable parameters  (a two-layer feed-forward network in our case): where Q  (, ⟨⟩) is a function returning a vector of size |A| and stores the obtained values for every action  ∈ A given some  ∈ S;  ∈ R  ×1 ;  is the size of the input encoding vector; is the action mask; and hidden size ℎ is a tunable hyperparameter.ReLU is the non-linear activation function.
During training, the agent needs to explore different actions in each state via sampling.In this work, we sample from a Boltzmann distribution to enable such exploration (Algorithm 1, line 5).A Boltzmann distribution is parameterized by a temperature  that we can use to conveniently control the degree of exploration: A -value close to zero means taking the best action (with highest reward at this point) greedily more often, whereas larger values ( is unbounded) make the actual Q-values less relevant and result in a random policy.

Applying the RCS model
We train the RCS on the development set 1 of a ConvQA benchmark, and apply it on the questions in the training set.At RCS inference time, the agent follows a greedy policy  with respect to Q-values, and typically chooses an action    in a state  as below: In our case, the RCS predicts the top  reformulation categories {   } for each training question   , which the RG picks up to actually generate the new question variants {   }.   [39] in our case).BART is especially effective when information is both copied from the input plus perturbed with noise, to generate the output autoregressively [39]: this is exactly the setup in this work.The concatenation of the conversation history ⟨ 1  1 . . .  −1   −1 ⟩, the current question   , and a special reformulation category tag ("rc1", "rc2", ..."rc15") constitute the input, and the category-specific reformulation is the output.
Noisy data for fine-tuning.We generate fine-tuning data for the BART model through distant supervision.This is a noisy process, but the alternative of strong supervision would entail the use of human-generated reformulations.This would be expensive to obtain at scale (benchmarks like ConvRef [35] contain a relatively small number of unique reformulations, and lack category labels).Further, given a conversational question, an average crowdworker is not likely to be be able to come up with several diverse and distinct reformulations for each category.Specifically, we adopt the following strategy.Given a question, we run the recently proposed Clocq API [13] to obtain mentions and disambiguations of entities and predicates.Types are also essentially entities and hence detected in this step: one can detect whether an entity is a KG type by searching the KG for a fact where this entity is an object and the predicate indicates a type relationship (instance of in Wikidata).
We look up KG types of disambiguated question and answer entities to decide if the detected type mentions correspond to question entity types or expected answer types.Once we have mentions and their disambiguations, we can apply our transformations from the taxonomy (Fig. 3) on input questions.Deletion is straightforward: the mention is simply removed the question token sequence.For substitution, the main decision to make is the source of alternative surface forms for the linked KG items.We use aliases in curated KGs for synonyms of entities and predicates, a rich and precise yet relatively under-explored source.Substitution happens in-place: the source mention is replaced by the target mention from the KG alias list in its corresponding position in the question.Each unique alias results in a unique transformation possibility.Pronoun replacements for human entities are performed by looking up their gender in the KG.For insertions, the main concern is the position of insertion in the question: (i) mentions of answer types and relations are inserted just after the wh-question word; (ii) mentions of entity types are inserted just before the respective entity; and (iii) entity mentions are inserted at the end of the question.(details are in our source code repository).

Applying the RG model
The BART model is fine-tuned on distantly supervised data generated with the ConvQA dev set.It is then applied on the train set where a question and a category from the RCS are already available.

CONVERSATIONAL QUESTION ANSWERING 6.1 Training the ConvQA model
A ConvQA model is trained on sequences of ⟨  ,   ⟩ pairs.In the original training mode, QA pairs are directly used from the benchmark train sets.This original or initial QA model is used to collect rewards for the RCS in one pass over the dev set (as mentioned earlier, the QA dev set is used to train the RCS model).After the trained RCS and RG models generate the reformulations for each training question, these reformulations are paired with the corresponding gold answer of the original training question.These new ⟨reformulation, gold answer⟩ pairs are added to the benchmark, and the ConvQA model is trained again on this augmented resource.This model is expected to be more robust than the original model (original and robust models are marked ConvQA orig and ConvQA robust in Fig. 2, respectively).

Applying the ConvQA model
The trained ConvQA model is directly applied to the questions in test sets at answering time to produce ranked lists of entities.

EXPERIMENTAL SETUP
Benchmarks.As shown in Table 2, we use two ConvQA benchmarks: ConvMix [14] (more recent) and ConvQuestions [12] (more popular).These contain realistic questions from crowdworkers.We obtained 20 reformulations from ChatGPT (gpt-3.5-turbomodel) for each test question in these benchmarks.Examples are in Table 3.The complete list of GPT-generated reformulations is available through our website at https://reign.mpi-inf.mpg.de.We set the temperature value to 0 for greedy decoding, and tried a few alternatives like with and without examples of reformulations.We Table 2: Benchmark sizes as #questions (#conversations).Reformulations are also counted as individual questions to be answered.Questions for the GPT-Test sets subsume the original test questions.
saw that examples did not have a noticeable effect on the generations, and so we used the following zero-shot prompt (the 'History' line is omitted for generating the variants without history): Reformulate the 'Question' 10 times in a short, informal way.Assume third person singular if not obvious from the question.'History': {Conversation history} 'Question': {Question} 'Reformulation': The second sentence was used to avoid generations like Your place of birth?instead of the correct His ...? or Her ...? There are no duplicates in any of the ChatGPT reformulations.Conversations in ConvQuestions are generated by permuting questions from a seed set of 700 conversations: we used only the train set for this seed (420 conversations) for training ConvQA models, to decouple the effect of data augmentation inherent in the benchmark.Baselines.ConvQA models belong to two families, one based on history modeling, and the other on question completion (Sec.9).
We choose one open-source system from each family for KG-QA: Conqer [35] (history modeling with context entities, with RL) and the very recent Explaignn (completion to an intent-explicit structured representation, with GNN).Explaignn was built for heterogeneous sources, and we use the KG-only model, in line with our setting.Default configurations were used for both systems.Metrics.All methods produce ranked lists of entities with binary relevance.We thus used three appropriate KG-QA metrics [64]: Precision@1 (P@1), Mean Reciprocal Rank (MRR), and whether a correct answer is in the top-5 (Hit@5).We define a new metric Robust, that computes, for each question, the number of reformulations (out of 20 here) correctly answerable by a ConvQA model, averaged over the number of test intents.The higher this value, the more robust the model.Statistical significance (*) is conducted via McNemar's test for binary variables (P@1 and Hit@5), and 2-tailed paired -test otherwise (MRR, Robust), with  < 0.05.Initializing Reign.We use Wikidata as our KG: all models use the dump from 31-01-2022.We use BART (bit.ly/3N9WPVj, for RG), and BERT (bit.ly/3NkKRsd, for state encoding in RCS) implementations from Hugging Face.As history input to BART, we used only the first and previous turns of the conversation [59,75].Hyperparameters for the Deep Q-Network in the RCS were tuned on the ConvMix dev set:  = 768; hidden size ℎ = 128; Boltzmann temperature  = 0.3; discount factor  = 1.0 (no decay for future rewards); step size  = 10 −5 ; batch size = 10; epochs = 5.The RG was trained for 3 epochs, and 2 examples from each reformulation category were used for fine-tuning BART.Both RCS and RG models are only trained on ConvMix and applied zero-shot on ConvQuestions.Five reformulation categories were selected by RCS for every question ( = 5).Clocq API [13] was used in RG annotations.Implementation details.A single GPU (NVIDIA Quadro RTX 8000, 48 GB GDDR6) was used to train and evaluate all models.The TensorFlow Agents library is used for the RL components.

RESULTS AND INSIGHTS 8.1 Key findings
Reign results in robust training.The four methods Conqer, Conqer + Reign (Conqer coupled with Reign), Explaignn, and Explaignn + Reign (Explaignn coupled with Reign) are evaluated on the two benchmarks ConvMix and ConvQuestions.Results on test sets are in Table 4.A clear observation is that methods interfaced with Reign systematically outperform the original ConvQA models, on all test sets and metrics.While numbers are reported on the original test sets for completeness, results become much more significant on the GPT-test sets, with -values of the order of 10 −80 (recall that these values are averaged over ≃ 100-200 cases, Table 2).Importantly, versions with Reign score systematically higher on the robustness metric (Sec.7), showing that the improved models are capable of handling more lexical and syntactic variations on average (differences higher for larger GPT-sets).ConvQA with these benchmarks and GPT reformulations are challenging: these values are far less than 21 (the Robust measure here lies between 0 and the number of reformulations per question including the original formulation, 21).We also computed the number of unique intents that newly become answerable (P@1 = 1 for at least one question or one of its reformulations) with Reign: this is 115 (ConvMix-GPTset) and 407 (ConvQuestions-GPT-set) for Conqer, showing that our robust training can put more unique information needs within reach of the ConvQA model.Representative reformulations generated by Reign and GPT are in Table ?? (more in the supplementary material).On average, original questions, Reign, and GPT-reformulations, are 5.9, 7.5, and 7.2 words long.Reign components are generalizable.Results on the ConvQuestions benchmark showcase successful zero-shot application of Reign modules.Given that the ConvQuestions test sets are much larger than ConvMix (see Table 2), improved results over the original QA modules show that our RCS and RG modules, individually, are immune to idiosyncrasies in specific datasets.Benefits of Reign hold over domains and turns.We report drilldown results over five domains and individual conversation turns in Tables 5 and 6.We show that the benefits provided by reinforced reformulation generation are not limited to specific domains, or shallower conversation turns only.

In-depth analysis
In Table 7, we report in-depth analyses of the moving parts in Reign, using Conqer on the ConvMix-GPT-set.Trends with Explaignn and ConvQuestions are similar.We do not use this table for making design choices -rather, we expose large-scale effects of sub-optimal configurations: hence the choice of a ≃ 100k-GPT set instead of the typically small dev set.RCS with DQN is vital.First and foremost, we show that selecting reformulations with our DQN is necessary, and simply taking all noisy reformulations does not serve as a sledgehammer for performance improvement even at three times the number of data points used (Row 1 vs. Row 4).This makes a solid case for judicious augmentation.Using all reformulations does lead to higher answer recall as seen through the Hit@5 value, but at the cost of precise ranking.Using top-5 reformulations is a sweet spot for deploying the RCS (Row 1 vs. 2 and 3).Using higher numbers drastically increases the training time and often produces degenerate reformulations.Contrast against a random choice of categories inside the RCS is a natural experiment, and we find this to be sub-optimal Benchmark → ConvMix [14] Test GPT-ConvMix Test ConvQuestions [12] Test GPT-ConvQuestions Test Method ↓ P@1 MRR Hit@5 P@1 MRR Hit@5 Robust P@1 MRR Hit@5 P@1 MRR Hit@5 Robust Conqer [35] 0  (Row 5).Another stronger baseline is to sample  = 5 categories according to the Q-value distribution: this again falls short of a top- prediction (Row 6).
The whole taxonomy matters.It may appear that using only insertion or substitution operations from the taxonomy may suffice for robust learning, but we find that considering all categories jointly (Row 1) is superior to using only individual "meta"-categories (INS, DEL SUBS in Rows 7−9).While using only deletion operations hurts performance the most (Row 8), it is thus clear that carefully removing parts of questions also contributes to a stronger model (for example, deleting an entity was considered to improve MRR 10% of the time on ConvQuestions, presumably removing noise).Fig. 4 shows the union of the top-5 frequent predictions from our RCS DQN for the two benchmarks.Insertion of question entity types and expected answer types are generally useful for disambiguation, and substituting relations with aliases naturally makes the system more robust to predicate paraphrasing.The original question was retained 10 − 20% of the time.
BART contributes to robust QA models.We found that even with very noisy training data from distant supervision, BARTgenerated reformulations perform better than the original generations (Row 1 vs. Row 10).These rules (details in supplementary PDF) were used to generate the fine-tuning data for BART, but could also be used on questions directly to generate the reformulations as per our taxonomy.The noise in the BART reformulations thus contributes to more robust ConvQA models.Table 8 shows representative examples for BART-generated reformulations along with the expected reformulation categories.Question rewriting is not enough.As discussed in Sec. 1, reformulating a conversational question into a more complete form at answering time is a prevalent approach in ConvQA.As such, comparison with such rewriting or completion approaches is out of scope, as we focus on more robust training.Nevertheless, we explore the natural possibility of using completed forms of questions during training, as opposed to a set of noisy reformulations.The ConvMix benchmark [14] contains intent-explicit questions by the original crowdworkers who generated the conversations, and can thus be treated as gold standard completions.We found that this falls short of our proposed version (Row 1 vs. Row 11), as does model-generated question rewriting using T5 [42] (Row 1 vs. Row 12).Interestingly, corroborating findings with BART, noisy rewrites with T5 outperform human completions.Note that completion or rewriting entails one longer version of the question (hence ≃ 15k data points): we find that generating a small set of potentially incomplete variants improves performance.Intrinsic rewards also work well.Our DQN uses differences in reciprocal ranks, computed from gold answers in benchmarks, as extrinsic rewards.A natural question is what happens in cases where such relevance assessments are not available.We thus explored an alternative of an intrinsic reward [10,45] computed as the difference in the ConvQA model's probabilities of its top-1 answer for the reformulation and the original question [10,45].This resulted in comparable performance on the ConvMix dev set (0.270 P@1 for extrinsic vs. 0.269 for intrinsic; 0.311 MRR for both).
Manual error analysis.The authors analyzed 10 reformulations from each category for both BART reformulations and the original fine-tuning data (15 × 10 × 2 = 300 in all), to look for potential issues.There were only minor problems detected for both scenarios.
The concerns with BART were as follows: unintelligible intent (4 cases), hallucinations (5), wrong category applied (13), information removed unintentionally (15), transformation possible but not made (13), unsuitable entity or type added (4), and information already in the question was added again (5).The concerns with the initial noisy data can sometimes be traced back to incorrect processing of the benchmarks (Sec.

RELATED WORK
Conversational question answering.ConvQA [11,12,63,64,66] can be viewed as a research direction under the umbrella of conversational search [16,53,84], with natural-language utterances as input.Answers are crisp entities [25,33], sentences [5], or passages [17,58].Methods proposed belong to two major families; they either (i) derive a self-contained version of the question that can be handled by a standard QA system (referred to as rewriting [10,29,36,62,75,83], resolution [37,77], or even reformulation [50,74], in different works), or (ii) model the history as additional context to answer the current question [24,26,35,57,59,60,70].Reign is not a QA model by itself, but can improve the performance of any given ConvQA system: we demonstrate this by choosing one method from each family of approaches in our experiments [15,35].In this work, we enhance conversational QA over KGs [25, 31-33, 54, 67], where answers are small sets of entities.Robustness in QA.Improving the robustness or generalizability of ConvQA models has not seen much dedicated activity: work has mostly been limited to specific benchmarks of choice [12,14,25,66,67].Implicitly, authors have tried to prove robust behavior by the use of multiple benchmarks [33,38,48], or zero-shot application of models to new benchmarks [15].Data augmentation, given one or more benchmarks, is one of the prominent approaches for increasing model robustness in QA [5,6,44,56,65,68,81]. Our work stands out as model-specific data augmentation, a philosophy for effective training by trying to fill "gaps" in a specific model's learned behavior, instead of feeding in a very large volume of noisy data to all models.Some recent works in QA over text investigate model robustness by perturbing input passages [24,51], while we tap into question reformulations as a perturbation on the question-side.
This work falls into the rephrasing-for-training quadrant, viewing reformulations as rephrased user utterances for the current question in a conversation, and leveraging these for training a more robust model.Early work on automatic acquisition of query reformulation patterns [47,71,72,80], or on paraphrasing for improving model robustness [1,2,8,[20][21][22][23], did not account for answers from previous turns, and more generally, did not address the specific difficulty of incomplete and ad-hoc user utterances in conversations.

CONCLUSION
This work contributes a method that makes conversational question answering models more robust with generated reformulations that are specifically guided towards better QA performance.The proposed framework judiciously picks the most suitable choices for enhanced training, as opposed to brute-force data augmentation with all possible reformulations.Experiments with two state-of-theart ConvQA methods demonstrate benefits of the Reign method.

Figure 1 :
Figure 1: Performance-guided reformulation generation in Reign, illustrated through our running example conversation.

Figure 2 :
Figure 2: Workflow of Reign: RCS is trained by reinforcement learning, and RG by supervised learning.

Figure 4 :
Figure 4: Common category predictions by the RCS DQN.

Table 1 :
Notation for concepts in Reign.main triple ⟨The Rings of Power, cast member, Maxim Baldry⟩ has a qualifier ⟨character role, Isildur⟩.Conversation.A conversation  consists of a sequence of ⟨  ,   ⟩ turns around a topic of interest.An example is in Sec. 1.
Notation Concept ,  ∈ {1, 2, . ..}Conversation, conversational turn  = ⟨ 1 ...  ⟩,  Question and its tokens, Answer   ,   Question and answer at turn   Φ(⟨ 1 ...  ⟩) Function to map ⟨  ⟩ to state space  (, A) Action masking vector Q (, ) Q-value (expected reward) for  in  Q * (, ) Optimal Q-value  RCS policy  Step size in Q-Learning  Discount factor in Q-Learning W 1 , W 2 Weight matrices in RCS Deep Q-Network ℎ Hidden vector size  Dimensionality of input encoding vector  (•) Probability  Boltzmann temperature for action sampling • A reformulation category selector (RCS) model, that takes a question   as input, and produces a reformulation category    for transforming   , as output; • A reformulation generator (RG) model, that takes some   and    as input, and produces a reformulation    of  according to    , as output; • An external ConvQA model, that takes some   as input, and produces a ranked list of answers ⟨  ⟩ as output.

Table 3 :
[Books] History: How many Pulitzer Prizes has John Updike won? 2. Question: Which was the first book to win him the award?Ref 1: What book earned John Updike his first Pulitzer Prize?Ref 2: What was the author's first book to win a Pulitzer?Ref 3: Title of John Updike's first Pulitzer Prize-winning book?Which singer sang the number Single Ladies?Beyonce.What is the year of its release?2008.Who is her spouse?Jay-Z .What is his date of birth? 4 December 1969.Question: Was Kanye West a composer of the song?Ref 1: Did Kanye West contribute to the lyrics of the song?Ref 2: Did Kanye West perform the song with Beyonce?Ref 3: Was Kanye West featured in the song?[TV series] History: What is the release year of the TV series See? 2019.Pele scored how many goals in international play?77.Has he scored the most goals?No. Question: Did Messi beat his goal total?Ref 1: Did Messi surpass Pele's international goal record? Ref 2: Has Messi scored more international goals than Pele?Ref 3: Did Messi break Pele's goal-scoring record? Examples of GPT reformulations for test sets.

Table 4 :
Main results comparing Reign-enhanced ConvQA models with their standalone versions.GPT-augmented test sets are 20x original sizes.Reign is applied zero-shot on ConvQuestions.The higher value per column per QA model is in bold.

Table 7 :
Large-scale effects of design choices in Reign (with Conqer on GPT-ConvMix, all differences systematic).
Which book won the 2017 Pulitzer Prize for Fiction?The Underground Railroad.subject of the book?Slavery in the United States.publisher of the novel?Doubleday.Who played as Marty in Ozark series?Jason Bateman.and Wendy Byrde?Laura Linney.who is the director of the series?Jason Bateman.How many episodes are in the series?30.
5.1), like wrong predicate (4 cases) and [Books] History: Question: production company of the series?Ref 1: production company of the series television series?[INS ent-type] Ref 2: production company of the series Ozark?[INS ent] Ref 3: production house of the series?[SUBS rel] [Soccer] History: What is the full name of footballer Neymar?Neymar da Silva Santos Junior.Birthplace of Neymar?Brazil .When was he born?5 February 1992.Question: Which club does he play now?Ref 1: Which club does he play now association football player?[INS ent-type] Ref 2: Which club does he play now Neymar?[INS ent] Ref 3: Which Football team does he play now? [SUBS ans-type]

Table 8 :
Examples of BART-generated reformulations along with respective reformulation categories, used for training.GPT cannot replace the Reign pipeline.It is a common trend nowadays to use LLMs like GPT at multiple points in pipelines.We thus checked whether the same ChatGPT model that generated our test set could actually replace the whole Reign pipeline by directly generating reformulations for training questions.Importantly, we found that this underperforms Reign when five reformulations are considered for each alternative, on the original ConvMix dev set (evaluation of GPT reformulations on GPT test sets could result in undesirable biases): 0.270 P@1 for Reign vs. 0.261 for GPT (Conqer), and 0.423 for Reign vs. 0.405 for GPT (Explaignn).Note that the GPT reformulations are model-agnostic: this proves that reformulations generated with model-aware performance feedback is indeed a better choice for robust training.