Zero-shot Clarifying Question Generation for Conversational Search

A long-standing challenge for search and conversational assistants is query intention detection in ambiguous queries. Asking clarifying questions in conversational search has been widely studied and considered an effective solution to resolve query ambiguity. Existing work have explored various approaches for clarifying question ranking and generation. However, due to the lack of real conversational search data, they have to use artificial datasets for training, which limits their generalizability to real-world search scenarios. As a result, the industry has shown reluctance to implement them in reality, further suspending the availability of real conversational search interaction data. The above dilemma can be formulated as a cold start problem of clarifying question generation and conversational search in general. Furthermore, even if we do have large-scale conversational logs, it is not realistic to gather training data that can comprehensively cover all possible queries and topics in open-domain search scenarios. The risk of fitting bias when training a clarifying question retrieval/generation model on incomprehensive dataset is thus another important challenge. In this work, we innovatively explore generating clarifying questions in a zero-shot setting to overcome the cold start problem and we propose a constrained clarifying question generation system which uses both question templates and query facets to guide the effective and precise question generation. The experiment results show that our method outperforms existing state-of-the-art zero-shot baselines by a large margin. Human annotations to our model outputs also indicate our method generates 25.2\% more natural questions, 18.1\% more useful questions, 6.1\% less unnatural and 4\% less useless questions.


Introduction
One common cause of search failure is ambiguity in queries, which refers to the queries with multiple relevant information needs or unclear intent.Ambiguous queries are often the result of users not knowing how to formulate their needs.For example, a user looking for "the Discovery Channel's dinosaur site with pictures and games of dinosaurs" and a user looking for "different kinds of dinosaurs" can both search with the query "dinosaur".Ambiguous queries can also indicate the user is conducting an exploratory search, such as learning or investigating searches [30].A popular system feature for query ambiguity is search result page diversification [21,41].However, it can hardly be applied to searches on devices with small screens or devices with only speech functions by design.
In fact, both scenarios of ambiguous query incentivize the search system to have multi-turn user-system interaction capabilities, i.e., Conversational Search, which has recently become a growing research frontier in the Information Retrieval (IR) community.Conversational Search addresses query ambiguity by arguably its most characterizing feature, mix-initiative interactions.It means that not only the user but also the system can proactively lead the conversation by asking clarifying questions about the user's search intent.These clarifying questions chiefly determine the quality of conversational search.Therefore, existing works have extensively explored various approaches to selecting or generating high-quality clarifying questions.
However, there are two challenges that limits the application of conversational search systems in real-world.First, as a new retrieval paradigm, there isn't any mature online service for open-domain conversational search.The cost of collecting large-scale conversational search logs is still prohibitive, and the building and evaluation of reliable conversational systems is thus difficult, which further increase the difficulty of collecting conversational logs in practice.We refer to this dilemma as the cold-start problem for clarifying question generation.Second, traditional clarifying question generation methods [44,54] often rely on supervised learning with labeled or artificial conversational logs.It is unrealistic to require such logs to cover all the topics of possible queries, and models trained with incomprehensive datasets could suffer from catastrophic forgetting [32] and inevitably be biased on unseen queries.We refer to this problem as the data bias in conversational search log collection.
Unlike previous studies [13,39,44,47,54], we explore a new task of clarifying question generation in zero-shot scenarios without the use of conversational search logs.The main idea is to learn a clarifying question generation model directly from large-scale text data and search engine traffic without collecting or labeling conversational data from a conversational search system.In this way, we can avoid both the cold start and data bias problem from the very beginning.While there have been many studies on zero-shot text generation, as shown in this paper, applying these methods to clarifying question generation directly often produce unsatisfactory results because of two reasons: First, the conventional sequenceto-sequence language generation models cannot efficiently learn the needed correlations between the initial queries submitted by users and the clarifying questions generated by systems.Their generations tend to talk about general topics that are not relevant to the specific search need; Second, existing zero-shot language model generations are usually narratives instead of questions.How to guide the zero-shot model to generate text in question forms that are proper for each search query is still unknown.
To solve the above problem, in this paper, we propose to constrain the clarifying question decoding with search facets.Facets refer to possible subtopics of a query (e.g., "pictures", "map", "populations" are possible facets for the query "I am looking for information about South Africa") that can be effectively extracted from search result pages (SERP) [58], knowledge graphs [50], or other sources [39,49] in unsupervised manners.Constrained language decoding refines the naive beam-search decoding with the ability to rank facet-containing generations higher, resulting in more questions about the facet.We also initialize the decoding with questioning prompts instead of generating the entire question sentence.Multiple question templates are used in this process and eventually ranked for the best generation.
To demonstrate the effectiveness of our zero-shot system, we compare with multiple existing non-facet and facet-driven baselines, including several state-of-the-art supervised learning methods.They will be finetuned on a training set, which is not accessible by our zero-shot system.Nonetheless, we show that our system significantly improves these baseline methods by a large margin, which implies our system is the best solution for zero-shot clarifying question generation.
During the evaluation phase, we compute automatic metrics [5,24,25,35] and employ humans to provide quality labels for the generations of the compared systems.Our human annotators evaluate the generations from both their naturalness [44] from language perspective and usefulness [40,44] from utility perspective.The automatic metric scores are our primary evaluation, from which we conclude that our system is the best.Human annotation results suggest our method generates 25.2% more natural questions, 18.1% more useful questions, 6.1% less unnatural and 4% less useless questions, which aligns with automatic evaluations and reinforces our conclusions with confidence.
We consider the following key contributions of our work: • We are the first to propose a zero-shot clarifying question generation system, which attempts to address the cold-start challenge of asking clarifying questions in conversational search.The zero-shot setting also maximizes the generalizability of our system to serve different search scenarios.• We are the first to cast clarifying question generation as a constrained language generation task and show the advantage of this configuration.We show that a simple constrained decoding algorithm, even under zero-shot setting, can guide clarifying question generation better than finetuning the model with limited training data.Our work is a compelling demonstration of how large deep models benefit from properly integrating human knowledge.• We propose an auxiliary evaluation strategy for clarifying question generations, which removes the information-scarce question templates from both generations and references.
Results computed this way expose the limitations of the existing default evaluation strategy and provide insights into the actual quality of generated questions.

Related Works
Conversational Search Conversational Search refers to the process of information-seeking involving natural language conversations with the search system [55].It has been identified as one of the most important research area of IR [3,11].Recently, a plethora of seminars and tutorials has been given about Conversational Search from different standpoints such as [3,[15][16][17][18]55].Radlinski and Craswell [37] proposed a theoretical framework for Conversational Search and highlighted mix-initiative as one of its most desired perspective.Later, Zamani and Craswell [53]  Resolving Query Ambiguity The query ambiguity problem is an important motivation to promote conversational search over conventional single-turn search.Ambiguous query generally refers to the queries for which the search system cannot confidently identify the user's information need and return search results [21].
Queries can be ambiguous for various reasons, such as containing multiple distinct interpretations or under-specified subtopics [10], anaphoric ambiguity, and syntactic ambiguities [43].Approaches to clarifying query ambiguity can be roughly divided into three categories: (1) Query Reformulation such as [12,14,26,52] iteratively refines the query; (2) Query Suggestion such as [40,45,51] offers related queries to the user; (3) Asking Clarifying Questions such as [8,38,39,54,57] proactively engages users to provide additional context.While the three approaches share many structural and functional similarities, they cannot be replaced by each another.Because none of them is the best in all scenarios, for example, asking clarifying questions could be exclusively helpful to clarify ambiguous queries without context.In contrast, query reformulation is more efficient in context-rich situations.Query suggestion is good for leading search topics, discovering user needs, etc.
Asking Clarifying Questions Among the approaches to resolving query ambiguity, asking clarifying questions (CQ) is the most studied [21], and is considered as more convenient because of its proactivity [37,46,58].Existing studies about asking CQ can be divided into two main categories: (1) ranking/selecting CQ such as [1,8,38], and (2) generating CQ such as [13,39,44,47,54].Rao and Daumé III [39] applied generative adversarial learning in training sequence-to-sequence question generation model.Zamani et al. [54] proposed a rule-based template completion model and two neural question generation models to generate CQ given the query and its aspect.Later in [13,47], the authors also demonstrated templates could guide CQ generation.Their solutions effectively convert the CQ generation problem to a selection task.Similar to using query aspect, Sekulić et al. [44] also proposed a query facet-driven approach.Recently, Zhao et al. [58] showed such query facets could be extracted from web search results and guide question generation.
Constrained Natural Language Generation Our work applies constrained natural language generation to generating clarifying questions for conversational search.The task of constrained natural language generation was proposed in [4], where the problem was modeled as beam search over 2  states representing all combinatorial satisfaction states of  constraints.This exponential complexity limited its applications.Hokamp and Liu [19] propose a grid beam search method that groups beams by the number of constraints already satisfied.Miao et al. [33] propose to edit generations using constraints with Metropolis Hastings sampling.Welleck et al. [48] develop a non-monotonic tree-based generation system which can generate texts given constraints at arbitrary positions.Zhang et al. [56] suggest a tree-enhanced Monte-Carlo approach for text generation via Combinatorial Constraint Satisfaction.More recently, Lu et al. [28,29] propose NeuroLogic Decoding and A ★ search.Their decoding algorithms incorporate constraints as Conjunctive Normal Forms (CNF) and estimate the viability of each beam to satisfy constraints by sampling their future generations.
3 Zero-shot Facet-constrained Question Generation This section gives detailed descriptions of our zero-shot clarifying question generation system, which addresses two challenges in naive models for zero-shot clarifying question generation.Our system is zero-shot, meaning that we do not train our system on any training data for clarifying question generation.The generation is also facet-constrained, which implies that we use the search facet in our question generation.A facet is one possible search direction for the ambiguous query; for example, pictures, map, location, populations are possible facets for the query "I am looking for information about South Africa".Facet has been considered useful [44,54,58] for clarifying question generation since it provides a relevant direction for inquiring about the user intent.Clarifying question generation can be challenging without facets because the generations are often too general and clueless.In [58], Zhao et al.
proposes a facet extraction approach, which shows that these useful keywords can be easily extracted from web search results.Previous works also suggest that facets can also be extracted from various sources, including product reviews [39], images [49], or knowledge graphs [50].The backbone of our system is a checkpoint of the public Generative Pretrained Transformers (GPT-2) [36] pretrained on a seperate large scale text corpus.Originally, the inference objective of GPT-2 is to predict the next token given all previous texts.
One naive method to adapt GPT-2 for clarifying question generation is to append the query  and facet  together as initial texts and let GPT-2 generate a continuation  as the clarifying question.However, this method faces two challenges.The first challenge is it does not necessarily cover facets in the generation.Previous work [44] proposes a finetuning approach, which trains on a collection of  [SEP]  [BOS]  [EOS] paragraphs.However, as reported in their work, this structure does not outperform simply using the query alone as input.We analyze the generations of their model and find that the coverage of facet words in these generations is only about 20%.This number implies that simply appending facet words to the input of GPT-2 is highly inefficient in informing the decoder.
The other challenge is that the generated sentences are not necessarily in the tone of clarifying questions.This is because clarifying questions makes up only a small portion of natural language usage.GPT-2 is pre-trained on web texts, most of which are narrative.Even for the questions, they are not necessarily for the purpose of clarifying.As a result, pre-trained GPT-2 often generates relevant factoids following the query and facet.
To explain our proposed system easier, we divide our system into two parts: (1) facet-constrained question generation and (2) multiform question prompting and ranking.The two parts respectively address the above two challenges of zero-shot GPT-2.

Facet-constrained Question Generation
In the first abovementioned challenge, we find that existing works struggle to generate facet-related clarifying questions.Unlike these works, we believe that simply appending the facet to the input is inefficient.Instead, our model utilizes the facet words as decoding constraints.Specifically, we see the task of generating facet-related questions as a facet-constrained language generation problem, i.e., to use the facet words as constraints during generation decoding.To encourage the decoder to choose generations containing more facet words, we employ an algorithm called NeuroLogic Decoding [29].
NeuroLogic Decoding is based on beam-search decoding.In each decoding step , assuming the already generated candidates in the beam are  = { 1: }, where  is the beam size,   =   1:( −1) is the th candidate, and   1:( −1) are tokens generated from decoding step 1 to ( − 1), NeuroLogic Decoding works in the following steps: (1) Generate the next token distributions    ∼  (  1:( −1) ) for each candidate in the beam with GPT-2.Assume that the vocabulary size is | |, then this will create  × | | new candidates.
(2) Pruning Step: Among these candidates, discard all but candidates that are in both in the top- tokens in terms of  ( 1: ) and the top- in terms of number of facet words contained by  1: .
Table 1: The most common 4-grams in clarifying question answers [22] and their corresponding questions with example generations from GPT-2 for the query "I am looking for information about South Africa.We recommend a more vivid demonstration in the original paper [29].We now explain why NeuroLogic Decoding could better constrain the decoder to generate facet-related questions by highlighting some key steps.First, the top- filtering in step (2) is the main reason for promoting facet words in generations.Because of this filtering, NeuroLogic Decoding tends to discard generations with fewer facet words regardless of their generation probability.Therefore, facet-related generations with low probability will more likely stand out against greedy high-probability generations without using facet words.Then, the grouping in step (3) is the key for NeuroLogic Decoding to explore as many branches as possible.Because this grouping method keeps the most cases (2 | | ) of facet word inclusions, allowing the decoder to cover the most possibilities of ordering constraints in generation.
As mentioned in [29], the asymptotic runtime of NeuroLogic Decoding is  ( ), where  is the text sequence length and  is beam search size.This is the same as normal beam search and faster than most previous constrained language generation algorithms, making it fairly applicable in real cases.

Multiform Question Prompting and Ranking
Another challenge is guiding zero-shot GPT-2 to generate clarifying questions instead of narrative or other types of questions.We use clarifying question templates as the starting text of the generation and let the decoder generate the rest of question body.For example, if the query is "I am looking for information about South Africa."Then we give the decoder "I am looking for information about South Africa.
[SEP] would you like to know" as input and let it generate the rest.From our observation, GPT-2 is much better at finishing a question like this than asking a new question by itself.
In our system, we use multiple prompts because we want to both cover more ways of clarification with different prompts and avoid making users bored with monotonic questions.A previous study about the effect of clarifying question [22] shows the most common 4-grams for answering clarifying questions, as shown in Table 1.Inspired by their work, we reverse these most common answers to their original question forms (eight in total) and use them as our prompt candidates.For each query, we will append these eight prompts to the query and form eight inputs.Eventually, we will generate eight clarifying question candidates.For example, our generated questions for the query "I am looking for information about South Africa." with facet "map" is shown in Table 1.
In real applications, our system should return one question in the form of the concatenation of the prompt and the GPT-2 output.To find the best question, we explore various ranking methods to rank our prompted generations: Perplexity is a commonly used method that ranks clarifying questions by the perplexity of query-question concatenation computed by pre-trained and GPT-2.
AutoScore is another commonly used method that ranks clarifying questions by weighted sum of automatic natural language generation scores including BLEU [35], ROUGE [25], and METEOR [5].These scores are computed using generated questions as hypotheses and queries as references.
Cross-encoder [20] is a typical and commonly used dense retrieval structure.It ranks question candidate by its relevance which is computed by a transformer encoder followed by a linear scoring layer.The cross-encoder is pre-trained on millions of Reddit dialogues [31].Directly using the pre-trained checkpoint is potentially suboptimal because the prompted generation ranking objective differs from the pretraining task.
NTES [34] is a clarifying question ranking model that wins the ConvAI3 challenge [1] on ClariQ dataset.The clarifying question ranking subtask in ConvAI3 requires a system to rank clarifying question candidates given query and facet.We consider this task highly similar to our prompted generation ranking task.The NTES model finetunes pretrained ELECTRA [9] as its ranker on ClariQ dataset.We use their finetuned checkpoint and aim to leverage the clarifying question ranking knowledge in this model.
Weighted Sequential Dependency Model (WSDM) [7] is a document ranking method based on query-candidate overlap in terms of unigram, and ordered/unordered bigram within a context window.Our system treats the center words in the original query together with facet words as query and prompted generation as candidates.The motivation of WSDM is to rank those generations with facet words co-occurring together higher.For example, given the query "I'm looking for information about South Africa" and facet "map", the WSDM model will rank all the prompted questions using "South Africa map" as the query.Questions like "do you need information about the map of South Africa" will be ranked higher than "do you want to buy a map that is made in South Africa" because "South Africa" and "map" are closer in the first question and more likely to be more useful clarifications.
We have empirically compared different versions of our model using the ranking methods above in the experiments and find that WSDM achieves the best empirical performance overall.Thus, if not mentioned, we use WSDM as our primary ranker.

Research Questions and Experiment Design
We design experiments to answer the following research questions: RQ1.How well can we do in zero-shot clarifying question generation with existing baselines?
In this research question, we show the performance of some clarifying question generation baselines in the zero-shot setting: (1) Q-GPT-0 simply uses GPT-2 to generate the clarifying question given the query.This and the next baseline are the zero-shot version of approaches in [44].(2) QF-GPT-0 appends the facet to the front of the query and generates the clarifying question.(3) Prompt-based GPT-0 is a prompt-based GPT-2 approach which includes a special instructional prompt as input: "Ask a question that contains words in the list [𝑓 ]. " (4) Template-0 is a template-guided approach using GPT-2.As mentioned earlier, a common problem for zero-shot GPT-2 is that it mostly generates narratives instead of questions.The Template baseline add the eight question templates during decoding and generate the rest of the question, which is similar to approaches in [13,47].

RQ2. How effective is facet information for clarifying question generation if utilized efficiently?
To the best of our knowledge, no previous work has explored clarifying question generation using the ambiguous query as the only source of information.Previous works such as [44,50,54] propose various facet-specific clarifying question generation methods using facet or aspect of the query.Despite this, most of them did not or failed to experimentally demonstrate the importance of additional information for facet-specific clarifying question generation.Particularly in [44], it is shown that adding facet does not significantly improve the quality of generated questions.These works leave the effectiveness of facet in doubt.We argue that the way facet information is utilized in these works is inefficient.
To answer this research question, we compare our proposed zero-shot facet-constrained approach with a facet-free variation that uses subject words from the query as constraints.For example, the subject words of the query "I am looking for information about South Africa." is "South Africa".Using a part-of-speech tagger, we extract the nouns or proper nouns as subject from the query.
RQ3.How does our zero-shot facet-constrained approach compare to existing facet-driven baselines?
To answer this research question, we include some existing methods and a few other reasonable solutions not mentioned by previous works as our baseline models.Some of them are zero-shot, while others are not.However, we still compare their performances jointly to demonstrate our zero-shot approach's power.We divided the dataset into training and evaluating sets.All the finetuning methods can access the training set to finetune pre-trained GPT-2 checkpoint, while our zero-shot system cannot access them.Then, we evaluate all the methods on the evaluation set.
We compare our model against the following baseline models: (1) Template-facet is a clarifying question rewriting baseline which appends the facet word right after the question template.For a fair comparison, we also apply multiform question templates and ranking.For example, given query  and facet  , we first generate eight questions by appending facet to each of the eight templates in Table 1.Then we rank these questions by their language perplexity.This baseline is not ideal.Admittedly, it can generate good questions such as: : "I am looking for information about South Africa."  : "population" : "Are you interested in [population]" However, sometimes the facet itself is not meaningful: : "I am interested in poker tournaments."  : "online" : "Are you interested in [online]" (2) QF-GPT [44] is a GPT-2 finetuning version of QF-GPT-0.
It initializes with pretrained GPT-2 and finetunes on a set of (facet  , query , clarifying question ) tuples in the form as

𝑓 [SEP] 𝑞 [BOS] 𝑐𝑞 [EOS] paragraphs, where [SEP] is the separator token, [BOS] the beginning-of-sentence token, and [EOS] the end-of-sentence token. (3) Prompt-based finetuned GPT is a finetuning version of
Prompt-based GPT-0 The motivation is that simple facetas-input finetuning is highly inefficient in informing the decoder to generate facet-related questions by observing a facet coverage rate of only 20%.Inspired by recent advances in prompt studies, especially for natural language generation such as [23,27,42], we add a sentence "Ask a question that contains words in the list [ ]" between  and , aiming to instruct GPT-2 the inclusion of facet words in the clarifying question.Hence, we finetune GPT-2 with the structure:  "Ask a question that contains words in the list [𝑓 ]. "

Dataset
We use ClariQ-FKw [44] dataset for our main experiments.ClariQ dataset is originally from ConvAI3 challenge [1].This dataset has rows of (,  , ) tuples, where  is an open-domain search query,  is a search facet, and  is a human-generated clarifying question regarding the facet.The facet in ClariQ is in the form of a faceted search query.ClariQ-FKw extracts the keyword of the faceted query as its facet column and samples a dataset with 1756 training examples and 425 evaluation examples.We report the performances of all of our proposed and baseline systems on its evaluation set.
Because we aim to solve the problem in a zero-shot setting, our proposed system does not access the training set.The other supervised learning systems can access the training set for finetuning.

Evaluation
We use automatic metrics for natural language generation and human annotators to evaluate system performances to label the generated questions.Following previous works [1,2,44], we use the human generated question from the dataset as gold reference.
It is worth to clarify that this type of evaluation and its metrics are only meant to measure the the ability of a system to ask clarifying questions about the facet specifically, instead of generally relevant questions to query.However, defining other types of evaluation will be challenging given what we have in the dataset.

Automatic Metrics
The automatic metrics we use are BLEU [35], ROUGE [25], METEOR [5], and Coverage [24].BLEU and ROUGE are based on word match, while METEOR uses more general word forms to compute alignments between reference and generation.Coverage is computed as the average frequency of facet words in generations.Because these automatic metrics are mostly based on word overlap, we propose two different ways of computing them.We argue that not all the words in the generation are equally important.Such as the words in the question templates are less important than those in the actual question body.For example: : "I am looking for information about South Africa."  : "picture" ref : "would you like to [see some pictures of South Africa]" cq1: "would you like to [take pictures of]" cq2: "are you looking for [pictures of South Africa]" In this example, ref is the gold reference clarifying question, cq1 and cq2 are two candidate generations.Underline means word overlap between candidates and references.cq1 has more word overlaps than cq2, including a 4-gram overlap "would you like to" with the reference.However, we humans can quickly tell that cq1 is a worse clarifying question than cq2.The reason for this discrepancy is that the question template does not contain real information.Therefore it can easily bias generations with the same template but a bad question body.We are concerned that the conventional evaluation approaches computed on full questions could have limitations in our task.Plus the fact that the question templates are not actually 'generated' by the models.(They are given.)Hence, we propose another way to compute these metrics without the question template and on the actual question body.In the example, we bold the word overlap between candidates and reference within the question body.cq2 has more overlap than cq1 in this way, which corresponds to the human judgment of good and bad clarifying questions.
After reading Section 3.1 and 4.3, a reasonable concern would be: wouldn't the facet-constrained generation naturally improve the automatic metrics?Because using facets as constraints will make these true-positive words more likely to be included.We are aware of this concern, and we design multiple experiments and evaluations to ensure the performances of our system are meaningful.First, in RQ3, we compare our system with the Template-facet rewriting baseline, which benefits even more than our system because of the guaranteed facet inclusion.We will show that our proposed system can achieve even higher scores than Template-facet in Section 5. Second, we include human evaluations.Human annotators are free of this bias because they will evaluate generated questions by the quality of entire sentences, not word overlaps.Last, we want to highlight that facet is used in one way or another by all facet-driven models in RQ3.Using them as constraints or inputs is a modeling choice that does not break fair comparison principles.

Human Evaluation Metrics
Like the example above, automatic metrics are reported [6] for not necessarily corresponding to true generation quality.Therefore, following previous works [40,44,54], we employ human annotators to evaluate the generated clarifying question qualities on 425 test examples.The annotators are provided randomly shuffled generations from all the models in RQ3 and asked to label them without knowing their sources.We provide the annotators with a detailed guideline to annotate the generated question into two labels: usefulness and naturalness.For each label, the annotators must decide whether the question is good, fair, or bad.The guideline can be found in appendix A.
Naturalness is defined as the general fluency and understandability of the generated question.The naturalness of a question is independent of its coherence to the topic of the query.By our definition, this label mainly evaluates the overall language modeling capacities of the model.Generally speaking, a zero-shot GPT-2-based decoder would keep the same language capacity as the original GPT-2 because it uses the same model.However, finetuning could downgrade the capacity due to the bias of the limited-sized finetuning set.
Usefulness is defined as whether the question is relevant to both the query and the facet and makes the query easier to answer.Typical bad usefulness questions can fall into one of the categories: duplicate, prequel, miss-intent, too general, or too specific [40].These questions are relevant to the query but not useful for clarification.For example, for the query "Tell me about computer programming." and facet "courses", "are you interested in computer programming." is a duplicate question with the original query, and "are you looking for computer programming courses for children." is too-specific.

Implementation Details
We use NeuroLogic Decoding algorithm from the author's GitHub implementation 2 .We implement QF-GPT by ourselves with pretrained GPT-2 checkpoints from Huggingface and achieve similar performances as the original work [44].Similarly, we implement prompt finetuning by changing the input of QF-GPT.
The perplexity ranker is implemented using the Huggingface pre-trained GPT-2 checkpoint for our question candidate rankers.We use ParlAI implementation3 for cross-encoder, and the ConvAI3 winning team's implementation 4 for their ranker named NTES.We implement our own Weighted Sequential Dependency Model using the intuitively adjusted parameters:

Results and Analyses
This section answers the research questions using our experiment results.In Table 2 and Table 3, we show the automatic metrics evaluation results respectively for the full-question evaluation and the question-body evaluation, as described in Section 4.3.In Table 4, we show the human annotation results of the compared models in the third research question.Our first research question RQ1 is about the performance of existing methods for zero-shot clarifying question generation.The results are shown in the first four rows in the both tables.From the full question evaluation in table 2, we see that all these baselines struggle to produce any reasonable generations except for Template-0.However, we cannot conclude that Template-0 generates significantly better questions.Because when we compare the question body evaluation results in Table 3, we see the scores of Template-0 drop significantly.This means that its question body is not good, which implies that the reason for its higher score on full question evaluation is because of the question templates.In general, we find existing zero-shot GPT-2-based approaches cannot solve the clarifying question generation task effectively.
Our second research question RQ2 is about the effectiveness of facet information for facet-specific clarifying question generation.
To answer this question, we compare our proposed zero-shot facetconstrained (ZSFC) methods with a facet-free variation of ZSFC named Subject-constrained which uses subject of the query as constraints.It would be unfair to compare the coverage metric between the two models because the Subject-constrained system does not access facet information.The gold references used for computing other metrics are the facet-specific clarifying questions from the dataset, thus it would also be incomprehensive to see the scores as the quality of clarifying question generation in general.
Because the generated question could be reflecting another facet and get a low score on these metrics.However, these metrics could still be seen as the generation quality about these specific facets, which should intuitively be improved by adding facet as input, although not with naive GPT-2 [44].
From both the entire generation and question body evaluations, we see that all the ZSFC models significantly improve the Subjectconstrained method across all the other evaluation metrics.The ZSFC models also drop less performance when switched to questionbody evaluation, which suggests its better performance is more from the question body.In contrast to existing works, our study show that adequate use of facet information can significantly improve clarifying question generation quality.
The last research question RQ3 is whether our proposed zeroshot approach can perform the same or even better than existing facet-driven baselines.To answer this question, we compare our method with a simple clarifying question rewriting baseline and two finetuning baselines in the third section of the table.Among them, QF-GPT is the existing method [44], and Prompt finetuning is our proposed prompt-based finetuning method.We see that from both tables, our zero-shot facet-driven approaches are always better than the finetuning baselines.Our best-performing generation system, ZSFC+WSDM, improves the existing method QF-GPT by a large margin in the full-question evaluation and doubles its performance in question-body evaluation.When compared with the Template-facet baseline, ZSFC can outperform it in both tables.This implies these ZSFC system generations have more word overlaps from the reference beyond just the facet, which means the performance improvements are non-trivial.We also bold the smallest performance drop between the two evaluations.We can see that our ZSFC models have relatively minor performance drops, which means that they potentially generate better question bodies.
To validate the above conclusions, we employ human annotators to label the quality of generated questions.From Table 4, we can see that our proposed system ZSFC gives the best generation for both naturalness and usefulness.Specifically, it generates the most amount of good naturalness and usefulness questions and the least bad ones.This human evaluation result strengthens the conclusion from the automatic evaluation that our method is better than the baseline and supervised learning methods.We also notice that the Template-facet rewriting is a simple yet strong baseline that both finetuning-based methods are actually worse than it.However, ZSFC outperforms it by a large margin in both measures.

Ablation Study
We are also interested in whether question prompting is necessary for our system and which question ranker is the best in Section 3.2.
To answer this question, we compare all five ranking methods mentioned in Section 3.2.For each (query, facet) pair in our dataset and each ranker, we will run the ranker on the eight question candidates using eight question templates and choose the top question from the ranked lists as the question generation of that ranker.The noprompt-no-ranker approach will generate only one sentence with NeuroLogic Decoding with query as input and facet as constraints.
Here, we analyze the results of all the ranker variations in the last section from Table 2 and 3.The full-question evaluation results show that the AutoScore ranker performs best on all but the coverage metric.However, the question-body evaluation results suggest that the WSDM ranker performs the best.On the one hand, we believe the question-body evaluation results are more convincing by our previous analyses and examples of the two evaluation methods.On the other, we notice that when switched to question-body evaluation, the performances of AutoScore drop more than other methods, while WSDM almost does not change or even increase its scores.This could further suggest that part of the performances of AutoScore should be attributed to the question templates.We now explain why WSDM metrics have such low performance drops.Unlike other ranking methods, the way the WSDM scoring function is defined encourages generation to score higher with a high-quality question body since the question templates rarely contain facet or query subject words.Based on all the above observations and reasoning, we propose that WSDM is the best ranker.

Conclusion
In this work, we study the task of zero-shot clarifying question generation for conversational search.We propose to solve the task as a constrained language generation problem and present a concrete system.To demonstrate the power of our system, we answer three research questions, including comparing our zero-shot system with baseline and existing supervised learning approaches.All the experiment results have been evaluated using a variety of natural language generation metrics, and human evaluations are done for part of the results.The automatic metrics and human annotation results suggest our proposed zero-shot system outperforms the other compared approaches.Our work can be seen as both a solid zero-shot solution to the cold start problem of conversation search and a compelling demonstration of how large deep models benefit from properly integrating human knowledge.

A Human Annotation Guideline
In this task, imagine you are the user who unintentionally asks our search system an ambiguous search query (imagine they are using Google or talking to Siri) in a conversation.To better understand the intention of your query, our system asks a clarifying question to you.And your task is to judge if this clarification question is natural and useful.
For example, the user asks "Tell me about defender".The query is ambiguous because the word "defender" can refer to a personality type coded as ISFJ, a TV series "The Defender", a vehicle named "Defender", or a video game named "Defender".In order to know whether the user is asking about the TV series, the search system asks a clarifying question "Are you interested in a television series?"Another example can be the user asks "Tell me information about computer programming." Different from the last example, this query is NOT ambiguous because of the term "computer programming" is ambiguous, but because "computer programming" is a general concept, and there can be multiple search directions.For example, the user can be looking for computer programming jobs, computer programming languages, computer programming courses, or the history of computer programming.To confirm whether the user is looking for computer programming courses, the system asks a clarifying question "Are you looking for a course in computer programming?"In general, ambiguous queries have many possible "facets".For example, "TV series" is one possible facet in the "defender" example, and "course" is one possible facet in the "computer programming" example.Our system generates these questions based on the ambiguous query and one possible facet.
Your goal is to evaluate the clarifying question asked by our system, in terms of its Naturalness and Usefulness (Please read explanation below).Besides the query, facet, and generated question, you will also get a human-written question as your reference.You can assume the human-written question is always good in both naturalness and usefulness.
Explanation of Naturalness: The Naturalness of a question is whether the question is fluent, grammatical, and easy to understand.Your goal is to give each question "Good", "Fair", or "Bad" in terms of its naturalness.Good naturalness means the question is fluent and like our daily language.Fair naturalness means although not grammatically perfect or contains noise, the question can still be understood with efforts.Bad naturalness means the question is incomplete, hard to understand or the generated sentence is not a question.
Here are some examples with explanations: Example 1 Query: "Tell me about defender" Facet: "television series" Reference: "are you interested in the television series defender" Good naturalness questions: "are you interested in a television series" (Almost the same as reference) "do you need to be in the team" (fluent and easy to understand, although not meaningful) Fair naturalness questions: "do you want to know television series" (A little weird but understandable) Bad naturalness questions: "television series, etc." (Not a question, and the sentence is incomplete) "would you like to know more about" (the question is incomplete) Example 2 Query: "Tell me information about computer programming." Facet: "courses" Reference: "are you interested in coding courses online" Good naturalness questions: "are you looking for a course in computer programming" (Almost the same as reference) "do you need to have courses in computer science" (Almost the same as reference) "would you like to tell me about it" (fluent and easy to understand, although not meaningful) Fair naturalness questions: "do you want to coursework in computer programming" (sound strange/ungrammatical, but understandable on a second thought) "do you want to know what is going on with your courses" (fluent but unlike daily language) Bad naturalness questions: "do you need to know" (Not a complete sentence) Explanation of Usefulness: (2) Not be a duplicative question of the query or ask for prequel information.
(3) Not miss the intent of the query.(4) Not be too general or over-specific.Your goal is to give each question "Good", "Fair", or "Bad" in terms of its usefulness.A Good usefulness questions is a perfect reflection of the facet and make the query easier to answer.A Fair usefulness questions is weakly relevant to the query and facet, but not completely irrelevant.A Bad usefulness question can be completely irrelevant, duplicative, miss-intent, prequel, too general or overspecific.
A question can be natural but not useful.Please see examples below.
Example 1 Query: "Tell me about defender" Facet: "television series" Reference: "are you interested in the television series defender" Good usefulness questions: "are you interested in a television series" (Almost the same as reference) "do you want to know television series" (not perfectly natural but useful) Fair usefulness questions: "would you like to see a television series based on your work" (although "based on your work" is weird, user could still answer the question as "yes I am referring to the TV series defender") Bad usefulness questions: "do you need to be in the team" (not meaningful for the query) "television series, etc." (Not a question, and the sentence is incomplete, user cannot answer it) Example 2 Query: "Tell me information about computer programming." Facet: "courses" Reference: "are you interested in coding courses online" Good usefulness questions: "are you looking for a course in computer programming" (Almost the same as reference) "do you need to have courses in computer science" (not natural but useful) "do you want to coursework in computer programming" (not natural but useful) Fair usefulness questions: "do you want to know what is going on with your courses" (weakly relevant to the facet) Bad usefulness questions: "do you need to know" (Not a complete sentence) "would you like to tell me about it" (completely irrelevant)
(3) Grouping Step: Group the remaining candidates by the facet words they contain.This will result in 2 | | groups.However, notice that some group could be empty.(4) Keep the best candidate from each of the groups.Again, there may be less than 2 | | candidates left now.Just keep at most the best  candidates with the highest  ( 1: ) in the beam and move onto decoding the ( + 1)th token.

Table 2 :
Model performances evaluated on full question and reference.ZSFC models are our zero-shot facet-constrained method with different question rankers.Bolded numbers indicate the best-performing model of the column.1

Table 3 :
The effectiveness of question body only evaluation.The first number is the metric scores on question body.The second number (Δ) is the performance gap between question-body and full-question evaluation, which indicates the boosted portion by question templates.

Table 4 :
Human evaluations for models in RQ.3 according to major vote from 5 annotators.Our human evaluation results are aligned with our automatic evaluations.† and ‡ indicates  < 0.05 and  < 0.0001 statistical significance over other models.