A Comparative Study of Training Objectives for Clarification Facet Generation

Due to the ambiguity and vagueness of a user query, it is essential to identify the query facets for the clarification of user intents. Existing work on query facet generation has achieved compelling performance by sequentially predicting the next facet given previously generated facets based on pre-trained language generation models such as BART. Given a query, there are mainly two types of training objectives to guide the facet generation models. One is to generate the default sequence of ground-truth facets, and the other is to enumerate all the permutations of ground-truth facets and use the sequence that has the minimum loss for model updates. The second is permutation-invariant while the first is not. In this paper, we aim to conduct a systematic comparative study of various types of training objectives, with different properties of not only whether it is permutation-invariant but also whether it conducts sequential prediction and whether it can control the count of output facets. To this end, we propose another three training objectives of different aforementioned properties. For comprehensive comparisons, besides the commonly used evaluation that measures the matching with ground-truth facets, we also introduce two diversity metrics to measure the diversity of the generated facets. Based on an open-domain query facet dataset, i.e., MIMICS, we conduct extensive analyses and show the pros and cons of each method, which could shed light on model training for clarification facet generation. The code can be found at \url{https://github.com/ShiyuNee/Facet-Generation}


INTRODUCTION
Since user queries can be ambiguous or vague, query intent clarification is beneficial to enhance user experience and retrieval effectiveness.In today's dialogue-based retrieval systems, where the display space or voice bandwidth is limited, it becomes even more important.For example, the query "Chicago" can mean "the musical Chicago", "the band Chicago", "the city Chicago", and "the movie Chicago", etc. Accurately predicting such query facets can be conducive to various search scenarios.In web search, displaying possible facets can help users refine their original queries or provide them with relevant subtopics that they may find useful.In a conversation search system, facets can be used to ask clarifying questions.
Early studies on extracting query facets mainly rely on specific domains or external resources [5,13,18,20,21,28].However, such methods may not be suitable for open-domain queries, which have a much more variety of facets.Later, some studies [8,[14][15][16] find that utilizing top-retrieved documents can help identify opendomain query facets based on the word frequencies [8] and their co-occurrences [14][15][16].Nevertheless, these methods could not capture semantic relations sufficiently.In recent years, grounded on pre-trained language models (PLMs) such as BERT [6] and BART [19], query facet prediction has achieved compelling performance [10,11,26,32].Although query facet prediction can be formulated as facet generation, sequence labeling, facet classification, etc., facet generation has more flexibility than the others and has been found to perform the best [26].In this paper, we focus on facet generation based on PLMs with top-retrieved documents.
Existing methods typically conduct facet generation by generating the sequence of facets associated with a query [10,11,26], using two common training objectives where the ordering of a facet sequence has an impact or no impact during training.One is to generate the default sequence of facets [10,26], denoted as Seq-Default in Table 1.This approach forces the model to generate a specific facet ordering that is only one of the possible combinations, which may lead to suboptimal performance.Realizing this issue, Hashemi et al. [11] propose the other training objective that enumerates the permutations of the query facets and uses the one with minimum loss to guide model training, denoted as Seq-Min-Perm in Table 1.Theoretically, the training objective should be permutation-invariant since the ground-truth facets are an unordered set.However, enumerating the permutations incurs huge computation costs and only using minimum loss may not leverage the permutations of facets sufficiently.When we examine the existing methods from other perspectives [10,11,26] (See Table 1), we find that they all generate the next facet depending on the previously generated facets.Also, the count of generated facets is determined by the model and no more facets can be generated if the model has already output a termination symbol.
In this paper, we aim to conduct a comparative study of various training objectives for the task of clarification facet generation from multiple perspectives.To this end, we propose another three training objectives for facet generation that are all permutation-invariant but have different properties in terms of whether to sequentially predict facets based on the context of so-far generated facets and whether the count of generated facets is controllable.1) Seq-Avg-Perm: It is similar to Seq-Min-Perm but uses the average loss of facet permutations instead of minimum; 2) Set-Pred: It predicts each facet in the ground-truth set as independent targets during training and uses sampling algorithms such as beam search to select the top k facets with the highest probabilities for inference; 3) Seq-Set-Pred: It sequentially predicts the remaining facet set based on arbitrarily generated facets as context.As shown in Table 1, among these methods, only Set-Pred does not output the next facet based on previously generated facets.So, it does not have the training cost due to enumerating the facet permutations.However, it could suffer from keep generating similar facets due to only referring to the original query and top documents.Only Set-Pred and Seq-Set-Pred can output the designated number of facets.They focus on the facet prediction task alone, which relieves them from also concerning how many facets they need to produce and could lead to potentially better performance.
Previous work mainly evaluates the effectiveness of facet generation by measuring the matching degree between the generated facets and the ground truth facets [10,11,26].This, however, can not reflect the diversity of generated facets sufficiently.To conduct systematic comparisons of different types of methods, we introduce two diversity metrics that measure the term and semantic diversity of the generated facets respectively.We evaluate models trained with the two existing objectives and three new objectives using both the commonly used matching metrics and our proposed diversity metrics.Based on MIMICS [33], an open-domain query facet dataset, Our experimental analyses demonstrate that the appropriate permutation-invariant objectives can help generate better facets; Facet prediction that is only based on the query and top-retrieved documents (i.e., Set-Pred) achieves compelling performance in terms of the metrics measuring matching with the ground truth but have the worst diversity performance; Methods that only learn facet prediction given context (i.e., Seq-Set-Pred) have better semantic matching metrics with ground truth but worse diversity performance than its counterpart that also learns when to stop generating facets (i.e., Seq-Avg-Perm).To sum up, the main contributions of this work include: 1) To the best of our knowledge, this is the first work that conducts a systematic comparative study on a wide variety of training objectives for clarification facet generation.
2) We propose three extra training objectives that are of different properties from the existing work and introduce two diversityoriented metrics for evaluation.
3) We conduct comprehensive analyses and show the pros and cons of each type of method, which we believe could provide insights for future research efforts in clarification facet generation.

RELATED WORK
Clarifying user intent is a crucial issue in information retrieval.There have been many works focusing on learning facet prediction.Additionally, some studies utilize predicted intents to generate clarifying questions, clarifying intent by asking questions to the user.So the two threads of research related to our work are: asking clarifying questions and facet prediction.

Asking Clarifying Questions
Asking clarifying questions(CQ) is an important way to clarify the intent of a query [12].Typically, before asking a CQ, we need to determine a facet and generate based on the facet.Current research on CQ can be mainly divided into two categories: (1) ranking/selecting CQ as [1][2][3]24] and (2) generating CQ like [7,25,27,30,32].
For ranking/selecting CQ, Rao and Daumé III [24] sorted questions based on the usefulness of their answers, using the Expected Value of Perfect Information as the theoretical basis for ranking.Later, Aliannejadi et al. [2] collected a dataset for CQ and proposed a baseline for selecting CQ. [3,9] conducted CQ ranking based on top-retrieved documents and negative feedback, respectively.
For CQ generation, Rao and Daumé III [25] applied generative adversarial learning techniques when training the sequence-tosequence question generation model.Zamani et al. [32] proposed a rule-based model and two neural question generation models to generate CQ when given a query and its facet.Later, Dhole [7] proposed a model that utilizes rule-based systems to generate discriminative questions, aiming to obtain clarifications on user intent.Wang and Li [30] introduced a template-based question generation model called TG-ClariQ, which selects a template question from a candidate set.The missing parts in the template question are filled in using selected words.They converted generating CQ into a selection task.Meanwhile, Sekulić et al. [27] also proposed a facet-driven approach and Zhao et al. [35] showed that such facets can be extracted from top retrieved documents.Recently, Wang et al. [31] converted CQ generation to a facet-constrained question generation task to guide effective and precise question generation.

Facet Prediction
Current research on facet prediction can be mainly divided into two categories: (1) facet extraction and (2) facet generation.
The majority of early work on facet extraction [5,13,18,20,28] primarily relied on specific domains or external resources.Kohlschtter et al. [13] introduced an approach for extracting based on personalized PageRank link analysis and annotated taxonomies.Stoica et al. [28] put forward a technique for generating hierarchical faceted metadata.The method utilized hypernym relations in WordNet to extract this metadata from textual descriptions of items.Subsequently, Dakka and Ipeirotis [5] use entity hierarchies in Wikipedia and WordNet to extract candidate facet terms.Li et al. [20] created a faceted retrieval system that showcases pertinent facets extracted from Wikipedia hyperlinks and categories.While these methods have shown promising results in certain scenarios, they often struggle under large-scale open-domain settings.Apart from the approaches mentioned above that rely on specific domains or external resources, another approach for facet extraction and generation is based on top retrieved documents.Dou et al. [8] introduced QDMiner, one of the earliest open-domain facet extraction systems.This system utilizes textual patterns to aggregate frequent lists from the top web search results and gets query dimensions based on the aggregated results.Kong and Allan [14] proposed a graph-based probabilistic model for determining whether a candidate term is a facet term and for identifying whether two candidate terms belong to the same query facet.Later, they extended faceted search to the general web [15].In their subsequent work [16], the authors put forward a graphical model that optimizes the expected performance measure and selectively displays facets just for part of queries by their predicted performance.
After the rise of pre-trained language models, query facet generation has become a popular way of facet generation and achieved compelling performance.Hashemi et al. [10] proposed NMIR to cluster the documents prior to generation and learned a representation for each facet.Subsequently, to address the issue of matching between clustered documents and facets, as well as the influence caused by the order of generating facets, they proposed PINIMR [11].Samarinas et al. [26] revisited the task of query facet extraction and generation and considered facet generation as autoregressive text generation which produces state-of-the-art results.
Inspired by the aforementioned works, in this paper, we summarize the impact of various training objectives on facet generation and conduct a comparative analysis of the generated results.We hope to provide guidance for the research in facet generation.

METHODOLOGY
In this section, we first introduce the definition of the facet generation task.We illustrate two existing training objectives in Section 3.2 and the newly proposed three in Section 3.3.

Task Description
Given an open-domain query, our task is to generate the associated facets based on their corresponding related search engine result pages (SERPs).We use top-retrieved documents in the SERPs as evidence to help generate better facets during both training and inference.Let The task is to generate a set of related facets  for any given query  with its associated documents .
In this section, we describe five representative training objectives for facet generation.Two of them have been proposed in previous work and both of them conduct sequential facet prediction.They are: (1) seq-default [10,26] and (2) seq-min-perm [11].These two are order-sensitive and permutation-invariant respectively.Moreover, we propose another three permutation-invariant training objectives: (3) seq-avg-perm, (4) set-pred, and (5) seq-set-pred.The comparative characteristics of these five objectives can be seen in Table 1.As in [10,11,26], we use an autoregressive model BART [19], a Transformer-based encoder-decoder model defined by the parameter  for sequence generation and leave the studies based on decoder-only methods in the future.Next, we will describe the details of all the objectives.

Existing Objectives for Facet Generation
In this subsection, we introduce two existing methods used for facet generation seq-default and set-min-perm.
Seq-Default [10,26].It considers the default facet sequences in the corpus as training targets and is commonly used in previous work [10,26].For a given query   and its corresponding related documents   , we concatenate   and each    ∈   using [SEP] and produce Note that we follow the same process of encoding this concatenated sequence with BART for all the studied training methods.We concatenate    ∈   with ', ' and the yielded text string  1 ,  2 , • • • ,   [] is the target   for input   .This objective can be expressed in the following mathematical forms: where  (,   ) = 1 and  is the output of BART encoder.This training objective could lead to suboptimal results since the model learns towards only the given facet ordering and ignores other equally valid permutations.This could harm the model performance.Aware of this issue, Hashemi et al. [11] have proposed a loss function that is permutation-invariant, which we name as seq-min-perm and will introduce next.
Seq-Min-Perm [11].It treats the query intents as a set rather than a sequence to eliminate the impact of facet order.It extends the Hungarian loss [17] for facet generation.For a given input   , this method takes all the permutations of facets in   into consideration.Each permutation is concatenated as a sequence and the sequence with the minimum loss is used as the training target.The way of concatenation to yield input is the same as seq-default for a query   .The objective is as follows: where the definition of  (,  *  ) is the same negative log-likelihood function as seq-default,  is the output of BART encoder,  (  ) means all the possible permutations of ground truth facets for query   , and the number of items in the permutation set | (  )| equals to !.Seq-min-perm also has some limitations.Despite more computation on the permutations, only one permutation will have an impact on the model update, which could be insufficient.Also, the facet ordering with the minimum loss may be random in the early stages of model training.This random selection introduces noise to the training process and could result in a decrease in performance.
For both seq-default and seq-min-perm, during inference, we greedily generate each word with the highest probability, expecting the model to automatically generate the separator ',' and termination symbol '</s>' to distinguish and end the generated facets.Seq-default is order sensitive and seq-min-perm is permutationinvariant.They both conduct sequential facet prediction and adaptively terminate the facet generation process.Also, seq-min-perm has larger time complexity than seq-default.

New Objectives for Facet Generation
To better model the facet generation process, we propose another three training objectives.They are all permutation-invariant but have different characteristics regarding whether to sequentially predict facets based on previously generated ones and whether the count of generated facets is controllable.We will describe each of them in detail next.
Seq-Avg-Perm.It is a straightforward extension of seq-minperm that sequentially generates facets and is trained with the average loss of the permutations of facets.We permute the concatenation of facets in order to let the model learn towards all the possible permutations of the ground-truth facets which enhances its ability to search for the optimal solution.In particular, we employ the following training objective: where  (,  *  ),  and  (  ) are the same as seq-min-perm.The inference procedure is also the same as seq-default and seq-min-perm.
Seq-avg-perm has the same time complexity as seq-min-perm but instead uses all the samples for model updates.It also conducts sequential facet prediction so that previously generated facets could guide the generation of the next facet away from them.Same as seq-default and seq-min-perm, seq-avg-perm also cannot generate more query facets when the final termination symbol is generated.
Set-Pred.Instead of using the so-far predicted facets as contexts for the current facet generation, set-pred treats each facet as an individual target and conducts parallel predictions.For a given query   and its related documents   , we concatenate them and obtain the input   .In contrast to the other methods, we use an independent facet    as target output for the input   and totally construct   tuples (  ,    ) for training, where   is the number of facets that query   has.So the optimization objective is: where  is the output of BART encoder and  is again the negative likelihood probability of    given .During inference, we generate multiple facets by taking the top z most likely predictions using beam search.The time complexity of set-pred is much lower than seq-min-perm and seq-avg-perm while higher than seq-default.According to [29], vanilla beam search could output synonyms in the top predictions.Similarly, the generated facets by set-pred may be synonyms or refer to the same concept.To enhance the diversity of generated facets, we combine the ideas of set-avg-perm and set-pred and propose the seq-set-pred objective.
Seq-Set-Pred.It predicts each of the remaining facet sets in parallel based on arbitrarily generated facets as context.For a given query   , we divide the generation of facets into |  | steps and generate one facet at each step.We concatenate the input   −1 with the output   −1 from the (t-1)-th step to form the t-th step's input   =   −1 []  −1 .During training, similar to seq-avg-perm, to mitigate the influence of facet order, we train the model on the data that covers all the possible permutations.The organization of training data is depicted in Figure 1.This prevents the model from learning specific generation ordering and enables it to make precise predictions under different permutation contexts effectively.We formalize this optimization objective as: where The time complexity of seq-set-pred is the highest among the five approaches.It also conducts sequential facet prediction so that the previously predicted facets could help it avoid generating similar facets.Same as set-pred, it is able to control the count of generated facets.Note that based on decoder-only language generation models such as the GPT series [4,23], seq-set-pred will be essentially the same as seq-avg-perm since we do not need to move the generated facets to the encoder.

Model Inference
For seq-avg-perm inference, we generate facets the same as seqdefault and seq-min-perm.For set-pred and seq-set-pred, we generate each facet independently and set the number of facets manually.We generate facets in parallel and utilize a search algorithm to select the top few facets with the highest probability as the generated results for seq-min-perm while for seq-set-pred, we sequentially generate all the facets, appending the generated facets to the current input as the input of next step, until the count of generated facets reaches the specified number.

Stochastic Optimization
Optimizing towards all permutations of query facets is computationally challenging.Therefore, as [11], we adopt the stochastic variations of seq-min-perm, seq-avg-perm, and seq-set-pred.In other words, instead of taking all permutations into consideration, we just sample a certain number of permutations from all the possible ones for each query and compute the approximate losses.We dynamically sample the permutations during each training epoch to make sure that the model can see as many different permutations as possible.

EXPERIMENTAL SETUP
This section introduces the data we use for training and evaluation, the metrics we use to evaluate the models, and the technical details of the experiments.

Dataset
Following [10,26], our experiments are based on the MIMICS dataset [33].MIMICS is a collection of search clarification datasets for real search queries sampled from the Bing query logs and it contains three subsets: MIMICS-Click, MIMICS-ClickExplore and MIMICS-Manual.For each search query, it provides up to 5 groundtruth facets and at most 10 associated documents with information such as document snippets.We use MIMICS-Click which includes over 400K unique queries for training and MIMICS-Manual which contains 2832 queries for evaluation.For all the training objectives, we use the document snippets provided by MIMICS as our document text.

Evaluation Metrics
We evaluate our approach in terms of two aspects: the matching between the generated facets and the ground-truth facets and the diversity among the generated facets.On the one hand, to evaluate our approach in terms of matching with ground truth, we follow Samarinas et al. [26] and adopt four sets of evaluation metrics.(1) Precision, recall, and F1 of term overlap metrics: These metrics are computed based on the matching between the set of generated facet terms and the set of ground-truth facet terms at the term level.(2) Exact match: These metrics also compute precision, recall, and F1 between the facets generated by the model and the ground-truth facets, but at the facet level.(3) Set BLEU score: It calculates the BLEU [22] scores between the best permutation of generated facets and the ground-truth facets (4) Set BERT-Score: It calculates the BERTScore [34] between the best permutation of generated facets and the ground-truth facets.For more details of the metrics, please refer to [10].
On the other hand, we propose two extra metrics to measure the diversity of a set of facets.(1) Term diversity: For a given set  = { 1 ,  2 , • • • ,   } where   means the i-th facet in , we calculate the term-level diversity of  with the average of one minus the overlap ratio between each pair of facets in  and the overlap ratio between   and   is computed by 2 ) BERTscore diversity: for the set , we calculate the average BERTScore between every pair of facets in the set where BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence and the token similarity is computed using contextual embeddings.We use one minus it as the BERT-score diversity.

Implementation Details
We fine-tuned BART-base for five epochs with an initial learning rate set to 5 × 10 −5 for all the following approaches as [26] and employed beam search algorithm with a beam size of 5.If not specifically mentioned, we use AdamW optimizer and set the maximum sequence length to 512 tokens, the maximum output length to 32 tokens, and batch size to 16.The count of generated facets for seq-default, seq-min-perm, and seq-avg-perm is determined by themselves, while we specifically set the number of generated facets to 3 for set-pred and seq-set-pred, as it yields the best results on the validation set.Additionally, practical considerations led us to perform deduplication on all the generated results for each method.It is worth noting that there was only a minor difference between the results before and after deduplication.For seq-default, we utilized the checkpoint provided by Samarinas et al. [26].
For the seq-avg-perm and seq-min-perm techniques, we also finetuned the BART-base model with a maximum sequence length of 512 tokens.However, we augmented the maximum output sequence length to 128 tokens to empower the model with the capability to generate multiple facets within a single sequence.During each training epoch, we sampled six permutations for each query and the batch size for seq-min-perm was set to 18, enabling it to handle all permutations of three queries within one batch.
Regarding seq-set-pred, we set the maximum input sequence length to 640 tokens.This ensured that the facets appended to the input would not be truncated.We sampled 6 permutations, 8 permutations, 9 permutations, 11 permutations, and 13 permutations when the count of ground-truth facets is 1, 2, 3, 4, 5 respectively to ensure the model to see almost the same number of facets as in the seq-avg-perm approach during each training epoch.

RESULTS AND DISCUSSION
Next, we show the experimental results of the five training objectives.First, we evaluate the accuracy of the generated facets against the ground-truth facets.Then, we measure their diversity with our proposed diversity metrics.We also show the model performance when different counts of facets are generated and compare the model performance using a similar amount of training data.Finally, we conduct case analyses to show the quality of the facets generated by each method.

Evaluation Against Ground Truth
Table 2 shows the matching degree between the generated facets and ground-truth facets for all the methods.We have the following observations: 1) Most permutation-invariant methods except seqmin-perm perform better than the order-sensitive methods.This is consistent with our presumption that seq-default was hurt when forced to generate a specific facet ordering that is only one of the possible combinations.However, seq-min-perm is the exception and performs the worst most of the time.There are some potential reasons for its unsatisfactory performance.If the model consistently selects the same sequence for a query throughout the entire training phase, it is expected to observe a similar performance to that of seq-default.However, in the early stages of training, the selection of the sequence with the minimum loss exhibits randomness.This random selection introduces noise to the training process, resulting in a decrease in performance.So the effectiveness of generated facets is even worse than seq-default.
2) The method that generates facets without depending on the previously generated facets (i.e., set-pred) has compelling performance in terms of both term-based and semantic matching metrics.Previous studies do not consider this way of training so this has not been observed.Its diversity, however, is lower than the others, which we will show in Section 5.2. 3) Methods that only learn facet prediction given context(e.g., seq-set-pred) have better semantic matching performance with ground truth compared to those that also learn when to stop generating facets.

Evaluation on Diversity
Table 3 shows the results of diversity evaluation on the generated facets with the query words removed.It indicates that seq-avg-perm performs the best on both metrics, suggesting there are not only fewer repeated terms between the generated facets but also a higher level of semantic differentiation.It performs worse than the groundtruth facets regarding term diversity but better in terms of semantic diversity.By checking some of their generated facets, we find that seq-avg-perm usually generates terms such as prepositions to maintain correct grammatical structures while such connection words are fewer in the ground truth and this phenomenon also exists in other methods.For example, given query "internet explorer", seqavg-perm generates facets "for windows 10" and "for windows 7" while the corresponding ground truth facets are actually "windows 10" and "windows 7" respectively(shown in Table 6).This leads to a decrease regarding term diversity for seq-avg-perm.The higher semantic diversity scores of seq-avg-perm compared to the ground truth indicates that its generated facets are more semantically different.It is worth noting that all the sequential prediction methods, except for seq-default, exhibit higher diversity than set-pred.It is not surprising to observe worse diversity in set-pred because it could keep generating similar facets since it only refers to the original query and top documents during generation.The possible reason why seq-default performs poorly regarding term diversity is that it is trained towards only one possible permutation of the facets, which could constrain the output term space and in turn result in lower term diversity.We observe good term diversity but worse semantic diversity in seq-set-pred.Multiple target ground-truth facets could be helpful for it to obtain higher term diversity.However, when we move the previously generated facets to the encoder and predict the rest with the decoder, the encoder is struggling to learn that the target facet should be similar to the original query while different from the facets.In contrast, for the sequential prediction methods(i.e., seq-default, seq-min-perm, seq-avg-perm), the encoder only encodes query and captures its relevance with the target facet and it is feasible for the decoder to capture the difference between target facet with the previously generated facets.This could be the reason for the much lower semantic diversity of seq-set-pred compared to the sequential prediction counterparts.4 shows the performance when generating different numbers of facets using the facet-count-controllable methods set-pred and seq-set-pred.We also compute the ratio of the generated facet counts matching the ground truth for each method and show the ratios in 5. Table 4 shows that the evaluation scores constantly change with different facet counts.The best matching scores are mainly achieved when generating 2 or 3 facets because the average number of ground truth facets is between 2 and 3(shown in Table 5).The ratio is the average of one minus across all the queries where   is the count of ground-truth facets for   .The ratio matching with the ground-truth facet counts is 0.7039 and 0.7431 when generating 3 and 2 facets respectively.Although seq-default and seq-min-perm have generated more facets that have the same count with ground truth than seq-avg-perm, they have worse term or semantic level matching scores with the ground-truth facets.It indicates that they do generate worse facet contents.Set-pred and seq-set-pred have similar accuracy in terms of facet counts than seq-default and seq-min-perm but have better facet contents as well.Compared to seq-avg-perm, facet-count-controllable methods perform better on set BLEU score and set BERT-Score.The possible reason is that these two metrics are more sensitive to the number of generated facets.When calculating these two metrics, facets that do not match the ground-truth facets in the count will receive a score of 0 and the mismatch between the count of generated facets and the count of ground-truth facets results in worse matching scores.However, this phenomenon does not exist When calculating BERT-score diversity.
Because each facet has its comparable facets.
When it comes to diversity, the term diversity and BERT-score diversity of seq-set-pred decrease and increase, respectively, with the increase in the number of generated facets.The results of setpred indicate that this approach performs well when generating a smaller number of facets.However, when generating more facets, it may produce more repeated tokens.In conclusion, seq-set-pred demonstrates better diversity than set-pred across all the numbers of generated facets.Simultaneously, we visualize the count of facets generated by the adaptive generation methods as Figure 2. It demonstrates that seq-default and seq-min-perm tend to generate more results with two facets while seq-avg-perm generate various numbers of facets.
Table 6: Some examples of the facets generated by each model.We selected the top-5 facets for Set-Pred and Seq-Set-Pred.The duplicate facets were removed from all the models.Facets are separated using the ',' symbol.

Impact of Training Data Amount
Due to the significant differences in the quantity of information utilized per training epoch for each method, we compare each method when the amount of training data is at a similar scale.Specifically, we only train the methods that learn towards all the facet permutations (seq-avg-perm and seq-set-pred) for 1 epoch, so that the overall training data amount is similar to the other methods.As shown in Table 7, their results have different extents of regressions since it takes longer for the model training to converge under the permutation-invariant context.However, both seq-avgperm and seq-set-pred still outperform seq-default and seq-min-perm in terms of all the metrics.

Facet Generation with ChatGPT
Recently, Large Language Models (LLMs), such as ChatGPT, have demonstrated remarkable capabilities across various tasks.So, one may wonder how such LLMs perform on the facet generation task.With this regard, we assess the ability of ChatGPT in facet generation and compare it to the methods investigated in this paper.We randomly sampled 50 test examples from MIMICS-Manual dataset and ash ChatGPT to generate facets using the instruction as in Figure 3.The results are displayed in Table 8.It shows that ChatGPT has a large performance gap compared to most of the methods presented in our paper across all metrics.However, despite ChatGPT's poor performance on the metrics, we still believe that its generated results are reasonable.For example, for the query "Express vp", the facets generated by ChatGPT are "virtual private network", "privacy and security", "content access", "server network", and "pricing" while the ground-truth facets are "expressvpn mac", "expressvpn android", "expressvpn windows", "expressvpn linux", and "expressvpn ios".This finding is consistent with the results in [26].In sum, the facets generated by ChatGPT are general concepts related to the query but do not match the facets manually labeled according to the provided document snippets.

Case Study
We demonstrate the generated results for three queries with varying numbers of ground-truth facets in Table 6.We can observe that seqdefault and seq-min-perm tend to generate two facets for a given query, which aligns with the pattern shown in Figure 2.Many facets generated by seq-default match the terms in the ground truth, resulting in higher precision in term overlap, but due to limitations in the number of generation facets, the recall metric is not very high.For seq-min-perm, the generated facets are somewhat related to the query but could not generate the desired facets, leading to lower scores.These findings are consistent with the results in Table 2.The remaining three methods set-pred, seq-avg-perm, and seq-set-pred demonstrate good generation performance.In the facetcount-controllable methods, we select the top three facets as the generated results.However, we can find that the ignored facets may be better.Therefore, we can generate more facets and choose the best facets instead of directly selecting the first three based on the similarity between the facets and the query.We will investigate this in the future.

CONCLUSIONS AND FUTURE WORK
In this paper, we conducted a systematic comparative study of various types of training objectives, with different properties of, whether it is permutation-invariant, whether it conducts sequential prediction, and whether it can control the count of output facets.For comprehensive comparisons, besides the commonly used evaluation that measures the matching with ground-truth facets, we also introduce two diversity metrics.Our experimental analyses demonstrate: the appropriate permutation-invariant objectives can help generate better facets; facet prediction that is only based on the query and top-retrieved documents (i.e., Set-Pred) achieves compelling performance in terms of the metrics measuring matching with the ground truth but have the worst diversity performance; methods that only learn facet prediction given context (i.e., Seq-Set-Pred) have better semantic matching metrics with ground truth but worse diversity performance than its counterpart that also learns when to stop generating facets.Our newly proposed methods outperform the previous state-of-the-art models [26].
For future work, We plan to evaluate these objectives on a decoder-only architecture such as GPT-2 [23] in the next step.As we mentioned in Section 3.3, seq-perm-avg and seq-set-pred will be essentially the same based on the decoder-only models.Another interesting research direction is to utilize a small amount of data to learn how to predict user intents.Although MIMICS have enough annotations for fine-tuning, in reality it is common to have limited annotated data.Thus, we plan to study intent prediction with few-shot learning.

Figure 1 :
Figure 1: The organization of training data for seq-set-pred

Figure 2 :
Figure 2: Distribution of the number of facets generated by different models.

Figure 3 :
Figure 3: The prompt used for ChatGPT

Table 1 :
Comparison of characteristics among different objectives.✓ means true and × means false.|| is the average length for all the queries, | | is the average length for all the concatenations of documents,  is the average number of facets for each query and | | is the average length for all the facets.A   means selecting  elements without repetition from  elements, considering the ordering.
(, ) is the negative log likelihood of the generation probability for  given .A  ( ) means all the possible orderings of  selected non-repetitive elements from  .For example, A 2 ({  ,   ,   }) = {(  ,   ), (  ,   ), (  ,   ), (  ,   ), (  ,   ), (  ,   )}.   is the remaining set consisting of facets that are in   but not in the k-th set in A  (  )  .   is the output of BART encoder given the input    , i.e., the concatenation of   ,   and facets in A  (  )  .  ℎ is the h-th ground truth facet chosen from    .We get the average loss of all the possible permutations for query   by dividing

Table 3 :
Diversity evaluation on the facet body which removes words from the facets that appear in the query.

Table 4 :
Matching scores and diversity on a different number of facets.Div means diversity for short.

Table 5 :
The proportion of generated facets that match the ground truth facets.

Table 7 :
Matching scores of different methods based on similar training data amount.The superscript + denotes significant improvements compared to the worst one which is underscored and − means significant decreases compared to the best one which is bold in terms of a two-tailed paired t-test with Bonferroni correction with 99% confidence.

Table 8 :
Comparison between ChatGPT with our studied methods on 50 random test samples in MIMICS-Manual.The worst performance is underscored and the best one is in bold.