Leveraging Event Schema to Ask Clarifying Questions for Conversational Legal Case Retrieval

Legal case retrieval is a special IR task aiming to retrieve supporting cases for a given query case. Existing works have shown that conversational search paradigm can improve users' search experience in legal case retrieval. One of the keys to a practical conversational search system is how to ask high-quality clarifying questions to initiate conversations with users and understand their search intents. Recently, Large Language Models, such as ChatGPT and GPT-4, have shown superior ability in both open-domain QA and conversations with human. Thus it is natural to believe that they could be applied to legal conversational search as well. However, our preliminary study has shown that generating clarifying questions in legal conversational search with SOTA LLMs (e.g., GPT-4) often suffers from several problems such as duplication and low-utility contents. To address these problems, we propose LeClari, which leverages legal event schema as external knowledge to instruct LLMs to generate effective clarifying questions for legal conversational search. LeClari is constructed with a prompt module and a novel legal event selection module. The former defines a prompt with legal events for clarifying question generation and the latter selects potential event types by modeling the relationships of legal event types, conversational context, and candidate cases. We also propose ranking-oriented rewards and employ the reward augmented maximum likelihood (RAML) method to optimize LeClari directly based on the final retrieval performance of the conversational legal search system. Empirical results over two widely adopted legal case retrieval datasets demonstrate the effectiveness of our approach as compared with the state-of-the-art baselines.


INTRODUCTION
In recent years, legal case retrieval has attracted much attention in the IR research community.It aims to retrieve supporting cases for a given query case and constitutes an essential component of a legal information system.In practice, prior cases are primary legal materials in various law systems.When using a conventional legal case retrieval system, users need to issue queries to express their information needs [8,21], which could be complex and difficult to verbalize [31].Conversational search is a rising topic in IR [26] and it can help users better express their information needs [4,26], especially for complex search tasks [3,32].In legal case retrieval, studies have shown that conversational search paradigm can improve legal case retrieval from a variety of perspectives including but not limited to query formulation, users' satisfaction and search success [15,17,18].
One of the key research problems in conversational search systems is how to ask good clarifying questions based on the conversation context so that we could better understand user intents and guide future conversations based on the user's answers.Recently, revolutionary Large Language Models (LLMs) techniques, such as ChatGPT [24] and GPT-4 [25], have shown strong zero-shot and few-shot generalization ability in many natural language processing tasks.Intuitively, it seems natural to apply LLMs to generate clarifying questions for legal conversational search, e.g., using simple prompts such as "based on the above conversation, please ask a clarifying question to further understand the background information of the legal case?".However, our preliminary study on the state-of-the-art LLMs (e.g., ChatGPT and GPT-4) has shown several problems that limit their performance in asking high-quality clarifying questions for legal conversational search.First, they sometimes generate clarifying questions that focus on facts that have already been presented in the previous context, which could provide low or no additional information to the conversation.Second, existing LLMs often ask general questions that are not relevant from the legal perspectives and thus provide limited benefits for the performance of downstream legal retrieval models.Because LLMs are usually built with open-domain data and are not trained specifically for clarifying question generation, they do not know what to ask and how to ask effective questions in legal case retrieval.
Inspired by recent studies on constrained question generation [36] and conversational product search [6,43,45], we propose LeClari, a conversational search model that generates high-quality clarifying questions for conversational legal case retrieval.LeClari is constructed with a prompt module and an event selection module.The event selection module iteratively selects event types from a legal event schema to guide LLMs to ask clarifying questions with the prompt module.The legal event schema can be considered as a special kind of legal database that contains multiple types of legal events with their descriptions.Here we leverage the existing legal event schema LEVEN [38] for the prompt construction in LeClari.The event schema LEVEN reasonably divides the key facts in criminal cases into 108 event types and can be utilized as external knowledge to promote downstream legal applications.By selecting event types from LEVEN, LeClari can ask questions effectively to narrow down the search space of downstream retrieval models and thus improve the performance of the whole system.
However, adopting existing conversational models to the selection of legal event types is suboptimal because they mostly ignore the connections between event type selection and downstream retrieval tasks.To this end, LeClari explicitly models the relationships of legal event types, conversational context, and potential candidate cases retrieved by downstream legal case retrieval models together in its event selection module.Further, we propose ranking-oriented rewards and employ the reward augmented maximum likelihood (RAML) method [22] to optimize LeClari directly for downstream retrieval metrics such as MAP and NDCG.
We conduct empirical experiments on two widely adopted legal case retrieval datasets, including the LeCaRD [20] dataset and the CAIL2022-LCR dataset.For evaluation, we compare with several other event type selection strategies in conversational models to verify the effectiveness of our model.Empirical results demonstrate that our model can select appropriate event types for LLMs to construct useful clarifying questions and improve the legal case retrieval performances significantly than all the baselines on the two datasets.

RELATED WORK 2.1 Legal case retrieval
Legal case retrieval is a specialized IR task [33,34,39].Several approaches have been explored in previous research of legal IR, including knowledge engineering-based techniques and NLP-based methods [7].For instance, [27] combined symbolic and connectionist artificial intelligence techniques to integrate both symbolic and sub-symbolic information in legal domain.[28] developed a legal knowledge-based framework to overcome synonymy and ambivalence of words in query process and enhance the user's query for retrieving truly relevant legal judgments.However, these existing legal case retrieval systems still followed a traditional search paradigm in which users issued keyword-based queries to describe their information needs [8,21].With the rapid development of deep learning, applying pre-trained language models (PLMs) to legal case retrieval has received a huge success.[30] proposed a BERT-based neural network to model paragraph-level interactions for legal case retrieval.And [19,29,30] suggested that BERT-based neural networks improved the performance of the legal case retrieval task significantly.In recent years, knowledge transfer has become a popular research topic [5,14,16].In addition to the original PLMs pre-trained by multiple resources, [10][11][12][13] demonstrated that domain-adaptive pretraining improves the performance of PLMs in domain-specific tasks.For example, [42] and [37] used large legal corpus to pre-train BERT and Longformer, respectively.Both models outperform their precedent PLMs in legal tasks.

Learning to Ask Clarifying Questions
Clarifying questions are important parts to improve conversational search systems and has attracted much attention in the IR research community [1,40].Recently, LLMs [24,25] are revolutionizing natural language processing and have great potential in promoting clarifying question generation.As LLMs are not specifically trained for clarifying question generation, researchers relied on providing constrained prompts to elicit their zero-shot generalization ability.In open-domain conversational search, [36] proposed to constrain the clarifying question decoding with search facets to solve the coldstart problem.And in conversational product search, researchers focused on asking clarifying questions based on a pre-defined product aspect set and propose a series of learning to ask strategies.[44] learned to ask a good question based on user preferences and the rewards over question performances.[41] predicted the next question to ask to the user by maximizing the probability of the next question based on the softmax output layer for probability estimation.And [43] proposed a set of systematic learning to ask strategies, including both greedy (GBS) and explore-exploit (bandit learning) strategies.Compared to these studies, we propose LeClari which selects event types from the legal event schema to generate clarifying questions and promote the legal case retrieval systems.

PRELIMINARY STUDY
In this section, we conduct a preliminary study to investigate the clarifying question generation performances of LLMs in legal case • RQ1: What are the qualities of the clarifying questions generated by LLMs in legal scenarios, independent of the search system?• RQ2: Can the clarifying questions generated by LLMs provide benefits for the performance of downstream legal retrieval models?

Conversation Construction
To address the above two research questions, we utilize two widely adopted legal case retrieval datasets LeCaRD [20] and CAIL2022-LCR 1 to construct the conversations and evaluate the clarifying questions.LeCaRD is the first criminal case retrieval dataset under the Chinese law system.Challenge of AI in Law (CAIL) is a competition held annually under the guidance of the Supreme People's Court and the Chinese Information Processing Society of China to promote AI technology and a higher level of digital justice since 2018.As one of the eight tasks in CAIL 2022, the legal case retrieval task provides a dataset named CAIL2022-LCR.These two datasets contain several query cases (i.e., the complete fact description parts of case documents) and each query case corresponds to a candidate case pool with a size of 100.It is required to select relevant cases from the candidate case pool for each query case.Every candidate case has a four-level relevance label annotated by criminal law experts.To analyze the LLMs' abilities to generate legal clarifying questions, we construct conversations based on the two datasets following the steps below.
(1) Initial query construction.We invited a PhD student major in criminal law to select 1-2 sentences from each query case and rewrite them into coherent statements as the initial query.Note that we focus on situations where the search system needs to ask clarifying questions to further understand the background information of the query case.Therefore, we asked the PhD student to leave some important information from the query case when formulating the initial query and mark it in the query case.We hope that the clarifying questions can help the search system to complete this part of information.We utilize the initial query as the start of the conversation.(2) Clarifying question generation.Then we apply LLMs including ChatGPT and GPT-4 to generate a clarifying question as the system reply based on the contextual conversation information.Specifically, we incorporate the conversation into the following prompt and feed the prompt into LLMs to generate the clarifying question: Based on previous user study results [18], the average number of clarifying questions is 4 in the scenario of conversational legal case retrieval.Therefore, for each query case, we generate four clarifying questions to complete the conversation construction process.
In addition, to ensure the quality of the user simulator, we randomly selected 20 conversations from two datasets (a total of 40 conversations, 160 clarifying questions and answers) and invited another three graduate students major in criminal law to annotate the answers to each clarifying question (1 point -correct, 0 pointincorrect).The Fleiss's  among three assessors was 0.913, indicating almost perfect agreement [9].If there were disagreements, we took the result of the majority vote.We find that only 3 answers are annotated as incorrect.This shows that the user responses generated by the LLM is convincing.

Clarifying Question Quality Analysis
To address RQ1, we evaluate the qualities of the clarifying questions generated by the LLMs from multiple aspects, independent of the search system.Specifically, the three graduate students labeled each clarifying question in the constructed conversations according to topic relevance, answerability, and information gain.They denote whether the clarifying question is relevant to the initial query, can be answered based on the query case and provides additional information, respectively.Each labeling task asks the annotators to assign a label to the clarifying question (1 pointrelevant/answerable/provide additional information, 0 point -irrelevant/unanswerable/not provide additional information).The Fleiss's  scores of the three tasks were 0.863, 0.834 and 0.818, respectively, indicating almost perfect agreement.We obtain the labels based on majority voting.
We calculate the average quality scores for these three aspects, respectively.The results are shown in Table 1.We find that almost all clarifying questions are relevant to the initial query.However, more than 30% of clarifying questions can not be answered, indicating that the corresponding information is not mentioned in the query case.In addition, only 35-40% of clarifying questions can help the search system obtain additional information.This means for 20-30% of clarifying questions, although their corresponding answers can be found in the query case, the content they ask about has already been presented in the existing conversation.But LLMs fail to realize this.

Retrieval Performance Analysis
To address RQ2, we investigate whether the clarifying questions generated by LLMs can obtain useful information to improve the retrieval performance in legal scenarios.Specifically, we first finetune two legal pre-train language models, BERT-Crime [42] and LawFormer [37], on a conversational legal case retrieval dataset [18] to enhance their conversational search abilities.We use pair-wise loss to train them by feeding the concatenation of the conversation and the candidate case document.We hope that they can find relevant cases when the information is sufficient, so as to determine whether the clarifying questions have obtained useful information based on retrieval performance.Then we utilize them to compute the relevance scores between the constructed conversations and candidate cases.Finally, we compare the retrieval performances between using the conversations without clarifying questions (i.e., only the initial query) and those with clarifying questions generated by LLMs.
We use NDCG@10 as the retrieval metric and the results are shown in Table 2.The differences are all not significant at 0.05 level with a two-tailed pairwise t-test.We find that the clarifying questions proposed by LLMs (i.e., "ChatGPT" and "GPT-4" group) do not significantly improve legal case retrieval performance, although the retrieval metrics are slightly worse when clarifying questions were not proposed (i.e., "w/o clarify" group).This indicates that although some clarifying questions by LLMs can obtain additional background information, they do not help improve the conversational legal case retrieval performances.

Summary
Regarding the two research questions, we find two disadvantages of legal clarifying questions which are generated by LLMs directly: (1) As for RQ1, although almost all the clarifying questions are relevant to the search task, more than half of them are unable to obtain additional information.Especially some of them focus on facts that have already been presented in the previous context, which could provide low or no additional information to the conversation.(2) As for RQ2, they are not relevant from the legal perspectives and thus provide limited benefits for the performance of downstream legal retrieval models.

LECLARI
In this section, we present LeClari which is a conversational search model that generates high-quality legal clarifying questions (ref. Figure 1).We first introduce the Prompt Module (PM) with legal event types to generate clarifying questions by LLMs and the workflow to interact with users.Then we show the Event Selection Module that selects appropriate event types for Prompt Module and the model training strategy.

Prompt Module
To overcome the two disadvantages mentioned in Section 3 and generate useful clarifying questions for legal case retrieval, we propose the Prompt Module by leveraging the legal event schema LEVEN2 [38] for LLMs prompt constructions.LEVEN is constructed based on the law articles, legal textbooks and case documents.It can be considered as a special kind of legal database which groups the facts of criminal law into 108 event types.The event types can be divided into 6 categories and the examples are shown in Table 3.
The first three categories are related to various human behaviors.
The fourth and fifth are related to results.The last one is related to majeure.Each event type has a textual description, like "Escaping: Fleeing and hiding to avoid unfavorable circumstances."Therefore, the event types can be added as constrains into the LLMs prompts to generate clarifying questions.The constrained generated questions can help the search systems know the detailed facts in the background information related to the event type.On the one hand, we can force to use different event types as the constrains in different rounds.It can force LLMs to generate different clarifying questions and avoid asking about the same contents.On the other hand, [38] found that the legal case retrieval performance could be improved when they focused on the facts related to event types of LEVEN in the query case.Intuitively, selecting appropriate event types from LEVEN can offset the disadvantages mentioned in Section 3.So at each round, the Prompt Module selects an event type and incorporates it into a pre-defined LLMs prompt.The descriptions are also used to explain the event type.For example, when the user issues an initial query "Alice drove a car at night and crashed into Bob", we can select "Escaping" event type to construct the following prompt: Prompt: You are now a knowledgeable judge in law.
The current conversation between you and the user is as follows: User: Alice drove a car at night and crashed into And ChatGPT can generate the following clarifying questions based on this prompt, which is highly related to the event type: "Can you provide any information on whether Alice attempted to flee or hide after the accident?"

Workflow
After defining the Prompt Module, we can formalize the workflow of LeClari to interact with users based on it (shown in Figure 1(a)).
Assume that we have  event types  = { 1 ,  2 , ...,   } and  candidate cases  = { 1 ,  2 , ...,   }.We now consider a conversation search scenario where users and the search system are discussing and finding relevant legal case documents for a specific query case.
A user with a legal query case issued an initial query  0 in the form of natural language to start a search session.During the session, the search system selects an event type  ′ (by the Event Selection Module in Section 4.3) for the Prompt Module to generate a clarifying question  ′ .And the user provides an associated answer  ′ for the clarifying question based on her query case.In the conversation for each search session, the system asks the user a sequence of clarifying questions and collects a sequence of user answers.Suppose LeClari asks -round clarifying questions, the sequence   of the events, clarifying questions and answers are represented as follows: Based on the sequence   , the current conversation   can be denoted as follows: Finally, we apply the fine-tuned BERT-Crime or LawFormer in Section 3.3 as the Ranker.They rank the candidate cases  based on the conversation   and obtain the ranking list   : where   is a permutation of the candidate case set .Thus, the sequence of actions in the conversation can be represented as: The goal of the search system is to maximize the retrieval metrics of the ranking list   .We compute the retrieval metrics each round until asking  clarifying questions.Note that  is a pre-defined number and we leave the selection of the number of clarifying questions as future work.

Event Selection Module
The relevance estimation in legal case retrieval has already achieved good solutions through pre-trained models.So we fix the parameters of Ranker and focus on selecting appropriate event types and asking useful clarifying questions in this paper.We design the Event Selection Module (ref. Figure 1(b)), which selects the ( + 1)-round event types  ′ +1 for the Prompt Module given the initial query  0 , the -round sequence   and the -round ranking list   .It contains an encoding layer, an interaction layer and a decision layer.

Encoding Layer.
To mine the rich semantic information in the conversations, event descriptions and candidate cases, we apply a shared encoding layer to generate semantic embeddings for them.
As for the conversation   , we set LawFormer [37] as our encoder, which is a Longformer-based pre-trained language model for legal long documents understanding.In detail, the conversation   contains the initial query and a sequence of clarifying questions and answers.It can be denoted as { 0 , ( ′ 1 ,  ′ 1 ), ..., ( ′  ,  ′  )}.Usually, a specific token [CLS] is inserted as the first token and another token [SEP] is utilized to split different segments.Therefore, the semantic embeddings of the conversations can be obtained as follows: Here, • denotes the concatenation of two sequences.And we use the [CLS] representations as the semantic embeddings of conversations.
In addition, given all event types  = { 1 ,  2 , ...,   }, we also apply LawFormer to map each event type to a dense representation based on its description.Specifically, the description of -th event type   is a word sequence (  1 ,   2 , ...,    ). represents the maximum length of event description.We also insert the [CLS] token as the first token and the semantic embedding of -th event type is generated as follows: We also use the and I   , respectively).Specifically, the input embedding of the conversation is represented as follows: Meanwhile, because our task aims to select the next round event type for clarifying question generation, the event selection history should be taken into consideration.For example, when the conversation contains the event "Robbery", it is necessary to select "Injury" to construct a clarifying question to know whether the defendant has caused injuries.Therefore, given the selected event types in the previous  round  ′  = { ′ 1 ,  ′ 2 , ...,  ′  }, the input embeddings of event types add selected embeddings (denoted as is false and S(  ∈  ′  ) = 0. Specifically, the input embedding of the -th event type is represented as follows: Based on the input embeddings of the conversation and event types, the Event-Conversation Transformer can generate the enhanced representations as follows: where ℎ  is the enhanced conversation representation and ℎ   is the -th event type enhanced representation which is combined with conversational information.
As for the input of the Event-Candidate Transformer, we utilize the same event input embeddings as the Event-Conversation Transformer.And the input embedding of the -th ranked candidate case also contains the semantic embedding    and segment embedding (denoted as I   ).In addition, because the final aim of our task is to improve the legal case retrieval performance(i.e., rerank the candidate list), the current ranking position information is important for event selection decision.Therefore, the ranking embedding is added to the input, which helps the model distinguish between candidate cases on different ranks: Then we can obtain the enhanced representations by the Event-Candidate Transformer as follows: where ℎ   is the -th event type enhanced representation which is combined with the ranking list of candidate cases and ℎ   is the enhanced representation of the -th ranked candidate case.In the two Transformers, the input embeddings are all randomly initialized and trainable, except the semantic embeddings.

Decision Layer.
The two Vanilla Transformers output two kinds of the -th event type representations ℎ   and ℎ   , respectively.The former is combined with contextual conversations and the latter is combined with the candidate ranking lists.Here we need to generate a list of clarifying scores to select the next round event type.We import these representations into an MLP followed by a softmax layer to get the predictions: )) (13) Here  = { 1 ,  2 , ...,   } is a list of clarifying scores for the event types.Namely,   denotes the confidence for LeClari to select the -th event type for clarifying question generation.Finally, we select the event type with the highest clarifying score among the event types which have never been selected in the previous rounds.

Model Training
As mentioned before, the key to improve LeClari is to select appropriate event types for the Prompt Module by the Event Selection Module.Here we introduce the training strategy to optimize the Event Selection Module, including the loss function and training samples.

4.4.1
Ranking-oriented Loss Function.We hope to enhance the ability of LeClari to generate high-quality clarifying questions to improve the legal case retrieval performance (i.e., retrieval metrics).
The key is that the Event Selection Module can select appropriate event types for clarifying question generation.Therefore, we design the ranking-oriented rewards as the target clarifying scores for event types.Specifically, given the -th round conversation   , the event types , the selected event types  ′  and the ranking list   , we can generate one round clarifying question and answer (  +1 ,   +1 ) by LLMs for the -th event type based on   , following the eventbased prompt in Section 4.1 and the user simulation prompt in Section 3. In this way, we can obtain ( + 1)-round conversations for all event types (denoted as   +1 for the -th event type   ).Then we feed the -th new conversation   +1 into the ranking models (i.e., BERT-Crime or LawFormer) to generate the new ranking list   +1 .The ranking-oriented reward for the -th event is computed as follows: where M() is the retrieval metric score (e.g., MAP and NDCG) of the ranking list .So the rewards can reflect the usefulness of each event type for the current conversation   .
Based on the rewards, we can use Maximum Likelihood Estimation (MLE) to optimize LeClari directly.Specifically, the event type with the highest reward (denoted as  * ) is defined as the ground truth and the loss function is as follows: where   * is the Dirac distribution of the ground-truth event type, i.e.,   * ( * ) = 1 and   * () = 0 for other .
As we can see, the MLE criterion ignores the structure of the output space by treating all the outputs that do not match the ground-truth as equally poor, and thus brings the discrepancy between training and test.So we propose to take into account the alternative outputs beyond the ground truth for better model learning.Specifically, we try to derive the new target distribution by employing Reward Augmented Maximum Likelihood (RAML) [22].We normalize these target clarifying scores  to obtain the distribution of the outputs by a softmax layer and replace the Dirac distribution in the loss function: 4.4.2Training Conversations Sampling.Considering there are only the query cases with their candidate cases in the original legal case retrieval datasets, we need to construct training conversations to apply the ranking-oriented loss function.We utilize the initial queries in Section 3 and define an event sampling strategy to generate conversations automatically.Specifically, suppose there are  query cases in the dataset, we construct  conversations for each query case and each conversation contains ( −1) round event-type related clarifying questions. is the pre-defined maximum of the clarifying questions, so we can obtain ( + *  * ( − 1)) training samples (including  initial queries and  *  conversations each round).In addition, to avoid that most of the clarifying questions by random sampling do not provide more useful information compared to the initial query, we use the target clarifying scores in Section 4.4.1 to define the dynamic probability of the -th event type sampled in the next round based on the current conversation: where  is a pre-defined parameter.When  is higher, we tend to sample more useful events in the conversation.And if  = 0, the sampling method degenerates into random sampling.Given the initial query, we sample event types round by round based on the dynamic sampling probability distributions until ( − 1) round clarifying questions and answers have been generated.

EXPERIMENTS
This section reports the experimental results.We first introduce the evaluation scheme and baseline models.Then we present overall performance comparison results, further analyses and a case study.

Evaluation Scheme
Evaluation Protocol.We aim to evaluate the clarifying questions for conversational legal case retrieval and use the same user simulation method in our preliminary study.Specifically, the conversation is started with an initial query and the user simulator generates the answer to the clarifying question by LLMs to continue the conversations.The user simulator is the same LLM for the clarifying question generation and its prompt has been shown in Section 3. We also generate 3 conversations for each query by each model to eliminate the effects of LLMs randomness.Until  clarifying questions have been generated ( is pre-defined), we apply pre-trained legal case retrieval models to obtain the final ranking lists.Here we utilize the LeCaRD and CAIL2022-LCR with the corresponding initial queries for evaluation, both following a 5-fold cross-validation.And we also apply the same LLMs (i.e., ChatGPT and GPT-4) and legal case retrieval models (i.e., BERT-Crime and LawFormer) in our preliminary study.
Metrics.We evaluate the performances of models using three metrics: Mean Average Precision (MAP), Precision@5 (P@5) and NDCG@10.Notably, we merge the four-level label in legal case retrieval datasets into binary when measuring MAP and P@5.Only cases with the highest relevance label are regarded as relevant cases and the rest are regarded as irrelevant.

Baselines
For a comprehensive evaluation, we compared our method with the following baselines.(1) "w/o Clarify" : does not generates clarifying questions and just uses the initial query for retrieval.(2) "w/o Event": uses the prompt in our preliminary study to generate clarifying questions directly without incorporating the event type information.(3) Event selection models: only replaces the Event Selection Module in LeClari with other selection strategies.
As for the event selection models, we apply the following strategies to select event types.Previous work [43] has shown that they are effective in aspect selection for conversational product search.Random selects event types randomly and MaxE selects the most frequent event types in candidate cases.GBS [23] selects event types to best split the current ranking list corresponding to the candidate cases closest to two halves.LinRel [2] estimates a linear regression model and makes use of side information to estimate the relevance score of an event type.GP+UCB/EI [43] models the event type selection as a Gaussian Process and uses two acquisition functions (i.e., Upper Confidence Bound and Expected Improvement) to select them.Note that these strategies need to use one-hot vector (i.e., including the event type or not) to represent the query cases and the candidate cases.Considering that the human annotations are prohibitive, we train DMBERT [35], which is the best legal event detection model, on LEVEN and apply it to annotate the event types for the query cases and candidate cases.On LeCaRD, the query cases and candidates contain 14.83 and 9.72 events on average, respectively.And on CAIL2022-LCR, the query cases and candidates contain 13.85 and 10.69 events on average, respectively.

Ablation Study
We conduct an ablation study to evaluate the effects of the components of Prompt Module and Event Selection Module in LeClari (Table 6).First we find that LeClari performs worse significantly  w/o Event+ChatGPT LeClari+ChatGPT q ′ 1 : Can you tell us the specific amount of drugs that defendant A sold to others?q ′ 1 (Event: "Buying"): Did defendant A purchase or produce the drugs, and what was the type and quantity of drugs involved?a ′ 1 : Defendant A provided methamphetamine to D and E for consumption, but the specific quantity was not provided.
a ′ 1 : Defendant A purchased the drugs, which were methamphetamine, and the quantity was 1 kilogram.
q ′ 2 : Did defendant A engage in drug trafficking before being arrested?If so, what was the specific quantity of drugs sold?(Duplicated) q ′ 2 (Event: "Collusion"): Did anyone else conspire with defendant A in drug trafficking?a ′ 2 : Defendant A provided drugs to D twice and to E three times before being arrested, indicating that he engaged in drug trafficking.a ′ 2 : Defendant A conspired with co-defendant B to purchase drugs, and C in the same village was involved as the seller.
q ′ 3 : Does defendant A have a criminal record? (Useless) q ′ 3 (Event: "Taking_drugs"): Have defendant A's buyers already consumed the drugs?a ′ 3 : It was not mentioned whether defendant A has a criminal record. a ′ 3 : Defendant A provided drugs to D twice and to E three times, indicating that the buyers have already consumed the drugs.

Analysis on Training Strategies
Here we compare the two training strategies (i.e., MLE and RAML) when using different target ranking metrics on LeCaRD.It is obvious that RAML performs better than MLE.This is mainly because the MLE learning criterion brings the discrepancy between training and test, leading to overfitting on the ground-truth labels and reduced generalization ability.And RAML introduces the alternative outputs beyond the ground truth and overcomes this issue effectively.In addition, we find that when we utilize MAP as the target metric to generate the target distribution of event types, LeClari achieves higher MAP than when using the other two target metrics: P@5 and NDCG@10.And LeClari can also achieve the best P@5 and NDCG@10 by utilizing themselves as the target metrics, respectively.It demonstrates that we can use our most concerned metric as the target metric to derive the target distribution.

Case Study
We conduct a case study to show the clarifying questions generated by "w/o Event" and LeClari through ChatGPT (ref.Table 7).
Here we highlight the three key points for legal case retrieval: A purchased the drugs rather than producing them by himself, A had a co-defendant B and A has sold the drugs.Here we find that when ChatGPT generates the clarifying questions directly without event type information, some of the questions ask about duplicated contents (e.g.,  ′ 2 ) and some of them are useless for legal case retrieval (e.g.,  ′ 3 ).This is consistent with the conclusion of our preliminary study.And LeClari incorporates three appropriate event types into the LLMs prompts and generates three clarifying questions corresponding to the three key points, respectively.It indicates that LeClari can select appropriate event types to cover the key points in the query case.

CONCLUSION
In this paper, we first conducted a preliminary study to show that generating clarifying questions in legal conversational search with SOTA LLMs (e.g., GPT-4) often suffers from several problems such as duplication and low-utility contents.Therefore, we leverage the legal event schema LEVEN to address these problems and propose a novel conversational search model LeClari with a Prompt Module and an Event Selection Module.The former defines a prompt with legal event for clarifying question generation and the latter selects potential event types by modeling the relationships of legal event types, conversational context, and potential candidate cases.We employ the RAML for the model learning to directly optimize the legal case retrieval metrics.Empirical results showed that our model can significantly outperform the state-of-the-art event selection methods.In the future work, we will make the model decide when to stop asking clarifying questions dynamically.

Figure 1 :
Figure 1: The overview of LeClari.PM denotes the Prompt Module.

Table 1 :
The Quality Scores of Clarifying Questions by LLMs

Table 2 :
Retrieval performance comparison of the conversations with/without clarifying questions in terms of NDCG@10.There are no significant differences between the performances with or without clarifying questions.

Table 3 :
Statistics and examples of event schema LEVEN Interaction Layer.The final aim of LeClari is to find relevant legal cases.So the Event Selection Module needs to consider about not only the conversation context but also the ranking list of candidate cases.Therefore, we apply an interaction layer which aims to enhance the representations of the event types based on the con- versation and candidate cases.To achieve this purpose, we leverage two Vanilla Transformers to represent event embeddings.One of them (named Event-Conversation Transformer) encodes the event types with conversations and another (named Event-Candidate Transformer) encodes the event types with the ranking lists of candidate cases.The multi-head attention mechanism used in the Transformer captures interaction information between the event types and the conversations/candidate cases.

Table 4 :
Performance Comparison on the LeCaRD dataset.The best results are highlighted with boldface.† denotes that LeClari performs significantly better than the baseline at 0.05 level with a two-tailed pairwise t-test.

Table 5 :
Performance Comparison on the CAIL2022-LCR dataset.The best results are highlighted with boldface.† denotes that LeClari performs significantly better than the baseline at 0.05 level with a two-tailed pairwise t-test.

Table 6 :
Ablation Study on LeCaRD.† denotes that LeClari performs significantly better than the variations at 0.05 level with a two-tailed pairwise t-test.
cases by these two Transformers, respectively.In addition, these three input embeddings also contribute to selecting better event types, especially the semantic embeddings, indicating that LeClari effectively models the relationships of legal event types, conversational context, and potential candidate cases.

Table 7 :
A case study to compare the clarifying questions by "w/o Event" and LeClari.The three key points of query case for legal case retrieval and their corresponding answers are highlighted in boldface and distinguished by different colors.Query Case: On the evening of August 25th, 2013, the defendant A discussed with co-defendant B about purchasing crystal meth.On August 27th of the same year, defendant A gave B RMB 28,000 to buy methamphetamine.On August 28th, B bought 1 kilogram of methamphetamine from C in the same village for RMB 22,000 and gave it to A. From October 2015 until his arrest, defendant A provided methamphetamine to D twice and to E three times.

Table 8 :
Comparison between MLE and RAML on LeCaRD.We highlight the better results in boldface and ‡ denotes the best results among them with different target metrics.