System Initiative Prediction for Multi-turn Conversational Information Seeking

Identifying the right moment for a system to take the initiative is essential to conversational information seeking (CIS). Existing studies have extensively studied the clarification need prediction task, i.e., predicting when to ask a clarifying question, however, it only covers one specific system-initiative action. We define the system initiative prediction (SIP) task as predicting whether a CIS system should take the initiative at the next turn . Our analysis reveals that for effective modeling of SIP, it is crucial to capture dependencies between adjacent user–system initiative-taking decisions. We propose to model SIP by CRFs. Due to their graphical nature, CRFs are effective in capturing such dependencies and have greater transparency than more complex methods, e.g., LLMs. Applying CRFs to SIP comes with two challenges: (i) CRFs need to be given the unobservable system utterance at the next turn, and (ii) they do not explicitly model multi-turn features. We model SIP as an input-incomplete sequence labeling problem and propose a multi-turn system initiative predictor (MuSIc) that has (i) prior-posterior inter-utterance encoders to eliminate the need to be given the un-observable system utterance, and (ii) a multi-turn feature-aware CRF layer to incorporate multi-turn features into the dependencies between adjacent initiative-taking decisions. Experiments show that MuSIc outperforms LLM-based baselines including LLaMA, achieving state-of-the-art results on SIP. We also show the benefits of SIP on clarification need prediction and action prediction.


INTRODUCTION
An essential part of conversational information seeking (CIS) is to identify the right moment for a CIS system to take the initiative [6,72], given that system initiative-taking risks frustrating the user and hurting the user experience [64,65,72,75,76].Various system-initiative actions can be taken by a CIS system to take the initiative, e.g., asking a clarifying question or requesting feedback [58,60].Existing work has extensively studied the clarification need prediction task, that is, predicting when to ask a clarifying question in an information-seeking conversation [2,3,5,64,65,68].However, as shown in Fig. 1, asking a clarifying question is only one of several possible system-initiative actions [1,7,72] Task and motivation.We define system initiative prediction (SIP) task, which is to predict whether the CIS system should take the initiative at the next turn in an information-seeking conversation.To the best of our knowledge, no existing studies explicitly model this problem.SIP has three benefits for CIS systems: (i) SIP can improve the controllability of the overall initiative level of the system to balance utility and user experience [53].(ii) SIP can enable knowledge sharing among various system-initiative actions; the shared knowledge learned through SIP can be transferred to improve the prediction of a specific system-initiative action by transfer learning, e.g., by fine-tuning a model, pre-trained on SIP, on clarification need prediction.And (iii) SIP is a high-level decision, and downstream tasks, such as action prediction, depend on SIP; SIP can boost the prediction performance on downstream tasks by reducing the decision space; e.g., on action prediction, the action requesting feedback is performed only if the SIP result is initiative.One could argue that existing action prediction methods [70] are sufficiently effective for SIP.However, our experiments show that using action prediction methods for SIP leads to suboptimal results, but conversely, SIP significantly improves downstream action prediction.
Our empirical analysis of two CIS datasets [43,47] reveals that a system's initiative-taking decision at the next turn is not isolated but depends on the user's previous initiative-taking decision.Fig. 2a shows that the system is more likely to take the initiative Figure 2: The probability of system initiative-taking (sys-in) conditioned on the user's initiative-taking decision at the preceding turn (pre-user-in) and the number of times the system has taken the initiative (# pre-sys-in) on the WISE and MSDialog training sets.
immediately after the user has taken the initiative in a conversation; thus, capturing the dependencies between adjacent user-system initiative-taking decisions is critical for modeling SIP.
A natural way to capture such structural dependencies is to use probabilistic graphical models, such as conditional random fields (CRFs) [32].We propose to use linear-chain CRFs [32,56] to model SIP for three reasons: (i) they have been shown to be effective in capturing dependencies between adjacent output decisions [56]; (ii) linear-chain CRFs for SIP can guarantee the best initiative-taking decision at the next turn by decoding the optimal sequence of initiative-taking decisions in context (1 :  − 1 in Fig. 3a) and the next turn ( in Fig. 3a), instead of outputting the decision at the next turn independently [32,56]; and (iii) due to CRFs' graphical nature, they have been shown to exhibit better interpretability and transparency than other methods [20,30], such as emergent large language models (LLMs) [14,57,74].
Challenges.When adopting linear-chain CRFs to the SIP task we face two challenges: (i) They cannot be directly applied to SIP because we face an input-incomplete sequence labeling problem.Linear-chain CRFs are designed for sequence labeling problems that have a one-to-one correspondence between input observations and output decisions.As shown in Fig. 3a, to output initiative-taking decisions in context and at the next turn, linear-chain CRFs need to be given a complete input sequence of utterances in context and at the next turn.However, given the nature of SIP, as shown in Fig. 3b, the system utterance at the next turn is unobservable, leading to an input-incomplete sequence labeling problem.And (ii) linear-chain CRFs do not explicitly model multi-turn features.Our empirical analysis shows that an initiative-taking decision depends on multi-turn features.We define a multi-turn feature as a variable that varies across turns.Consider, e.g., the number of times the system has taken the initiative; Fig. 2b shows that a system is much less likely to take the initiative once again if it has already taken the initiative before.But linear-chain CRFs do not consider this feature as it is beyond the dependency between adjacent initiative decisions.
Approach.To address the challenges, we cast SIP as an inputincomplete sequence labeling problem and propose a multi-turn system initiative predictor (MuSIc).We propose (i) prior-posterior inter-utterance encoders to adapt linear-chain CRFs to the input-incomplete sequence labeling problem and eliminate the need to be given the unobservable system utterance, and (ii) a multi-turn feature-aware conditional random field (CRF) layer to explicitly capture the impact of multi-turn features on an initiative-taking decision by Linear-chain CRFs Output decisions (initiative-taking decisions) Linear-chain CRFs Output decisions (initiative-taking decisions) Figure 3: A comparison between a sequence labeling problem (a) and an input-incomplete sequence labeling problem (b). 1 :  denote turn numbers and  is the next system turn.
conditioning the dependencies between adjacent initiative-taking decisions on multi-turn features.MuSIc can use an arbitrary number of multi-turn features; we consider three essential ones: (i) role transition direction, (ii) the number of times the system has taken the initiative, and (iii) the distance to the last system initiative turn.
Experiments.We annotate the initiative-taking decision at each turn on two multi-turn CIS datasets, WISE [47] and MSDialog [42].
Experiments on both datasets show that MuSIc achieves state-ofthe-art performance on SIP, outperforming strong clarification need prediction, action prediction, and LLM-based (LLaMA [57]) baselines.We get two more insights: (i) LLMs do not show promising performance on SIP where scaling up LLMs is not an effective way to solve SIP; and (ii) probabilistic graphical modeling is still competitive and effective for this task and it should not be ignored in the era of LLMs.Furthermore, a visual analysis indicates that the transition matrices learned through the MuSIc exhibit meaningful transition patterns and explicitly show how MuSIc models the dependencies, showing great interpretability and transparency.Moreover, we finetune MuSIc pre-trained on SIP on the clarification need prediction task, achieving the state-of-the-art clarification need prediction performance on ClariQ [2,3], indicating that the knowledge shared among various system-initiative actions learned through SIP can be used to improve the prediction of a specific system-initiative action.Finally, we construct a SIP-aware action prediction framework where action prediction is fed with SIP results returned by MuSIc.The action prediction performance is significantly improved, indicating the effectiveness of SIP in benefiting downstream tasks.
Contributions.Our main contributions are as follows: • We introduce the task of system initiative prediction (SIP) for CIS, which has not been explicitly modeled in prior work.• We propose a multi-turn system-initiative predictor (MuSIc), which formalizes SIP as an input-incomplete sequence labeling problem and jointly considers dependencies between adjacent user-system initiative-taking decisions and the impact of multi-turn features on an initiative-taking decision.

RELATED WORK 2.1 Conversational information seeking
can both take the initiative at different times in a conversation.Mixed-initiative CIS systems can ask clarifying questions [3,4,49,71], elicit user preferences [44,50], ask for feedback [58,60], initiate a conversation [62] and so on.Existing work focuses on when a CIS system should take the initiative [6] and response generation/selection given a decided system-initiative action [4,12,49,66,71].We focus on the former.In this direction, Avula et al. [6] run a user study to investigate it.Besides, much work has studied the prediction of when to perform a specific systeminitiative action, asking a clarifying question (a.k.a.clarification need prediction) [2,3,5,64,65,68].
Clarification need prediction.Zou et al. [75,76] show that asking a clarifying question is not always necessary, and inappropriate requests for clarification can hurt user experience.Xu et al. [68] propose a binary classification model to identify whether clarification is needed given the conversational context.Aliannejadi et al. [2,3] fine-tune pre-trained language models fed with user queries to return a clarification need score.Wang and Ai [64,65] propose a binary classification model that further takes into account clarifying question and answer candidates returned by retrieval models.Arabzadeh et al. [5] utilize the coherency of items retrieved for the user query: the more coherent the retrieved items are, the less ambiguous the query is, and the need for clarification decreases.Our work differs from these studies, as SIP covers a broader range of system-initiative actions, while these studies are limited to one initiative type (i.e., asking a clarifying question).
System action prediction.Radlinski and Craswell [45] define a system action space and emphasize the need for system action prediction in CIS, i.e., a CIS system should predict an appropriate action from an action space at the right time.Azzopardi et al. [8] define a more detailed taxonomy of user/system actions in CIS.Schneider et al. [48] conduct user study to reveal action flow patterns in CIS.Ghosh et al. [22] first identify the user action used in the previous user utterance and then use that to benefit the system action prediction.In this paper, we are concerned with the more challenging multi-action system action prediction task, i.e., the system performs multiple actions concurrently per turn [73].Beyond CIS, multi-action system action prediction has been well studied for task-oriented conversations, where it is typically formulated as a multi-label classification [25,34,67] or sequence generation problem [28,34,52,63].Ye et al. [70] propose a sequence generationbased method, called Co-Gen, achieving leading performance in terms of response generation and action prediction.
Our work differs from action prediction because SIP is a higherlevel decision on which the action prediction depends.

Linear-chain conditional random fields
Linear-chain CRFs are discriminative probabilistic graphical models for sequence labeling problems that assign output decisions to all of the observations in a sequence jointly [32].The output decisions are arranged in a sequence/linear chain where adjacent output decisions are dependent according to the first-order Markov assumption, enabling linear-chain CRFs to effectively capture dependencies between adjacent output decisions [56].We focus on neural linear-chain CRFs [26,27], where parameters can be trained end-to-end.They have been widely used for sequence labeling  tasks, e.g., POS tagging [26], named entity recognition [26,33] and dialogue act recognition [13,17,31,46,51].None of the work listed above can be directly applied to SIP due to the input-incomplete sequence labeling problem.Another line of research captures the dependencies between adjacent output decisions by dynamically generating transition matrices [24,27,54,55].MuSIc differs as it explicitly incorporates multi-turn features into the adjacent dependencies.While some work [11,51] injects features (e.g., emotion shifts) into the adjacent dependencies for sequence labeling, MuSIc is for input-incomplete sequence labeling and considers CIS-specific features that have not been studied yet.

TASK DEFINITION
Suppose that we have an information-seeking conversation  = ( 1 ,  2 , . . .,  | | −1 ,  | | ) with a sequence of | | utterances, where  is an utterance uttered by either a user or system.The conversation  comes with a sequence of ground-truth initiativetaking decisions  = ( 1 ,  2 , . . .,  | | −1 ,  | | ), i.e., each utterance  in the conversation has a corresponding initiative-taking decision  ∈ {Initiative, Non-initiative}.Given the context  1: −1 = ( 1 ,  2 , . . .,   −1 ), where  − 1 is a user turn, the system initiative prediction (SIP) task is to predict the system's initiative-taking decision   at the next turn  .We formulate SIP as an input-incomplete sequence labeling problem: we model the conditional probability  ( 1: |  1: −1 ) of the sequence of initiative-taking decisions in the context  1: −1 and at the next turn   given the sequence  1: −1 of utterances in the context.Only the system's initiative-taking decision   at the next turn  is used for evaluation.

METHOD 4.1 Limitations of linear-chain CRFs
Linear-chain CRFs predict a sequence of output decisions based on emission and transition scores (see [26,33] for details), and have two main limitations when applied "as is" to SIP: (i) They model  ( 1: |  1: ) to output the sequence  1: : they use the sequence  1: of utterances in the context and at the next turn to calculate emissions scores over {Initiative, Non-initiative} over turns 1 :  ; there is a one-to-one correspondence between  1: and emission scores over turns 1 :  .However,   , the utterance at the next turn, is unobservable for SIP (see Fig. 3b), leading to the absence of the emission scores at turn  .(ii) They use a transition matrix that contains transition scores from one initiative-taking decision to itself (e.g., Initiative to Initiative) or the other (e.g., Initiative to Non-initiative) to capture dependencies between adjacent initiativetaking decisions.An initiative-taking decision   +1 is also impacted by a multi-turn feature   : +1 that changes across turns, e.g., the number of times the system has taken the initiative (see Fig. 2b).However, the transition matrix is unique and shared across all turns; thus, the transition scores cannot be adjusted across turns to capture the impact of a multi-turn feature   : +1 effectively.

Overview of MuSIc
We propose MuSIc for SIP, which consists of three parts: (i) a BERT utterance encoder, (ii) prior-posterior inter-utterance encoders, and (iii) a multi-turn feature-aware CRF layer.See Fig. 4. The BERT utterance encoder is used to encode each utterance into a latent representation.Prior-posterior inter-utterance encoders enable MuSIc to model the input-incomplete sequence labeling by approximating the absent emission scores at turn  .We model  ( 1: |  1: ) during training (see Fig. 5a) as we can access the unobservable system utterance   at the next turn  ; we pass  1: through the BERT encoder and a posterior inter-utterance encoder to calculate emission scores over turns 1 :  ; we define them as posterior emission scores.Similarly, we pass  1: −1 through BERT and a prior inter-utterance encoder; we use the output of the prior inter-utterance encoder to calculate prior emission scores that are forced to approximate the posterior emission scores at  via an MSE loss.During inference (see Fig. 5b), we model  ( 1: |  1: −1 ) and regard the approximate (prior) emission scores as the absent emission scores at turn  , eliminating the need to be given the unobservable system utterance   .The multi-turn feature-aware CRF layer incorporates three multi-turn features and conditions transition scores (dependencies) between adjacent initiative-taking decisions on multi-turn features.We extend the single transition matrix in linear-chain CRFs to multiple ones, corresponding to different multi-turn features.For a pair of adjacent initiative-taking decisions between turn  and  + 1, we adjust the transition score between them by selecting the transition matrix corresponding to the multi-turn features from turn  to  + 1.

BERT utterance encoding.
We use a BERT encoder [18] to encode an utterance   ( = 1, . . ., during training,  = 1, . . ., − 1 during inference) into an utterance representation H   ∈ R |  | × , after which an average pooling operation [10] is used to get a condensed representation h   ∈ R 1× , where |  | and  denote the number of tokens in   and the hidden size, respectively.

4.2.2
Prior-posterior inter-utterance encoding. 1We have a prior encoder fed with {h   }  −1  =1 , returning prior utterance representations {h    }  −1  =1 , as shown in Fig. 5. Also, we have a posterior encoder fed with {h as additional input: 1 We implement inter-utterance encoders by BiLSTMs, which got better performance than Transformers in our preliminary experiments.
(1)    : +1 represents the role transition direction from turn  to  + 1, i.e.,    : +1 = 2/2 means that the role transition is from the user to the system/the system to the user from turn  to  + 1.
(2) Given    : +1 = 2 ("user to system"),2    : +1 represents the number of times the system takes the initiative before the next system turn at  + 1. Table 1 shows that the average number of system initiative utterances in a conversation in training sets is less than 1.To make full use of the sparse training data, we only consider the cases    : +1 = 0 and > 0, which means that the system has not taken the initiative and has taken the initiative once or more before the next system turn at  + 1, respectively.
(3) Given    : +1 = 2 (again, "user to system") and    : +1 > 0,    : +1 represents the distance to the last system initiative turn from the next system turn at  + 1.Similarly, to make full use of the sparse data, we only consider    : +1 = 2 and > 2, 3 which means that the distance to the last system initiative turn from the next system turn at  + 1 is 2 and more than 2 turns, respectively.After considering the three multi-turn features, MuSIc models: where Ỹ1: denotes one of all possible sequences of initiative-taking decisions,  (  , where  = 1, 2, . . ., , e    ∈ R 1×2 are posterior emission scores over {Initiative, Non-initiative}, and MLP(•) denotes a multilayer perceptron (MLP).In parallel, we calculate the prior emission scores e   −1  ∈ R 1×2 based on the last output (at turn  − 1) of the prior inter-utterance encoder h   −1  (see Fig. 5a): The prior emission scores e   −1  ∈ R 1×2 would learn to approximate the posterior emission scores e    ∈ R 1×2 at turn  (see Fig. 5a and Eq. 8).The parameters of the MLP in Eq. 2 and Eq. 3 are not shared.
Computing transition scores.Linear-chain -CRFs do not condition a transition score on any multi-turn features: where  ∈ R 2×2 is a transition matrix shared across all turns, and    ,  +1 is the transition score from the decision   to   +1 .
Our transition scoring function (  ,   +1 ,    : +1 ,    : +1 ,    : +1 ) does Posterior inter-utterance encoder Prior inter-utterance encoder Posterior inter-utterance encoder Prior inter-utterance encoder condition the computation of the transition scores between adjacent initiative-taking decisions on the multi-turn features    : +1 ,    : +1 , and    : +1 .We define separate transition matrices corresponding to different combinations of multi-turn features.For a pair of adjacent initiative-taking decisions between turn  and  + 1, we select the transition matrix corresponding to the multi-turn features from turn  to  + 1.If the transition score is only conditioned on the multi-turn feature role transition direction    : +1 , it is calculated as: where  (   : +1 ) is an indicator function that equals 1 if    : +1 = 2 and 0 otherwise, and G 2 ∈ R 2×2 and G 2 ∈ R 2×2 are transition matrices corresponding to "from system to user" and "from user to system, " respectively.
Given    : +1 = 2, if the transition score is further conditioned on the feature    : +1 , the number of times the system takes the initiative before the next system turn at  + 1, it is calculated as: where  (   : +1 ) is an indicator function that equals 1 if    : +1 > 0 and 0 otherwise, and G 2,=0 ∈ R 2×2 and G 2,>0 ∈ R 2×2 are transition matrices corresponding to "the system has not take the initiative" and "the system has taken the initiative once or more" before the next system turn at  + 1, respectively.
Given    : +1 = 2 and    : +1 > 0, if the transition score is further conditioned on the feature    : +1 , the distance to the last system's initiative turn from the next system turn at  + 1, it is calculated as: where  (   : +1 ) is an indicator function that equals 1 if    : +1 > 2 and 0 otherwise, and G 2,>0,=2 ∈ R 2×2 and G 2,>0, >2 ∈ R 2×2 are transition matrices for "the distance to the last system's initiative turn is 2 turns" and "the distance to the last system's initiative turn is more than 2 turns" from the next system turn at  + 1, respectively.
Training objectives.Our final loss function is defined as L = L crf + L mse .We not only minimize the negative log-likelihood of the sequence  1: of ground-truth initiative-taking decisions in the context and at the next turn, but also force e   −1  to learn to approximate e    via an MSE loss (see Fig. 5a): Inference phase.MuSIc models the conditional probability  ( Ỹ1: |  1: −1 , S) of a possible sequence Ỹ1: of initiative-taking decisions in the context (1 :  − 1) and at the next turn  only given the sequence  1: −1 of utterances in the context (see Fig. 5b): where  ( ỹ ,  1: −1 ) = e Datasets.We consider two multi-turn CIS datasets with annotations of actions for utterances, WISE [47] and MSDialog [42,43,69].
Based on the action annotations, we annotate the initiative-taking decision for each utterance.WISE is collected through crowdsourcing; it consists of mixed-initiative conversations between two workers playing the role of user and system.All utterances are annotated with actions.We use the data split from [47].MSDialog consists of mixed-initiative conversations between users who ask for technical help and expert users or staff (i.e., system) who help to solve problems.This dataset has two versions: the complete set and a labeled subset.Each utterance in the labeled subset is annotated with actions; We use the data split of the labeled subset from [43].Pre-processing.Following [59,64,65], we merge consecutive utterances from either the user or system into one utterance by concatenation; their corresponding actions are merged by a union  1 for the statistics of the datasets.The average numbers of turns in both datasets are less than the numbers in the original papers [43,47] due to the merging operation.
Annotation of initiative-taking decision labels.For both datasets, we derive the initiative annotations by mapping the manual annotations of actions to initiative or non-initiative labels.An utterance is annotated as initiative if it is annotated with any of the actions showing initiative 4 and non-initiative otherwise.
Baselines.We compare MuSIc with recently proposed LLM-based baselines, and three other groups of state-of-the-art baselines for the SIP task: (i) clarification need prediction, (ii) system action prediction, and (iii) linear-chain CRF-based methods.
We consider LLaMA-7B/13B/33B/65B [57] using in-context learning [9,19] as the LLM-based baselines.Mao et al. [35] prompt LLMs for conversational query rewriting and we adapt their designed prompt to SIP.We prepend the SIP task instruction at the beginning of the prompt, followed by two groups of demonstrations: (i) a few complete conversations randomly sampled from the training set, and (ii) utterances in the context  1: −1 prior to the next turn  .Given the prompt, LLaMA generates the system-initiative decision at the next system turn  .WISE is a Chinese language dataset; however, the original LLaMA has a limited ability to encode and decode Chinese text [15].Cui et al. [15] release Chinese-LLaMA-Plus-7B and -13B at the time of writing.These LLaMA variants use the extended Chinese vocabulary and are further trained on Chinese data.We report the performance of both [15] on WISE.
We train and test two clarification need prediction models on SIP: (i) CtxPred (BERT) uses a BERT encoder to encode the context and predict whether to take the initiative at the next turn [2,3,68].(ii) Risk-aware Conversational Search agent with Q-learning (RCSQ) is fed with the context, clarifying question and answer candidates returned by retrievers, and is trained with a user simulator by reinforcement learning [64,65].To adapt it to SIP, 5 we replace the clarifying question and answer candidates with initiative and non-initiative system utterance candidates retrieved by bi-encoders; 6 we also replace Q-learning with supervised learning using the annotations of initiative-taking decisions.
We also compare MuSIc with the state-of-art system action prediction method Co-Gen [70].Co-Gen generates actions and responses concurrently -the two generators share a common latent space.We consider two variants of Co-Gen: 7 (i) Co-Gen (action prediction) is trained with action and response generation; the model outputs actions based on which we derive initiative-taking decisions using our action-initiative mapping.(ii) Co-Gen (SIP) is trained with SIP and response generation; the action generator in the original paper directly learns SIP to output the initiative-taking decision at the next turn.
Linear-chain CRF-based methods cannot be directly applied to SIP as they need to be given the unobservable utterance at the next turn.Based on the same BERT utterance encoder and priorposterior inter-utterance encoders as in MuSIc, we implement the following: (i) VanillaCRF only uses a unique transition matrix (see Eq. 4).(ii) VanillaCRF+features feeding the three multi-turn features into the prior-posterior inter-utterance encoders by encoding the multi-turn features as one-hot vectors at each turn and concatenating the vectors with the BERT utterance representation.(iii) DynamicCRF uses adjacent input observations   ,   +1 to generate a dynamic transition matrix    ,  +1 to model the dependency between the corresponding output decisions   ,   +1 [24,27,54,55].  is unseen so    −1 ,  cannot be computed.Like the calculation of the prior/posterior emissions scores in MuSIc, we use the output of the prior inter-utterance encoder h   −1  to generate a prior transition matrix    −1 for the output decisions   −1 ,   ;    −1 approximates a posterior matrix    −1 ,  generated by the output of the posterior encoder h   −1  , h    via an MSE loss.Evaluation metrics.Because SIP is a binary classification problem, we use macro-averaged F1, precision, recall, and accuracy.
Implementation details.For all models except LLaMA, we use BERT encoders (BERT-base) on all datasets, set the hidden size to 768, batch whole conversations instead of individual turns, set the overall learning rate to 0.00002, use the Adam optimizer [29], and pick the best checkpoint in terms of F1 on the validation set. 8For LLaMA with all sizes, we randomly sample 2 complete conversations from the training set of WISE/MSDialog as demonstrations since other numbers lead to degraded performance.Note that all methods need to predict initiative-taking decisions for all system turns in all conversations in a dataset.Our code and data resources are available at https://github.com/ChuanMeng/SIP.

RESULTS AND ANALYSIS 6.1 Performance comparison
To answer RQ1, the results of MuSIc and all baselines on WISE and MSDialog are presented in Table 2.We have five observations.First, LLaMA-7B/13B gets the worst result on WISE; on MSDialog, LLaMA-13B outperforms CtxPred (BERT), and is comparable to VanillaCRF and DynamicCRF, showing the effectiveness of LLMs.However, LLaMA with a larger parameter size even performs worse 7 We use the code released by the author and adapt Co-Gen to SIP by making three changes: (i) we replace the GRU encoder with a BERT encoder like MuSIc has; (ii) Co-Gen requires a state vector (belief state and database records) that does not exist in CIS, so we replace the state vector with one-hot vectors encoding the current multi-turn features; and (iii) we remove reinforcement learning in Co-Gen as the rewards (task completion) do not exist in both CIS datasets. 8We found that F1 can better show the ability of a model to deal with the class imbalance problem according to experimental results on the WISE and MSDialog validation sets.This problem is also known as inverse scaling [36].McKenzie et al.
[36] identify four potential causes of it and highlight that there's still much to uncover in understanding it.Further investigation of this problem on SIP is left for future work.Second, MuSIc and the linear-chain CRF-based methods outperform CtxPred (BERT).In terms of F1, VanillaCRF outperforms CtxPred (BERT) by 0.59% and 2.14% on WISE and MSDialog, respectively.The gains indicate that it is beneficial for SIP to capture dependencies between adjacent initiative-taking decisions.
Third, both MuSIc and VanillaCRF+features outperform Vanil-laCRF and DynamicCRF, indicating that it is beneficial for SIP to take into account the impact of multi-turn features on an initiativetaking decision.Also, in terms of F1, MuSIc outperforms Vanil-laCRF+features by more than 3% on both datasets, underlining the importance of introducing such impact in the CRF layer.
Fourth, Co-Gen (action prediction) performs poorly, indicating that SIP cannot be effectively inferred from the predicted system actions.This could be due to the large action space, making the model prone to action prediction errors, which would propagate to SIP.It also implies the potential of SIP to reduce the decision space of action prediction, which we discuss in response to RQ4.Co-Gen (SIP) outperforms Co-Gen (action prediction), suggesting that sharing a common latent space between SIP and response generation is beneficial, however, MuSIc does not use that information.
Fifth, MuSIc outperforms RCSQ, which uses system initiative and non-initiative utterance candidates returned by retrieval models, whereas MuSIc does not have access to such information.MuSIc outperforms RCSQ in terms of F1 by 2.51% and 2.89% on WISE and MSDialog, respectively, confirming the effectiveness of MuSIc.

Visualisation of transition matrices
We show MuSIc's transition matrices G 2 , G 2,=0 , G 2,>0,=2 and G 2,>0, >2 on WISE and MSDialog in Fig. 6.We see different patterns in each transition matrix, indicating that different transition patterns are associated with different cases: (i) G 2,=0 shows that the user's initiative tends to transition to the system's initiative when the system has not taken the initiative before.This corresponds to cases where the system tends to take the initiative for the first time to ask a clarifying question after the user has asked a question.(ii) G 2,>0,=2 shows that the user's initiative tends to transition to the system's non-initiative if the system has taken the initiative at the last system turn.In other words, the system is less likely to take the initiative in two consecutive system turns if the user takes the initiative in the middle.(iii) According to G 2,>0, >2 , we see that compared to G 2,>0,=2 , if the system has not taken the initiative at the last system turn, the possibility of system initiative increases, especially when the user takes the initiative (on MSDialog).This corresponds to cases where the system takes the initiative once again to ask for feedback after answering a question from the user.The complexities of the patterns described above indicate that MuSIc effectively captures the impact of multi-turn features on an initiative-taking decision.

Effect of different multi-turn features
To answer RQ2, we evaluate MuSIc with multi-turn features on WISE and MSDialog.We consider four settings: (i) (r, n, d) is our final model considering all features (Eq.7); (ii) (r, n) does not consider the distance to the last system's initiative turn (Eq.6); (iii) (r) does not consider the number of times the system has taken the initiative (Eq.5); (iv) -does not consider any feature, degrading to VanillaCRF (Eq.4).See Table 3.All proposed multi-turn features contribute to the success of MuSIc.On WISE, the MuSIc performance shows the biggest drop (0.92%) in terms of F1 score after removing role transition direction ((r) vs. -).On MSDialog, MuSIc's F1 score shows the biggest drop (1.96%) after removing the number of times the system has taken the initiative ((r, n) vs. (r)).

Benefits of SIP on other tasks
We have demonstrated the effectiveness of MuSIc on SIP.Next, we illustrate two applications of SIP.
Improving clarification need prediction via transfer learning.
To answer RQ3, we examine the benefits of SIP to clarification need prediction (CNP) [2,3,5,64,65,68].We examine whether knowledge shared among system-initiative actions learned through SIP on a dataset (MSDialog) can be reused to improve clarification need prediction on the single-turn ClariQ dataset [2,3].We adopt and the two strong clarification need prediction baselines CtxPred (BERT) [2,3,68] and RCSQ [64,65] in two settings: (i) a supervised setting (CNP, ClariQ), where we only train models on the ClariQ training dataset, and (ii) a transfer learning setting (SIP, MS. → CNP, ClariQ), where we first get the best checkpoints pre-trained on SIP on the MSDialog training set and then fine-tune them on the ClariQ training dataset.We also introduce MiniLm-ANC [5], an unsupervised learning method for clarification need prediction.We follow [5] to binarize the graded clarification need scores ranging from 1 (no need for clarification) to 4 (clarification is necessary) on ClariQ.Unlike [5], where scores are split in the middle, we only regard score 1 as not asking a clarifying question because the author of ClariQ states that clarification is still needed for scores 2 and 3 but not as much as score 4. 9 We present the results in Table 4.
MuSIc outperforms strong baselines on the single-turn ClariQ dataset in the supervised setting; it outperforms MiniLm-ANC and RCSQ (CNP, ClariQ) that use retrieved documents by 6.88% and 3.07% in terms of F1 score, respectively.Transfer learning from SIP to clarification need prediction benefits MuSIc and the baselines: performance increases with knowledge shared among systeminitiative actions acquired from SIP. MuSIc (SIP, MS. → CNP, ClariQ) shows an increase (3.77%) in terms of F1 compared to MuSIc (CNP, ClariQ), significantly exceeding all baselines in the transfer learning setting and achieving state-of-the-art performance on ClariQ.
Because the MSDialog training set contains system utterances of clarifying questions, pre-training on SIP on the MSDialog dataset  Improving downstream action prediction.To answer RQ4, we propose a SIP-aware action prediction framework where action prediction is fed with the initiative-taking decision predicted by MuSIc.In our scenario, the system can take multiple actions per turn.Multiaction system action prediction is typically modeled as multi-label classification [25,34,67] or sequence generation [28,34,52,63].We adopt two typical models for both types and a state-of-art system action prediction method, Co-Gen [70]: (i) following [25,34,67], we construct a multi-label classification model by using a BERT encoder to encode the context and feeding the [CLS] token to an MLP followed by sigmoid activation function to perform binary classification for each action; (ii) following [28,34,52,63], we construct a sequence generation model by using BERT to encode the context and feeding the [CLS] token to a GRU decoder to sequentially decode actions step by step; and (iii) Co-Gen is a sequence generation model, and we use Co-Gen (action prediction) (see Section 5) to generate actions.To inject initiative-taking decisions into these models, we first embed an initiative-taking decision (annotated during training and predicted by MuSIc during inference) to a 768dimensional vector.For the models under (i) and (ii) we concatenate the vector with the [CLS] token and feed the concatenation to an MLP/GRU decoder.For Co-Gen, we concatenate the vector with the context representation (see [70]).
For evaluation, we adopt the same metrics as the previous sections except for accuracy.Accuracy here is measured by the Hamming score (a.k.a. the intersection over the union) [23] that is widely used in multi-label classification evaluation [43].Table 5 shows the results.The performance of three action prediction models fed with the initiative-taking decision predicted by MuSIc (+ MuSIc) is significantly improved compared to models without using SIP results.We think that this is because SIP, when effective, can reduce the action space of the downstream action prediction models.However, the downstream action prediction model cannot solve the SIP task (see Section 6.1).It shows that action prediction cannot replace SIP, reiterating the effectiveness of SIP in benefiting downstream tasks.

Error analysis
We conduct an error analysis of SIP.We group system initiative utterances in the test sets of WISE and MSDialog according to their annotated system-initiative actions; utterances in each group share the same system-initiative action.See Fig. 7. MuSIc can still perform well on some system-initiative actions that only take up a limited proportion of the training sets.E.g., on MSDialog, the percentage of CQ is far less than the percentage of IR in the training set, but the performance of MuSIc is comparable in terms of CQ and IR in the test set.SIP enables knowledge sharing among various systeminitiative actions, benefiting individual system-initiative actions.For revise (RV), there are only 4 and 3 system utterances of this type in the WISE training and test sets, respectively, numbers that are too small to properly evaluate the performance.

CONCLUSIONS AND FUTURE WORK
We have introduced the task of system initiative prediction (SIP), which is to predict whether a CIS system should take the initiative at the next turn.We found that it is natural to utilize probabilistic graphical models for SIP but we faced two main challenges: solving the input-incomplete sequence labeling problem and explicitly modeling multi-turn features.To solve the challenges, we proposed  MuSIc, which has (i) prior-posterior inter-utterance encoders to adapt CRFs to input-incomplete sequence labeling by eliminating the need to be given the unobservable system utterance at the next turn, and (ii) a multi-turn feature-aware CRF layer to jointly consider dependencies between adjacent user-system initiative-taking decisions and the impact of multi-turn features on an initiative-taking decision.
Experiments on two CIS datasets show that MuSIc outperforms various baselines including LLMs and achieves state-of-the-art performance on SIP.A visual analysis shows how the learned transition matrices exhibit MuSIc's interpretability and transparency.Transferring knowledge shared among system-initiative actions learned through SIP to the clarification need prediction task greatly benefits it; MuSIc achieves state-of-the-art performance on ClariQ.Lastly, SIP significantly improves the downstream action prediction task by the proposed SIP-aware action prediction framework.
As to MuSIc's limitations and future work, MuSIc does not utilize retrieved documents to improve SIP.Recent research into query performance prediction (QPP) on conversational search [37,38] has shown that QPP can model retrieved documents and has the potential to help a CIS system take appropriate action at the next turn [37,38].We plan to incorporate QPP-based features into our model.Clearly, splitting out SIP as a separate task adds complexity to CIS systems.Pre-training a model on SIP to learn knowledge shared among system-initiative actions and then fine-tuning the model on other tasks does not change the model architecture, but only increases training time without affecting inference time.Our proposed SIP-aware action prediction framework models SIP and action prediction as a two-stage process, which carries additional computational costs at inference time.We plan to improve the efficiency in the future, e.g., by modeling SIP and action prediction jointly in one stage.

Figure 4 :
Figure 4: Overview of MuSIc.Its target is to predict the optimal sequence of initiative-taking decisions in the context 1 :  − 1 and at the next turn  given the utterances over turns 1 :  − 1. I/N at the top denotes Initiative/Non-initiative.
4.2.3Multi-turn feature-aware CRF layer.During training, we feed the unobservable system utterance   to MuSIc and model the conditional probability  ( 1: |  1: ) of the sequence  1: of initiativetaking decisions in the context and at the next turn given the sequence  1: of utterances in the context and at the next turn.We consider three multi-turn features S = {   : +1 ,    : +1 ,    : +1 }  −1  =1

Figure 5 :
Figure 5: Prior-posterior inter-utterance encoders and multi-turn feature-aware CRF layer during (a) training and (b) inference.The system utterance at the next turn  can be accessed by the posterior inter-utterance encoder only during training.

𝑥 𝑇 − 1
, ỹ  if  =  and e   , ỹ  otherwise (see Fig. 5b).The optimal sequence  * 1: of initiative-taking decisions in context and at the next turn is decoded by the Viterbi algorithm [61]:  * 1: = arg max Ỹ1:  ( Ỹ1: |  1: −1 , S). (10) 5 EXPERIMENTAL SETUP Research questions.(RQ1) To what extent does MuSIc improve performance on the SIP task compared to state-the-art baselines?(RQ2) What is the effect of multi-turn features on the performance of MuSIc?(RQ3) To what extent does knowledge shared among various system-initiative actions learned through SIP benefit the clarification need prediction task?(RQ4) To what extent does the SIP task benefit the downstream action prediction task?

Figure 6 :
Figure 6: MuSIc's transition matrices learned on WISE and MSDialog.N and I denote non-initiative and initiative, respectively.See Section 4.2.3 for more information about each transition matrix.Transition scores are normalized across columns.Darker colors indicate higher scores.
already includes the pre-training of clarification need prediction.Is the improvement of transfer learning because the model learns knowledge shared among various system-initiative actions on the SIP task or because the model is just augmented with more training examples of clarification need prediction on MSDialog?In order to determine this, we introduce MuSIc (CNP, MS. → CNP, ClariQ), which is only pre-trained on clarification need prediction on the MSDialog training dataset, i.e., pre-trained on the partial SIP training examples containing clarifying questions.The performance of MuSIc (SIP, MS. → CNP, ClariQ) shows an increase (2%) in terms of F1 score compared to the performance of MuSIc (CNP, MS. → CNP, ClariQ), confirming that shared knowledge of various systeminitiative actions learned through SIP benefits the model.

Figure 7 :
Figure 7: SIP accuracy over utterance groups (utterances in one group share the same system-initiative action) in the test sets and percentages of system-initiative actions in the training sets.Abbreviations are explained in Figure 1. .
Figure 1: Distribution of system-initiative actions in two realistic CIS training datasets, WISE and MSDialog.CQ: clarifying question (called clarify in WISE); IR: information request (called request in WISE); RV: revise; RC: recommendation (ask users if they would like something); OQ: original question; RQ: repeat question; and FQ: follow up question.

•
We conduct experiments on two multi-turn CIS datasets, showing state-of-the-art performance of MuSIc on SIP.

Table 1 :
Statistics of the WISE and MSDialog datasets after preprocessing; conv. is short for "conversation."

Table 2 :
Performance comparison on SIP.Significant improvements over the best baseline results are marked with * (t-test,  < 0.05).The significance test is only performed on accuracy because it gives a score for each individual example, while other metrics evaluate the performance over all examples.Chinese versions of LLaMA-33B/65B are unavailable at the time of writing.

Table 3 :
Effect of multi-turn features in MuSIc.Notation for features explained in Section 6.3. *

Table 4 :
Performance on clarification need prediction on ClariQ.(CNP, ClariQ) indicates models in the supervised setting, where we only train the models on the ClariQ training dataset; (SIP, MS. → CNP, ClariQ) indicates models in the transfer learning setting, where we further fine-tune the best checkpoints, pre-trained on SIP, on the ClariQ training dataset; MuSIc (CNP, MS. → CNP, ClariQ), pre-trained on the SIP examples only containing clarifying questions on the MSDialog training dataset.Significant improvements over the best baseline results are marked with * (t-test,  < 0.05).MS. → CNP, ClariQ) 63.03 69.74 60.61 86.89 MuSIc (SIP, MS. → CNP, ClariQ) 65.03 78.16 61.56 88.52 *