Segment Augmentation and Prediction Consistency Neural Network for Multi-label Unknown Intent Detection

Multi-label unknown intent detection is a challenging task where each utterance may contain not only multiple known but also unknown intents. To tackle this challenge, pioneers proposed to predict the intent number of the utterance first, then compare it with the results of known intent matching to decide whether the utterance contains unknown intent(s). Though they have made remarkable progress on this task, their method still suffers from two important issues: 1) It is inadequate to extract multiple intents using only utterance encoding; 2) Optimizing two sub-tasks (intent number prediction and known intent matching) independently leads to inconsistent predictions. In this paper, we propose to incorporate segment augmentation rather than only use utterance encoding to better detect multiple intents. We also design a prediction consistency module to bridge the gap between the two sub-tasks. Empirical results on MultiWOZ2.3 show that our method achieves state-of-the-art performance and improves the best baseline significantly.


INTRODUCTION
Intent detection is a crucial component of task-oriented dialogue systems, which aims at matching the utterance to its corresponding known intent set.Though there have been dozens of research on single-label intent detection [11], in the realistic environment, a dialogue system is expected to have the ability to identify both in-distribution (IND) and out-of-distribution (OOD) utterances [7] because of the openness of user's expression.So the dialogue system can take proper action to provide good user experiences [3].Most recent-work study OOD detection on the assumption that each utterance has only one intent [8,12,13,15].These approaches have achieved promising performance on this task.However, several intents are usually expressed in an utterance in a real-world scenario.For example, Gangadharaiah and Narayanaswam [4] reveal that 52% of utterances are multi-label in the amazon internal dataset.So obviously, the research of single-label unknown intent detection cannot fully meet the needs of dialogue systems.
To deal with the challenge of multi-label unknown intent detection, Ouyang et al. [11] propose to predict the intent number of the utterance, then compare it with the results of known intent matching.If the number of matched known intents is less than the predicted number of intents, it implies the utterance is OOD (only containing unknown intents or containing both known and unknown intents).Otherwise, the utterance will be judged as IND (only containing known intents).Their method achieves SOTA performance on this task.
However, Ouyang et al. 's [11] work still suffers from two important issues.First, single encoding mode on utterance is inadequate to extract rich information of multiple intents.Second, two sub-tasks (i.e., intent number prediction and known intent matching) are optimized independently, which may lead to inconsistent predictions.To address these problems, in this paper, we design a novel Segment Augmentation and Prediction Consistency neural network (SAPC) Utterance: I also need information on the attractions that have multiple sports in town , in the same area as the restaurant please .

Intents：
Inform-Attraction-Area Inform-Attraction-Type Figure 1: An example of utterance containing multiple intents from MultiWOZ2.3[6].Tokens in yellow are used as delimiter tokens to get text segments.
for multi-label unknown intent detection.Specifically, we propose to take advantage of both utterance encoding and segment encoding to make the model better extract the intents behind text segments.As shown in Figure 1, we use delimiter tokens 1 to get several text segments, among which the "have multiple sports" and "same area" are corresponding to intent Inform-Attraction-Type and Inform-Attraction-Area respectively.In addition, we design a prediction consistency network to utilize known intent matching results and preliminary results of intent number prediction to achieve a more accurate intent number.The contributions of our paper are threefold: 1) We propose to utilize segment representation to enhance the performance of multi-label unknown intent detection; 2) We design a prediction consistency module to bridge the gap between intent number prediction and known intent matching; 3) Experiments on MultiWOZ2.3dataset demonstrate our model achieves state-of-the-art performances according to all metrics.

TASK DEFINITION
Multi-label unknown intent detection aims to detect whether an utterance refers to unknown intents based on the assumption that each utterance may contain multiple intents.Given a training set   ,    =1 in which   is an utterance and   is a set of intents contained in   .And   ⊆ Y  where Y  is a set combined with all known intents.In the testing process, we consider the utterance to be OOD if not all intents in  are in Y  .

METHOD 3.1 Overall Description
As shown in Figure 2, SAPC consists of three modules: shared encoder, segment-augmented known intents extractor, and prediction consistency module.Specifically, we conduct utterance encoding and segment encoding through the same encoder [1].By considering token-level and segment-level representations as augmentation, the segment-augmented known intents extractor can further extract known-intent-wise representations.Then the prediction consistency module can get a more precise intent number based on known intent matching and a preliminary intent number.At last, we compare the final predicted intent number with the results of matched known intents to decide whether the utterance is OOD. 1 We use prepositions, conjunctions, articles, and punctuation as delimiter tokens in this paper.

Model Architecture
Shared Encoder.We employ a BERT [2] as our shared encoder to conduct utterance encoding and segment encoding.For an utterance  = { 1 ,  2 , ...,   } where  is the number of tokens, the sequence of hidden states can be written as   = {ℎ  , ℎ 1 , ..., ℎ  }.For  segments of the utterance noted as  = { 1 ,  2 , ...,   }, we encode each one of them respectively and take the hidden state of [CLS] as its representation.Segment-augmented Known Intents Extractor.For a further interaction with global information, we prepend ℎ  to segment representations and get ℎ  , ℎ which is passed through a self-attention layer [16] to achieve the final segment representation sequence We utilize a label-wise attention mechanism [9,11] to extract the known-intent-wise representations from both   and   .Take   as an example, we initialize a trainable query   for each known intent , and apply the query to calculate attention over hidden states.Then we aggregate them to get the intent-wise representation    for the intent : In the same way, we can get    from   .Then we utilize a linear layer to fuse information from    and    by   = Linear(   ;    ), in which ; means concatenation and   has the same hidden size with    and    .Based on the Mahalanobis distance between   and  (  , Σ) (i.e., supposed conditional Gaussian distribution of  ℎ known intent) [10], SAPC model can identify the known intents behind the utterance as described specifically in [11].Prediction Consistency Module.Ouyang et al. 's work [11] simply maps ℎ  to a scalar by a linear layer to predict the intent number, without interacting with the known intent matching task which can provide explicit guidance.In our SAPC model, we keep their method to get a preliminary intent number   first.Besides, we calculate the cosine similarity scores   between   and   for each of  known intents.After concating   and   = [ 1 , ...,   ] as a new vector  with  + 1 hidden size, we can get an enhanced global hidden state ĥ as: ĥ = FeedForward(ℎ  + Linear()) which will be mapped to a scalar as the final predicted intent number   as:

Training Objective
Intent number loss.We use mean-squared error (MSE) loss to optimize both   and   predictions.Inspired by Xing et al.'s work [17], we additionally introduce a margin penalty to ensure   is more accurate than   .The overall intent number prediction loss can be written as: where  [11].All results are in percentage and the best ones are in bold.
Distribution loss.We calculate the distribution loss as described in [11].Specifically, for known intents contained in the utterance, we maximize the corresponding probability density, which can be written as: For known intents not in the utterance, we make the corresponding probability density tend to be small by setting a margin , calculated as: where ĉ refers to known intents but not in .
Finally, the overall loss  is described as: in which  1 ,  2 ,  3 are hyperparameters.

EXPERIMENTS 4.1 Experimental Setup
Dataset.We employ the dataset constructed by Ouyang et al. [11] based on MultiWOZ2.3[6], which is high-quality and large enough to evaluate our model.Metrics.We adopt four metrics widely used in single-label unknown intent detection: AUROC is the area under the true positive rate-false positive rate curve.FRP95 is the false positive rate when the true positive rate is 95% (OOD data are set as positive samples here).AUPR In is the area under the precision-recall curve.AUPR Out is the area under the precision-recall curve while OOD data are treated as positive samples here.Note that the lower FPR95 indicates better performance and the larger the other three metrics mean better performance.Details.We employ a shared BERT-base model as our encoder to conduct both utterance encoding and segment encoding.The max number of segments for an utterance is set to 6.When there are more than 6 segments in the utterance, the shortest one will be merged into an adjacent segment till the number of segments reaches 6.For intent , we initialize its representation   by the hidden state of [CLS] token after BERT encoding. 1 ,  2 ,  3 are set to 1.5, 1, 1 respectively.We use 0.0001 learning rate and 0.1 dropout rate to get our best performance.All the studies are conducted at a GeForce RTX 3090 with a batchsize of 128.Note that for each split, we run the experiments on 5 different random seeds and take the average value to report.For Table 1 the results are the average of 25 experiments.We follow the other training and testing details as described in [11].

Main Results
In this work, we choose 6 baselines for comparison: Logit [15] uses the maximum binary classifier output to detect OOD.LOF [8] uses local outlier factor in the utterance representation to detect OOD.LLR [14] trains an additional language model to eliminate irrelevant factors in OOD detection.Likelihood [5]  Energy [12] uses the sum of exponential of binary classifier output to detect OOD.AIK [11] predicts the number of an utterance first then checks whether the same number of known intents are contained to detect OOD.As shown in Table 1, our SAPC model reaches new state-ofthe-art performance according to all metrics.In particular, SAPC significantly reduces FPR95 by 6.59% compared to the best baseline AIK.The reasons can be summarized as follows: 1) By sharing the same parameters to conduct both utterance and segment encoding, our encoder can better extract the relationships between text segments and intents.
2) The segment-augmented known intents extractor utilizes fused information from both token level and segment level.
3) The consistency prediction module can predict a more accurate intent number.

Results on Different Splits
Table 2 further proves the superiority of SAPC across 5 different splits of MultiWOZ2.3.Our SAPC model is more or less better than AIK except for the AUPR Out metric of split 4, which also shows the robustness of our approach.

Ablation Study
As shown in Table 3, compared with AIK, the prediction consistency module mainly contributes to Intent Num Acc because of the explicit guidance from known intent matching and the preliminary intent number.Segment augmentation helps the model better understand the intents behind segment-level information, which is more in line with human language to express an intent.So the other 4 metrics are improved obviously.
Note that a more accurate intent number does not necessarily bring benefits to OOD detection, although the prediction consistency module leads to higher Intent Num Acc, the improvement of the other four metrics is still lower than the improvement from segment augmentation.

CONCLUSION
In this paper, we propose a novel segment augmentation and prediction consistency neural network for multi-label unknown intent detection, which effectively solves the problems caused by only utterance encoding and inconsistent prediction between two subtasks.Experimental results on MultiOZ2.3show that our method improves the best baseline significantly.

Table 1 :
Experiment results on MultiWOZ2.3dataset.We follow the baseline results reported in and    are MSE losses of the preliminary intent number and the final predicted intent number respectively.

Table 2 :
[11]ns a language model with IND utterances to get a lower likelihood for OOD utterances.Experiment results of five different splits on MultiWOZ2.3dataset.Results of each split for AIK are produced by released code in[11].