skip to main content
research-article
Open Access

Toward Explainable Dialogue System Using Two-stage Response Generation

Published:10 March 2023Publication History

Skip Abstract Section

Abstract

In recent years, neural networks have achieved impressive performance on dialogue response generation. However, most of these models still suffer from some shortcomings, such as yielding uninformative responses and lacking explainable ability. This article proposes a Two-stage Dialogue Response Generation model (TSRG), which specifies a method to generate diverse and informative responses based on an interpretable procedure between stages. TSRG involves a two-stage framework that generates a candidate response first and then instantiates it as the final response. The positional information and a resident token are injected into the candidate response to stabilize the multi-stage framework, alleviating the shortcomings in the multi-stage framework. Additionally, TSRG allows adjusting and interpreting the interaction pattern between the two generation stages, making the generation response somewhat explainable and controllable. We evaluate the proposed model on three dialogue datasets that contain millions of single-turn message-response pairs between web users. The results show that, compared with the previous multi-stage dialogue generation models, TSRG can produce more diverse and informative responses and maintain fluency and relevance.

Skip 1INTRODUCTION Section

1 INTRODUCTION

As a core of artificial intelligence chatbots, the Neural-network-based Response Generation (NRG) models have attracted much attention from researchers in recent years [18, 45]. Conventional NRG models generally have a sequence-to-sequence architecture [33], where a single encoder transforms an input message into a dense vector as the message representation, and a single decoder generates a response conditioned on the message representation instantly. Typically, the encoder-decoder models are trained on message-response pairs in an end-to-end fashion [15], making the generation procedure a black box that is difficult to understand or explain. And the generated responses are apt to suffer from the non-fluency and the lack of information problems.

Many previous researchers focus on enhancing the single-stage decoder to improve the quality of the generated responses by leveraging the topic information [20, 43, 44], mutual information [17], and external knowledge [10, 34]. However, the fact that human beings usually think one candidate response over and refine it on both the content and expression aspects before speaking or writing it out is rarely taken into account. Accordingly, endowing response generation models with multi-stage text generation procedures is of great importance to producing high-quality responses and building an explainable response generation approach.

Xia et al. [41] and Liu et al. [21] show that the multi-stage text generation models that include two or more decoders can generate better results by refining the output response repeatedly. Multi-stage text generation models generate candidate response at the early stage and polish it at the following stage. More than being similar to the way humans talk or write, the multi-stage text generation model also can expose more internal workflow of the generating procedure, similar to the attention weights produced by the attention mechanism [7, 42, 50, 51]. By inspecting and analyzing the candidate response and the interaction between generating stages, we can better illuminate the procedure about how the neural networks output a response, which is very important for building an eXplainable Artificial Intelligence (XAI) system.

The multi-stage text generation models are more intuitive and have a great potential for response generation. However, there are still several shortcomings in the multi-stage text generation framework. The exposure bias problem [48] between the training and prediction in the multi-stage framework is more serious than it is in the standard single-stage auto-regressive decoding procedure. The multi-stage frameworks are more complicated and more likely to accumulate errors between stages. So we should carefully design the training targets for different stages and regularize the previous output before passing it to the next. To address the above issues, the proposed Two-stage Response Generation model (TSRG) improves the training targets for decoders and stabilizes the interaction between decoders. The empirical experiments conducted on several dialogue datasets show that the proposed model significantly outperforms the representative baselines in diversity-oriented automatic metrics while ensuring the relevance between the messages and the generated responses. The human evaluations are carried out as well, verifying that the proposed model can generate more informative and fluent responses.

1.1 Contribution

The main contributions of this article can be summarized as follows:

  • First, the proposed TSRG introduces an additional second-stage decoder into the standard sequence-to-sequence architecture for response generation, providing an explainable intermediate procedure.

  • Second, we aim to make the multi-stage framework more reliable by bridging the gap between training and prediction using a resident token and add positional information into the output sequence from the first-stage decoder.

  • Third, we simplify the training target of the first-stage decoder from word sequence to cluster index sequence to alleviate the risk that the first-stage decoder generates and cascade errors. The words in responses are agglomerated as word clusters. The high-frequency word clusters are further split into smaller clusters to keep balance. The training target of the first-stage decoder is the corresponding cluster index sequence instead of the word sequence.

  • Finally, extensive experiments are performed to determine the performance of the proposed model. We perform the experiments on three web dialogue datasets. The proposed model outperforms several baseline models on various evaluation metrics and has better explainability.

1.2 Outline of the Article

The rest of this article is organized as follows. Section 2 introduces some sequence-to-sequence and NRG models and describes the previous multi-stage text generation models in detail. In Section 3, we first give an overview of the TSRG, then explain the two-stage decoding procedure. Section 4 contains the experiments on three different dialogue datasets, verifying the effectiveness of the proposed method. We further show the explainability of the proposed model through a deep analysis of the generating procedure in this section. The conclusion of the article and some discussions about the future works are in Section 5.

Skip 2RELATED WORK Section

2 RELATED WORK

The social media platforms, such as Twitter1 and Weibo,2 store massive message-response pairs between web users worldwide, presenting the opportunity to build automatic dialogue systems in a data-driven approach [28] instead of the rule-based or template-based approaches [4, 24]. Vinyals and Le [36] trained a sequence-to-sequence model on the message-response pairs to generate textual responses according to input messages, in an end-to-end manner. The similar sequence-to-sequence model is also applied in other text generation tasks, such as machine translation [6], language modeling [3], and image caption [37].

The typical sequence-to-sequence models consist of one encoder and one decoder. Both the encoder and decoder are Recurrent Neural Networks (RNNs). Researchers have tried to enhance the sequence-to-sequence model in many ways. Chen et al. [5] introduce a hierarchical structure and the variational memory network into the sequence-to-sequence model to handle the long-term dependencies in dialogue. The Transformer [35] blocks are also used to replace the RNN cells in the encoder-decoder model to improve the generation quality in [46]. [49] uses an emotion classifier to provide additional supervision for the sequence-to-sequence model to make the generated responses emotionally rich. Zhu et al. [52] import keywords into the response generation process to promote the informativeness of responses. Further analysis about the importing order of the keywords is also conducted.

To further improve the quality of the generated responses, some researchers design multi-stage response generation models that include multiple encoders and decoders. Su et al. [32] propose a multi-stage text generation model with four independent decoders. Each decoder is responsible for the words with different part-of-speech (POS) tags. The subsequent decoder inserts new words into the output sequence of the previous decoder. Furthermore, the multi-stage generation model uses an inter-layer teacher forcing mechanism that replaces the sequence generated by the previous decoder to accelerate the training.

In a Deliberation Network (DN) [41], a deliberation process is introduced in the sequence-to-sequence model as an extra decoding procedure. A draft of the target sequence is generated by the first-stage decoder first, and it then is used as the input of the second-stage decoder to generate the final output sequence.

The Vocabulary Pyramid Network (VPN) proposed in [21] contains three different encoder-decoder pairs. One of them, named raw encoder-decoder, is the same as the canonical sequence-to-sequence model and takes the original message as input and response as output. The other two encoder-decoder pairs, named high-level encoder-decoder and low-level encoder-decoder, are responsible for the high-level and the low-level message-response pairs, respectively, which are constructed by clustering and replacing the raw words in the message-response pairs. The words in the original messages and responses are agglomerated together based on their embedding vectors. The words belonging to the same cluster are represented with the same token to build the high-level and the low-level message-response pairs. The high-level and low-level decoders emit their generation results first and then the raw decoder generates the final output based on these results. All three decoders for different generation targets are optimized together and constitute a multi-task learning paradigm.

Skip 3TWO-STAGE DIALOGUE RESPONSE GENERATION Section

3 TWO-STAGE DIALOGUE RESPONSE GENERATION

In this section, we provide an overview of the proposed TSRG model at the outset, then the details of the two-stage decoding procedure.

3.1 Overview

Figure 1 shows the overall framework of the TSRG, which consists of a character-aware encoding component and a two-stage decoding component. The character-aware encoding component includes a character-level encoder and a word-level encoder to obtain multi-grained representations for the input message. The character-level encoder takes characters’ sequence as the input, then gets the character-level representation for each word through a Convolutional Neural Network (CNN) [16]. The word-level encoder encodes the input message at the word level using an RNN. The embeddings of the words are input into the word-level encoder accompanying the character-level representations.

Fig. 1.

Fig. 1. Overview of the proposed Two-stage Response Generation model (TSRG) with an example message-response pair. The message is encoded at both character level and word level and imported into decoders. The output of the first-stage decoder is the cluster index sequence, which is regarded as a candidate response and fed into the second-stage decoder with positional information and a resident token.

Then, the character-level and word-level representations of the input message are fed into the decoding component. The decoding component includes two decoding stages. The first decoding stage makes a candidate response with cluster indexes. Each cluster index denotes a cluster of raw words, indicating that the raw word in the final response should come from this cluster. Using the candidate response as guidance, the second-stage decoder generates the final response.

The following sections describe the details of the encoders and decoders in TSRG.

3.2 The Character-aware Encoding Component

To satisfy the resource limitation and speed up the training and generation procedure, we always limit the size of the vocabulary of NRG models to a fixed number. The words that cannot get into the vocabulary are present as unknown tokens after pre-processing. We usually refer to such words as Out-of-vocabulary (OOV) words. Much information contained in the OOV words is out of consideration in encoding and decoding, making the message representation obtained by the encoder lack information. Thus, the semantic information from messages may be insufficient for the decoder to generate high-quality responses. To obtain a more comprehensive representation for the input message, we adopt the character-aware encoder from [29], which encodes the input at both word level and character level. The character-level representations can preserve the information from OOV words.

As mentioned in Section 3.1, the character-aware encoding component processes the input message X in two granularities: a word sequence \(X=\lbrace t_1,t_2,\ldots ,t_m\rbrace\) and a character set sequence \(X=\lbrace \mathbf {k_1},\mathbf {k_2},\ldots ,\mathbf {k_m}\rbrace\), where \(\mathbf {k}_i=\lbrace k_{i,1},k_{i,2},\ldots ,k_{i,l_i}\rbrace\) is the character sequence obtained by dividing the words \(t_i\) into characters, and \(l_i\) is the number of characters in token \(t_i\). First, the character set sequence \(\mathbf {k}_i\) is encoded into a vector \(\mathbf {h}^k_i\) by a CNN. We use \(\mathbf {H}^k=\lbrace \mathbf {h}^k_1,\mathbf {h}^k_2,\ldots ,\mathbf {h}^k_m\rbrace\) to denote the character-level representations of the input message, remedying the supplementary semantic information from the OOV words.

The character-level representation \(\mathbf {h}^k_i\) in \(\mathbf {H}^k\) and the word embedding vector \(\mathbf {e}(t_i)\) are concatenated together and fed into the word-level encoder. A Bi-directional Gated Recurrent Unit (Bi-GRU) [6] is used as the RNN cell in the word-level encoder to project the concatenation \({[\mathbf {h}^k_i; e(t_i)]}\) into the hidden state sequence \(\mathbf {H}^t=\lbrace \mathbf {h}^t_1,\mathbf {h}^t_2,\ldots ,\mathbf {h}^t_m\rbrace\). \(\mathbf {H}^k\) and \(\mathbf {H}^t\) are used as the word-level and character-level representations of the message, respectively; they are fed into the two-stage decoders together with attention modules.

3.3 The Two-stage Decoding Component

The decoding component in TSRG contains two decoders: the first-stage decoder produces a candidate response first, and the second-stage decoder generates the final response based on the candidate response. The candidate response can be used as a preliminary plan, improving the generation of the final responses by enabling the second-stage decoder to discover some information about the further generation steps. In the following sections, we detail the decoding framework in the proposed TSRG.

3.3.1 Building the Training Targets for Decoders.

The multi-stage text generation model contains two or more decoders. We need to define an adequate training target for each decoder, to make the multiple decoders compatible and the entire framework settled. The previous multi-stage NRG model, VPN [21], uses the cluster index sequence instead of the word sequence as the training target for the additional encoders and decoders, as introduced in Section 2. The words are aggregated according to their pre-trained embedding vectors [26]. The pre-trained embedding vectors can serve as high-quality initialization for the word embeddings, but there are some problems when using these vectors as the features in aggregation. The high-frequency words usually lie in the same sub-region in the word embedding space [11], which causes the high-frequency words to agglomerate together into the same cluster. This phenomenon makes a single cluster index account for a large proportion in the additional message-response pairs constructed by clustering and replacing. Similar to the long-tail distribution of the raw words, the cluster-based training target of decoders suffers more from the class imbalance problem, which makes the additional decoders tend to generate the high-frequency cluster indexes.

To show this phenomenon quantitatively, we follow the workflow in VPN to create the additional message-response pairs and analyze the word distributions. We use a hierarchical cluster algorithm to agglomerate the raw words into clusters. The sizes of clusters are set to 300 and 3,000 for the high-level and low-level agglomerations, respectively. The high level represents a higher abstraction (i.e., the number of clusters is smaller) of the raw words compared to the low level. Following the workflow in VPN, we replace every word in the original message-response pairs with its cluster index based on the high-level agglomerations to obtain the high-level message-response pairs and use low-level agglomerations for the low-level message-response pairs. We then calculate the percentage of the highest-frequency tokens in each set of message-response pairs, as shown in Table 1.

Table 1.
Number of Clusters/Size of VocabularyHighest Percentage (%)
Raw129,2175.27
Low Level3,00016.21
High Level30021.26
Separation1008.26

Table 1. Percentage of the Highest Frequent Cluster under Different Settings

We repeat the procedure 20 times and average the percentages from different repeats to reduce the randomness. We can see that 21.26% of the tokens are identical in the high-level message-response pairs, increasing dramatically compared with the raw message-response pairs. And the percentage does not degrade when allowing more clusters; i.e., 16.21% of the tokens are still identical after increasing the number of clusters to 3,000 in the low level.

VPN uses the additional message-response pairs as training targets for the high-level and low-level decoders. The higher percentage means there are many of the same tokens in the additional message-response pairs. At the sequence level, the different responses become similar after replacing the words with cluster indexed. Accordingly, the decoders trained on such responses prefer to generate a similar sequence. At the token level, the decoder tends to assign higher probabilities for the high-frequency tokens at each generation step, to minimize the empirical risk on the training set. As a result, the generation results from the additional two decoders contain many high-frequency tokens and are very homogenized between different messages. Since the final decoder uses the high-level and the low-level generation results as input, it would emit more high-frequency words into the final response, making the final response mediocre and uninformative.

The experimental results support the above statements. By analyzing the generation of VPN, we find that it achieves a better score on BLEU but reduces the diversity score. According to the observation in [12, 23], the BLEU score measures fluency better than perplexity but usually fails to capture diversity. So using the conventional cluster-based training targets for the intermediate decoders will hurt the generation diversity and makes the generated responses uninformative.

To address this problem, the high-frequency words should be separated into different clusters instead of being agglomerated into the same one. Therefore, we separate the words in the highest-frequency cluster into different clusters. Assuming that \(C_{h} = \lbrace w_1, w_2, \ldots , w_i, \ldots \rbrace\) denotes the highest-frequency cluster, where \(w_i\) is the word in the cluster, the separation procedure can be formulated as follows: (1) \(\begin{equation} C_{h} \Rightarrow {\left\lbrace \begin{array}{ll} C_{h_0} = \lbrace w_{h_{0i}}\rbrace , & \frac{count(w_{h_{0i}})}{total} \lt 1\% \\ C_{h_i} = w_j, & \frac{count(w_{j})}{total} \ge 1\%. \end{array}\right.} \end{equation}\) According to Equation (1), \(C_{h}\) is divided into multiple clusters, denoted as \({C_{h_0}, C_{h_1}, \ldots , C_{h_i}, \ldots }\). \(count(w_{j})\) is the count of the word \(w_j\) and total is the total word count. \(C_{h_0}\) contains all the words whose frequency is less than 1% in \(C_h\), while \(C_{h_i},(i=1,2,\ldots)\) contains the word whose frequency is more than 1% alone. As shown in the last row of Table 1, by separating \(C_{h}\), the percentage of the highest-frequency tokens reduces to 8.26% with a smaller size of the clusters.

In summary, the training targets of the two decoders in our model are defined as follows: the objective of the first decoding stage is to maximize the probability of the cluster index sequence obtained by replacing the original word with the index of the cluster; the objective of the second decoding stage is similar to the canonical sequence-to-sequence model. Compared with VPN, clusters used by the proposed model are re-divided according to the frequency distribution, which causes some clusters to contain a single raw word each.

3.3.2 Stabling the Interactions between Decoders.

TSRG adopts RNN-based decoders in the two-stage decoding procedure. An RNN-based decoder is usually trained with the teacher forcing strategy [2, 39], which uses ground-truth sequence as input in training to stabilize and accelerate the model convergence. Similarly, the multi-stage decoding framework also needs to be trained with teacher forcing at the transition between decoders (i.e., instead of predictions, ground-truth intermediate output of the previous decoder is taken to feed the next decoder) to ensure the convergence.

However, teacher forcing brings an input discrepancy between training and prediction [48], which is usually referred to as exposure bias. In the multi-stage text generation model, the output sequence of the previous decoder is used as the input of the subsequent decoder. When applying the teacher forcing strategy, the ground-truth candidate response is used as the input in the training of the subsequent decoder, but the ground truth is not available at the prediction stage, so we can only use the prediction instead. The subsequent decoder learns how to make predictions based on the ground-truth previous output as input. When the input changes from ground truth to the predictions, which may be very different, the decoder may fail to generate regular results.

The above problem widely exists in the auto-regressive NRG models and sequence-to-sequence models [14, 40]. And this problem gets worse in the multi-stage generation models because the inconsistency escalates from word level in a single decoding stage to sequence level between different decoding stages. A dilemma has arisen about the teacher forcing: we should use the teacher forcing strategy to ensure the model can learn an effective relationship between input and output, e.g., generate the target based on the valid source instead of exceptional previous prediction. However, the teacher forcing strategy causes serious exposure bias.

To alleviate this problem, we first choose the cluster index sequence instead of the raw words to bridge the two decoders, as described in Section A. Because the different clusters are much fewer than the words, the combination of clusters is much more restricted and easier to decode. The prediction accuracy of the first-stage decoding can also boost accordingly, which reduces the risk of generating exceptional output of the first-stage decoder.

Furthermore, to make the output of the first-stage decoder more stable, we introduce a pre-defined resident token \(y^R\) between the two decoders. \(y^R\) is appended to the output of the first-stage decoder \(Y^f\), so the input of the second-stage decoder becomes \(\lbrace y_1^f,y_2^f,\ldots ,y_n^f,y^R\rbrace\). During the training, the quality of \(Y^f\) is guaranteed by the teacher forcing strategy, so the attention weight for \(y_R\) will be relatively small. When predicting, \(y^R\) can apportion some attention weight when the deserved tokens do not exist in the \(Y^f\). The resident token acts as a fail-safe token. If the sequence generated by the first-stage decoder doesn’t contain useful information, \(y^R\) provides an additional choice for the attention mechanism instead of the unimportant or wrong tokens in \(Y^f\). Because \(y^R\) exists in every input sequence, the next decoder can treat it as nothing input from the first-stage decoder. If some relevant tokens are contained in \(\lbrace y_1^f,y_2^f,\ldots ,y_n^f\rbrace\), the second-stage decoder will be attentive to them. Otherwise, the next decoder will ignore \(\lbrace y_1^f,y_2^f,\ldots ,y_n^f\rbrace\) by being attentive to the resident token \(y^R\).

3.3.3 Integrating Positional Information.

As one of the primary elements in natural language, positional information provides relationships of words in the sequence modeling process. But the output of the first-stage decoder lacks positional information when the second-stage decoder receives only the embedding vector sequence of the tokens in \(Y^f\). The sequential tokens in \(Y^f\) degenerate into a bag of words. When the same tokens appear multiple times in \(Y^f\), they are treated equally by the second-stage decoder. And the sequential information from \(Y^f\) degenerates into bag of words.

To address this problem, we add the sinusoidal position embedding [35, 38] into the \(Y^f\!\). The position embedding restores the temporal information in the output of the first-stage decoder, which makes the second-stage decoder able to distinguish the repeat tokens in \(Y^f\) and receives more diverse information from the first-stage decoder.

3.3.4 Two-stage Decoders.

In TSRG, the decoding process has two stages: the first-stage decoder generates an intermediate cluster index sequence \(Y^f\) as the candidate response; then the second-stage decoder generates the final output sequence Y based on \(Y^f\). Both decoders adopt GRU as the basic decoding module, and the initial state of GRU is set to the final vector \(\mathbf {h}_m^t\) in the hidden state sequence \(\mathbf {H}^t\) output by the encoder.

In the first-stage decoder, the training data is constructed using the word cluster index instead of raw words as described in Section 3.3.1. At the ith timestep, the input of the GRU cell in the decoder consists of \(\mathbf {e}(y_{i-1}^f)\), \(\mathbf {c}^{kf}_i\), and \(\mathbf {c}^{tf}_i\), where \(\mathbf {e}(y_{i-1}^f)\) is the embedding vector of the previous generated token, \(\mathbf {c}^{kf}_i\) is the weighted average of the vectors in \(\mathbf {H}^k\), and \(\mathbf {c}^{tf}_i\) is the weighted average of \(\mathbf {H}^t\). They are calculated by (2) \(\begin{equation} \mathbf {c}^{kf}_i=\sum _{j=1}^m\alpha _{ij}^{kf}\mathbf {h}_j^{k}, \end{equation}\) (3) \(\begin{equation} \mathbf {c}^{tf}_i=\sum _{j=1}^m\alpha _{ij}^{tf}\mathbf {h}_j^{t}, \end{equation}\) where \(\alpha _{ij}^{kf}\) and \(\alpha _{ij}^{tf}\) are the attention weights obtained by the attention mechanism [1]. The attention weight \(\alpha _{ij}^{kf}\) for the jth vector \(\mathbf {h}_j^k\) in \(\mathbf {H}_k\) and the attention weight \(\alpha _{ij}^{tf}\) for \(\mathbf {h}_j^t\) in \(\mathbf {H}_t\) are calculated by the following equations: (4) \(\begin{equation} \alpha _{ij}^{kf}=softmax\left(\mathbf {h}_j^k\mathbf {W}_{kf}\mathbf {h}_{i-1}^f\right), \end{equation}\) (5) \(\begin{equation} \alpha _{ij}^{tf}=softmax\left(\mathbf {h}_j^t\mathbf {W}_{tf}\mathbf {h}_{i-1}^f\right). \end{equation}\) \(\mathbf {h}_{i-1}^f\) is the hidden state of the first-stage decoder at the previous timestep. \(\mathbf {W}_{kf}\) and \(\mathbf {W}_{tf}\) are matrices that transform vectors to scalar. The softmax function is used to normalize the \(\lbrace \alpha _{ij}^{kf}\rbrace\) and \(\lbrace \alpha _{ij}^{tf}\rbrace\) (\(j=1,2,\ldots ,m\)) as probabilities. \(\mathbf {e}(y_{i-1})\), \(\mathbf {c}^{kf}_i\), and \(\mathbf {c}^{tf}_i\) are concatenated as the input of the GRU, and then the hidden states vector sequence \(\mathbf {H}^f\) is obtained. After that, a linear transformation and a softmax function are employed to project the first-stage output into \(Y^f=\lbrace y_1^f,y_2^f,\ldots ,y_n^f\rbrace\).

The second-stage decoder takes \(\mathbf {e}(y_{i-1})\), \(\mathbf {c}^{ks}_i\), \(\mathbf {c}^{ts}_i\), and \(\mathbf {c}^{fs}_i\) as the input at the ith timestep. \(\mathbf {e}(y_{i-1})\) represents the embedding of the previous generated word \(y_{i-1}\) of the second-stage decoder. Similar to \(\mathbf {c}^{kf}_i\) and \(\mathbf {c}^{tf}_i\), \(\mathbf {c}^{ks}_i\) and \(\mathbf {c}^{ts}_i\) are the weighted average of \(\mathbf {H}^k\) and \(\mathbf {H}^k\). \(\mathbf {c}^{fs}_i\) is the weighted average of the embedding vector sequence corresponding to \(\lbrace y_1^f,y_2^f,\ldots ,y_n^f,y^R\rbrace\), which is calculated by the following equation: (6) \(\begin{equation} \mathbf {c}^{fs}_i=\sum _{j=1}^{n+1}\alpha _j^{fs}\mathbf {e}_p\left(y_j^{f}\right). \end{equation}\) \(y_j^{f}\) is the jth word in the sequence \(\lbrace y_1^f,y_2^f,\ldots ,y_n^f,y^R\rbrace\). \(\mathbf {e}(y^R)\) is the embedding of the resident token \(y^R\). \(\alpha _j^{fs}\) is the attention weight of the jth word in \(\lbrace y_1^f,y_2^f,\ldots ,y_n^f,y^R\rbrace\) when feeding into the GRU cell in the second-stage decoder. \(\mathbf {e}_p(y_j^f)\) denotes the embedding vector of \(y_j^f\) with position embedding [35], which can be obtained by (7) \(\begin{equation} \mathbf {e}_p\left(y_j^f\right)=\mathbf {e}\left(y_j^f\right) + \lambda \mathbf {p}(j). \end{equation}\) \(\mathbf {e}(y_j^f)\) is the embedding of \(y_j^f\). \(\mathbf {p}(j)\) is the jth sinusoidal position embedding vector; it has the same size as the embedding vector and changes according to the position index j. \(\lambda\) is a weight that determines how much of the position information is included.

Skip 4EXPERIMENTS Section

4 EXPERIMENTS

4.1 Settings

4.1.1 Datasets.

We employ three datasets that contain millions of message-response pairs between web users to evaluate the NRG models. The statistics of the datasets are shown in Table 2. For Weibo and Twitter datasets, we filter some noisy message-response pairs and split them into Train/Dev/Test set randomly. The Weibo-clean dataset is proposed in [9] and contains a predefined development set and a test set.

Table 2.
Weibo [30]Weibo-clean [9]Twitter [21]
Total samples (message-response pairs)3,942,3964,266,6502,599,029
Messages in training set186,897212,1632,326,684
Messages in development set2,0008018,000
Messages in test set2,0008008,000
Average response per message20.6520.111.12
Message vocabulary size93,469121,521261,907
Response vocabulary size71,165105,536250,737

Table 2. Statistics of the Datasets

4.1.2 Evaluation.

We first calculate the overlap between generated and ground-truth responses with BLEU [25], ROUGE [19], and METEOR [8] metrics. These overlap-based metrics can reveal the topicality and relevance of responses [22]. In the Weibo and Weibo-clean dataset, each message for Dev/Test has multiple corresponding ground-truth responses, so we use multi-reference BLEU that includes more ground-truth references during evaluation, the same as the setting in [27]. Meanwhile, in the Twitter dataset, the number of ground-truth responses for each message is only 1.12, so we use one ground-truth response as a reference during calculation.

On the overlap-based metrics, the uninformative responses that involve more high-frequency N-grams usually achieve higher scores. So it is inappropriate to measure the response quality by using the overlap-based metrics solely. Therefore, we also introduce the diversity score [17], which is calculated based on the number of distinct N-grams, to evaluate the informativeness of the responses. We calculate these automatic metrics by the NLGEVAL library [31] on the test set. Furthermore, we also conduct human comparisons between the responses generated by different models. Due to the complexity and variance of the dialogue, the human evaluation is usually more reliable than the automatic evaluation.

4.1.3 Baselines.

The following models are used as our baselines for comparison:

  • RNNSearch [1]: A sequence-to-sequence model contains a single encoder and a single decoder, equipped with an attention mechanism from the decoder to the encoder.

  • RNNSearch with Transformer encoder: Replace the encoder in the RNNSearch model with a Transformer encoder [35].

  • Transformer: A sequence-to-sequence model that consists of a Transformer encoder and a Transformer decoder [35, 46].

  • Deliberation Networks [41]: Our implementation of the Deliberation Network, contains one encoder and two decoders.

  • Vocabulary Pyramid Network [21]: Our implementation of the Vocabulary Pyramid Network, contains three encoders and three decoders.

4.1.4 Parameter Settings.

For RNNSearch, DN, and Transformer, the vocabulary size is set to 30,000. For VPN, the vocabulary size of raw words is 30,000, and the number of clusters is set to 300 and 3,000 for the high-level decoder and low-level decoder, respectively. For TSRG, the vocabulary size of the first-stage decoder and second-stage decoder is set to 100 and 30,000, respectively. The embedding dimension is set to 300. The hidden size of the bi-directional GRU encoder(s) and the GRU decoder(s) is set to 1,024, the training batch size is set to 128, and the learning rate is set to 0.0001. We use the BIRCH [47] algorithm implemented in Sklearn3 to perform the hierarchical clustering of raw words. In TSRG, 11 words with a frequency greater than 1% are forced to form a cluster after aggregation.

4.2 Evaluation Results

4.2.1 Automatic Evaluation Results.

Table 3 shows the automatic evaluation results of the proposed model and baselines on the three datasets. The experiment results demonstrate that TSRG performs better on the diverse metrics compared with RNNSearch. Unlike the previous multi-stage NRG models (i.e., VPN and DN), which improve the overlap-based scores only but hurt the diversity or vice versa, TSRG improves both the relevance and diversity.

Table 3.
DatasetModelBLEU1BLEU2METEORROUGE-LDIS1DIS2
WeiboRNNSearch34.050613.23669.240421.07952.86747.0136
DN36.106013.04858.796921.65541.90874.1940
VPN39.302815.50209.760822.21902.94427.1301
RNNSearch w. Transformer33.740910.66577.906920.70851.48523.0635
Transformer32.664013.50629.376120.75623.06828.0288
TSRG37.025315.24359.549121.54503.38308.4732
Weibo-cleanRNNSearch23.39187.49796.814416.94122.55934.9874
DN16.09384.90835.801314.62610.75001.3347
VPN11.11764.07883.895512.01061.08301.9929
RNNSearch w. Transformer15.90024.71535.869514.87210.71391.3011
Transformer21.37987.03136.549016.38242.62635.5050
TSRG22.92787.15477.450518.07273.92967.9861
TwitterRNNSearch5.23931.66013.26719.30690.02960.2534
DN7.39182.37513.40799.97200.02480.2232
VPN6.19101.79293.01039.55150.02160.1621
RNNSearch w. Transformer6.15631.73633.17969.54500.01600.0588
Transformer6.48982.12363.33428.99680.02850.1915
TSRG4.70321.58383.18288.86710.03900.2962
  • Top two results on each metric are shown in bold.

Table 3. Automatic Evaluation Results

  • Top two results on each metric are shown in bold.

Generally, there is a tradeoff between the overlap-based scores and the diversity scores. Rare words contribute more to the diversity scores than common words but also increase the danger of decreasing the overlap-based scores. The experiment results show that DN and VPN invoke more common words in responses and degrade their informativeness. We think this is because the multi-stage frameworks have more learnable parameters and are more likely to overfit to generate the high-frequency words at each decoding step. To solve this problem, the high-frequency target labels of the first-stage decoder are divided into different clusters in TSRG, so the distribution of target labels becomes more even and the high-frequency labels are prevented from filling the candidate response.

The experiment results also reveal the instability of the multi-stage framework. As shown in Table 2, the Weibo-clean dataset possesses a larger response vocabulary than the Weibo dataset. Thus, the Weibo-clean dataset is considered more challenging for NRG models because it contains more rare words and more kinds of responses. The VPN and DN show promising results on the Weibo dataset but fail on the Weibo-clean dataset, and degrade both the overlap-based and diverse metrics compared to RNNSearch. The contradiction between these two datasets may be due to the instability of the multi-stage framework, which creates a new risk of cascading errors between stages. That is, the subsequent generation stage relies on the previous generation result and inherits the mistakes in it. When the dataset is hard to learn, it is more likely for the previous decoder to emit wrong tokens into generation results and mislead the subsequent prediction. To address this problem, TSRG introduces a resident token to stabilize the first-pass decoding result, reduce the proportion of the generated tokens, and make the candidate response more stable. The results show that TSRG performs better than RNNSearch on all metrics, indicating that TSRG has better generalizability.

In the Twitter dataset, there is only one ground-truth reference, making the overlap-based scores very low. A generated response may vary in the surface form and be very different from the single ground truth but is also a decent response. It is arbitrary to measure the quality of the responses using the overlap-based scores on the Twitter dataset. Due to more ground-truth responses involved as references in calculating, the overlap-based scores on the Weibo and Weibo-clean datasets are higher than those on the Twitter dataset. The results on Weibo and Weibo-clean datasets are more stable and reliable for analysis since multiple ground truths cover more potentially reasonable responses in evaluation. Therefore, the following discussion will focus on the results from the Weibo and Weibo-clean datasets.

4.2.2 Human Evaluation Results.

To evaluate the generated responses from human perspectives, we conduct a human comparison between the responses generated by VPN and TSRG. One hundred messages are randomly sampled from the test set of the Weibo dataset. The responses generated by VPN and TSRG for the same message form a message pair. The annotators need to select which response is better in each pair from three aspects [13, 21]: (1) fluency: which response is more grammatical and more fluent; (2) coherence: which response is more relevant to the message; (3) informativeness: which response provides more information.

Table 4 shows the human comparison results. There are 300 votes from three annotators. TSRG Win denotes that TSRG performs better on the corresponding aspect. Win ratio is the proportion of the winning times of TSRG compared to losing times. The order of the response generated by TSRG and VPN is randomly shuffled during the comparison. The human evaluation results provide more reliable evidence to support that the responses generated by TSRG are more fluent, coherent, and informative than VPN. After collecting some comments from the annotators, we find that sometimes both models generate ridiculous responses and tie in comparison. There still needs to be more designs to guarantee the responses are logically sound and consistent with world facts. Although the responses generated by both models are not satisfying enough, the proposed model has made some progress.

Table 4.
AspectsTRSG WinVPN WinTieWin Ratio of TSRG (%)
Fluency1117811158.73
Coherence867913552.12
Informativeness907014056.25

Table 4. Human Comparisons between TSRG and VPN

4.2.3 Case Study.

To compare the multi-stage models directly, we select some input messages from the test set of the Weibo dataset and show the responses generated by different multi-stage models in Table 5. We can see that the responses from TSRG are more informative and appropriate in these cases. For Message 1, the word “dumpling” in the response generated by TSRG is a traditional festival food in China and is more relevant to the new year theme. For Message 2, TSRG generates a more diverse response that contains the pertinent word “push-ups,” which makes the generated response consistent with the topic of the message. For Message 3, compared to the predictions of the baseline models, the response generated by TSRG is more natural and fluent. These results show the effectiveness of the proposed improvements in TSRG.

Table 5.
IDMessageRNNSearchDNVPNTSRG
1The foods at home are always more delicious! Reunion for the new year. (还是家里的饭菜香啊!过年了 该团圆啦。)Is this food, or is it food (这是吃的,还是吃的)My family’s, mine, (我家的,我的,)I’m going. I don’t want to eat anymore. (我要去,我不想吃了。)Haha! The food you eat is dumpling! (哈哈!吃的饺子啊!)
2Let’s exercise together. I laugh every time I look at this picture (一起来做运动吧。看一次笑一次)What is this, I want to see (这是什么的,我要去看)Mine, I am also one (我的,我也是一个)I don’t want don’t want! (我想不要不要啊!)Is he doing push-ups? (这不是在练俯卧撑?)
3Today is holiday. The ground is freezing, give me a hug! (今天过节。地冻天寒,一起抱抱!)Where is this? (这是在哪里啊?) Mine, I’m also yours (我的, 我也是你的) This is me! This is me! (我是啊 ,我是啊!)Haha! Happy holiday! (哈哈!节日快乐!)

Table 5. Comparison between the Responses Generated by the Baseline Models and Our Model (i.e., TSRG)

4.3 Analysis of the Interactions between the Decoders

In TSRG, we use positional embedding to inject positional information into the words in the candidate response. The \(\lambda\) in Equation (7) controls the amount of positional information introduced. We find that the value of \(\lambda\) can also influence the interaction pattern between the two decoders, making the final generation results controllable and explainable.

4.3.1 Control the Interaction Pattern through \(\lambda\).

We represent the interaction patterns with the attention weight between the second-stage decoder and the candidate responses from the first-stage decoder, as shown in Figure 2.

Fig. 2.

Fig. 2. The different interaction patterns between the two generation stages when tuning \(\lambda\) .

The cases in Figure 2 are selected from the Weibo dataset. In a single case, each subfigure corresponds to a different value of \(\lambda\). The horizontal axis shows the words in the final output; the vertical axis shows the tokens in the candidate response. For convenience, we refer to words in the final output as words and tokens in the candidate response as tokens in the following discussion. The value in the square cell denotes the attentive weight between the corresponding word and token. The attentive weight of a token indicates how much the second-stage decoder acquires this token to generate the corresponding word in the final output.

Based on the cases, there is a clear trend that the interactive patterns become sparse as \(\lambda\) increases. For example, when the \(\lambda\) is 0, the attention is uniformly distributed over the tokens, as shown by the first word (i.e., “我”) in the first subfigure (\(\lambda =0\)) of Case 1. In the last subfigure, \(\lambda\) increases to 1, and the max value of the attention values shifts closer to 1. The attention weights from each word to tokens present like one-hot vectors. The phenomena can be summarized as follows: when the \(\lambda\) is greater, there is sufficient positional information, and the second-stage decoder relies more on the tokens that have a similar position with the decoding words.

We employ two quantitative metrics, entropy and standard deviation, to track the change in the interaction pattern. We calculate the entropy and the standard deviation of the attention weights from the word in the final response to the tokens in the candidate response, i.e., the entropy of the distribution shown on the row of the attention map. The entropy measures the similarity of a set of attention weights to the uniform distribution. Inversely, the standard deviation shows the variation between the attention weights. As shown in Figure 3, we can see that the entropy changes greatly when setting \(\lambda\) to different values, revealing the effectiveness of \(\lambda\) in controlling the attention patterns between decoders.

Fig. 3.

Fig. 3. Changes in the entropy and the standard deviation of the interaction patterns when tuning \(\lambda\) .

We analyze the underlying reason in detail. As a result of the teacher forcing strategy, the tokens in the candidate response map with the words in the target response exactly. Each token provides crucial guidance for the corresponding target word in the second-stage decoder, so the second-stage decoder learns to rely on tokens with a similar position. The positional embedding attaches the position information on the tokens, under the control of \(\lambda\). In this way, \(\lambda\) controls how much the second-stage decoder relies on the positionally similar tokens when generating words.

4.3.2 Effects of \(\lambda\) on the Generated Responses.

We can somewhat adjust the characteristic of the final generation by controlling the interaction pattern through \(\lambda\). Table 6 shows the automatic metrics of the response generated with different \(\lambda\). The results with \(\lambda =0.1\) and \(\lambda =0.5\) are very similar. On the Weibo and Weibo-clean datasets, the BLUE1 increases more as the \(\lambda\) increases, indicating the generated responses become more relevant and ordinary. Accordingly, we can increase \(\lambda\) to make the response more relevant or obtain a more free response with a small \(\lambda\).

Table 6.
Dataset\(\lambda\)BLEU1BLEU2METEORROUGE-LDIS1DIS2
Weibo035.345114.87459.221220.89863.20387.9647
0.137.025315.24359.549121.54503.38308.4732
0.537.169915.09559.567821.67962.93697.2519
138.048714.38039.303021.87723.14157.7374
Weibo-clean020.91736.84666.909717.65104.48689.4080
0.122.92787.15477.450518.07273.92967.9861
0.522.24606.82887.538018.16444.32288.9365
123.81957.16726.939717.17622.60495.0699
Twitter04.49941.56323.02768.41330.03650.2854
0.14.33241.43763.01748.55860.03820.2910
0.54.54721.52503.06848.62600.03880.2927
14.70321.58383.18288.86710.03900.2962

Table 6. Performance of the TSRG Model When Setting \(\lambda\) to Different Values

Skip 5CONCLUSION AND FUTURE WORK Section

5 CONCLUSION AND FUTURE WORK

In this article, we proposed TSRG, a two-stage dialogue response generation model that contains a first-stage decoder and a second-stage decoder. The candidate response is generated by the first-stage decoder and then polished by the second-stage decoder to obtain the output response. The two decoders interact with each other with attention mechanisms that can be used to obtain additional explanations for the two-stage generation procedure. Moreover, a resident token has been introduced into the output of the first-stage decoder to help mitigate the exposure bias between two decoders, a character-aware encoder is used to obtain the information contained in the input message effectively, and the positional information is aligned with the candidate response to complete the information from the first-stage decoder. The TSRG provides an intuitive explanation and the experiment results reveal that the proposed model can generate more fluency and diverse responses than the other existing baseline models.

In the future, there are several directions to study. First, we will try to connect the two-stage generation process with reinforcement loss to form a more straightforward interaction between decoders. Second, we will try to apply the proposed model to other text generation tasks, such as text summarization, image captioning, and generative question answering. Third, we will study how to find a more appropriate way to build the cluster of raw words.

Skip AUTHOR CONTRIBUTION Section

AUTHOR CONTRIBUTION

Shaobo Li: Methodology, implementation, writing. Chengjie Sun: Methodology, validation, writing, supervision. Zhen Xu: Validation, writing, editing. Prayag Tiwari: Validation, writing. Bingquan Liu: Validation, writing, supervision. Deepak Gupta: Validation, writing. K. Shankar: Validation, writing. Zhenzhou Ji: Validation, writing, supervision. Mingjiang Wang: Validation, writing.

Footnotes

REFERENCES

  1. [1] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2015. Neural machine translation by jointly learning to align and translate. In Proceeding of the 3rd International Conference on Learning Representations (ICLR’15), Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1409.0473.Google ScholarGoogle Scholar
  2. [2] Bengio Samy, Vinyals Oriol, Jaitly Navdeep, and Shazeer Noam. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems (NeurIPS’15). 11711179.Google ScholarGoogle Scholar
  3. [3] Bengio Yoshua, Ducharme Réjean, Vincent Pascal, and Janvin Christian. 2003. A neural probabilistic language model. Journal of Machine Learning Research 3 (2003), 11371155.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Bohus Dan and Rudnicky Alexander I.. 2005. A principled approach for rejection threshold optimization in spoken dialog systems. In Proceeding of the INTERSPEECH 2005-Eurospeech, 9th European Conference on Speech Communication and Technology (INTERSPEECH’05). ISCA, 2781–2784. http://www.isca-speech.org/archive/interspeech_2005/i05_2781.html.Google ScholarGoogle Scholar
  5. [5] Chen Hongshen, Ren Zhaochun, Tang Jiliang, Zhao Yihong Eric, and Yin Dawei. 2018. Hierarchical variational memory network for dialogue generation. In Proceedings of the 2018 World Wide Web Conference (WWW’18). 16531662.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Cho Kyunghyun, Merrienboer Bart van, Gulcehre Caglar, Bahdanau Dzmitry, Bougares Fethi, Schwenk Holger, and Bengio Yoshua. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 17241734.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Chorowski Jan, Bahdanau Dzmitry, Serdyuk Dmitriy, Cho Kyunghyun, and Bengio Yoshua. 2015. Attention-based models for speech recognition. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS’15), Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett (Eds.). 577–585. https://proceedings.neurips.cc/paper/2015/hash/1068c6e4c8051cfd4e9ea8072e3189e2-Abstract.html.Google ScholarGoogle Scholar
  8. [8] Denkowski Michael and Lavie Alon. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Gao Jun, Bi Wei, Liu Xiaojiang, Li Junhui, and Shi Shuming. 2019. Generating multiple diverse responses for short-text conversation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’19), Vol. 33. 63836390.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Ghazvininejad Marjan, Brockett Chris, Chang Ming-Wei, Dolan Bill, Gao Jianfeng, Yih Wen-tau, and Galley Michel. 2018. A knowledge-grounded neural conversation model. In Proceeding of the 32nd AAAI Conference on Artificial Intelligence (AAAI’18).Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Gong Chengyue, He Di, Tan Xu, Qin Tao, Wang Liwei, and Liu Tie-Yan. 2018. Frage: Frequency-agnostic word representation. In Advances in Neural Information Processing Systems (NeurIPS’18). 13341345.Google ScholarGoogle Scholar
  12. [12] Hashimoto Tatsunori, Zhang Hugh, and Liang Percy. 2019. Unifying human and statistical evaluation for natural language generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). 16891701.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] He Shizhu, Liu Cao, Liu Kang, and Zhao Jun. 2017. Generating natural answers by incorporating copying and retrieving mechanisms in sequence-to-sequence learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL’17). 199208.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] He Tianxing, Zhang Jingzhao, Zhou Zhiming, and Glass James R.. 2019. Quantifying exposure bias for neural language generation. CoRR abs/1905.10617 (2019). arXiv:1905.10617 http://arxiv.org/abs/1905.10617.Google ScholarGoogle Scholar
  15. [15] Hecht-Nielsen Robert. 1992. Theory of the backpropagation neural network. In Neural Networks for Perception. Elsevier, 6593.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Kim Yoon. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 17461751.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Li Jiwei, Galley Michel, Brockett Chris, Gao Jianfeng, and Dolan Bill. 2016. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’16). Association for Computational Linguistics, San Diego, 110–119. Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Li Lu, Li Chenliang, and Ji Donghong. 2021. Deep context modeling for multi-turn response selection in dialogue systems. Information Processing & Management 58, 1 (2021), 102415.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Lin Chin-Yew. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, 74–81. https://aclanthology.org/W04-1013.Google ScholarGoogle Scholar
  20. [20] Ling Yanxiang, Cai Fei, Hu Xuejun, Liu Jun, Chen Wanyu, and Chen Honghui. 2021. Context-controlled topic-aware neural response generation for open-domain dialog systems. Information Processing & Management 58, 1 (2021), 102392.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Liu Cao, He Shizhu, Liu Kang, and Zhao Jun. 2019. Vocabulary pyramid network: Multi-pass encoding and decoding with multi-level vocabularies for response generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 37743783.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Liu Chia-Wei, Lowe Ryan, Serban Iulian Vlad, Noseworthy Mike, Charlin Laurent, and Pineau Joelle. 2016. How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP’16). 21222132.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Montahaei Ehsan, Alihosseini Danial, and Baghshah Mahdieh Soleymani. 2019. Jointly measuring diversity and quality in text generation models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19), 90.Google ScholarGoogle Scholar
  24. [24] Oh Alice and Rudnicky Alexander. 2000. Stochastic language generation for spoken dialogue systems. In Proceeding of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics Workshop: Conversational Systems (ANLP-NAACL’00).Google ScholarGoogle Scholar
  25. [25] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL’02). Association for Computational Linguistics, 311318.Google ScholarGoogle Scholar
  26. [26] Pennington Jeffrey, Socher Richard, and Manning Christopher D.. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 15321543.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Qiu Lisong, Li Juntao, Bi Wei, Zhao Dongyan, and Yan Rui. 2019. Are training samples correlated? Learning to generate dialogue responses with multiple references. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 38263835.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Ritter Alan, Cherry Colin, and Dolan William B.. 2011. Data-driven response generation in social media. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP’11). 583593.Google ScholarGoogle Scholar
  29. [29] Seo Minjoon, Kembhavi Aniruddha, Farhadi Ali, and Hajishirzi Hannaneh. 2017. Bidirectional attention flow for machine comprehension. In Proceeding of the 5th International Conference on Learning Representations (ICLR’17). OpenReview.net. https://openreview.net/forum?id=HJ0UKP9ge.Google ScholarGoogle Scholar
  30. [30] Shang Lifeng, Lu Zhengdong, and Li Hang. 2015. Neural responding machine for short-text conversation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP’15). The Association for Computer Linguistics, 1577–1586. Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Sharma Shikhar, Asri Layla El, Schulz Hannes, and Zumer Jeremie. 2017. Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation. CoRR abs/1706.09799. http://arxiv.org/abs/1706.09799.Google ScholarGoogle Scholar
  32. [32] Su Shang-Yu, Lo Kai-Ling, Yeh Yi-Ting, and Chen Yun-Nung. 2018. Natural language generation by hierarchical decoding with linguistic patterns. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18), Marilyn A. Walker, Heng Ji, and Amanda Stent (Eds.). Association for Computational Linguistics, 61–66. Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Sutskever I., Vinyals O., and Le Q. V.. 2014. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems (NeurIPS’14). (2014).Google ScholarGoogle Scholar
  34. [34] Tiwari Prayag, Zhu Hongyin, and Pandey Hari Mohan. 2021. DAPath: Distance-aware knowledge graph reasoning based on deep reinforcement learning. Neural Networks 135 (2021), 112.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceeding of the Advances in Neural Information Processing Systems (NeurIPS’17). 59986008.Google ScholarGoogle Scholar
  36. [36] Vinyals Oriol and Le Quoc V.. 2015. A neural conversational model. CoRR abs/1506.05869 (2015). arXiv:1506.05869 http://arxiv.org/abs/1506.05869.Google ScholarGoogle Scholar
  37. [37] Vinyals Oriol, Toshev Alexander, Bengio Samy, and Erhan Dumitru. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 31563164.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Wang Benyou, Zhao Donghao, Lioma Christina, Li Qiuchi, Zhang Peng, and Simonsen Jakob Grue. 2019. Encoding word order in complex embeddings. In Proceeding of the 8th International Conference on Learning Representations (ICLR’20). OpenReview.net. https://openreview.net/forum?id=Hke-WTVtwr.Google ScholarGoogle Scholar
  39. [39] Williams Ronald J. and Zipser David. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Computation 1, 2 (1989), 270280.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Wiseman Sam and Rush Alexander M.. 2016. Sequence-to-sequence learning as beam-search optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP’16). 12961306.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Xia Yingce, Tian Fei, Wu Lijun, Lin Jianxin, Qin Tao, Yu Nenghai, and Liu Tie-Yan. 2017. Deliberation networks: Sequence generation beyond one-pass decoding. In Proceeding of the Advances in Neural Information Processing Systems (NeurIPS’17). 17841794.Google ScholarGoogle Scholar
  42. [42] Xiao Han, Chen Yidong, Shi Xiaodong, and Xu Ge. 2019. Multi-perspective neural architecture for recommendation system. Neural Networks 118 (2019), 280288.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Xie Qianqian, Tiwari Prayag, Gupta Deepak, Huang Jimin, and Peng Min. 2021. Neural variational sparse topic model for sparse explainable text representation. Information Processing & Management 58, 5 (2021), 102614.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Xing Chen, Wu Wei, Wu Yu, Liu Jie, Huang Yalou, Zhou Ming, and Ma Wei-Ying. 2017. Topic aware neural response generation. In Proceeding of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI’17).Google ScholarGoogle Scholar
  45. [45] Yang Min, Tu Wenting, Qu Qiang, Zhao Zhou, Chen Xiaojun, and Zhu Jia. 2018. Personalized response generation by dual-learning based domain adaptation. Neural Networks 103 (2018), 7282.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Zeng Yan and Nie Jian-Yun. 2020. Open-Domain Dialogue Generation Based on Pre-trained Language Models. CoRR abs/2010.12780 (2020). arXiv:2010.12780 https://arxiv.org/abs/2010.12780.Google ScholarGoogle Scholar
  47. [47] Zhang Tian, Ramakrishnan Raghu, and Livny Miron. 1996. BIRCH: An efficient data clustering method for very large databases. In ACM Sigmod Record, Vol. 25. ACM, 103114.Google ScholarGoogle Scholar
  48. [48] Zhang Wen, Feng Yang, Meng Fandong, You Di, and Liu Qun. 2019. Bridging the gap between training and inference for neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 43344343.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Zhou Guangyou, Fang Yizhen, Peng Yehong, and Lu Jiaheng. 2019. Neural conversation generation with auxiliary emotional supervised models. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 19, 2 (2019), 117.Google ScholarGoogle Scholar
  50. [50] Zhou Peng, Xu Jiaming, Qi Zhenyu, Bao Hongyun, Chen Zhineng, and Xu Bo. 2018. Distant supervision for relation extraction with hierarchical selective attention. Neural Networks 108 (2018), 240247.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Zhu Hongyin, Tiwari Prayag, Ghoneim Ahmed, and Hossain M. Shamim. 2021. A collaborative AI-enabled pretrained language model for AIoT domain question answering. IEEE Transactions on Industrial Informatics 18, 5 (2021), 3387–3396.Google ScholarGoogle Scholar
  52. [52] Zhu Qingfu, Zhang Weinan, Cui Lei, and Liu Ting. 2019. Order-sensitive keywords based response generation in open-domain conversational systems. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 19, 2 (2019), 118.Google ScholarGoogle Scholar

Index Terms

  1. Toward Explainable Dialogue System Using Two-stage Response Generation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 3
        March 2023
        570 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3579816
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 March 2023
        • Online AM: 18 August 2022
        • Accepted: 27 October 2021
        • Received: 8 April 2021
        Published in tallip Volume 22, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
      • Article Metrics

        • Downloads (Last 12 months)611
        • Downloads (Last 6 weeks)130

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!