Abstract
Phonetic features are indispensable in understanding the spoken language. Especially in Korean, which is wh-in-situ and head-final, the addressee of spoken language sometimes finds it hard to discern the speaker’s original intention if not provided with the sentence prosody. However, acoustic information may not be guaranteed for all spoken language processing, due to the difficulty of managing and computing speech data. This article suggests a corpus that aims to distinguish utterances with ambiguous intention from clear-cut ones, utilizing the prosodic ambiguity of the text input. In detail, the resulting classification system decides whether the given text input is one of fragment, statement, question, command, rhetorical question/command, or indecisive, taking into account the intonation-dependency of the text. Based on an intuitive understanding of the Korean language engaged in the data annotation, we construct a corpus with seven intention categories, train classification systems, and validate the utility of our dataset with quantitative and qualitative analyses.
1 INTRODUCTION
Understanding speech intention includes all aspects of syntax, semantics, and phonetics. For example, even when the transcript of a spoken utterance is given and the syntactic structure is apparent, the speech act may differ depending on dialogue or social context [45]. Besides, phonetic features such as prosody can influence the genuine intention, which can be different from the speech act that is grasped from the textual information [1].
In conventional spoken language processing pipelines, namely automatic speech recognition (ASR) and intention identification, phonetic features tend to be inadvertently removed during speech transcription. For instance, transcripts usually do not contain punctuation, which is sometimes essential for proper understanding of spoken utterances. While more accurate understanding of the intention is expected if phonetic features are available, as observed in co-utilization of audio and text [5, 13], such acoustic information is not always guaranteed in real-world spoken language processing, since handling speech data requires quite a significant burden compared to lightweight text data. This sometimes incurs “prosodic ambiguity” [12], the possibility of diverse interpretation of the given text regarding the acoustic feature of speech.
Such a phenomenon may not threaten the spoken language understanding (SLU) in many languages, especially if the lexical usage is straightforward given the sentence form and type (as in English, where interrogatives and imperatives are distinguished from declaratives in general). However, it matters if the prosodic ambiguity significantly affects the intention understanding of transcribed utterances. Up-to-date end-to-end models tackle such weaknesses of pipeline approaches, but they are often vulnerable to unexplainable errors. Instead, we deemed it worthwhile to first tell utterances whose intentions are determined solely upon the text from those that incorporate prosodic ambiguity, preserving the conventional pipeline structure. The further processing of ambiguous texts with context or audio will help the whole system reach the desired decision.
In this study, the language of interest is Korean, a wh-in-situ language with head-final syntax. Natural language processing in Korean is known to be burdensome, not only because the Korean language is agglutinative and morphologically rich but also for frequent pro-drop and high context-dependency. Moreover, to make it challenging to understand the utterance meaning only by text, the intention of certain types of sentences is significantly influenced by prosodic information such as the intonation of sentence ender [20]. Consider the following sentence, of which the meaning depends on the sentence-final intonation:
(1) 천천히 가고 있어
chen-chen-hi ka-ko iss-e
With a high rise intonation, the sentence becomes a question (Are you/they going slowly?), and, given a fall or fall-rise intonation, it becomes a statement ((I am) going slowly.). Also, given a low rise or level intonation, the sentence becomes a command (Go slowly.). This phenomenon partially originates in particular constituents of Korean utterances, such as multi-functional particle “-어 (-e),” or other sentence enders determining the sentence type [35]. Although similar tendencies are observed in other languages as well (e.g., declarative questions in English [14]), syntactical and morphological properties of the Korean language strengthen the ambiguity of spoken utterances.
Here, we propose a corpus that can help identify the intention of a spoken Korean utterance, particularly when there are textual utterances that can have diverse meanings depending on the intonation. The system trained upon our corpus classifies an input utterance into seven speech act categories of fragment, statement, question, command, rhetorical question\( \cdot \)command, and intonation-dependent, where the final one suggests that the intention is indecisive and the decision requires further acoustic information. A total of 61,225 lines of text utterances were annotated or generated, including about 20K lines manually tagged with a fair agreement. We claim the following as our contribution:
A new kind of text annotation scheme that reflects prosodic sensitivity of Korean utterances (effective, but not restricted, to a head-final language)
A freely available corpus of 61K Korean sentences with seven categories (including human-annotated 20K), validated with conventional and up-to-date pretrained language models
In the following section, we take a look at the literature on intention classification and demonstrate the background of our categorization. In the next section, the corpus construction process is described with a detailed annotation scheme. Afterwards, the trained system is evaluated quantitatively and qualitatively with the test set. Besides, we will briefly explain how our methodology can be effective in real-world applications.
2 BACKGROUND
Our study is most significantly influenced by the existing work on sentence-level semantics. Unlike the syntactic concept presented in Sadock and Zwicky [40], the speech act or intention3 has been studied in the area of pragmatics, especially along with illocutionary act and dialogue act (DA) [42, 45].
Among all, we concentrate on lessening vague intersections between the classes, which can be observed between, e.g., statement and opinion [45], in the sense that some statements can be regarded as an opinion and vice versa. Thus, slightly different from apparent boundaries between the sentence forms declaratives, interrogatives, and imperatives [40], we extend them to syntax-semantic level adopting discourse component [38]. It involves common ground, question set, and to-do-list: the constituent of sentence types that comprise natural language. We interpret them in terms of speech act, considering the obligation that the sentence impose on the listeners; whether to answer (question), to react (command), or neither (statement).
Building on the concept of discourse component revisited, we take into account the rhetoricalness of question and command, which yields additional categories of rhetorical question (RQ) [39] and rhetorical command (RC) [19]. We claim that a single utterance falls into one of the five categories or be classified as a fragment if given enough acoustic information.4 Our categorization is much simpler than the conventional DA tagging schemes [3, 45] and is rather close to tweet act [49] or situation entity types [11] but relies less on the dialogue history and is more suitable with short command conditions. We will show that our scheme can be helpful in spoken language processing by discerning the prosodic ambiguity from text, which is achieved by introducing a new class of intonation-dependent utterances.
Intention and speech act. At this point, it may be beneficial to point out that the term intention or speech act is to be used as a domain non-specific indicator of the utterance type. We argue that two terms are different from intent, which is used as a specific action in the literature [15, 29, 44], along with the concept of item, object, and argument, generally for domain-specific tasks. Also, unlike dialogue management, where a proper response is created upon the dialogue history [27], the proposed system aims to find the genuine intention of a single input utterance and guide the following direction.
Korean sentence types. For corpus annotation, research on Korean sentence types is essential. Although the annotation process partly depends on the intuition of annotators, we refer to the work related to syntax-semantics and speech act of Korean [16, 34, 43] to handle some unclear cases regarding optatives, permissives, promisives, request\( \cdot \)suggestions, and rhetorical questions. We expect this research to be cross-lingually extended, but first we concentrate on resolving the aforementioned language-specific problems.
3 ANNOTATION PROTOCOL
We clarify the annotation concept of the proposed corpus, which can be adopted to train a text-level classifier for spoken language.5 As its motivation is briefly introduced in Section 1, we primarily aim to discern the existence of prosodic ambiguity that is determined upon the lexical features.
We assume that the Korean sentences in our study can be assigned with one of five intention categories, namely statement, question, command, rhetorical question, and rhetorical command. However, only given a single sentence (or a sentence-like text), one might not be able to determine the exact category. First, the annotator should check whether the given sentence is a fragment, where fragments (FR) denotes a single word or a chunk whose intention is underspecified under our criteria. Next, if the sentence is not necessarily determined as a fragment, then the annotator may check if the sentence connotes some intentions among the five candidates, and whether the intention can be decided as a unique one. If the decision is not feasible due to the prosodic ambiguity, then the sentence is labeled as an intonation-dependent utterance (IU). If the sentence is uniquely determined as one of the pre-defined categories, then we call such utterance a clear-cut case (CC), and it includes the above five utterance types.
A brief illustration of the annotation process is depicted in Figure 1. For a detailed description, we describe each sentence type in the order or FR, CCs, and IU, with example sentences.
Fig. 1. A brief illustration on the proposed annotation protocol.
3.1 Fragments
From a linguistic viewpoint, fragments often refer to a single noun\( \cdot \)verb phrase where ellipsis occurred [31]. However, colloquial expressions often show omission, replacement, and/or scrambling, hindering us from applying the same definition as the written language. Thus, in this study, we also count some sentence segments whose intention is underspecified. If the input sentence is not a fragment, then it is assumed to belong to clear-cut cases or be an intonation-dependent utterance.
Some might argue that fragments can be interpreted as command or question under some circumstances. For instance, simply uttering a noun in a rising intonation can be interpreted as an echo question, and loudly uttering some objects can be considered as a command to bring it on. We observed that a large portion of the intention concerning context is represented in the prosody, which leads us to define prosody-sensitive cases afterwards.
However, for fragments, we found it difficult to assign a specific intention to them even given audio, since they highly rely on the dialogue or situational context. Interpreting a single noun as an echo question requires the existence of the original question, and uttering some objects as a command requires the circumstance that the speaker urgently demands the addressee. That is, discerning such implication is not usually feasible, especially in a short command context. Thus, we decided to leave the intention of fragments underspecified, and let them be combated with the help of the context in real-world usage. Here are some examples of fragments:
(2a) 마우스
mawusu
mouse
mouse
(2b) 키보드와 마우스
khipodu-wa mawusu
keyboard-AND mouse
keyboard and mouse
(2c) 마우스로
mawusu-lo
mouse-WITH
with mouse
A single word (2a) is a fragment as is a noun phrase (2b) or a postposition phrase (2c). We concluded that determining the intention of such phrases requires the dialogue history even if the prosody is given.
3.2 Clear-Cut Cases
Clear-cut cases include utterances of five categories: statement, question, command, rhetorical question, and rhetorical command, as described detailed in the annotation guideline6 with examples. Questions are utterances that require the addressee to answer (3a,b), and commands are ones that require the addressee to act physically or psychologically (3c,d). Even if the sentence form is declarative, words such as wonder or should can let the sentence be a question or command. Statements are descriptive and expressive sentences that do not apply to both cases (3e).
(3a) 너 집에 갈거니
ne cip-ey kal-ke-ni
Will you go home?
(3b) 내일 날씨 좀 알려줘
nayil nalssi com ally.e-cwu.e
tomorrow weather POL9 inform.PRT-give.SE
Please tell me tomorrow’s weather.
(3c) 세 시 반에 나 좀 깨워
sey si pan-ey na com kkaywu.e
three hour half-at I POL wake.SE
Please wake me up at three thirty.
(3d) 목소리 좀 낮추는 게 어때
moksoli com nacchwu-nun key ettay
voice POL lower-PRT thing.NOM10 how
How about lowering your voice?
(3e) 아무래도 내일 나스닥 떨어질 것 같아
amwulayto nayil nasudak tteleci.l kes kath-a
anyway tomorrow NASDAQ drop.FUT11 thing seem-SE
I have a feeling that NASDAQ may drop tomorrow.
Rhetorical questions are questions that do not require an answer, because it is already in the speaker’s mind (4a) [39]. Similarly, RC are idiomatic expressions in which imperative structure does not convey a to-do-list that is mandatory (e.g., Have a nice day, (4b)) [16, 19]. Sentences in these categories are functionally similar to statements but are categorized as separate classes, since they usually show a non-neutral tone.
(4a) 너 돈 벌기 싫니
ne ton pel-ki silh-ni
you money earn-PRT dislike-INT
Don’t you want to make money? (= It seems that you are not interested in making money.)
(4b) 쏠 테면 쏴 봐
sso.l tey-myen sso.a po.a
shoot.FUT thing.NOM-if shoot.PRT see.SE
Shoot me if you can. (= You won’t be able to shoot me.)
In making up the guideline, we carefully looked into the dataset so that the annotation can cover ambiguous cases. As stated in the previous section, we refer to Portner [38] to borrow the concept of discourse component and extend the formal semantic property to the level of pragmatics. This indicates that we search for a question set (QS) or to-do-list (TDL) that makes an utterance directive in terms of speech act [42], taking into account non-canonical and conversation-style sentences that contain idiomatic expressions and jargon. If we cannot find such components (QS for asking a question and TDL for asking an action), then the utterance is determined to display a discourse component of common ground (CG). We provide a simplified criterion in Table 1, where the discourse components (CG, QS, and TDL) imply the core concept of the sentence and sentence forms denote the syntactical property of the sentence ender.
| Discourse component/Sentence form | Common Ground | Question Set | To-do List |
|---|---|---|---|
| Declaratives | Statements, RQ, RC | Question | Command |
| Interrogatives | RQ | Question | Command |
| Imperatives | RC | Question | Command |
Table 1. A Simplified Annotation Scheme Regarding Discourse Component and Sentence Form: Discourse Component in the Table Implies the Concept That Extends the Original Formal Semantic Property [38] to Speech Act Level
3.3 Intonation-dependent Utterances
Given the decision criteria for clear-cut cases, we further investigate whether the intention of a given sentence can be determined without information on prosody or intonation. That is, we consider the potential interpretation of an utterance in case it is projected to a textual form, when even the punctuation is omitted or not adequately transcribed with an ASR system. Sentence (1) in Section 1, which is not accompanied by punctuation but is ambiguous, describes such cases.
Although there have been studies on Korean sentences that handle final particles and adverbs [5, 32], to the best of our knowledge, there has been no explicit guideline on a text-based identification of utterances that incorporates prosodic ambiguity. On top of this, we set up some principles, or rules of thumb, concerned with the empirical result of our data analysis. Note that the last two (e and f) are closely related to the maxims of conversation [26], e.g., “Do not say more than is required” or “What is generally said is stereotypically and specifically exemplified.”
(1) | Take into account possible prosody/intonation of a text input, given no non-lexical information such as emojis and punctuation. Remember that the sentence-final part mainly concerns the intonation-dependency of the intention. | ||||||||||||||||||||||
(2) | A wh-particle is interpreted as an existential quantifier in the case of wh- intervention due to Korean being wh-in-situ, changing the wh-questions to another type of question or a statement. | ||||||||||||||||||||||
(3) | Since the subject is dropped in many Korean spoken utterances, one may have to assign all the agents (first to third) in investigating the sentence type, which depends on the intention. In this process, an awkward combination can be ignored. For instance,
| ||||||||||||||||||||||
(4) | The presence of vocatives can sometimes restrict the role of the utterance. For instance, in the preceding example, if a vocative ‘누나 (nwuna, deixis for an older sister, used mainly by male speakers)’ is augmented at the start of the sentence (5b), then it is much more plausible to interpret the sentence as
| ||||||||||||||||||||||
(5) | Adding adverbs or numeric polarity items may not always preserve the intention of the sentence. Therefore, one should be aware of the loss of felicity in the interpretation (as to a specific speech act) that is induced by introducing such components. For instance, in Korean, 좀 (com, slightly) or 하나 (hana, one) are respectively an adverb and numeric polarity item that induce politeness, as seen in (3b,c). Again, in (5c,d), com and hana can come right after mwe to cautiously convey that the speaker wants to eat something today (and the addressee may feel an obligation to eat something together with the speaker).
| ||||||||||||||||||||||
(6) | Some sentences can have both an underspecified sentence ender (that can let the sentence be either a question or statement) and excessively specific information. Although the sentence form is not a direct link to the intention, in that case, the sentence is more likely to be determined as a statement rather than a declarative question. This matches with the intuition that it is not felicitous to ask too specific information as a question, except for some affirmative questions. For instance, if a specific cuisine comes in place of mwe (what) in (5a), then it becomes less felicitous to interpret it as a question, like
| ||||||||||||||||||||||
4 CORPUS BUILDING
4.1 Source Scripts
To cover a variety of topics, utterances used for the annotation were collected from (i) a corpus provided by Seoul National University Speech Language Processing Lab12; (ii) a set of frequently used lexicons, released by the National Institute of Korean Language13; and (iii) manually created questions/commands. In specific, (i) contains short utterances with topics covering e-mail, housework, weather, transportation, stock, and so on; (ii) is an official Korean word dictionary organized in lexicographical order; and (iii) was created by Seoul Korean speakers based on the annotation scheme of question and command.
4.2 Agreement
From (i), 20K lines were randomly selected, and three Seoul Korean L1 speakers classified them into seven categories of fragments, intonation-dependent utterances, and five clear-cut cases (Table 2, Corpus 20K). Annotators were well informed on the guideline and had enough debate on the conflicts during the annotating process. The resulting inter-annotator agreement was \( \kappa \) = 0.85 [10] and the final decision was made by majority voting and adjudication.
| Categories (total 7 classes) | Intention | Instances | |
|---|---|---|---|
| Corpus 20K | Whole | ||
| Fragment | — | 384 | 6,009 |
| Clear-cut cases | Statement | 8,032 | 18,300 |
| Question | 3,563 | 17,869 | |
| Command | 4,571 | 12,968 | |
| Rhetorical Q. | 613 | 1,745 | |
| Rhetorical C. | 572 | 1,087 | |
| Intonation-dependent utterance | Unknown (among 5 candidates) | 1,583 | 3,277 |
| Total | 19,318 | 61,255 | |
Table 2. Composition of the Constructed Corpus
4.3 Augmentation
Considering the shortage of certain types of utterances in Corpus 20K, (i)–(iii) were utilized in the data supplementation. First, we trained a simple classifier with Corpus 20K. Then, we extracted rhetorical questions and commands, and statements, from the rest of (i). We checked and relabeled the outcome to supplement each category. Next, in (ii), about 6,000 Korean words were investigated, and only single nouns were collected and augmented to fragments. Finally, for (iii), paid participants made up question and command given topics of e-mail, housework, weather, and schedule, which are frequent categories appearing in Corpus 20K. With a total of 20,000 sentences created, where most of the portion belonged to questions or commands, the authors manually checked the outcome and relabeled some of them as statements or IU. The composition of the final dataset is stated in Table 2.
4.4 Train Split
The Whole corpus was split into train, validation, and test set, for model-based experiments. Seven classes of utterances were distributed with balance in each set. The size of sets reach 49,620, 5,514, and 6,121, respectively. The dataset is available at https://huggingface.co/datasets/kor_3i4k, and the validation set is obtained by splitting the last 10% of the train set, in the currently uploaded version.
5 EXPERIMENT
5.1 Models
To check how our annotation scheme works with the machine learning-based classification algorithms, we investigate the training and validation process with conventional architectures such as convolutional neural network (CNN) [21] or bidirectional long short-term memory (BiLSTM) [41] along with fastText [2] word vectors and up-to-date pretrained language models (PLMs), such as bidirectional encoder representations from Transformers (BERT) [9] and ELECTRA [8].
5.1.1 Conventional Architectures.
Conventional architectures include CNN [21, 23], BiLSTM [41], and self-attentive BiLSTM (BiLSTM-Att [28]). For CNN, two convolution layers were stacked with max-pooling layers in between, summarizing the distributional information lying in an input vector sequence. For BiLSTM, the hidden layer of a specific timestep was fed together with the input of the next timestep to infer the subsequent hidden layer in an autoregressive manner. For a self-attentive embedding, the context vector whose length equals that of the hidden layer of BiLSTM was jointly trained along with the network to provide the weight assigned to each hidden layer. The input format of BiLSTM equals that of CNN except for the channel number, which was set to 1 (single channel) in the CNN model.
For the input featurization of conventional architectures, we tokenized sentences into character level and adopted 100-dimensional fastText dense vectors [2] that correspond to each character. Although the featurization of conventional architectures may not fully match the data-driven representation of BERT-like models, we aimed to accommodate language model pretraining that may make models compatible with up-to-date PLMs. Thus, we exploited the word vector pretrained with 200M lines of drama scripts instead of one-hot vectors or TF-IDF, which was reported to display a satisfactory result with spoken language processing tasks such as word segmentation [4], publicly available in a Github repository.14
5.1.2 Pretrained Language Models.
For BERT-like PLMs, we adopted multilingual BERT (mBERT), KoBERT [47], KcBERT [24], KoELECTRA [36], KcELECTRA [25], and KLUE-BERT [37], which are all currently available in Hugging face Transformers library [51].15 mBERT, KoBERT, KcBERT, and KLUE-BERT follow Devlin et al. [9], which builds a bidirectional encoding upon Transformer [48], where the pretraining aims to optimize the model to two subtasks of masked language model and next sentence prediction. KoELECTRA and KcELECTRA utilize replaced token prediction (RTD) of ELECTRA [8], which strengthens the model from the perspective of logical reasoning. mBERT and KoBERT are pretrained with written-style texts such as Wikipedia. KoELECTRA and KLUE-BERT utilize a large amount of texts available online, including a small number of spoken texts from messages and web data [33]. KcBERT and KcELECTRA are pretrained with online news comments that are much more colloquial and informal than written text. Note that input features of PLMs are all customized tokens, where the token set differs by the model utilized.
5.2 Implementation
All conventional architectures were implemented with Keras Python library [7]. CNN includes two convolutional layers of window size 3, with one max pooling layer in between, and BiLSTM is made up of two 64-dimensional forward and backward LSTM layers. For both architectures, the maximum length was set to 50, and empty areas were padded with zeros. The word vector size was fixed to 100, while CNN had a single channel with 32 filters. For self-attentive BiLSTM, the context vector was set up with the same size as the LSTM hidden layers (64). The optimization was done with Adam (5e-4) [22], with batch size 16 and a dropout rate of 0.3. The model for the test was chosen as the best performing one with the validation set, after training for 50 epochs.
Up-to-date PLMs were adopted from Hugging face model hub, namely mBERT,16 KoBERT,17 KcBERT,18 KoELECTRA,19 KcELECTRA,20 and KLUE-BERT,21 and follow the default setting. mBERT is a multilingual model for around 100 languages and utilizes 119,547 tokens for the dictionary, while other five are monolingual models and utilize 8,002 (KoBERT), 30,000 (KcBERT), 35,000 (KoELECTRA), 50,135 (KcELECTRA), and 32,000 (KLUE-BERT) vocabs for dictionary, respectively. All the tokens were projected to 768-dimensional output layers, while the length was set to 512 following the original Transformers setting. The dropout rate was set to 0.1, with Adam optimizer (1e-4)22 and linear scheduler with 100 warm-up steps.23 The training with batch size 32 ran for three epochs, which is sufficient for the fine-tuning of the created data, and the final trained model was directly adopted for the test.
5.3 Result
Table 3 shows the performance of conventional architectures and up-to-date PLMs, where all results were obtained by inferring the test set.
Dictionary size for the conventional architectures indicates the number of character vectors. Epochs denote the training from scratch for conventional models and fine-tuning for large-scale PLMs. In pretraining, “Mono” denotes that the model pretraining was done with monolingual data, while “Multi” denotes the multilingual case. “Mono (Emb)” means that the pretraining was done only for the embedding vectors (with fastText), not the weight for the whole architecture.
Table 3. Test Result (accuracy) with Conventional Architectures and PLMs
Dictionary size for the conventional architectures indicates the number of character vectors. Epochs denote the training from scratch for conventional models and fine-tuning for large-scale PLMs. In pretraining, “Mono” denotes that the model pretraining was done with monolingual data, while “Multi” denotes the multilingual case. “Mono (Emb)” means that the pretraining was done only for the embedding vectors (with fastText), not the weight for the whole architecture.
5.3.1 Quantitative Analysis.
Among all the conventional architectures and up-to-date PLMs, KoELECTRA, which is pretrained upon both colloquial and written texts with an adequate size of vocabulary, exhibited the highest accuracy. This proves that strategies for language model pretraining and the property of source corpora both benefit the classification performance for our dataset.
PLMs outperform conventional architectures in general, but not all. It is notable that not all the fine-tuned PLMs outperform conventional architectures, which differs from recent reports that PLMs leveraging information from massive corpora have an advantage over models trained solely upon the target task. In our experiment, CNN and BiLSTM(-Att) modules showed competitive performance with some BERT modules, and KoBERT, with the smallest dictionary size among PLMs, seems to fail to outperform conventional architectures.
Pretraining corpus influences the result. We analyze that the result is also influenced by the type of source corpora utilized in pretraining of fastText word vectors or PLMs. Different from other PLMs of which the source corpus of pretraining includes colloquial texts, training corpora for mBERT and KoBERT are more concentrated on written texts such as Wikipedia, which may not fit with the processing of the spoken language. In the proposed task, some utterances are more challenging to categorize due to prosodic cues that are not explicit in the textual form. Such property may have made it difficult for mBERT or KoBERT to meet the desired standard, at the same time guaranteeing the competitive performance of conventional modules where the fastText-based word vectors are trained upon colloquial and non-normalized drama scripts [4].
Less sensitive to OOV and follows scaling laws. It is also noteworthy that mBERT, trained upon multilingual vocab and corpora, outperforms KoBERT, which is based on similar monolingual corpora. This suggests that our dataset is less vulnerable to out-of-vocabulary issues that lie in mBERT with shortened Korean Hangul vocabs (about 3.3K). Instead, it can be inferred that models follow the scaling laws for neural language models [18], as can be observed similarly in KcBERT and KcELECTRA or KLUE-BERT and KoELECTRA (though weakly significant).
Data fits with models. Despite some results beyond expectation, it is still encouraging that PLMs show adequate performance only with simple fine-tuning of three epochs. In future, the updated PLMs pretrained with more various spoken language corpora and advanced strategies may show higher performance with lightweight architectures, which can be helpful for the real-world application of the trained module.
5.3.2 Further Investigation using PLMs.
As using PLMs is de facto in recent literature, we conducted a further investigation to help understand how the constructed dataset can be utilized in analyses and practice. In Table 4, we compare the size and domain of the pretraining corpora of PLMs, referring to Hur et al. [17] and Yang [52], and how they perform in various classification scenarios.
| Model | Pretraining corpora | Performance (Average) | |||
|---|---|---|---|---|---|
| Size | Domain | Sevenfold (IU) | Error (%) | Threefold (IU) | |
| mBERT | 2.5 B (words) | Wikipedia (of 104 languages) | 89.57 (66.65) | 0.19 | 93.12 (17.23) |
| KoBERT | 5.4 M (words) | Korean Wikipedia | 52.87 (22.23) | 20.19 | 92.40 (0) |
| KcBERT | 12 GB | Korean online news comments | 90.93 (69.92) | 0.11 | 94.76 (41.63) |
| KoELECTRA | 34 GB | Korean Wikipedia, Namu Wiki, | 91.98 (72.86) | 0.36 | 96.37 (68.13) |
| KcELECTRA | 17 GB | Korean online news comments | 91.95 (72.16) | 0.11 | 96.72 (70.81) |
| KLUE-BERT | 63 GB | Modu Corpus [33], CC-100-Kor [50], Namu | 91.72 (72.18) | 0.20 | 96.13 (65.09) |
Note that the size and domain of mBERT denote the pretraining corpora regarding all the languages that are relevant, that size cannot be specified to a specific language.
Table 4. Comparison Table of Pretraining Corpora and Performance of each PLM Module
Note that the size and domain of mBERT denote the pretraining corpora regarding all the languages that are relevant, that size cannot be specified to a specific language.
For all the PLMs, pretrained weight was frozen, and we conducted additional training for a single fully connected network added on the highest 768-dimensional layer regarding
First, as discussed in the previous section, the size of the pretraining corpus seems to influence the performance, considering that mBERT outperforms KoBERT and so does KoELECTRA against KcBERT and KcELECTRA. However, given that KcELECTRA shows almost the same performance as KoELECTRA in sevenfold and even outperforms in threefold despite its half-sized pretraining corpus, it seems that how the model is familiar with colloquial text is crucial to the practical utilization of the proposed dataset. In other words, effective fine-tuning using the dataset requires more domain-specific (especially prosodic and phonetic) linguistic knowledge, such as sentence structure for spoken language, that helps disambiguate the role of polarity items or sentence enders that can completely change or diversify the meaning of utterances. Also, it seems that concentrating on a domain-specific dictionary seems to lessen the statistical uncertainty of the training and inference, given relatively stable results shown by KcBERT and KcELECTRA compared to other written text-based or general-domain models.
Next, ELECTRA models (KoELECTRA, KcELECTRA) show higher performance for overall and IU compared to BERT-based ones. This result suggests that the training scheme of ELECTRA that is based on RTD fits with the current downstream task compared to the masked language model of BERT, considering that RTD had conventionally been more suitable with logical or factoid problems such as natural language inference [8], which requires a slightly different aspect of language understanding in contrast with indecisive tasks such as sentiment analysis. Finding the presence of ambiguity in a given text is more close to detecting some attribute rather than deciding the intensity of it. In contrast, detecting rhetoricalness (as in RQ and RC) is a less clear problem and depending more on context or other non-verbal terms may have yielded the lower accuracy in those classes.
Last, we see how each module distinguishes intonation-dependent utterances from fragments or clear-cut cases and how such an approach can be further utilized to promote the model development. Unfortunately, we found that threefold classification is not yet practical for performance enhancement, since integrating CCs to one class yields a severe imbalance among FRs, IUs, and CCs. However, because detecting IUs is promising in both scenarios using ELECTRA models, we expect that making the dataset balanced (beyond merely integrating classes) can boost the performance and help multi-stage classification, which would benefit the detection of IUs and the classification of CCs. We leave the adequate sampling strategy and dataset reformulation as our future direction.
5.3.3 Qualitative Analysis.
We made up a confusion matrix with the result of the fine-tuned KoELECTRA module, which shows the most reliable performance (Table 5). Fragments, statements, questions, and commands show high accuracy (\( \gt \)92%) while others show lower (\( \lt \)80%).
Challenges. RQs show the lowest accuracy (73%), and a large portion of wrong answers were related to utterances that are even difficult for a human to disambiguate, since nuance is involved. Such cases include questions without tags or wh-particles, for example, “난 버린 거예요” (Nan pelyn keyeyyo, Did you dump me?). The sentence can be interpreted as interrogative and declarative in Korean, at a glance, since there is no subject nor polarity item that determines the rhetoricalness of the sentence. However, people may not ask “Did you dump me?” to the addressee, because they are curious about it. The model found it hard to tell these rhetorical sentences from declarative statements.
RCs and IUs also showed low accuracy. Nevertheless, it is encouraging that the frequency of false alarms regarding RCs and IUs is generally low (except for statements predicted as IU). For RCs, the false alarms might induce an excessive movement of the addressee (e.g., AI agents) in the case that involves the optatives (“Have a nice day!”) or greetings (“See you later!”). For IUs, an unnecessary analysis of the speech data could have been performed if clear-cut cases were classified incorrectly as IU. The low false alarm rate of both categories sheds light on the further utilization of the trained system in circumstances with single short commands.
False alarms. Though the significance is lower compared to the above challenging cases, we observed a tendency within wrong answers regarding the prediction as statements. We found that most of them have a long sentence length that can confuse the system as the descriptive expression, especially those originally a question, command, or RQ. For example, some of the misclassified commands contained a modal phrase (e.g., -야 한다 (-ya hanta, should)) that is frequently used in prohibition or requirement. This lets the utterance be recognized as a descriptive one. Also, we could find some errors incurred by the morphological ambiguity of Korean. For example, “베란다 (peylanta, a terrace)” was classified as a statement due to the existence of “란다 (lanta, declarative sentence ender),” albeit the word (a single noun) has nothing to do with descriptiveness.
6 DISCUSSION
6.1 Findings
In the experiment, we found that the proposed corpus, constructed with a satisfactory agreement (0.85), shows the accuracy that fits the industrial needs (around 0.9) with conventional architectures and up-to-date PLMs. Since we publicly open the corpus and training schemes to facilitate future research, we expect that our dataset can serve as a source of efficient SLU or natural language understanding (NLU) management and, at the same time, as a Korean sentence classification benchmark.
One of our concerns is that adequate classification performance or agreement does not necessarily guarantee the optimality of our sentence categorization scheme. For instance, if we merely categorize the sentences based on their sentence form (declaratives, interrogatives, and imperatives), then the scheme would be clearer and the classification performance may be far higher. However, it does not resolve the problem of ambiguity that is frequently observed in SLU environments.
To attack this, we adopted the concept of discourse component, assuming that the genuine intention of the sentence can be categorized into one of CG, QS, and TDL, regardless of the sentence form. Also, we took into account SLU environments where only transcripts are available, even with no punctuation marks. This is the background we set a categorization scheme with broader coverage that includes fragments and intonation-dependent utterances, where the former is underspecified and the latter is indecisive without prosodic information. Although experimental results do not guarantee that our categorization comprises the whole Korean sentence types, a well-defined annotation guideline with examples and the resulting corpus may benefit the application of the trained modules.
6.2 Application
Applying our corpus and the trained system to the real world is an essential consideration for the broader impact of our research. We claim that our protocol makes it possible for conventional spoken language understanding systems that utilize an ASR-NLU pipeline to perform more efficiently at handling transcribed utterances. First, the corpus can make the system function without the requirement of wake-up words such as “Siri” or “Bixby,” with the proper aid of ASR and speaker verification technologies (Free-running environment). Besides, the corpus can be exploited in making the system react only to utterances that require feedback, while simply generating chit-chat for other non-directive utterances (Omakase dialogue system).
6.2.1 Free-Running Environment.
In Table 1, sentence types with the discourse component of common ground, namely statements, RQs, and RCs, are non-directive utterances. Such utterances may require the addressee’s reaction (answering or acting) if there is a specific context but usually not in the case where they are used to start a dialogue. In usual SLU environments where the user’s command starts the conversation between human and agent, it is essential to discern the directive intention from a single input utterance.
In this regard, given that an acoustic channel is open for the device, the system trained upon our corpus may suggest which input utterance to accept as a command or not, instead of requiring wake-up words from the user. This simple detection system prevents unnecessary wake-up of agents caused by false alarms (e.g., wake-up caused by non-directive sentences that contain words pronounced similar to “Siri”), and in the case of IU, the device may provide acoustic information for further processing. At the same time, the system induces the agents’ reaction without starting with the wake-up words. Eventually, agents may not interrupt users’ non-directive utterances in usual conversation.
6.2.2 Omakase Dialogue System.
Omakase dialogue system is a coined term for a dialogue manager that adopts the module trained based on our dataset. Figure 2 depicts a simplified architecture for the system.
Fig. 2. A brief illustration on the Omakase dialogue system.
For the transcript of a single utterance in spoken dialogue, first, the trained module categorizes the intention into one of seven sentence types. If the intention is discerned as a directive one, namely a question or command, then the manager lists it to the array of instructions so that the following module can understand the instruction and take action (to commands) or give an answer (to questions). If the intention of the utterance is underspecified or non-directive, then the manager checks if the topic of the utterance is shared with any listed instruction and holds the instruction if relevant. If the topic is not relevant to any of the instructions listed in the array, then the manager merely generates the following sentence, for instance, a simple chit-chat for the user’s fun. Even if the utterance is directive and instructional, such chit-chat is inserted to accommodate a smooth continuation of the dialogue.
Though just conceptional at this point, we named this system Omakase, since it aims at a well-serving and smart task-oriented agent, which is also fluent at chit-chatting with the user. This is quite similar to an Omakase chef who is a guru in making sushi and at the same time fluent at talking with the customers. The spirit is aligned with the recently suggested Sun et al. [46]. However, our approach intends a more heuristic and less data-driven but assistive and attachable module. Also, since our approach incorporates Korean sentences with various syntax and sentence forms that are labeled with their intention, the resulting classifier may fit a wide range of users who are not familiar with talking to AI agents in a commanding manner. In other words, our approach heads more human-familiar and inclusive usage of SLU modules.
7 CONCLUSION
In this article, we proposed a textual classification scheme for the spoken Korean language, which considers the intonation dependency of the given sentence. The corpus was created based on the annotation principle that first detects fragments and categorizes the sentence into one of five intention types, considering if such categorization is available depending on the presence of prosodic information. For a data-driven training of deep learning models, 61K sentences were collected, with a fairly high inter-annotator agreement using 20K manually tagged samples. The neural network model-based classification yielded adequate accuracy, proving the validity of our approach. Also, we found that the PLMs trained upon colloquial texts more fit with our task, suggesting that our corpus can be a new benchmark for Korean spoken language understanding, which lacks in the literature that is dominant of tasks with written texts.
Though we could not investigate the case using speech signal input in this article, direct usage of trained systems might enhance the accuracy of spoken language processing. Particularly, there are emerging needs and studies on end-to-end SLU systems [6, 30], which are conducted to reduce the error propagation and computation issues of conventional ASR-NLU pipelines. In this regard, up-to-date SLU modules are being used along with or replacing conventional pipelines. However, we believe that our scheme can be beneficial for both pipeline and end-to-end modules in weighing the importance of each approach. For instance, the probability of predicting the input as IU can be aligned with the output distribution of the end-to-end module, to tell how the output distribution should be taken into account in the final decision. This kind of application does not harm the power of the ensembled guess and at the same time allows an efficient computation if the pipeline and end-to-end modules are calculated subsequently.
A large portion of this article concentrates on verifying the validity of our corpus in a computational manner, but our goal in theoretical linguistics lies in making up a new speech act categorization that aggregates potential prosodic cues. It was shown to be successful computationally, but as discussed in Section 6.1, the promising result does not guarantee theoretical completeness. Still, some challenges exist in handling jussives such as promissives and exhortatives, since utterances that involve social context for disambiguation are not clearly categorized in linguistic viewpoint, such as “It’s so hot here,” which asks for the addressee to open the window. In our annotation scheme, such utterances were considered non-directive and may require the dialogue history or multimodal input to determine it as an instruction. These kinds of disambiguations are to be handled in our future research that addresses the social convention.
A promising application of the proposed system regards spoken language understanding modules for smart agents, especially those targeting humanlike conversation with the user. It is the reason our categorization considers the directiveness and rhetoricalness of an utterance. We expect that identifying the rhetoricalness and non-directiveness of an utterance in dialogue management might help people who are not familiar with the way to talk to intelligent agents, at the same time preventing false alarms. It may widen the accessibility of speech-driven AI services and shed light on flexible dialogue management. For a real-life application, we aim to check how the proposed scheme can be extended or distilled to a speech understanding beyond the textual level. We provide the corpora25 and PLM-based recipe26 freely online to encourage future research toward humanlike Korean spoken language understanding.
ACKNOWLEDGMENT
The annotation guideline and the initial version of the corpus were elaborately constructed with the great help of Ha Eun Park and Dae Ho Kook. Also, the authors appreciate Jong In Kim, Jio Chung, and \( \dagger \)Kyu Hwan Lee from SNU Spoken Language Processing laboratory (SNU SLP) for providing useful corpora for the analysis.
Footnotes
1 Denotes a progressive marker.
Footnote2 Denotes the underspecified sentence enders; final particles whose role vary.
Footnote3 In this article, intention and act are often used interchangeably. In principle, the intention of an utterance is the object of grasping, and the act of speech is a property of the utterance itself. However, we denote determining the act of a speech, such as question and demand, as inferring the intention.
Footnote4 More elaborate definition of fragments and intonation dependency will be discussed in the next section.
Footnote5 Throughout this article, text refers to the sequence of symbols (or letters) with the punctuation marks removed, which is a frequent output format of speech recognition. Also, sentence and utterance are interchangeably used to denote an input, where usually the latter implies an object with intention while the former does not necessarily.
Footnote6 Currently uploaded online in Korean. https://docs.google.com/document/d/1-dPL5MfsxLbWs7vfwczTKgBq_1DX9u1wxOgOPn1tOss.
Footnote7 Denotes a functional particle.
Footnote8 Denotes an interrogative ender.
Footnote9 Denotes a polarity item for the politeness in asking something.
Footnote10 Denotes a nominative case.
Footnote11 Denotes a future tense.
Footnote- Footnote
- Footnote
14 https://github.com/warnikchow/raws.
Footnote15 https://github.com/huggingface/transformers.
Footnote16 https://huggingface.co/bert-base-multilingual-cased.
Footnote17 Originally provided in https://github.com/SKTBrain/KoBERT , and the served version for Hugging face Transformers was available in https://huggingface.co/monologg/kobert.
Footnote18 https://huggingface.co/beomi/kcbert-base.
Footnote19 https://huggingface.co/monologg/koelectra-base-v3-discriminator.
Footnote20 https://huggingface.co/beomi/KcELECTRA-base.
Footnote21 https://huggingface.co/klue/bert-base.
Footnote22 We additionally set weight decay 0.01, Adam beta1= 0.9, Adam beta2= 0.95, and Adam epsilon 1e-8.
Footnote23 The optimization scheme for PLMs was more delicately set due to the sensitivity of models.
Footnote24 The performance for the sevenfold scenario slightly differs from Table 3, which recorded the best score, since the score is averaged after five repetitions with different initialization.
Footnote25 https://github.com/warnikchow/3i4k.
Footnote26 https://colab.research.google.com/drive/13IQCnXkPykwxWioby3W0l7pwTpJzESsf#scrollTo=Vbhh6YZ2vbb5.
Footnote
- [1] . 1999. Is that a real question? final rises, final falls, and discourse function in yes-no question intonation. Clin. Lab. Sci. J. 35 (1999), 1–14.Google Scholar
- [2] . 2017. Enriching word vectors with subword information. Trans. Assoc. Comput. Ling. 5 (2017), 135–146. Google Scholar
Cross Ref
- [3] . 2010. Towards an ISO standard for dialogue act annotation. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC’10).Google Scholar
- [4] . 2021. Giving space to your message: Assistive word segmentation for the electronic typing of digital minorities. In Proceedings of the Designing Interactive Systems Conference (DIS’21). Association for Computing Machinery, New York, NY, 1739–1747. Google Scholar
Digital Library
- [5] . 2020. Text matters but speech influences: A computational analysis of syntactic ambiguity resolution. In Proceedings of the 42th Annual Meeting of the Cognitive Science Society - Developing a Mind: Learning in Humans, Animals, and Machines (CogSci’20), , , , and (Eds.).Google Scholar
- [6] . 2020. Speech to text adaptation: Towards an efficient cross-modal distillation. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech’20). 896–900. Google Scholar
Cross Ref
- [7] . 2015. Keras. Retrieved from https://github.com/fchollet/keras.Google Scholar
- [8] . 2019. ELECTRA: Pre-training text encoders as discriminators rather than generators. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [9] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.Google Scholar
- [10] . 1971. Measuring nominal scale agreement among many raters.Psychol. Bull. 76, 5 (1971), 378.Google Scholar
Cross Ref
- [11] . 2016. Situation entity types: Automatic classification of clause-level aspect. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 1757–1768.Google Scholar
Cross Ref
- [12] . 2010. The ambiguity of ‘ambiguity’: Beauty, power, and understanding. In Ambiguity and the Search for Meaning: English and American Studies at the Beginning of the 21st Century (Volume 2: Language and Culture). Jagiellonian University Press, 33–52.Google Scholar
- [13] . 2017. Speech intention classification with multimodal deep learning. In Proceedings of the Canadian Conference on Artificial Intelligence. Springer, 260–271.Google Scholar
Cross Ref
- [14] . 2002. Declarative questions. In Semantics and Linguistic Theory, Vol. 12. 124–143.Google Scholar
- [15] . 2018. From audio to semantics: Approaches to end-to-end spoken language understanding. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT’18). IEEE, 720–726.Google Scholar
Cross Ref
- [16] . 2000. The Structure and Interpretation of Imperatives: Mood and Force in Universal Grammar. Psychology Press.Google Scholar
- [17] . 2021. K-EPIC: Entity-perceived context representation in Korean relation extraction. Appl. Sci. 11, 23 (2021), 11472.Google Scholar
Cross Ref
- [18] . 2020. Scaling laws for neural language models. arXiv:2001.08361. Retrieved from https://arxiv.org/abs/2001.08361.Google Scholar
- [19] . 2019. Fine-tuning natural language imperatives. J. Logic Comput. 29, 3 (2019), 321–348.Google Scholar
Cross Ref
- [20] . 2005. Evidentiality in achieving entitlement, objectivity, and detachment in Korean conversation. Discourse Stud. 7, 1 (2005), 87–108.Google Scholar
Cross Ref
- [21] . 2014. Convolutional neural networks for sentence classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, 1746–1751. Google Scholar
Cross Ref
- [22] . 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15), and (Eds.).Google Scholar
- [23] . 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097–1105.Google Scholar
Digital Library
- [24] . 2020. KcBERT: Korean comments BERT. In Proceedings of the 32nd Annual Conference on Human and Cognitive Language Technology. 437–440.Google Scholar
- [25] . 2021. KcELECTRA: Korean Comments ELECTRA. Retrieved from https://github.com/Beomi/KcELECTRA.Google Scholar
- [26] . 2000. Presumptive Meanings: The Theory of Generalized Conversational Implicature. MIT Press.Google Scholar
Digital Library
- [27] . 2017. DailyDialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the 8th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 986–995.Google Scholar
- [28] . 2017. A structured self-attentive sentence embedding. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=BJC_jUqxe.Google Scholar
- [29] . 2016. Attention-based recurrent neural network models for joint intent detection and slot filling. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech’16). 685–689. Google Scholar
Cross Ref
- [30] . 2019. Speech model pre-training for end-to-end spoken language understanding. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech’19). 814–818. Google Scholar
Cross Ref
- [31] . 2005. Fragments and ellipsis. Ling. Philos. 27, 6 (2005), 661–738.Google Scholar
Cross Ref
- [32] . 2014. A novel dichotomy of the Korean adverb nemwu in opinion classification. Stud. Lang. 38, 1 (2014), 171–209.Google Scholar
Cross Ref
- [33] . 2020. NIKL CORPORA 2020 (v.1.0). Retrieved from https://corpus.korean.go.kr.Google Scholar
- [34] . 2006. Jussive clauses and agreement of sentence final particles in Korean. Jpn/Kor. Ling. 14 (2006), 295–306.Google Scholar
- [35] . 2008. Types of clauses and sentence end particles in Korean. Kor. Ling. 14, 1 (2008), 113–156.Google Scholar
Cross Ref
- [36] . 2020. KoELECTRA: Pretrained ELECTRA Model for Korean. Retrieved from https://github.com/monologg/KoELECTRA.Google Scholar
- [37] . 2021. KLUE: Korean language understanding evaluation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=q-8h8-LZiUm.Google Scholar
- [38] . 2004. The semantics of imperatives within a theory of clause types. In Semantics and Linguistic Theory, Vol. 14. 235–252.Google Scholar
- [39] . 2006. Rhetorical questions as redundant interrogatives. In San Diego Linguistics Papers. Department of Linguistics, UCSD, 134–168.Google Scholar
- [40] . 1985. Speech act distinctions in syntax. Lang. Typol. Syntact. Descript. 1 (1985), 155–196.Google Scholar
- [41] . 1997. Bidirectional recurrent neural networks. IEEE Trans. Sign. Process. 45, 11 (1997), 2673–2681.Google Scholar
Digital Library
- [42] . 1976. A classification of illocutionary acts. Lang. Soc. 5, 1 (1976), 1–23.Google Scholar
Cross Ref
- [43] . 2017. The Syntax of Jussives: Speaker and Hearer at the Syntax-Discourse Interface. Ph.D. Dissertation. Seoul National University.Google Scholar
- [44] . 2007. A case study of comparison of several methods for corpus-based speech intention identification. In Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING’07). Citeseer, 255–262.Google Scholar
- [45] . 2000. Dialogue act modeling for automatic tagging and recognition of conversational speech. Comput. Ling. 26, 3 (2000), 339–373.Google Scholar
Digital Library
- [46] . 2021. Adding chit-chat to enhance task-oriented dialogues. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1570–1583.Google Scholar
Cross Ref
- [47] . 2019. Korean BERT Pre-trained Cased (KoBERT). Retrieved from https://github.com/SKTBrain/KoBERT.Google Scholar
- [48] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.Google Scholar
- [49] . 2016. Tweet acts: A speech act classifier for twitter. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 10.Google Scholar
- [50] . 2020. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 4003–4012. Google Scholar
- [51] . 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 38–45.Google Scholar
Cross Ref
- [52] . 2021. Transformer-based Korean pretrained language models: A survey on three years of progress. arXiv:2112.03014. Retrieved from https://arxiv.org/abs/2112.03014.Google Scholar
Index Terms
Text Implicates Prosodic Ambiguity: A Corpus for Intention Identification of the Korean Spoken Language
Recommendations
Automatic extraction of paralinguistic information using prosodic features related to F0, duration and voice quality
The use of acoustic-prosodic features related to F0, duration and voice quality is proposed and evaluated for automatic extraction of paralinguistic information (intentions, attitudes, and emotions) in dialogue speech. Perceptual experiments and ...
A prosodic diphone database for korean text-to-speech synthesis system
CICLing'05: Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text ProcessingThis paper presents a prosodically conditioned diphone database to be used in a Korean text-to-speech (TTS) synthesis system. The diphones are prosodically conditioned in the sense that a single conventional diphone is stored as different versions taken ...
An analysis of prosodic information for the recognition of dialogue acts in a multimodal corpus in Mexican Spanish
This paper presents empirical results of an analysis on the role of prosody in the recognition of dialogue acts and utterance mood in a practical dialogue corpus in Mexican Spanish. The work is configured as a series of machine-learning experimental ...








Comments