skip to main content
research-article
Open Access

A Statistical Language Model for Pre-Trained Sequence Labeling: A Case Study on Vietnamese

Authors Info & Claims
Published:13 December 2021Publication History

Skip Abstract Section

Abstract

By defining the computable word segmentation unit and studying its probability characteristics, we establish an unsupervised statistical language model (SLM) for a new pre-trained sequence labeling framework in this article. The proposed SLM is an optimization model, and its objective is to maximize the total binding force of all candidate word segmentation units in sentences under the condition of no annotated datasets and vocabularies. To solve SLM, we design a recursive divide-and-conquer dynamic programming algorithm. By integrating SLM with the popular sequence labeling models, Vietnamese word segmentation, part-of-speech tagging and named entity recognition experiments are performed. The experimental results show that our SLM can effectively promote the performance of sequence labeling tasks. Just using less than 10% of training data and without using a dictionary, the performance of our sequence labeling framework is better than the state-of-the-art Vietnamese word segmentation toolkit VnCoreNLP on the cross-dataset test. SLM has no hyper-parameter to be tuned, and it is completely unsupervised and applicable to any other analytic language. Thus, it has good domain adaptability.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Sequence labeling is a universal framework for natural language processing (NLP). It has been applied to many NLP tasks including word segmentation performed in analytic language processing, part-of-speech (POS) tagging, noun phrase chunking, and named entity recognition (NER) [5, 21, 40]. These tasks play very important roles in NLP and fulfill lots of downstream applications, such as relation extraction, syntactic parsing, and entity linking [18, 19]. Due to the rapid development of deep learning in recent years, deep learning methods have replaced traditional statistical learning methods (e.g., CRF and HMM) as the mainstream methods of sequence labeling, and have achieved much better performance in many tasks.

Training a neural network for sequence labeling requires a large amount of annotated data, and usually the amount of required data increases with the network size. Unfortunately, building large annotated datasets are expensive, and we need well-trained manpower to do the annotation [2]. Even though some labeled datasets are available, most of them are domain related. So the domain adaptability of trained model is greatly limited, especially for low-resource languages [31], such as Vietnamese and Thai. To solve the problems of domain adaptability and insufficiency of labeled data, researchers have exploited many different methods to draw information from other easily available datasets to improve the performance of deep learning models. These methods can be roughly divided into four categories. The first is to train a neural network language model using a large amount of non-labeled text to obtain word embeddings containing rich semantic and grammatical information, such as Word2Vec, GloVe (static word embedding), ELMO, and BERT (dynamic word embedding) [8, 24, 28, 29]. The second is to use additional knowledge to improve the performance of sequence labeling models [10, 31, 40]. The third are transfer learning and meta-learning [2, 38]. The fourth is multi-task learning [19, 34].

In this article, we propose an unsupervised statistical language model (SLM) to improve the performance of sequence labeling models in the context of low-resource analytic language. We choose Vietnamese as a study case. Our SLM is an optimization model based on the co-occurrence of n-gram words. It aims at maximizing the total binding force of all candidate word segmentation units (WSUs) in sentences. The solution of SLM is to optimally divide the given sentence into relatively independent semantic and grammatical units, namely WSU, which is described in Section 4.1. Figure 1 illustrates a feasible solution of SLM under the specific input. Experiments on Vietnamese datasets show that these pre-divided WSUs can deliver useful information to sequence labeling models and improve their performance. The advantages of our proposed unsupervised SLM are as follows:

Fig. 1.

Fig. 1. A feasible solution of SLM under a specific input sentence. The objective value of SLM is calculated by . represents the candidate WSU set obtained by splitting the input sentence. hi and ni represent the binding force and the occurrence number of WSUi in the corpus.

  • It is completely unsupervised. It does not require any labeled data and other additional knowledge, such as vocabularies and gazetteers.

  • It can ignore the differences among different languages and deal with analytic languages indiscriminately, such as Chinese, Vietnamese, Thai, and Burmese.

  • It has good domain adaptability, as long as enough task-specific plain texts are collected.

  • It has no hyper-parameters.

  • Our proposed sequence labeling framework is task independent and loose coupling. SLM can be integrated with a variety of machine learning models, such as SVM, CRF, BiLSTM, and BERT. Our codes and data are available at https://github.com/Liaoxianwen/SLM_code_data.

Skip 2RELATED WORK Section

2 RELATED WORK

How to improve the performance of sequence labeling tasks is a current research hotspot. These sequence labeling tasks typically include word segmentation, POS labeling, and NER [5, 7, 14, 17, 25, 39]. The means of improvement include semi-supervision, knowledge enhancement, multi-tasking, transfer learning, and meta-learning. In the AAAI 2020 conference, a total of 13 sequence labeling papers were accepted, 11 of which used comprehensive models and obtained SOTA on multiple datasets. In these 11 papers, 7 use semi-supervised or knowledge-enhancement models [1, 10, 16, 20, 23, 30, 31], 2 use multi-task learning models [12, 34], and 2 use transfer learning and meta-learning models [2, 38]. The pre-trained sequence labeling framework proposed in this article belongs to the category of semi-supervision or knowledge enhancement.

In those semi-supervised or knowledge enhancement deep learning models, the types of data are different. In the work of Ali et al. [1], a knowledge graph was used to train an edge-weighted attentive graph convolution network to refine noisy mention representations. In the work of Liu et al. [20], distant supervision was generalized by extending dictionaries with headwords. Knowledge graph augmented word representation (KAWR) was introduced for NER by encoding the prior knowledge of entities from an external knowledge base [10]. A domain adaptation method using a Bayesian sequence combination was proposed in the work of Simpson et al. [31] to exploit pre-trained models and unreliable crowdsourced data. In the work of Mayhew et al. [23], a standard BiLSTM-CRF model was integrated with a pretrained truecaser for NER. A semi-supervised learning approach based on multi-view models through consensus promotion was proposed in the work of Lim et al. [16] for multi-task tagging. In the work of Safranchik et al. [30], multiple heuristic rules were used to establish a weak supervision framework for sequence labeling tasks.

At present, SLM has been replaced by the neural language model in NLP. However, SLM still plays its role. A statistical model based on linguistic knowledge was proposed in the work of Somsap and Seresangtakul [32] to overcome word ambiguity and OOVs in Isarn Dharma word segmentation. The sum of probability values in possible word sequences was used to determine the best-segmented sentence. In the work of Ghodsi and DeNero [9], some key structural properties of language were studied, and a conclusion that RNNs had a structural capacity largely missing from n-gram models was drawn. n-Gram and statistical features were used to identify fake news spreaders on Twitter in the work of Buda and Bolonyai [4]. n-Gram statistics and location-related dictionaries were used to extract location name from target text streams [3]. In the work of Marie et al. [22], a unsupervised neural and statistical machine translation system was implemented. In the work of Varjokallio and Klakow [35], unsupervised morph segmentation and vocabulary expansion were done by the combination of a trigram language model and RNN.

In summary, under the trend of semi-supervision and knowledge enhancement in sequence labeling, we can draw a conclusion that the information extracted from a large corpus by SLM is still useful. The combination of SLM and a neural network can improve sequence labeling tasks’ performance, especially in the case of low resource. The experimental results of three Vietnamese sequence labeling tasks show that our pre-trained framework can effectively improve sequence labeling tasks’ performance.

Skip 3SLM FOR THE PRE-TRAINED SEQUENCE LABELING FRAMEWORK Section

3 SLM FOR THE PRE-TRAINED SEQUENCE LABELING FRAMEWORK

The architecture of our proposed pre-trained sequence labeling framework is shown in Figure 2. The proposed framework is composed of SLM and the upper machine learning model. For a sentence , its label sequence is , where xi is the ith word1 in x and yi is the label of xi. Before using to train a sequence labeling model (e.g., SVM, CRF, and BiLSTM+CRF), we input x into SLM and get a preliminary division—for instance, . This preliminary division marks the boundaries of basic semantic and grammatical units (named WSUs) in x. For a WSU , we use the BIES schema2 to mark the position of every xj in wi. More specifically, if xj is at the beginning and the end of wi, we label xj with B and E. If xj is in the middle of wi, we label it with I. If xj is a single word in wi, we label it with S. When we obtain the output of SLM, we convert the position symbols (a tag sequence make up of B, I, E, or S) into position encodings and splice them with the corresponding word embeddings to form the final embeddings. The final embeddings are input into the upper machine learning model.

Fig. 2.

Fig. 2. The pre-trained sequence labeling framework consisting of an unsupervised SLM and the upper machine learning model.

Skip 4THE PROPOSED UNSUPERVISED SLM Section

4 THE PROPOSED UNSUPERVISED SLM

In this section, we define the computable WSU and derive two theorems and one corollary from the definition. Based on the derived theorems, we establish an optimization model with the objective of maximizing the total binding force of all candidate WSUs in sentences. Finally a recursive divide-and-conquer algorithm is designed to solve SLM.

4.1 The Definition of WSU and the Derived Theorems

In ISO 24614-1:2010, WSU was defined as a sequence of morphemes or other characters that were treated as a unit, such as “take care of,” “blackboard,” and “Apple pie” in English, or “thi gian” and in Vietnamese. In Chinese word segmentation specification GB/T 13715-92, WSU was defined as basic unit with definite semantic or grammatical function, other qualified words, and phrases. From the atomic characteristics of WSU’s definition, it can be inferred that WSU is a morpheme (word) string that combines tightly and is used stably.

In the case of available annotated corpora and vocabularies, WSU is defined by the annotated corpora, vocabularies, and word segmentation specifications [13]. Under the condition of without any annotated corpora, vocabularies, and word segmentation specifications, we define WSU according to its statistical characteristics.

Definition 4.1

(WSU).

Let Ω be the word set of an analytic language and Wi be a word, , . Let . Let be the probability of Wi and be the probability of , , where L is the maximum WSU length under consideration. If , then is called a WSU.

Definition 4.1 is equivalent to the following description. Let be a k-gram word set of the corpus. Suppose that when are independent the number of in is , and when are not independent the number of in is . If , then in is a WSU.

Theorem 4.2 (The Probability Relation of Bigram WSU).

Let WA, . Then the necessary and sufficient condition for to be a bigram WSU is .

Proof.

Let us prove the necessity first. In other words, if is a bigram WSU, then .

Suppose there is a bigram word set 3, and WA and WB are independent in . Let , which can be calculated by Equation (10). Now we use m bigram word s to randomly replace m elements of . This is shown in Figure 3. After the replacement, we get another bigram word set . In , WA and WB are not independent and is a bigram WSU (because there are more after the replacement than before). The probability of in is calculated by

(1)
In Equation (1), is the number of in . It is calculated by (also shown in Figure 3)
(2)
In Equation (2), represents the probability of in . Because WA and WB are independent in , we get
and
(3)
By combining Equation (1), Equation (2), and Equation (3), we can obtain
Since and , so we have . As described in Figure 3, we consider that and . So we obtain .

Fig. 3.

Fig. 3. The relationship between and . The operator represents using m bigram word to randomly replace m elements of . and represent the probability of WA and WB in the corpus before the replacement. and represent the probability of WA and WB in the corpus after the replacement. Since the corpus is large enough, randomly replacing m elements in with m s does not change the probability of WA and WB. So we consider that and .

Next, let us prove the sufficiency. In other words, if , then is a bigram WSU.

Definition 4.1 clearly makes the sufficiency true.□

Theorem 4.3 (The Probability Relation of Trigram WSU).

Let WA, WB, and , be a bigram word. The necessary and sufficient condition for to be a trigram WSU is .

Proof.

The proof of Theorem 4.3 is the same as the proof of Theorem 4.2. Let us prove the necessity first. In other words, if is a trigram WSU, then .

Suppose there is a trigram word set , and and WC are independent in . Let , which can be calculated by Equation (9). Now we use m trigram word s to randomly replace m elements of . After the replacement, we get another trigram word set . In , is a trigram WSU (because there are more after the replacement than before). The probability of in is calculated by

(4)
In Equation (4), is the number of in . It is calculated by (similar to Figure 3)
(5)
In Equation (5), represents the probability of in . Because and WC are independent in , we get
and
(6)
By combining Equation (4), Equation (5), and Equation (6), we can obtain
Since and , we have . As described in Figure 3, we consider that and . So we obtain .

Next, let us prove the sufficiency. In other words, if , then is a trigram WSU.

We treat as a unit, and Definition 4.1 clearly makes the sufficiency true.□

According to the proofs of Theorem 4.2 and Theorem 4.3, we can easily draw a conclusion that the necessary and sufficient conditions for WA, WB, and WC to be a trigram WSU is , and for WA and to be a trigram WSU is . We can generalize the case to n-gram.

Corollary 4.4.

If is a k-gram WSU, then , where and , is an arbitrary end-to-end segment of —that is, , .

Corollary 4.4 is a generalization of Theorem 4.2 and Theorem 4.3. According to the proofs of the two theorems, if we treat si as a unit, Corollary 4.4 can also be proved. The function of Corollary 4.4 is to determine whether a k-gram word constitutes a WSU. It is used to construct one of SLM’s constraints.

4.2 SLM

In this section, we first define the binding force of k-gram word represented by , and describe how to construct k-gram co-occurrence word set that is used to estimate the size of k-gram word set of the corpus. Next, we define and construct SLM with the objective of maximizing the total binding force of all candidate WSUs in sentences.

Definition 4.5

(Binding Force).

For a word string , , its binding force is denoted by , which is defined as

(7)

The binding force is used to indicate how tightly a word string combines. Generally, if the number of a word string in the corpus is larger, its binding force may be be larger. Definition 4.5 is the generalized definition of binding force defined in the work of Sproat and Shih [33], and we use 10 as the base of the logarithm in Equation (7).

Since it is difficult to obtain all k-gram words of the corpus, it is almost impossible for us to estimate the exact size of k-gram word set. As an alternative, we construct the k-gram co-occurrence word set of the corpus to solve this problem. The bigram and trigram co-occurrence word sets are constructed by intercepting two and three consecutive words from sentences. For a Vietnamese sentence its derived bigram co-occurrence word set is , and its derived trigram co-occurrence word set is . The k-gram co-occurrence word set can be constructed like this, although k cannot exceed sentence length. For languages whose words are separated by spaces naturally, such as Vietnamese and English, the order of co-occurrence word is in unit of word. In some languages, such as Chinese and Thai, there are no natural separators between syllables, and the order of the co-occurrence word is in unit of character. In other words, one character is considered as one word.

Definition 4.6

(The Probability of a k-Gram Word)

Let Sk be a k-gram co-occurrence word set, be a k-gram word, and . The probability of is defined as

(8)
where is the number of in Sk, and Nk is the number of all k-gram words in the corpus.

Since Nk cannot be accurately calculated, and the estimation of Nk is relatively complicated, we first derive the estimate formula of and next Nk.

Figure 4 shows the distribution of trigram co-occurrence words in the sentence of length n, Wj is a word. Suppose sj is a trigram word, then , , and are no longer trigram words. Therefore, only one-fifth of all trigram co-occurrence words are treated as trigram words. However, the blue marked , , sn, and are not real trigram co-occurrence words, so we need to add these four nonexistent trigram co-occurrence words before dividing 5. Therefore, is estimated by

(9)
where and represent the size of trigram co-occurrence word set and the number of sentences in the corpus. Suppose there is only one sentence of length n in the corpus; we have .

Fig. 4.

Fig. 4. The distribution of trigram co-occurrence words in the sentence.

Figure 5 shows the distribution of k-gram co-occurrence words in the sentence of length n. According to the estimation of , if the red marked sj is a k-gram word, the black marked , , are no longer k-gram words. Their number is . Therefore, only of all k-gram co-occurrence words are treated as k-gram words. Because the blue marked co-occurrence words are nonexistent, we need to add these co-occurrence words before dividing . So Nk is estimated by

(10)
where and represent the size of k-gram co-occurrence word set and the number of sentences in the corpus.

Fig. 5.

Fig. 5. The distribution of k-gram co-occurrence words in the sentence.

Usually there are several different sizes of WSU in sentences. How to determine the boundaries of these WSUs is the key problem to be solved by SLM. The divisions of candidate WSUs in sentences affect each other. So both the locality and globality should be taken into account. In the feasible solution space of a sentence, we consider the division with maximum total binding force as the optimal WSU division—that is,

(11)

In Equation (11), represents the feasible solution space of the sentence. represents a feasible WSU division. represents the number of candidate WSUs in . and , n is the length of the sentence. ni is the number of co-occurrence word in the corpus. So we establish the following optimization model:

(12)
The constraint is explained by Corollary 4.4, and is an arbitrary end-to-end segment of . The concept of is the same as the concept of si in Corollary 4.4. The interpretation of is shown in Figure 6.

Fig. 6.

Fig. 6. The relationship of binding force between two adjacent WSUs.

Fig. 7.

Fig. 7. The maximum length WSU in the sentence.

In Figure 6, if and are adjacent WSUs, the binding force of and Wj (denoted by ) must be smaller than the minimum of and . Otherwise, and Wj will form another candidate WSU. We only consider this simple situation. The actual situation is more complicated. and are the internal binding forces in and . We agree that .

4.3 The Recursive Divide-and-Conquer Algorithm for Solving SLM

The solution of SLM is equivalent to inserting several separators into the sentence input into SLM. Word strings between adjacent separators are treated as candidate WSUs. The optimal WSU division of the input sentence is that whose total binding force of all candidate WSUs is largest. Judging the optimal WSU division of a sentence is as difficult as finding the optimal WSU division of it. So this optimization problem is NP-hard. We can solve SLM by enumeration whose time complexity is , which is unbearable. So we need a more efficient method. In this section, we prove the additivity of the optimal WSU division and design a recursive divide-and-conquer algorithm for solving SLM based on the dynamic programming principle.

Theorem 4.7 (The Additivity of the Optimal WSU Division).

Suppose S is a sentence of length n, . S’s optimal WSU division is . If we divide U into and , then and are the optimal WSU division of and . and .

Proof.

This theorem can be easily proved by reduction to absurdity. Suppose is not the optimal WSU division of , and the optimal WSU division of is . Then is the optimal WSU division of S, and indicates concatenation. This contradicts the premise that the optimal WSU division of S is U.□

Generally there are very few long words in the corpus. So we only consider WSUs with . In this article, we set . For a word string with the length of , there must be at least one WSU boundary in it. But we do not know where this WSU boundary is. In the actual calculation, we need to assume that each candidate position is a WSU boundary. It is assumed that the candidate position j is a WSU boundary. So j divides sentence S into two sub-sentences and . According to Theorem 4.7, we only need to find the optimal WSU divisions of and , and concatenate them together to make the optimal WSU division of S. So we design a recursive divide-and-conquer algorithm to solve SLM (see Algorithm 1).

In Algorithm 1, Function gets the length of input. Function calls Algorithm 1 recursively. Function getOptimalWSUDivison-In_Set(D) selects the optimal WSU division from set D by judging the total binding force of each WSU division. This is easy to achieve by enumeration. Function getSubSentenceAt(i,sentence,MaxWsuLen) gets left and right sub-sentences separated by position i. It is implemented by Algorithm 2.

Figure 8 (the abscissa represents x in , the ordinate represents the sentence length) shows the compression effect of Algorithm 1 on search space. We can easily see that the growth rate of Algorithm 1’s search space is much smaller than the growth rate of enumeration. The shorter the maximum WSU length is, the smaller the growth rate of Algorithm 1’s search space is. For example, when and the sentence length are 60, 80, and 100, the search spaces of Algorithm 1 are , , and , respectively, whereas the search spaces of enumeration are , , and , respectively.

Fig. 8.

Fig. 8. The compression effect of Algorithm 1 on search space.

Skip 5EXPERIMENT Section

5 EXPERIMENT

In this section, we first introduce the storage of k-gram co-occurrence word set. Next, we show the performance promotion of SLM on three Vietnamese sequence labeling tasks: word segmentation, POS tagging, and NER.

5.1 The Storage of Co-Occurrence Words

The co-occurrence word matrix4 has a huge sparsity. When the 863-MB Vietnamese raw text is downloaded from Wikidata, it costs 195 MB in .txt format to store the bigram co-occurrence word matrix (only considers the first 10,000 words sorted by frequency in descending order). However, the storage of trigram co-occurrence word matrix requires 1.95 TB. So it is impossible to store all k-gram co-occurrence word matrices in this manner. In the bigram co-occurrence word matrix, about of elements do not appear in the corpus. The higher order the co-occurrence word matrix is, the sparser it is. Considering various technologies currently used to store sparse matrices, we believe that storing all the k-gram co-occurrence word matrices in the database is a good choice. Database makes co-occurrence word matrix management and probability calculation more convenient and efficient, especially when the order of co-occurrence word matrices are high.

Table 1 shows the parts of an 8-gram co-occurrence word matrix. The first eight columns of the table store the word indexes. The last coe column stores the number of corresponding 8-gram co-occurrence words in the corpus. Co-occurrence words that do not appear in the corpus are not stored.

Table 1.
W0W1W2W3W4W5W6W7coe
014239190196101662
014282011128513357
0142391901964053
01428201671296946
01491901961014844
0142889309229237
01428122851334330
...........................

Table 1. Parts of an 8-Gram Co-Occurrence Word Set

We only consider the first 10,000 Vietnamese words sorted by frequency in descending order and build 1- to 8-gram co-occurrence word matrices from the raw text with the size of 863 MB downloaded from Wikidata. The size, number of sentences, and Nk of each gram are shown in Table 2. Nk is calculated by Equation (10).

Table 2.
UnaryBigramTrigram4-Gram5-Gram6-Gram7-Gram8-Gram
Size75,017,45169,357,25363,697,05558,036,85752,376,65946,716,46141,056,26335,396,065
Sentences14,169,99311,980,79910,311,9279,005,6087,841,4716,827,1995,660,224
Nk75,017,45126,892,54917,267,56913,142,57710,850,9159,392,5858,382,9727,642,589

Table 2. The Size, Number of Sentences, and Nk of a k-Gram Co-Occurrence Word Set

5.2 Vietnamese Word Segmentation Experiment

5.2.1 Dataset.

We use Vnwordseg5 and Conll20176 datasets for the Vietnamese word segmentation experiment. The detail of the two datasets is shown in Table 3. All sentences are pre-processed to be lowercase.

Table 3.
SentencesTokens
Train SetTest SetTrain SetTest Set
Vnwordseg6,2451,560164k42k
Conll2017140080026k15k

Table 3. Detail of Vietnamese Word Segmentation Datasets

5.2.2 Baselines and Parameter Settings.

We choose the most popular deep learning models and the official Vietnamese word segmentation tool as baselines for this test. The most popular deep learning models for sequence labeling include BERT7 [8], LSTM+CRF [11], and BiLSTM+CRF and its various variants [7, 14, 21].

vnTkoenizer [15] is the official word segmentation tool of the Vietnam Language and Speech Processing project8 [36].

VnCoreNLP is a fast and accurate Vietnamese word segmenter and obtains state-of-the-art results [26, 37].

The parameter settings of these baselines are shown in Table 4. We use Word2Vec [24] and GloVe [28] to pre-train Vietnamese word embeddings.

Table 4.
ModelPre-Trained Word EmbeddingLSTM Hidden UnitsWord’s Character Sequence EmbeddingWord’s Character Sequence Convolution FiltersEpochDropout
LSTM+CRF100100NoNo500.3
BiLSTM+CRF10050NoNo500.3
Lample. [14]1005015No500.3
Ma. [21]10050No30500.3
Chen. [7]100150NoNo500.3
BERT [8]768200NoNo500.3

Table 4. Parameter Settings of Vietnamese Word Segmentation Baselines

5.2.3 Vietnamese Word Segmentation Experimental Results.

As shown in Figure 2, we first use SLM to determine the position symbol of each token in sentences and convert position symbols into one-hot position encodings. Next, we concatenate all tokens’ position encodings and their word embeddings together to form the final input of the upper model. By comparing whether to add these position embeddings to the input of the upper word segmentation model, we can evaluate SLM’s performance promotion.

Table 5 shows the cross-dataset experimental results of word segmentation on the Conll2017 dataset. “+SLM” means adding position encoding to the input of corresponding model. “(GloVe)” represents that word embeddings are pre-trained by GloVe [28]. “(Word2Vec)” represents that word embeddings are pre-trained by Word2Vec [24]. “Vnwordseg Test” means test on the Vnwordseg test set. It can be seen from Table 5 that “+SLM” can improve the word segmentation performance. The train set of Conll2017 has only 1,400 sentences. The best F value of cross-dataset test on Vnwordseg we obtain is 0.8833, which is +0.0113 more than the best F value of VnTokenizer. The test result of VnTokenizer on the Conll2017 test set is . We rerun VnCoreNLP on the Conll2017 test set and get a good F value (). However, the train sets of VnTokenizer and VnCoreNLP are almost the same9 [26]. We regard VnTokenizer and VnCoreNLP testing on Vnwordseg as the cross-dataset test.

Table 5.
Vnwordseg Test (GloVe)Vnwordseg Test (Word2Vec)
PRFΔ FPRFΔ F
LSTM_CRF0.78110.84650.81250.77760.84550.8102
LSTM_CRF +SLM0.82360.8740.848+0.03560.81970.8680.8432+0.033
biLSTM+CRF0.790.8660.82630.78720.85440.8194
biLSTM+CRF +SLM0.82850.87970.8533+0.0270.82430.87280.8479+0.0285
Chen. [7]0.78230.83160.80620.78280.82890.8052
Chen. [7]+SLM0.8020.84640.8236+0.01740.80170.84510.8228+0.0176
Lample. [14]0.79130.85510.8220.79240.85480.8224
Lample. [14] +SLM0.8310.85140.8411+0.0190.82190.87340.8469+0.0245
Ma. [21]0.7850.85330.81770.78480.85040.8163
Ma. [21]+SLM0.82370.87390.8481+0.03040.82560.87340.8488+0.0325
BERT [8]0.83860.87230.8551
BERT [8]+SLM0.87520.89150.8833+0.0282
VnTokenizer [15, 36]0.9280.8220.872

Table 5. Vietnamese Word Segmentation Experimental Results on the Conll2017 Dataset

Table 6 shows the cross-dataset experimental results of word segmentation on the Vnwordseg dataset. We can draw the same conclusion from Table 6 as Table 5. The Vnwordseg dataset is also a small dataset containing 6,245 sentences (train set). The best F value of the cross-dataset test on Conll2017 we obtain is 0.9061, which are +0.0341 and +0.0108 more than the F value of VnTokenizer and VnCoreNLP (we rerun VnCoreNLP on Vnwordseg, and the cross-dataset test F value we get is 0.8953). However, VnTokenizer’s train set includes a Vietnamese dictionary and a 70,000-sentence Vietnamese treebank [36], and VnCoreNLP used a 75,000-sentence Vietnamese treebank to train. When we conduct the experiment on the Vnwordseg dataset, the hidden units of LSTM and BiLSTM are 150 and 75. The words’ character embedding has 40 dimensions, and the words’ character sequence convolution filters are 40. Other parameters are shown in Table 4.

Table 6.
Conll2017 Test (GloVe)Conll2017 Test (Word2Vec)
PRFΔ FPRFΔ F
LSTM_CRF0.840.84650.84320.83040.83860.8344
LSTM_CRF +SLM0.86280.85790.86+0.0170.86950.84610.8576+0.0232
biLSTM+CRF0.8570.83780.84730.85830.8310.8445
biLSTM+CRF +SLM0.8690.85360.8612+0.0140.87510.84180.8581+0.0136
Chen. [7]0.84910.83250.84070.85470.82380.839
Chen. [7]+SLM0.85930.84640.8528+0.0120.86560.82480.8447+0.0057
Lample. [14]0.8370.82550.83120.83930.82170.8393
Lample. [14]+SLM0.8760.8490.8623+0.0310.87530.84560.8602+0.0209
Ma. [21]0.860.8370.84830.85840.83000.844
Ma. [21]+SLM0.86370.85840.861+0.01270.86880.85460.8616+0.0176
BERT [8]0.88420.85830.871
BERT [8]+SLM0.9130.89940.9061+0.0351

Table 6. Vietnamese Word Segmentation Experimental Results on the Vnwordseg Dataset

GloVe [28] and Word2Vec [24] are both static word embedding models. The experimental results show that there is no obvious difference in performance between them. Unsurprisingly, BERT performs better than all of the other models. In the first cross-dataset test, BERT+SLM surpasses the performance of VnTokenizer using only a training set of 1,400 sentences (only 2% of VnTokenizer’s train set). In the second cross-dataset test, BERT+SLM uses less than 10% of the training set of VnCoreNLP (we also do not use a dictionary), and its best F (0.9061) value exceeds state-of-the-art VnCoreNLP’s F (0.8953) value of 1.08%.

From the word segmentation experimental results, we can draw a conclusion that the pre-divided WSUs output by SLM can deliver useful information to word segmentation models and improve their performance.

5.3 Vietnamese POS Labeling Experiment

5.3.1 Dataset.

We use Conll2017 for the Vietnamese POS labeling experiment. Its label distribution is shown in Table 7. All sentences are pre-processed to be lowercase.

Table 7.
NOUNADPXVERBADJPUNCTSCONJ
Train Set6,4951,1891,5954,1231,1042,960245
Test Set3,8386889702,1787381,722122
NUMDETCCONJPROPNAUXPARTINTJ
Train Set6173105308231821093
Test Set412232335494132877

Table 7. POS Labels’ Distribution of the Conll2017 Vietnamese Dataset

5.3.2 Baselines and Parameter Settings.

We also choose BERT, LSTM+CRF, BiLSTM+CRF, and its various variants, the state-of-the-art Vietnamese POS tagging, and the NER system (PhoNLP) [27] as baselines. These baselines’ parameter settings are also shown in Table 4.

5.3.3 Vietnamese POS Labeling Experimental Results.

The Vietnamese POS labeling experimental results are shown in Table 8. “Model-A+SLM” means adding position encodings output by SLM to the input of Model-A. We choose five popular sequence labeling models for this test. From the experimental results, it can be seen that SLM can improve the performance of existing sequence labeling models. The minimum improvement of absolute F value is 0.8%. Models [14] and [21] both extract the features of character sequence within a word. The former uses a BiLSTM, and the latter uses a convolution network. BERT [8]+SLM has the best performance, and LSTM has the worst. Model Ma. [21]+SLM has the least improvement, but LSTM+SLM has the most. The accuracies of BERT+SLM and PhoNLP are 82.33% and 93.88% (on the Vietnam Language and Speech Processing 2013 POS tagging dataset) [27], respectively. PhoNLP uses joint training, and its training set includes 30,000 sentences. The training set of BERT+SLM only contains 1,400 sentences.

Table 8.
Embedding Trained by GloVeEmbedding Trained by Word2Vec
PRFΔ FPRFΔ F
LSTM_CRF0.73620.76930.75240.73210.76630.7488
LSTM_CRF +SLM0.78050.80.7901+0.0380.7720.7940.7838+0.035
biLSTM+CRF0.760.7990.7790.76530.79260.7787
biLSTM+CRF +SLM0.8060.80740.8067+0.0280.79040.80840.7993+0.021
Lample. [14]0.7590.7830.77080.77070.7980.7841
Lample. [14]+SLM0.7980.79690.7974+0.0270.79450.81090.8026+0.019
Ma. [21]0.7830.8150.79870.78670.80450.7955
Ma. [21]+SLM0.79750.8160.8066+0.0080.79550.81160.8035+0.008
BERT [8]0.79230.80450.7984
BERT [8]+SLM0.80840.81720.8128+0.0144

Table 8. POS Labeling Experiment Results on the Conll2017 Vietnamese Dataset

From the POS labeling experiment, we can draw a conclusion that the pre-divided WSUs can also deliver useful information to the POS labeling models and improve their performance. The performance of BERT is better than all of the other models.

5.4 Vietnamese NER Experiment

5.4.1 Dataset.

Since there is no publicly available Vietnamese NER dataset, we hired two Vietnamese native speakers to annotate a small Vietnamese NER dataset.10 Its detail is shown in Table 9. PER, LOC, and ORG stand for the tags of people, places, and organizations, respectively. We use the BIES labeling schema described in Section 3 to annotate the dataset. All sentences are pre-processed to be lowercase.

Table 9.
SentencesTokensPERLOCORG
Train Set1,06039k845834834
Test Set44117k351372378

Table 9. Detail of Our Vietnamese NER Dataset

5.4.2 Baselines and Parameter Settings.

Baselines and their parameter settings are shown in Table 10. “Model-A+SLM” means adding position encodings output by SLM to the input of Model-A. In “Model-A+SLM,” since the dimension of input vector is increased by 4, the hidden units numbers of BiLSTM and LSTM increase to 54 and 104.

Table 10.
ModelPre-Trained Word EmbeddingLSTM Hidden UnitsWord’s Character Sequence EmbeddingWord’s Character Sequence Convolution FiltersEpochDropout
LSTM+CRF100100NoNo1500.3
LSTM+CRF +SLM100104NoNo1500.3
BiLSTM+CRF10050NoNo1500.3
BiLSTM+CRF +SLM10054NoNo1500.3
Lample. [14]1005015No1500.3
Lample. [14]+SLM1005415No1500.3
Ma. [21]10050No301500.3
Ma. [21]+SLM10054No301500.3
BERT [8]768204NoNo1000.3

Table 10. Parameter Settings of Vietnamese NER Baselines

5.4.3 Vietnamese NER Experimental Results.

Table 11 shows Vietnamese NER experimental results. P, R, F, and Δ F in the “All Tags” column represent the statistics considering all tags in the test. Fper and represent the statistics considering only PER, and so on for LOC and ORG. The subscripts WV and GLO represent using Word2Vec [24] and GloVe [28] to pre-train Vietnamese word embeddings.

Table 11.
All TagsPERLOCORG
PRFΔ FFperFlocForg
LSTM+CRFWV0.91990.95210.93570.72040.73550.6071
LSTM+CRFWV +SLM0.92520.95140.9381+0.0020.7436+0.0230.7637+0.0280.6202+0.013
LSTM+CRFGLO0.9180.94920.9330.720.73280.606
LSTM+CRFGLO +SLM0.93080.96150.946+0.0120.7511+0.0310.766+0.0330.6289+0.023
biLSTM+CRFWV0.93870.95870.94860.78720.76430.649
biLSTM+CRFWV +SLM0.94190.96270.9522+0.0040.806+0.0190.7898+0.0260.6985+0.05
biLSTM+CRFGLO0.93130.95940.9450.78510.7620.648
biLSTM+CRFGLO +SLM0.9450.96910.9569+0.0120.81+0.0250.806+0.0440.689+0.041
0.93730.9610.9490.7570.77460.6496
[14] +SLM0.94350.96410.9537+0.0050.7976+0.0410.8068+0.0320.6733+0.024
[14]0.93870.95610.9470.7530.7790.631
[14]+SLM0.94890.96770.958+0.0110.79+0.0370.815+0.0360.674+0.043
[21]0.93690.96140.9490.7790.76670.643
[21]+SLM0.94430.96410.9541+0.0050.8157+0.0370.7957+0.0290.6889+0.046
[21]0.9410.96060.95070.780.7620.659
[21]+SLM0.9460.96740.957+0.0060.82+0.040.798+0.0360.697+0.038
BERT [8]0.9520.9730.9620.92640.78670.7197
BERT [8]+SLM0.95390.9790.9663+0.0040.9387+0.0120.8012+0.0150.745+0.025
PhoNLP [27]0.9451

Table 11. Vietnamese NER Experimental Results

From the NER experimental results, it can be seen that our SLM can improve the performance of different popular deep learning models. The pre-divided WSUs in sentences also help the labeling of named entities. The absolute improvement of F value is up to 5%. Overall, BiLSTM performs better than LSTM. Models [14] and [21] both extract features of character sequence within word. One of them uses a convolution network, and the other uses a BiLSTM. However, their performances are not much different. The performance of BERT+SLM is the best. The difference in NER performance between Word2Vec [24] and GloVe [28] is not obvious. BERT+SLM and PhoNLP use their own test sets, so we can only roughly compare the performance between the two. The training sets of PhoNLP and BERT+SLM contain 14,861 and 1,060 sentences, respectively [27]. The sentences number of the former is 14 times that of the latter.

5.5 The Control of Average WSU Length in the Optimal WSU Division

In Equation (12), if we change the objective function to

(13)
the average WSU length in the optimal WSU division will vary with b (, ). in Equation (13) is a penalty for the binding force. It can reduce the differences between candidate WSUs’ binding force. The influences of b on the average WSU length and NER performance is shown in Figure 9. We use Model [21]+SLM to perform the experiment, and the epoch is set to 100.

Fig. 9.

Fig. 9. The influences of b on the Avg. Length in the optimal WSU division and NER performance.

In Figure 9, “Avg. Length” stands for “the average WSU length in the optimal WSU division.” Fall represents the overall F value. FPER, FLOC, and FORG represent the F value of PER, LOC, and ORG, respectively. From Figure 9, we can see that as b increases, Avg. Length also increases, but Fall decreases slightly. When we carefully compare FPER, FLOC, and FORG under different Avg. Length, we find that the increase of Avg. Length reduces the performance of NER slightly.

In summary, we do not need to add to SLM’s objective function.

Skip 6CONCLUSION Section

6 CONCLUSION

We first defined the computable WSU and derive two theorems and one corollary at the beginning of this article. Next, we constructed the unsupervised SLM with the objective of maximizing the total binding force of all candidate WSUs in sentences. The solution of SLM is NP-hard. We proved the additivity of the optimal WSU division and designed a recursive divide-and-conquer algorithm to solve SLM based on the dynamic programming principle. Our SLM is completely unsupervised and has no hyper-parameter, and it can deal with analytic languages indiscriminately. So it has good domain adaptability.

Deep learning models have achieved great success in NLP; however, these models require a large amount of high-quality annotated data. Extracting information from raw text to reduce the dependency of deep learning models on annotated data and improving deep learning models’ performance have been research hotspots in NLP recently. In this article, we established the pre-trained sequence labeling framework shown in Figure 2. Based on the proposed framework, we integrated our SLM model with several popular sequence labeling models and conducted three types of Vietnamese sequence labeling experiments. Whether it is word segmentation, POS labeling, or an NER test, SLM can efficiently improve the performances of baselines. In particular, without using a dictionary, we only used 10% of the training data to make the word segmentation performance of BERT+SLM exceed VnCoreNLP [37]. This shows that SLM can also effectively improve the performance of the dynamic word embedding model.

Our SLM can be improved in the following aspects to further promote the performance of sequence labeling tasks:

  • Use vocabulary in SLM. Vocabulary includes dictionaries, names of persons, locations, organizations, and proper nouns. First, we can use vocabulary to determine those unambiguous boundaries of WSUs to narrow the search space. Next, we can use SLM to deal with the ambiguities [15].

  • Add morphological information to position encodings before they concatenate with word embeddings. Experiments in the work of Mayhew et al. [23] show that the case-sensitive information can promote the performance of NER.

  • Extract information from other knowledge bases and add it to word embedding [1, 10, 31].

  • Use regular expressions to pre-process the input. Experiments have shown that this may improve the performance of WSU division [5, 6, 7].

Acknowledgments

We thank for insightful comments of anonymous reviewers.

Footnotes

  1. 1 For languages like English and Vietnamese, xi represents a word, but for Chinese, xi represents a Chinese character. Word and character are not distinguished in this article.

    Footnote
  2. 2 The detail of tagging schema can be seen in the work of Xue [39].

    Footnote
  3. 3 is a virtual set, and it is used to represent the bigram word space of the corpus. is only used to represent the relationship between two words. Some bigram words in are bigram WSUs, and some are just the random combinations of two words. The concept of can be generalized to . is different from k-gram co-occurrence word set Sk that can actually be constructed.

    Footnote
  4. 4 The k-gram co-occurrence word matrix is a set of arbitrary ordered combinations of k words.

    Footnote
  5. 5 http://www.jaist.ac.jp/~hieuxuan/vnwordseg/data/.

    Footnote
  6. 6 Train set: https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1983 , Test set: https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2184.

    Footnote
  7. 7 We add BiLSTM+CRF on the top of BERT for sequence labeling tasks.

    Footnote
  8. 8 https://vlsp.org.vn/.

    Footnote
  9. 9 https://vlsp.hpda.vn/demo/?page=resources.

    Footnote
  10. 10 https://github.com/Liaoxianwen/SLM_code_data.

    Footnote

REFERENCES

  1. [1] Ali Muhammad Asif, Sun Yifang, Li Bing, and Wang Wei. 2020. Fine-grained named entity typing over distantly supervised data based on refined representations. In Proceedings of AAAI 2020, Vol. 34. 73917398. https://doi.org/10.1609/aaai.v34i05.6234Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Bari M. Saiful, Joty Shafiq, and Jwalapuram Prathyusha. 2020. Zero-resource cross-lingual named entity recognition. In Proceedings of AAAI 2020, Vol. 34. 74157423. https://doi.org/10.1609/aaai.v34i05.6237Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Buda Jakab and Bolonyai Flora. 2018. Location name extraction from targeted text streams using gazetteer-based statistical language models. In Proceedings of the 27th International Conference on Computational Linguistics. 19861997.Google ScholarGoogle Scholar
  4. [4] Buda Jakab and Bolonyai Flora. 2020. An ensemble model using n-grams and statistical features to identify fake news spreaders on Twitter. In Proceedings of CLEF.Google ScholarGoogle Scholar
  5. [5] Cai Deng and Zhao Hai. 2016. Neural word segmentation learning for Chinese. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 409420. https://doi.org/10.18653/v1/P16-1039Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Cai Deng, Zhao Hai, Zhang Zhisong, Xin Yuan, Wu Yongjian, and Huang Feiyue. 2017. Fast and accurate neural word segmentation for chinese. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 608615.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Chen Xinchi, Qiu Xipeng, Zhu Chenxi, Liu Pengfei, and Huang Xuanjing. 2015. Long short-term memory neural networks for chinese word segmentation. In Proceedings of EMNLP. 13851394.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Devlin Jacob, Chang Mingwei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL 2019. 41714186. https://doi.org/10.18653/v1/N19-1423Google ScholarGoogle Scholar
  9. [9] Ghodsi Aneiss and DeNero John. 2016. An analysis of the ability of statistical language models to capture the structural properties of language. In Proceedings of the 9th International Natural Language Generation Conference. 227231.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] He Qizhen, Wu Liang, Yin Yida, and Cai Heming. 2020. Knowledge-graph augmented word representations for named entity recognition. In Proceedings of AAAI 2020, Vol. 34. 79197926. https://doi.org/10.1609/aaai.v34i05.6299Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Hochreiter Sepp and Schmidhuber Jurgen. 1997. Long short term memory. Neural Computation 9, 8 (Aug. 1997), 17351780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Hu Anwen, Dou Zhicheng, Wen Jirong, and Nie Jianyun. 2020. Leveraging multi-token entities in document-level named entity recognition. In Proceedings of AAAI 2020, Vol. 34. 79617968. https://doi.org/10.1609/aaai.v34i05.6304Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Huang Changning and Zhao Hai. 2007. Chinese word segmentation: A decade review. Journal of Chinese Information Processing 21, 3 (May 2007), 819.Google ScholarGoogle Scholar
  14. [14] Lample Guillaume, Ballesteros Miguel, Subramanian Sandeep, Kawakami Kazuya, and Dyer Chris. 2016. Neural architectures for named entity recognition. In Proceedings of NAACL-HLT 2016. 260270.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Le Hong Phuong, Nguyen Thi Minh Huyen, Roussanaly Azim, and Ho Tuong Vinh. 2008. A hybrid approach to word segmentation of Vietnamese texts. In Proceedings of the 2nd International Conference on Language and Automata Theory and Applications. 240249.Google ScholarGoogle Scholar
  16. [16] Lim KyungTae, Lee Jay Yoon, Carbonell Jaime, and Poibeau Thierry. 2020. Semi-supervised learning on meta structure: Multi-task tagging and parsing in low-resource scenarios. In Proceedings of AAAI 2020, Vol. 34. 83448351. https://doi.org/10.1609/aaai.v34i05.6351Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Liu Junxin, Wu Fangzhao, Wu Chuhan, Huang Yongfeng, and Xie Xing. 2019. Neural Chinese word segmentation with dictionary. Neurocomputing 338 (April 2019), 4654.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Liu Liyuan, Ren Xiang, Zhu Qi, Gui Huan, Zhi Shi, Ji Heng, and Han Jiawei. 2017. Heterogeneous supervision for relation extraction: A representation learning approach. In Proceedings of EMNLP. 4656.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Liu Liyuan, Shang Jingbo, Ren Xiang, Xu Frank F., Gui Huan, Peng Jian, and Han Jiawei. 2018. Empower sequence labeling with task-aware neural language model. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 52535260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Liu Shifeng, Sun Yifang, Li Bing, Wang Wei, and Zhao Xiang. 2020. HAMNER: Headword amplified multi-span distantly supervised method for domain specific named entity recognition. In Proceedings of AAAI 2020, Vol. 34. 84018408. https://doi.org/10.1609/aaai.v34i05.6358Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Ma Xuezhe and Hovy Eduard. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 10641074.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Marie Benjamin, Sun Haipeng, Wang Rui, Chen Kehai, Fujita Atsushi, Utiyama Masao, and Sumita Eiichiro. 2019. NICT’s unsupervised neural and statistical machine translation systems for the WMT19 news translation task. In Proceedings of the 4th Conference on Machine Translation (WMT’19). 294301.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Mayhew Stephen, Gupta Nitish, and Roth Dan. 2020. Robust named entity recognition with truecasing pretraining. In Proceedings of AAAI 2020, Vol. 34. 84808487. https://doi.org/10.1609/aaai.v34i05.6368Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Mikolov Tomas, Chen Kai, Corrado Greg, and Dean Jeffrey. 2013. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations (ICLR’13). http://arxiv.org/abs/1301.3781Google ScholarGoogle Scholar
  25. [25] Nguyen A., Nguyen K., and Ngo V.. 2019. Neural sequence labeling for Vietnamese POS tagging and NER. In Proceedings of the 2019 IEEE-RIVF International Conference on Computing and Communication Technologies (RIVF’19). 15. https://doi.org/10.1109/RIVF.2019.8713710Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Nguyen Dat Quoc, Nguyen Dai Quoc, Vu Thanh, Dras Mark, and Johnson Mark. 2018. A fast and accurate Vietnamese word segmenter. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). 25822587.Google ScholarGoogle Scholar
  27. [27] Nguyen Linh The and Nguyen Dat Quoc. 2021. PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing. In Proceedings of NAACL-HLT 2021. 17.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Pennington Jeffrey, Socher Richard, and Manning Christopher. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 15321543. https://doi.org/10.3115/v1/D14-1162Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Petersy Matthew E., Neumanny Mark, Iyyery Mohit, Gardnery Matt, Clark Christopher, Lee Kenton, and Zettlemoyery Luke. 2018. Deep contextualized word representations. In Proceedings of NAACL 2018. 15321543.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Safranchik Esteban, Luo Shiying, and Bach Stephen H.. 2020. Weakly supervised sequence tagging from noisy rules. In Proceedings of AAAI 2020, Vol. 34. 55705578. https://doi.org/10.1609/aaai.v34i04.6009Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Simpson Edwin, Pfeiffer Jonas, and Gurevych Iryna. 2020. Low resource sequence tagging with weak labels. In Proceedings of AAAI 2020, Vol. 34. 88628869. https://doi.org/10.1609/aaai.v34i05.6415Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Somsap Sittichai and Seresangtakul Pusadee. 2020. Isarn Dharma word segmentation using a statistical approach with named entity recognition. ACM Transactions on Asian and Low-Resource Information Processing 19, 2 (Feb. 2020), Article 27, 16 pages. https://doi.org/10.1145/3359990 Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Sproat Richard and Shih Chilin. 1990. A statistical method for finding word boundaries in Chinese text. Computer Processing of Chinese and Oriental Languages 4, 4 (March 1990), 336351.Google ScholarGoogle Scholar
  34. [34] Tan Chuanqi, Qiu Wei, Chen Mosha, Wang Rui, and Huang Fei. 2020. Boundary enhanced neural span classification for nested named entity recognition. In Proceedings of AAAI 2020, Vol. 34. 90169023. https://doi.org/10.1609/aaai.v34i05.6434Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Varjokallio Matti and Klakow Dietrich. 2016. Unsupervised morph segmentation and statistical language models for vocabulary expansion. In Proceedings of ACL. 175180.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] VLSP Project 2010. Vietnamese Language Resources. Retrieved August 5, 2020 from https://vlsp.hpda.vn/demo/?page=resources&lang=en.Google ScholarGoogle Scholar
  37. [37] Vu Thanh, Nguyen Dat Quoc, Nguyen Dai Quoc, Dras Mark, and Johnson Mark. 2018. VnCoreNLP: A Vietnamese natural language processing toolkit. In Proceedings of NAACL-HLT 2018. 5660.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Wu Qianhui, Lin Zijia, Wang Guoxin, Chen Hui, Karlsson Borje F., Huang Biqing, and Lin Chinyew. 2020. Enhanced meta-learning for cross-lingual named entity recognition with minimal resources. In Proceedings of AAAI 2020, Vol. 34. 92749281. https://doi.org/10.1609/aaai.v34i05.6466Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Xue Nianwen. 2003. Chinese word segmentation as character tagging. International Journal of Computational Linguistics and Chinese Language Processing 8, 6 (March 2003), 2948.Google ScholarGoogle Scholar
  40. [40] Yang Jie, Zhang Yue, and Dong Fei. 2017. Neural word segmentation with rich pretraining. In Proceedings of the 55h Annual Meeting of the Association for Computational Linguistics. 839849.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A Statistical Language Model for Pre-Trained Sequence Labeling: A Case Study on Vietnamese

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 3
      May 2022
      413 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3505182
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 December 2021
      • Accepted: 1 August 2021
      • Revised: 1 December 2020
      • Received: 1 June 2019
      Published in tallip Volume 21, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed
    • Article Metrics

      • Downloads (Last 12 months)316
      • Downloads (Last 6 weeks)29

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!