Abstract
Mongolian named entity recognition (NER) is not only one of the most crucial and fundamental tasks in Mongolian natural language processing, but also an important step to improve the performance of downstream tasks such as information retrieval, machine translation, and dialog system. However, traditional Mongolian NER models heavily rely on the feature engineering. Even worse, the complex morphological structure of Mongolian words makes the data sparser. To alleviate the feature engineering and data sparsity in Mongolian named entity recognition, we propose a novel NER framework with Multi-Knowledge Enhancement (MKE-NER). Specifically, we introduce both linguistic knowledge through Mongolian morpheme representation and cross-lingual knowledge from Mongolian-Chinese parallel corpus. Furthermore, we design two methods to exploit cross-lingual knowledge sufficiently, i.e., cross-lingual representation and cross-lingual annotation projection. Experimental results demonstrate the effectiveness of our MKE-NER model, which outperforms strong baselines and achieves the best performance (94.04% F1 score) on the traditional Mongolian benchmark. Particularly, extensive experiments with different data scales highlight the superiority of our method in low-resource scenarios.
1 INTRODUCTION
Named entity recognition (NER) is one of the fundamental tasks in natural language processing (NLP), which detects names of specified categories, such as person, location and organization. It plays a significant role in the overall task of information extraction and serves as a vital step for other downstream NLP applications, e.g., information retrieval, machine translation, and dialog system. Recently, neural-based methods have achieved state-of-the-art performances in both Chinese and English NER tasks [22, 36, 40, 41], where NER is formatted as sequence labeling. Nevertheless, these methods rely on a large amount of annotated data, which is scarce and difficult to obtain, especially in low resource scenarios [12, 43]. Therefore, several research studies pay attention to low-resource scenarios. Zhang et al. [43] use a novel neural model for name tagging solely based on pseudo data. Cao et al. [7] proposed a new expectation-driven learning framework with very few resources. Cui et al. [10] proposed a template-based method for low-resource NER with the help of generative pre-trained language model BART [21]. Nevertheless, studies on NER tasks in low-resource languages still lag far behind the leading edge due to the sparse corpora and the neglect of the language-specific characteristics, e.g., Mongolian morphemes.
In this paper, we explore the NER task in Mongolian, a low-resource language mainly used in northern China, Mongolia, and Russia. For Mongolian, there are two commonly used scripts, i.e., traditional Mongolian in China, and Cyrillic Mongolian in Mongolia and Russia. Both of them have similar phonology systems but different writing scripts. In this work, we mainly focus on traditional Mongolian NER, whose scripts and corresponding Latin letters are shown with an example in Figure 1. Note that traditional Mongolian is written from up to down, left to right in actual writing. In order to facilitate the reading, we adopt the following format as shown in Figure 1.
Fig. 1. An example of Mongolian NER, whose meaning is “Wen Jiabao presided over the executive meeting of the State Council”.
First of all, complex characteristics of Mongolian and different encoding modes, increase the difficulty on Mongolian NER research, which heavily restricts the further development of other downstream tasks in Mongolian. By now, most work on Mongolian NER has still used traditional rules-based or statistical methods [4, 19, 30], which requires expensive hand-crafted features and high-quality corpus. Moreover, these methods cannot handle the problem of out-of-vocabulary (OOV) words.
In addition, due to the scarcity of training data in Mongolian, the performance of most neural-based NER models degrade significantly with deficient feature representation learning. Besides, the performance of Mongolian NER task has an important impact on the performance of specific downstream tasks such as knowledge graphs (KG) [1, 5, 33] and automatic question answering (QA) [6, 27]. Therefore, how to effectively alleviate the data scarcity of Mongolian NER and fully exploit Mongolian-specific features has far-reaching significance for Mongolian NLP research.
Inspired by the recent work [14], we consider enhancing Mongolian NER system with both Mongolian linguistic and cross-lingual knowledge. We introduce linguistic knowledge through using Mongolian morphemes to capture more fine-grained representations. Moreover, we transfer cross-lingual knowledge from high resource language (Chinese) to low resource language (Mongolian) with bilingual parallel corpora. Furthermore, we design two methods to exploit cross-lingual knowledge sufficiently, i.e., cross-lingual representation incorporation and cross-lingual annotation projection. Our approach is denoted as Multi-Knowledge Enhanced (MKE) Mongolian NER.
The contributions of this paper can be summarized as follows:
(1) | We propose a novel framework, i.e., Multi-Knowledge Enhanced Named Entity Recognition (MKE-NER), which incorporates Mongolian linguistic knowledge with fine-grained morpheme embeddings. | ||||
(2) | MKE-NER also incorporates cross-lingual knowledge, i.e., effective representation and annotation projection from high resource (Chinese) to low resource (Mongolian), to alleviate the data scarcity and feature engineering in Mongolian NER task with a language-independent manner, which could be readily applied to other low-resource languages. | ||||
(3) | Experimental results demonstrate the effectiveness of multi-knowledge enhancement and our proposed framework outperforms the state-of-the-art method (i.e., TENER) on the Mongolian NER dataset. | ||||
(4) | Extensive experiments with different data scales illustrate that our method further surpasses other comparisons, and the scarcer the data is, the more significant impacts our method will exert, which highlights the superiority of our method in low-resource scenarios. | ||||
2 BACKGROUND
2.1 Task Definition
Named entity recognition (NER) is one of the most fundamental sub-tasks in information extraction, aiming to identify the named entities of predefined types like names of persons, locations, and organizations in unstructured text. Formally, given a sentence of tokens \( S =\ \lt x_1, x_2,\ldots , x_N\gt \), NER is to output a list of tuples \( \lt I_s, I_e, t\gt \), each of which is a named entity in sentence S. \( I_s \) and \( I_e \) are the start and end indexes of a named entity. t is the entity type from the predefined category set, e.g., PER, LOC, and ORG for persons, locations, and organizations, respectively. For example, given a sentence “Peter will fly to California tomorrow to attend a conference held by ACM.”, NER systems need to recognize three entities, i.e.,<Peter, PER>, <California, LOC>, and <ACM, ORG>.
NER is usually modeled as a sequence labeling task and therefore lots of solutions use conditional random fields (CRFs) as their backbones. Neural Networks, e.g., Convolutional Neural Networks (CNN) and Long-Short Term Memory (LSTM) networks, play roles of feature extractors in NER models. Recently, many researchers try to exploit the power of pre-trained language models like BERT to further improve the performance of NER and yield state-of-the-art results. However, how to maintain the effectiveness of NER models in low-resource scenarios is still an intractable problem.
2.2 Our Framework
To transfer the strong performance of Transformer [32] from other NLP tasks to NER tasks, our framework, named Multi-Knowledge Enhanced Named Entity Recognition (MKE-NER), is based on TENER [42] instead of traditional bidirectional LSTM. As shown in Figure 2, there are three components in MKE-NER: (a) Embedding Layer, which fuses not only Mongolian but also cross-lingual knowledge representations, (b) Adaptive Encoding Layer, which modifies the Transformer encoder to model the long-range and complicated interactions of sentence for NER, and (c) CRF Layer, from which we can decode the final predicted tag sequence for the input sentences. We will elaborate each part of the model as follows.
Fig. 2. The overview of MKE-NER, which contains three preliminary components: embedding layer, adaptive encoding layer, and CRF layer.
Embedding Layer. For each token \( x_i \) of the input sentence, its embedding \( {\bf e}_i \) is composed of character representations \( {\bf e}_i^C \), our proposed Mongolian morpheme representations \( {\bf e}_i^M \), and cross-lingual knowledge representations \( {\bf e}_i^{XL} \), which can be defined as follows: (1) \( \begin{equation} {\bf e}_i = \left[{\bf e}_i^C;{\bf e}_i^M;{\bf e}_i^{XL}\right]\!, \end{equation} \) where [;] denotes the operation of concatenation. The character representation \( {\bf e}_i^C \) is the concatenation of each character feature, which is extracted by a Transformer-based character encoder. The Mongolian morpheme representations \( {\bf e}_i^M \) is the fine-grained segmentation representations between character-level and word-level.
The cross-lingual knowledge representation \( {\bf e}_i^{XL} \) is learned with Mongolian-Chinese parallel corpus, which alleviate the insufficiency of Mongolian representations due to data scarcity. Exploiting Mongolian-Chinese parallel corpus, we fuse cross-lingual knowledge into our framework by two methods: representation projection enhancing and annotation projection enhancing. For the representation enhancing, we consider extracting a bilingual dictionary by aligning parallel corpus and enhancing the Mongolian word representations with Chinese word representations selected by this dictionary. For the annotation enhancing, we focus on NER data augmentation, which use existing NER tools to generate pseudo labels on Chinese in parallel corpus and project the annotations to Mongolian sentences with alignment. More details of Mongolian morpheme representation and cross-lingual knowledge representation are introduced in Section 3.
Adaptive Encoding Layer. The adaptive encoding layer takes token embeddings E as input to generate contextual word representations. Following [42], this component adapts Transformer for NER task with two improvements.
Direction- and Distance-Aware Attention. The original Transformer lacks directionality despite of the positional encoding, and it is impossible for transformer to distinguish whether the text information comes from the previous text or the upcoming text. Even if the input vector carries the original direction information, after the calculation by the attention mechanism, the original direction information will be destroyed. Direction information is particularly important for named entity recognition tasks. Therefore, they use the following equation to calculate the attention score: (2) \( \begin{equation} Q,K,V=H W_q,H_{d_k},H W_v \end{equation} \) (3) \( \begin{equation} R_{t-j}=\left[\cdots sin\left(\frac{t-j}{10000^{\frac{2i}{d_k}}}\right)cos\left(\frac{t-j}{10000^{\frac{2i}{d_k}}}\right)\cdots \right]^T, \end{equation} \) (4) \( \begin{equation} A^{rel}_{t,j}=Q_t K_j^T + Q_t R_{t-j}^T + u K_j^T + v R_{t-j}^T, \end{equation} \) where t is the index of the target token, j is the index of the context token, \( Q_t \), \( K_j \) is the query vector and key vector of t-th and j-th word, respectively.
Un-scaled Dot-Product Attention. The origin Transformer model uses a scaled dot-product attention mechanism to make the output of softmax function smoother. Since named entities only occupy a low percentage of the whole sentence, transformer without scaled dot-product attention can concentrate more on entities and their surrounding information.
CRF Layer. A CRF layer is utilized to obtain sentence level tag information following the adaptive encoding layer, with which we can efficiently use past and future tags to predict the current tag. The CRF layer contains two key modules: the state emission matrix and state transition matrix. For the emission matrix, we consider the output of adaptive encoding layer as the matrix of scores \( f_\theta ([x]_1^T) \). Note that we drop the input \( [x]_1^T \) for notation simplification. The element \( [f_\theta ]_i^t \) of the matrix is the score output by the adaptive encoding layer with parameters \( \theta \) for the sentence \( [x]_1^T \) and for the i-th tag at the t-th word. For the transition score \( [A]_{i,j} \) to model the transition from i-th state to j-th for a pair of consecutive steps. Note that this transition matrix is position independent. The score of a sentence \( [x]_1^T \) along with a path of tags \( [i]_1^T \) is then given by the sum of transition scores and network scores: (5) \( \begin{equation} s\left([x]_1^T, [i]_1^T, \tilde{\theta }\right) = \sum _{t=1}^T\left([A]_{[i]_{t-1},[i]_t}+[f_\theta ]_{[i]_t,t}\right), \end{equation} \) then we conduct dynamic programming to compute \( [A]_{i,j} \) and optimal tag sequences for inference.
3 OUR APPROACHES
Although the aforementioned model improves the performance of transformer on NER tasks, it is still difficult to obtain a satisfying performance with scarce data. Therefore, based on the above architecture, we propose three extension components, which incorporates knowledge from three aspects: (1) Linguistic knowledge, based on which a fine-grained Mongolian morphology segmentation is proposed to capture more semantic information; (2) Cross-lingual representation knowledge, to improve low-resource language (Mongolian) embedding via entity-specific knowledge transfer from a high-resource language (e.g., Chinese); and (3) Cross-lingual annotation knowledge, where labeled NER data is automatically created by annotation projection from parallel data. (2) and (3) are two kinds of projection-based knowledge, which could be obtained from bilingual data. In this section, we first introduce the main framework of our proposed approach. The above three kinds of knowledge enhancing components are elaborated in order as follows.
3.1 Linguistic Knowledge (Morpheme Embedding)
Mongolian words are usually composed of roots, stems, and various affixes. The root is also regarded as the first stem. The stem of a Mongolian word, namely morpheme, contains a root and several following affixes. Besides the stem, Mongolian words usually contain one or more ending suffixes which have no practical meanings. These meaningless suffixes will significantly increase the morphological diversity of Mongolian words and thereby enlarge the size of vocabulary and make data sparser.
Consequently, based on the aforementioned Mongolian word-construction characteristics, we take more fine-grained morpheme embedding in place of traditional word embedding with a new Mongolian word segmentation method. Specifically, we segment a Mongolian word into two parts, i.e., its stem and its affixes including contiguous affixes and split affixes. For the sake of clarity, we illustrate it with a Mongolian word “
” (its meaning is “on and on”), which can be segmented into the stem “
”, the contiguous affix “
”, and the split affix “
”. Compared with word embedding, the morpheme embedding can distinguish the semantic stems from the meaningless affixes, so that the model can learn to eliminate the interference of noises.
Meanwhile, to further exploit the morpheme embedding in our NER model and correspond to the aforementioned morphological forms in Mongolian, we also design a novel label scheme “BMESO\( \text{B}_1\text{B}_2 \)” for Mongolian NER task. In the traditional “BMESO” label scheme based for word segmentation, “B”, “M”, and “E” denote the first, middle, and last words, respectively. “S” denotes a single word as an entity and “O” for words not contained in any entities. Since various Mongolian affixes would intensify the data scarcity problem, we tag two kinds of functional and meaningless affixes in Mongolian words (i.e., contiguous affixes and split affixes) with two extra labels \( B_1 \) and \( B_2 \), respectively, compared with “BMESO”. As shown in Table 1, there is an example for location named entity “
”, meaning of “Ulaanbaatar City”. Note that if the continuous and split affixes appear in a word simultaneously, only the first affix is labeled as “\( \text{B}_1 \)” or “\( \text{B}_2 \)” and others are ignored. For example, “
” is the first word of location entity with label “B-LOC”. For the second word “
”, “
” is the stem with label “E-LOC”, “
” is the first continuous suffix with label “\( \text{B}_1 \)”, and “
” is the split suffix with label “O”.
In a word, we split Mongolian words into morphemes and affixes so as to get more fine-grained information in both embedding level and label level, which can help the model capture more accurate representation.
3.2 Cross-lingual Representation Knowledge
We incorporate cross-lingual knowledge into our framework from high resource language (Chinese) to low resource language (Mongolian) by two methods: representation projection enhancing and annotation projection enhancing.
Representation Projection Enhancing. As shown in Figure 3, to enhance Mongolian word representations, we not only consider Mongolian “character embedding” and “morpheme embedding”, but also incorporate a special embedding with cross-lingual knowledge, called Mongolian-Chinese bilingual lexicon representations. Specifically, we consider using Chinese words to capture more abundant semantic information of Mongolian words based on alignment information existing in Mongolian-Chinese parallel corpus.
Fig. 3. Cross-lingual representation.
To model this cross-lingual information, firstly we construct a Mongolian-Chinese dictionary with parallel sentence pairs and design a strategy for dictionary extension. Then, we propose two methods to learn bilingual lexicon representations based on Mongolian-Chinese bilingual dictionary: (1) through the bidirectional LSTM network [20]; and (2) through the attention mechanism [2]. We will elaborate the above process as follows.
Mongolian-Chinese Dictionary. In order to obtain the relationship between Mongolian and Chinese, we utilize the unlabeled bilingual parallel corpus to obtain the Mongolian-Chinese bilingual dictionary through bilingual word alignment. The bilingual word alignment method is a very important basic task in statistical machine translation (SMT) and we use FastAlign toolkit1 to conduct word alignment on bilingual parallel data.
As shown in Figure 4, given a Mongolian sentence \( M=\lbrace m_1,m_2,\ldots ,m_i,\ldots , m_n\rbrace \) with n Mongolian words, and a Chinese word sequence \( C=\lbrace c_1,c_2,\ldots ,c_j,\ldots ,c_l\rbrace \) with l Chinese words, each Mongolian word \( m_i \) corresponds to multiple Chinese word \( c_j \).
Fig. 4. An example of word alignment between Mongolian and Chinese.
Although the word alignment performs well, we further conduct data cleaning and manual corrections to reduce the noise introduced by word alignment.
Dictionary Extension Strategy. For the words out of Mongolian-Chinese dictionary, we propose a dictionary extension strategy. Let Mongolian word set be \( W=\lbrace w_1,\ldots ,w_i,\ldots ,w_f\rbrace \) and the corresponding bilingual lexicon representation be \( vec_i \). The calculation method is as follows: (6) \( \begin{equation} vec_i=Mw_i, \end{equation} \) where M represents the mapping matrix, and we optimize it as follows: (7) \( \begin{equation} loss_M=\sum _{i=1}^f||vec_i-Mw_i||_2, \end{equation} \)
Finally, each OOV Mongolian word is calculated as: (8) \( \begin{equation} veo_i=Mo_i. \end{equation} \)
Cross-lingual Representation with LSTM. Firstly, we use Stanford CoreNLP tools2 to obtain part-of-speech (POS) tags for Chinese words and denote POS-tag embeddings as P, which is initialized randomly and optimized during the model training. We take the concatenation \( [c_j,p_j] \) of Chinese word embedding and POS-tag embedding as input and use bidirectional LSTM neural network to extract textual features. Finally, we concatenate the output of bidirectional LSTM at last time step as bilingual lexicon representation.
Cross-lingual Representation with Attention. Since one Mongolian word \( m_i \) corresponds to multiple Chinese words \( c_j \) and \( m_i \) has different correlation degrees with \( c_j \), it is therefore necessary to use attention mechanism. The output vector after the attention mechanism training is named as \( vec_i \), and the calculation methods are as follows: (9) \( \begin{equation} vec_i=\sum _{j=1}^l\alpha _j c_j, \end{equation} \) (10) \( \begin{equation} \sum _{j}\alpha _j=1. \end{equation} \)
3.3 Cross-lingual Annotation Knowledge
To make full use of cross-lingual knowledge, we further conduct annotation projection to synthesize extra data based on a parallel corpus to pretrain the model. Figure 5 describes the overall framework of our proposed annotation projection and its training process. In particular, we denote our parallel corpus as \( (C,M)^N \), which contains N Chinese-Mongolian sentence pairs. We use the FastAlign3 toolkit for word alignment of the i-th sentence pair \( (C^{(i)},M^{(i)}) \), so that the labels on \( C^{(i)} \) can be projected to \( M^{(i)} \), \( 1\le i \le N \). We regard the Mongolian sentences with projected labels as projection-labeled Mongolian data in the following sections.
Fig. 5. Overall framework and training process of annotation projection.
Given a Chinese sentence \( C^{(i)}=\lbrace C^{(i)}_1,C^{(i)}_2,\ldots ,C^{(i)}_n\rbrace \) with n words and its Mongolian translation \( M^{(i)}=\lbrace M^{(i)}_1,M^{(i)}_2,\ldots ,M^{(i)}_l\rbrace \) with l words, we first use LTP NER tools4 on \( C^{(i)} \) to obtain the label sequence \( T^{C^{(i)}}=\lbrace T^{C^{(i)}}_1,T^{C^{(i)}}_2,\ldots ,T^{C^{(i)}}_n\rbrace \) on the Chinese sentence.
If the Mongolian word \( M^{(i)}_j \) is aligned to the Chinese word \( C^{(i)}_k \) with label \( T^{C^{(i)}}_k \), the pseudo label of \( M^{(i)}_j \) should be defined as \( T^{M^{(i)}}_j = T^{C^{(i)}}_k \) Through this process, the corresponding label sequence of the Mongolian sentence \( T^{M^{(i)}}=\lbrace T^{M^{(i)}}_1,T^{M^{(i)}}_2,\ldots ,T^{M^{(i)}}_l\rbrace \).
In order to measure the annotation quality of the projection-labeled data, we design a heuristic, language-independent metric to select a fixed size of data with higher quality than others based on statistics. Specifically, we count the frequency of tags for each word in the dataset and assume that the more frequent the tag occurs with the word, the more likely it is the correct tag for the word.
For a Mongolian sentence \( M=\lbrace m_1,m_2,\ldots ,m_k,\ldots , m_n\rbrace \) which contains n words, its corresponding NER label sequence is \( T=\lbrace t_1,t_2,\ldots ,t_k,\ldots , t_n\rbrace \) and we calculate the score of each word as follows: (11) \( \begin{equation} score(m_k,t_k)=\frac{Count(m_k,t_k)}{\sum _i Count(m_k,t_i)}, \end{equation} \) where \( t_i \) means each existing label for word \( m_k \) in dataset. And the score of sentence M with its label sequence T is calculated by averaging the scores of all the words in M: (12) \( \begin{equation} score(M,T)=\frac{1}{n}\sum _{k}^{n}score(m_k,t_k), \end{equation} \)
To minimize the bias in projection-labeled data, we only calculate \( Count(\cdot) \) in golden dataset unless the word only exists in projection-labeled data. In addition to the score of the sentence, we also consider a minimum sentence length to preserve sentences which have more sequence features.
4 EXPERIMENTAL SETTINGS
In this section, we first introduce the Mongolian NER dataset. Then we describe implementation details and comparison methods in our experiments.
4.1 Datasets & Experimental Setup
Mongolian NER dataset. Due to the scarcity of annotated Mongolian corpus, there is no open resource of traditional Mongolian data. Our labeled data comes from the Latin-Mongolian corpus provided by Dr. Wu Jinxing of Inner Mongolia University [37]. Since our study focuses on traditional Mongolian, we transform Latin-Mongolian data to traditional Mongolian format automatically with the comparison table of Latin-Mongolian characters. Through further manual correction, we get labeled data with 13,755 traditional Mongolian sentences and 446,153 Mongolian words. Table 2 shows the statistics of train, development, and test splits for the Mongolian NER dataset.
Cross-Lingual Unlabeled data. To make full use of cross-lingual knowledge, we exploit the Mongolian-Chinese parallel corpora in three forms as follows. (1) Mongolian-Chinese dictionary. We use cross-lingual unlabeled data to construct the Mongolian-Chinese dictionary which we introduce in Section 3.2, and finally we get 50,000 word pairs. We also utilize unlabeled data to train pre-trained model as our comparison experiments. Our unlabeled Mongolian data comes from CCMT2019 (The 15th China Conference on Machine Translation). Table 3 displays a summary of the unlabeled data sets.
(2) Projection-labeled data. As illustrated in Section 3.3, we construct a projection-labeled dataset with Label Projection. In particular, we firstly conduct NER task with LTP toolkit on every sentence in the Chinese corpus. Then, we use FastAlign toolkit to finish word alignment between Chinese and Mongolian words so that we can project the labels on Chinese words into corresponding Mongolian words. After that, we obtain a coarse version of the first kind of pseudo data, and then we complement missing entities with an entity lexicon to promote the recall rate of entity labels.
(3) Pseudo Data. For comparison, we also construct another weakly labeled dataset through a pre-trained model doing annotation on unlabeled data, called pseudo data. We primarily pre-train the aforementioned NER model using golden data and then use it to annotate the unlabeled Mongolian sentences in parallel corpus. Similar to the Projection-labeled data introduced above, we modify the labels of those entities predicted as “O” in the entity lexicon to their corresponding labels and do data selection for higher quality.
4.2 Morpheme Labeling Settings
We totally set four Morpheme labeling schemes for our experiments to generate more general and convincing results.
BASE means the vanilla method for NER annotation which only considers a set of four labels {“O”, “PER”, “LOC”, “ORG”};
BIO distinguishes the start and the rest of an entity span based on the previous one;
BMESO further takes the end of an entity span into account;
BMESO\( \text{B}_1\text{B}_2 \) is a refinement of “BMESO” for morpheme settings which denotes the first contiguous suffix in an entity word as “\( \text{B}_1 \)” and the first split suffix as “\( \text{B}_2 \)”.
4.3 Implementation and Hyper-parameters
We use Adam as optimizer. The batch size is set to 32. The number of attention heads is set to 12. The learning rate is set to 0.0001. The dimension of character embeddings is set to 30. Dropout is set to 0.02. To train embeddings with cross-lingual knowledge, we set batch size to 32, head number to 12. Dropout is set to 0.02 and learning rate is set to 0.0001. The dimension of word embedding is 30.
4.4 Systems
We compare our method with other relevant methods as follows:
(1) | CRF: CRF++ toolkit,5 an open source implementation of CRF for segmenting/labeling sequential data; | ||||
(2) | Bi-LSTM+CRF [17]: It contains a bi-directional LSTM to extract text features and a CRF layer to obtain the label sequence with the highest probability, which is one of the most widely used system in the field of named entity recognition; | ||||
(3) | Bi-LSTM-CNN+CRF [25]: Compared with Bi-LSTM+CRF, it inserts several Convolutional Neural Network (CNN) layers between Bi-LSTM and the CRF layers to further extract high-dimension semantic features; | ||||
(4) | TENER: Our baseline model, which modifies the Transformer to model the long-range and complicated interactions of sentence for NER. | ||||
MKE-NER (ours): Our MKE-NER is a training framework for low-resource NER, which uses three kinds of knowledge to enhance NER models. The framework is flexible and can be applied to the above baseline models.
5 RESULTS AND ANALYSIS
We conduct experiments on traditional Mongolian NER dataset to compare our MKE-NER framework with other comparison systems introduced in Section 4.4. We apply our approach on the strongest baseline TENER among the aforementioned baseline systems to validate the effectiveness of our approach.
5.1 Overall Results
Table 4 shows the overall results of comparisons and our MKE-NER on the test set. Results demonstrate that our method significantly outperforms all comparison systems and achieves 94.04% of F1 score. It is worth noting that all comparisons perform high precision and relatively low recall on the test set. We attribute this phenomenon to the nature of NER tasks, where there exist inherent label imbalance problems. Nevertheless, our method can remarkably improve recall while maintaining high precision, which demonstrates that our approaches are superior to those baseline approaches in relatively low-resource Mongolian NER.
Although NER models based on the prevalent pre-trained language models like BERT achieve strong performances on many languages, we do not adopt any pre-trained language models in our framework because of two reasons: (a) there is no pre-trained language models off-the-shelf for traditional Mongolian, and (b) when pre-training a language model for traditional Mongolian from scratch, it is too complicated to incorporate cross-lingual knowledge with sub-word generated by BPE, since a Mongolian word segmented based on BPE algorithm and Mongolian Morpheme will have different subwords and the alignment between two kinds of sub-words is particularly complex.
5.2 Ablation Study
To verify the effectiveness of each knowledge we exploit in MKE-NER, we conduct a series of ablation experiments. In Table 5 we can see that each kind of knowledge benefits the baseline and their contributions are additive. Besides, we also conduct exhaustive experiments and analysis on the effects of each kind of knowledge.
Effects of Mongolian Morpheme Knowledge. Firstly, to explore the effect of Mongolian Morpheme embedding, we compare different morpheme embeddings with different labeling schemes. The experimental results are shown in Figure 6.
Fig. 6. Comparison of three types of morpheme labeling schemes on different systems.
Obviously, no matter in CRF model, Bi-LSTM+CRF model or TENER model, using morpheme embedding can improve the performance.6 We consider that morpheme embedding has two main advantages compared with word embedding. On the one hand, splitting words into morphemes can notably reduce the vocabulary size so as to mitigate the sparsity of the data and benefit model training. On the other hand, morpheme embedding can be considered as a kind of linguistic knowledge which indicates lexical structures of Mongolian words. In addition, the proposed label scheme BMESO\( \text{B}_1\text{B}_2 \) introduces another two labels for Mongolian suffixes, which can further promote the performance of our model on the Mongolian dataset.
Effects of Cross-Lingual Representation Knowledge. As for the cross-lingual representation knowledge, Table 6 shows the effects of two methods for cross-lingual representation knowledge and dictionary extension proposed in Section 3.2. For more detailed comparisons, we depict the benefit of cross-lingual knowledge under all morpheme labeling schemes in Figure 7.
Fig. 7. Comparison of different cross-lingual representation knowledge based on different Morpheme label schemes. “CR” denotes cross-lingual representation knowledge.
Table 6. Comparison of Cross-Lingual Representation Knowledge Effects
We find that our two various methods for cross-lingual representation knowledge surpasses the original model by 0.67% and 0.30% in F1 score, respectively. Note that there is only marginal improvement or even harm for cross-lingual representation without dictionary extension. The reason is that most Mongolian words are out of the bilingual dictionary and adding a meaningless padding embedding into their representation may introduce noise, while dictionary extension can provide meaningful cross-lingual representation through transformation of semantic spaces. Therefore, these results demonstrate that our embedding with Mongolian-Chinese cross-lingual knowledge plays an important part, and it can enrich semantic representation of Mongolian word embedding.
Effects of Cross-Lingual Annotation Knowledge. Lastly, we explore the effects of cross-lingual annotation knowledge. The results with “BMESO” label scheme and the detailed ones are shown in Table 7 and Figure 8, respectively. We compare our cross-lingual annotation knowledge (i.e., projection-labeled data) with baseline and pseudo data enhancing method mentioned in Section 4.1. It can be observed that both “Cross-lingual Annotation Knowledge” and “Pseudo data” benefit the baseline and “Pseudo data” contributes slightly more to the model than “Cross-lingual Annotation Knowledge”. One reason is that “Pseudo data” has more similar distribution to golden data while “Cross-lingual Annotation Knowledge” is independently obtained from golden data which may accumulate error through Chinese NER and alignment procedures. Another reason we consider is that the baseline may have already performed better on Mongolian than NER tools on Chinese, so that projection-labeled data may bring some noises.
Fig. 8. Comparison of different cross-lingual annotation methods based on different morpheme label schemes. “CA” denotes cross-lingual annotation.
However, we argue that in small data scale settings, where the baseline may get poor performance, projection-labeled data may get better results. To this end, we randomly use 20%, 40%, 60%, and 80% of the full training instances to train the pre-trained models, respectively. The results shown in Figure 9 verify our assumption. We observe that projection-labeled data contributes more to model when there is only 20% golden data while pseudo data performs relatively worse. With more golden data utilized, pseudo data performs better and eventually surpasses projection-labeled data when 100% golden data is used. Thus, it validates that projection-labeled data can significantly promote model performance when golden data is severely scarce. In a nutshell, we believe that our projection-labeled data also provides a considerable amount of cross-lingual knowledge, on which the pre-training would make a meaningful initialization for model parameters. Besides, it is effective to introduce noisy data into a relatively high resource scenario (where golden data is sufficient).
Fig. 9. Comparison of different cross-lingual annotation methods under different low resource scenarios.
5.3 Comparison under Different Low Resource Scenarios
Due to the high expense of hand-crafted annotations and the scarcity of Mongolian NER corpus, it is non-trivial to verify the effectiveness of our method under different low resource scenarios. Note that in each low resource scenario, models are trained on different sizes of training data. Furthermore, as shown in Figure 9, different scales of data may cause different results (for example, pseudo data benefits the model when 100% golden data is used yet hurts the model when there is only 20% golden data), so we investigate the impact of the size of training data to check whether morpheme and cross-lingual knowledge can make a difference when reducing the amount of training data.
Figure 10 compares our method with the baseline under different data scales. Experimental results demonstrate that our method outperforms the baseline on all data scales, which verify the effectiveness of our method under various settings of data scales. Besides, the gap between two methods is gradually decreasing as the scale data expands (i.e., from 14.63% to 2.74% on F1 score with the scale of data from 1% to 100%), which suggests that the scarcer the data is, the more significant impacts our method will exert.
Fig. 10. Comparison of baseline and our MKE-NER model under different low resource scenarios.
In the following, we show more details about the effects of morpheme and cross-lingual knowledge under different scales of dataset. As shown in Figure 11, knowledge enhancement brings less improvement on performance as the data scale expands. For “+Morpheme” knowledge, the improvement varies from 5.25%, 4.18%, 1.73% to 1.24% on F1 score with the training data scale increases. With the data scale increasing from 20% to 100%, the improvements on F1 score vary from 6.2% to 1.9%, 7.11% to 1.49% and 4.52% to 0.94% for the combination of “Morpheme” and cross-lingual knowledge, i.e., “CR” (cross-lingual representation) and “CA” (cross-lingual annotation), respectively. Therefore, we conjecture that if the training data for Mongolian NER task is sparse (e.g., when the data scale is 20%), the introduction of morpheme and cross-lingual knowledge can effectively alleviate the data scarcity and achieve significant improvements. If the training data is relatively sufficient (e.g., when the data scale is 100%), the knowledge enhancement can still bring less improvement on Mongolian NER performance.
Fig. 11. Effects of our proposed knowledge under different low resource scenarios. Note that we only choose “CR(LSTM)” rather than “CR(Attention)” here since the former performs slightly better than the latter under these data scales.
Additionally, we compare the effects of different knowledge enhancement under different data scales and observe that the knowledge of “CA” has the most important impact on the performance with relatively smaller data scales (i.e., 20%, 40%, and 60%). As the data scale expands up to 80%, the “CR” knowledge becomes the most significant factor on performance. All results demonstrate that both two kinds of cross-lingual knowledge can effectively boost the Mongolian NER performance. Specifically, cross-lingual annotation knowledge performs well under a relatively lower-resource scenario while cross-lingual representation knowledge does well in NER with a higher data scale.
6 RELATED WORK
In this section, we first review the mainstream development of neural NER task and focus on low resource scenarios. Then we introduce the basic linguistic knowledge of Mongolian, and finally elaborate studies on Mongolian NER task.
Neural Named Entity Recognition. The CNN model proposed by Collobert et al. [8] proved the effectiveness of deep neural networks in NER firstly. This method transferred text features to word embeddings successfully. Recurrent Neural Network (RNN) [26, 31] has brought another wave of using deep neural networks to improve performance of NER tasks. Based on RNN, various related network models blossom. Among these models, LSTM [15, 16, 23, 24] partly solves problems like the gradient explosion and long-distance dependence in RNN. Subsequently, Lample et al. [20] used LSTM to extract text features, and CRF to obtain the optimal labeling sequence, whose combination achieved the best results on NER tasks for the time. Still today, it remains as the mainstream model. Yan et al. [42] proposed TENER to adapt transformer encoder to NER tasks. Yamada et al. [41] introduced entities into pre-trained language models and achieved the state-of-the-art performance on the CoNLL-2003 dataset.
Low-resource Named Entity Recognition. Low-resource scenarios is one of the challenges for NER due to the limited scale of annotated data in many languages and domains, where general neural network methods usually have poor performances.
Cotterell and Duh [9] transferred knowledge from high-resource language to low-resource language through jointly training a CRF model using two languages. Kruengkrai et al. [18] applied multi-task learning on low-resource NER by jointly training a sentence classification task and NER task. Wu et al. [38] proposed a teacher-student learning method for unlabeled data in the target language. Besides transfer learning, data augmentation is also widely used to enhance performance for low-resource NER. Ding et al. [11] introduced a generative data augmentation approach using a language model. Bari et al. [3] proposed a novel data augmentation framework for zero-resource cross-lingual task adaptation. The two methods have been proved helpful for low resource NER. Although based on transfer learning and data augmentation, our MKE-NER is different from these studies since we focus on a specific low-resource language, i.e., Mongolian and exploit multiple kinds of language-specific knowledge to improve Mongolian NER.
Basic Linguistic-knowledge of Mongolian. Traditional Mongolian is an alphabetic-writing language, and Mongolian words are composed of Mongolian letters, which contains eight vowels, ten prepositional consonants, and seventeen basic consonants. The glyph of each letter changes when it appears at different positions of a word. Similar to English, Mongolian words are separated by spaces. The writing order of words in Mongolian is from top to bottom, and the order of lines is from left to right. Additionally, Mongolian is an agglutinative language and the mainstream method of its word-building is adding word-forming suffixes to different roots and stems. Different from English, both roots and stems are regarded as suffixes in Mongolian and some suffixes have no real meaning [13, 29]. In view of the word-building of Mongolian, there are studies on segmenting Mongolian words into morphemes and directly using the morpheme representations to reduce the sparsity of words [35].
Mongolian Named Entity Recognition. The development of Mongolian NER is relatively slow and lagging. Previous approaches for Mongolian NER mostly rely on lexicon or hand-crafted features [28]. With the development of deep learning technologies, neural networks begin to be applied in this field [34]. Based on this, Xiong and Nuo [39] introduced attention mechanism into bi-LSTM and used contextual embeddings for Mongolian NER. Different from these methods, we tackle Mongolian NER with multi-knowledge enhancement, considering not only Mongolian linguistic knowledge with a novel morpheme segmentation method but also cross-lingual knowledge.
7 CONCLUSION
In this paper, we focus on the problem of traditional Mongolian NER task. To alleviate the scarcity of training data, we propose a Multi-Knowledge Enhancement framework for Mongolian NER (MKE-NER) to exploit not only Mongolian linguistic knowledge through more fine-grained morpheme segmentation but also cross-lingual knowledge with Mongolian-Chinese parallel corpus. Specifically, we consider two kinds of knowledge in parallel data, i.e., cross-lingual representation and cross-lingual annotation to boost the performance of Mongolian NER. Thorough evaluations on Mongolian NER dataset demonstrate the effectiveness of our model in Mongolian NER task and especially, the superiority in the low-resource scenario.
Footnotes
1 https://github.com/clab/fast_align/.
Footnote2 https://github.com/stanfordnlp/CoreNLP.
Footnote3 https://github.com/clab/fast_align.
Footnote4 https://github.com/HIT-SCIR/ltp/.
Footnote5 https://github.com/taku910/crfpp/.
Footnote6 Bi-LSTM-CNN+CRF is not included in the results since it has similar architecture and performances with Bi-LSTM+CRF.
Footnote
- [1] . 2020. Named entity extraction for knowledge graphs: A literature overview. IEEE Access 8 (2020), 32862–32881.Google Scholar
Cross Ref
- [2] . 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
- [3] . 2020. Multimix: A robust data augmentation framework for cross-lingual NLP. arXiv preprint arXiv:2004.13240 (2020).Google Scholar
- [4] . 1996. A maximum entropy approach to natural language processing. Computational Linguistics 22, 1 (1996), 39–71.Google Scholar
Digital Library
- [5] . 2013. Translating embeddings for modeling multi-relational data. In Neural Information Processing Systems (NIPS). 1–9.Google Scholar
- [6] . 2019. What question answering can learn from trivia nerds. arXiv preprint arXiv:1910.14464 (2019).Google Scholar
- [7] . 2019. Low-resource name tagging learned with weakly labeled data. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 261–270.Google Scholar
Cross Ref
- [8] . 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, Article (2011), 2493–2537.Google Scholar
Digital Library
- [9] . 2017. Low-resource named entity recognition with cross-lingual, character-level neural conditional random fields. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Asian Federation of Natural Language Processing, Taipei, Taiwan, 91–96. https://www.aclweb.org/anthology/I17-2016.Google Scholar
- [10] . 2021. Template-based named entity recognition using BART. arXiv preprint arXiv:2106.01760 (2021).Google Scholar
- [11] . 2020. DAGA: Data augmentation with a generation approach for low-resource tagging tasks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6045–6057. Google Scholar
Cross Ref
- [12] . 2021. Few-NERD: A few-shot named entity recognition dataset. arXiv preprint arXiv:2105.07464 (2021).Google Scholar
- [13] . 2017. Mongolian language morphology and its database structure. In 2017 International Conference on Green Informatics (ICGI). IEEE, 282–285.Google Scholar
Cross Ref
- [14] . 2018. Improving low resource named entity recognition using cross-lingual knowledge transfer. In IJCAI. 4071–4077.Google Scholar
- [15] . 2003. Named entity recognition with long short-term memory. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003. 172–175.Google Scholar
Digital Library
- [16] . 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.Google Scholar
Digital Library
- [17] . 2015. Bidirectional LSTM-CRF Models for sequence tagging. arXiv e-prints (2015), arXiv–1508.Google Scholar
- [18] . 2020. Improving low-resource named entity recognition using joint sentence and token labeling. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 5898–5905. Google Scholar
Cross Ref
- [19] . 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 282–289. Google Scholar
- [20] . 2016. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 260–270.Google Scholar
Cross Ref
- [21] . 2019. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).Google Scholar
- [22] . 2020. Dice loss for data-imbalanced NLP tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 465–476. Google Scholar
Cross Ref
- [23] . 2018. Empower sequence labeling with task-aware neural language model. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.Google Scholar
Cross Ref
- [24] . 2020. Hierarchical contextualized representation for named entity recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8441–8448.Google Scholar
Cross Ref
- [25] . 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1064–1074.Google Scholar
Cross Ref
- [26] . 2010. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association.Google Scholar
Cross Ref
- [27] . 2006. Named entity recognition for question answering. (2006).Google Scholar
- [28] . 2015. Named entity recognition for Mongolian language. In International Conference on Text, Speech, and Dialogue. Springer, 243–251.Google Scholar
Digital Library
- [29] . 1974. Grammar of Written Mongolian. Otto Harrassowitz Verlag.Google Scholar
- [30] . 1986. An introduction to hidden Markov models. IEEE ASSP Magazine 3, 1 (1986), 4–16.Google Scholar
Cross Ref
- [31] . 2014. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 27 (2014), 3104–3112.Google Scholar
Digital Library
- [32] . 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 6000–6010.Google Scholar
Digital Library
- [33] . 2017. Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering 29, 12 (2017), 2724–2743.Google Scholar
- [34] . 2016. Mongolian named entity recognition with bidirectional recurrent neural networks. In 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 495–500.Google Scholar
Cross Ref
- [35] . 2019. Learning morpheme representation for Mongolian named entity recognition. Neural Processing Letters 50, 3 (2019), 2647–2664.Google Scholar
Cross Ref
- [36] . 2020. Automated concatenation of embeddings for structured prediction. arXiv preprint arXiv:2010.05006 (2020).Google Scholar
- [37] . 2016. Recognition method of Mongolian person names based on conditional random fields. Application Research of Computers 33, 7 (2016), 2014–2017.Google Scholar
- [38] . 2020. Single-/multi-source cross-lingual NER via teacher-student learning on unlabeled data in target language. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 6505–6514.Google Scholar
Cross Ref
- [39] . 2018. Attention-based BLSTM-CRF architecture for Mongolian named entity recognition. In Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation.Google Scholar
- [40] . 2020. FGN: Fusion glyph network for Chinese named entity recognition. arXiv preprint arXiv:2001.05272 (2020).Google Scholar
- [41] . 2020. LUKE: Deep contextualized entity representations with entity-aware self-attention. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6442–6454. Google Scholar
Cross Ref
- [42] . 2019. Tener: Adapting transformer encoder for named entity recognition. arXiv preprint arXiv:1911.04474 (2019).Google Scholar
- [43] . 2016. Name tagging for low-resource incident languages based on expectation-driven learning. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 249–259.Google Scholar
Cross Ref
Index Terms
Exploiting Morpheme and Cross-lingual Knowledge to Enhance Mongolian Named Entity Recognition
Recommendations
Learning multilingual named entity recognition from Wikipedia
We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
Chinese Named Entity Recognition Using a Morpheme-Based Chunking Tagger
IALP '09: Proceedings of the 2009 International Conference on Asian Language ProcessingMost previous studies formalize Chinese named entity recognition (NER) as a chunking task with either characters or lexicon words as the basic tokens for chunking. However, it is difficult under this formulation to explore lexical information for NER. ...
Named entity recognition an aid to improve multilingual entity filling in language-independent approach
IKM4DR '12: Proceedings of the first workshop on Information and knowledge management for developing regionThis paper details the approach to identify Named Entities (NEs) from a large non-English corpus and associate them with appropriate tags, requiring minimal human intervention and no linguistic expertise. The main objective in this paper is to focus on ...


















Comments