skip to main content
research-article
Open Access

Exploiting Morpheme and Cross-lingual Knowledge to Enhance Mongolian Named Entity Recognition

Published:23 September 2022Publication History

Skip Abstract Section

Abstract

Mongolian named entity recognition (NER) is not only one of the most crucial and fundamental tasks in Mongolian natural language processing, but also an important step to improve the performance of downstream tasks such as information retrieval, machine translation, and dialog system. However, traditional Mongolian NER models heavily rely on the feature engineering. Even worse, the complex morphological structure of Mongolian words makes the data sparser. To alleviate the feature engineering and data sparsity in Mongolian named entity recognition, we propose a novel NER framework with Multi-Knowledge Enhancement (MKE-NER). Specifically, we introduce both linguistic knowledge through Mongolian morpheme representation and cross-lingual knowledge from Mongolian-Chinese parallel corpus. Furthermore, we design two methods to exploit cross-lingual knowledge sufficiently, i.e., cross-lingual representation and cross-lingual annotation projection. Experimental results demonstrate the effectiveness of our MKE-NER model, which outperforms strong baselines and achieves the best performance (94.04% F1 score) on the traditional Mongolian benchmark. Particularly, extensive experiments with different data scales highlight the superiority of our method in low-resource scenarios.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Named entity recognition (NER) is one of the fundamental tasks in natural language processing (NLP), which detects names of specified categories, such as person, location and organization. It plays a significant role in the overall task of information extraction and serves as a vital step for other downstream NLP applications, e.g., information retrieval, machine translation, and dialog system. Recently, neural-based methods have achieved state-of-the-art performances in both Chinese and English NER tasks [22, 36, 40, 41], where NER is formatted as sequence labeling. Nevertheless, these methods rely on a large amount of annotated data, which is scarce and difficult to obtain, especially in low resource scenarios [12, 43]. Therefore, several research studies pay attention to low-resource scenarios. Zhang et al. [43] use a novel neural model for name tagging solely based on pseudo data. Cao et al. [7] proposed a new expectation-driven learning framework with very few resources. Cui et al. [10] proposed a template-based method for low-resource NER with the help of generative pre-trained language model BART [21]. Nevertheless, studies on NER tasks in low-resource languages still lag far behind the leading edge due to the sparse corpora and the neglect of the language-specific characteristics, e.g., Mongolian morphemes.

In this paper, we explore the NER task in Mongolian, a low-resource language mainly used in northern China, Mongolia, and Russia. For Mongolian, there are two commonly used scripts, i.e., traditional Mongolian in China, and Cyrillic Mongolian in Mongolia and Russia. Both of them have similar phonology systems but different writing scripts. In this work, we mainly focus on traditional Mongolian NER, whose scripts and corresponding Latin letters are shown with an example in Figure 1. Note that traditional Mongolian is written from up to down, left to right in actual writing. In order to facilitate the reading, we adopt the following format as shown in Figure 1.

Fig. 1.

Fig. 1. An example of Mongolian NER, whose meaning is “Wen Jiabao presided over the executive meeting of the State Council”.

First of all, complex characteristics of Mongolian and different encoding modes, increase the difficulty on Mongolian NER research, which heavily restricts the further development of other downstream tasks in Mongolian. By now, most work on Mongolian NER has still used traditional rules-based or statistical methods [4, 19, 30], which requires expensive hand-crafted features and high-quality corpus. Moreover, these methods cannot handle the problem of out-of-vocabulary (OOV) words.

In addition, due to the scarcity of training data in Mongolian, the performance of most neural-based NER models degrade significantly with deficient feature representation learning. Besides, the performance of Mongolian NER task has an important impact on the performance of specific downstream tasks such as knowledge graphs (KG) [1, 5, 33] and automatic question answering (QA) [6, 27]. Therefore, how to effectively alleviate the data scarcity of Mongolian NER and fully exploit Mongolian-specific features has far-reaching significance for Mongolian NLP research.

Inspired by the recent work [14], we consider enhancing Mongolian NER system with both Mongolian linguistic and cross-lingual knowledge. We introduce linguistic knowledge through using Mongolian morphemes to capture more fine-grained representations. Moreover, we transfer cross-lingual knowledge from high resource language (Chinese) to low resource language (Mongolian) with bilingual parallel corpora. Furthermore, we design two methods to exploit cross-lingual knowledge sufficiently, i.e., cross-lingual representation incorporation and cross-lingual annotation projection. Our approach is denoted as Multi-Knowledge Enhanced (MKE) Mongolian NER.

The contributions of this paper can be summarized as follows:

(1)

We propose a novel framework, i.e., Multi-Knowledge Enhanced Named Entity Recognition (MKE-NER), which incorporates Mongolian linguistic knowledge with fine-grained morpheme embeddings.

(2)

MKE-NER also incorporates cross-lingual knowledge, i.e., effective representation and annotation projection from high resource (Chinese) to low resource (Mongolian), to alleviate the data scarcity and feature engineering in Mongolian NER task with a language-independent manner, which could be readily applied to other low-resource languages.

(3)

Experimental results demonstrate the effectiveness of multi-knowledge enhancement and our proposed framework outperforms the state-of-the-art method (i.e., TENER) on the Mongolian NER dataset.

(4)

Extensive experiments with different data scales illustrate that our method further surpasses other comparisons, and the scarcer the data is, the more significant impacts our method will exert, which highlights the superiority of our method in low-resource scenarios.

Skip 2BACKGROUND Section

2 BACKGROUND

2.1 Task Definition

Named entity recognition (NER) is one of the most fundamental sub-tasks in information extraction, aiming to identify the named entities of predefined types like names of persons, locations, and organizations in unstructured text. Formally, given a sentence of tokens \( S =\ \lt x_1, x_2,\ldots , x_N\gt \), NER is to output a list of tuples \( \lt I_s, I_e, t\gt \), each of which is a named entity in sentence S. \( I_s \) and \( I_e \) are the start and end indexes of a named entity. t is the entity type from the predefined category set, e.g., PER, LOC, and ORG for persons, locations, and organizations, respectively. For example, given a sentence “Peter will fly to California tomorrow to attend a conference held by ACM.”, NER systems need to recognize three entities, i.e.,<Peter, PER>, <California, LOC>, and <ACM, ORG>.

NER is usually modeled as a sequence labeling task and therefore lots of solutions use conditional random fields (CRFs) as their backbones. Neural Networks, e.g., Convolutional Neural Networks (CNN) and Long-Short Term Memory (LSTM) networks, play roles of feature extractors in NER models. Recently, many researchers try to exploit the power of pre-trained language models like BERT to further improve the performance of NER and yield state-of-the-art results. However, how to maintain the effectiveness of NER models in low-resource scenarios is still an intractable problem.

2.2 Our Framework

To transfer the strong performance of Transformer [32] from other NLP tasks to NER tasks, our framework, named Multi-Knowledge Enhanced Named Entity Recognition (MKE-NER), is based on TENER [42] instead of traditional bidirectional LSTM. As shown in Figure 2, there are three components in MKE-NER: (a) Embedding Layer, which fuses not only Mongolian but also cross-lingual knowledge representations, (b) Adaptive Encoding Layer, which modifies the Transformer encoder to model the long-range and complicated interactions of sentence for NER, and (c) CRF Layer, from which we can decode the final predicted tag sequence for the input sentences. We will elaborate each part of the model as follows.

Fig. 2.

Fig. 2. The overview of MKE-NER, which contains three preliminary components: embedding layer, adaptive encoding layer, and CRF layer.

Embedding Layer. For each token \( x_i \) of the input sentence, its embedding \( {\bf e}_i \) is composed of character representations \( {\bf e}_i^C \), our proposed Mongolian morpheme representations \( {\bf e}_i^M \), and cross-lingual knowledge representations \( {\bf e}_i^{XL} \), which can be defined as follows: (1) \( \begin{equation} {\bf e}_i = \left[{\bf e}_i^C;{\bf e}_i^M;{\bf e}_i^{XL}\right]\!, \end{equation} \) where [;] denotes the operation of concatenation. The character representation \( {\bf e}_i^C \) is the concatenation of each character feature, which is extracted by a Transformer-based character encoder. The Mongolian morpheme representations \( {\bf e}_i^M \) is the fine-grained segmentation representations between character-level and word-level.

The cross-lingual knowledge representation \( {\bf e}_i^{XL} \) is learned with Mongolian-Chinese parallel corpus, which alleviate the insufficiency of Mongolian representations due to data scarcity. Exploiting Mongolian-Chinese parallel corpus, we fuse cross-lingual knowledge into our framework by two methods: representation projection enhancing and annotation projection enhancing. For the representation enhancing, we consider extracting a bilingual dictionary by aligning parallel corpus and enhancing the Mongolian word representations with Chinese word representations selected by this dictionary. For the annotation enhancing, we focus on NER data augmentation, which use existing NER tools to generate pseudo labels on Chinese in parallel corpus and project the annotations to Mongolian sentences with alignment. More details of Mongolian morpheme representation and cross-lingual knowledge representation are introduced in Section 3.

Adaptive Encoding Layer. The adaptive encoding layer takes token embeddings E as input to generate contextual word representations. Following [42], this component adapts Transformer for NER task with two improvements.

Direction- and Distance-Aware Attention. The original Transformer lacks directionality despite of the positional encoding, and it is impossible for transformer to distinguish whether the text information comes from the previous text or the upcoming text. Even if the input vector carries the original direction information, after the calculation by the attention mechanism, the original direction information will be destroyed. Direction information is particularly important for named entity recognition tasks. Therefore, they use the following equation to calculate the attention score: (2) \( \begin{equation} Q,K,V=H W_q,H_{d_k},H W_v \end{equation} \) (3) \( \begin{equation} R_{t-j}=\left[\cdots sin\left(\frac{t-j}{10000^{\frac{2i}{d_k}}}\right)cos\left(\frac{t-j}{10000^{\frac{2i}{d_k}}}\right)\cdots \right]^T, \end{equation} \) (4) \( \begin{equation} A^{rel}_{t,j}=Q_t K_j^T + Q_t R_{t-j}^T + u K_j^T + v R_{t-j}^T, \end{equation} \) where t is the index of the target token, j is the index of the context token, \( Q_t \), \( K_j \) is the query vector and key vector of t-th and j-th word, respectively.

Un-scaled Dot-Product Attention. The origin Transformer model uses a scaled dot-product attention mechanism to make the output of softmax function smoother. Since named entities only occupy a low percentage of the whole sentence, transformer without scaled dot-product attention can concentrate more on entities and their surrounding information.

CRF Layer. A CRF layer is utilized to obtain sentence level tag information following the adaptive encoding layer, with which we can efficiently use past and future tags to predict the current tag. The CRF layer contains two key modules: the state emission matrix and state transition matrix. For the emission matrix, we consider the output of adaptive encoding layer as the matrix of scores \( f_\theta ([x]_1^T) \). Note that we drop the input \( [x]_1^T \) for notation simplification. The element \( [f_\theta ]_i^t \) of the matrix is the score output by the adaptive encoding layer with parameters \( \theta \) for the sentence \( [x]_1^T \) and for the i-th tag at the t-th word. For the transition score \( [A]_{i,j} \) to model the transition from i-th state to j-th for a pair of consecutive steps. Note that this transition matrix is position independent. The score of a sentence \( [x]_1^T \) along with a path of tags \( [i]_1^T \) is then given by the sum of transition scores and network scores: (5) \( \begin{equation} s\left([x]_1^T, [i]_1^T, \tilde{\theta }\right) = \sum _{t=1}^T\left([A]_{[i]_{t-1},[i]_t}+[f_\theta ]_{[i]_t,t}\right), \end{equation} \) then we conduct dynamic programming to compute \( [A]_{i,j} \) and optimal tag sequences for inference.

Skip 3OUR APPROACHES Section

3 OUR APPROACHES

Although the aforementioned model improves the performance of transformer on NER tasks, it is still difficult to obtain a satisfying performance with scarce data. Therefore, based on the above architecture, we propose three extension components, which incorporates knowledge from three aspects: (1) Linguistic knowledge, based on which a fine-grained Mongolian morphology segmentation is proposed to capture more semantic information; (2) Cross-lingual representation knowledge, to improve low-resource language (Mongolian) embedding via entity-specific knowledge transfer from a high-resource language (e.g., Chinese); and (3) Cross-lingual annotation knowledge, where labeled NER data is automatically created by annotation projection from parallel data. (2) and (3) are two kinds of projection-based knowledge, which could be obtained from bilingual data. In this section, we first introduce the main framework of our proposed approach. The above three kinds of knowledge enhancing components are elaborated in order as follows.

3.1 Linguistic Knowledge (Morpheme Embedding)

Mongolian words are usually composed of roots, stems, and various affixes. The root is also regarded as the first stem. The stem of a Mongolian word, namely morpheme, contains a root and several following affixes. Besides the stem, Mongolian words usually contain one or more ending suffixes which have no practical meanings. These meaningless suffixes will significantly increase the morphological diversity of Mongolian words and thereby enlarge the size of vocabulary and make data sparser.

Consequently, based on the aforementioned Mongolian word-construction characteristics, we take more fine-grained morpheme embedding in place of traditional word embedding with a new Mongolian word segmentation method. Specifically, we segment a Mongolian word into two parts, i.e., its stem and its affixes including contiguous affixes and split affixes. For the sake of clarity, we illustrate it with a Mongolian word “” (its meaning is “on and on”), which can be segmented into the stem “”, the contiguous affix “ ”, and the split affix “”. Compared with word embedding, the morpheme embedding can distinguish the semantic stems from the meaningless affixes, so that the model can learn to eliminate the interference of noises.

Meanwhile, to further exploit the morpheme embedding in our NER model and correspond to the aforementioned morphological forms in Mongolian, we also design a novel label scheme “BMESO\( \text{B}_1\text{B}_2 \)” for Mongolian NER task. In the traditional “BMESO” label scheme based for word segmentation, “B”, “M”, and “E” denote the first, middle, and last words, respectively. “S” denotes a single word as an entity and “O” for words not contained in any entities. Since various Mongolian affixes would intensify the data scarcity problem, we tag two kinds of functional and meaningless affixes in Mongolian words (i.e., contiguous affixes and split affixes) with two extra labels \( B_1 \) and \( B_2 \), respectively, compared with “BMESO”. As shown in Table 1, there is an example for location named entity “”, meaning of “Ulaanbaatar City”. Note that if the continuous and split affixes appear in a word simultaneously, only the first affix is labeled as “\( \text{B}_1 \)” or “\( \text{B}_2 \)” and others are ignored. For example, “” is the first word of location entity with label “B-LOC”. For the second word “”, “” is the stem with label “E-LOC”, “” is the first continuous suffix with label “\( \text{B}_1 \)”, and “” is the split suffix with label “O”.

Table 1.

Table 1. Examples of Annotation Schemes based on Word and Morpheme Segmentation

In a word, we split Mongolian words into morphemes and affixes so as to get more fine-grained information in both embedding level and label level, which can help the model capture more accurate representation.

3.2 Cross-lingual Representation Knowledge

We incorporate cross-lingual knowledge into our framework from high resource language (Chinese) to low resource language (Mongolian) by two methods: representation projection enhancing and annotation projection enhancing.

Representation Projection Enhancing. As shown in Figure 3, to enhance Mongolian word representations, we not only consider Mongolian “character embedding” and “morpheme embedding”, but also incorporate a special embedding with cross-lingual knowledge, called Mongolian-Chinese bilingual lexicon representations. Specifically, we consider using Chinese words to capture more abundant semantic information of Mongolian words based on alignment information existing in Mongolian-Chinese parallel corpus.

Fig. 3.

Fig. 3. Cross-lingual representation.

To model this cross-lingual information, firstly we construct a Mongolian-Chinese dictionary with parallel sentence pairs and design a strategy for dictionary extension. Then, we propose two methods to learn bilingual lexicon representations based on Mongolian-Chinese bilingual dictionary: (1) through the bidirectional LSTM network [20]; and (2) through the attention mechanism [2]. We will elaborate the above process as follows.

Mongolian-Chinese Dictionary. In order to obtain the relationship between Mongolian and Chinese, we utilize the unlabeled bilingual parallel corpus to obtain the Mongolian-Chinese bilingual dictionary through bilingual word alignment. The bilingual word alignment method is a very important basic task in statistical machine translation (SMT) and we use FastAlign toolkit1 to conduct word alignment on bilingual parallel data.

As shown in Figure 4, given a Mongolian sentence \( M=\lbrace m_1,m_2,\ldots ,m_i,\ldots , m_n\rbrace \) with n Mongolian words, and a Chinese word sequence \( C=\lbrace c_1,c_2,\ldots ,c_j,\ldots ,c_l\rbrace \) with l Chinese words, each Mongolian word \( m_i \) corresponds to multiple Chinese word \( c_j \).

Fig. 4.

Fig. 4. An example of word alignment between Mongolian and Chinese.

Although the word alignment performs well, we further conduct data cleaning and manual corrections to reduce the noise introduced by word alignment.

Dictionary Extension Strategy. For the words out of Mongolian-Chinese dictionary, we propose a dictionary extension strategy. Let Mongolian word set be \( W=\lbrace w_1,\ldots ,w_i,\ldots ,w_f\rbrace \) and the corresponding bilingual lexicon representation be \( vec_i \). The calculation method is as follows: (6) \( \begin{equation} vec_i=Mw_i, \end{equation} \) where M represents the mapping matrix, and we optimize it as follows: (7) \( \begin{equation} loss_M=\sum _{i=1}^f||vec_i-Mw_i||_2, \end{equation} \)

Finally, each OOV Mongolian word is calculated as: (8) \( \begin{equation} veo_i=Mo_i. \end{equation} \)

Cross-lingual Representation with LSTM. Firstly, we use Stanford CoreNLP tools2 to obtain part-of-speech (POS) tags for Chinese words and denote POS-tag embeddings as P, which is initialized randomly and optimized during the model training. We take the concatenation \( [c_j,p_j] \) of Chinese word embedding and POS-tag embedding as input and use bidirectional LSTM neural network to extract textual features. Finally, we concatenate the output of bidirectional LSTM at last time step as bilingual lexicon representation.

Cross-lingual Representation with Attention. Since one Mongolian word \( m_i \) corresponds to multiple Chinese words \( c_j \) and \( m_i \) has different correlation degrees with \( c_j \), it is therefore necessary to use attention mechanism. The output vector after the attention mechanism training is named as \( vec_i \), and the calculation methods are as follows: (9) \( \begin{equation} vec_i=\sum _{j=1}^l\alpha _j c_j, \end{equation} \) (10) \( \begin{equation} \sum _{j}\alpha _j=1. \end{equation} \)

3.3 Cross-lingual Annotation Knowledge

To make full use of cross-lingual knowledge, we further conduct annotation projection to synthesize extra data based on a parallel corpus to pretrain the model. Figure 5 describes the overall framework of our proposed annotation projection and its training process. In particular, we denote our parallel corpus as \( (C,M)^N \), which contains N Chinese-Mongolian sentence pairs. We use the FastAlign3 toolkit for word alignment of the i-th sentence pair \( (C^{(i)},M^{(i)}) \), so that the labels on \( C^{(i)} \) can be projected to \( M^{(i)} \), \( 1\le i \le N \). We regard the Mongolian sentences with projected labels as projection-labeled Mongolian data in the following sections.

Fig. 5.

Fig. 5. Overall framework and training process of annotation projection.

Given a Chinese sentence \( C^{(i)}=\lbrace C^{(i)}_1,C^{(i)}_2,\ldots ,C^{(i)}_n\rbrace \) with n words and its Mongolian translation \( M^{(i)}=\lbrace M^{(i)}_1,M^{(i)}_2,\ldots ,M^{(i)}_l\rbrace \) with l words, we first use LTP NER tools4 on \( C^{(i)} \) to obtain the label sequence \( T^{C^{(i)}}=\lbrace T^{C^{(i)}}_1,T^{C^{(i)}}_2,\ldots ,T^{C^{(i)}}_n\rbrace \) on the Chinese sentence.

If the Mongolian word \( M^{(i)}_j \) is aligned to the Chinese word \( C^{(i)}_k \) with label \( T^{C^{(i)}}_k \), the pseudo label of \( M^{(i)}_j \) should be defined as \( T^{M^{(i)}}_j = T^{C^{(i)}}_k \) Through this process, the corresponding label sequence of the Mongolian sentence \( T^{M^{(i)}}=\lbrace T^{M^{(i)}}_1,T^{M^{(i)}}_2,\ldots ,T^{M^{(i)}}_l\rbrace \).

In order to measure the annotation quality of the projection-labeled data, we design a heuristic, language-independent metric to select a fixed size of data with higher quality than others based on statistics. Specifically, we count the frequency of tags for each word in the dataset and assume that the more frequent the tag occurs with the word, the more likely it is the correct tag for the word.

For a Mongolian sentence \( M=\lbrace m_1,m_2,\ldots ,m_k,\ldots , m_n\rbrace \) which contains n words, its corresponding NER label sequence is \( T=\lbrace t_1,t_2,\ldots ,t_k,\ldots , t_n\rbrace \) and we calculate the score of each word as follows: (11) \( \begin{equation} score(m_k,t_k)=\frac{Count(m_k,t_k)}{\sum _i Count(m_k,t_i)}, \end{equation} \) where \( t_i \) means each existing label for word \( m_k \) in dataset. And the score of sentence M with its label sequence T is calculated by averaging the scores of all the words in M: (12) \( \begin{equation} score(M,T)=\frac{1}{n}\sum _{k}^{n}score(m_k,t_k), \end{equation} \)

To minimize the bias in projection-labeled data, we only calculate \( Count(\cdot) \) in golden dataset unless the word only exists in projection-labeled data. In addition to the score of the sentence, we also consider a minimum sentence length to preserve sentences which have more sequence features.

Skip 4EXPERIMENTAL SETTINGS Section

4 EXPERIMENTAL SETTINGS

In this section, we first introduce the Mongolian NER dataset. Then we describe implementation details and comparison methods in our experiments.

4.1 Datasets & Experimental Setup

Mongolian NER dataset. Due to the scarcity of annotated Mongolian corpus, there is no open resource of traditional Mongolian data. Our labeled data comes from the Latin-Mongolian corpus provided by Dr. Wu Jinxing of Inner Mongolia University [37]. Since our study focuses on traditional Mongolian, we transform Latin-Mongolian data to traditional Mongolian format automatically with the comparison table of Latin-Mongolian characters. Through further manual correction, we get labeled data with 13,755 traditional Mongolian sentences and 446,153 Mongolian words. Table 2 shows the statistics of train, development, and test splits for the Mongolian NER dataset.

Table 2.
DataSentence NumToken Num
Train12,000388,988
Dev87528,185
Test88028,970
Total13,755446,153

Table 2. Details of Mongolian NER Datasets

Cross-Lingual Unlabeled data. To make full use of cross-lingual knowledge, we exploit the Mongolian-Chinese parallel corpora in three forms as follows. (1) Mongolian-Chinese dictionary. We use cross-lingual unlabeled data to construct the Mongolian-Chinese dictionary which we introduce in Section 3.2, and finally we get 50,000 word pairs. We also utilize unlabeled data to train pre-trained model as our comparison experiments. Our unlabeled Mongolian data comes from CCMT2019 (The 15th China Conference on Machine Translation). Table 3 displays a summary of the unlabeled data sets.

Table 3.
LanguageSentence NumVocabulary Size
Mongolian254,722150,635
Chinese254,722106,556

Table 3. Details of Unlabeled Datasets

(2) Projection-labeled data. As illustrated in Section 3.3, we construct a projection-labeled dataset with Label Projection. In particular, we firstly conduct NER task with LTP toolkit on every sentence in the Chinese corpus. Then, we use FastAlign toolkit to finish word alignment between Chinese and Mongolian words so that we can project the labels on Chinese words into corresponding Mongolian words. After that, we obtain a coarse version of the first kind of pseudo data, and then we complement missing entities with an entity lexicon to promote the recall rate of entity labels.

(3) Pseudo Data. For comparison, we also construct another weakly labeled dataset through a pre-trained model doing annotation on unlabeled data, called pseudo data. We primarily pre-train the aforementioned NER model using golden data and then use it to annotate the unlabeled Mongolian sentences in parallel corpus. Similar to the Projection-labeled data introduced above, we modify the labels of those entities predicted as “O” in the entity lexicon to their corresponding labels and do data selection for higher quality.

4.2 Morpheme Labeling Settings

We totally set four Morpheme labeling schemes for our experiments to generate more general and convincing results.

  • BASE means the vanilla method for NER annotation which only considers a set of four labels {“O”, “PER”, “LOC”, “ORG”};

  • BIO distinguishes the start and the rest of an entity span based on the previous one;

  • BMESO further takes the end of an entity span into account;

  • BMESO\( \text{B}_1\text{B}_2 \) is a refinement of “BMESO” for morpheme settings which denotes the first contiguous suffix in an entity word as “\( \text{B}_1 \)” and the first split suffix as “\( \text{B}_2 \)”.

4.3 Implementation and Hyper-parameters

We use Adam as optimizer. The batch size is set to 32. The number of attention heads is set to 12. The learning rate is set to 0.0001. The dimension of character embeddings is set to 30. Dropout is set to 0.02. To train embeddings with cross-lingual knowledge, we set batch size to 32, head number to 12. Dropout is set to 0.02 and learning rate is set to 0.0001. The dimension of word embedding is 30.

4.4 Systems

We compare our method with other relevant methods as follows:

(1)

CRF: CRF++ toolkit,5 an open source implementation of CRF for segmenting/labeling sequential data;

(2)

Bi-LSTM+CRF [17]: It contains a bi-directional LSTM to extract text features and a CRF layer to obtain the label sequence with the highest probability, which is one of the most widely used system in the field of named entity recognition;

(3)

Bi-LSTM-CNN+CRF [25]: Compared with Bi-LSTM+CRF, it inserts several Convolutional Neural Network (CNN) layers between Bi-LSTM and the CRF layers to further extract high-dimension semantic features;

(4)

TENER: Our baseline model, which modifies the Transformer to model the long-range and complicated interactions of sentence for NER.

MKE-NER (ours): Our MKE-NER is a training framework for low-resource NER, which uses three kinds of knowledge to enhance NER models. The framework is flexible and can be applied to the above baseline models.

Skip 5RESULTS AND ANALYSIS Section

5 RESULTS AND ANALYSIS

We conduct experiments on traditional Mongolian NER dataset to compare our MKE-NER framework with other comparison systems introduced in Section 4.4. We apply our approach on the strongest baseline TENER among the aforementioned baseline systems to validate the effectiveness of our approach.

5.1 Overall Results

Table 4 shows the overall results of comparisons and our MKE-NER on the test set. Results demonstrate that our method significantly outperforms all comparison systems and achieves 94.04% of F1 score. It is worth noting that all comparisons perform high precision and relatively low recall on the test set. We attribute this phenomenon to the nature of NER tasks, where there exist inherent label imbalance problems. Nevertheless, our method can remarkably improve recall while maintaining high precision, which demonstrates that our approaches are superior to those baseline approaches in relatively low-resource Mongolian NER.

Table 4.
ModelPrecisionRecallF1
CRF94.54%76.21%84.39%
Bi-LSTM+CRF94.16%84.50%89.07%
Bi-LSTM-CNN+CRF95.31%84.47%89.56%
TENER94.69%88.25%91.36%
TENER+MKE-NER95.64%92.49%94.04%

Table 4. Overall Performances of Baselines and our Method on Mongolian NER Dataset

Although NER models based on the prevalent pre-trained language models like BERT achieve strong performances on many languages, we do not adopt any pre-trained language models in our framework because of two reasons: (a) there is no pre-trained language models off-the-shelf for traditional Mongolian, and (b) when pre-training a language model for traditional Mongolian from scratch, it is too complicated to incorporate cross-lingual knowledge with sub-word generated by BPE, since a Mongolian word segmented based on BPE algorithm and Mongolian Morpheme will have different subwords and the alignment between two kinds of sub-words is particularly complex.

5.2 Ablation Study

To verify the effectiveness of each knowledge we exploit in MKE-NER, we conduct a series of ablation experiments. In Table 5 we can see that each kind of knowledge benefits the baseline and their contributions are additive. Besides, we also conduct exhaustive experiments and analysis on the effects of each kind of knowledge.

Table 5.
ModelF1
MKE-NER94.04%
w/o Linguistic Knowledge92.26% (\( - \)1.78%)
w/o Cross-lingual Representation Knowledge93.43% (\( - \)0.61%)
w/o Cross-lingual Annotation Knowledge93.78% (\( - \)0.26%)
w/o above three91.36% (\( - \)2.68%)

Table 5. Ablation Study

Effects of Mongolian Morpheme Knowledge. Firstly, to explore the effect of Mongolian Morpheme embedding, we compare different morpheme embeddings with different labeling schemes. The experimental results are shown in Figure 6.

Fig. 6.

Fig. 6. Comparison of three types of morpheme labeling schemes on different systems.

Obviously, no matter in CRF model, Bi-LSTM+CRF model or TENER model, using morpheme embedding can improve the performance.6 We consider that morpheme embedding has two main advantages compared with word embedding. On the one hand, splitting words into morphemes can notably reduce the vocabulary size so as to mitigate the sparsity of the data and benefit model training. On the other hand, morpheme embedding can be considered as a kind of linguistic knowledge which indicates lexical structures of Mongolian words. In addition, the proposed label scheme BMESO\( \text{B}_1\text{B}_2 \) introduces another two labels for Mongolian suffixes, which can further promote the performance of our model on the Mongolian dataset.

Effects of Cross-Lingual Representation Knowledge. As for the cross-lingual representation knowledge, Table 6 shows the effects of two methods for cross-lingual representation knowledge and dictionary extension proposed in Section 3.2. For more detailed comparisons, we depict the benefit of cross-lingual knowledge under all morpheme labeling schemes in Figure 7.

Fig. 7.

Fig. 7. Comparison of different cross-lingual representation knowledge based on different Morpheme label schemes. “CR” denotes cross-lingual representation knowledge.

Table 6.
ModelF1
\( \triangle \) (only with Morpheme Embedding)93.29%
\( \triangle \) + Cross-Lingual Representation (Attention)
w/ dictionary extension93.96% (+0.67%)
w/o dictionary extension93.77% (+0.48%)
\( \triangle \) + Cross-Lingual Representation (LSTM)
w/ dictionary extension93.59% (+0.30%)
w/o dictionary extension93.17% (\( - \)0.12%)
  • \( \triangle \)” denotes our baseline with morpheme embedding only.

Table 6. Comparison of Cross-Lingual Representation Knowledge Effects

  • \( \triangle \)” denotes our baseline with morpheme embedding only.

We find that our two various methods for cross-lingual representation knowledge surpasses the original model by 0.67% and 0.30% in F1 score, respectively. Note that there is only marginal improvement or even harm for cross-lingual representation without dictionary extension. The reason is that most Mongolian words are out of the bilingual dictionary and adding a meaningless padding embedding into their representation may introduce noise, while dictionary extension can provide meaningful cross-lingual representation through transformation of semantic spaces. Therefore, these results demonstrate that our embedding with Mongolian-Chinese cross-lingual knowledge plays an important part, and it can enrich semantic representation of Mongolian word embedding.

Effects of Cross-Lingual Annotation Knowledge. Lastly, we explore the effects of cross-lingual annotation knowledge. The results with “BMESO” label scheme and the detailed ones are shown in Table 7 and Figure 8, respectively. We compare our cross-lingual annotation knowledge (i.e., projection-labeled data) with baseline and pseudo data enhancing method mentioned in Section 4.1. It can be observed that both “Cross-lingual Annotation Knowledge” and “Pseudo data” benefit the baseline and “Pseudo data” contributes slightly more to the model than “Cross-lingual Annotation Knowledge”. One reason is that “Pseudo data” has more similar distribution to golden data while “Cross-lingual Annotation Knowledge” is independently obtained from golden data which may accumulate error through Chinese NER and alignment procedures. Another reason we consider is that the baseline may have already performed better on Mongolian than NER tools on Chinese, so that projection-labeled data may bring some noises.

Fig. 8.

Fig. 8. Comparison of different cross-lingual annotation methods based on different morpheme label schemes. “CA” denotes cross-lingual annotation.

Table 7.
ModelF1
\( \triangle \) (only with Morpheme Embedding)93.29%
\( \triangle \) + Cross-Lingual Annotation Knowledge93.43% (+0.14%)
\( \triangle \) + Pseudo data93.70% (+0.41%)

Table 7. Effects of Cross-Lingual Annotation Knowledge

However, we argue that in small data scale settings, where the baseline may get poor performance, projection-labeled data may get better results. To this end, we randomly use 20%, 40%, 60%, and 80% of the full training instances to train the pre-trained models, respectively. The results shown in Figure 9 verify our assumption. We observe that projection-labeled data contributes more to model when there is only 20% golden data while pseudo data performs relatively worse. With more golden data utilized, pseudo data performs better and eventually surpasses projection-labeled data when 100% golden data is used. Thus, it validates that projection-labeled data can significantly promote model performance when golden data is severely scarce. In a nutshell, we believe that our projection-labeled data also provides a considerable amount of cross-lingual knowledge, on which the pre-training would make a meaningful initialization for model parameters. Besides, it is effective to introduce noisy data into a relatively high resource scenario (where golden data is sufficient).

Fig. 9.

Fig. 9. Comparison of different cross-lingual annotation methods under different low resource scenarios.

5.3 Comparison under Different Low Resource Scenarios

Due to the high expense of hand-crafted annotations and the scarcity of Mongolian NER corpus, it is non-trivial to verify the effectiveness of our method under different low resource scenarios. Note that in each low resource scenario, models are trained on different sizes of training data. Furthermore, as shown in Figure 9, different scales of data may cause different results (for example, pseudo data benefits the model when 100% golden data is used yet hurts the model when there is only 20% golden data), so we investigate the impact of the size of training data to check whether morpheme and cross-lingual knowledge can make a difference when reducing the amount of training data.

Figure 10 compares our method with the baseline under different data scales. Experimental results demonstrate that our method outperforms the baseline on all data scales, which verify the effectiveness of our method under various settings of data scales. Besides, the gap between two methods is gradually decreasing as the scale data expands (i.e., from 14.63% to 2.74% on F1 score with the scale of data from 1% to 100%), which suggests that the scarcer the data is, the more significant impacts our method will exert.

Fig. 10.

Fig. 10. Comparison of baseline and our MKE-NER model under different low resource scenarios.

In the following, we show more details about the effects of morpheme and cross-lingual knowledge under different scales of dataset. As shown in Figure 11, knowledge enhancement brings less improvement on performance as the data scale expands. For “+Morpheme” knowledge, the improvement varies from 5.25%, 4.18%, 1.73% to 1.24% on F1 score with the training data scale increases. With the data scale increasing from 20% to 100%, the improvements on F1 score vary from 6.2% to 1.9%, 7.11% to 1.49% and 4.52% to 0.94% for the combination of “Morpheme” and cross-lingual knowledge, i.e., “CR” (cross-lingual representation) and “CA” (cross-lingual annotation), respectively. Therefore, we conjecture that if the training data for Mongolian NER task is sparse (e.g., when the data scale is 20%), the introduction of morpheme and cross-lingual knowledge can effectively alleviate the data scarcity and achieve significant improvements. If the training data is relatively sufficient (e.g., when the data scale is 100%), the knowledge enhancement can still bring less improvement on Mongolian NER performance.

Fig. 11.

Fig. 11. Effects of our proposed knowledge under different low resource scenarios. Note that we only choose “CR(LSTM)” rather than “CR(Attention)” here since the former performs slightly better than the latter under these data scales.

Additionally, we compare the effects of different knowledge enhancement under different data scales and observe that the knowledge of “CA” has the most important impact on the performance with relatively smaller data scales (i.e., 20%, 40%, and 60%). As the data scale expands up to 80%, the “CR” knowledge becomes the most significant factor on performance. All results demonstrate that both two kinds of cross-lingual knowledge can effectively boost the Mongolian NER performance. Specifically, cross-lingual annotation knowledge performs well under a relatively lower-resource scenario while cross-lingual representation knowledge does well in NER with a higher data scale.

Skip 6RELATED WORK Section

6 RELATED WORK

In this section, we first review the mainstream development of neural NER task and focus on low resource scenarios. Then we introduce the basic linguistic knowledge of Mongolian, and finally elaborate studies on Mongolian NER task.

Neural Named Entity Recognition. The CNN model proposed by Collobert et al. [8] proved the effectiveness of deep neural networks in NER firstly. This method transferred text features to word embeddings successfully. Recurrent Neural Network (RNN) [26, 31] has brought another wave of using deep neural networks to improve performance of NER tasks. Based on RNN, various related network models blossom. Among these models, LSTM [15, 16, 23, 24] partly solves problems like the gradient explosion and long-distance dependence in RNN. Subsequently, Lample et al. [20] used LSTM to extract text features, and CRF to obtain the optimal labeling sequence, whose combination achieved the best results on NER tasks for the time. Still today, it remains as the mainstream model. Yan et al. [42] proposed TENER to adapt transformer encoder to NER tasks. Yamada et al. [41] introduced entities into pre-trained language models and achieved the state-of-the-art performance on the CoNLL-2003 dataset.

Low-resource Named Entity Recognition. Low-resource scenarios is one of the challenges for NER due to the limited scale of annotated data in many languages and domains, where general neural network methods usually have poor performances.

Cotterell and Duh [9] transferred knowledge from high-resource language to low-resource language through jointly training a CRF model using two languages. Kruengkrai et al. [18] applied multi-task learning on low-resource NER by jointly training a sentence classification task and NER task. Wu et al. [38] proposed a teacher-student learning method for unlabeled data in the target language. Besides transfer learning, data augmentation is also widely used to enhance performance for low-resource NER. Ding et al. [11] introduced a generative data augmentation approach using a language model. Bari et al. [3] proposed a novel data augmentation framework for zero-resource cross-lingual task adaptation. The two methods have been proved helpful for low resource NER. Although based on transfer learning and data augmentation, our MKE-NER is different from these studies since we focus on a specific low-resource language, i.e., Mongolian and exploit multiple kinds of language-specific knowledge to improve Mongolian NER.

Basic Linguistic-knowledge of Mongolian. Traditional Mongolian is an alphabetic-writing language, and Mongolian words are composed of Mongolian letters, which contains eight vowels, ten prepositional consonants, and seventeen basic consonants. The glyph of each letter changes when it appears at different positions of a word. Similar to English, Mongolian words are separated by spaces. The writing order of words in Mongolian is from top to bottom, and the order of lines is from left to right. Additionally, Mongolian is an agglutinative language and the mainstream method of its word-building is adding word-forming suffixes to different roots and stems. Different from English, both roots and stems are regarded as suffixes in Mongolian and some suffixes have no real meaning [13, 29]. In view of the word-building of Mongolian, there are studies on segmenting Mongolian words into morphemes and directly using the morpheme representations to reduce the sparsity of words [35].

Mongolian Named Entity Recognition. The development of Mongolian NER is relatively slow and lagging. Previous approaches for Mongolian NER mostly rely on lexicon or hand-crafted features [28]. With the development of deep learning technologies, neural networks begin to be applied in this field [34]. Based on this, Xiong and Nuo [39] introduced attention mechanism into bi-LSTM and used contextual embeddings for Mongolian NER. Different from these methods, we tackle Mongolian NER with multi-knowledge enhancement, considering not only Mongolian linguistic knowledge with a novel morpheme segmentation method but also cross-lingual knowledge.

Skip 7CONCLUSION Section

7 CONCLUSION

In this paper, we focus on the problem of traditional Mongolian NER task. To alleviate the scarcity of training data, we propose a Multi-Knowledge Enhancement framework for Mongolian NER (MKE-NER) to exploit not only Mongolian linguistic knowledge through more fine-grained morpheme segmentation but also cross-lingual knowledge with Mongolian-Chinese parallel corpus. Specifically, we consider two kinds of knowledge in parallel data, i.e., cross-lingual representation and cross-lingual annotation to boost the performance of Mongolian NER. Thorough evaluations on Mongolian NER dataset demonstrate the effectiveness of our model in Mongolian NER task and especially, the superiority in the low-resource scenario.

Footnotes

REFERENCES

  1. [1] Al-Moslmi Tareq, Ocaña Marc Gallofré, Opdahl Andreas L., and Veres Csaba. 2020. Named entity extraction for knowledge graphs: A literature overview. IEEE Access 8 (2020), 3286232881.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google ScholarGoogle Scholar
  3. [3] Bari M. Saiful, Mohiuddin Tasnim, and Joty Shafiq. 2020. Multimix: A robust data augmentation framework for cross-lingual NLP. arXiv preprint arXiv:2004.13240 (2020).Google ScholarGoogle Scholar
  4. [4] Berger Adam, Pietra Stephen A. Della, and Pietra Vincent J. Della. 1996. A maximum entropy approach to natural language processing. Computational Linguistics 22, 1 (1996), 3971.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Bordes Antoine, Usunier Nicolas, Garcia-Duran Alberto, Weston Jason, and Yakhnenko Oksana. 2013. Translating embeddings for modeling multi-relational data. In Neural Information Processing Systems (NIPS). 19.Google ScholarGoogle Scholar
  6. [6] Boyd-Graber Jordan and Börschinger Benjamin. 2019. What question answering can learn from trivia nerds. arXiv preprint arXiv:1910.14464 (2019).Google ScholarGoogle Scholar
  7. [7] Cao Yixin, Hu Zikun, Chua Tat-seng, Liu Zhiyuan, and Ji Heng. 2019. Low-resource name tagging learned with weakly labeled data. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 261270.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Collobert Ronan, Weston Jason, Bottou Léon, Karlen Michael, Kavukcuoglu Koray, and Kuksa Pavel. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, Article (2011), 24932537.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Cotterell Ryan and Duh Kevin. 2017. Low-resource named entity recognition with cross-lingual, character-level neural conditional random fields. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Asian Federation of Natural Language Processing, Taipei, Taiwan, 9196. https://www.aclweb.org/anthology/I17-2016.Google ScholarGoogle Scholar
  10. [10] Cui Leyang, Wu Yu, Liu Jian, Yang Sen, and Zhang Yue. 2021. Template-based named entity recognition using BART. arXiv preprint arXiv:2106.01760 (2021).Google ScholarGoogle Scholar
  11. [11] Ding Bosheng, Liu Linlin, Bing Lidong, Kruengkrai Canasai, Nguyen Thien Hai, Joty Shafiq, Si Luo, and Miao Chunyan. 2020. DAGA: Data augmentation with a generation approach for low-resource tagging tasks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 60456057. Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Ding Ning, Xu Guangwei, Chen Yulin, Wang Xiaobin, Han Xu, Xie Pengjun, Zheng Hai-Tao, and Liu Zhiyuan. 2021. Few-NERD: A few-shot named entity recognition dataset. arXiv preprint arXiv:2105.07464 (2021).Google ScholarGoogle Scholar
  13. [13] Dulamragchaa Uuganbaatar, Chadraabal Sodoo, Ivanov Byambasuren, and Baatarkhuu Munkhbayar. 2017. Mongolian language morphology and its database structure. In 2017 International Conference on Green Informatics (ICGI). IEEE, 282285.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Feng Xiaocheng, Feng Xiachong, Qin Bing, Feng Zhangyin, and Liu Ting. 2018. Improving low resource named entity recognition using cross-lingual knowledge transfer. In IJCAI. 40714077.Google ScholarGoogle Scholar
  15. [15] Hammerton James. 2003. Named entity recognition with long short-term memory. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003. 172175.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 17351780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Huang Zhiheng, Xu Wei, and Yu Kai. 2015. Bidirectional LSTM-CRF Models for sequence tagging. arXiv e-prints (2015), arXiv–1508.Google ScholarGoogle Scholar
  18. [18] Kruengkrai Canasai, Nguyen Thien Hai, Aljunied Sharifah Mahani, and Bing Lidong. 2020. Improving low-resource named entity recognition using joint sentence and token labeling. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 58985905. Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Lafferty John D., McCallum Andrew, and Pereira Fernando C. N.. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 282289. Google ScholarGoogle Scholar
  20. [20] Lample Guillaume, Ballesteros Miguel, Subramanian Sandeep, Kawakami Kazuya, and Dyer Chris. 2016. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 260270.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Lewis Mike, Liu Yinhan, Goyal Naman, Ghazvininejad Marjan, Mohamed Abdelrahman, Levy Omer, Stoyanov Ves, and Zettlemoyer Luke. 2019. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).Google ScholarGoogle Scholar
  22. [22] Li Xiaoya, Sun Xiaofei, Meng Yuxian, Liang Junjun, Wu Fei, and Li Jiwei. 2020. Dice loss for data-imbalanced NLP tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 465476. Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Liu Liyuan, Shang Jingbo, Ren Xiang, Xu Frank, Gui Huan, Peng Jian, and Han Jiawei. 2018. Empower sequence labeling with task-aware neural language model. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Luo Ying, Xiao Fengshun, and Zhao Hai. 2020. Hierarchical contextualized representation for named entity recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 84418448.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Ma Xuezhe and Hovy Eduard. 2016. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 10641074.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Mikolov Tomáš, Karafiát Martin, Burget Lukáš, Černockỳ Jan, and Khudanpur Sanjeev. 2010. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Mollá Diego, Zaanen Menno Van, Smith Daniel, et al. 2006. Named entity recognition for question answering. (2006).Google ScholarGoogle Scholar
  28. [28] Munkhjargal Zoljargal, Bella Gabor, Chagnaa Altangerel, and Giunchiglia Fausto. 2015. Named entity recognition for Mongolian language. In International Conference on Text, Speech, and Dialogue. Springer, 243251.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Poppe Nicholas. 1974. Grammar of Written Mongolian. Otto Harrassowitz Verlag.Google ScholarGoogle Scholar
  30. [30] Rabiner Lawrence and Juang Biinghwang. 1986. An introduction to hidden Markov models. IEEE ASSP Magazine 3, 1 (1986), 416.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Sutskever Ilya, Vinyals Oriol, and Le Quoc V.. 2014. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 27 (2014), 31043112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 60006010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Wang Quan, Mao Zhendong, Wang Bin, and Guo Li. 2017. Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering 29, 12 (2017), 27242743.Google ScholarGoogle Scholar
  34. [34] Wang Weihua, Bao Feilong, and Gao Guanglai. 2016. Mongolian named entity recognition with bidirectional recurrent neural networks. In 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 495500.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Wang Weihua, Bao Feilong, and Gao Guanglai. 2019. Learning morpheme representation for Mongolian named entity recognition. Neural Processing Letters 50, 3 (2019), 26472664.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Wang Xinyu, Jiang Yong, Bach Nguyen, Wang Tao, Huang Zhongqiang, Huang Fei, and Tu Kewei. 2020. Automated concatenation of embeddings for structured prediction. arXiv preprint arXiv:2010.05006 (2020).Google ScholarGoogle Scholar
  37. [37] Wu Jinxin, Nasun-Urtu, and Yang Zhenxin. 2016. Recognition method of Mongolian person names based on conditional random fields. Application Research of Computers 33, 7 (2016), 20142017.Google ScholarGoogle Scholar
  38. [38] Wu Qianhui, Lin Zijia, Karlsson Börje, Jian-Guang Lou, and Huang Biqing. 2020. Single-/multi-source cross-lingual NER via teacher-student learning on unlabeled data in target language. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 65056514.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Xiong Yuzhu and Nuo Minghua. 2018. Attention-based BLSTM-CRF architecture for Mongolian named entity recognition. In Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation.Google ScholarGoogle Scholar
  40. [40] Xuan Zhenyu, Bao Rui, and Jiang Shengyi. 2020. FGN: Fusion glyph network for Chinese named entity recognition. arXiv preprint arXiv:2001.05272 (2020).Google ScholarGoogle Scholar
  41. [41] Yamada Ikuya, Asai Akari, Shindo Hiroyuki, Takeda Hideaki, and Matsumoto Yuji. 2020. LUKE: Deep contextualized entity representations with entity-aware self-attention. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 64426454. Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Yan Hang, Deng Bocao, Li Xiaonan, and Qiu Xipeng. 2019. Tener: Adapting transformer encoder for named entity recognition. arXiv preprint arXiv:1911.04474 (2019).Google ScholarGoogle Scholar
  43. [43] Zhang Boliang, Pan Xiaoman, Wang Tianlu, Vaswani Ashish, Ji Heng, Knight Kevin, and Marcu Daniel. 2016. Name tagging for low-resource incident languages based on expectation-driven learning. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 249259.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Exploiting Morpheme and Cross-lingual Knowledge to Enhance Mongolian Named Entity Recognition

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 5
      September 2022
      486 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3533669
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 September 2022
      • Online AM: 29 April 2022
      • Accepted: 10 January 2022
      • Revised: 29 October 2021
      • Received: 27 July 2021
      Published in tallip Volume 21, Issue 5

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed
    • Article Metrics

      • Downloads (Last 12 months)343
      • Downloads (Last 6 weeks)34

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!