Abstract
Manually constructing an annotated Named Entity (NE) in a bilingual corpus is a time-consuming, labor--intensive, and expensive process, but this is necessary for natural language processing (NLP) tasks such as cross-lingual information retrieval, cross-lingual information extraction, machine translation, etc. In this article, we present an automatic approach to construct an annotated NE in English-Vietnamese bilingual corpus from a bilingual parallel corpus by proposing an aligned NE method. Basing this corpus on a bilingual corpus in which the initial NEs are extracted from its own language separately, the approach tries to correct unrecognized NEs or incorrectly recognized NEs before aligning the NEs by using a variety of bilingual constraints. The generated corpus not only improves the NE recognition results but also creates alignments between English NEs and Vietnamese NEs, which are necessary for training NE translation models. The experimental results show that the approach outperforms the baseline methods effectively. In the English-Vietnamese NE alignment task, the F-measure increases from 68.58% to 79.77%. Thanks to the improvement of the NE recognition quality, the proposed method also increases significantly: the F-measure goes from 84.85% to 88.66% for the English side and from 75.71% to 85.55% for the Vietnamese side. By providing the additional semantic information for the machine translation systems, the BLEU score increases from 33.04% to 45.11%.
- Y. Al-Onaizan and K. Knight. 2002. Translating named entities using monolingual and bilingual resources. In Proceedings of ACL-2002. Google Scholar
Digital Library
- P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, and R. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics. Google Scholar
Digital Library
- Wanxiang Che, Mengqiu Wang, Christopher D. Manning, and Ting Liu. 2013. Named entity recognition with bilingual constraints. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).Google Scholar
- Yufeng Chen, Chengqing Zong, and S. U. Keh-Yih. 2010. On jointly recognizing and aligning bilingual named entities. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL), 631--639. Google Scholar
Digital Library
- Yufeng Chen, Chengqing Zong, and S. U. Keh-Yih. 2013. A joint model to identify and align bilingual named entities. Computational Linguistics 39, 2, 229--266. Google Scholar
Digital Library
- C. Cherry and D. Lin. 2003. A probability model to improve word alignment. In Proceedings of ACL-2003. Google Scholar
Digital Library
- Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford core NLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations.Google Scholar
Cross Ref
- D. Dien and V. Thuy. 2006. A maximum entropy approach for Vietnamese word segmentation. In Proceedings of 4th IEEE International Conference on Computer Science - Research, Innovation and Vision of the Future 2006 (RIVF’06).Google Scholar
- Donghui Feng, Lv. Yajuan, and Ming Zhou. 2004. A new approach for English-Chinese named entity alignment. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.Google Scholar
- J. Hobbs. 1996. FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural Language Text. MIT Press, Cambridge, MA.Google Scholar
- F. Huang, S. Vogel, and A. Waibel. 2003. Automatic extraction of named entity translingual equivalence based on multi-feature cost minimization. In Proceedings of the Workshop on Multi-lingual and Mixed-language NER (ACL 2003). Google Scholar
Digital Library
- J. Huang and K. Choi. 2000. Chinese-Korean word alignment based on linguistic comparison. In Proceedings of ACL-2000. Google Scholar
Digital Library
- Ngo Quoc Hung and Dien Dinh. 2014. Building English-Vietnamese named entity corpus with aligned bilingual news articles. In Proceedings of the 5th Workshop on South and Southeast Asian Natural Language Processing.Google Scholar
- S. J. Ker and J. S. Chang. 1997. A class-based approach to word alignment. Computational Linguistics 23, 2, 313--343. Google Scholar
Digital Library
- Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP). 388--395.Google Scholar
- C. Lee and J. S. Chang. 2003. Acquisition of English-Chinese transliterated word pairs from parallel-aligned texts. In Proceedings of the 2003 Workshop on Data Driven MT HLT-NAACL. 96--103. Google Scholar
Digital Library
- C. Lee, J. S. Chang, and J. R. Jang. 2006. Alignment of bilingual named entities in parallel corpora using statistical models and multiple knowledge sources. ACM Transactions on Asian Language Information Processing (TALIP) 5, 2, 121--145. Google Scholar
Digital Library
- I. D. Melamed. 2000. Models of translation equivalence among words. Computational Linguistics 26, 2, 221--249. Google Scholar
Digital Library
- R. C. Moore. 2003. Learning translations of named-entity phrases from parallel corpora. In Proceedings of EACL-2003. Google Scholar
Digital Library
- F. J. Och and H. Ney. 2000. Improved statistical alignment models. In Proceedings of ACL 2000, 440--447. Google Scholar
Digital Library
- F. J. Och and H. Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29, 19--51. Google Scholar
Digital Library
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL). 311--318. Google Scholar
Digital Library
- Fu. Ruiji, Qin Bing, and Liu Ting. 2011. Generating Chinese named entity data from parallel corpora. Frontiers of Computer Science 8, 4, 629--641. Google Scholar
Digital Library
- S. Vogel, H. Ney, and C. Tillmann. 1996. HMM-based word alignment in statistical translation. In Proceedings of COLING 1996, 836--841. Google Scholar
Digital Library
- Mengqiu Wang, Wanxiang Che, and Christopher D. Manning. 2013. Effective bilingual constraints for semi-supervised learning of named entity recognizers. In Proceedings of the 27th AAAI Conference on Artificial Intelligence (AAAI). Google Scholar
Digital Library
- W. Wang, M. Zhou, J. Huang, and C. Huang. 2002. Structural alignment using bilingual chunking. In Proceedings of COLING-2002. Google Scholar
Digital Library
Index Terms
An Approach to Construct a Named Entity Annotated English-Vietnamese Bilingual Corpus
Recommendations
Extracting named entity translingual equivalence with limited resources
In this article we present an automatic approach to extracting Hindi-English (H-E) Named Entity (NE) translingual equivalences from bilingual parallel corpora. In the absence of a Hindi NE tagger or H-E translation dictionary, this approach adapts a ...
Named Entity Disambiguation for Resource-Poor Languages
ESAIR '15: Proceedings of the Eighth Workshop on Exploiting Semantic Annotations in Information RetrievalNamed entity disambiguation (NED) is the task of linking ambiguous names in natural language text to canonical entities like people, organizations or places, registered in a knowledge base. The problem is well-studied for English text, but few systems ...
Improving Named Entity Recognition of English and Vietnamese Languages using Bilingual Constraints
NLPIR '18: Proceedings of the 2nd International Conference on Natural Language Processing and Information RetrievalNamed entity recognition plays a crucial role in many Natural Language Processing tasks because the semantic information is carried by entities. The recent efforts are trying to reduce the annotation labor because the state-of-the-art Named Entity ...






Comments