Abstract
Research on the automatic construction of bilingual dictionaries has achieved impressive results. Bilingual dictionaries are usually constructed from parallel corpora, but since these corpora are available only for selected text domains and language pairs, the potential of other resources is being explored as well.
In this article, we want to further pursue the idea of using Wikipedia as a corpus for bilingual terminology extraction. We propose a method that extracts term-translation pairs from different types of Wikipedia link information. After that, an SVM classifier trained on the features of manually labeled training data determines the correctness of unseen term-translation pairs.
- Adafre, S. F. and de Rijke, M. 2006. Finding similar sentences across multiple languages in Wikipedia. In Proceedings of the EACL Workshop on NEW TEXT Wikis and Blogs and Other Dynamic Text Sources.Google Scholar
- Adar, E., Skinner, M., and Weld, D. S. 2009. Information arbitrage across multi-lingual Wikipedia. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining (WSDM). Google Scholar
Digital Library
- Bouma, G., Fahmi, I., Mur, J., van Noord, G., van der Plas, L., and Tiedemann, J. 2006. The University of Groningen at qa@clef 2006 using syntactic knowledge for qa. In Working Notes for the Cross Language Evaluation Forum Workshop.Google Scholar
- Brown, P. F., Cocke, J., Pietra, S. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., Mercer, R. L., and Roossin, P. S. 1990. A statistical approach to machine translation. Comput. Ling. 16, 2. Google Scholar
Digital Library
- Brown, P. F., Pietra, V. J. D., Pietra, S. A. D., and Mercer, R. L. 1993. The mathematics of statistical machine translation: Parameter estimation. Comput. Ling. 19, 2. Google Scholar
Digital Library
- Chang, C.-C. and Lin, C.-J. 2001. LIBSVM: A library for support vector machines. http://www.csie.ntu.edu.tw/cjlin/libsvm.Google Scholar
- Chen, Y.-W. and Lin, C.-J. 2006. Combining svms with various feature selection strategies. In Feature Extraction, Foundations and Applications. Springer.Google Scholar
- Erdmann, M., Nakayama, K., Hara, T., and Nishio, S. 2008a. An approach for extracting bilingual terminology from Wikipedia. In Proceedings of International Conference on Database Systems for Advanced Applications (DASFAA). Google Scholar
Digital Library
- Erdmann, M., Nakayama, K., Hara, T., and Nishio, S. 2008b. Extraction of bilingual terminology from a multilingual Web-based encyclopedia. J. Inform. Process. 16.Google Scholar
- Ferrández, S., Toral, A., Óscar Ferrández, Ferrández, A., and noz, R. M. 2007. Applying Wikipedias multilingual knowledge to crosslingual question answering. In Natural Language Processing and Information Systems. Springer.Google Scholar
- Goncalves, P., Robin, J., Santos, T., Miranda, O., and Meira, S. 1998. Measuring the effect of centroid size on Web search precision and recall. In Proceedings of the Annual Conference of the Internet Society (INET).Google Scholar
- Haghighi, A., Liang, P., Berg-Kirkpatrick, T., and Klein, D. 2008. Learning bilingual lexicons from monolingual corpora. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL'08).Google Scholar
- Hull, D. A. and Grefenstette, G. 1996. Querying across languages: A dictionary-based approach to multilingual information retrieval. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Google Scholar
Digital Library
- Kay, M. and Röscheisen, M. 1993. Text-translation alignment. Comput. Ling. 19, 1. Google Scholar
Digital Library
- Koehn, P. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X.Google Scholar
- Lowry, R. 2008. McNemar's test for correlated proportions in the marginals of a 2x2 contingency table. http://faculty.vassar.edu/lowry/propcorr.html.Google Scholar
- Melamed, I. D. 1997. A word-to-word model of translational equivalence. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL). Google Scholar
Digital Library
- Och, F. J. and Ney, H. 2003. A systematic comparison of various statistical alignment models. Comput. Ling. 29, 1. Google Scholar
Digital Library
- Rapp, R. 1999. Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). Google Scholar
Digital Library
- Resnik, P. and Smith, N. A. 2003. The Web as a parallel corpus. Comput. Ling. 29, 3. Google Scholar
Digital Library
- Schönhofen, P., Benczúr, A., Bíró, I., and Csalogány, K. 2007. Performing cross-language retrieval with Wikipedia. In Proceedings of the Workshop of the Cross-Language Evaluation Forum (CLEF).Google Scholar
- Tiedemann, J. and Nygaard, L. 2004. The opus corpus—parallel&free. In Proceedings of the International Conference on Language Resources and Evaluation (LREC).Google Scholar
- van der Eijk, P. 1993. Automating the acquisition of bilingual terminology. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL). Google Scholar
Digital Library
- Vogel, S., Ney, H., and Tillmann, C. 1996. Hmm-based word alignment in statistical translation. In Proceedings of the Conference on Computational Linguistics (CL). Google Scholar
Digital Library
Index Terms
Improving the extraction of bilingual terminology from Wikipedia
Recommendations
Multilingual Topic Models for Bilingual Dictionary Extraction
A machine-readable bilingual dictionary plays a crucial role in many natural language processing tasks, such as statistical machine translation and cross-language information retrieval. In this article, we propose a framework for extracting a bilingual ...
Automatic induction of bilingual resources from aligned parallel corpora: application to shallow-transfer machine translation
The availability of machine-readable bilingual linguistic resources is crucial not only for rule-based machine translation but also for other applications such as cross-lingual information retrieval. However, the building of such resources (bilingual ...
Automatic taxonomy extraction in different languages using wikipedia and minimal language-specific information
CICLing'12: Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part IKnowledge bases extracted from Wikipedia are particularly useful for various NLP and Semantic Web applications due to their co- verage, actuality and multilingualism. This has led to many approaches for automatic knowledge base extraction from ...








Comments