skip to main content
research-article

Improving the extraction of bilingual terminology from Wikipedia

Published:06 November 2009Publication History
Skip Abstract Section

Abstract

Research on the automatic construction of bilingual dictionaries has achieved impressive results. Bilingual dictionaries are usually constructed from parallel corpora, but since these corpora are available only for selected text domains and language pairs, the potential of other resources is being explored as well.

In this article, we want to further pursue the idea of using Wikipedia as a corpus for bilingual terminology extraction. We propose a method that extracts term-translation pairs from different types of Wikipedia link information. After that, an SVM classifier trained on the features of manually labeled training data determines the correctness of unseen term-translation pairs.

References

  1. Adafre, S. F. and de Rijke, M. 2006. Finding similar sentences across multiple languages in Wikipedia. In Proceedings of the EACL Workshop on NEW TEXT Wikis and Blogs and Other Dynamic Text Sources.Google ScholarGoogle Scholar
  2. Adar, E., Skinner, M., and Weld, D. S. 2009. Information arbitrage across multi-lingual Wikipedia. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining (WSDM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bouma, G., Fahmi, I., Mur, J., van Noord, G., van der Plas, L., and Tiedemann, J. 2006. The University of Groningen at qa@clef 2006 using syntactic knowledge for qa. In Working Notes for the Cross Language Evaluation Forum Workshop.Google ScholarGoogle Scholar
  4. Brown, P. F., Cocke, J., Pietra, S. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., Mercer, R. L., and Roossin, P. S. 1990. A statistical approach to machine translation. Comput. Ling. 16, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Brown, P. F., Pietra, V. J. D., Pietra, S. A. D., and Mercer, R. L. 1993. The mathematics of statistical machine translation: Parameter estimation. Comput. Ling. 19, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chang, C.-C. and Lin, C.-J. 2001. LIBSVM: A library for support vector machines. http://www.csie.ntu.edu.tw/cjlin/libsvm.Google ScholarGoogle Scholar
  7. Chen, Y.-W. and Lin, C.-J. 2006. Combining svms with various feature selection strategies. In Feature Extraction, Foundations and Applications. Springer.Google ScholarGoogle Scholar
  8. Erdmann, M., Nakayama, K., Hara, T., and Nishio, S. 2008a. An approach for extracting bilingual terminology from Wikipedia. In Proceedings of International Conference on Database Systems for Advanced Applications (DASFAA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Erdmann, M., Nakayama, K., Hara, T., and Nishio, S. 2008b. Extraction of bilingual terminology from a multilingual Web-based encyclopedia. J. Inform. Process. 16.Google ScholarGoogle Scholar
  10. Ferrández, S., Toral, A., Óscar Ferrández, Ferrández, A., and noz, R. M. 2007. Applying Wikipedias multilingual knowledge to crosslingual question answering. In Natural Language Processing and Information Systems. Springer.Google ScholarGoogle Scholar
  11. Goncalves, P., Robin, J., Santos, T., Miranda, O., and Meira, S. 1998. Measuring the effect of centroid size on Web search precision and recall. In Proceedings of the Annual Conference of the Internet Society (INET).Google ScholarGoogle Scholar
  12. Haghighi, A., Liang, P., Berg-Kirkpatrick, T., and Klein, D. 2008. Learning bilingual lexicons from monolingual corpora. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL'08).Google ScholarGoogle Scholar
  13. Hull, D. A. and Grefenstette, G. 1996. Querying across languages: A dictionary-based approach to multilingual information retrieval. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kay, M. and Röscheisen, M. 1993. Text-translation alignment. Comput. Ling. 19, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Koehn, P. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X.Google ScholarGoogle Scholar
  16. Lowry, R. 2008. McNemar's test for correlated proportions in the marginals of a 2x2 contingency table. http://faculty.vassar.edu/lowry/propcorr.html.Google ScholarGoogle Scholar
  17. Melamed, I. D. 1997. A word-to-word model of translational equivalence. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Och, F. J. and Ney, H. 2003. A systematic comparison of various statistical alignment models. Comput. Ling. 29, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Rapp, R. 1999. Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Resnik, P. and Smith, N. A. 2003. The Web as a parallel corpus. Comput. Ling. 29, 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Schönhofen, P., Benczúr, A., Bíró, I., and Csalogány, K. 2007. Performing cross-language retrieval with Wikipedia. In Proceedings of the Workshop of the Cross-Language Evaluation Forum (CLEF).Google ScholarGoogle Scholar
  22. Tiedemann, J. and Nygaard, L. 2004. The opus corpus—parallel&free. In Proceedings of the International Conference on Language Resources and Evaluation (LREC).Google ScholarGoogle Scholar
  23. van der Eijk, P. 1993. Automating the acquisition of bilingual terminology. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Vogel, S., Ney, H., and Tillmann, C. 1996. Hmm-based word alignment in statistical translation. In Proceedings of the Conference on Computational Linguistics (CL). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Improving the extraction of bilingual terminology from Wikipedia

                Recommendations

                Reviews

                Jolanta Mizera-Pietraszko

                Wikipedia is presumably the most popular encyclopedia on the Net. It is maintained by users who are not necessarily specialists in the respective fields, so the quality of the knowledge presented is not always reliable. On the other hand, as natural languages have the tendency to change with the passing of time, the Wikipedia articles include expressions that are still current. From this perspective, its multilingualism is of a special kind?each article has a different set of articles in other languages that are not necessarily their translations. The advantage is that a user who speaks foreign languages has a chance to read the same article in many languages and study new details of the subject for each language version. This substantiates some interest in exploiting Wikipedia for multilingual purposes. This paper's stem-based approach relies on the assumption that two term translations of a Wikipedia title link to their parallel article. Erdmann et al. aim to complement automatically, rather than create a bilingual dictionary. According to the authors, the translation accuracy measured increases with the added redirect page titles, with the anchor text information, and when it is supplemented by the translation candidates extracted from forward and backward links of the article (although an incoming link for one user can be the outgoing link for another). Next, the support vector machine (SVM) classifier filters the total number of translation candidates. Although the paper is very interesting, it has some minor flaws. For instance, in Section 2.2, "Automatic Dictionary Construction," "Wikipedia" is a subsection title?as are "Parallel Corpora" and "Comparable Corpora"?even though Wikipedia is not a dictionary, but rather an encyclopedia of articles whose structures, organization, and content depend on the author. Also, for this particular experiment, the language pair of English and German is only mentioned on page 10, as if the language pair phenomenon is of no importance to the process of creating an automatic dictionary. Some of the results, such as comparing an SVM classifier based on two different features to an SVM classifier based on 13 different features, or the two languages' common derivations, are too obvious to include. Nonetheless, the paper is really intriguing, and the results can be used by other researchers and ported to other language pairs. Online Computing Reviews Service

                Access critical reviews of Computing literature here

                Become a reviewer for Computing Reviews.

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                • Published in

                  cover image ACM Transactions on Multimedia Computing, Communications, and Applications
                  ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 5, Issue 4
                  October 2009
                  103 pages
                  ISSN:1551-6857
                  EISSN:1551-6865
                  DOI:10.1145/1596990
                  Issue’s Table of Contents

                  Copyright © 2009 ACM

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 6 November 2009
                  • Revised: 1 June 2009
                  • Accepted: 1 June 2009
                  • Received: 1 January 2009
                  Published in tomm Volume 5, Issue 4

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article
                  • Research
                  • Refereed

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader
                About Cookies On This Site

                We use cookies to ensure that we give you the best experience on our website.

                Learn more

                Got it!