skip to main content
note

Corpus-Based Translation Induction in Indian Languages Using Auxiliary Language Corpora from Wikipedia

Published:17 March 2017Publication History
Skip Abstract Section

Abstract

Identifying translations from comparable corpora is a well-known problem with several applications. Existing methods rely on linguistic tools or high-quality corpora. Absence of such resources, especially in Indian languages, makes this problem hard; for example, state-of-the-art techniques achieve a mean reciprocal rank of 0.66 for English-Italian, and a mere 0.187 for Telugu-Kannada. In this work, we address the problem of comparable corpora-based translation correspondence induction (CC-TCI) when the only resources available are small noisy comparable corpora extracted from Wikipedia. We observe that translations in the source and target languages have many topically related words in common in other “auxiliary” languages. To model this, we define the notion of a translingual theme, a set of topically related words from auxiliary language corpora, and present a probabilistic framework for CC-TCI. Extensive experiments on 35 comparable corpora showed dramatic improvements in performance. We extend these ideas to propose a method for measuring cross-lingual semantic relatedness (CLSR) between words. To stimulate further research in this area, we make publicly available two new high-quality human-annotated datasets for CLSR. Experiments on the CLSR datasets show more than 200% improvement in correlation on the CLSR task. We apply the method to the real-world problem of cross-lingual Wikipedia title suggestion and build the WikiTSu system. A user study on WikiTSu shows a 20% improvement in the quality of titles suggested.

References

  1. Daniel Andrade, Masaaki Tsuchida, Takashi Onishi, and Kai Ishikawa. 2013. Translation acquisition using synonym sets. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Google ScholarGoogle Scholar
  2. Ron Artstein and Massimo Poesio. 2008. Inter-coder agreement for computational linguistics. Computational Linguistics 34, 4 (2008), 555--596. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Carmen Banea, Rada Mihalcea, and Janyce Wiebe. 2010. Multilingual subjectivity: Are more languages better? In Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 28--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Colin Bannard and Chris Callison-Burch. 2005. Paraphrasing with bilingual parallel corpora. (ACL’05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Patti Bao, Brent Hecht, Samuel Carton, Mahmood Quaderi, Michael S. Horn, and Darren Gergle. 2012. Omnipedia: Bridging the wikipedia language gap. In ACM Conference on Human Factors in Computing Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Danushka Bollegala, Yutaka Matsuo, and Mitsuru Ishizuka. 2007. Measuring semantic similarity between words using web search engines. International World Wide Web Conference (WWW) 7 (2007), 757--766. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Lars Borin. 2000. You’ll take the high road and I’ll take the low road: Using a third language to improve bilingual word alignment. In International Conference on Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Compututational Linguistics 19, 2 (1993), 263--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jacob Cohen. 1968. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin 70, 4 (1968), 213.Google ScholarGoogle ScholarCross RefCross Ref
  10. Trevor Cohn and Mirella Lapata. 2007. Machine translation by triangulation: Making effective use of multi-parallel corpora. In Annual Meeting of the Association for Computational Linguistics.Google ScholarGoogle Scholar
  11. Raj Dabre, Fabien Cromieres, Sadao Kurohashi, and Pushpak Bhattacharyya. 2014. Leveraging small multilingual corpora for smt using many pivot languages. In North American Chapter of the Association for Computational Linguistics.Google ScholarGoogle Scholar
  12. Dmitry Davidov and Ari Rappoport. 2009. Enhancement of lexical concepts using cross-lingual web mining. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP): Volume 2. Association for Computational Linguistics, 852--861. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Hervé Déjean, Éric Gaussier, and Fatiha Sadat. 2002. An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In International Conference on Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Maike Erdmann, Kotaro Nakayama, Takahiro Hara, and Shojiro Nishio. 2009. Improving the extraction of bilingual terminology from Wikipedia. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 5, 4 (2009), 31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: The concept revisited. In International World Wide Web Conference (WWW). ACM, 406--414. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Darja Fišer and Nikola Ljubešic. 2011. Bilingual lexicon extraction from comparable corpora for closely related languages. In Recent Advances in Natural Language Processing.Google ScholarGoogle Scholar
  17. Pascale Fung. 1995. Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. In Workshop on Very Large Corpora (1995).Google ScholarGoogle Scholar
  18. Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In International Joint Conference on Artificial Intelligence. 6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Eric Gaussier, J.-M. Renders, Irina Matveeva, Cyril Goutte, and Hervé Déjean. 2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Annual Meeting of the Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. 2008. Learning bilingual lexicons from monolingual corpora. In Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.Google ScholarGoogle Scholar
  21. Samer Hassan, Carmen Banea, and Rada Mihalcea. 2012. Measuring semantic relatedness using multilingual representations. In Proceedings of the 1st Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the 6th International Workshop on Semantic Evaluation (SemEval’12). Association for Computational Linguistics, Stroudsburg, PA, 20--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Samer Hassan and Rada Mihalcea. 2009. Cross-lingual semantic relatedness using encyclopedic knowledge. In Conference on Empirical Methods in Natural Language Processing. 10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ann Irvine and Chris Callison-Burch. 2013. Supervised bilingual lexicon induction with multiple monolingual signals. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Google ScholarGoogle Scholar
  24. Azniah Ismail and Suresh Manandhar. 2010. Bilingual lexicon extraction from comparable corpora using in-domain terms. In International Conference on Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Glen Jeh and Jennifer Widom. 2002. SimRank: A measure of structural-context similarity. In Conference on Knowledge Discovery and Data Mining. 6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Heng Ji. 2009. Mining name translations from comparable corpora by creating bilingual information networks. In Workshop on Building and Using Comparable Corpora. 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Hiroyuki Kaji, Shin’ichi Tamamura, and Dashtseren Erdenebat. 2008. Automatic construction of a Japanese-Chinese dictionary via English. In Language Resources and Evaluation Conference.Google ScholarGoogle Scholar
  28. Mitesh M. Khapra, A. Kumaran, and Pushpak Bhattacharyya. 2010. Everybody loves a rich cousin: An empirical study of transliteration through bridge languages. In Human Language Technology Conference. 9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Woosung Kim and Sanjeev Khudanpur. 2004. Lexical triggers and latent semantic analysis for cross-lingual language model adaptation. ACM Transactions on Asian Language Information Processing (TALIP) 3, 2 (June 2004), 94--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Alexandre Klementiev, Ann Irvine, Chris Callison-Burch, and David Yarowsky. 2013. Toward statistical machine translation without parallel corpora. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Philipp Koehn and Kevin Knight. 2002. Learning a translation lexicon from monolingual corpora. In ACL Workshop on Unsupervised Lexical Acquisition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Klaus Krippendorff. 2004. Content analysis: An introduction to its methodology. Sage.Google ScholarGoogle Scholar
  33. Klaus Krippendorff. 2012. Content Analysis: An Introduction to Its Methodology. Sage.Google ScholarGoogle Scholar
  34. Shankar Kumar, Franz Josef Och, and Wolfgang Macherey. 2007. Improving word alignment with bridge languages. In Conference on Empirical Methods in Natural Language Processing: Conference on Natural Language Learning.Google ScholarGoogle Scholar
  35. Audrey Laroche and Philippe Langlais. 2010. Revisiting context-based projection methods for term-translation spotting in comparable corpora. In International Conference on Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Florian Laws, Lukas Michelbacher, Beate Dorow, Christian Scheible, Ulrich Heid, and Hinrich Schütze. 2010. A linguistically grounded graph model for bilingual lexicon extraction. In International Conference on Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Bo Li and Eric Gaussier. 2010. Improving corpus comparability for bilingual lexicon extraction from comparable corpora. In International Conference on Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Lishuang Li, Peng Wang, Degen Huang, and Lian Zhao. 2011. Mining English-Chinese named entity pairs from comparable corpora. ACM Transactions on Asian Language Information Processing 10, 4 (2011), 19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Gideon S. Mann and David Yarowsky. 2001. Multipath translation lexicon induction via bridge languages. In North American Chapter of the Association for Computational Linguistics. 8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Mausam, Stephen Soderland, Oren Etzioni, Daniel S. Weld, Michael Skinner, and Jeff Bilmes. 2009. Compiling a massive, multilingual dictionary via probabilistic inference. In Annual Meeting of the Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. George A. Miller and Walter G. Charles. 1991. Contextual correlates of semantic similarity. Language and Cognitive Processes 6, 1 (1991), 1--28.Google ScholarGoogle ScholarCross RefCross Ref
  42. David Mimno, Hanna M. Wallach, Jason Naradowsky, David A. Smith, and Andrew McCallum. 2009. Polylingual topic models. In Conference on Empirical Methods in Natural Language Processing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Emmanuel Morin, Batrice Daille, Koichi Takeuchi, and Kyo Kageura. 2008. Brains, not brawn: The use of \smart\ comparable corpora in bilingual terminology mining. ACM Transactions on Speech and Language Processing 7, 1 (2008), 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Roberto Navigli and Simone Paolo Ponzetto. 2012. BabelRelate! A joint multilingual approach to computing semantic relatedness. In AAAI. 22--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Rebecca J. Passonneau, Tae Yano, Tom Lippincott, and Judith Klavans. 2008. Relation between agreement measures on human labeling and machine learning performance: Results from an art history image indexing domain. Computational Linguistics for Metadata Building (2008), 49.Google ScholarGoogle Scholar
  46. Richard R. Picard and R. Dennis Cook. 1984. Cross-validation of regression models. Journal of the American Statistical Association 79, 387 (1984), 575--583.Google ScholarGoogle ScholarCross RefCross Ref
  47. Emmanuel Prochasson and Pascale Fung. 2011. Rare word translation extraction from aligned comparable documents. In Human Language Technology Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Longhua Qian, Hongling Wang, Guodong Zhou, and Qiaoming Zhu. 2012. Bilingual lexicon construction from comparable corpora via dependency mapping. In International Conference on Computational Linguistics.Google ScholarGoogle Scholar
  49. Reinhard Rapp. 1995. Identifying word translations in non-parallel texts. In Annual Meeting of the Association for Computational Linguistics. 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Reinhard Rapp. 1996. Die Berechnung von Assoziationen: Ein Korpuslinguistischer Ansatz. Vol. 16. Georg Olms Verlag.Google ScholarGoogle Scholar
  51. Reinhard Rapp. 1999. Automatic identification of word translations from unrelated English and German corpora. In Annual Meeting of the Association for Computational Linguistics. 8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Reinhard Rapp, Serge Sharoff, and Bogdan Babych. 2012. Identifying word translations from comparable documents without a seed lexicon. In Language Resources and Evaluation Conference. 460--466.Google ScholarGoogle Scholar
  53. Sujith Ravi. 2013. Scalable decipherment for machine translation via hash sampling. In Annual Meeting of the Association for Computational Linguistics (1). 362--371.Google ScholarGoogle Scholar
  54. Sujith Ravi and Kevin Knight. 2011. Deciphering foreign language. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 12--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Herbert Rubenstein and John B. Goodenough. 1965. Contextual correlates of synonymy. Communications of the ACM 8, 10 (1965), 627--633. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Raphaël Rubino and Georges Linarès. 2011. A multi-view approach for term translation spotting. In Computational Linguistics and Intelligent Text Processing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Charles Schafer and David Yarowsky. 2002. Inducing translation lexicons via diverse similarity measures and bridge languages. In International Conference on Computational Linguistics. 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Serge Sharoff, Reinhard Rapp, and Pierre Zweigenbaum. 2013. Overviewing important aspects of the last twenty years of research in comparable corpora. In Workshop on Building and Using Comparable Corpora (BUCC). Springer, 1--17.Google ScholarGoogle ScholarCross RefCross Ref
  59. Daphna Shezaf and Ari Rappoport. 2010. Bilingual lexicon generation using non-aligned signatures. In Annual Meeting of the Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Fangzhong Su and Bogdan Babych. 2012. Development and application of a cross-language document comparability metric. In Language Resources and Evaluation Conference.Google ScholarGoogle Scholar
  61. Akihiro Tamura, Taro Watanabe, and Eiichiro Sumita. 2012. Bilingual lexicon extraction from comparable corpora using label propagation. In Conference on Empirical Methods in Natural Language Processing: Conference on Natural Language Learning. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Takashi Tsunakawa, Naoaki Okazaki, and Jun ichi Tsujii. 2008. Building bilingual lexicons using lexical translation probabilities via pivot languages. In Language Resources and Evaluation Conference.Google ScholarGoogle Scholar
  63. Peter Turney. 2001. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. European Conference on Machine Learning. Springer, 491--502. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Raghavendra Udupa and Mitesh Khapra. 2010. Improving the multilingual user experience of Wikipedia using cross-language name search. In Human Language Technology Conference. 9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Raghavendra Udupa, K. Saravanan, A. Kumaran, and Jagadeesh Jagarlamudi. 2009. MINT: A method for effective and scalable mining of named entity transliterations from large comparable corpora. In Conference of the European Chapter of the Association for Computational Linguistics. 9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Masao Utiyama and Hitoshi Isahara. 2007. A comparison of pivot methods for phrase-based statistical machine translation. In Human Language Technologies: North American Chapter of the Association for Computational Linguistics.Google ScholarGoogle Scholar
  67. Ellen M. Voorhees. 1999. The TREC-8 question answering track report. In Text REtrieval Conference. Vol. 99, 77--82.Google ScholarGoogle ScholarCross RefCross Ref
  68. Ivan Vulić, Wim De Smet, and Marie-Francine Moens. 2011. Identifying word translations from comparable corpora using latent topic models. In Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. http://www.aclweb.org/anthology/P11-2084. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Ivan Vulić and Marie-Francine Moens. 2013. Cross-lingual semantic similarity of words as the similarity of their semantic word responses. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies.Google ScholarGoogle Scholar
  70. Hua Wu and Haifeng Wang. 2007. Pivot language approach for phrase-based statistical machine translation. Machine Translation 21, 3 (2007), 165--181. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Corpus-Based Translation Induction in Indian Languages Using Auxiliary Language Corpora from Wikipedia

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!