Abstract
The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction a difficult task for low-resource languages. The pivot language and cognate recognition approaches have been proven useful for inducing bilingual lexicons for such languages. We propose constraint-based bilingual lexicon induction for closely related languages by extending constraints from the recent pivot-based induction technique and further enabling multiple symmetry assumption cycle to reach many more cognates in the transgraph. We further identify cognate synonyms to obtain many-to-many translation pairs. This article utilizes four datasets: one Austronesian low-resource language and three Indo-European high-resource languages. We use three constraint-based methods from our previous work, the Inverse Consultation method and translation pairs generated from Cartesian product of input dictionaries as baselines. We evaluate our result using the metrics of precision, recall, and F-score. Our customizable approach allows the user to conduct cross validation to predict the optimal hyperparameters (cognate threshold and cognate synonym threshold) with various combination of heuristics and number of symmetry assumption cycles to gain the highest F-score. Our proposed methods have statistically significant improvement of precision and F-score compared to our previous constraint-based methods. The results show that our method demonstrates the potential to complement other bilingual dictionary creation methods like word alignment models using parallel corpora for high-resource languages while well handling low-resource languages.
- Carlos Ansótegui, María Luisa Bonet, and Jordi Levy. 2009. Solving (weighted) partial MaxSAT through satisfiability testing. In Theory and Applications of Satisfiability Testing-SAT 2009. Springer, 427--440.Google Scholar
- Armin Biere, Marijn Heule, and Hans van Maaren. 2009. Handbook of Satisfiability. Vol. 185. IOS Press.Google Scholar
- Peter F. Brown, John Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Fredrick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin. 1990. A statistical approach to machine translation. Comput. Ling. 16, 2 (1990), 79--85.Google Scholar
Digital Library
- Lyle Campbell. 2013. Historical Linguistics. Edinburgh University Press.Google Scholar
- Lyle Campbell and William J. Poser. 2008. Language classification. History and Method. Cambridge University Press, Cambridge (2008). Google Scholar
Cross Ref
- Hervé Déjean, Éric Gaussier, and Fatia Sadat. 2002. An approach based on multilingual thesauri and model combination for bilingual lexicon extraction. In Proceedings of the 19th International Conference on Computational Linguistics, Vol. 1 (COLING’02). Association for Computational Linguistics, Stroudsburg, PA, 1--7. DOI:http://dx.doi.org/10.3115/1072228.1072394 Google Scholar
Digital Library
- Pascale Fung. 1995. Compiling bilingual lexicon entries from a non-parallel English-Chinese corpus. In Proceedings of the 3rd Workshop on Very Large Corpora. 173--183.Google Scholar
- Pascale Fung. 1998. A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. In Machine Translation and the Information Soup. Springer, 1--17. Google Scholar
Cross Ref
- Charlotte Gooskens. 2006. Linguistic and extra-linguistic predictors of inter-scandinavian intelligibility. Ling. Netherlands 23, 1 (2006), 101--113.Google Scholar
Cross Ref
- Ahlem Ben Hassine, Shigeo Matsubara, and Toru Ishida. 2006. A constraint-based approach to horizontal web service composition. In Proceedings of the International Semantic Web Conference. Springer, 130--143. Google Scholar
Digital Library
- Eric W. Holman, Cecil H. Brown, Søren Wichmann, André Müller, Viveka Velupillai, Harald Hammarström, Sebastian Sauppe, Hagen Jung, Dik Bakker, Pamela Brown, and others. 2011. Automated dating of the world’s language families based on lexical similarity. Curr. Anthropol. 52, 6 (2011), 841--875. Google Scholar
Cross Ref
- Toru Ishida. 2011. The Language Grid: Service-Oriented Collective Intelligence for Language Resource Interoperability. Springer. Google Scholar
Cross Ref
- Winfred P. Lehmann. 2013. Historical Linguistics: An Introduction. Routledge.Google Scholar
Cross Ref
- M. Paul Lewis, Gary F. Simons, and Charles D. Fennig (Eds.). 2015. Ethnologue: Languages of the World (18th ed.). SIL International, Dallas, TX.Google Scholar
- Gideon S. Mann and David Yarowsky. 2001. Multipath translation lexicon induction via bridge languages. In Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies. Association for Computational Linguistics, 1--8. Google Scholar
Digital Library
- Jun Matsuno and Toru Ishida. 2011. Constraint optimization approach to context based word selection. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’11), Vol. 22.Google Scholar
- I. Dan Melamed. 1995. Automatic evaluation and uniform filter cascades for inducing N-best translation lexicons. CoRR cmp-lg/9505044 (1995). Retrieved from http://arxiv.org/abs/cmp-lg/9505044.Google Scholar
- Preslav Nakov and Hwee Tou Ng. 2012. Improving statistical machine translation for a resource-poor language using related resource-rich languages. J. Artif. Intell. Res. 44 (2012), 179--222.Google Scholar
Digital Library
- Preslav Nakov and Jörg Tiedemann. 2012. Combining word-level and character-level models for machine translation between closely related languages. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers, Volume 2. Association for Computational Linguistics, 301--305.Google Scholar
- Arbi Haza Nasution, Yohei Murakami, and Toru Ishida. 2016. Constraint-based bilingual lexicon induction for closely related languages. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). Paris, France, 3291--3298.Google Scholar
- Reinhard Rapp. 1995. Identifying word translations in non-parallel texts. In Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 320--322. Google Scholar
Digital Library
- John Richardson, Toshiaki Nakazawa, and Sadao Kurohashi. 2015. Pivot-based topic models for low-resource lexicon extraction. In Proceedings of the Pacific Asia Conference on Language, Information and Computation (PACLIC’15).Google Scholar
- C. J. Van Rijsbergen. 1979. Information Retrieval (2nd ed.). Butterworth-Heinemann, Newton, MA.Google Scholar
- Xabier Saralegi, Iker Manterola, and Inaki San Vicente. 2011. Analyzing methods for improving precision of pivot based bilingual dictionaries. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 846--856.Google Scholar
Digital Library
- Kevin P. Scannell. 2006. Machine translation for closely related language pairs. In Proceedings of the Workshop Strategies for Developing Machine Translation for Minority Languages. Citeseer, 103--109.Google Scholar
- Lloyd S. Shapley. 1953. A value for n-person games. Contrib. Theor. Games 2, 28 (1953), 307--317. Google Scholar
Cross Ref
- Gary F. Simons and Charles D. Fennig (eds.). 2017. Ethnologue: Languages of the World, 20th ed. (2017).Google Scholar
- Mark D. Smucker, James Allan, and Ben Carterette. 2007. A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the 16th ACM Conference on Conference on Information and Knowledge Management (CIKM’07). ACM, New York, NY, 623--632. DOI:http://dx.doi.org/10.1145/1321440.1321528 Google Scholar
Digital Library
- Stephen Soderland, Oren Etzioni, Daniel S Weld, Michael Skinner, Jeff Bilmes, and others. 2009. Compiling a massive, multilingual dictionary via probabilistic inference. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Volume 1. Association for Computational Linguistics, 262--270.Google Scholar
Digital Library
- Morris Swadesh. 1955. Towards greater accuracy in lexicostatistic dating. Int. J. Am. Ling. 21, 2 (1955), 121--137. Google Scholar
Cross Ref
- Kumiko Tanaka and Kyoji Umemura. 1994. Construction of a bilingual dictionary intermediated by a third language. In Proceedings of the 15th Conference on Computational Linguistics, Volume 1. Association for Computational Linguistics, 297--303. Google Scholar
Digital Library
- Rie Tanaka, Yohei Murakami, and Toru Ishida. 2009. Context-based approach for pivot translation services. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’09), Vol. 2009. 1555--1561.Google Scholar
- Jörg Tiedemann. 2009. Character-based PSMT for closely related languages. In Proceedings of the 13th Conference of the European Association for Machine Translation (EAMT’09). 12--19.Google Scholar
- Renee Van Bezooijen and Charlotte Gooskens. 2005. How easy is it for speakers of dutch to understand frisian and afrikaans, and why? Ling. Netherlands 22, 1 (2005), 13--24.Google Scholar
Cross Ref
- Hua Wu and Haifeng Wang. 2007. Pivot language approach for phrase-based statistical machine translation. Mach. Transl. 21, 3 (01 Sep 2007), 165--181. DOI:http://dx.doi.org/10.1007/s10590-008-9041-6 Google Scholar
Digital Library
- Mairidan Wushouer, Donghui Lin, Toru Ishida, and Katsutoshi Hirayama. 2014. Pivot-Based Bilingual Dictionary Extraction from Multiple Dictionary Resources. Springer International, Cham, 221--234. DOI:http://dx.doi.org/10.1007/978-3-319-13560-1_18 Google Scholar
Cross Ref
- Mairidan Wushouer, Donghui Lin, Toru Ishida, and Katsutoshi Hirayama. 2015. A constraint approach to pivot-based bilingual dictionary induction. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 15, 1, Article 4 (Nov. 2015), 26 pages. DOI:http://dx.doi.org/10.1145/2723144 Google Scholar
Digital Library
Index Terms
A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families
Recommendations
Multilingual Offensive Language Identification for Low-resource Languages
Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g., hate speech, ...
Plan Optimization to Bilingual Dictionary Induction for Low-resource Language Families
Creating bilingual dictionary is the first crucial step in enriching low-resource languages. Especially for the closely related ones, it has been shown that the constraint-based approach is useful for inducing bilingual lexicons from two bilingual ...
A Constraint Approach to Pivot-Based Bilingual Dictionary Induction
High-quality bilingual dictionaries are very useful, but such resources are rarely available for lower-density language pairs, especially for those that are closely related. Using a third language to link two other languages is a well-known solution and ...






Comments