Abstract
This article proposes a technique for mining bilingual lexicons from pairs of parallel short word sequences. The technique builds a generative model from a corpus of training data consisting of such pairs. The model is a hierarchical nonparametric Bayesian model that directly induces a bilingual lexicon while training. The model learns in an unsupervised manner and is designed to exploit characteristics of the language pairs being mined. The proposed model is capable of utilizing commonly used word-pair frequency information and additionally can employ the internal character alignments within the words themselves. It is thereby capable of mining transliterations and can use reliably aligned transliteration pairs to support the mining of other words in their context. The model is also capable of performing word reordering and word deletion during the alignment process, and it is furthermore capable of operating in the absence of full segmentation information. In this work, we study two mining tasks based on English-Japanese and English-Chinese language pairs, and compare the proposed approach to baselines based on a simpler models that use only word-pair frequency information. Our results show that the proposed method is able to mine bilingual word pairs at higher levels of precision and recall than the baselines.
- Phil Blunsom, Trevor Cohn, Chris Dyer, and Miles Osborne. 2009. A Gibbs sampler for phrasal synchronous grammar induction. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL. 782--790. Google Scholar
Digital Library
- Kareem Darwish. 2010. Transliteration mining with phonetic conflation and iterative training. In Proceedings of the 2010 Named Entities Workshop. 53--56. http://www.aclweb.org/anthology/W10-2407 Google Scholar
Digital Library
- Andrew Finch and Eiichiro Sumita. 2010. A Bayesian model of bilingual segmentation for transliteration. In Proceedings of the 7th International Workshop on Spoken Language Translation (IWSLT’10). 259--266.Google Scholar
- Andrew M. Finch, Ohnmar Htun, and Eiichiro Sumita. 2012. The NICT translation system for IWSLT 2012. In Proceedings of the 2012 International Workshop on Spoken Language Translation (IWSLT’12). 121--125. http://www.isca-speech.org/archive/iwslt_12/sltc_121.html.Google Scholar
- Takaaki Fukunishi, Andrew Michael Finch, Eiichiro Sumita, and Seiichi Yamamoto. 2013. A Bayesian alignment approach to transliteration mining. ACM Transactions on Asian Language Information Processing 12, 3, Article No. 9. Google Scholar
Digital Library
- Sharon Goldwater, Thomas L. Griffiths, and Mark Johnson. 2006. Contextual dependencies in unsupervised word segmentation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (ACL-44). 673--680. Google Scholar
Digital Library
- Ohnmar Htun, Andrew Finch, Eiichiro Sumita, and Yoshiki Mikami. 2012. Improving transliteration mining by integrating expert knowledge with statistical approaches. International Journal of Computer Applications 58, Article No. 17.Google Scholar
- Hemant Ishwaran and Lancelot F. James. 2003. Generalized weighted Chinese restaurant processes for species sampling mixture models. Statistica Sinica 13, 4, 1211--1235.Google Scholar
- Sittichai Jiampojamarn, Kenneth Dwyer, Shane Bergsma, Aditya Bhargava, Qing Dou, Mi-Young Kim, and Grzegorz Kondrak. 2010. Transliteration generation and mining with limited training resources. In Proceedings of the 2010 Named Entities Workshop. 39--47. http://www.aclweb.org/anthology/W10-2405 Google Scholar
Digital Library
- Kevin Knight and Jonathan Graehl. 1998. Machine transliteration. Computational Linguistics 24, 4, 599--612. http://www.aclweb.org/anthology/J98-4003 Google Scholar
Digital Library
- A. Kumaran, Mitesh M. Khapra, and Haizhou Li. 2010. Report of NEWS 2010 transliteration mining shared task. In Proceedings of the 2010 Named Entities Workshop. 21--28. http://www.aclweb.org/anthology/W10-2403 Google Scholar
Digital Library
- Abby Levenberg, Chris Dyer, and Phil Blunsom. 2012. A Bayesian model for learning SCFGs with discontiguous rules. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 223--232. Google Scholar
Digital Library
- Tingting Li, Tiejun Zhao, Andrew Finch, and Chunyue Zhang. 2013. A tightly-coupled unsupervised clustering and bilingual alignment model for transliteration. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 393--398. http://www.aclweb.org/anthology/P13-2070Google Scholar
- D. Lopresti, A. Tomkins, and J. Zhou. 1997. Algorithms for matching hand-drawn sketches. In Proceedings of the 5th International Workshop on Frontiers in Handwriting Recognition. 233--238.Google Scholar
- Daichi Mochihashi, Takeshi Yamada, and Naonori Ueda. 2009. Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 (ACL-IJCNLP’09). 100--108. Google Scholar
Digital Library
- Graham Neubig, Taro Watanabe, Eiichiro Sumita, Shinsuke Mori, and Tatsuya Kawahara. 2011. An unsupervised model for joint phrase alignment and extraction. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT’11). 632--641. Google Scholar
Digital Library
- Sara Noeman and Amgad Madkour. 2010. Language independent transliteration mining system using finite state automata framework. In Proceedings of the 2010 Named Entities Workshop. 57--61. http://www.aclweb.org/anthology/W10-2408 Google Scholar
Digital Library
- Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29, 1, 19--51. Google Scholar
Digital Library
- Jim Pitman and Marc Yor. 1995. The Two-Parameter Poisson-Dirichlet Distribution Derived from a Stable Subordinator. Retrieved November 19, 2016, from http://digitalassets.lib.berkeley.edu/sdtr/ucb/text/433.pdf.Google Scholar
- Eric Sven Ristad and Peter N. Yianilos. 1998. Learning string edit distance. IEEE Transactions on Pattern Recognition and Machine Intelligence 20, 5, 522--532. Google Scholar
Digital Library
- Hassan Sajjad, Alexander Fraser, and Helmut Schmid. 2012. A statistical model for unsupervised and semi-supervised transliteration mining. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 469--477. http://www.aclweb.org/anthology/P12-1049 Google Scholar
Digital Library
- Steven L. Scott. 2002. Bayesian methods for hidden Markov models: Recursive computing in the 21st century. Journal of the American Statistical Association 97, 457, 337--351. http://www.jstor.org/stable/3085787Google Scholar
Cross Ref
Index Terms
Inducing a Bilingual Lexicon from Short Parallel Multiword Sequences
Recommendations
A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora
AMTA '98: Proceedings of the Third Conference of the Association for Machine Translation in the Americas on Machine Translation and the Information SoupWe present two problems for statistically extracting bilingual lexicon: (1) How can noisy parallel corpora be used? (2) How can non-parallel yet comparable corpora be used? We describe our own work and contribution in relaxing the constraint of using ...
Bootstrapping a Lexicon of Multiword Adverbs for Brazilian Portuguese
Computational and Corpus-Based PhraseologyAbstractThis paper presents the process for bootstrapping a computational lexicon of multiword adverbs for Brazilian Portuguese (PT-BR) from an already existing lexicon built for the European variety of the language (PT-PT). This ongoing work aims to ...
Acquiring Verb Subcategorization Frames in Bengali from Corpora
ICCPOL '09: Proceedings of the 22nd International Conference on Computer Processing of Oriental Languages. Language Technology for the Knowledge-based EconomySubcategorization frames acquisition of a phrase can be described as a mechanism to extract different types of relevant arguments that are associated with that phrase in a sentence. This paper presents the acquisition of different subcategory frames for ...






Comments