Abstract
Comparable corpora are valuable alternatives for the expensive parallel corpora. They comprise informative parallel fragments that are useful resources for different natural language processing tasks. In this work, a generative model is proposed for efficient extraction of parallel fragments from a pair of comparable documents. The core of the proposed model is a graph called the Matching Graph. The ability of the Matching Graph to be trained on a small initial seed makes it a proper model for language pairs suffering from the scarce resource problem. Experiments show that the Matching Graph performs significantly better than other recently published models. According to the experiments on English-Persian and Arabic-Persian language pairs, the extracted parallel fragments can be used instead of parallel data for training statistical machine translation systems. Results reveal that the extracted fragments in the best case are able to retrieve about 90% of the information of a statistical machine translation system that is trained on a parallel corpus. Moreover, it is shown that using the extracted fragments as additional information for training statistical machine translation systems leads to an improvement of about 2% for English-Persian and about 1% for Arabic-Persian translation on BLEU score.
- H. Afli, L. Barrault, and H. Schwenk. 2014. Multimodal comparable corpora for machine translation. In Proceedings of the 7th International Workshop on Building and Using Comparable Corpora, Building Resources for Machine Translation Research, Co-located with LREC 2014. 22--27.Google Scholar
- H. Afli, L. Barrault, and H. Schwenk. 2016. Building and using multimodal comparable corpora for machine translation. Natural Language Engineering 22, 4 (2016), 603--625.Google Scholar
Cross Ref
- A. Aker, Y. Feng, and R. Gaizauskas. 2012. Automatic bilingual phrase extraction from comparable corpora. In Proceedings of the 24th International Conference on Computational Linguistics. 23--32.Google Scholar
- D. Andrade, T. Matsuzaki, and J. Tsujii. 2012. Statistical extraction and comparison of pivot words for bilingual lexicon extension. ACM Transactions on Asian Language Information Processing 11, 2 (June 2012), 1--31. Google Scholar
Digital Library
- M. Apidianaki, N. Ljubesi, and D. Fiser. 2013. Cross-lingual WSD for translation extraction from comparable corpora. In Proceedings of the 6th Workshop on Building and Using Comparable Corpora, Co-located with ACL 2013. Association for Computational Linguistics, Sofia Bulgaria, 1--10.Google Scholar
- S. Bakhshaei, S. Khadivi, N. Riahi, and H. Sameti. 2010. A study to find influential parameters on a Farsi-English statistical machine translation system. In 5th International Symposium on Telecommunications (IST’10). IEEE, 985--991.Google Scholar
- S. Bakhshaei, R. Safabakhsh, and S. Khadivi. 2019. Extracting parallel fragments from comparable documents using a generative model. Computer Speech 8 Language 53 (2019), 25--42.Google Scholar
- Y. Bengio, A. Courville, and P. Vincent. 2013. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 8 (2013), 1798--1828. Google Scholar
Digital Library
- S. Bergsma and B. Van Durme. 2011. Learning bilingual lexicons using the visual similarity of labeled web images. In Proceedings of IJCAI -International Joint Conference on Artificial Intelligence. 1764--1769. Google Scholar
Digital Library
- J. Boyd-Graber and D. M. Blei. 2009. Multilingual topic models for unaligned text. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence. 75--82. Google Scholar
Digital Library
- P. Brown, V. Pietra, S. Pietra, and R. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19, 2 (1993), 263--311. Google Scholar
Digital Library
- G. Celuex and J. Diebolt. 1985. The SEM algorithm: A probabilistic teacher algorithm derived from the EM algorithm for the mixture problems. Computational Statistics 2 (1985), 73--82.Google Scholar
- C. Chu, T. Nakazawa, and S. Kurohashi. 2013. Accurate parallel fragment extraction from quasi-comparable corpora using alignment model and translation lexicon. In Proceedings of IJCNLP. 1144--1150.Google Scholar
- C. Chu, T. Nakazawa, and S. Kurohashi. 2014. Improving statistical machine translation accuracy using bilingual lexicon extraction with paraphrases. In Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing. 2014.Google Scholar
- C. Chu, T. Nakazawa, and S. Kurohashi. 2015. Integrated parallel sentence and fragment extraction from comparable corpora: A case study on Chinese--Japanese Wikipedia. ACM Transactions on Asian Language Information Processing 15, 2 (Dec. 2015), 1--22. Google Scholar
Digital Library
- R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, (Aug. 2011), 2493--2537. Google Scholar
Digital Library
- H. Daume III and J. Jagarlamudi. 2011. Domain adaptation for machine translation by mining unseen words. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 407--412. Google Scholar
Digital Library
- A. De Gispert and J. B. Marino. 2006. Catalan-English statistical machine translation without parallel corpus: Bridging through Spanish. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06). 65--68.Google Scholar
- H. Dejean, E. Gaussier, and F. Sadat. 2002. Bilingual terminology extraction: An approach based on a multilingual thesaurus applicable to comparable corpora. In Proceedings of the 19th International Conference on Computational Linguistics COLING. 218--224.Google Scholar
- M. Diab and S. Finch. 2000. A statistical word-level translation model for comparable corpora. Content-Based Multimedia Information Access 2 (2000), 1500--1508. Google Scholar
Digital Library
- M. Dong, Y. Liu, H. Luan, M. Sun, T. Izuha, and D. Zhang. 2015. Iterative learning of parallel lexicons and phrases from non-parallel corpora. In Proceedings of the 24th International Conference on Artificial Intelligence. 1250--1256. Google Scholar
Digital Library
- A. El Kholy, N. Habash, G. Leusch, E. Matusov, and H. Sawaf. 2013. Language independent connectivity strength features for phrase pivot statistical machine translation. In Proceedings of the 51st Annual Meeting of the ACL. 412--418.Google Scholar
- M. Erdmann, K. Nakayama, T. Hara, and S. Nishio. 2008. An approach for extracting bilingual terminology from wikipedia. In Proceedings of the International Conference on Database Systems for Advanced Applications. 380--392. Google Scholar
Digital Library
- M. Faruqui and C. Dyer. 2014. Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. 462--471.Google Scholar
- D. Fiser and N. Ljubesic. 2011. Bilingual lexicon extraction from comparable corpora for closely related languages. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’11). 125--131.Google Scholar
- X. Fu, W. Wei, S. Lu, Z. Chen, and B. Xu. 2013. Phrase-based parallel fragments extraction from comparable corpora. In Proceedings of IJCNLP. 972--976.Google Scholar
- P. Fung and P. Cheung. 2004. Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In Proceedings of the 20th International Conference on Computational Linguistics. 1051. Google Scholar
Digital Library
- P. Fung and K. McKeown. 1997. Finding terminology translations from non-parallel corpora. In Proceedings of the 5th Annual Workshop on Very Large Corpora. 192--202.Google Scholar
- P. Fung and L. Y. Yee. 1998. An IR approach for translating new words from nonparallel, comparable texts. In Proceedings of the 17th International Conference on Computational Linguistics. 414--420. Google Scholar
Digital Library
- N. Garera, C. Callison-Burch, and D. Yarowsky. 2009. Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences. In Proceedings of the 13th Conference on Computational Natural Language Learning. 129--137. Google Scholar
Digital Library
- S. Gouws, Y. Bengio, and G. Corrado. 2015. BilBOWA: Fast bilingual distributed representations without word alignments. In JMLR Workshop and Conference Proceedings of the 32nd International Conference on Machine Learning. 748--756. Google Scholar
Digital Library
- A. Haghighi, P. Liang, T. Berg-Kirkpatrick, and D. Klein. 2008. Learning bilingual lexicons from monolingual corpora. In Proceedings of ACL. 771--779.Google Scholar
- K. M. Hammouda and M. S. Kamel. 2004. Document similarity using a phrase indexing graph model. Knowledge and Information Systems 6, 6 (2004), 710--727. Google Scholar
Digital Library
- D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16, 12 (2004), 2639--2664. Google Scholar
Digital Library
- A. Hazem, E. Morin, and S. P. Saldarriaga. 2011. Bilingual lexicon extraction from comparable corpora as metasearch. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web (BUCC’11). Association for Computational Linguistics, Stroudsburg, PA, 35--43. Google Scholar
Digital Library
- S. Hewavitharana and S. Vogel. 2013. Extracting parallel phrases from comparable data. Building and Using Comparable Corpora. Springer, Berlin, Heidelberg, 191–204. Google Scholar
Digital Library
- A. Irvine and C. Callison-Burch. 2013. Combining bilingual and comparable corpora for low resource machine translation. In Proceedings of the 8th Workshop on Statistical Machine Translation. Association for Computational Linguistics, Sofia, Bulgaria, 262--270.Google Scholar
- A. Irvine and C. Callison-Burch. 2013. Supervised bilingual lexicon induction with multiple monolingual signals. In Proceedings of HLT-NAACL. 518--523.Google Scholar
- G. Jeh and J. Widom. 2002. SimRank: A measure of structural-context similarity. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 538--543. Google Scholar
Digital Library
- H. Kaji, S. Tamamura, and D. Erdenebat. 2008. Automatic construction of a Japanese-Chinese dictionary via English. In Proceedings of LREC, Marrakech (Morocco). 699--706.Google Scholar
- A. Kilgarriff. 2000. WordNet: An electronic lexical database. JSTOR 76 (2000), 706--708.Google Scholar
- A. Klementiev, A. Irvine, C. Callison-Burch, and D. Yarowsky. 2012. Toward statistical machine translation without parallel corpora. In Proceedings of the 13th Conference of the European Chapter of the ACL. 130--140. Google Scholar
Digital Library
- P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, and Others. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. 177--180. Google Scholar
Digital Library
- P. Koehn and K. Knight. 2002. Learning a translation lexicon from monolingual corpora. In Proceedings of the ACL-02 Workshop on Unsupervised Lexical Acquisition. 9--16. Google Scholar
Digital Library
- G. Kontonatsios, I. Korkontzelos, J. Tsujii, and S. Ananiadou. 2014. Combining string and context similarity for bilingual term alignment from comparable corpora. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, Doha, Qatar, 1701--1712.Google Scholar
- G. Kontonatsios, I. Korkontzelos, J. Tsujii, and S. Ananiadou. 2014. Using a random forest classifier to compile bilingual dictionaries of technical terms from comparable corpora. In Proceedings of the 14th Conference of the European Chapter of the ACL. Association for Computational Linguistics, Gothenburg, Sweden, 111--116.Google Scholar
- S. Kumar, F. J. Och, and W. Macherey. 2007. Improving word alignment with bridge languages. In Proceedings of EMNLP-CoNLL. 42--50.Google Scholar
- F. Laws, L. Michelbacher, B. Dorow, C. Scheible, U. Heid, and H. Schutze. 2010. A linguistically grounded graph model for bilingual lexicon extraction. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. 614--622. Google Scholar
Digital Library
- B. Li and E. Gaussier. 2010. Improving corpus comparability for bilingual lexicon extraction from comparable corpora. In Proceedings of the 23rd International Conference on Computational Linguistics. 644--652. Google Scholar
Digital Library
- B. Li, E. Gaussier, and A. Aizawa. 2011. Clustering comparable corpora for bilingual lexicon extraction. In Proceedings of the 49th Annual Meeting of the ACL: Human Language Technologies, Portland, Oregon, USA. Association for Computational Linguistics, 473--478. Google Scholar
Digital Library
- J. Li, J. Li, X. Fu, M. A. Masud, and J. Z. Huang. 2016. Learning distributed word representation with multi-contextual mixed embedding. Knowledge-Based Systems 106 (2016), 220--230. Google Scholar
Digital Library
- A. Linard, B. Daille, and E. Morin. 2015. Attempting to bypass alignment from comparable corpora via pivot language. In Proceedings of ACL-IJCNLP 2015. 32--37.Google Scholar
- C. Liu, Y. Liu, H. Luan, M. Sun, and H. Yu. 2016. Agreement-based learning of parallel lexicons and phrases from non-parallel corpora. In Proceedings of the 54th Annual Meeting of the ACL. 1024--1033.Google Scholar
- G. S. Mann and D. Yarowsky. 2001. Multipath translation lexicon induction via bridge languages. In Proceedings of the 2nd Meeting of the North American Chapter of the ACL on Language Technologies. 1--8. Google Scholar
Digital Library
- A. K. McCallum. 2002. Mallet: A machine learning for language toolkit.Google Scholar
- T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013. Efficient estimation of word representations in vector space. arXiv:hep-ph/1301.3781Google Scholar
- T. Mikolov, Q. V. Le, and I. Sutskever. 2013. Exploiting similarities among languages for machine translation. arXiv:hep-ph/1309.4168Google Scholar
- T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems (2013), 3111--3119. Google Scholar
Digital Library
- D. Mimno, H. M. Wallach, J. Naradowsky, D. A. Smith, and A. McCallum. 2009. Polylingual topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 880--889. Google Scholar
Digital Library
- E. Minkov and W. Cohen. 2012. Graph based similarity measures for synonym extraction from parsed text. In Workshop Proceedings of TextGraphs-7 on Graph-bBased Methods for Natural Language Processing. 20--24. Google Scholar
Digital Library
- D. Munteanu and D. Marcu. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics 31, 4 (2005), 477--504. Google Scholar
Digital Library
- D. S. Munteanu and D. Marcu. 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the ACL. 81--88. Google Scholar
Digital Library
- P. Muthukrishnan, D. Radev, and Q. Mei. 2011. Simultaneous similarity learning and feature-weight learning for document clustering. In Proceedings of Textgraphs-6: Graph-Based Methods for Natural Language Processing. 42--50. Google Scholar
Digital Library
- A. H. Nasution, Y. Murakami, and T. Ishida. 2017. A generalized constraint approach to bilingual dictionary induction for low-resource language families. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 17, 2 (Nov. 2017), 1--29. Google Scholar
Digital Library
- R. Navigli and P. S. Ponzetto. 2012. BabelNetXplorer: A platform for multilingual lexical knowledge base access and exploration. In Proceedings of the 21st International Conference on World Wide Web. 393--396. Google Scholar
Digital Library
- R. M. Neal. 2000. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics 9, 2 (2000), 249--265.Google Scholar
- J. Niehues and A. Waibel. 2011. Using Wikipedia to translate domain-specific terms in SMT. In Proceedings of the 7th IWSLT, Marcello Federico, Mei-Yuh Hwang, Margit Rödder, and Sebastian Stüker (Eds.). 230--237.Google Scholar
- S. Nielsen. 2000. The stochastic EM algorithm: Estimation and asymptotic results. Bernoulli 6, 3 (2000), 457--489.Google Scholar
Cross Ref
- F. J. Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on ACL. 160--167. Google Scholar
Digital Library
- P. G. Otero and I. G. Lopez. 2010. Wikipedia as multilingual source of comparable corpora. In Proceedings of the 3rd Workshop on Building and Using Comparable Corpora, LREC. 21--25.Google Scholar
- S. Pal, P. Pakray, A. Gelbukh, and J. van Genabith. 2015. Mining parallel resources for machine translation from comparable corpora. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. 534--544.Google Scholar
Cross Ref
- S. Pal, P. Pakray, S. Naskar, and Others. 2014. Automatic building and using parallel resources for SMT from comparable corpora. In Proceedings of the 3rd Workshop on Hybrid Approaches to Translation (HyTra)@ EACL. 48--57.Google Scholar
Cross Ref
- K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. 311--318. Google Scholar
Digital Library
- C. Quirk, R. Udupa, and A. Menezes. 2007. Generative models of noisy translations with applications to parallel fragment extraction. In Proceedings of MT Summit XI. 337--384.Google Scholar
- R. Rahimi, A. Shakery, J. Dadashkarimi, M. Ariannezhad, M. Dehghani, and H. N. Esfahani. 2016. Building a multi-domain comparable corpus using a learning to rank method. Natural Language Engineering 22, 4 (2016), 627--653.Google Scholar
Cross Ref
- R. Rapp. 1995. Identifying word translations in non-parallel texts. In Proceedings of the 33rd Annual Meeting on ACL. 320--322. Google Scholar
Digital Library
- R. Rapp. 1999. Automatic identification of word translations from unrelated English and German corpora. In Proceedings of the 37th Annual Meeting of the ACL. 519--526. Google Scholar
Digital Library
- R. Rapp, S. Sharoff, and P. Zweigenbaum. 2016. Recent advances in machine translation using comparable corpora. Natural Language Engineering 22, 4 (2016), 501--516.Google Scholar
Cross Ref
- R. Rapp and M. Zock. 2009. Automatic dictionary expansion using non-parallel corpora. Advances in Data Analysis, Data Handling and Business Intelligence (2009), 317--325.Google Scholar
- M. Razmara, M. Siahbani, R. Haffari, and A. Sarkar. 2013. Graph propagation for paraphrasing out-of-vocabulary words in statistical machine translation. In Proceedings of ACL (1). 1105--1115.Google Scholar
- A. Saluja, H. Hassan, K. Toutanova, and C. Quirk. 2014. Graph-based semi-supervised learning of translation models from monolingual data. In Proceedings of the 52nd Annual Meeting of the ACL. Association for Computational Linguistics, Baltimore, Maryland, 676--686.Google Scholar
- A. Saluja and J. Navratil. 2013. Graph-based unsupervised learning of word similarities using heterogeneous feature types. In Proceedings of TextGraphs@ EMNLP. Citeseer, 29--38.Google Scholar
- C. Schafer and D. Yarowsky. 2002. Inducing translation lexicons via diverse similarity measures and bridge languages. In Proceedings of the 6th Conference on Natural Language Learning. 1--7. Google Scholar
Digital Library
- M. Shamsfard, A. Hesabi, H. Fadaei, N. Mansoory, A. Famian, S. Bagherbeigi, E. Fekri, M. Monshizadeh, and S. M. Assi. 2010. Semi automatic development of farsnet; the Persian wordnet. In Proceedings of 5th Global WordNet Conference.Google Scholar
- D. Shezaf and A. Rappoport. 2010. Bilingual lexicon generation using non-aligned signatures. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 98--107. Google Scholar
Digital Library
- B. Snyder, R. Barzilay, and K. Knight. 2010. A statistical model for lost language decipherment. In Proceedings of the 48th Annual Meeting of the ACL. 1048--1057. Google Scholar
Digital Library
- S. Soderland, O. Etzioni, S. Weld, D., M. Skinner, J. Bilmes, and Others. 2009. Compiling a massive, multilingual dictionary via probabilistic inference. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP-Volume 1. 262--270. Google Scholar
Digital Library
- A. Tamura, T. Watanabe, and E. Sumita. 2012. Bilingual lexicon extraction from comparable corpora using label propagation. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, Jeju Island, Korea, 24--36. Google Scholar
Digital Library
- L. Tang, T. Y. Wang, Y. B. Chen, and T. Y. Wang. 2015. Problems of alignment in Paraconc for a case study. In Proceedings of the 2014 Asia-Pacific Conference on Computer Science and Applications (CSAC’14). 57--62.Google Scholar
- G. Tholpadi, C. Bhattacharyya, and S. Shevade. 2017. Corpus-based translation induction in Indian languages using auxiliary language corpora from Wikipedia. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 16, 3 (March 2017), 1--25. Google Scholar
Digital Library
- J. Turian, L. Ratinov, and Y. Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the ACL. 384--394. Google Scholar
Digital Library
- I. Vulić, W. De Smet, and M. Moens. 2011. Identifying word translations from comparable corpora using latent topic models. In Proceedings of the 49th Annual Meeting of the ACL: Human Language Technologies. Association for Computational Linguistics, Portland, Oregon, USA, 479--484. Google Scholar
Digital Library
- I. Vulić and M. Moens. 2012. Detecting highly confident word translations from comparable corpora without any prior knowledge. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL’12). Association for Computational Linguistics, Stroudsburg, PA, 449--459. Google Scholar
Digital Library
- K. Wolk and K. Marasek. 2014. Building subject-aligned comparable corpora and mining it for truly parallel sentence pairs, Vol. 18. Elsevier, 126--132.Google Scholar
- H. Wu and H. Wang. 2007. Pivot language approach for phrase-based statistical machine translation. Machine Translation 21, 3 (2007), 165--181. Google Scholar
Digital Library
- L. Xiang, Y. Zhou, and C. Zong. 2013. An efficient framework to extract parallel units from comparable data. Natural Language Processing and Chinese Computing (2013), 151--163.Google Scholar
- K. Yu and J. Tsujii. 2009. Bilingual dictionary extraction from wikipedia. In Proceedings of Machine Translation Summit XII. 121--124.Google Scholar
- C. Zhang and T. Zhao. 2015. Bilingual lexicon extraction using locally weighted linear regression from comparable corpora. In Proceedings of the International Conference on Asian Language (IALP’15). 13--16.Google Scholar
- Y. Zhang, M. M. Rahman, A. Braylan, B. Dang, H. Chang, H. Kim, Q. McNamara, A. Angert, E. Banner, V. Khetan, and T. McDonnell. 2016. Neural information retrieval: A literature review. (2016). arXiv:arXiv:1611.06792.Google Scholar
- Z. Zhu, M. Li, L. Chen, and Z. Yang. 2013. Building comparable corpora based on bilingual LDA model. In Proceedings of ACL (2). 278--282.Google Scholar
- Z. Zhu, X. Zeng, S. Zheng, X. Sun, S. Wang, and S. Weng. 2016. A mutual iterative enhancement model for simultaneous comparable corpora and bilingual lexicons construction. In Proceedings of the 9th Workshop on Building and Using Comparable Corpora, Co-located with LREC 2016. 27--33.Google Scholar
Index Terms
Matching Graph, a Method for Extracting Parallel Information from Comparable Corpora
Recommendations
Extracting translations from comparable corpora for Cross-Language Information Retrieval using the language modeling framework
Proposing a language modeling method to extract translations from comparable corpora.Comparing two similarity functions for deriving bilingual word correlations.Improving translation quality by integrating co-occurrence relations into word ...
Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora: A Case Study on Chinese--Japanese Wikipedia
Parallel corpora are crucial for statistical machine translation (SMT); however, they are quite scarce for most language pairs and domains. As comparable corpora are far more available, many studies have been conducted to extract either parallel ...
Parallel fragments
Phrase fragments have proved to be a valuable resource for increasing translation and natural language generation performance.A novel approach to find parallel fragments from comparable corpora is presented which is simple and efficient in ...






Comments