ABSTRACT
A distributed system is described that reliably mines parallel text from large corpora. The approach can be regarded as cross-language near-duplicate detection, enabled by an initial, low-quality batch translation. In contrast to other approaches which require specialized metadata, the system uses only the textual content of the documents. Results are presented for a corpus of over two billion web pages and for a large collection of digitized public-domain books.
- Abdul-Rauf, Sadaf and Holger Schwenk. 2009. On the use of comparable corpora to improve SMT performance. In EACL, pages 16--23. Google Scholar
Digital Library
- Broder, Andrei Z. 2000. Identifying and filtering near-duplicate documents. In COM '00: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, pages 1--10, London, UK. Springer-Verlag. Google Scholar
Digital Library
- Chen, Jiang and Jian-Yun Nie. 2000. Parallel web text mining for cross-language IR. In In In Proc. of RIAO, pages 62--77.Google Scholar
- Cormode, Graham, S. Muthukrishnan, and Süleyman Cenk Sahinalp. 2001. Permutation editing and matching via embeddings. In ICALP '01: Proceedings of the 28th International Colloquium on Automata, Languages and Programming, pages 481--492, London, UK. Springer-Verlag. Google Scholar
Digital Library
- Dean, Jeffrey and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI-04), San Francisco, CA, USA. Google Scholar
Digital Library
- Do, Thi-Ngoc-Diep, Viet-Bac Le, Brigitte Bigi, Laurent Besacier Eric, and Castelli. 2009. Mining a comparable text corpus for a Vietnamese - French statistical machine translation system. In Proceedings of the 4th EACL Workshop on Statistical Machine Translation, pages 165--172, Athens, Greece, March. Google Scholar
Digital Library
- European Commission Directorate-General for Translation. 2007. DGT-TM parallel corpus. http://langtech.jrc.it/DGT-TM.html.Google Scholar
- Harding, Stephen M., W. Bruce Croft, and C. Weir. 1997. Probabilistic retrieval of OCR degraded text using n-grams. In ECDL '97: Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries, pages 345--359, London, UK. Springer-Verlag. Google Scholar
Digital Library
- Henzinger, Monika. 2006. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 284--291, New York, NY, USA. ACM. Google Scholar
Digital Library
- Koehn, Philipp. 2002. Europarl: A multilingual corpus for evaluation of machine translation. Draft.Google Scholar
- Macherey, Wolfgang, Franz Och, Ignacio Thayer, and Jakob Uszkoreit. 2008. Lattice-based minimum error rate training for statistical machine translation. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 725--734, Honolulu, Hi, October. Association for Computational Linguistics. Google Scholar
Digital Library
- Manber, Udi. 1994. Finding similar files in a large file system. In Proceedings of the USENIX Winter 1994 Technical Conferenc. Google Scholar
Digital Library
- Munteanu, Dragos Stefan and Daniel Marcu. 2005. Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist., 31(4):477--504. Google Scholar
Digital Library
- Munteanu, Dragos Stefan and Daniel Marcu. 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In ACL. Google Scholar
Digital Library
- Och, Franz Josef and Hermann Ney. 2002. Discriminative training and maximum entropy models for statistical machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 295--302, Philadelphia, PA, USA. Google Scholar
Digital Library
- Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 311--318, Philadelphia, PA, USA. Google Scholar
Digital Library
- Resnik, Philip and Noah A. Smith. 2003. The web as a parallel corpus. Computational Linguistics, 29:349--380. Google Scholar
Digital Library
- Robertson, S E, S Walker, S Jones, M M Hancock-Beaulieu, and M Gatford. 1995. Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference (TREC-3).Google Scholar
- Udupa, Raghavendra, K. Saravanan, A. Kumaran, and Jagadeesh Jagarlamudi. 2009. Mint: A method for effective and scalable mining of named entity transliterations from large comparable corpora. In EACL, pages 799--807. Google Scholar
Digital Library
- United Nations. 2006. ODS UN parallel corpus. http://ods.un.org/.Google Scholar
Index Terms
Large scale parallel document mining for machine translation
Recommendations
Aligning turkish and english parallel texts for statistical machine translation
ISCIS'05: Proceedings of the 20th international conference on Computer and Information SciencesThis paper presents a preliminary work on aligning Turkish and English parallel texts towards developing a statistical machine translation system for English and Turkish. To avoid the data sparseness problem and to uncover relations between sublexical ...
Toward statistical machine translation without parallel corpora
EACL '12: Proceedings of the 13th Conference of the European Chapter of the Association for Computational LinguisticsWe estimate the parameters of a phrase-based statistical machine translation system from monolingual corpora instead of a bilingual parallel corpus. We extend existing research on bilingual lexicon induction to estimate both lexical and phrasal ...
Large aligned treebanks for syntax-based machine translation
We present a collection of parallel treebanks that have been automatically aligned on both the terminal and the non-terminal constituent level for use in syntax-based machine translation. We describe how they were constructed and applied to a syntax- ...




Comments