skip to main content
10.5555/1873781.1873905dlproceedingsArticle/Chapter ViewAbstractPublication PagescolingConference Proceedingsconference-collections
research-article
Free Access

Large scale parallel document mining for machine translation

Published:23 August 2010Publication History

ABSTRACT

A distributed system is described that reliably mines parallel text from large corpora. The approach can be regarded as cross-language near-duplicate detection, enabled by an initial, low-quality batch translation. In contrast to other approaches which require specialized metadata, the system uses only the textual content of the documents. Results are presented for a corpus of over two billion web pages and for a large collection of digitized public-domain books.

References

  1. Abdul-Rauf, Sadaf and Holger Schwenk. 2009. On the use of comparable corpora to improve SMT performance. In EACL, pages 16--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Broder, Andrei Z. 2000. Identifying and filtering near-duplicate documents. In COM '00: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, pages 1--10, London, UK. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Chen, Jiang and Jian-Yun Nie. 2000. Parallel web text mining for cross-language IR. In In In Proc. of RIAO, pages 62--77.Google ScholarGoogle Scholar
  4. Cormode, Graham, S. Muthukrishnan, and Süleyman Cenk Sahinalp. 2001. Permutation editing and matching via embeddings. In ICALP '01: Proceedings of the 28th International Colloquium on Automata, Languages and Programming, pages 481--492, London, UK. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Dean, Jeffrey and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI-04), San Francisco, CA, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Do, Thi-Ngoc-Diep, Viet-Bac Le, Brigitte Bigi, Laurent Besacier Eric, and Castelli. 2009. Mining a comparable text corpus for a Vietnamese - French statistical machine translation system. In Proceedings of the 4th EACL Workshop on Statistical Machine Translation, pages 165--172, Athens, Greece, March. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. European Commission Directorate-General for Translation. 2007. DGT-TM parallel corpus. http://langtech.jrc.it/DGT-TM.html.Google ScholarGoogle Scholar
  8. Harding, Stephen M., W. Bruce Croft, and C. Weir. 1997. Probabilistic retrieval of OCR degraded text using n-grams. In ECDL '97: Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries, pages 345--359, London, UK. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Henzinger, Monika. 2006. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 284--291, New York, NY, USA. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Koehn, Philipp. 2002. Europarl: A multilingual corpus for evaluation of machine translation. Draft.Google ScholarGoogle Scholar
  11. Macherey, Wolfgang, Franz Och, Ignacio Thayer, and Jakob Uszkoreit. 2008. Lattice-based minimum error rate training for statistical machine translation. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 725--734, Honolulu, Hi, October. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Manber, Udi. 1994. Finding similar files in a large file system. In Proceedings of the USENIX Winter 1994 Technical Conferenc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Munteanu, Dragos Stefan and Daniel Marcu. 2005. Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist., 31(4):477--504. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Munteanu, Dragos Stefan and Daniel Marcu. 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In ACL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Och, Franz Josef and Hermann Ney. 2002. Discriminative training and maximum entropy models for statistical machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 295--302, Philadelphia, PA, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 311--318, Philadelphia, PA, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Resnik, Philip and Noah A. Smith. 2003. The web as a parallel corpus. Computational Linguistics, 29:349--380. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Robertson, S E, S Walker, S Jones, M M Hancock-Beaulieu, and M Gatford. 1995. Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference (TREC-3).Google ScholarGoogle Scholar
  19. Udupa, Raghavendra, K. Saravanan, A. Kumaran, and Jagadeesh Jagarlamudi. 2009. Mint: A method for effective and scalable mining of named entity transliterations from large comparable corpora. In EACL, pages 799--807. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. United Nations. 2006. ODS UN parallel corpus. http://ods.un.org/.Google ScholarGoogle Scholar

Index Terms

  1. Large scale parallel document mining for machine translation
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image DL Hosted proceedings
            COLING '10: Proceedings of the 23rd International Conference on Computational Linguistics
            August 2010
            1408 pages

            Publisher

            Association for Computational Linguistics

            United States

            Publication History

            • Published: 23 August 2010

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate1,537of1,537submissions,100%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader