skip to main content
research-article
Open Access

Expanding Paraphrase Lexicons by Exploiting Generalities

Authors Info & Claims
Published:30 January 2018Publication History
Skip Abstract Section

Abstract

Techniques for generating and recognizing paraphrases, i.e., semantically equivalent expressions, play an important role in a wide range of natural language processing tasks. In the last decade, the task of automatic acquisition of subsentential paraphrases, i.e., words and phrases with (approximately) the same meaning, has been drawing much attention in the research community. The core problem is to obtain paraphrases of high quality in large quantity. This article presents a method for tackling this issue by systematically expanding an initial seed lexicon made up of high-quality paraphrases. This involves automatically capturing morpho-semantic and syntactic generalizations within the lexicon and using them to leverage the power of large-scale monolingual data. Given an input set of paraphrases, our method starts by inducing paraphrase patterns that constitute generalizations over corresponding pairs of lexical variants, such as “amending” and “amendment,” in a fully empirical way. It then searches large-scale monolingual data for new paraphrases matching those patterns. The results of our experiments on English, French, and Japanese demonstrate that our method manages to expand seed lexicons by a large multiple. Human evaluation based on paraphrase substitution tests reveals that the automatically acquired paraphrases are also of high quality.

References

  1. Ion Androutsopoulos and Prodromos Malakasiotis. 2010. A survey of paraphrasing and textual entailment methods. Journal of Artificial Intelligence Research 38, 135--187. Google ScholarGoogle ScholarCross RefCross Ref
  2. R. Baayen, R. Piepenbrock, and L. Gulikers. 1995. CELEX2 LDC96L14. Philadelphia: Linguistic Data Consortium.Google ScholarGoogle Scholar
  3. Colin Bannard and Chris Callison-Burch. 2005. Paraphrasing with bilingual parallel corpora. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL). 597--604. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Regina Barzilay and Noemie Elhadad. 2003. Sentence alignment for monolingual comparable corpora. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP). 25--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Regina Barzilay and Lillian Lee. 2002. Bootstrapping lexical choice via multiple-sequence alignment. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP). 164--171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Regina Barzilay and Kathleen R. McKeown. 2001. Extracting paraphrases from a parallel corpus. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL). 50--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Rahul Bhagat and Deepak Ravichandran. 2008. Large scale acquisition of paraphrases for learning surface patterns. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL). 161--170.Google ScholarGoogle Scholar
  8. Chris Callison-Burch. 2008. Syntactic constraints on paraphrases extracted from parallel corpora. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP). 196--205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Tsz Ping Chan, Chris Callison-Burch, and Benjamin Van Durme. 2011. Reranking bilingually extracted paraphrases using monolingual distributional similarity. In Proceedings of the Workshop on Geometrical Models of Natural Language Semantics (GEMS). 33--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. David L. Chen and William B. Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL). 190--200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Noam Chomsky. 1957. Syntactic Structures. Mouton Publishers, The Hague, The Netherlands.Google ScholarGoogle Scholar
  12. Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 1, 37--46.Google ScholarGoogle ScholarCross RefCross Ref
  13. Stijn De Saeger, Kentaro Torisawa, Masaaki Tsuchida, Jun’ichi Kazama, Chikara Hashimoto, Ichiro Yamada, Jong Hoon Oh, István Varga, and Yulan Yan. 2011. Relation acquisition using word classes and partial patterns. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP). 825--835. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Michael Denkowski and Alon Lavie. 2011. Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems. In Proceedings of the 6th Workshop on Statistical Machine Translation (WMT). 85--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the 20th International Conference on Computational Linguistics (COLING). 350--356. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jinhua Du, Jie Jiang, and Andy Way. 2010. Facilitating translation using source language paraphrase lattices. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP). 420--429. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA.Google ScholarGoogle Scholar
  18. Joseph F. Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin 76, 5, 378--382.Google ScholarGoogle ScholarCross RefCross Ref
  19. Atsushi Fujita, Shuhei Kato, Naoki Kato, and Satoshi Sato. 2007. A compositional approach toward dynamic phrasal thesaurus. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing (WTEP). 151--158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Juri Ganitkevitch and Chris Callison-Burch. 2014. The multilingual paraphrase database. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC). 4276--4282.Google ScholarGoogle Scholar
  21. Juri Ganitkevitch, Chris Callison-Burch, Courtney Napoles, and Benjamin Van Durme. 2011. Learning sentential paraphrases from bilingual parallel corpora for text-to-text generation. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1168--1179. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. PPDB: The paraphrase database. In Proceedings of Human Language Technologies: The 2013 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT). 758--764.Google ScholarGoogle Scholar
  23. Éric Gaussier. 1999. Unsupervised learning of derivational morphology from inflectional lexicons. In Proceedings of the Workshop on Unsupervised Learning in Natural Language Processing. 24--30.Google ScholarGoogle Scholar
  24. Maayan Geffet and Ido Dagan. 2004. Feature vector quality and distributional similarity. In Proceedings of the 20th International Conference on Computational Linguistics (COLING). 247--253. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Nizar Habash and Bonnie Jean Dorr. 2003. A categorial variation database for English. In Proceedings of the 2003 Human Language Technology Conference and the North American Chapter of the Association for Computational Linguistics (HLT-NAACL). 96--102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Masato Hagiwara, Yasuhiro Ogawa, and Katsuhiko Toyama. 2006. Selection of effective contextual information for automatic synonym acquisition. In Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics and the 21st International Conference on Computational Linguistics (COLING-ACL). 353--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Masato Hagiwara, Yasuhiro Ogawa, and Katsuhiko Toyama. 2008. Effective use of indirect dependency for distributional similarity. Journal of Natural Language Processing 15, 4, 19--42.Google ScholarGoogle ScholarCross RefCross Ref
  28. Zellig Harris. 1954. Distributional structure. Word 10, 23, 146--162.Google ScholarGoogle ScholarCross RefCross Ref
  29. Zellig Harris. 1957. Co-occurrence and transformation in linguistic structure. Language 33, 3, 283--340.Google ScholarGoogle ScholarCross RefCross Ref
  30. Chikara Hashimoto, Kentaro Torisawa, Stijn De Saeger, Jun’ichi Kazama, and Sadao Kurohashi. 2011. Extracting paraphrases from definition sentences on the web. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL). 1087--1097. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Chikara Hashimoto, Kentaro Torisawa, Kow Kuroda, Stijn De Saeger, Masaki Murata, and Jun’ichi Kazama. 2009. Large-scale verb entailment acquisition from the web. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1172--1181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Christian Jacquemin. 1999. Syntagmatic and paradigmatic representations of term variation. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL). 341--348. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Howard Johnson, Joel Martin, George Foster, and Roland Kuhn. 2007. Improving translation quality by discarding most of the phrasetable. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 967--975.Google ScholarGoogle Scholar
  34. Philipp Koehn. 2009. Statistical Machine Translation. Cambridge University Press, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Human Language Technology Conference and the North American Chapter of the Association for Computational Linguistics (HLT-NAACL). 48--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Stanley Kok and Chris Brockett. 2010. Hitting the right paraphrases in good time. In Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT). 145--153. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics 33, 1, 159--174.Google ScholarGoogle Scholar
  38. Lillian Lee. 1999. Measures of distributional similarity. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL). 25--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Dekang Lin. 1998. Automatic retrieval and clustering of similar words. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL). 768--774. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Dekang Lin and Patrick Pantel. 2001. Discovery of inference rules for question answering. Natural Language Engineering 7, 4, 343--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Nitin Madnani and Bonnie J. Dorr. 2010. Generating phrasal and sentential paraphrases: A survey of data-driven methods. Computational Linguistics 36, 3, 341--387. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Prodromos Malakasiotis and Ion Androutsopoulos. 2011. A generate and rank approach to sentence paraphrasing. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP). 96--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Yuval Marton. 2013. Distributional phrasal paraphrase generation for statistical machine translation. ACM Transactions on Intelligent Systems and Technology 4, 3, Article 39, 32 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Yuval Marton, Ahmed El Kholy, and Nizar Habash. 2011. Filtering antonymous, trend-contrasting, and polarity-dissimilar distributional paraphrases for improving statistical machine translation. In Proceedings of the 6th Workshop on Statistical Machine Translation (WMT). 237--249. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Aurélien Max. 2010. Example-based paraphrasing for improved phrase-based statistical machine translation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP). 656--666. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Igor Mel’čuk and Alain Polguère. 1987. A formal lexicon in meaning-text theory (or how to do Lexica with words). Computational Linguistics 13, 3--4, 261--275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Marius Paşca and Péter Dienes. 2005. Aligning needles in a haystack: Paraphrase acquisition across the web. In Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCNLP). 119--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Bo Pang, Kevin Knight, and Daniel Marcu. 2003. Syntax-based alignment of multiple translations: Extracting paraphrases and generating new sentences. In Proceedings of the 2003 Human Language Technology Conference and the North American Chapter of the Association for Computational Linguistics (HLT-NAACL). 102--109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Patric Pantel and Marco Pennacchiotti. 2006. Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics and the 21st International Conference on Computational Linguistics (COLING-ACL). 113--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Hideki Shima. 2015. Paraphrase Pattern Acquisition by Diversifiable Bootstrapping. Ph.D. Dissertation. Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.Google ScholarGoogle Scholar
  51. Yusuke Shinyama, Satoshi Sekine, Kiyoshi Sudo, and Ralph Grishman. 2002. Automatic paraphrase acquisition from news articles. In Proceedings of the 2002 Human Language Technology Conference (HLT). Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Idan Szpektor and Ido Dagan. 2008. Learning entailment rules for unary templates. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING). 849--856. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Idan Szpektor, Eyal Shnarch, and Ido Dagan. 2007. Instance-based evaluation of entailment rule acquisition. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL). 456--463.Google ScholarGoogle Scholar
  54. Idan Szpektor, Hristo Tanev, Ido Dagan, and Bonaventura Coppola. 2004. Scaling web-based acquisition of entailment relations. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP). 41--48.Google ScholarGoogle Scholar
  55. Kentaro Torisawa. 2002. An unsupervised learning method for associative relationships between verb phrases. In Proceedings of the 19th International Conference on Computational Linguistics (COLING). 1009--1015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Julie Weeds, David Weir, and Bill Keller. 2005. The distributional similarity of sub-parses. In Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment. 7--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Sander Wubben, Antal van den Bosch, Emiel Krahmer, and Erwin Marsi. 2009. Clustering and matching headlines for automatic paraphrase acquisition. In Proceedings of the 12th European Workshop on Natural Language Generation (EWNLG). 122--125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Yulan Yan, Chikara Hashimoto, Kentaro Torisawa, Takao Kawai, Jun’ichi Kazama, and Stijn De Saeger. 2013. Minimally supervised method for multilingual paraphrase extraction from definition sentences on the web. In Proceedings of Human Language Technologies: The 2013 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT). 63--73.Google ScholarGoogle Scholar
  59. Shiqi Zhao, Haifeng Wang, Ting Liu, and Sheng Li. 2009. Extracting paraphrase patterns from bilingual parallel corpora. Natural Language Engineering 15, 4, 503--526. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Expanding Paraphrase Lexicons by Exploiting Generalities

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 17, Issue 2
        June 2018
        134 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3160862
        Issue’s Table of Contents

        Copyright © 2018 Owner/Author

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 30 January 2018
        • Revised: 1 October 2017
        • Accepted: 1 October 2017
        • Received: 1 November 2016
        Published in tallip Volume 17, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed
      • Article Metrics

        • Downloads (Last 12 months)36
        • Downloads (Last 6 weeks)2

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!