Abstract
Techniques for generating and recognizing paraphrases, i.e., semantically equivalent expressions, play an important role in a wide range of natural language processing tasks. In the last decade, the task of automatic acquisition of subsentential paraphrases, i.e., words and phrases with (approximately) the same meaning, has been drawing much attention in the research community. The core problem is to obtain paraphrases of high quality in large quantity. This article presents a method for tackling this issue by systematically expanding an initial seed lexicon made up of high-quality paraphrases. This involves automatically capturing morpho-semantic and syntactic generalizations within the lexicon and using them to leverage the power of large-scale monolingual data. Given an input set of paraphrases, our method starts by inducing paraphrase patterns that constitute generalizations over corresponding pairs of lexical variants, such as “amending” and “amendment,” in a fully empirical way. It then searches large-scale monolingual data for new paraphrases matching those patterns. The results of our experiments on English, French, and Japanese demonstrate that our method manages to expand seed lexicons by a large multiple. Human evaluation based on paraphrase substitution tests reveals that the automatically acquired paraphrases are also of high quality.
- Ion Androutsopoulos and Prodromos Malakasiotis. 2010. A survey of paraphrasing and textual entailment methods. Journal of Artificial Intelligence Research 38, 135--187. Google Scholar
Cross Ref
- R. Baayen, R. Piepenbrock, and L. Gulikers. 1995. CELEX2 LDC96L14. Philadelphia: Linguistic Data Consortium.Google Scholar
- Colin Bannard and Chris Callison-Burch. 2005. Paraphrasing with bilingual parallel corpora. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL). 597--604. Google Scholar
Digital Library
- Regina Barzilay and Noemie Elhadad. 2003. Sentence alignment for monolingual comparable corpora. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP). 25--32. Google Scholar
Digital Library
- Regina Barzilay and Lillian Lee. 2002. Bootstrapping lexical choice via multiple-sequence alignment. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP). 164--171. Google Scholar
Digital Library
- Regina Barzilay and Kathleen R. McKeown. 2001. Extracting paraphrases from a parallel corpus. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL). 50--57. Google Scholar
Digital Library
- Rahul Bhagat and Deepak Ravichandran. 2008. Large scale acquisition of paraphrases for learning surface patterns. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL). 161--170.Google Scholar
- Chris Callison-Burch. 2008. Syntactic constraints on paraphrases extracted from parallel corpora. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP). 196--205. Google Scholar
Digital Library
- Tsz Ping Chan, Chris Callison-Burch, and Benjamin Van Durme. 2011. Reranking bilingually extracted paraphrases using monolingual distributional similarity. In Proceedings of the Workshop on Geometrical Models of Natural Language Semantics (GEMS). 33--42. Google Scholar
Digital Library
- David L. Chen and William B. Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL). 190--200. Google Scholar
Digital Library
- Noam Chomsky. 1957. Syntactic Structures. Mouton Publishers, The Hague, The Netherlands.Google Scholar
- Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 1, 37--46.Google Scholar
Cross Ref
- Stijn De Saeger, Kentaro Torisawa, Masaaki Tsuchida, Jun’ichi Kazama, Chikara Hashimoto, Ichiro Yamada, Jong Hoon Oh, István Varga, and Yulan Yan. 2011. Relation acquisition using word classes and partial patterns. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP). 825--835. Google Scholar
Digital Library
- Michael Denkowski and Alon Lavie. 2011. Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems. In Proceedings of the 6th Workshop on Statistical Machine Translation (WMT). 85--91. Google Scholar
Digital Library
- Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the 20th International Conference on Computational Linguistics (COLING). 350--356. Google Scholar
Digital Library
- Jinhua Du, Jie Jiang, and Andy Way. 2010. Facilitating translation using source language paraphrase lattices. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP). 420--429. Google Scholar
Digital Library
- Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA.Google Scholar
- Joseph F. Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin 76, 5, 378--382.Google Scholar
Cross Ref
- Atsushi Fujita, Shuhei Kato, Naoki Kato, and Satoshi Sato. 2007. A compositional approach toward dynamic phrasal thesaurus. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing (WTEP). 151--158. Google Scholar
Digital Library
- Juri Ganitkevitch and Chris Callison-Burch. 2014. The multilingual paraphrase database. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC). 4276--4282.Google Scholar
- Juri Ganitkevitch, Chris Callison-Burch, Courtney Napoles, and Benjamin Van Durme. 2011. Learning sentential paraphrases from bilingual parallel corpora for text-to-text generation. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1168--1179. Google Scholar
Digital Library
- Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. PPDB: The paraphrase database. In Proceedings of Human Language Technologies: The 2013 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT). 758--764.Google Scholar
- Éric Gaussier. 1999. Unsupervised learning of derivational morphology from inflectional lexicons. In Proceedings of the Workshop on Unsupervised Learning in Natural Language Processing. 24--30.Google Scholar
- Maayan Geffet and Ido Dagan. 2004. Feature vector quality and distributional similarity. In Proceedings of the 20th International Conference on Computational Linguistics (COLING). 247--253. Google Scholar
Digital Library
- Nizar Habash and Bonnie Jean Dorr. 2003. A categorial variation database for English. In Proceedings of the 2003 Human Language Technology Conference and the North American Chapter of the Association for Computational Linguistics (HLT-NAACL). 96--102. Google Scholar
Digital Library
- Masato Hagiwara, Yasuhiro Ogawa, and Katsuhiko Toyama. 2006. Selection of effective contextual information for automatic synonym acquisition. In Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics and the 21st International Conference on Computational Linguistics (COLING-ACL). 353--360. Google Scholar
Digital Library
- Masato Hagiwara, Yasuhiro Ogawa, and Katsuhiko Toyama. 2008. Effective use of indirect dependency for distributional similarity. Journal of Natural Language Processing 15, 4, 19--42.Google Scholar
Cross Ref
- Zellig Harris. 1954. Distributional structure. Word 10, 23, 146--162.Google Scholar
Cross Ref
- Zellig Harris. 1957. Co-occurrence and transformation in linguistic structure. Language 33, 3, 283--340.Google Scholar
Cross Ref
- Chikara Hashimoto, Kentaro Torisawa, Stijn De Saeger, Jun’ichi Kazama, and Sadao Kurohashi. 2011. Extracting paraphrases from definition sentences on the web. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL). 1087--1097. Google Scholar
Digital Library
- Chikara Hashimoto, Kentaro Torisawa, Kow Kuroda, Stijn De Saeger, Masaki Murata, and Jun’ichi Kazama. 2009. Large-scale verb entailment acquisition from the web. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1172--1181. Google Scholar
Digital Library
- Christian Jacquemin. 1999. Syntagmatic and paradigmatic representations of term variation. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL). 341--348. Google Scholar
Digital Library
- Howard Johnson, Joel Martin, George Foster, and Roland Kuhn. 2007. Improving translation quality by discarding most of the phrasetable. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). 967--975.Google Scholar
- Philipp Koehn. 2009. Statistical Machine Translation. Cambridge University Press, New York, NY. Google Scholar
Digital Library
- Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Human Language Technology Conference and the North American Chapter of the Association for Computational Linguistics (HLT-NAACL). 48--54. Google Scholar
Digital Library
- Stanley Kok and Chris Brockett. 2010. Hitting the right paraphrases in good time. In Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT). 145--153. Google Scholar
Digital Library
- J. Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics 33, 1, 159--174.Google Scholar
- Lillian Lee. 1999. Measures of distributional similarity. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL). 25--32. Google Scholar
Digital Library
- Dekang Lin. 1998. Automatic retrieval and clustering of similar words. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL). 768--774. Google Scholar
Digital Library
- Dekang Lin and Patrick Pantel. 2001. Discovery of inference rules for question answering. Natural Language Engineering 7, 4, 343--360. Google Scholar
Digital Library
- Nitin Madnani and Bonnie J. Dorr. 2010. Generating phrasal and sentential paraphrases: A survey of data-driven methods. Computational Linguistics 36, 3, 341--387. Google Scholar
Digital Library
- Prodromos Malakasiotis and Ion Androutsopoulos. 2011. A generate and rank approach to sentence paraphrasing. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP). 96--106. Google Scholar
Digital Library
- Yuval Marton. 2013. Distributional phrasal paraphrase generation for statistical machine translation. ACM Transactions on Intelligent Systems and Technology 4, 3, Article 39, 32 pages. Google Scholar
Digital Library
- Yuval Marton, Ahmed El Kholy, and Nizar Habash. 2011. Filtering antonymous, trend-contrasting, and polarity-dissimilar distributional paraphrases for improving statistical machine translation. In Proceedings of the 6th Workshop on Statistical Machine Translation (WMT). 237--249. Google Scholar
Digital Library
- Aurélien Max. 2010. Example-based paraphrasing for improved phrase-based statistical machine translation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP). 656--666. Google Scholar
Digital Library
- Igor Mel’čuk and Alain Polguère. 1987. A formal lexicon in meaning-text theory (or how to do Lexica with words). Computational Linguistics 13, 3--4, 261--275. Google Scholar
Digital Library
- Marius Paşca and Péter Dienes. 2005. Aligning needles in a haystack: Paraphrase acquisition across the web. In Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCNLP). 119--130. Google Scholar
Digital Library
- Bo Pang, Kevin Knight, and Daniel Marcu. 2003. Syntax-based alignment of multiple translations: Extracting paraphrases and generating new sentences. In Proceedings of the 2003 Human Language Technology Conference and the North American Chapter of the Association for Computational Linguistics (HLT-NAACL). 102--109. Google Scholar
Digital Library
- Patric Pantel and Marco Pennacchiotti. 2006. Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics and the 21st International Conference on Computational Linguistics (COLING-ACL). 113--120. Google Scholar
Digital Library
- Hideki Shima. 2015. Paraphrase Pattern Acquisition by Diversifiable Bootstrapping. Ph.D. Dissertation. Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.Google Scholar
- Yusuke Shinyama, Satoshi Sekine, Kiyoshi Sudo, and Ralph Grishman. 2002. Automatic paraphrase acquisition from news articles. In Proceedings of the 2002 Human Language Technology Conference (HLT). Google Scholar
Digital Library
- Idan Szpektor and Ido Dagan. 2008. Learning entailment rules for unary templates. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING). 849--856. Google Scholar
Digital Library
- Idan Szpektor, Eyal Shnarch, and Ido Dagan. 2007. Instance-based evaluation of entailment rule acquisition. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL). 456--463.Google Scholar
- Idan Szpektor, Hristo Tanev, Ido Dagan, and Bonaventura Coppola. 2004. Scaling web-based acquisition of entailment relations. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP). 41--48.Google Scholar
- Kentaro Torisawa. 2002. An unsupervised learning method for associative relationships between verb phrases. In Proceedings of the 19th International Conference on Computational Linguistics (COLING). 1009--1015. Google Scholar
Digital Library
- Julie Weeds, David Weir, and Bill Keller. 2005. The distributional similarity of sub-parses. In Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment. 7--12. Google Scholar
Digital Library
- Sander Wubben, Antal van den Bosch, Emiel Krahmer, and Erwin Marsi. 2009. Clustering and matching headlines for automatic paraphrase acquisition. In Proceedings of the 12th European Workshop on Natural Language Generation (EWNLG). 122--125. Google Scholar
Digital Library
- Yulan Yan, Chikara Hashimoto, Kentaro Torisawa, Takao Kawai, Jun’ichi Kazama, and Stijn De Saeger. 2013. Minimally supervised method for multilingual paraphrase extraction from definition sentences on the web. In Proceedings of Human Language Technologies: The 2013 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT). 63--73.Google Scholar
- Shiqi Zhao, Haifeng Wang, Ting Liu, and Sheng Li. 2009. Extracting paraphrase patterns from bilingual parallel corpora. Natural Language Engineering 15, 4, 503--526. Google Scholar
Digital Library
Index Terms
Expanding Paraphrase Lexicons by Exploiting Generalities
Recommendations
Phrase Table Induction Using Monolingual Data for Low-Resource Statistical Machine Translation
We propose a new method for inducing a phrase-based translation model from a pair of unrelated monolingual corpora. Our method is able to deal with phrases of arbitrary length and to find phrase pairs that are useful for statistical machine translation, ...
PKU Paraphrase Bank: A Sentence-Level Paraphrase Corpus for Chinese
Natural Language Processing and Chinese ComputingAbstractOne of the main challenges of conducting research on paraphrase is the lack of large-scale, high-quality corpus, which is particularly serious for non-English investigations. In this paper, we present a simple and effective unsupervised learning ...
English- Vietnamese Cross-Language Paraphrase Identification Method
SoICT '17: Proceedings of the 8th International Symposium on Information and Communication TechnologyParaphrase identification is a very important problem and is used in many natural language processing tasks such as machine translation, bilingual information retrieval, plagiarism detection, etc. With the development of information technology and the ...






Comments