skip to main content
research-article

Duplicate Detection in Programming Question Answering Communities

Published:17 April 2018Publication History
Skip Abstract Section

Abstract

Community-based Question Answering (CQA) websites are attracting increasing numbers of users and contributors in recent years. However, duplicate questions frequently occur in CQA websites and are currently manually identified by the moderators. Automatic duplicate detection, on one hand, alleviates this laborious effort for moderators before taking close actions, and, on the other hand, helps question issuers quickly find answers. A number of studies have looked into related problems, but very limited works target Duplicate Detection in Programming CQA (PCQA), a branch of CQA that is dedicated to programmers. Existing works framed the task as a supervised learning problem on the question pairs and relied on only textual features. Moreover, the issue of selecting candidate duplicates from large volumes of historical questions is often un-addressed. To tackle these issues, we model duplicate detection as a two-stage “ranking-classification” problem over question pairs. In the first stage, we rank the historical questions according to their similarities to the newly issued question and select the top ranked ones as candidates to reduce the search space. In the second stage, we develop novel features that capture both textual similarity and latent semantics on question pairs, leveraging techniques in deep learning and information retrieval literature. Experiments on real-world questions about multiple programming languages demonstrate that our method works very well; in some cases, up to 25% improvement compared to the state-of-the-art benchmarks.

References

  1. Muhammad Ahasanuzzaman, Muhammad Asaduzzaman, Chanchal K. Roy, and Kevin A. Schneider. Mining duplicate questions in stack overflow. In Proceedings of of the MSR 2016. ACM, Austin, Texas, USA, 402--412. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Naomi S. Altman. 1992. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician 46, 3 (1992), 175--185.Google ScholarGoogle Scholar
  3. Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems 20, 4 (2002), 357--389. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In Proceedings of the EMNLP 2013. ACL, Seattle, Washington, USA, 1533--1544.Google ScholarGoogle Scholar
  5. Jonathan Berant and Percy Liang. Semantic parsing via paraphrasing. In Proceedings of the ACL 2014. 1415--1425.Google ScholarGoogle ScholarCross RefCross Ref
  6. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3 (2003), 993--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. 1984. Classification and Regression Trees. Wadsworth.Google ScholarGoogle Scholar
  8. Xin Cao, Gao Cong, Bin Cui, and Christian S. Jensen. A generalized framework of exploring category information for question retrieval in community question answer archives. In Proceedings of the WWW 2010. ACM, Raleigh, North Carolina, USA, 201--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Xin Cao, Gao Cong, Bin Cui, Christian S. Jensen, and Quan Yuan. 2012. Approaches to exploring category information for question retrieval in community question-answer archives. ACM Transactions on Information Systems 30, 2 (2012), 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Tony F. Chan, Gene Howard Golub, and Randall J. LeVeque. Updating formulae and a pairwise algorithm for computing sample variances. In Proceedings of the COMPSTAT 1982. Springer, Physica, Heidelberg, 30--41.Google ScholarGoogle Scholar
  11. Stéphane Clinchant and Éric Gaussier. Information-based models for ad hoc IR. In Proceedings of the SIGIR 2010. ACM, Geneva, Switzerland, 234--241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Michael Collins. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of the EMNLP 2002. ACL, Philadelphia, PA, USA, 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Denzil Correa and Ashish Sureka. Chaff from the wheat: Characterization and modeling of deleted questions on stack overflow. In Proceedings of the WWW 2014. ACM, Seoul, Republic of Korea, 631--642. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. 2006. Online passive-aggressive algorithms. Journal of Machine Learning Research 7 (2006), 551--585. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Fellbaum. 1998. WordNet: An electronic lexical database. MIT Press.Google ScholarGoogle Scholar
  16. Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the EuroCOLT 1995. Springer, Barcelona, Spain, 23--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Haibo He and Edwardo A. Garcia. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21, 9 (2009), 1263--1284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hua He, Kevin Gimpel, and Jimmy J. Lin. Multi-perspective sentence similarity modeling with convolutional neural networks. In Proceedings of the EMNLP 2015. ACL, Lisbon, Portugal, 1576--1586.Google ScholarGoogle ScholarCross RefCross Ref
  19. Marti A. Hearst, Susan T. Dumais, Edgar Osman, John Platt, and Bernhard Scholkopf. 1998. Support vector machines. IEEE Intelligent Systems and Their Applications 13, 4 (1998), 18--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Tin Kam Ho. 1998. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 8 (1998), 832--844. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Fred Jelinek and Robert L. Mercer. Interpolated estimation of Markov source parameters from sparse data. In Proceedings of the PRNI 1980. North Holland, Amsterdam, Netherlands, 381--397.Google ScholarGoogle Scholar
  22. Yangfeng Ji and Jacob Eisenstein. Discriminative improvements to distributional sentence similarity. In Proceedings of the EMNLP 2013. ACL, Seattle, Washington, USA, 891--896.Google ScholarGoogle Scholar
  23. Jey Han Lau and Timothy Baldwin. An empirical evaluation of doc2vec with practical insights into document embedding generation. In Proceedings of the RepL4NLP 2016. ACL, Berlin, Germany, 78--86.Google ScholarGoogle ScholarCross RefCross Ref
  24. Quoc V. Le and Tomas Mikolov. Distributed representations of sentences and documents. In Proceedings of the ICML 2014. JMLR.org ©2014, Beijing, China, 1188--1196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the SIGIR 2016. ACM, Pisa, Italy, 165--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Nitin Madnani, Joel R. Tetreault, and Martin Chodorow. Re-examining machine translation metrics for paraphrase identification. In Proceedings of the NAACL 2012. ACL, Montréal, Canada, 182--190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Rada Mihalcea, Courtney Corley, and Carlo Strapparava. Corpus-based and knowledge-based measures of text semantic similarity. In Proceedings of the AAAI 2006. AAAI Press, Boston, Massachusetts, USA, 775--780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the NIPS 2013. Neural Information Processing Systems, Lake Tahoe, Nevada, USA, 3111--3119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29, 1 (2003), 19--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Franz Josef Och and Hermann Ney. 2004. The alignment template approach to statistical machine translation. Computational Linguistics 30, 4 (2004), 417--449. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Long Qiu, Min-Yen Kan, and Tat-Seng Chua. Paraphrase recognition via dissimilarity significance classification. In Proceedings of the EMNLP 2006. ACL, Sydney, Australia, 18--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Stephen E. Robertson, Steve Walker, Susan Jones, Micheline M. Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. In Proceedings of the TREC 1994. National Institute of Standards and Technology (NIST), Gaithersburg, Maryland, USA, 109--126.Google ScholarGoogle Scholar
  33. Anna Shtok, Gideon Dror, Yoelle Maarek, and Idan Szpektor. Learning from the past: Answering new questions with past answers. In Proceedings of the WWW 2012. ACM, Lyon, France, 759--768. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Richard Socher, Eric H. Huang, Jeffrey Pennington, Andrew Y. Ng, and Christopher D. Manning. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Proceedings of the NIPS 2011. Neural Information Processing Systems, Lake Tahoe, Nevada, United States, 801--809. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Christoph Treude, Ohad Barzilay, and Margaret-Anne D. Storey. How do programmers ask and answer questions on the web? In Proceedings of the ICSE 2011. ACM, Waikiki, Honolulu, HI, USA, 804--807. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Strother H. Walker and David B. Duncan. 1967. Estimation of the probability of an event as a function of several independent variables. Biometrika 54, 1--2 (1967), 167--179.Google ScholarGoogle ScholarCross RefCross Ref
  37. Kai Wang, Zhaoyan Ming, and Tat-Seng Chua. A syntactic tree matching approach to finding similar questions in community-based QA services. In Proceedings of the SIGIR 2009. ACM, Boston, MA, USA, 187--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Lichun Yang, Shenghua Bao, Qingliang Lin, Xian Wu, Dingyi Han, Zhong Su, and Yong Yu. Analyzing and predicting not-answered questions in community-based question answering services. In Proceedings of the AAAI 2011. AAAI Press, San Francisco, California, USA, 1273--1278. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Pengcheng Yin, Nan Duan, Ben Kao, Jun-Wei Bao, and Ming Zhou. Answering questions with complex semantic constraints on open knowledge bases. In Proceedings of the CIKM 2015. ACM, Melbourne, Australia, 1301--1310. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Cheng Xiang Zhai and John D. Lafferty. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems 22, 2 (2004), 179--214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Tong Zhang. 2004. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the ICML 2004. ACM, Banff, Alberta, Canada, 919--926. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Wei Emma Zhang, Quan Z. Sheng, Jey Han Lau, and Ermyas Abebe. Detecting duplicate posts in programming QA communities via latent semantics and association rules. In Proceedings of the WWW 2017. ACM, Perth, Australia, 1221--1229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Yun Zhang, David Lo, Xin Xia, and Jianling Sun. 2015. Multi-factor duplicate question detection in stack overflow. Journal of Computer Science and Technology 30, 5 (2015), 981--997.Google ScholarGoogle ScholarCross RefCross Ref
  44. Guangyou Zhou, Yang Liu, Fang Liu, Daojian Zeng, and Jun Zhao. Improving question retrieval in community question answering using world knowledge. In Proceedings of the IJCAI 2013. IJCAI/AAAI, Beijing, China, 2239--2245. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Duplicate Detection in Programming Question Answering Communities

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Internet Technology
          ACM Transactions on Internet Technology  Volume 18, Issue 3
          Special Issue on Artificial Intelligence for Secruity and Privacy and Regular Papers
          August 2018
          314 pages
          ISSN:1533-5399
          EISSN:1557-6051
          DOI:10.1145/3185332
          • Editor:
          • Munindar P. Singh
          Issue’s Table of Contents

          Copyright © 2018 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 17 April 2018
          • Accepted: 1 November 2017
          • Revised: 1 October 2017
          • Received: 1 June 2017
          Published in toit Volume 18, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!