Abstract
Community-based Question Answering (CQA) websites are attracting increasing numbers of users and contributors in recent years. However, duplicate questions frequently occur in CQA websites and are currently manually identified by the moderators. Automatic duplicate detection, on one hand, alleviates this laborious effort for moderators before taking close actions, and, on the other hand, helps question issuers quickly find answers. A number of studies have looked into related problems, but very limited works target Duplicate Detection in Programming CQA (PCQA), a branch of CQA that is dedicated to programmers. Existing works framed the task as a supervised learning problem on the question pairs and relied on only textual features. Moreover, the issue of selecting candidate duplicates from large volumes of historical questions is often un-addressed. To tackle these issues, we model duplicate detection as a two-stage “ranking-classification” problem over question pairs. In the first stage, we rank the historical questions according to their similarities to the newly issued question and select the top ranked ones as candidates to reduce the search space. In the second stage, we develop novel features that capture both textual similarity and latent semantics on question pairs, leveraging techniques in deep learning and information retrieval literature. Experiments on real-world questions about multiple programming languages demonstrate that our method works very well; in some cases, up to 25% improvement compared to the state-of-the-art benchmarks.
- Muhammad Ahasanuzzaman, Muhammad Asaduzzaman, Chanchal K. Roy, and Kevin A. Schneider. Mining duplicate questions in stack overflow. In Proceedings of of the MSR 2016. ACM, Austin, Texas, USA, 402--412. Google Scholar
Digital Library
- Naomi S. Altman. 1992. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician 46, 3 (1992), 175--185.Google Scholar
- Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems 20, 4 (2002), 357--389. Google Scholar
Digital Library
- Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In Proceedings of the EMNLP 2013. ACL, Seattle, Washington, USA, 1533--1544.Google Scholar
- Jonathan Berant and Percy Liang. Semantic parsing via paraphrasing. In Proceedings of the ACL 2014. 1415--1425.Google Scholar
Cross Ref
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3 (2003), 993--1022. Google Scholar
Digital Library
- Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. 1984. Classification and Regression Trees. Wadsworth.Google Scholar
- Xin Cao, Gao Cong, Bin Cui, and Christian S. Jensen. A generalized framework of exploring category information for question retrieval in community question answer archives. In Proceedings of the WWW 2010. ACM, Raleigh, North Carolina, USA, 201--210. Google Scholar
Digital Library
- Xin Cao, Gao Cong, Bin Cui, Christian S. Jensen, and Quan Yuan. 2012. Approaches to exploring category information for question retrieval in community question-answer archives. ACM Transactions on Information Systems 30, 2 (2012), 7. Google Scholar
Digital Library
- Tony F. Chan, Gene Howard Golub, and Randall J. LeVeque. Updating formulae and a pairwise algorithm for computing sample variances. In Proceedings of the COMPSTAT 1982. Springer, Physica, Heidelberg, 30--41.Google Scholar
- Stéphane Clinchant and Éric Gaussier. Information-based models for ad hoc IR. In Proceedings of the SIGIR 2010. ACM, Geneva, Switzerland, 234--241. Google Scholar
Digital Library
- Michael Collins. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of the EMNLP 2002. ACL, Philadelphia, PA, USA, 1--8. Google Scholar
Digital Library
- Denzil Correa and Ashish Sureka. Chaff from the wheat: Characterization and modeling of deleted questions on stack overflow. In Proceedings of the WWW 2014. ACM, Seoul, Republic of Korea, 631--642. Google Scholar
Digital Library
- Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. 2006. Online passive-aggressive algorithms. Journal of Machine Learning Research 7 (2006), 551--585. Google Scholar
Digital Library
- C. Fellbaum. 1998. WordNet: An electronic lexical database. MIT Press.Google Scholar
- Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the EuroCOLT 1995. Springer, Barcelona, Spain, 23--37. Google Scholar
Digital Library
- Haibo He and Edwardo A. Garcia. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21, 9 (2009), 1263--1284. Google Scholar
Digital Library
- Hua He, Kevin Gimpel, and Jimmy J. Lin. Multi-perspective sentence similarity modeling with convolutional neural networks. In Proceedings of the EMNLP 2015. ACL, Lisbon, Portugal, 1576--1586.Google Scholar
Cross Ref
- Marti A. Hearst, Susan T. Dumais, Edgar Osman, John Platt, and Bernhard Scholkopf. 1998. Support vector machines. IEEE Intelligent Systems and Their Applications 13, 4 (1998), 18--28. Google Scholar
Digital Library
- Tin Kam Ho. 1998. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 8 (1998), 832--844. Google Scholar
Digital Library
- Fred Jelinek and Robert L. Mercer. Interpolated estimation of Markov source parameters from sparse data. In Proceedings of the PRNI 1980. North Holland, Amsterdam, Netherlands, 381--397.Google Scholar
- Yangfeng Ji and Jacob Eisenstein. Discriminative improvements to distributional sentence similarity. In Proceedings of the EMNLP 2013. ACL, Seattle, Washington, USA, 891--896.Google Scholar
- Jey Han Lau and Timothy Baldwin. An empirical evaluation of doc2vec with practical insights into document embedding generation. In Proceedings of the RepL4NLP 2016. ACL, Berlin, Germany, 78--86.Google Scholar
Cross Ref
- Quoc V. Le and Tomas Mikolov. Distributed representations of sentences and documents. In Proceedings of the ICML 2014. JMLR.org ©2014, Beijing, China, 1188--1196. Google Scholar
Digital Library
- Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. Topic modeling for short texts with auxiliary word embeddings. In Proceedings of the SIGIR 2016. ACM, Pisa, Italy, 165--174. Google Scholar
Digital Library
- Nitin Madnani, Joel R. Tetreault, and Martin Chodorow. Re-examining machine translation metrics for paraphrase identification. In Proceedings of the NAACL 2012. ACL, Montréal, Canada, 182--190. Google Scholar
Digital Library
- Rada Mihalcea, Courtney Corley, and Carlo Strapparava. Corpus-based and knowledge-based measures of text semantic similarity. In Proceedings of the AAAI 2006. AAAI Press, Boston, Massachusetts, USA, 775--780. Google Scholar
Digital Library
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the NIPS 2013. Neural Information Processing Systems, Lake Tahoe, Nevada, USA, 3111--3119. Google Scholar
Digital Library
- Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29, 1 (2003), 19--51. Google Scholar
Digital Library
- Franz Josef Och and Hermann Ney. 2004. The alignment template approach to statistical machine translation. Computational Linguistics 30, 4 (2004), 417--449. Google Scholar
Digital Library
- Long Qiu, Min-Yen Kan, and Tat-Seng Chua. Paraphrase recognition via dissimilarity significance classification. In Proceedings of the EMNLP 2006. ACL, Sydney, Australia, 18--26. Google Scholar
Digital Library
- Stephen E. Robertson, Steve Walker, Susan Jones, Micheline M. Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. In Proceedings of the TREC 1994. National Institute of Standards and Technology (NIST), Gaithersburg, Maryland, USA, 109--126.Google Scholar
- Anna Shtok, Gideon Dror, Yoelle Maarek, and Idan Szpektor. Learning from the past: Answering new questions with past answers. In Proceedings of the WWW 2012. ACM, Lyon, France, 759--768. Google Scholar
Digital Library
- Richard Socher, Eric H. Huang, Jeffrey Pennington, Andrew Y. Ng, and Christopher D. Manning. Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Proceedings of the NIPS 2011. Neural Information Processing Systems, Lake Tahoe, Nevada, United States, 801--809. Google Scholar
Digital Library
- Christoph Treude, Ohad Barzilay, and Margaret-Anne D. Storey. How do programmers ask and answer questions on the web? In Proceedings of the ICSE 2011. ACM, Waikiki, Honolulu, HI, USA, 804--807. Google Scholar
Digital Library
- Strother H. Walker and David B. Duncan. 1967. Estimation of the probability of an event as a function of several independent variables. Biometrika 54, 1--2 (1967), 167--179.Google Scholar
Cross Ref
- Kai Wang, Zhaoyan Ming, and Tat-Seng Chua. A syntactic tree matching approach to finding similar questions in community-based QA services. In Proceedings of the SIGIR 2009. ACM, Boston, MA, USA, 187--194. Google Scholar
Digital Library
- Lichun Yang, Shenghua Bao, Qingliang Lin, Xian Wu, Dingyi Han, Zhong Su, and Yong Yu. Analyzing and predicting not-answered questions in community-based question answering services. In Proceedings of the AAAI 2011. AAAI Press, San Francisco, California, USA, 1273--1278. Google Scholar
Digital Library
- Pengcheng Yin, Nan Duan, Ben Kao, Jun-Wei Bao, and Ming Zhou. Answering questions with complex semantic constraints on open knowledge bases. In Proceedings of the CIKM 2015. ACM, Melbourne, Australia, 1301--1310. Google Scholar
Digital Library
- Cheng Xiang Zhai and John D. Lafferty. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems 22, 2 (2004), 179--214. Google Scholar
Digital Library
- Tong Zhang. 2004. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the ICML 2004. ACM, Banff, Alberta, Canada, 919--926. Google Scholar
Digital Library
- Wei Emma Zhang, Quan Z. Sheng, Jey Han Lau, and Ermyas Abebe. Detecting duplicate posts in programming QA communities via latent semantics and association rules. In Proceedings of the WWW 2017. ACM, Perth, Australia, 1221--1229. Google Scholar
Digital Library
- Yun Zhang, David Lo, Xin Xia, and Jianling Sun. 2015. Multi-factor duplicate question detection in stack overflow. Journal of Computer Science and Technology 30, 5 (2015), 981--997.Google Scholar
Cross Ref
- Guangyou Zhou, Yang Liu, Fang Liu, Daojian Zeng, and Jun Zhao. Improving question retrieval in community question answering using world knowledge. In Proceedings of the IJCAI 2013. IJCAI/AAAI, Beijing, China, 2239--2245. Google Scholar
Digital Library
Index Terms
Duplicate Detection in Programming Question Answering Communities
Recommendations
DeepDup: Duplicate Question Detection in Community Question Answering
ICDLT '21: Proceedings of the 2021 5th International Conference on Deep Learning TechnologiesDuplicate question detection is an ongoing challenge in community question answering because semantically equivalent questions can have significantly different words and structures. The identification of duplicate questions can reduce the resources ...
Adaptive Multi-Attention Network Incorporating Answer Information for Duplicate Question Detection
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information RetrievalCommunity-based question answering (CQA), which provides a platform for people with diverse backgrounds to share information and knowledge, has become increasingly popular. With the accumulation of site data, methods to detect duplicate questions in CQA ...
Detecting Duplicate Posts in Programming QA Communities via Latent Semantics and Association Rules
WWW '17: Proceedings of the 26th International Conference on World Wide WebProgramming community-based question-answering (PCQA) websites such as Stack Overflow enable programmers to find working solutions to their questions. Despite detailed posting guidelines, duplicate questions that have been answered are frequently ...






Comments