skip to main content
research-article
Open Access

Web Content Classification Using Distributions of Subjective Quality Evaluations

Published:15 November 2016Publication History
Skip Abstract Section

Abstract

Machine learning algorithms and recommender systems trained on human ratings are widely in use today. However, human ratings may be associated with a high level of uncertainty and are subjective, influenced by demographic or psychological factors. We propose a new approach to the design of object classes from human ratings: the use of entire distributions to construct classes. By avoiding aggregation for class definition, our approach loses no information and can deal with highly volatile or conflicting ratings. The approach is based the concept of the Earth Mover's Distance (EMD), a measure of distance for distributions. We evaluate the proposed approach based on four datasets obtained from diverse Web content or movie quality evaluation services or experiments. We show that clusters discovered in these datasets using the EMD measure are characterized by a consistent and simple interpretation. Quality classes defined using entire rating distributions can be fitted to clusters of distributions in the four datasets using two parameters, resulting in a good overall fit. We also consider the impact of the composition of small samples on the distributions that are the basis of our classification approach. We show that using distributions based on small samples of 10 evaluations is still robust to several demographic and psychological variables. This observation suggests that the proposed approach can be used in practice for quality evaluation, even for highly uncertain and subjective ratings.

References

  1. Jesus Bobadilla, F. Ortega, A. Hernando, and A. Gutiérrez. 2013. Recommender systems survey. Knowledge-Based Systems 46, (July 2013), 109--132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alexander P. Dawid and Allan M. Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied Statistics. (1979), 20--28.Google ScholarGoogle Scholar
  3. Carsten Eickhoff and Arjen P. de Vries. 2013. Increasing cheat robustness of crowdsourcing tasks. Information Retrieval 16, 2 (2013), 121--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Lucie Flekova, Oliver Ferschke, and Iryna Gurevych. 2014. What makes a good biography?: Multidimensional quality analysis based on Wikipedia article feedback data. In Proceedings of the 23rd International Conference on World Wide Web (WWW’14). ACM, New York, 855--866. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Benoît Frénay and Michel Verleysen. 2014. Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems 25, 5 (2014), 845--869.Google ScholarGoogle ScholarCross RefCross Ref
  6. Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning. Springer, Berlin.Google ScholarGoogle Scholar
  7. Eszter Hargittai and Yuli Patrick Hsieh. 2012. Succinct survey measures of web-use skills. Social Science Computer Review 30, 1, 95--107. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani, and Eduard Hovy. 2013. Learning whom to trust with MACE. In HLT-NAACL, 1120--1130.Google ScholarGoogle Scholar
  9. Nan Hu, Jie Zhang, and Paul A. Pavlou. 2009. Overcoming the J-shaped distribution of product reviews. Communications of the ACM 52, 10, 144--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Panagiotis G. Ipeirotis, Foster Provost, Victor S. Sheng, and Jing Wang. 2014. Repeated labeling using multiple noisy labelers. Data Mining and Knowledge Discovery 28, 2, 402--441. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Susan Jamieson. 2004. Likert scales: How to (ab)use them. Medical Education 38, 12, 1217--1218.Google ScholarGoogle ScholarCross RefCross Ref
  12. Michał Jankowski-Lorek, Radosław Nielek, Adam Wierzbicki, Kazimierz Zieliński, 2014. Predicting controversy of Wikipedia articles using the article feedback tool. In Proceedings of the 2014 International Conference on Social Computing. ACM, 22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Michał Kąkol, Michał Jankowski-Lorek, Katarzyna Abramczuk, Adam Wierzbicki, and Michelle Catasta. 2013. On the subjectivity and bias of web content credibility evaluations. In Proceedings of the 22nd International Conference on World Wide Web Companion. 1131--1136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Aniket Kittur, Ed H. Chi, and Bongwon Suh. 2008. Crowdsourcing user studies with Mechanical Turk. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 453--456. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Vassilis Kostakos. 2009. Is the crowd's wisdom biased? A quantitative analysis of three online communities. In Proceedings of the 2009 International Conference on Computational Science and Engineering (CSE’09). IEEE, 251--255. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ludmila Kuncheva, Christopher J. Whitaker, Catherine A. Shipp, and Robert P. W. Duin. 2003. Limits on the majority evaluation accuracy in classifier fusion. Pattern Analysis & Applications 6, 1, 22--31.Google ScholarGoogle ScholarCross RefCross Ref
  17. Robert Leik. 1966. A measure of ordinal consensus. Pacific Sociological Review 9, 2, 85--90.Google ScholarGoogle ScholarCross RefCross Ref
  18. Xiu Liu, Radosław Nielek, Paulina Adamska, Adam Wierzbicki, and Karl Aberer. 2015. Towards a highly effective and robust Web credibility evaluation system. Decision Support Systems 79 (2015), 99--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Andrea Malossini, Enrico Blanzieri, and Raymond T. Ng. 2006. Detecting potential labeling errors in microarrays by data perturbation. Bioinformatics 22, 17 (2006), 2114--2121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Mikołaj Morzy and Adam Wierzbicki. 2006. The sound of silence: Mining implicit feedbacks to compute reputation. In International Workshop on Internet and Network Economics. Springer, 365--376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Arjun Mukherjee, Bing Liu, and Natalie Glance. 2012. Spotting fake reviewer groups in consumer reviews. In Proceedings of the 21st International Conference on World Wide Web. ACM, New York, 191--200. DOI:10.1145/2187836.2187863. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Stefanie Nowak and Stefan Rüger. 2010. How reliable are annotations via crowdsourcing: A study about inter-annotator agreement for multi-label image annotation. In Proceedings of the International Conference on Multimedia Information Retrieval. ACM, 557--566. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Radosław Nielek, Aleksander Wawer, Michał Jankowski-Lorek, and Adam Wierzbicki. 2013. Temporal, cultural and thematic aspects of web credibility. In Social Informatics. Springer International Publishing, 419-428. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Alexandra Olteanu, Stanislav Peshterliev, Xin Liu, and Karl Aberer. 2013. Web credibility: Features exploration and credibility prediction. Advances in Information Retrieval. Springer, Berlin, 557--568. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Maria Rafalak, Katarzyna Abramczuk, and Adam Wierzbicki. 2014a. Incredible: Is (almost) all web content trustworthy? Analysis of psychological factors related to website credibility evaluation. Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1117--1122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Maria Rafalak, Piotr Bilski, and Adam Wierzbicki. 2014b. Analysis of demographical factors’ influence on websites’ credibility evaluation. Human-Computer Interaction. Applications and Services. Springer International Publishing, 57--68.Google ScholarGoogle Scholar
  27. Vikas C. Raykar and Shipeng Yu. 2012. Eliminating spammers and ranking annotators for crowdsourced labeling tasks. Journal of Machine Learning Research 13, 491--518. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. 2000. The earth mover's distance as a metric for image retrieval. International Journal of Computer Vision. 40, 2 (2000), 99--121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Subhash Shinde and Uday Kulkami. 2012. Hybrid personalizad recommender system using centering--bunching-based clustering algorithm. Expert Systems with Applications 39, 1, 1381--1387. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Padhraic Smyth, Usama Fayyad, Michael Burl, Pietro Perona, and Pierre Baldi. 1995. Inferring ground truth from subjective labelling of Venus images. In Advances in Neural Information Processing Systems. 1085--1092. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and fast—but is it good?: Evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistic, 254--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Alexander Sorokin and David Forsyth. 2008. Utility data annotation with Amazon Mechanical Turk. Urbana 51, 61(2008), 820.Google ScholarGoogle Scholar
  33. William J. Tastle and Mark J. Wierman. 2007. Consensus and dissention: A measure of ordinal dispersion. International Journal of Approximate Reasoning 45, 3, 531--545. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Cees Van der Eijk. 2001. Measuring agreement in ordered rating scales. Quality and Quantity 35, 3, 325--341.Google ScholarGoogle ScholarCross RefCross Ref
  35. Aleksander Wawer, Radosław Nielek, and Adam Wierzbicki. 2014. Predicting webpage credibility using linguistic features. In Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee, 1135--1140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Peter Welinder, Steve Branson, Serge Belongie, and Pietro Perona. 2010. The multidimensional wisdom of crowds. In Advances in Neural Information Processing Systems. 2424--2432. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Jacob Whitehill, Paul Ruvolo, Tingfan Wu, Jacob Bergsma, and Javier Movellan. 2009. Whose evaluation should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in Neural Information Processing Systems. 2035--2043. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Baba Yukino and Hisashi Kashima. 2013. Statistical quality estimation for general Crowdsourcing tasks. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 554--562. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Web Content Classification Using Distributions of Subjective Quality Evaluations

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader
              About Cookies On This Site

              We use cookies to ensure that we give you the best experience on our website.

              Learn more

              Got it!