Abstract
Machine learning algorithms and recommender systems trained on human ratings are widely in use today. However, human ratings may be associated with a high level of uncertainty and are subjective, influenced by demographic or psychological factors. We propose a new approach to the design of object classes from human ratings: the use of entire distributions to construct classes. By avoiding aggregation for class definition, our approach loses no information and can deal with highly volatile or conflicting ratings. The approach is based the concept of the Earth Mover's Distance (EMD), a measure of distance for distributions. We evaluate the proposed approach based on four datasets obtained from diverse Web content or movie quality evaluation services or experiments. We show that clusters discovered in these datasets using the EMD measure are characterized by a consistent and simple interpretation. Quality classes defined using entire rating distributions can be fitted to clusters of distributions in the four datasets using two parameters, resulting in a good overall fit. We also consider the impact of the composition of small samples on the distributions that are the basis of our classification approach. We show that using distributions based on small samples of 10 evaluations is still robust to several demographic and psychological variables. This observation suggests that the proposed approach can be used in practice for quality evaluation, even for highly uncertain and subjective ratings.
- Jesus Bobadilla, F. Ortega, A. Hernando, and A. Gutiérrez. 2013. Recommender systems survey. Knowledge-Based Systems 46, (July 2013), 109--132. Google Scholar
Digital Library
- Alexander P. Dawid and Allan M. Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied Statistics. (1979), 20--28.Google Scholar
- Carsten Eickhoff and Arjen P. de Vries. 2013. Increasing cheat robustness of crowdsourcing tasks. Information Retrieval 16, 2 (2013), 121--137. Google Scholar
Digital Library
- Lucie Flekova, Oliver Ferschke, and Iryna Gurevych. 2014. What makes a good biography?: Multidimensional quality analysis based on Wikipedia article feedback data. In Proceedings of the 23rd International Conference on World Wide Web (WWW’14). ACM, New York, 855--866. Google Scholar
Digital Library
- Benoît Frénay and Michel Verleysen. 2014. Classification in the presence of label noise: A survey. IEEE Transactions on Neural Networks and Learning Systems 25, 5 (2014), 845--869.Google Scholar
Cross Ref
- Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning. Springer, Berlin.Google Scholar
- Eszter Hargittai and Yuli Patrick Hsieh. 2012. Succinct survey measures of web-use skills. Social Science Computer Review 30, 1, 95--107. Google Scholar
Digital Library
- Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani, and Eduard Hovy. 2013. Learning whom to trust with MACE. In HLT-NAACL, 1120--1130.Google Scholar
- Nan Hu, Jie Zhang, and Paul A. Pavlou. 2009. Overcoming the J-shaped distribution of product reviews. Communications of the ACM 52, 10, 144--147. Google Scholar
Digital Library
- Panagiotis G. Ipeirotis, Foster Provost, Victor S. Sheng, and Jing Wang. 2014. Repeated labeling using multiple noisy labelers. Data Mining and Knowledge Discovery 28, 2, 402--441. Google Scholar
Digital Library
- Susan Jamieson. 2004. Likert scales: How to (ab)use them. Medical Education 38, 12, 1217--1218.Google Scholar
Cross Ref
- Michał Jankowski-Lorek, Radosław Nielek, Adam Wierzbicki, Kazimierz Zieliński, 2014. Predicting controversy of Wikipedia articles using the article feedback tool. In Proceedings of the 2014 International Conference on Social Computing. ACM, 22. Google Scholar
Digital Library
- Michał Kąkol, Michał Jankowski-Lorek, Katarzyna Abramczuk, Adam Wierzbicki, and Michelle Catasta. 2013. On the subjectivity and bias of web content credibility evaluations. In Proceedings of the 22nd International Conference on World Wide Web Companion. 1131--1136. Google Scholar
Digital Library
- Aniket Kittur, Ed H. Chi, and Bongwon Suh. 2008. Crowdsourcing user studies with Mechanical Turk. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 453--456. Google Scholar
Digital Library
- Vassilis Kostakos. 2009. Is the crowd's wisdom biased? A quantitative analysis of three online communities. In Proceedings of the 2009 International Conference on Computational Science and Engineering (CSE’09). IEEE, 251--255. Google Scholar
Digital Library
- Ludmila Kuncheva, Christopher J. Whitaker, Catherine A. Shipp, and Robert P. W. Duin. 2003. Limits on the majority evaluation accuracy in classifier fusion. Pattern Analysis & Applications 6, 1, 22--31.Google Scholar
Cross Ref
- Robert Leik. 1966. A measure of ordinal consensus. Pacific Sociological Review 9, 2, 85--90.Google Scholar
Cross Ref
- Xiu Liu, Radosław Nielek, Paulina Adamska, Adam Wierzbicki, and Karl Aberer. 2015. Towards a highly effective and robust Web credibility evaluation system. Decision Support Systems 79 (2015), 99--108. Google Scholar
Digital Library
- Andrea Malossini, Enrico Blanzieri, and Raymond T. Ng. 2006. Detecting potential labeling errors in microarrays by data perturbation. Bioinformatics 22, 17 (2006), 2114--2121. Google Scholar
Digital Library
- Mikołaj Morzy and Adam Wierzbicki. 2006. The sound of silence: Mining implicit feedbacks to compute reputation. In International Workshop on Internet and Network Economics. Springer, 365--376. Google Scholar
Digital Library
- Arjun Mukherjee, Bing Liu, and Natalie Glance. 2012. Spotting fake reviewer groups in consumer reviews. In Proceedings of the 21st International Conference on World Wide Web. ACM, New York, 191--200. DOI:10.1145/2187836.2187863. Google Scholar
Digital Library
- Stefanie Nowak and Stefan Rüger. 2010. How reliable are annotations via crowdsourcing: A study about inter-annotator agreement for multi-label image annotation. In Proceedings of the International Conference on Multimedia Information Retrieval. ACM, 557--566. Google Scholar
Digital Library
- Radosław Nielek, Aleksander Wawer, Michał Jankowski-Lorek, and Adam Wierzbicki. 2013. Temporal, cultural and thematic aspects of web credibility. In Social Informatics. Springer International Publishing, 419-428. Google Scholar
Digital Library
- Alexandra Olteanu, Stanislav Peshterliev, Xin Liu, and Karl Aberer. 2013. Web credibility: Features exploration and credibility prediction. Advances in Information Retrieval. Springer, Berlin, 557--568. Google Scholar
Digital Library
- Maria Rafalak, Katarzyna Abramczuk, and Adam Wierzbicki. 2014a. Incredible: Is (almost) all web content trustworthy? Analysis of psychological factors related to website credibility evaluation. Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1117--1122. Google Scholar
Digital Library
- Maria Rafalak, Piotr Bilski, and Adam Wierzbicki. 2014b. Analysis of demographical factors’ influence on websites’ credibility evaluation. Human-Computer Interaction. Applications and Services. Springer International Publishing, 57--68.Google Scholar
- Vikas C. Raykar and Shipeng Yu. 2012. Eliminating spammers and ranking annotators for crowdsourced labeling tasks. Journal of Machine Learning Research 13, 491--518. Google Scholar
Digital Library
- Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. 2000. The earth mover's distance as a metric for image retrieval. International Journal of Computer Vision. 40, 2 (2000), 99--121. Google Scholar
Digital Library
- Subhash Shinde and Uday Kulkami. 2012. Hybrid personalizad recommender system using centering--bunching-based clustering algorithm. Expert Systems with Applications 39, 1, 1381--1387. Google Scholar
Digital Library
- Padhraic Smyth, Usama Fayyad, Michael Burl, Pietro Perona, and Pierre Baldi. 1995. Inferring ground truth from subjective labelling of Venus images. In Advances in Neural Information Processing Systems. 1085--1092. Google Scholar
Digital Library
- Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and fast—but is it good?: Evaluating non-expert annotations for natural language tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistic, 254--263. Google Scholar
Digital Library
- Alexander Sorokin and David Forsyth. 2008. Utility data annotation with Amazon Mechanical Turk. Urbana 51, 61(2008), 820.Google Scholar
- William J. Tastle and Mark J. Wierman. 2007. Consensus and dissention: A measure of ordinal dispersion. International Journal of Approximate Reasoning 45, 3, 531--545. Google Scholar
Digital Library
- Cees Van der Eijk. 2001. Measuring agreement in ordered rating scales. Quality and Quantity 35, 3, 325--341.Google Scholar
Cross Ref
- Aleksander Wawer, Radosław Nielek, and Adam Wierzbicki. 2014. Predicting webpage credibility using linguistic features. In Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee, 1135--1140. Google Scholar
Digital Library
- Peter Welinder, Steve Branson, Serge Belongie, and Pietro Perona. 2010. The multidimensional wisdom of crowds. In Advances in Neural Information Processing Systems. 2424--2432. Google Scholar
Digital Library
- Jacob Whitehill, Paul Ruvolo, Tingfan Wu, Jacob Bergsma, and Javier Movellan. 2009. Whose evaluation should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in Neural Information Processing Systems. 2035--2043. Google Scholar
Digital Library
- Baba Yukino and Hisashi Kashima. 2013. Statistical quality estimation for general Crowdsourcing tasks. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 554--562. Google Scholar
Digital Library
Index Terms
Web Content Classification Using Distributions of Subjective Quality Evaluations
Recommendations
Flatter Is Better: Percentile Transformations for Recommender Systems
Survey Paper and Regular PaperIt is well known that explicit user ratings in recommender systems are biased toward high ratings and that users differ significantly in their usage of the rating scale. Implementers usually compensate for these issues through rating normalization or ...
Improving the Estimation of Tail Ratings in Recommender System with Multi-Latent Representations
WSDM '20: Proceedings of the 13th International Conference on Web Search and Data MiningThe importance of the distribution of ratings on recommender systems (RS) is well-recognized. And yet, recommendation approaches based on latent factor models and recently introduced neural variants (e.g., NCF) optimize for the head of these ...
Naïve filterbots for robust cold-start recommendations
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data miningThe goal of a recommender system is to suggest items of interest to a user based on historical behavior of a community of users. Given detailed enough history, item-based collaborative filtering (CF) often performs as well or better than almost any ...






Comments