ABSTRACT
Massive Online Open Courses have the potential to revolutionize higher education with their wide outreach and accessibility, but they require instructors to come up with scalable alternates to traditional student evaluation. Peer grading -- having students assess each other -- is a promising approach to tackling the problem of evaluation at scale, since the number of "graders" naturally scales with the number of students. However, students are not trained in grading, which means that one cannot expect the same level of grading skills as in traditional settings. Drawing on broad evidence that ordinal feedback is easier to provide and more reliable than cardinal feedback [5, 38, 29, 9], it is therefore desirable to allow peer graders to make ordinal statements (e.g. "project X is better than project Y") and not require them to make cardinal statements (e.g. "project X is a B-"). Thus, in this paper we study the problem of automatically inferring student grades from ordinal peer feedback, as opposed to existing methods that require cardinal peer feedback. We formulate the ordinal peer grading problem as a type of rank aggregation problem, and explore several probabilistic models under which to estimate student grades and grader reliability. We study the applicability of these methods using peer grading data collected from a real class --- with instructor and TA grades as a baseline --- and demonstrate the efficacy of ordinal feedback techniques in comparison to existing cardinal peer grading methods. Finally, we compare these peer-grading techniques to traditional evaluation techniques.
References
- N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: Ranking and clustering. J. ACM, 55(5):23:1--23:27, Nov. 2008. Google Scholar
Digital Library
- K. J. Arrow. Social Choice and Individual Values. Yale University Press, 2nd edition, Sept. 1970.Google Scholar
- J. A. Aslam and M. Montague. Models for metasearch. In SIGIR, pages 276--284, 2001. Google Scholar
Digital Library
- Y. Bachrach, T. Graepel, T. Minka, and J. Guiver. How to grade a test without knowing the answers - a bayesian graphical model for adaptive crowdsourcing and aptitude testing. In ICML, 2012.Google Scholar
- W. Barnett. The modern theory of consumer behavior: Ordinal or cardinal? The Quarterly Journal of Austrian Economics, 6(1):41--65, 2003.Google Scholar
Cross Ref
- M. Bashir, J. Anderton, J. Wu, P. B. Golbus, V. Pavlu, and J. A. Aslam. A document rating system for preference judgements. In SIGIR, pages 909--912, 2013. Google Scholar
Digital Library
- L. Bouzidi and A. Jaillet. Can online peer assessment be trusted? Educational Technology & Society, 12(4):257--268, 2009.Google Scholar
- R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):pp. 324--345, 1952.Google Scholar
Cross Ref
- B. Carterette, P. N. Bennett, D. M. Chickering, and S. T. Dumais. Here or there: Preference judgments for relevance. In ECIR, pages 16--27, 2008. Google Scholar
Digital Library
- C.-C. Chang, K.-H. Tseng, P.-N. Chou, and Y.-H. Chen. Reliability and validity of web-based portfolio peer assessment: A case study for a senior high school's students taking computer course. Comput. Educ., 57(1):1306--1316, Aug. 2011. Google Scholar
Digital Library
- X. Chen, P. N. Bennett, K. Collins-Thompson, and E. Horvitz. Pairwise ranking aggregation in a crowdsourced setting. In WSDM, pages 193--202, 2013. Google Scholar
Digital Library
- J. Diez, O. Luaces, A. Alonso-Betanzos, A. Troncoso, and A. Bahamonde. Peer assessment in moocs using preference learning via matrix factorization, 2013.Google Scholar
- C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In WWW, pages 613--622, 2001. Google Scholar
Digital Library
- S. Freeman and J. W. Parks. How accurate is peer grading? CBE-Life Sciences Education, 9(4):482--488, 2010.Google Scholar
Cross Ref
- J. Guiver and E. Snelson. Bayesian inference for plackettluce ranking models. In ICML, pages 377--384, 2009. Google Scholar
Digital Library
- J. Haber. http://degreeoffreedom.org/between-two-worlds-moocs-and-assessment.Google Scholar
- J. Haber. http://degreeoffreedom.org/mooc-assignments-screwing/, Oct. 2013.Google Scholar
- R. Herbrich, T. Minka, and T. Graepel. Trueskill tm: A bayesian skill rating system. In NIPS, pages 569--576, 2007.Google Scholar
- P. G. Ipeirotis and P. K. Paritosh. Managing crowdsourced human computation: a tutorial. In WWW, pages 287--288, 2011. Google Scholar
Digital Library
- M. Kendall. Rank correlation methods. Griffin, London, 1948.Google Scholar
- C. Kenyon-Mathieu and W. Schudy. How to rank with few errors. In STOC, pages 95--103, 2007. Google Scholar
Digital Library
- C. Kulkarni, K. Wei, H. Le, D. Chia, K. Papadopoulos, J. Cheng, D. Koller, and S. Klemmer. Peer and self assessment in massive online classes. ACM Trans. CHI, 20(6):33:1--33:31, Dec. 2013. Google Scholar
Digital Library
- G. Lebanon and J. D. Lafferty. Cranking: Combining rankings using conditional probability models on permutations. In ICML, pages 363--370, 2002. Google Scholar
Digital Library
- T.-Y. Liu. Learning to rank for information retrieval. Found. Trends Inf. Retr., 3(3):225--331, Mar. 2009. Google Scholar
Digital Library
- T. Lu and C. Boutilier. Learning mallows models with pair-wise preferences. In ICML, pages 145--152, June 2011.Google Scholar
- T. Lu and C. E. Boutilier. The unavailable candidate model: A decision-theoretic view of social choice. In EC, pages 263--274, 2010. Google Scholar
Digital Library
- R. D. Luce. Individual Choice Behavior: A theoretical analysis. Wiley, 1959.Google Scholar
- C. L. Mallows. Non-null ranking models. Biometrika, 44(1/2):pp. 114--130, 1957.Google Scholar
Cross Ref
- G. A. Miller. The magical number seven, plus or minus two: Some limits on our capacity for processing information. The Psychological Review, 63(2):81--97, March 1956.Google Scholar
- M. Mostert and J. D. Snowball. Where angels fear to tread: online peer-assessment in a large first-year class. Assessment & Evaluation in Higher Education, 38(6):674--686, 2013.Google Scholar
Cross Ref
- S. Niu, Y. Lan, J. Guo, and X. Cheng. Stochastic rank aggregation. CoRR, abs/1309.6852, 2013.Google Scholar
- C. Piech, J. Huang, Z. Chen, C. Do, A. Ng, and D. Koller. Tuned models of peer assessment in MOOCs. In EDM, 2013.Google Scholar
- R. L. Plackett. The analysis of permutations. Journal of the Royal Statistical Society. Series C (Applied Statistics), 24(2):193--202, 1975.Google Scholar
- T. Qin, X. Geng, and T.-Y. Liu. A new probabilistic model for rank aggregation. In NIPS, pages 1948--1956, 2010.Google Scholar
- V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds. JMLR, 11:1297--1322, Aug. 2010. Google Scholar
Digital Library
- J. Rees. http://www.insidehighered.com/views/2013/03/05/essays-flaws-peer-grading-moocs.Google Scholar
- N. Shah, J. Bradley, A. Parekh, M. Wainwright, and K. Ramchandran. A case for ordinal peer-evaluation in MOOCs, 2013.Google Scholar
- N. Stewart, G. D. A. Brown, and N. Chater. Absolute identification by relative judgment. Psychological Review, 112:881--911, 2005.Google Scholar
Cross Ref
- L. L. Thurstone. The method of paired comparisons for social values. Journal of Abnormal and Social Psychology, 27:384--400, 1927.Google Scholar
Cross Ref
- M. N. Volkovs and R. S. Zemel. A exible generative model for preference aggregation. In WWW, pages 479--488, 2012. Google Scholar
Digital Library
Supplemental Material
Index Terms
Methods for ordinal peer grading






Comments