Abstract
AI-enabled decision support systems have repeatedly failed in real world applications despite the underlying model operating as designed. Often this was because the system was used in an unexpected manner. Our goal is to enable better prediction of how systems will be used prior to their implementation as well as to improve existing designs, by taking human behavior into account. There are several challenges to collecting such data. Not having access to an existing prediction engine requires the simulation of such a system's behavior. This simulation must include not just the behavior of the underlying model but also the context in which the decision will be made in the real world. Additionally, collecting statistically valid samples requires that test subjects make repeated choices under slightly varied conditions. Unfortunately, in such repetitious conditions fatigue can quickly set in. Games provide us the ability to address both of these challenges by providing both systems context and narrative context. Systems context can be used to convey some or all of the information the player needs to make a decision in the game environment itself, which can help avoid the onset of fatigue. Narrative context can provide a broader environment within which the simulated system operates, adding a sense of progress, showing the effect of decisions, adding perceived social norms, and setting incentives and stakes. This broader environment can further prevent player fatigue while replicating many of the external factors that might affect choices in the real world. In this paper we describe the design of the Human-AI Decision Evaluation System (HADES), a test harness capable of interfacing with a game environment, simulating the behavior of an AI-enabled decision support system, and collecting the results of human decision making based upon such a system's predictions. Additionally, we present an analysis of data collected by HADES while interfaced with a visual novel game focused on software cyber-risk assessment.
- REFERENCESGoogle Scholar
- D. J. Ahler, C. E. Roush, and G. Sood. 2018. The micro-task market for "Lemons": Collecting data on Amazon's Mechanical Turk. Working Paper. Epub ahead of print.Google Scholar
- V. Aleven, E. Myers, M. Easterday, and A. Ogan. 2010, April. Toward a framework for the analysis and design of educational games. In 2010 third IEEE international conference on digital game and intelligent toy enhanced learning (pp. 69--76). IEEE.Google Scholar
- I. G. Anson. 2018. Taking the time? Explaining effortful participation among low-cost online survey participants. Research & Politics, 5(3), 2053168018785483.Google Scholar
Cross Ref
- K. Bergström. 2010, October. The implicit rules of board games: On the particulars of the lusory agreement. In Proceedings of the 14th International Academic MindTrek Conference: Envisioning Future Media Environments (pp. 86--93).Google Scholar
Digital Library
- A. J. Berinsky, G. A. Huber, and G. S. Lenz. 2012. Evaluating online labor markets for experimental research: Amazon.com's Mechanical Turk. Political analysis, 20(3), 351--368.Google Scholar
- J. A. Bopp, K. Opwis, and E.D. Mekler. 2018. "An Odd Kind of Pleasure": Differentiating Emotional Challenge in Digital Games. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18). Association for Computing Machinery, New York, NY, USA, Paper 41, 1--12. DOI: https://doi.org/10.1145/3173574.3173615Google Scholar
- S. Clifford, R. M. Jewell, and P. D. Waggoner. 2015. Are samples drawn from Mechanical Turk valid for research on political ideology? Research & Politics, 2(4), 2053168015622072.Google Scholar
Cross Ref
- N.E Day, D Hudson, P.R. Dobies, et al. 2011. Student or situation? Personality and classroom context as predictors of attitudes about business school cheating. Soc Psychol Educ. 14: 261. https://doi.org/10.1007/s11218-010--9145--8Google Scholar
Cross Ref
- S. A. Dennis, B. M. Goodson, and C. A. Pearson. 2020. Online worker fraud and evolving threats to the integrity of MTurk data: A discussion of virtual private servers and the limitations of IP-based screening procedures. Behavioral Research in Accounting, 32(1), 119--134.Google Scholar
Cross Ref
- F. Doshi-Velez and B. Kim. 2017. Towards A Rigorous Science of Interpretable Machine Learning. arXiv preprint arXiv:1702.08608Google Scholar
- J. N. Druckman. 2001. Using credible advice to overcome framing effects. Journal of Law, Economics, and Organization, 17(1), 62--82.Google Scholar
Cross Ref
- M. Dufwenberg, S. Gächter, and H. Henning-Schmidt. 2006. The framing of games and the psychology of strategic choice (No. 19/2006). Bonn Econ Discussion Papers.Google Scholar
- S. Feng and J. Boyd-Graber. 2019, March. What can ai do for me? evaluating machine learning interpretations in cooperative play. In Proceedings of the 24th International Conference on Intelligent User Interfaces (pp. 229--239).Google Scholar
- L. B. Fulton, J. Y. Lee, Q. Wang, Z. Yuan, J. Hammer, and A. Perer. 2020, April. Getting playful with explainable ai: Games with a purpose to improve human understanding of ai. In Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems (pp. 1--8).Google Scholar
- A. Furnham and H. C. Boo. 2011. A literature review of the anchoring effect. The journal of socio-economics, 40(1), 35--42.Google Scholar
- C. Garvie 2019. Garbage In, Garbage Out | Face Recognition on Flawed Data. [Online]. Available: https://www.flawedfacedata.com/Google Scholar
- D.Y. Geiskkovitch, D. Cormier, S.H. Seo, and J.E Young. 2016. Please continue, we need more data: an exploration of obedience to robots. J. Hum.-Robot Interact. 5, 1 (March 2016), 82--99. DOI: https://doi.org/10.5898/JHRI.5.1.GeiskkovitchGoogle Scholar
- Katy Ilonka Gero, Zahra Ashktorab, Casey Dugan, Qian Pan, James Johnson, Werner Geyer, Maria Ruiz, Sarah Miller, David R. Millen, Murray Campbell, Sadhana Kumaravel, and Wei Zhang. 2020. Mental Models of AI Agents in a Cooperative Game Setting. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI '20). Association for Computing Machinery, New York, NY, USA, 1--12. DOI: https://doi.org/10.1145/3313831.3376316Google Scholar
Digital Library
- M. J. Habgood and S. E. Ainsworth. 2011. Motivating children to learn effectively: Exploring the value of intrinsic integration in educational games. The Journal of the Learning Sciences, 20(2), 169--206.Google Scholar
Cross Ref
- M. J. Habgood, S. E. Ainsworth, and S. Benford. 2005. Endogenous fantasy and learning in digital games. Simulation & Gaming, 36(4), 483--498.Google Scholar
Cross Ref
- D. Journet. 2007. Narrative, Action, and Learning: The Stories of Myst. In: Selfe C.L., Hawisher G.E., Van Ittersum D. (eds) Gaming Lives in the Twenty-First Century. Palgrave Macmillan, New York. https://doi.org/10.1057/9780230601765_6Google Scholar
- J. Juul. 2010. The game, the player, the world: Looking for a heart of gameness. Plurais Revista Multidisciplinar, 1(2).Google Scholar
- R. Kennedy, S. Clifford, T. Burleigh, P. D. Waggoner, R. Jewell, and N. J. Winter. 2020. The shape of and solutions to the MTurk quality crisis. Political Science Research and Methods, 8(4), 614--629.Google Scholar
- Y. Kou and X. Gui. 2020. Mediating Community-AI Interaction through Situated Explanation: The Case of AI-Led Moderation. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW2), 1--27.Google Scholar
- A. Kühberger. 1998. The influence of framing on risky decisions: A meta-analysis. Organizational Behavior and Human Decision Processes. 75, 1 (1998), 23--55. DOI: https://doi.org/https://doi.org/10.1006/obhd.1998.2781Google Scholar
Cross Ref
- V. Lai, and C. Tan. 2019. On Human Predictions with Explanations and Predictions of Machine Learning Models: A Case Study on Deception Detection, arXiv preprint arXiv:1811.07901Google Scholar
- N. Lane and N. R. Prestopnik 2017, October. Diegetic connectivity: blending work and play with storytelling in serious games. In Proceedings of the Annual Symposium on Computer-Human Interaction in Play (pp. 229--240).Google Scholar
Digital Library
- A. C. Madrigal. 2019. How a Feel-Good AI Story Went Wrong in Flint. [Online]. Available: https://www.theatlantic.com/technology/archive/2019/01/how-machine-learning-found-flints-lead-pipes/578692/Google Scholar
- P. Madumal, T. Miller, L. Sonenberg, and F. Vetere. 2019. A grounded interaction protocol for explainable artificial intelligence. arXiv preprint arXiv:1903.02409.Google Scholar
- T. Miller, P. Howe and L. Sonenberg. 2017. Explainable AI: Beware of inmates running the asylum or: How I learnt to stop worrying and love the social and behavioural sciences. arXiv preprint arXiv:1712.00547.Google Scholar
- M. Molineaux, D. Dannenhauer, and D. W. Aha. 2018, January. Towards Explainable NPCs: A Relational Exploration Learning Agent. In AAAI Workshops (pp. 565--569).Google Scholar
- K. J. Mullinix, T. J. Leeper, J. N. Druckman, and J. Freese. 2015. The generalizability of survey experiments. Journal of Experimental Political Science, 2(2), 109--138.Google Scholar
Cross Ref
- M. Narayanan, E. Chen, J. He, B. Kim, S. Gershman, and F. Doshi-Velez. 2018. How do Humans Understand Explanations from Machine Learning Systems? An Evaluation of the Human-Interpretability of Explanation. arXiv preprint arXiv:1802.00682 (2018).Google Scholar
- S. Y. Okita, J. Bailenson, and D. L. Schwartz. 2007. The mere belief of social interaction improves learning. In Proceedings of the Annual Meeting of the Cognitive Science Society (Vol. 29, No. 29).Google Scholar
- E. Peer, J. Vosgerau, and A. Acquisti. 2014. Reputation as a sufficient condition for data quality on Amazon Mechanical Turk. Behavior research methods, 46(4), 1023--1031.Google Scholar
- M.T. Ribeiro, S. Singh, and C. Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). Association for Computing Machinery, New York, NY, USA, 1135--1144. DOI: https://doi.org/10.1145/2939672.2939778Google Scholar
- D. Rumeser and M. Emsley. 2019. Can serious games improve project management decision making under complexity?. Project Management Journal, 50(1), 23--39.Google Scholar
Cross Ref
- S. Samat and A. Acquisti. 2017. Format vs. content: the impact of risk and presentation on disclosure decisions. In Thirteenth Symposium on Usable Privacy and Security ({SOUPS} 2017) (pp. 377--384).Google Scholar
- K. Schrier. 2019. Designing Games for Moral Learning and Knowledge Building. Games and Culture. 2019;14(4):306--343. doi:10.1177/1555412017711514Google Scholar
- C. A. Steinkuehler. 2004. Learning in massively multiplayer online games.Google Scholar
- C. Steinkuehler and S. Duncan. 2008. Scientific habits of mind in virtual worlds. Journal of Science Education and Technology, 17(6), 530--543.Google Scholar
Cross Ref
- K. A. Thomas and S. Clifford. 2017. Validity and Mechanical Turk: An assessment of exclusion methods and interactive experiments. Computers in Human Behavior, 77, 184--197.Google Scholar
Digital Library
- Matt Turek. 2019. Explainable Artificial Intelligence (XAI). [Online]. Available: https://www.darpa.mil/program/explainable-artificial-intelligenceGoogle Scholar
- A. Tversky and D. Kahneman. 1981. The framing of decisions and the psychology of choice. science, 211(4481), 453--458.Google Scholar
- J. Villareale and J. Zhu. 2021. Understanding Mental Models of AI through Player-AI Interaction. arXiv preprint arXiv:2103.16168Google Scholar
- D. Wang, Q. Yang, A. Abdul, and B. Y. Lim. 2019, May. Designing theory-driven user-centric explainable AI. In Proceedings of the 2019 CHI conference on human factors in computing systems (pp. 1--15).Google Scholar
- J. D. Weinberg, J. Freese, and D. McElhattan. 2014. Comparing data characteristics and results of an online factorial survey between a population-based and a crowdsource-recruited sample. Sociological Science, 1.Google Scholar
- M. Yin, J.W. Vaughan, and H. Wallach. 2019. Understanding the Effect of Accuracy on Trust in Machine Learning Models. In CHI Conference on Human Factors in Computing Systems Proceedings (CHI 2019), May 4--9, 2019, Glasgow, Scotland.Google Scholar
Index Terms
Play for Real(ism) - Using Games to Predict Human-AI interactions in the Real World
Recommendations
Blending the real and virtual in games: the model of fantasy sports
FDG '09: Proceedings of the 4th International Conference on Foundations of Digital GamesFantasy sports provide participants with the opportunity to play the role of coach for a team of athletes they select. The better the athletes perform in their real-world competitions, the better the virtual teams they are on will perform. Leagues for ...
The many faces of sociability and social play in games
MindTrek '09: Proceedings of the 13th International MindTrek Conference: Everyday Life in the Ubiquitous EraIn the past, social interaction has been discussed mostly in the context of multiplayer games, ignoring the implicit forms of sociability in single player games. This paper distinguishes between the sociability around the playing of a game and the ...
Tennis real play: an interactive tennis game with models from real videos
MM '11: Proceedings of the 19th ACM international conference on MultimediaTennis Real Play (TRP) is an interactive tennis game system constructed with models extracted from videos of real matches. The key techniques proposed for TRP include player modeling and video-based player/court rendering. For player model creation, we ...






Comments