Abstract
The rise of machine learning has fundamentally altered decision making: rather than being made solely by people, many important decisions are now made through an "algorithm-in-the-loop'' process where machine learning models inform people. Yet insufficient research has considered how the interactions between people and models actually influence human decisions. Society lacks both clear normative principles regarding how people should collaborate with algorithms as well as robust empirical evidence about how people do collaborate with algorithms. Given research suggesting that people struggle to interpret machine learning models and to incorporate them into their decisions---sometimes leading these models to produce unexpected outcomes---it is essential to consider how different ways of presenting models and structuring human-algorithm interactions affect the quality and type of decisions made. This paper contributes to such research in two ways. First, we posited three principles as essential to ethical and responsible algorithm-in-the-loop decision making. Second, through a controlled experimental study on Amazon Mechanical Turk, we evaluated whether people satisfy these principles when making predictions with the aid of a risk assessment. We studied human predictions in two contexts (pretrial release and financial lending) and under several conditions for risk assessment presentation and structure. Although these conditions did influence participant behaviors and in some cases improved performance, only one desideratum was consistently satisfied. Under all conditions, our study participants 1) were unable to effectively evaluate the accuracy of their own or the risk assessment's predictions, 2) did not calibrate their reliance on the risk assessment based on the risk assessment's performance, and 3) exhibited bias in their interactions with the risk assessment. These results highlight the urgent need to expand our analyses of algorithmic decision making aids beyond evaluating the models themselves to investigating the full sociotechnical contexts in which people and algorithms interact.
- Alex Albright. 2019. If You Give a Judge a Risk Score: Evidence from Kentucky Bail Decisions. The John M. Olin Center for Law, Economics, and Business Fellows' Discussion Paper Series 85 (2019).Google Scholar
- Ali Alkhatib and Michael Bernstein. 2019. Street-Level Algorithms: A Theory at the Gaps Between Policy and Decisions. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI '19). ACM, New York, NY, USA, Article 530, bibinfonumpages13 pages. https://doi.org/10.1145/3290605.3300760Google Scholar
Digital Library
- Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. Machine Bias. ProPublica (2016). https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencingGoogle Scholar
- Reuben Binns, Max Van Kleek, Michael Veale, Ulrik Lyngs, Jun Zhao, and Nigel Shadbolt. 2018. 'It's Reducing a Human Being to a Percentage': Perceptions of Justice in Algorithmic Decisions. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18). ACM, New York, NY, USA, Article 377, bibinfonumpages14 pages. https://doi.org/10.1145/3173574.3173951Google Scholar
Digital Library
- Laura Bliss. 2018. Former Uber Backup Driver: 'We Saw This Coming'. CityLab (2018). https://www.citylab.com/transportation/2018/03/former-uber-backup-driver-we-saw-this-coming/556427/Google Scholar
- Darwin Bond-Graham and Ali Winston. 2013. All Tomorrow's Crimes: The Future of Policing Looks a Lot Like Good Branding. SF Weekly (2013). http://archives.sfweekly.com/sanfrancisco/all-tomorrows-crimes-the-future-of-policing-looks-a-lot-like-good-branding/Content?oid=2827968Google Scholar
- Sarah Brayne. 2017. Big Data Surveillance: The Case of Policing. American Sociological Review, Vol. 82, 5 (2017), 977--1008. https://doi.org/10.1177/0003122417725865Google Scholar
Cross Ref
- Pamela M. Casey, Roger K. Warren, and Jennifer K. Elek. 2011. Using Offender Risk and Needs Assessment Information at Sentencing: Guidance for Courts from a National Working Group .National Center for State Courts.Google Scholar
- Marco Cavallo and cCaugatay Demiralp. 2018. A Visual Interaction Framework for Dimensionality Reduction Based Data Exploration. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18). ACM, New York, NY, USA, Article 635, bibinfonumpages13 pages. https://doi.org/10.1145/3173574.3174209Google Scholar
Digital Library
- Steven L. Chanenson and Jordan M. Hyatt. 2016. The Use of Risk Assessment at Sentencing: Implications for Research and Policy. Bureau of Justice Assistance (2016).Google Scholar
- Angèle Christin. 2017. Algorithms in practice: Comparing web journalism and criminal justice. Big Data & Society, Vol. 4, 2 (2017), 2053951717718855. https://doi.org/10.1177/2053951717718855Google Scholar
Cross Ref
- Elizabeth Clark, Anne Spencer Ross, Chenhao Tan, Yangfeng Ji, and Noah A. Smith. 2018. Creative Writing with a Machine in the Loop: Case Studies on Slogans and Stories. In 23rd International Conference on Intelligent User Interfaces (IUI '18). ACM, New York, NY, USA, 329--340. https://doi.org/10.1145/3172944.3172983Google Scholar
- Thomas H. Cohen, Bailey Pendergast, and Scott W. VanBenschoten. 2016. Examining overrides of risk classifications for offenders on federal supervision. Federal Probation, Vol. 80, 1 (2016), 12.Google Scholar
- Sam Corbett-Davies, Sharad Goel, and Sandra González-Bailón. 2017. Even Imperfect Algorithms Can Improve the Criminal Justice System. New York Times (2017). https://www.nytimes.com/2017/12/20/upshot/algorithms-bail-criminal-justice-system.htmlGoogle Scholar
- New Jersey Courts. 2017. One Year Criminal Justice Reform Report to the Governor and the Legislature. (2017). https://www.njcourts.gov/courts/assets/criminal/2017cjrannual.pdfGoogle Scholar
- Bo Cowgill. 2018. The Impact of Algorithms on Judicial Discretion: Evidence from Regression Discontinuities. (2018).Google Scholar
- Mary L. Cummings. 2006. Automation and Accountability in Decision Support System Interface Design. Journal of Technology Studies (2006).Google Scholar
- Matthew DeMichele, Peter Baumgartner, Michael Wenger, Kelle Barrick, Megan Comfort, and Shilpi Misra. 2018. The Public Safety Assessment: A Re-Validation and Assessment of Predictive Utility and Differential Prediction by Race and Gender in Kentucky. (2018).Google Scholar
- Sarah L. Desmarais, Kiersten L. Johnson, and Jay P. Singh. 2016. Performance of Recidivism Risk Assessment Instruments in U.S. Correctional Settings. Psychological Services, Vol. 13, 3 (2016), 206--222.Google Scholar
Cross Ref
- Sarah L. Desmarais and Jay P. Singh. 2013. Risk Assessment Instruments Validated and Implemented in Correctional Settings in the United States. (2013).Google Scholar
- William Dieterich, Christina Mendoza, and Tim Brennan. 2016. COMPAS Risk Scales: Demonstrating Accuracy Equity and Predictive Parity. Northpointe Inc. (2016).Google Scholar
- Berkeley J. Dietvorst, Joseph P. Simmons, and Cade Massey. 2015. Algorithm Aversion: People Erroneously Avoid Algorithms After Seeing Them Err. Journal of Experimental Psychology: General, Vol. 144, 1 (2015), 114--126. https://doi.org/10.1037/xge0000033Google Scholar
- Finale Doshi-Velez, Mason Kortz, Ryan Budish, Chris Bavitz, Sam Gershman, David O'Brien, Stuart Schieber, James Waldo, David Weinberger, and Alexandra Wood. 2017. Accountability of AI Under the Law: The Role of Explanation. arXiv preprint arXiv:1711.01134 (2017).Google Scholar
- Lilian Edwards and Michael Veale. 2017. Slave to the Algorithm? Why a 'Right to an Explanation' Is Probably Not the Remedy You Are Looking For. Duke Law & Technology Review., Vol. 16 (2017), 18--84.Google Scholar
- Madeleine Clare Elish. 2019. Moral Crumple Zones: Cautionary Tales in Human-Robot Interaction. Engaging Science, Technology, and Society, Vol. 5, 0 (2019), 40--60. https://doi.org/10.17351/ests2019.260Google Scholar
Cross Ref
- Avshalom Elmalech, David Sarne, Avi Rosenfeld, and Eden Shalom Erez. 2015. When Suboptimal Rules. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence. 1313--1319.Google Scholar
Digital Library
- Birte Englich, Thomas Mussweiler, and Fritz Strack. 2006. Playing Dice With Criminal Sentences: The Influence of Irrelevant Anchors on Experts' Judicial Decision Making. Personality and Social Psychology Bulletin, Vol. 32, 2 (2006), 188--200.Google Scholar
Cross Ref
- Motahhare Eslami, Aimee Rickman, Kristen Vaccaro, Amirhossein Aleyasen, Andy Vuong, Karrie Karahalios, Kevin Hamilton, and Christian Sandvig. 2015. "I Always Assumed That I Wasn't Really That Close to [Her]": Reasoning About Invisible Algorithms in News Feeds. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI '15). ACM, New York, NY, USA, 153--162. https://doi.org/10.1145/2702123.2702556Google Scholar
Digital Library
- Jerry Alan Fails and Dan R. Olsen, Jr. 2003. Interactive Machine Learning. In Proceedings of the 8th International Conference on Intelligent User Interfaces (IUI '03). ACM, New York, NY, USA, 39--45. https://doi.org/10.1145/604045.604056Google Scholar
Digital Library
- David Foster. 2017. NEW R package that makes XGBoost interpretable. Medium: Applied Data Science (2017). https://medium.com/applied-data-science/new-r-package-the-xgboost-explainer-51dd7d1aa211Google Scholar
- Jerome H. Friedman. 2001. Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics, Vol. 29, 2 (2001), 1189--1232.Google Scholar
Cross Ref
- Tilmann Gneiting and Adrian E. Raftery. 2007. Strictly Proper Scoring Rules, Prediction, and Estimation. J. Amer. Statist. Assoc., Vol. 102, 477 (2007), 359--378.Google Scholar
- Paul Goodwin and Robert Fildes. 1999. Judgmental Forecasts of Time Series Affected by Special Events: Does Providing a Statistical Forecast Improve Accuracy? Journal of Behavioral Decision Making, Vol. 12, 1 (1999), 37--53.Google Scholar
Cross Ref
- Ben Green. 2018. "Fair" Risk Assessments: A Precarious Approach for Criminal Justice Reform. In Proceedings of the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning .Google Scholar
- Ben Green. 2019. Data Science as Political Action: Grounding Data Science in a Politics of Justice. arXiv preprint arXiv:1811.03435 (2019).Google Scholar
- Ben Green. 2019. The Smart Enough City: Putting Technology in Its Place to Reclaim Our Urban Future. MIT Press.Google Scholar
- Ben Green and Yiling Chen. 2019. Disparate Interactions: An Algorithm-in-the-Loop Analysis of Fairness in Risk Assessments. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT* '19). ACM, New York, NY, USA, 90--99. https://doi.org/10.1145/3287560.3287563Google Scholar
Digital Library
- Chris Guthrie, Jeffrey J. Rachlinski, and Andrew J. Wistrich. 2000. Inside the Judicial Mind. Cornell Law Review, Vol. 86 (2000), 777.Google Scholar
- Andrew J. Hawkins. 2019. Deadly Boeing Crashes Raise Questions About Airplane Automation. The Verge (2019). https://www.theverge.com/2019/3/15/18267365/boeing-737-max-8-crash-autopilot-automationGoogle Scholar
- Eric Horvitz. 1999. Principles of Mixed-initiative User Interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '99). ACM, New York, NY, USA, 159--166. https://doi.org/10.1145/302979.303030Google Scholar
Digital Library
- Human Rights Watch. 2017. "Not in it for Justice": How California's Pretrial Detention and Bail System Unfairly Punishes Poor People. (2017). https://www.hrw.org/report/2017/04/11/not-it-justice/how-californias-pretrial-detention-and-bail-system-unfairlyGoogle Scholar
- Priscillia Hunt, Jessica Saunders, and John S. Hollywood. 2014. Evaluation of the Shreveport Predictive Policing Experiment .RAND Corporation. https://www.rand.org/pubs/research_reports/RR531.htmlGoogle Scholar
- Northpointe Inc. 2012. COMPAS Risk & Need Assessment System. (2012). http://www.northpointeinc.com/files/downloads/FAQ_Document.pdfGoogle Scholar
- Daniel Kahneman. 2011. Thinking, Fast and Slow .Farrar, Straus and Giroux.Google Scholar
- Ece Kamar, Severin Hacker, and Eric Horvitz. 2012. Combining Human and Machine Intelligence in Large-scale Crowdsourcing. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 1 (AAMAS '12). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 467--474. http://dl.acm.org/citation.cfm?id=2343576.2343643Google Scholar
Digital Library
- Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. 2017. Human Decisions and Machine Predictions. The Quarterly Journal of Economics, Vol. 133, 1 (2017), 237--293. https://doi.org/10.1093/qje/qjx032Google Scholar
- Todd Kulesza, Saleema Amershi, Rich Caruana, Danyel Fisher, and Denis Charles. 2014. Structured Labeling for Facilitating Concept Evolution in Machine Learning. In Proceedings of the 32nd Annual ACM Conference on Human Factors in Computing Systems (CHI '14). ACM, New York, NY, USA, 3075--3084. https://doi.org/10.1145/2556288.2557238Google Scholar
Digital Library
- Vivian Lai and Chenhao Tan. 2019. On Human Predictions with Explanations and Predictions of Machine Learning Models: A Case Study on Deception Detection. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT* '19). ACM, New York, NY, USA, 29--38. https://doi.org/10.1145/3287560.3287590Google Scholar
Digital Library
- Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin. 2016. How We Analyzed the COMPAS Recidivism Algorithm. ProPublica (2016). https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithmGoogle Scholar
- Laura and John Arnold Foundation. 2016. Public Safety Assessment: Risk Factors and Formula. (2016). http://www.arnoldfoundation.org/wp-content/uploads/PSA-Risk-Factors-and-Formula.pdfGoogle Scholar
- John D. Lee and Katrina A. See. 2004. Trust in Automation: Designing for Appropriate Reliance. Human Factors, Vol. 46, 1 (2004), 50--80. https://doi.org/10.1518/hfes.46.1.50_30392Google Scholar
Digital Library
- Min Kyung Lee. 2018. Understanding perception of algorithmic decisions: Fairness, trust, and emotion in response to algorithmic management. Big Data & Society, Vol. 5, 1 (2018), 2053951718756684. https://doi.org/10.1177/2053951718756684Google Scholar
Cross Ref
- Lending Club. 2019. Lending Club Statistics. (2019). https://www.lendingclub.com/info/download-data.actionGoogle Scholar
- Julia Levashina, Christopher J. Hartwell, Frederick P. Morgeson, and Michael A. Campion. 2014. The Structured Employment Interview: Narrative and Quantitative Review of the Research Literature. Personnel Psychology, Vol. 67, 1 (2014), 241--293. https://doi.org/10.1111/peps.12052Google Scholar
Cross Ref
- Gerald S. Leventhal. 1980. What Should Be Done with Equity Theory? In Social Exchange. Springer, 27--55.Google Scholar
- Joa Sang Lim and Marcus O'Connor. 1995. Judgemental Adjustment of Initial Forecasts: Its Effectiveness and Biases. Journal of Behavioral Decision Making, Vol. 8, 3 (1995), 149--168.Google Scholar
Cross Ref
- Jennifer M. Logg, Julia A. Minson, and Don A. Moore. 2019. Algorithm appreciation: People prefer algorithmic to human judgment. Organizational Behavior and Human Decision Processes, Vol. 151 (2019), 90 -- 103. https://doi.org/10.1016/j.obhdp.2018.12.005Google Scholar
Cross Ref
- Frank Main. 2016. Cook County judges not following bail recommendations: study. Chicago Sun-Times (2016). https://chicago.suntimes.com/chicago-news/cook-county-judges-not-following-bail-recommendations-study-find/Google Scholar
- Alex P. Miller. 2018. Want Less-Biased Decisions? Use Algorithms. Harvard Business Review (2018). https://hbr.org/2018/07/want-less-biased-decisions-use-algorithmsGoogle Scholar
- myFICO. 2016. Understanding FICO Scores. (2016). https://www.myfico.com/Downloads/Files/myFICO_UYFS_Booklet.pdfGoogle Scholar
- Menaka Narayanan, Emily Chen, Jeffrey He, Been Kim, Sam Gershman, and Finale Doshi-Velez. 2018. How do Humans Understand Explanations from Machine Learning Systems? An Evaluation of the Human-Interpretability of Explanation. arXiv preprint arXiv:1802.00682 (2018).Google Scholar
- Forough Poursabzi-Sangdeh, Daniel G. Goldstein, Jake M. Hofman, Jennifer Wortman Vaughan, and Hanna Wallach. 2018. Manipulating and Measuring Model Interpretability. arXiv preprint arXiv:1802.07810 (2018).Google Scholar
- Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). ACM, New York, NY, USA, 1135--1144. https://doi.org/10.1145/2939672.2939778Google Scholar
Digital Library
- Naeem Siddiqi. 2012. Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. Wiley.Google Scholar
- Aaron Springer, Victoria Hollis, and Steve Whittaker. 2017. Dice in the black box: User experiences with an inscrutable algorithm. In 2017 AAAI Spring Symposium Series .Google Scholar
- Sonja B. Starr. 2014. Evidence-Based Sentencing and the Scientific Rationalization of Discrimination. Stanford Law Review, Vol. 66, 4 (2014), 803--872.Google Scholar
- Megan Stevenson. 2018. Assessing Risk Assessment in Action. Minnesota Law Review, Vol. 103 (2018), 303--384.Google Scholar
- Megan T. Stevenson and Jennifer L. Doleac. 2018. The Roadblock to Reform. The American Constitution Society (2018). https://www.acslaw.org/wp-content/uploads/2018/11/RoadblockToReformReport.pdfGoogle Scholar
- Lucy Suchman, Jeanette Blomberg, Julian E. Orr, and Randall Trigg. 1999. Reconstructing Technologies as Social Practice. American Behavioral Scientist, Vol. 43, 3 (1999), 392--408. https://doi.org/10.1177/00027649921955335Google Scholar
- Sarah Tan, Julius Adebayo, Kori Inkpen, and Ece Kamar. 2018. Investigating Human+Machine Complementarity for Recidivism Predictions. arXiv preprint arXiv:1808.09123 (2018).Google Scholar
- United States Department of Justice. Office of Justice Programs. Bureau of Justice Statistics. 2014. State Court Processing Statistics, 1990--2009: Felony Defendants in Large Urban Counties.Google Scholar
- Berk Ustun, Alexander Spangher, and Yang Liu. 2019. Actionable Recourse in Linear Classification. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT* '19). ACM, New York, NY, USA, 10--19. https://doi.org/10.1145/3287560.3287566Google Scholar
Digital Library
- Gina M. Vincent, Laura S. Guy, and Thomas Grisso. 2012. Risk Assessment in Juvenile Justice: A Guidebook for Implementation. (2012). http://njjn.org/uploads/digital-library/Risk_Assessment_in_Juvenile_Justice_A_Guidebook_for_Implementation.pdfGoogle Scholar
- Wisconsin Supreme Court. 2016. State v. Loomis. 881 Wis. N.W.2d 749.Google Scholar
- Ilan Yaniv. 2004. Receiving other people's advice: Influence and benefit. Organizational Behavior and Human Decision Processes, Vol. 93, 1 (2004), 1--13. https://doi.org/10.1016/j.obhdp.2003.08.002Google Scholar
Cross Ref
- Michael Yeomans, Anuj Shah, Sendhil Mullainathan, and Jon Kleinberg. 2019. Making sense of recommendations. Journal of Behavioral Decision Making (2019). https://doi.org/10.1002/bdm.2118Google Scholar
Index Terms
The Principles and Limits of Algorithm-in-the-Loop Decision Making
Recommendations
Disparate Interactions: An Algorithm-in-the-Loop Analysis of Fairness in Risk Assessments
FAT* '19: Proceedings of the Conference on Fairness, Accountability, and TransparencyDespite vigorous debates about the technical characteristics of risk assessments being deployed in the U.S. criminal justice system, remarkably little research has studied how these tools affect actual decision-making processes. After all, risk ...
Operational risk assessments by supply chain professionals: Process and performance
AbstractWe consider the “process‐performance paradox” in the assessment of operational risks by professionals in the field of operations and supply chain management (OSCM). The paradox states that although professionals with more expertise tend to decide ...
Making risk assessments more comparable and repeatable
Many of the objections to implementing Risk Management and acting upon risk results hinge on the subjectivity of the risk assessment system. This subjectivity makes it difficult to make risk assessments justifiable, repeatable, and comparable over an ...






Comments