skip to main content
research-article
Public Access

The Principles and Limits of Algorithm-in-the-Loop Decision Making

Published:07 November 2019Publication History
Skip Abstract Section

Abstract

The rise of machine learning has fundamentally altered decision making: rather than being made solely by people, many important decisions are now made through an "algorithm-in-the-loop'' process where machine learning models inform people. Yet insufficient research has considered how the interactions between people and models actually influence human decisions. Society lacks both clear normative principles regarding how people should collaborate with algorithms as well as robust empirical evidence about how people do collaborate with algorithms. Given research suggesting that people struggle to interpret machine learning models and to incorporate them into their decisions---sometimes leading these models to produce unexpected outcomes---it is essential to consider how different ways of presenting models and structuring human-algorithm interactions affect the quality and type of decisions made. This paper contributes to such research in two ways. First, we posited three principles as essential to ethical and responsible algorithm-in-the-loop decision making. Second, through a controlled experimental study on Amazon Mechanical Turk, we evaluated whether people satisfy these principles when making predictions with the aid of a risk assessment. We studied human predictions in two contexts (pretrial release and financial lending) and under several conditions for risk assessment presentation and structure. Although these conditions did influence participant behaviors and in some cases improved performance, only one desideratum was consistently satisfied. Under all conditions, our study participants 1) were unable to effectively evaluate the accuracy of their own or the risk assessment's predictions, 2) did not calibrate their reliance on the risk assessment based on the risk assessment's performance, and 3) exhibited bias in their interactions with the risk assessment. These results highlight the urgent need to expand our analyses of algorithmic decision making aids beyond evaluating the models themselves to investigating the full sociotechnical contexts in which people and algorithms interact.

References

  1. Alex Albright. 2019. If You Give a Judge a Risk Score: Evidence from Kentucky Bail Decisions. The John M. Olin Center for Law, Economics, and Business Fellows' Discussion Paper Series 85 (2019).Google ScholarGoogle Scholar
  2. Ali Alkhatib and Michael Bernstein. 2019. Street-Level Algorithms: A Theory at the Gaps Between Policy and Decisions. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI '19). ACM, New York, NY, USA, Article 530, bibinfonumpages13 pages. https://doi.org/10.1145/3290605.3300760Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. Machine Bias. ProPublica (2016). https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencingGoogle ScholarGoogle Scholar
  4. Reuben Binns, Max Van Kleek, Michael Veale, Ulrik Lyngs, Jun Zhao, and Nigel Shadbolt. 2018. 'It's Reducing a Human Being to a Percentage': Perceptions of Justice in Algorithmic Decisions. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18). ACM, New York, NY, USA, Article 377, bibinfonumpages14 pages. https://doi.org/10.1145/3173574.3173951Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Laura Bliss. 2018. Former Uber Backup Driver: 'We Saw This Coming'. CityLab (2018). https://www.citylab.com/transportation/2018/03/former-uber-backup-driver-we-saw-this-coming/556427/Google ScholarGoogle Scholar
  6. Darwin Bond-Graham and Ali Winston. 2013. All Tomorrow's Crimes: The Future of Policing Looks a Lot Like Good Branding. SF Weekly (2013). http://archives.sfweekly.com/sanfrancisco/all-tomorrows-crimes-the-future-of-policing-looks-a-lot-like-good-branding/Content?oid=2827968Google ScholarGoogle Scholar
  7. Sarah Brayne. 2017. Big Data Surveillance: The Case of Policing. American Sociological Review, Vol. 82, 5 (2017), 977--1008. https://doi.org/10.1177/0003122417725865Google ScholarGoogle ScholarCross RefCross Ref
  8. Pamela M. Casey, Roger K. Warren, and Jennifer K. Elek. 2011. Using Offender Risk and Needs Assessment Information at Sentencing: Guidance for Courts from a National Working Group .National Center for State Courts.Google ScholarGoogle Scholar
  9. Marco Cavallo and cCaugatay Demiralp. 2018. A Visual Interaction Framework for Dimensionality Reduction Based Data Exploration. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI '18). ACM, New York, NY, USA, Article 635, bibinfonumpages13 pages. https://doi.org/10.1145/3173574.3174209Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Steven L. Chanenson and Jordan M. Hyatt. 2016. The Use of Risk Assessment at Sentencing: Implications for Research and Policy. Bureau of Justice Assistance (2016).Google ScholarGoogle Scholar
  11. Angèle Christin. 2017. Algorithms in practice: Comparing web journalism and criminal justice. Big Data & Society, Vol. 4, 2 (2017), 2053951717718855. https://doi.org/10.1177/2053951717718855Google ScholarGoogle ScholarCross RefCross Ref
  12. Elizabeth Clark, Anne Spencer Ross, Chenhao Tan, Yangfeng Ji, and Noah A. Smith. 2018. Creative Writing with a Machine in the Loop: Case Studies on Slogans and Stories. In 23rd International Conference on Intelligent User Interfaces (IUI '18). ACM, New York, NY, USA, 329--340. https://doi.org/10.1145/3172944.3172983Google ScholarGoogle Scholar
  13. Thomas H. Cohen, Bailey Pendergast, and Scott W. VanBenschoten. 2016. Examining overrides of risk classifications for offenders on federal supervision. Federal Probation, Vol. 80, 1 (2016), 12.Google ScholarGoogle Scholar
  14. Sam Corbett-Davies, Sharad Goel, and Sandra González-Bailón. 2017. Even Imperfect Algorithms Can Improve the Criminal Justice System. New York Times (2017). https://www.nytimes.com/2017/12/20/upshot/algorithms-bail-criminal-justice-system.htmlGoogle ScholarGoogle Scholar
  15. New Jersey Courts. 2017. One Year Criminal Justice Reform Report to the Governor and the Legislature. (2017). https://www.njcourts.gov/courts/assets/criminal/2017cjrannual.pdfGoogle ScholarGoogle Scholar
  16. Bo Cowgill. 2018. The Impact of Algorithms on Judicial Discretion: Evidence from Regression Discontinuities. (2018).Google ScholarGoogle Scholar
  17. Mary L. Cummings. 2006. Automation and Accountability in Decision Support System Interface Design. Journal of Technology Studies (2006).Google ScholarGoogle Scholar
  18. Matthew DeMichele, Peter Baumgartner, Michael Wenger, Kelle Barrick, Megan Comfort, and Shilpi Misra. 2018. The Public Safety Assessment: A Re-Validation and Assessment of Predictive Utility and Differential Prediction by Race and Gender in Kentucky. (2018).Google ScholarGoogle Scholar
  19. Sarah L. Desmarais, Kiersten L. Johnson, and Jay P. Singh. 2016. Performance of Recidivism Risk Assessment Instruments in U.S. Correctional Settings. Psychological Services, Vol. 13, 3 (2016), 206--222.Google ScholarGoogle ScholarCross RefCross Ref
  20. Sarah L. Desmarais and Jay P. Singh. 2013. Risk Assessment Instruments Validated and Implemented in Correctional Settings in the United States. (2013).Google ScholarGoogle Scholar
  21. William Dieterich, Christina Mendoza, and Tim Brennan. 2016. COMPAS Risk Scales: Demonstrating Accuracy Equity and Predictive Parity. Northpointe Inc. (2016).Google ScholarGoogle Scholar
  22. Berkeley J. Dietvorst, Joseph P. Simmons, and Cade Massey. 2015. Algorithm Aversion: People Erroneously Avoid Algorithms After Seeing Them Err. Journal of Experimental Psychology: General, Vol. 144, 1 (2015), 114--126. https://doi.org/10.1037/xge0000033Google ScholarGoogle Scholar
  23. Finale Doshi-Velez, Mason Kortz, Ryan Budish, Chris Bavitz, Sam Gershman, David O'Brien, Stuart Schieber, James Waldo, David Weinberger, and Alexandra Wood. 2017. Accountability of AI Under the Law: The Role of Explanation. arXiv preprint arXiv:1711.01134 (2017).Google ScholarGoogle Scholar
  24. Lilian Edwards and Michael Veale. 2017. Slave to the Algorithm? Why a 'Right to an Explanation' Is Probably Not the Remedy You Are Looking For. Duke Law & Technology Review., Vol. 16 (2017), 18--84.Google ScholarGoogle Scholar
  25. Madeleine Clare Elish. 2019. Moral Crumple Zones: Cautionary Tales in Human-Robot Interaction. Engaging Science, Technology, and Society, Vol. 5, 0 (2019), 40--60. https://doi.org/10.17351/ests2019.260Google ScholarGoogle ScholarCross RefCross Ref
  26. Avshalom Elmalech, David Sarne, Avi Rosenfeld, and Eden Shalom Erez. 2015. When Suboptimal Rules. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence. 1313--1319.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Birte Englich, Thomas Mussweiler, and Fritz Strack. 2006. Playing Dice With Criminal Sentences: The Influence of Irrelevant Anchors on Experts' Judicial Decision Making. Personality and Social Psychology Bulletin, Vol. 32, 2 (2006), 188--200.Google ScholarGoogle ScholarCross RefCross Ref
  28. Motahhare Eslami, Aimee Rickman, Kristen Vaccaro, Amirhossein Aleyasen, Andy Vuong, Karrie Karahalios, Kevin Hamilton, and Christian Sandvig. 2015. "I Always Assumed That I Wasn't Really That Close to [Her]": Reasoning About Invisible Algorithms in News Feeds. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI '15). ACM, New York, NY, USA, 153--162. https://doi.org/10.1145/2702123.2702556Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jerry Alan Fails and Dan R. Olsen, Jr. 2003. Interactive Machine Learning. In Proceedings of the 8th International Conference on Intelligent User Interfaces (IUI '03). ACM, New York, NY, USA, 39--45. https://doi.org/10.1145/604045.604056Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. David Foster. 2017. NEW R package that makes XGBoost interpretable. Medium: Applied Data Science (2017). https://medium.com/applied-data-science/new-r-package-the-xgboost-explainer-51dd7d1aa211Google ScholarGoogle Scholar
  31. Jerome H. Friedman. 2001. Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics, Vol. 29, 2 (2001), 1189--1232.Google ScholarGoogle ScholarCross RefCross Ref
  32. Tilmann Gneiting and Adrian E. Raftery. 2007. Strictly Proper Scoring Rules, Prediction, and Estimation. J. Amer. Statist. Assoc., Vol. 102, 477 (2007), 359--378.Google ScholarGoogle Scholar
  33. Paul Goodwin and Robert Fildes. 1999. Judgmental Forecasts of Time Series Affected by Special Events: Does Providing a Statistical Forecast Improve Accuracy? Journal of Behavioral Decision Making, Vol. 12, 1 (1999), 37--53.Google ScholarGoogle ScholarCross RefCross Ref
  34. Ben Green. 2018. "Fair" Risk Assessments: A Precarious Approach for Criminal Justice Reform. In Proceedings of the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning .Google ScholarGoogle Scholar
  35. Ben Green. 2019. Data Science as Political Action: Grounding Data Science in a Politics of Justice. arXiv preprint arXiv:1811.03435 (2019).Google ScholarGoogle Scholar
  36. Ben Green. 2019. The Smart Enough City: Putting Technology in Its Place to Reclaim Our Urban Future. MIT Press.Google ScholarGoogle Scholar
  37. Ben Green and Yiling Chen. 2019. Disparate Interactions: An Algorithm-in-the-Loop Analysis of Fairness in Risk Assessments. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT* '19). ACM, New York, NY, USA, 90--99. https://doi.org/10.1145/3287560.3287563Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Chris Guthrie, Jeffrey J. Rachlinski, and Andrew J. Wistrich. 2000. Inside the Judicial Mind. Cornell Law Review, Vol. 86 (2000), 777.Google ScholarGoogle Scholar
  39. Andrew J. Hawkins. 2019. Deadly Boeing Crashes Raise Questions About Airplane Automation. The Verge (2019). https://www.theverge.com/2019/3/15/18267365/boeing-737-max-8-crash-autopilot-automationGoogle ScholarGoogle Scholar
  40. Eric Horvitz. 1999. Principles of Mixed-initiative User Interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '99). ACM, New York, NY, USA, 159--166. https://doi.org/10.1145/302979.303030Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Human Rights Watch. 2017. "Not in it for Justice": How California's Pretrial Detention and Bail System Unfairly Punishes Poor People. (2017). https://www.hrw.org/report/2017/04/11/not-it-justice/how-californias-pretrial-detention-and-bail-system-unfairlyGoogle ScholarGoogle Scholar
  42. Priscillia Hunt, Jessica Saunders, and John S. Hollywood. 2014. Evaluation of the Shreveport Predictive Policing Experiment .RAND Corporation. https://www.rand.org/pubs/research_reports/RR531.htmlGoogle ScholarGoogle Scholar
  43. Northpointe Inc. 2012. COMPAS Risk & Need Assessment System. (2012). http://www.northpointeinc.com/files/downloads/FAQ_Document.pdfGoogle ScholarGoogle Scholar
  44. Daniel Kahneman. 2011. Thinking, Fast and Slow .Farrar, Straus and Giroux.Google ScholarGoogle Scholar
  45. Ece Kamar, Severin Hacker, and Eric Horvitz. 2012. Combining Human and Machine Intelligence in Large-scale Crowdsourcing. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 1 (AAMAS '12). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 467--474. http://dl.acm.org/citation.cfm?id=2343576.2343643Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. 2017. Human Decisions and Machine Predictions. The Quarterly Journal of Economics, Vol. 133, 1 (2017), 237--293. https://doi.org/10.1093/qje/qjx032Google ScholarGoogle Scholar
  47. Todd Kulesza, Saleema Amershi, Rich Caruana, Danyel Fisher, and Denis Charles. 2014. Structured Labeling for Facilitating Concept Evolution in Machine Learning. In Proceedings of the 32nd Annual ACM Conference on Human Factors in Computing Systems (CHI '14). ACM, New York, NY, USA, 3075--3084. https://doi.org/10.1145/2556288.2557238Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Vivian Lai and Chenhao Tan. 2019. On Human Predictions with Explanations and Predictions of Machine Learning Models: A Case Study on Deception Detection. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT* '19). ACM, New York, NY, USA, 29--38. https://doi.org/10.1145/3287560.3287590Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin. 2016. How We Analyzed the COMPAS Recidivism Algorithm. ProPublica (2016). https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithmGoogle ScholarGoogle Scholar
  50. Laura and John Arnold Foundation. 2016. Public Safety Assessment: Risk Factors and Formula. (2016). http://www.arnoldfoundation.org/wp-content/uploads/PSA-Risk-Factors-and-Formula.pdfGoogle ScholarGoogle Scholar
  51. John D. Lee and Katrina A. See. 2004. Trust in Automation: Designing for Appropriate Reliance. Human Factors, Vol. 46, 1 (2004), 50--80. https://doi.org/10.1518/hfes.46.1.50_30392Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Min Kyung Lee. 2018. Understanding perception of algorithmic decisions: Fairness, trust, and emotion in response to algorithmic management. Big Data & Society, Vol. 5, 1 (2018), 2053951718756684. https://doi.org/10.1177/2053951718756684Google ScholarGoogle ScholarCross RefCross Ref
  53. Lending Club. 2019. Lending Club Statistics. (2019). https://www.lendingclub.com/info/download-data.actionGoogle ScholarGoogle Scholar
  54. Julia Levashina, Christopher J. Hartwell, Frederick P. Morgeson, and Michael A. Campion. 2014. The Structured Employment Interview: Narrative and Quantitative Review of the Research Literature. Personnel Psychology, Vol. 67, 1 (2014), 241--293. https://doi.org/10.1111/peps.12052Google ScholarGoogle ScholarCross RefCross Ref
  55. Gerald S. Leventhal. 1980. What Should Be Done with Equity Theory? In Social Exchange. Springer, 27--55.Google ScholarGoogle Scholar
  56. Joa Sang Lim and Marcus O'Connor. 1995. Judgemental Adjustment of Initial Forecasts: Its Effectiveness and Biases. Journal of Behavioral Decision Making, Vol. 8, 3 (1995), 149--168.Google ScholarGoogle ScholarCross RefCross Ref
  57. Jennifer M. Logg, Julia A. Minson, and Don A. Moore. 2019. Algorithm appreciation: People prefer algorithmic to human judgment. Organizational Behavior and Human Decision Processes, Vol. 151 (2019), 90 -- 103. https://doi.org/10.1016/j.obhdp.2018.12.005Google ScholarGoogle ScholarCross RefCross Ref
  58. Frank Main. 2016. Cook County judges not following bail recommendations: study. Chicago Sun-Times (2016). https://chicago.suntimes.com/chicago-news/cook-county-judges-not-following-bail-recommendations-study-find/Google ScholarGoogle Scholar
  59. Alex P. Miller. 2018. Want Less-Biased Decisions? Use Algorithms. Harvard Business Review (2018). https://hbr.org/2018/07/want-less-biased-decisions-use-algorithmsGoogle ScholarGoogle Scholar
  60. myFICO. 2016. Understanding FICO Scores. (2016). https://www.myfico.com/Downloads/Files/myFICO_UYFS_Booklet.pdfGoogle ScholarGoogle Scholar
  61. Menaka Narayanan, Emily Chen, Jeffrey He, Been Kim, Sam Gershman, and Finale Doshi-Velez. 2018. How do Humans Understand Explanations from Machine Learning Systems? An Evaluation of the Human-Interpretability of Explanation. arXiv preprint arXiv:1802.00682 (2018).Google ScholarGoogle Scholar
  62. Forough Poursabzi-Sangdeh, Daniel G. Goldstein, Jake M. Hofman, Jennifer Wortman Vaughan, and Hanna Wallach. 2018. Manipulating and Measuring Model Interpretability. arXiv preprint arXiv:1802.07810 (2018).Google ScholarGoogle Scholar
  63. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). ACM, New York, NY, USA, 1135--1144. https://doi.org/10.1145/2939672.2939778Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Naeem Siddiqi. 2012. Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. Wiley.Google ScholarGoogle Scholar
  65. Aaron Springer, Victoria Hollis, and Steve Whittaker. 2017. Dice in the black box: User experiences with an inscrutable algorithm. In 2017 AAAI Spring Symposium Series .Google ScholarGoogle Scholar
  66. Sonja B. Starr. 2014. Evidence-Based Sentencing and the Scientific Rationalization of Discrimination. Stanford Law Review, Vol. 66, 4 (2014), 803--872.Google ScholarGoogle Scholar
  67. Megan Stevenson. 2018. Assessing Risk Assessment in Action. Minnesota Law Review, Vol. 103 (2018), 303--384.Google ScholarGoogle Scholar
  68. Megan T. Stevenson and Jennifer L. Doleac. 2018. The Roadblock to Reform. The American Constitution Society (2018). https://www.acslaw.org/wp-content/uploads/2018/11/RoadblockToReformReport.pdfGoogle ScholarGoogle Scholar
  69. Lucy Suchman, Jeanette Blomberg, Julian E. Orr, and Randall Trigg. 1999. Reconstructing Technologies as Social Practice. American Behavioral Scientist, Vol. 43, 3 (1999), 392--408. https://doi.org/10.1177/00027649921955335Google ScholarGoogle Scholar
  70. Sarah Tan, Julius Adebayo, Kori Inkpen, and Ece Kamar. 2018. Investigating Human+Machine Complementarity for Recidivism Predictions. arXiv preprint arXiv:1808.09123 (2018).Google ScholarGoogle Scholar
  71. United States Department of Justice. Office of Justice Programs. Bureau of Justice Statistics. 2014. State Court Processing Statistics, 1990--2009: Felony Defendants in Large Urban Counties.Google ScholarGoogle Scholar
  72. Berk Ustun, Alexander Spangher, and Yang Liu. 2019. Actionable Recourse in Linear Classification. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT* '19). ACM, New York, NY, USA, 10--19. https://doi.org/10.1145/3287560.3287566Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Gina M. Vincent, Laura S. Guy, and Thomas Grisso. 2012. Risk Assessment in Juvenile Justice: A Guidebook for Implementation. (2012). http://njjn.org/uploads/digital-library/Risk_Assessment_in_Juvenile_Justice_A_Guidebook_for_Implementation.pdfGoogle ScholarGoogle Scholar
  74. Wisconsin Supreme Court. 2016. State v. Loomis. 881 Wis. N.W.2d 749.Google ScholarGoogle Scholar
  75. Ilan Yaniv. 2004. Receiving other people's advice: Influence and benefit. Organizational Behavior and Human Decision Processes, Vol. 93, 1 (2004), 1--13. https://doi.org/10.1016/j.obhdp.2003.08.002Google ScholarGoogle ScholarCross RefCross Ref
  76. Michael Yeomans, Anuj Shah, Sendhil Mullainathan, and Jon Kleinberg. 2019. Making sense of recommendations. Journal of Behavioral Decision Making (2019). https://doi.org/10.1002/bdm.2118Google ScholarGoogle Scholar

Index Terms

  1. The Principles and Limits of Algorithm-in-the-Loop Decision Making

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!