Abstract
Machine learning (ML) has become a crucial component in software products, either as part of the user experience or used internally by software teams. Prior studies have explored how ML is affecting development team roles beyond data scientists, including user experience designers, program managers, developers and operations engineers. However, there has been little investigation of how team members in different roles on the team communicate about ML, in particular about the quality of models. We use the general term quality to look beyond technical issues of model evaluation, such as accuracy and overfitting, to any issue affecting whether a model is suitable for use, including ethical, engineering, operations, and legal considerations. What challenges do teams face in discussing the quality of ML models? What work practices mitigate those challenges? To address these questions, we conducted a mixed-methods study at a large software company, first interviewing15 employees in a variety of roles, then surveying 168 employees to broaden our understanding. We found several challenges, including a mismatch between user-focused and model-focused notions of performance, misunderstandings about the capabilities and limitations of evolving ML technology, and difficulties in understanding concerns beyond one's own role. We found several mitigation strategies, including the use of demos during discussions to keep the team customer-focused.
- [n.d.]. 2017 Kaggle ML & DS Survey. https://kaggle.com/kaggle/kaggle-survey-2017 Library Catalog: www.kaggle.com.Google Scholar
- [n.d.]. How to Prevent Discriminatory Outcomes in Machine Learning. https://www.weforum.org/whitepapers/how-to-prevent-discriminatory-outcomes-in-machine-learning/ Library Catalog: www.weforum.org.Google Scholar
- 2014. 3 Data Careers Decoded and What It Means for You. https://blog.udacity.com/2014/12/data-analyst-vs-data-scientist-vs-data-engineer.html Library Catalog: blog.udacity.com Section: Career Guidance.Google Scholar
- 2018. Amazon scraps secret AI recruiting tool that showed bias against women. Reuters(Oct. 2018). https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08GGoogle Scholar
- 2018. Tutorial: 21 fairness definitions and their politics. https://www.youtube.com/watch?v=jIXIuYdnyyk&ab_channel=ArvindNarayananGoogle Scholar
- Jeroen C. J. H. Aerts, Keith C. Clarke, and Alex D. Keuper. 2003. Testing Popular Visualization Techniques for Representing Model Uncertainty.Cartography and Geographic Information Science 30, 3 (Jan. 2003), 249--261. https://doi.org/10.1559/152304003100011180Google Scholar
- Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, and Harald Gall. [n.d.]. Software Engineering for Machine Learning: A Case Study. ([n. d.]), 10. Google Scholar
Digital Library
- Saleema Amershi, Max Chickering, Steven M. Drucker, Bongshin Lee, Patrice Simard, and Jina Suh. 2015. Model Tracker:Redesigning Performance Analysis Tools for Machine Learning. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI '15). ACM, New York, NY, USA, 337--346. https://doi.org/10.1145/2702123.2702509 event-place: Seoul, Republic of Korea. Google Scholar
Digital Library
- Gennady Andrienko, Natalia Andrienko, Steven Drucker, Jean-Daniel Fekete, Danyel Fisher, Stavros Idreos, Tim Kraska,Guoliang Li, Kwan-Liu Ma, Jock Mackinlay, Antti Oulasvirta, Tobias Schreck, Heidrun Schmann, Michael Stonebraker, David Auber, Nikos Bikakis, Panos Chrysanthis, George Papastefanatos, and Mohamed Sharaf. 2020. Big Data Visualization and Analytics: Future Research Challenges and Emerging Applications. https://hal.inria.fr/hal-02568845Google Scholar
- Matthew Arnold, Rachel K. E. Bellamy, Michael Hind, Stephanie Houde, Sameep Mehta, Aleksandra Mojsilovic, Ravi Nair, Karthikeyan Natesan Ramamurthy, Darrell Reimer, Alexandra Olteanu, David Piorkowski, Jason Tsay, and Kush R. Varshney. 2018. FactSheets: Increasing Trust in AI Services through Supplier's Declarations of Conformity. arXiv:1808.07261 [cs] (Aug. 2018). http://arxiv.org/abs/1808.07261 arXiv: 1808.07261.Google Scholar
- author. [n.d.]. ONNX: Open Neural Network Exchange Format. https://onnx.ai/Google Scholar
- Ricardo Baeza-Yates. 2016. Data and algorithmic bias in the web. In Proceedings of the 8th ACM Conference on Web Science(WebSci '16). Association for Computing Machinery, New York, NY, USA, 1. https://doi.org/10.1145/2908131.2908135 Google Scholar
Digital Library
- Nadia Boukhelifa, Marc-Emmanuel Perrin, Samuel Huron, and James Eagan. 2017. How Data Workers Cope with Uncertainty: A Task Characterisation Study. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems - CHI '17. ACM Press, Denver, Colorado, USA, 3645--3656. https://doi.org/10.1145/3025453.3025738 Google Scholar
Digital Library
- George E. P. Box. 1976. Science and Statistics. J. Amer. Statist. Assoc.71, 356 (Dec. 1976), 791--799. https://doi.org/10.1080/01621459.1976.10480949Publisher:Taylor&Francis_eprint:https://www.tandfonline.com/doi/pdf/10.1080/01621459.1976.10480949.Google Scholar
Cross Ref
- E. Breck, S. Cai, E. Nielsen, M. Salib, and D. Sculley. 2017. The ML test score: A rubric for ML production readiness and technical debt reduction. In 2017 IEEE International Conference on Big Data (Big Data). 1123--1132. https://doi.org/10.1109/BigData.2017.8258038Google Scholar
Cross Ref
- Carrie J Cai and Philip J Guo. [n.d.]. Software Developers Learning Machine Learning: Motivations, Hurdles, and Desires. ([n. d.]), 10.Google Scholar
- Nan-Chen Chen, Jina Suh, Johan Verwey, Gonzalo Ramos, Steven Drucker, and Patrice Simard. 2018. AnchorViz: Facilitating Classifier Error Discovery through Interactive Semantic Data Exploration. In Proceedings of the 2018 Conference on Human Information Interaction & Retrieval - IUI '18. ACM Press, Tokyo, Japan, 269--280. https://doi.org/10.1145/3172944.3172950 Google Scholar
Digital Library
- Comet.ml. [n.d.]. Comet.ml - Supercharging Machine Learning. https://www.comet.ml/Google Scholar
- Graham Dove, Kim Halskov, Jodi Forlizzi, and John Zimmerman. 2017. UX Design Innovation: Challenges for Working with Machine Learning as a Design Material. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems - CHI '17. ACM Press, Denver, Colorado, USA, 278--288. https://doi.org/10.1145/3025453.3025739 Google Scholar
Digital Library
- Sebastian S. Feger, Sunje Dallmeier-Tiessen, Pawel W. Wozniak, and Albrecht Schmidt. 2019. The Role of HCIin Reproducible Science: Understanding, Supporting and Motivating Core Practices. In Extended Abstracts of the2019 CHI Conference on Human Factors in Computing Systems - CHI EA '19. ACM Press, Glasgow, Scotland Uk, 1--6. https://doi.org/10.1145/3290607.3312905 Google Scholar
Digital Library
- Richard Finger and Ann M. Bisantz. 2002. Utilizing graphical formats to convey uncertainty in a decision-making task. Theoretical Issues in Ergonomics Science 3, 1 (Jan. 2002), 1--25. https://doi.org/10.1080/14639220110110324 Publisher:Taylor & Francis _eprint: https://doi.org/10.1080/14639220110110324.Google Scholar
Cross Ref
- Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumeé III,and Kate Crawford. 2018. Datasheets for Datasets. arXiv:1803.09010 [cs](March 2018). http://arxiv.org/abs/1803.09010arXiv: 1803.09010.Google Scholar
- R. Stuart Geiger, Nelle Varoquaux, Charlotte Mazel-Cabasse, and Chris Holdgraf. 2018. The Types, Roles, and Practices of Documentation in Data Analytics Open Source Software Libraries. Computer Supported Cooperative Work (CSCW) 27, 3 (Dec. 2018), 767--802. https://doi.org/10.1007/s10606-018--9333--1 Google Scholar
Digital Library
- Philip Guo. [n.d.]. Data Science Workflow: Overview and Challenges. https://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext Library Catalog: cacm.acm.org.Google Scholar
- Lasswell Harold D. 1948. The structure and function of communication in society. InThe Communication of Ideas.Harper's, New York, N.Y.Google Scholar
- Galen Harrison, Julia Hanson, Christine Jacinto, Julio Ramirez, and Blase Ur. 2020. An empirical study on the perceived fairness of realistic, imperfect machine learning models. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (FAT* '20). Association for Computing Machinery, Barcelona, Spain, 392--402. https://doi.org/10.1145/3351095.3372831 Google Scholar
Digital Library
- Fred Hohman, Andrew Head, Rich Caruana, Robert DeLine, and Steven M. Drucker. 2019. Gamut: A Design Probe to Understand How Data Scientists Understand Machine Learning Models. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI '19). ACM, New York, NY, USA, 579:1--579:13. https://doi.org/10.1145/3290605.3300809 event-place: Glasgow, Scotland Uk. Google Scholar
Digital Library
- Sarah Holland, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia Chmielinski. 2018. The Dataset Nutrition Label:A Framework To Drive Higher Data Quality Standards. arXiv:1805.03677 [cs](May 2018). http://arxiv.org/abs/1805.03677arXiv: 1805.03677.Google Scholar
- Kenneth Holstein, Jennifer Wortman Vaughan, Hal Daume, Miro Dudik, and Hanna Wallach. 2019. Improving Fairness in Machine Learning Systems: What Do Industry Practitioners Need?. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems - CHI '19. ACM Press, Glasgow, Scotland Uk, 1--16. https://doi.org/10.1145/3290605.3300830 Google Scholar
Digital Library
- Sungsoo Ray Hong, Jessica Hullman, and Enrico Bertini. 2020. Human Factors in Model Interpretability: Industry Practices, Challenges, and Needs. Proceedings of the ACM on Human-Computer Interaction 4, CSCW 1 (May 2020), 1--26. https://doi.org/10.1145/3392878 arXiv: 2004.11440. Google Scholar
Digital Library
- J. Hullman, X. Qiao, M. Correll, A. Kale, and M. Kay. 2019. In Pursuit of Error: A Survey of Uncertainty Visualization Evaluation. IEEE Transactions on Visualization and Computer Graphics 25, 1 (Jan. 2019), 903--913. https://doi.org/10.1109/TVCG.2018.2864889Google Scholar
Digital Library
- Alex Kale, Matthew Kay, and Jessica Hullman. 2019. Decision-Making Under Uncertainty in Research Synthesis:Designing for the Garden of Forking Paths. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI '19). ACM, New York, NY, USA, 202:1--202:14. https://doi.org/10.1145/3290605.3300432 event-place: Glasgow, Scotland Uk. Google Scholar
Digital Library
- Matthew Kay, Tara Kola, Jessica R. Hullman, and Sean A. Munson. 2016. When (Ish) is My Bus?: User-centered Visualizations of Uncertainty in Everyday, Mobile Predictive Systems. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI '16). ACM, New York, NY, USA, 5092--5103. https://doi.org/10.1145/2858036.2858558 event-place: San Jose, California, USA. Google Scholar
Digital Library
- Matthew Kay, Shwetak N. Patel, and Julie A. Kientz. 2015. How Good is 85%?: A Survey Tool to Connect Classifier Evaluation to Acceptability of Accuracy. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI '15). ACM, New York, NY, USA, 347--356. https://doi.org/10.1145/2702123.2702603 event-place:Seoul, Republic of Korea. Google Scholar
Digital Library
- Claire Kayacik, Sherol Chen, Signe Noerly, Jess Holbrook, Adam Roberts, and Douglas Eck. 2019. Identifying the Intersections: User Experience + Research Scientist Collaboration in a Generative Machine Learning Interface. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems - CHI EA '19. ACM Press,Glasgow, Scotland Uk, 1--8. https://doi.org/10.1145/3290607.3299059 Google Scholar
Digital Library
- Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2016. The emerging role of data scientists on software development teams. In Proceedings of the 38th International Conference on Software Engineering (ICSE '16). Association for Computing Machinery, Austin, Texas, 96--107. https://doi.org/10.1145/2884781.2884783 Google Scholar
Digital Library
- Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2018. Data Scientists in Software Teams: State of the Art and Challenges.IEEE Transactions on Software Engineering 44, 11 (Nov. 2018), 1024--1038. https://doi.org/10.1109/TSE.2017.2754374 Conference Name: IEEE Transactions on Software Engineering. Google Scholar
Digital Library
- Sean Kross and Philip J. Guo. 2019. Practitioners Teaching Data Science in Industry and Academia: Expectations,Workflows, and Challenges. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI'19). ACM, New York, NY, USA, 263:1--263:14. https://doi.org/10.1145/3290605.3300493 event-place: Glasgow, ScotlandUk. Google Scholar
Digital Library
- Peter Kun, Ingrid Mulder, and Gerd Kortuem. 2018. Design Enquiry Through Data: Appropriating a Data Science Workflow for the Design Process. In Proceedings of the 32Nd International BCS Human Computer Interaction Conference(HCI '18). BCS Learning & Development Ltd., Swindon, UK, 32:1--32:12. https://doi.org/10.14236/ewic/HCI2018.32event-place: Belfast, United Kingdom. Google Scholar
Digital Library
- Matthew Lease. 2011. On quality control and machine learning in crowdsourcing. In Proceedings of the 11th AAAI Conference on Human Computation (AAAIWS'11--11). AAAI Press, 97--102. Google Scholar
Digital Library
- Lezhi Li, Yunfeng Bai, and Yang Wang. 2019. Manifold: A Model-Agnostic Visual Debugging Tool for Machine Learning at Uber. (2019). https://www.usenix.org/conference/opml19/presentation/li-lezhiGoogle Scholar
- John Lofland and John Lofland (Eds.). 2006.Analyzing social settings: a guide to qualitative observation and analysis(4th ed ed.). Wadsworth/Thomson Learning, Belmont, CA.Google Scholar
- Yaoli Mao, Dakuo Wang, Michael Muller, Kush R. Varshney, Ioana Baldini, Casey Dugan, and Aleksandra Mojsilovic. 2019. How Data Scientists Work Together With Domain Experts in Scientific Collaborations: To Find The Right Answer Or To Ask The Right Question? Proceedings of the ACM on Human-Computer Interaction 3, GROUP (Dec. 2019), 1--23. https://doi.org/10.1145/3361118 arXiv: 1909.03486. Google Scholar
Digital Library
- Lauren Kirchner Surya Julia Angwin Mattu, Jeff Larson. 2016. Machine Bias. ProPublica(2016). https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing?token=pB6i06IyoO0LwmE2vf YUQBGseZmS8U0EGoogle Scholar
- Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model Cards for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency - FAT* '19. ACM Press, Atlanta, GA, USA, 220--229. https://doi.org/10.1145/3287560.3287596 Google Scholar
Digital Library
- Michael Muller, Melanie Feinberg, Timothy George, Steven J. Jackson, Bonnie E. John, Mary Beth Kery, and Samir Passi.2019. Human-Centered Study of Data Science Work Practices. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems - CHI '19. ACM Press, Glasgow, Scotland Uk, 1--8. https://doi.org/10.1145/3290607.3299018 Google Scholar
Digital Library
- Syed Sadat Nazrul. 2018. DevOps for Data Scientists: Taming the Unicorn. https://towardsdatascience.com/devops-for-data-scientists-taming-the-unicorn-6410843990de Library Catalog: towards datascience.com.Google Scholar
- Azadeh Nematzadeh, Giovanni Luca Ciampaglia, Filippo Menczer, and Alessandro Flammini. 2018. How algorithmic popularity bias hinders or promotes quality.Scientific Reports8, 1 (Dec. 2018), 15951. https://doi.org/10.1038/s41598-018--34203--2 arXiv: 1707.00574.Google Scholar
- Gagan Bansal Besmira Nushi and Ece Kamar. [n.d.]. Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff. ([n. d.]), 9.Google Scholar
- Lace Padilla, Matthew Kay, and Jessica Hullman. 2020.Uncertainty Visualization. preprint. PsyArXiv. https://doi.org/10.31234/osf.io/ebd6rGoogle Scholar
- Samir Passi and Solon Barocas. 2019. Problem Formulation and Fairness. In Proceedings of the Conference on Fairness,Accountability, and Transparency - FAT* '19. ACM Press, Atlanta, GA, USA, 39--48. https://doi.org/10.1145/3287560.3287567 Google Scholar
Digital Library
- Samir Passi and Steven J. Jackson. 2018. Trust in Data Science: Collaboration, Translation, and Accountability in Corporate Data Science Projects. Proc. ACM Hum.-Comput. Interact. 2, CSCW (Nov. 2018), 136:1--136:28. https://doi.org/10.1145/3274405 Google Scholar
Digital Library
- Kayur Patel, James Fogarty, James A. Landay, and Beverly Harrison. 2008. Investigating statistical machine learning asa tool for software development. In Proceeding of the twenty-sixth annual CHI conference on Human factors in computing systems - CHI '08. ACM Press, Florence, Italy, 667. https://doi.org/10.1145/1357054.1357160 Google Scholar
Digital Library
- Inioluwa Deborah Raji, Andrew Smart, Rebecca N. White, Margaret Mitchell, Timnit Gebru, Ben Hutchinson, Jamila Smith-Loud, Daniel Theron, and Parker Barnes. 2020. Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing.arXiv:2001.00973 [cs] (Jan. 2020). http://arxiv.org/abs/2001.00973 arXiv:2001.00973.Google Scholar
- Donghao Ren, Saleema Amershi, Bongshin Lee, Jina Suh, and Jason D. Williams. 2017. Squares: Supporting Interactive Performance Analysis for Multiclass Classifiers. IEEE Transactions on Visualization and Computer Graphics23, 1 (Jan. 2017), 61--70. https://doi.org/10.1109/TVCG.2016.2598828 Google Scholar
Digital Library
- David Rolnick, Priya L. Donti, Lynn H. Kaack, Kelly Kochanski, Alexandre Lacoste, Kris Sankaran, Andrew SlavinRoss, Nikola Milojevic-Dupont, Natasha Jaques, Anna Waldman-Brown, Alexandra Luccioni, Tegan Maharaj, Evan D.Sherwin, S. Karthik Mukkavilli, Konrad P. Kording, Carla Gomes, Andrew Y. Ng, Demis Hassabis, John C. Platt, FelixCreutzig, Jennifer Chayes, and Yoshua Bengio. 2019. Tackling Climate Change with Machine Learning.arXiv:1906.05433[cs, stat](June 2019). http://arxiv.org/abs/1906.05433 arXiv: 1906.05433.Google Scholar
- Marck Harlan Harris Vaisman, Sean Murphy. [n.d.]. Analyzing the Analyzers - O'Reilly Media. https://www.oreilly.com/data/free/analyzing-the-analyzers.csp Library Catalog: www.oreilly.com.Google Scholar
- April Yi Wang, Anant Mittal, Christopher Brooks, and Steve Oney. 2019. How Data Scientists Use Computational Notebooks for Real-Time Collaboration.Proceedings of the ACM on Human-Computer Interaction 3, CSCW (Nov. 2019), 1--30. https://doi.org/10.1145/3359141 Google Scholar
Digital Library
- Qianwen Wang, Yao Ming, Zhihua Jin, Qiaomu Shen, Dongyu Liu, Micah J. Smith, Kalyan Veeramachaneni, and Huamin Qu. 2019. ATMSeer: Increasing Transparency and Controllability in Automated Machine Learning. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI '19). ACM, New York, NY, USA, 681:1--681:12. https://doi.org/10.1145/3290605.3300911 event-place: Glasgow, Scotland Uk. Google Scholar
Digital Library
- Maranke Wieringa. 2020. What to account for when accounting for algorithms: a systematic literature review on algorithmic accountability. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (FAT*'20). Association for Computing Machinery, Barcelona, Spain, 1--18. https://doi.org/10.1145/3351095.3372833 Google Scholar
Digital Library
- Qian Yang, Alex Scuito, John Zimmerman, Jodi Forlizzi, and Aaron Steinfeld. 2018. Investigating How Experienced UX Designers Effectively Work with Machine Learning. In Proceedings of the 2018 Designing Interactive Systems Conference(DIS '18). ACM, New York, NY, USA, 585--596. https://doi.org/10.1145/3196709.3196730 event-place: Hong Kong, China. Google Scholar
Digital Library
- Qian Yang, Jina Suh, Nan-Chen Chen, and Gonzalo Ramos. 2018. Grounding Interactive Machine Learning Tool Design in How Non-Experts Actually Build Models. In Proceedings of the 2018 on Designing Interactive Systems Conference 2018- DIS '18. ACM Press, Hong Kong, China, 573--584. https://doi.org/10.1145/3196709.3196729 Google Scholar
Digital Library
- Amy X. Zhang, Michael Muller, and Dakuo Wang. 2020. How do Data Science Workers Collaborate? Roles, Workflows,and Tools. arXiv:2001.06684 [cs, stat](Jan. 2020). http://arxiv.org/abs/2001.06684 arXiv: 2001.06684.Google Scholar
Index Terms
How Teams Communicate about the Quality of ML Models: A Case Study at an International Technology Company
Recommendations
Autonomous agile teams: challenges and future directions for research
XP '18: Proceedings of the 19th International Conference on Agile Software Development: CompanionAccording to the principles articulated in the agile manifesto, motivated and empowered software developers---relying on technical excellence and simple designs---create business value by delivering working software to users at regular short intervals. ...
Superstar student staff teams
SIGUCCS '11: Proceedings of the 39th annual ACM SIGUCCS conference on User servicesIt has been 10 years since the Student Technology Consultant program at Grinnell College began. It started with 10-12 students who answered phones at the campus Helpdesk and has evolved into a program that enriches the students learning and work ...
Learning Behaviors of Functions with Teams
We consider the inductive inference model of Gold [15]. Suppose we are given a set of functions that are learnable with certain number of mind changes and errors. What can we consistently predict about those functions if we are allowed fewer mind ...






Comments