Abstract
Data scientists often collaborate with clients to analyze data to meet a client's needs. What does the end-to-end workflow of a data scientist's collaboration with clients look like throughout the lifetime of a project? To investigate this question, we interviewed ten data scientists (5 female, 4 male, 1 non-binary) in diverse roles across industry and academia. We discovered that they work with clients in a six-stage outer-loop workflow, which involves 1) laying groundwork by building trust before a project begins, 2) orienting to the constraints of the client's environment, 3) collaboratively framing the problem, 4) bridging the gap between data science and domain expertise, 5) the inner loop of technical data analysis work, 6) counseling to help clients emotionally cope with analysis results. This novel outer-loop workflow contributes to CSCW by expanding the notion of what collaboration means in data science beyond the widely-known inner-loop technical workflow stages of acquiring, cleaning, analyzing, modeling, and visualizing data. We conclude by discussing the implications of our findings for data science education, parallels to design work, and unmet needs for tool development.
- Sara Alspaugh, Nava Zokaei, Andrea Liu, Cindy Jin, and Marti A. Hearst. 2019. Futzing and Moseying: Interviews with Professional Data Analysts on Exploration Practices. IEEE Transactions on Visualization and Computer Graphics 25, 1 (Jan. 2019), 22--31. https://doi.org/10.1109/TVCG.2018.2865040Google Scholar
Digital Library
- Alex Ball. [n.d.]. Review of Data Management Lifecycle Models.Google Scholar
- Johan Kaj Blomkvist, Johan Persson, and Johan Åberg. 2015. Communication through Boundary Objects in Distributed Agile Teams. Association for Computing Machinery, New York, NY, USA, 1875--1884. https://doi.org/10.1145/2702123.2702366Google Scholar
- Irwin D. J. Bross. 1974. The Role of the Statistician: Scientist or Shoe Clerk. The American Statistician 28, 4 (1974), 126--127. https://doi.org/10.1080/00031305.1974.10479092Google Scholar
- Joohee Choi and Yla Tausczik. 2017. Characteristics of Collaboration in the Emerging Practice of Open Data Analysis. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing (Portland, Oregon, USA) (CSCW '17). Association for Computing Machinery, New York, NY, USA, 835--846. https://doi.org/10.1145/2998181.2998265Google Scholar
Digital Library
- Herbert H. Clark and Susan E. Brennan. 1991. Grounding in Communication. In Perspectives on Socially Shared Cognition, L.B. Resnick, J.M. Levine, and S.D. Teasley (Eds.). American Psychological Association, 127--149.Google Scholar
- Juliet M. Corbin and Anselm L. Strauss. 2008. Basics of qualitative research: techniques and procedures for developing grounded theory. SAGE Publications, Inc.Google Scholar
- Nigel Cross. 2011. Design Thinking: Understanding How Designers Think and Work. Bloomsbury.Google Scholar
Cross Ref
- Nigel Cross. 2018. Expertise in Professional Design (2 ed.). Cambridge University Press, 372--388. https://doi.org/10.1017/9781316480748.021Google Scholar
- James Densmore. 2017. There are two types of data scientists -- and two types of problems to solve. https://medium.com/@jamesdensmore/there-are-two-types-of-data-scientists-and-two-types-of-problems-to-solve-a149a0148e64. Accessed: 2020--10--10.Google Scholar
- Conor Dewey. 2018. An Ode to the Type A Data Scientist. Towards Data Science -- https://towardsdatascience.com/ode-to-the-type-a-data-scientist-78d11456019. Accessed: 2020--10--10.Google Scholar
- David Donoho. 2017. 50 Years of Data Science. Journal of Computational and Graphical Statistics 26, 4 (2017), 745--766. https://doi.org/10.1080/10618600.2017.1384734 arXiv:https://doi.org/10.1080/10618600.2017.1384734Google Scholar
- Paul Dourish. 2001. Process Descriptions as Organisational Accounting Devices: The Dual Use of Workflow Technologies. In Proceedings of the 2001 International ACM SIGGROUP Conference on Supporting Group Work (Boulder, Colorado, USA) (GROUP '01). Association for Computing Machinery, New York, NY, USA, 52--60. https://doi.org/10.1145/500286.500297Google Scholar
Digital Library
- Ian Drosos, Titus Barik, Philip J. Guo, Robert DeLine, and Sumit Gulwani. 2020. Wrex: A Unified Programming-by-Example Interaction for Synthesizing Readable Code for Data Scientists. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI '20). Association for Computing Machinery, New York, NY, USA, 1--12. https://doi.org/10.1145/3313831.3376442Google Scholar
Digital Library
- D. J. Finney. 1982. The questioning statistician. Statistics in Medicine 1, 1 (1982), 5--13. https://doi.org/10.1002/sim.4780010103 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/sim.4780010103Google Scholar
Cross Ref
- Cristina Gallego Gómez and Consuelo Puchades Ruiz. 2016. The Inclusion of Methodologies User Experience in the Consulting Industry: An Approach to the Experience of Capgemini. In Proceedings of the XVII International Conference on Human Computer Interaction (Salamanca, Spain) (Interacción '16). Association for Computing Machinery, New York, NY, USA, Article 24, 2 pages. https://doi.org/10.1145/2998626.2998635Google Scholar
Digital Library
- Alicia A Grandey. 2000. Emotion regulation in the workplace: a new way to conceptualize emotional labour. Journal of Occupational Health Psychology 5 (2000), 95--100.Google Scholar
Cross Ref
- Philip J. Guo. 2012. Software Tools to Facilitate Research Programming. Ph.D. Dissertation. Stanford University.Google Scholar
- Philip J. Guo, Sean Kandel, Joseph M. Hellerstein, and Jeffrey Heer. 2011. Proactive Wrangling: Mixed-Initiative End-User Programming of Data Transformation Scripts. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology (Santa Barbara, California, USA) (UIST '11). Association for Computing Machinery, New York, NY, USA, 65--74. https://doi.org/10.1145/2047196.2047205Google Scholar
Digital Library
- Bob Hayes. 2020. Who Does the Machine Learning and Data Science Work? customer think -- https://customerthink.com/who-does-the-machine-learning-and-data-science-work/. Accessed: 2021-01--10.Google Scholar
- Daniel Hellmann, Carleen Maitland, and Andrea Tapia. 2016. Collaborative Analytics and Brokering in Digital Humanitarian Response. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work and Social Computing (San Francisco, California, USA) (CSCW '16). Association for Computing Machinery, New York, NY, USA, 1284--1294. https://doi.org/10.1145/2818048.2820067Google Scholar
Digital Library
- Stephanie C. Hicks and Roger D. Peng. 2019. Elements and Principles for Characterizing Variation between Data Analyses. arXiv:1903.07639 [stat.AP]Google Scholar
- C. Hill, R. Bellamy, T. Erickson, and M. Burnett. 2016. Trials and tribulations of developers of intelligent systems: A field study. In 2016 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 162--170. https://doi.org/10.1109/VLHCC.2016.7739680Google Scholar
Cross Ref
- Arlie Russell Hochschild. 2012. The Managed Heart: Commercialization of Human Feeling (1 ed.). University of California Press.Google Scholar
- Youyang Hou and Dakuo Wang. 2017. Hacking with NPOs: Collaborative Analytics and Broker Roles in Civic Data Hackathons. Proc. ACM Hum.-Comput. Interact. 1, CSCW, Article 53 (Dec. 2017), 16 pages. https://doi.org/10.1145/3134688Google Scholar
Digital Library
- Marina Jirotka, Charlotte P. Lee, and Gary M. Olson. 2013. Supporting Scientific Collaboration: Methods, Tools and Concepts. Comput. Supported Coop. Work 22, 4--6 (Aug. 2013), 667--715. https://doi.org/10.1007/s10606-012--9184-0Google Scholar
Digital Library
- Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. Wrangler: Interactive Visual Specification of Data Transformation Scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Vancouver, BC, Canada) (CHI '11). Association for Computing Machinery, New York, NY, USA, 3363--3372. https://doi.org/10.1145/1978942.1979444Google Scholar
Digital Library
- Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2012. Enterprise Data Analysis and Visualization: An Interview Study. IEEE Transactions on Visualization and Computer Graphics 18, 12 (Dec. 2012), 2917--2926. https://doi.org/10.1109/TVCG.2012.219Google Scholar
Digital Library
- Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel. 2016. The Emerging Role of Data Scientists on Software Development Teams. In Proceedings of the 38th International Conference on Software Engineering (Austin, Texas) (ICSE '16). Association for Computing Machinery, New York, NY, USA, 96--107. https://doi.org/10.1145/2884781.2884783Google Scholar
Digital Library
- Sean Kross and Philip J. Guo. 2019. Practitioners Teaching Data Science in Industry and Academia: Expectations, Workflows, and Challenges. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI '19). Association for Computing Machinery, New York, NY, USA, 1--14. https://doi.org/10.1145/3290605.3300493Google Scholar
Digital Library
- Sean Kross, Roger D. Peng, Brian S. Caffo, Ira Gooding, and Jeffrey T. Leek. 2020. The Democratization of Data Science Education. The American Statistician 74, 1 (2020), 1--7. https://doi.org/10.1080/00031305.2019.1668849 arXiv:https://doi.org/10.1080/00031305.2019.1668849Google Scholar
Cross Ref
- Sam Lau, Ian Drosos, Julia M. Markel, and Philip J. Guo. 2020. The Design Space of Computational Notebooks: An Analysis of 60 Systems in Academia and Industry. In Proceedings of the IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC) (VL/HCC '20).Google Scholar
- Katherine A. Lawrence. 2006. Walking the Tightrope: The Balancing Acts of a Large e-Research Project. Comput. Supported Coop. Work 15, 4 (Aug. 2006), 385--411. https://doi.org/10.1007/s10606-006--9025-0Google Scholar
Digital Library
- Diane Lending and Thomas W. Dillon. 2013. Identifying Skills for Entry-Level IT Consultants. In Proceedings of the 2013 Annual Conference on Computers and People Research (Cincinnati, Ohio, USA) (SIGMIS-CPR '13). Association for Computing Machinery, New York, NY, USA, 87--92. https://doi.org/10.1145/2487294.2487311Google Scholar
- Bertram Ludäscher, Ilkay Altintas, Chad Berkley, Dan Higgins, Efrat Jaeger, Matthew Jones, Edward A. Lee, Jing Tao, and Yang Zhao. 2006. Scientific workflow management and the Kepler system: Research Articles. Concurr. Comput.: Pract. Exper. 18, 10 (2006), 1039--1065. https://doi.org/10.1002/cpe.v18:10Google Scholar
Cross Ref
- Willam Lurie. 1958. The Impertinent Questioner: The Scientist's Guide to the Statistician's Mind. American Scientist 46, 1 (1958), 57--61.Google Scholar
- Yaoli Mao, Dakuo Wang, Michael Muller, Kush R. Varshney, Ioana Baldini, Casey Dugan, and Aleksandra Mojsilovic. 2019. How Data Scientists Work Together With Domain Experts in Scientific Collaborations: To Find The Right Answer Or To Ask The Right Question? Proc. ACM Hum.-Comput. Interact. 3, GROUP, Article 237 (Dec. 2019), 23 pages. https://doi.org/10.1145/3361118Google Scholar
- Pietro Mazzoleni, Sweefen Goh, Richard Goodwin, Manisha Bhandar, Shyh-Kwei Chen, Juhnyoung Lee, Vibha Singhal Sinha, Senthil Mani, Debdoot Mukherjee, Biplav Srivastava, Pankaj Dhoolia, Elad Fein, and Natalia Razinkov. 2009. Consultant Assistant: A Tool for Collaborative Requirements Gathering and Business Process Documentation. In Proceedings of the 24th ACM SIGPLAN Conference Companion on Object Oriented Programming Systems Languages and Applications (Orlando, Florida, USA) (OOPSLA '09). Association for Computing Machinery, New York, NY, USA, 807--808. https://doi.org/10.1145/1639950.1640025Google Scholar
Digital Library
- Hui Miao, Ang Li, Larry S. Davis, and Amol Deshpande. 2017. Towards Unified Data and Lifecycle Management for Deep Learning. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE). 571--582. https://doi.org/10.1109/ICDE.2017.112Google Scholar
- Wendy Moncur. 2013. The Emotional Wellbeing of Researchers: Considerations for Practice. Association for Computing Machinery, New York, NY, USA, 1883--1890. https://doi.org/10.1145/2470654.2466248Google Scholar
- Michael Muller, Ingrid Lange, Dakuo Wang, David Piorkowski, Jason Tsay, Q. Vera Liao, Casey Dugan, and Thomas Erickson. 2019. How Data Science Workers Work with Data: Discovery, Capture, Curation, Design, Creation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI '19). Association for Computing Machinery, New York, NY, USA, Article 126, 15 pages. https://doi.org/10.1145/3290605.3300356Google Scholar
Digital Library
- Tom Oinn, Mark Greenwood, Matthew Addis, M. Nedim Alpdemir, Justin Ferris, Kevin Glover, Carole Goble, Antoon Goderis, Duncan Hull, Darren Marvin, Peter Li, Phillip Lord, Matthew R. Pocock, Martin Senger, Robert Stevens, Anil Wipat, and Chris Wroe. 2006. Taverna: lessons in creating a workflow environment for the life sciences: Research Articles. Concurr. Comput. : Pract. Exper. 18, 10 (2006), 1067--1100. https://doi.org/10.1002/cpe.v18:10Google Scholar
Digital Library
- Gary M. Olson and Judith S. Olson. 2000. Distance Matters. Hum.-Comput. Interact. 15, 2 (Sept. 2000), 139--178.Google Scholar
Digital Library
- Gary M. Olson, Ann Zimmerman, and Nathan Bos. 2008. Scientific Collaboration on the Internet. The MIT Press.Google Scholar
Digital Library
- Roger Peng. 2019. How Data Scientists Think - A Mini Case Study. Simply Stats blog -- https://simplystatistics.org/2019/01/09/how-data-scientists-think-a-mini-case-study/. Accessed: 2020--10--10.Google Scholar
- Roger Peng. 2019. The Tentpoles of Data Science. Simply Stats blog -- https://simplystatistics.org/2019/01/18/the-tentpoles-of-data-science/. Accessed: 2020--10--10.Google Scholar
- Roger Peng and Hilary Parker. 2018. Not So Standard Deviations podcast, episodes on Design Thinking (Episodes 63--69). https://nssdeviations.com/63-book-club-part-1. Accessed: 2020--10--10.Google Scholar
- Roger D. Peng. 2011. Reproducible Research in Computational Science. Science 334, 6060 (2011), 1226--1227. https://doi.org/10.1126/science.1213847 arXiv: https://science.sciencemag.org/content/334/6060/1226.full.pdfGoogle Scholar
Cross Ref
- Kathleen H. Pine and Max Liboiron. 2015. The Politics of Measurement and Action. Association for Computing Machinery, New York, NY, USA, 3147--3156. https://doi.org/10.1145/2702123.2702298Google Scholar
- ProjectPro. 2020. Type A Data Scientist vs. Type B Data Scientist. https://www.dezyre.com/article/type-a-data-scientist-vs-type-b-data-scientist/194. Accessed: 2020--10--10.Google Scholar
- Noopur Raval and Paul Dourish. 2016. Standing Out from the Crowd: Emotional Labor, Body Labor, and Temporal Labor in Ridesharing. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing (San Francisco, California, USA) (CSCW '16). Association for Computing Machinery, New York, NY, USA, 97--107. https://doi.org/10.1145/2818048.2820026Google Scholar
Digital Library
- Arvind Satyanarayan and Jeffrey Heer. 2014. Lyra: An Interactive Visualization Design Environment. In Proceedings of the 16th Eurographics Conference on Visualization (Swansea, Wales, United Kingdom) (EuroVis '14). Eurographics Association, Goslar, DEU, 351--360.Google Scholar
Cross Ref
- Benjamin Saunders, Julius Sim, Tom Kingstone, Shula Baker, Jackie Waterfield, Bernadette Bartlam, Heather Burroughs, and Clare Jinks. 2018. Saturation in qualitative research: exploring its conceptualization and operationalization. Quality & quantity 52, 4 (2018).Google Scholar
- Carlos E. Scheidegger, Huy T. Vo, David Koop, Juliana Freire, and Claudio T. Silva. 2008. Querying and re-using workflows with VisTrails. In SIGMOD '08 (Vancouver, Canada). ACM. https://doi.org/10.1145/1376616.1376747Google Scholar
- Petr Slovák and Geraldine Fitzpatrick. 2015. Teaching and Developing Social and Emotional Skills with Technology. ACM Trans. Comput.-Hum. Interact. 22, 4, Article 19 (June 2015), 34 pages. https://doi.org/10.1145/2744195Google Scholar
Digital Library
- Susan Stager. 1986. The Consultant as Collaborator: The Process Facilatator Model. SIGUCCS Newsl. 16, 2 (June 1986), 22--26. https://doi.org/10.1145/382151.382978Google Scholar
- Sara Stoudt, Váleri N. Vásquez, and Ciera C. Martinez. 2021. Principles for data analysis workflows. PLOS Computational Biology 17, 3 (03 2021), 1--26. https://doi.org/10.1371/journal.pcbi.1008770Google Scholar
- Lucy Suchman. 1993. Do Categories Have Politics? The Language/Action Perspective Reconsidered. In Proceedings of the Third Conference on European Conference on Computer-Supported Cooperative Work (Milan, Italy) (ECSCW'93). Kluwer Academic Publishers, USA, 1--14.Google Scholar
Digital Library
- Hanxin Tang. 2019. The Building of Trust in Client-Consultant Relationships and Its Influence on Data Protection in Consulting. In Proceedings of the 2019 2nd International Conference on Information Management and Management Sciences (Chengdu, China) (IMMS 2019). Association for Computing Machinery, New York, NY, USA, 75--79. https://doi.org/10.1145/3357292.3357295Google Scholar
Digital Library
- Dakuo Wang, Justin D. Weisz, Michael Muller, Parikshit Ram, Werner Geyer, Casey Dugan, Yla Tausczik, Horst Samulowitz, and Alexander Gray. 2019. Human-AI Collaboration in Data Science: Exploring Data Scientists' Perceptions of Automated AI. Proc. ACM Hum.-Comput. Interact. 3, CSCW, Article 211 (Nov. 2019), 24 pages. https://doi.org/10.1145/3359313Google Scholar
Digital Library
- Hadley Wickham and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data (1st ed.). O'Reilly Media, Inc.Google Scholar
- Karlijn Willems. 2017. Data Scientist vs. Data Engineer. https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer. Accessed: 2020--10--10.Google Scholar
- Kanit Wongsuphasawat, Yang Liu, and Jeffrey Heer. 2019. Goals, Process, and Challenges of Exploratory Data Analysis: An Interview Study. arXiv:1911.00568 [cs.HC]Google Scholar
- Kanit Wongsuphasawat, Dominik Moritz, Anushka Anand, Jock Mackinlay, Bill Howe, and Jeffrey Heer. 2016. Voyager: Exploratory Analysis via Faceted Browsing of Visualization Recommendations. IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis) (2016). http://idl.cs.washington.edu/papers/voyagerGoogle Scholar
- Amy X. Zhang, Michael Muller, and Dakuo Wang. 2020. How Do Data Science Workers Collaborate? Roles, Workflows, and Tools. Proc. ACM Hum.-Comput. Interact. 4, CSCW1, Article 022 (May 2020), 23 pages. https://doi.org/10.1145/3392826Google Scholar
Digital Library
Index Terms
Orienting, Framing, Bridging, Magic, and Counseling: How Data Scientists Navigate the Outer Loop of Client Collaborations in Industry and Academia
Recommendations
Outer-loop vectorization: revisited for short SIMD architectures
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniquesVectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multi-media and ...
E-Research Collaboration in Academia and Industry
E-Collaboration has come of age in the last decade, with industry and academia using the latest web-based collaborative software to bring together groups of workers to work on common tasks. Research is a $370 billion industry in the United States and is ...
How Data Scientists Use Computational Notebooks for Real-Time Collaboration
Effective collaboration in data science can leverage domain expertise from each team member and thus improve the quality and efficiency of the work. Computational notebooks give data scientists a convenient interactive solution for sharing and keeping ...






Comments