skip to main content
10.1145/3531146.3533231acmotherconferencesArticle/Chapter ViewAbstractPublication PagesfacctConference Proceedingsconference-collections
research-article
Open Access

Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI

Published:20 June 2022Publication History

ABSTRACT

As research and industry moves towards large-scale models capable of numerous downstream tasks, the complexity of understanding multi-modal datasets that give nuance to models rapidly increases. A clear and thorough understanding of a dataset’s origins, development, intent, ethical considerations and evolution becomes a necessary step for the responsible and informed deployment of models, especially those in people-facing contexts and high-risk domains. However, the burden of this understanding often falls on the intelligibility, conciseness, and comprehensiveness of the documentation. It requires consistency and comparability across the documentation of all datasets involved, and as such documentation must be treated as a user-centric product in and of itself. In this paper, we propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets within the practical contexts of industry and research. Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders across a dataset’s lifecycle for responsible AI development. These summaries provide explanations of processes and rationales that shape the data and consequently the models—such as upstream sources, data collection and annotation methods; training and evaluation methods, intended use; or decisions affecting model performance. We also present frameworks that ground Data Cards in real-world utility and human-centricity. Using two case studies, we report on desirable characteristics that support adoption across domains, organizational structures, and audience groups. Finally, we present lessons learned from deploying over 20 Data Cards.x

References

  1. 2017. AI Now Institute. https://ainowinstitute.org/Google ScholarGoogle Scholar
  2. 2021. ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT). https://facctconference.org/Google ScholarGoogle Scholar
  3. Joint Artificial Intelligence Center Public Affairs. 2021. Enabling AI with Data Cards. https://www.ai.mil/blog_09_03_21_ai_enabling_ai_with_data_cards.htmlGoogle ScholarGoogle Scholar
  4. Nuno Antunes, Leandro Balby, Flavio Figueiredo, Nuno Lourenco, Wagner Meira, and Walter Santos. 2018. Fairness and transparency of machine learning for trustworthy cloud services. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W). IEEE, 188–193.Google ScholarGoogle ScholarCross RefCross Ref
  5. Parker Barnes Anurag Batra. 2020. Open Images Extended - Crowdsourced Data Card. https://research.google/static/documents/datasets/open-images-extended-crowdsourced.pdfGoogle ScholarGoogle Scholar
  6. Matthew Arnold, Rachel K. E. Bellamy, Michael Hind, Stephanie Houde, Sameep Mehta, Aleksandra Mojsilovic, Ravi Nair, Karthikeyan Natesan Ramamurthy, Darrell Reimer, Alexandra Olteanu, David Piorkowski, Jason Tsay, and Kush R. Varshney. 2019. FactSheets: Increasing Trust in AI Services through Supplier’s Declarations of Conformity. arxiv:1808.07261 [cs.CY]Google ScholarGoogle Scholar
  7. [7] Anja Austermann, Michelle Linch, Romina Stella, and Kellie Webster.2021. https://storage.googleapis.com/gresearch/translate-gender-challenge-sets/Data%20Card.pdfGoogle ScholarGoogle Scholar
  8. Iain Barclay, Harrison Taylor, Alun Preece, Ian Taylor, Dinesh Verma, and Geeth de Mel. 2020. A framework for fostering transparency in shared artificial intelligence models by increasing visibility of contributions. Concurrency and Computation: Practice and Experience (2020), e6129.Google ScholarGoogle Scholar
  9. Emily M Bender and Batya Friedman. 2018. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics 6 (2018), 587–604.Google ScholarGoogle ScholarCross RefCross Ref
  10. Umang Bhatt, Javier Antorán, Yunfeng Zhang, Q Vera Liao, Prasanna Sattigeri, Riccardo Fogliato, Gabrielle Melançon, Ranganath Krishnan, Jason Stanley, Omesh Tickoo, 2021. Uncertainty as a form of transparency: Measuring, communicating, and using uncertainty. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 401–413.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ajay Chander, Ramya Srinivasan, Suhas Chelian, Jun Wang, and Kanji Uchino. 2018. Working with beliefs: AI transparency in the enterprise. In IUI Workshops.Google ScholarGoogle Scholar
  12. [12] Candice chumann, Susanna Ricco, Utsav Prabhu, Vittorio Ferrari, and Caroline Pantofaru.2021. https://storage.googleapis.com/openimages/open_images_extended_miap/Open%20Images%20Extended%20-%20MIAP%20-%20Data%20Card.pdfGoogle ScholarGoogle Scholar
  13. Upol Ehsan, Q Vera Liao, Michael Muller, Mark O Riedl, and Justin D Weisz. 2021. Expanding explainability: Towards social transparency in ai systems. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Heike Felzmann, Eduard Fosch-Villaronga, Christoph Lutz, and Aurelia Tamò-Larrieux. 2020. Towards transparency by design for artificial intelligence. Science and Engineering Ethics 26, 6 (2020), 3333–3361.Google ScholarGoogle ScholarCross RefCross Ref
  15. Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2018. Datasheets for datasets. arXiv preprint arXiv:1803.09010(2018).Google ScholarGoogle Scholar
  16. GEM. 2022. Natural Language Generation, its Evaluation and Metrics Data Cards. https://gem-benchmark.com/data_cardsGoogle ScholarGoogle Scholar
  17. HuggingFace. 2021. HuggingFace - Create a Dataset Card. https://huggingface.co/docs/datasets/v1.12.0/dataset_card.htmlGoogle ScholarGoogle Scholar
  18. Ben Hutchinson, Andrew Smart, Alex Hanna, Emily Denton, Christina Greer, Oddur Kjartansson, Parker Barnes, and Margaret Mitchell. 2021. Towards accountability for machine learning datasets: Practices from software engineering and infrastructure. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 560–575.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. People + AI Research Initiative. 2022. Know Your Data. https://knowyourdata.withgoogle.com/Google ScholarGoogle Scholar
  20. Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, 2020. The open images dataset v4. International Journal of Computer Vision 128, 7 (2020), 1956–1981.Google ScholarGoogle ScholarCross RefCross Ref
  21. Susan Leigh Star. 2010. This is not a boundary object: Reflections on the origin of a concept. Science, Technology, & Human Values 35, 5 (2010), 601–617.Google ScholarGoogle ScholarCross RefCross Ref
  22. Colleen McCue. 2014. Data mining and predictive analysis: Intelligence gathering and crime analysis. Butterworth-Heinemann.Google ScholarGoogle Scholar
  23. Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency. 220–229.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Sundar Pichai. 2018. AI at Google: our principles. The Keyword 7(2018), 1–3.Google ScholarGoogle Scholar
  25. Mahima Pushkarna, Andrew Zaldivar, and Daniel Nanas. [n. d.]. Data Cards Playbook: Participatory Activities for Dataset Documentation. https://facctconference.org/2021/acceptedcraftsessions.html#data_cardsGoogle ScholarGoogle Scholar
  26. Mahima Pushkarna, Andrew Zaldivar, and Vivian Tsai. [n. d.]. Data Cards GitHub Page. https://pair-code.github.io/datacardsplaybook/Google ScholarGoogle Scholar
  27. Ben Shneiderman. 2003. The eyes have it: A task by data type taxonomy for information visualizations. In The craft of information visualization. Elsevier, 364–371.Google ScholarGoogle Scholar
  28. Susan Leigh Star and James R Griesemer. 1989. Institutional ecology,translations’ and boundary objects: Amateurs and professionals in Berkeley’s Museum of Vertebrate Zoology, 1907-39. Social studies of science 19, 3 (1989), 387–420.Google ScholarGoogle Scholar
  29. Harini Suresh, Steven R Gomez, Kevin K Nam, and Arvind Satyanarayan. 2021. Beyond Expertise and Roles: A Framework to Characterize the Stakeholders of Interpretable Machine Learning and their Needs. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–16.Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    FAccT '22: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency
    June 2022
    2351 pages
    ISBN:9781450393522
    DOI:10.1145/3531146

    Copyright © 2022 Owner/Author

    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 20 June 2022

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format