ABSTRACT
As research and industry moves towards large-scale models capable of numerous downstream tasks, the complexity of understanding multi-modal datasets that give nuance to models rapidly increases. A clear and thorough understanding of a dataset’s origins, development, intent, ethical considerations and evolution becomes a necessary step for the responsible and informed deployment of models, especially those in people-facing contexts and high-risk domains. However, the burden of this understanding often falls on the intelligibility, conciseness, and comprehensiveness of the documentation. It requires consistency and comparability across the documentation of all datasets involved, and as such documentation must be treated as a user-centric product in and of itself. In this paper, we propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets within the practical contexts of industry and research. Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders across a dataset’s lifecycle for responsible AI development. These summaries provide explanations of processes and rationales that shape the data and consequently the models—such as upstream sources, data collection and annotation methods; training and evaluation methods, intended use; or decisions affecting model performance. We also present frameworks that ground Data Cards in real-world utility and human-centricity. Using two case studies, we report on desirable characteristics that support adoption across domains, organizational structures, and audience groups. Finally, we present lessons learned from deploying over 20 Data Cards.x
- 2017. AI Now Institute. https://ainowinstitute.org/Google Scholar
- 2021. ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT). https://facctconference.org/Google Scholar
- Joint Artificial Intelligence Center Public Affairs. 2021. Enabling AI with Data Cards. https://www.ai.mil/blog_09_03_21_ai_enabling_ai_with_data_cards.htmlGoogle Scholar
- Nuno Antunes, Leandro Balby, Flavio Figueiredo, Nuno Lourenco, Wagner Meira, and Walter Santos. 2018. Fairness and transparency of machine learning for trustworthy cloud services. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W). IEEE, 188–193.Google Scholar
Cross Ref
- Parker Barnes Anurag Batra. 2020. Open Images Extended - Crowdsourced Data Card. https://research.google/static/documents/datasets/open-images-extended-crowdsourced.pdfGoogle Scholar
- Matthew Arnold, Rachel K. E. Bellamy, Michael Hind, Stephanie Houde, Sameep Mehta, Aleksandra Mojsilovic, Ravi Nair, Karthikeyan Natesan Ramamurthy, Darrell Reimer, Alexandra Olteanu, David Piorkowski, Jason Tsay, and Kush R. Varshney. 2019. FactSheets: Increasing Trust in AI Services through Supplier’s Declarations of Conformity. arxiv:1808.07261 [cs.CY]Google Scholar
- [7] Anja Austermann, Michelle Linch, Romina Stella, and Kellie Webster.2021. https://storage.googleapis.com/gresearch/translate-gender-challenge-sets/Data%20Card.pdfGoogle Scholar
- Iain Barclay, Harrison Taylor, Alun Preece, Ian Taylor, Dinesh Verma, and Geeth de Mel. 2020. A framework for fostering transparency in shared artificial intelligence models by increasing visibility of contributions. Concurrency and Computation: Practice and Experience (2020), e6129.Google Scholar
- Emily M Bender and Batya Friedman. 2018. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics 6 (2018), 587–604.Google Scholar
Cross Ref
- Umang Bhatt, Javier Antorán, Yunfeng Zhang, Q Vera Liao, Prasanna Sattigeri, Riccardo Fogliato, Gabrielle Melançon, Ranganath Krishnan, Jason Stanley, Omesh Tickoo, 2021. Uncertainty as a form of transparency: Measuring, communicating, and using uncertainty. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 401–413.Google Scholar
Digital Library
- Ajay Chander, Ramya Srinivasan, Suhas Chelian, Jun Wang, and Kanji Uchino. 2018. Working with beliefs: AI transparency in the enterprise. In IUI Workshops.Google Scholar
- [12] Candice chumann, Susanna Ricco, Utsav Prabhu, Vittorio Ferrari, and Caroline Pantofaru.2021. https://storage.googleapis.com/openimages/open_images_extended_miap/Open%20Images%20Extended%20-%20MIAP%20-%20Data%20Card.pdfGoogle Scholar
- Upol Ehsan, Q Vera Liao, Michael Muller, Mark O Riedl, and Justin D Weisz. 2021. Expanding explainability: Towards social transparency in ai systems. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–19.Google Scholar
Digital Library
- Heike Felzmann, Eduard Fosch-Villaronga, Christoph Lutz, and Aurelia Tamò-Larrieux. 2020. Towards transparency by design for artificial intelligence. Science and Engineering Ethics 26, 6 (2020), 3333–3361.Google Scholar
Cross Ref
- Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2018. Datasheets for datasets. arXiv preprint arXiv:1803.09010(2018).Google Scholar
- GEM. 2022. Natural Language Generation, its Evaluation and Metrics Data Cards. https://gem-benchmark.com/data_cardsGoogle Scholar
- HuggingFace. 2021. HuggingFace - Create a Dataset Card. https://huggingface.co/docs/datasets/v1.12.0/dataset_card.htmlGoogle Scholar
- Ben Hutchinson, Andrew Smart, Alex Hanna, Emily Denton, Christina Greer, Oddur Kjartansson, Parker Barnes, and Margaret Mitchell. 2021. Towards accountability for machine learning datasets: Practices from software engineering and infrastructure. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 560–575.Google Scholar
Digital Library
- People + AI Research Initiative. 2022. Know Your Data. https://knowyourdata.withgoogle.com/Google Scholar
- Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, 2020. The open images dataset v4. International Journal of Computer Vision 128, 7 (2020), 1956–1981.Google Scholar
Cross Ref
- Susan Leigh Star. 2010. This is not a boundary object: Reflections on the origin of a concept. Science, Technology, & Human Values 35, 5 (2010), 601–617.Google Scholar
Cross Ref
- Colleen McCue. 2014. Data mining and predictive analysis: Intelligence gathering and crime analysis. Butterworth-Heinemann.Google Scholar
- Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency. 220–229.Google Scholar
Digital Library
- Sundar Pichai. 2018. AI at Google: our principles. The Keyword 7(2018), 1–3.Google Scholar
- Mahima Pushkarna, Andrew Zaldivar, and Daniel Nanas. [n. d.]. Data Cards Playbook: Participatory Activities for Dataset Documentation. https://facctconference.org/2021/acceptedcraftsessions.html#data_cardsGoogle Scholar
- Mahima Pushkarna, Andrew Zaldivar, and Vivian Tsai. [n. d.]. Data Cards GitHub Page. https://pair-code.github.io/datacardsplaybook/Google Scholar
- Ben Shneiderman. 2003. The eyes have it: A task by data type taxonomy for information visualizations. In The craft of information visualization. Elsevier, 364–371.Google Scholar
- Susan Leigh Star and James R Griesemer. 1989. Institutional ecology,translations’ and boundary objects: Amateurs and professionals in Berkeley’s Museum of Vertebrate Zoology, 1907-39. Social studies of science 19, 3 (1989), 387–420.Google Scholar
- Harini Suresh, Steven R Gomez, Kevin K Nam, and Arvind Satyanarayan. 2021. Beyond Expertise and Roles: A Framework to Characterize the Stakeholders of Interpretable Machine Learning and their Needs. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–16.Google Scholar
Digital Library
Recommendations
Towards a Semantic Approach for Linked Dataspace, Model and Data Cards
WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023The vast majority of artificial intelligence practitioners overlook the importance of documentation when building and publishing models and datasets. However, due to the recent trend in the explainability and fairness of AI models, several frameworks ...
Documenting Data Production Processes: A Participatory Approach for Data Work
CSCWThe opacity of machine learning data is a significant threat to ethical data work and intelligible systems. Previous research has addressed this issue by proposing standardized checklists to document datasets. This paper expands that field of inquiry by ...
Understanding responsibility in Responsible AI. Dianoetic virtues and the hard problem of context
AbstractDuring the last decade there has been burgeoning research concerning the ways in which we should think of and apply the concept of responsibility for Artificial Intelligence. Despite this conceptual richness, there is still a lack of consensus ...





Comments