Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI
DOI: https://doi.org/10.1145/3531146.3533231
FAccT '22: 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, June 2022
As research and industry moves towards large-scale models capable of numerous downstream tasks, the complexity of understanding multi-modal datasets that give nuance to models rapidly increases. A clear and thorough understanding of a dataset's origins, development, intent, ethical considerations and evolution becomes a necessary step for the responsible and informed deployment of models, especially those in people-facing contexts and high-risk domains. However, the burden of this understanding often falls on the intelligibility, conciseness, and comprehensiveness of the documentation. It requires consistency and comparability across the documentation of all datasets involved, and as such documentation must be treated as a user-centric product in and of itself. In this paper, we propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets within the practical contexts of industry and research. Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders across a dataset's lifecycle for responsible AI development. These summaries provide explanations of processes and rationales that shape the data and consequently the models—such as upstream sources, data collection and annotation methods; training and evaluation methods, intended use; or decisions affecting model performance. We also present frameworks that ground Data Cards in real-world utility and human-centricity. Using two case studies, we report on desirable characteristics that support adoption across domains, organizational structures, and audience groups. Finally, we present lessons learned from deploying over 20 Data Cards.
ACM Reference Format:
Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. 2022. Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI. In 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT '22), June 21–24, 2022, Seoul, Republic of Korea. ACM, New York, NY, USA 51 Pages. https://doi.org/10.1145/3531146.3533231
1 INTRODUCTION
The challenge of transparency in machine learning (ML) models and datasets continues to receive increasing attention from academia and industry [1, 2]. Often, the goal has been to attain greater visibility into ML models and datasets by exposing source code [4], contribution trails [8], introducing ML-drive data analysis methods [19], and introducing diverse oversight [18]. Transparency and explainability of model outcomes through the lens of datasets has become a huge concern in regulation from government bodies internationally. However, attempts to introduce standardized, practical and sustainable mechanisms for transparency that create value at scale meet limited success in research and production contexts. This reflects real world constraints of the diversity of goals, workflows, and backgrounds of individual stakeholders participating in the life cycles of datasets and artificial intelligence (AI) systems [11, 13, 14].
As a step towards creating value that connects dataset success to research and production experiences, we propose a new framework for transparent and purposeful documentation of datasets, called Data Cards [26]. A Data Card contains a structured collection of summaries gathered over the life cycle of a dataset about observable (e.g., dataset attributes) and unobservable (e.g., intended use cases) aspects needed for decisions in organizational and practice-oriented contexts. Beyond metadata, Data Cards include explanations, rationales, and instructions pertaining to the provenance, representation, usage, and fairness-informed evaluations of datasets for ML
models.
Data Cards emphasize information and context that shape the data, but cannot be inferred from the dataset directly. These are designed as boundary objects [28] that should be easily available in accessible formats at important steps of a user journey for a diverse set of readers. Data Cards encourage informed decision making about data usage when building and evaluating ML models for products, policy and research. Data Cards complement other longer-form and domain-specific documentation frameworks for ethical reporting (See Appendix A), such as Model Cards [23], Data Statements [9], Datasheets for Datasets [15], and [6] FactSheets.
Data Cards are accompanied by frameworks to adapt them to a variety of datasets and organizational contexts. These frameworks are pivotal to establishing common ground across stakeholders and enable diverse input into decisions. Our case studies demonstrate that creators of Data Cards were able to discover surprising future opportunities to improve their dataset design decisions, such as considering reasons for a high percentage of unknown values and the need to create a shared understanding of lexicons used in dataset labeling during problem formulation.
In summary, our contributions are four-fold:
- We explain our multi-pronged approach in the setting of a large-scale technology company and present a typology of stakeholders that span a typical dataset lifecycle. We translate outcomes from our development methodology into corresponding objectives and principles for the creation of Data Cards to systematically reduce the knowledge asymmetries across stakeholders.
- We introduce a transparency artifact for at-scale production and research environments, Data Cards— structured summaries of essential facts about various aspects of ML datasets needed by stakeholders across a dataset's lifecycle for responsible AI development, and describe the content (What information to present), design (How to present information), and evaluation (Assess the efficacy of information) of Data Cards.
- We propose three frameworks for the construction of Data Cards that focus on information organization, question framing, and answer evaluation, respectively. Specifically, we describe OFTEn, our novel knowledge acquisition framework to arm dataset producers with a robust, deliberate, and repeatable approach for producing transparent documentation.
- We present case studies on the creation of Data Cards for a computer vision dataset and a language dataset to demonstrate their impact as boundary objects in practice, and discuss epistemic and organizational lessons learned in scaling Data Cards.
Our collective efforts suggest that in addition to comprehensive transparency artifacts1, the creation of structured frameworks are not only beneficial in adding nuance to the dataset documentation process itself, but also transformational in introducing human-centric and responsible practices when using datasets in ML applications.
2 DEVELOPMENT METHODOLOGY
Over the course of 24 months, multiple efforts were employed to design Data Cards and its supporting frameworks, borrowing from methods in human-centered design, participatory design, and human-computer interaction. We worked with dataset and ML teams in a large technology company to iteratively create Data Cards, refining our design decisions to respond to challenges in production contexts. In parallel, we ran studies and workshops to identify opportunities and challenges in the implementation of Data Cards. In this section, we detail the various efforts and describe their impact on the development of Data Cards.
Specifically, we worked with 12 teams in a large technology company to create 22 Data Cards that describe image, language, tabular, video, audio, and relational datasets in production settings. Teams ranged in size from four to over 20 members, and were comprised of some combination of research software engineers, research scientists, data analysts and data program managers. This allowed us to observe each teams’ documentation workflows, collaborative information gathering practices, information requests from downstream stakeholders, review and assessment practices. Our co-creative approach in conjunction with feedback received across other studies yielded continuous improvements in the usability and utility of each new Data Card created.
As we worked with ML dataset and model owners to produce prototypical transparency artifacts, drafts were evaluated in an external focus group with nine participants. These participants represented non-expert, technical use cases from User Experience (UX) and Human-Computer Interaction (HCI) research, Policy, Product Design & Development, Academia, and Law. Participants were asked to complete a paper-based questionnaire to reflect on their ideals of transparency, used as a basis for broader discussions on transparency. Participants were then provided with printed drafts which they annotated with their feedback. This allowed us to capture specific feedback and establish relationships across themes and topics in the artifacts. We concluded with a discussion reflecting on their use of transparency artifacts and an offline survey to capture their overall expectations. Through this focus group, we were able to arrive at a working definition and values of transparency relevant to domains within AI product life cycles. We further synthesized feedback on the transparency artifacts into an initial set of recommendations to combat common reader-side challenges, which were then offered as guidance to teams creating Data Cards.
Based on our experience in co-creating Data Cards with teams, we were able to consolidate recurring and overlapping questions into a canonical template that documents 31 different aspects of data sets. Questions that are were modality-specific were consolidated into appendable blocks, but largely left out of the canonical template. A follow-up internal MaxDiff survey (n=191) was conducted to understand the information needs in dataset documentation within our company. Through this survey, we learned the relative importance of the 31 aspects documented in a Data Card, how these vary by dataset modality and job function, and further incorporated insights into our design of Data Cards. We observed the need for a generative framework that Data Card creators could use to add or tailor question to new datasets without compromising the readability, navigability, comparability and transparency intrinsic to the Data Card.
Our internal study recruited 30 experts spanning sixteen teams within our company. Participants represented stakeholders who (a) create datasets designed for ML use cases and (b) use or review datasets for applied and foundational model development. Over the course of three days, this group engaged in various participatory activities to articulate use cases for transparency artifacts, information requirements, and strategies for evaluation of transparency artifacts. Participants were then invited to actively contribute to future discussions of Data Cards and their development as it related to the participant's specific data domains. We found that despite their deep expertise and experience, participants were unable to provide examples of exemplary documentation, but were quick to furnish "excellent" examples of poor documentation. This pointed us to the need for a set of dimensions that can be used to assess transparency and documentation without conflating documentation with the dataset.
Further, we developed a structured participatory workshop-based approach to engage cross-functional stakeholders when creating transparent metadata schema for dataset documentation [25]. This methodology was open-sourced and tested in the data domains of human computation, geo-spatial ML, multi-modal data operations, healthcare data, community-engaged research, and large-scale multitask language models. Common to all workshops, we found that participating teams often started with an intuition about the benefits of transparency in dataset documentation. We found that teams needed to necessarily align on a shared definition of transparency, audience, and the audience's requirements as pre-requisites define the content, infrastructure, and processes to scale Data Card creation. We observed organization-specific factors that can impact long-term sustainability of scaling Data Cards, such as knowledge asymmetries between stakeholders, organizational processes that incentivize the creation and maintenance of documentation, infrastructure compatibility and readiness, and communication culture across and within stakeholder groups. While a detailed discussion of our participatory methodology to developing transparency metadata schemas and survey is beyond the scope of this paper, we introduce relevant critical frameworks from our methodology.
2.1 Framing Transparency in the Context of Data Cards
Despite the diverse backgrounds of participants across studies, the shared dominant perception was that transparency artifacts were ironically opaque. The opacity in documentation, quite simply, increases when language used is technical, dense, and presumptive of a reader's background, making it difficult for non-technical stakeholders to interpret. This, in turn, leads to sub-optimal decision making, and propagates asymmetries in power structures and myopic AI data practices. Further, focus group and workshop participants described transparency as ”subjective”, ”audience-specific” and ”contextual”. To that end, we frame our definition of transparency as “a clear, easily understandable, and plain language explanation of what something is, what it does and why it does that”, to emphasize the domain-agnostic and inclusive prerogative of transparency artifacts. In Table 1, we present eight characteristics of transparency that are vital for a robust discussion of the benefits, values, ethics, and limitations of AI datasets.
| Transparency Characteristic | Description |
|---|---|
| Balance opposites | For example, disclosing information about AI systems without leaving creators vulnerable beyond reason, reporting fairness analyses without legitimizing inequitable or unfair systems, introducing standards for transparency that are wholly automated or become checklists. |
| Increase in expectations | Any information included in a transparency artifact can be expected to receive greater scrutiny. |
| Constant availability | Users want access to transparency information at multiple levels, even if they don't need to use it. |
| Require checks and balances | Transparency artifacts and their creation must be amenable to 3rd party evaluation, with the caveat that excessive transparency can open an AI system vulnerable to adversarial actors. |
| Subjective interpretations | Stakeholders have different definitions and unique ideas on what constitutes transparency. |
| Trust enabler | Accessible and relevant information about AI systems in-creases the the willingness of a data consumer or user to take a risk based on the expectation of benefits from the data, algorithms and the products they use. |
| Reduce knowledge asymmetries | Cross-disciplinary stakeholders are more effective when they possess a shared mental model and vocabulary to describe aspects of the AI system. |
| Reflects human values | It comes from both technical and non-technical disclosure about assumptions, facts and alternatives. |
Data Cards aim to provide a single scalable, artifact that allows non-traditional stakeholders across product, policy, and research to understand aspects about datasets and how they are used to make informed decisions. We found that stakeholders review role-related topics in Data Cards with amplified scrutiny, and follow-up questions progressively increase in specificity, which suggests that transparency is attained when we establish a shared and socratic understanding of datasets based on the ability to ask and answer questions over time.
2.2 A Typology of Stakeholders
At first, our audience for Data Cards was fairly broad, comprising a mix of experts and non-experts. Frameworks proposed by Suresh, et al [29] have distinguished higher-level domain goals and objectives from lower-level interpretability tasks, but are limited by their epistemological framing and vast scope. We created a broad yet decomposable typology describing three stakeholders groups in a dataset's life cycle, allowing us to consider how cross-functional stakeholders engage in decision-making on the basis of a single transparency artifact.
In our typology, Producers are upstream creators of dataset and documentation, responsible for dataset collection, ownership, launch and maintenance. We observed that producers often subscribe to a single, informal notion of “users” of Data Cards—loosely characterized by high data domain expertise, familiarity with similar datasets, and deep technical knowledge. However, in practice, we find that only a few readers or Agents actually meet all these requirements.
Agents are stakeholders who read transparency reports, and possess the agency to use or determine how themselves or others might use the described datasets or AI systems. After testing prototypes and proof of concepts with different audience groups, it became clear that agents with operational and reviewer needs were distinct categories. Reviewers include stakeholders who may never directly use the dataset, but will engage with the Data Card (for e.g. reviewers or non-technical subject matter experts). Agents may or may not possess the technical expertise to navigate information presented in typical dataset documentation, but often have access to expertise as required.
Additionally, agents are distinct from Users, who are individuals and representatives who interact with products that rely on models trained on dataset. Users may consent to providing their data as a part of the product experience, and require a significantly different set of explanations and controls grounded within product experiences. We therefore suggest the use of Data Card target agents with access to technical expertise, and encourage the use of alternative transparency artifacts for users that are designed exclusively for that purpose.
We further dis-aggregate these high-level groups to generate awareness and emphasize the unique decisions that each sub-group must make (Fig[3]). However, these groupings exist on a continuum and stakeholders may fall into more than one group concurrently, depending on their context. We used this typology to unearth assumptions that are often made about the rich intersectional attributes of individual stakeholders, such as expertise (e.g. novice or expert), data fluency (e.g. none to high), job roles (e.g. Data Scientist, Policy Maker), function performed vis-à-vis the data (Data Contributor, Rater), and goals or tasks (Publishing a dataset, Comparing datasets) when conceptualizing Data Cards. Usability studies across these groups revealed guidelines for the successful and appropriate adoption of Data Cards in practice and at scale. These are distilled into the following objectives for Data Cards:
2.2.1 O1. Consistent: Data Cards must be comparable to one another, regardless of data modality or domain such that claims are easy to interpret and validate within context of use. While deploying one-time Data Cards is relatively easy, we find that organizations need to preserve comparability when scaling adoption. A Data Card creation effort should solicit equitable information from all datasets.
2.2.2 O2. Comprehensive: Rather than being created as a last step in a dataset's lifecycle, it should be easy to create a Data Card concurrently with the dataset. Further, the responsibility of filling out fields in a Data Card should be distributed and assigned to the most appropriate individual. This requires standardized methods that extend beyond the Data Card, and apply to the various reports generated in the dataset's lifecycle.
2.2.3 O3. Intelligible and Concise: Readers have varying levels of proficiency2 which affects their interpretation of the Data Card. In scenarios where stakeholder proficiency differs, individuals with the strongest mental model of the dataset become de-facto decision makers. Finally, tasks that are more urgent or challenging can reduce the participation of non-traditional stakeholders (See 3) in decisions, which are left to “the expert”. This risks omitting critical perspectives that reflect the situated needs of downstream and lateral stakeholders. A Data Card should efficiently communicate to the reader with the least proficiency, while enabling readers with greater proficiency to find more information as needed. The content and design should advance a reader's deliberation process without overwhelming them, and encourage stakeholder cooperation towards a shared mental model of the dataset for decision-making.
2.2.4 O4. Explainability, Uncertainty: Workshop participants reported that ‘known unknowns’ were as important as known facets of the dataset in decision making. Communicating uncertainty along with meaningful metadata was considered a feature and not a bug, allowing readers to answer questions such as “Is a specific analysis irrelevant to the dataset or were the results insignificant?” or “Is information withheld because it is proprietary or is it unknown?”. Clear descriptions and justifications for uncertainty can lead to additional measures to mitigate risks, leading to opportunities for fairer and equitable models. This builds greater trust in the dataset and subsequently, its publishers [10].
3 DATA CARDS
Data Cards capture critical information about a dataset across its life cycle. Just as is true with every dataset, each Data Card is unique, and no single template satisfactorily captures the nuance of all datasets. In this section, we introduce our guiding principles, and elaborate on decisions towards the design, content, and evaluation of Data Cards. We introduce corresponding frameworks that allow Data Cards to be tailored but preserve the utility and intent of Data Cards.
3.1 Principles
In comparison to prior related documentation toolkits (A) that have been prescriptively adopted by producers, our novel contributions are the generative design of Data Cards as an underlying framework for transparency reporting for domain- and fluency-agnostic readability and scaling in production contexts. To meet the objectives stated above, Data Cards have been designed along the following principles:
- P1. Flexible: Describe a wide range of datasets such as static datasets, datasets that are actively being curated from single or multiple sources, or those with multiple modalities.
- P2. Modular: Organize documentation into meaningful sections that are self-contained and well-structured units, capable of providing an end-to-end description of a single aspect of the dataset.
- P3. Extensible: Components that can be easily reconfigured or extended systematically for novel datasets, analyses, and platforms.
- P4. Accessible: Represent content at multiple granularities so readers can efficiently find and effectively navigate detailed descriptions of the dataset.
- P5. Content-agnostic: Support diverse media including multiple choice selections, long-form inputs, text, visualizations, images, code blocks, tables, and other interactive elements.
3.2 Design and Structure
The fundamental ”display” unit of a Data Card is a block which consists of a title, a question, space for additional instructions or descriptions, and an input space for answers. Answer inputs are reinforced with structure to create blocks that are specifically suited for long- or short-form text, multiple or single choice responses, tables, numbers, key value pairs, code blocks, data visualizations, tags, links, and demos of the data itself, in alignment with principles (P1) and (P5). In our templates, we iteratively introduced structures for open-ended answers, predetermined responses for multiple choice questions, and demonstrative examples where responses could be complex (Fig. 2). Producers found these assistive efforts as useful guides for setting expectations about consistency, clarity, and granularity in responses. When completed, blocks typically retained titles and answers (See Fig 1) to reduce the gulf between the experience of producers and agents.
Blocks are arranged thematically and hierarchically on a grid to enable an “overview first, zoom-and-filter, details-on-demand” [27] presentation of the dataset, to accomplish principle (P4). In our template, blocks with related questions are organized into rows, and rows are stacked to create sections using meaningful and descriptive titles (Figure 2). Each row is thematically self-contained so readers can effectively navigate multiple facets of a dataset in a Data Card. Answers increase in both detail and specificity across columns in the direction of the language in which the Data Card is written, allowing readers to find information at the appropriate fidelity for their tasks and decisions. Where appropriate, a single block may span multiple columns. Sections are vertically arranged based on functional importance in a nested hierarchy marked by section titles in the first Data Card [D]. Here, all necessary sections (dataset snapshot, motivations, extended use, collection and labeling methods) are established in order to provide greater context for interpreting sections that describe fairness-related analyses (fairness indicators, bounding box sizes). In contrast, sections in the second Data Card [E] are organized in a flat hierarchy, suggesting equal importance of all blocks. Variation within the formatting of the content communicates both denotative and connotative meaning, while preserving the fundamental unit of ”blocks”, illustrating principles (P2) and (P3).
3.2.1 Socratic Question-Asking Framework: Scopes. To ensure that agents with varying proficiency levels can progressively explore content with minimal barriers (principle P4), any new information in a Data Card needs to be introduced at multiple levels of abstraction. Further, the addition of ad-hoc blocks risks structurally compromising Data Cards for readers and producers alike, thereby reducing both, usability of design and integrity or content. Pertinent to objectives O2 and O3, we provide a structured approach to framing and organizing questions to address common challenges in adapting Data Card templates for new datasets. Depending on the specificity desired, new themes are deconstructed into broad questions, which are then extrapolated into at least three questions framed at varying granularities. We characterize these as telescopes, periscopes, and microscopes. Depending on the topic documented, a Data Card may require an uneven distribution of telescopic, periscopic, or picroscopic questions. Our aforementioned row-and-column design, combined with our organization principle provides us with sufficient flexibility to intermix content hierarchy that caters to different combinations of scope types. For the purposes of demonstration, we consider the documentation of sensitive human attributes:
| (1) The publishers of the dataset and access to them | (17) The data collection process (inclusion, exclusion, filtering criteria) |
| (2) The funding of the dataset | (18) How the data was cleaned, parsed, and processed (transformations, sampling, etc.) |
| (3) The access restrictions and policies of the dataset | (19) Data rating in the dataset, process, description and/or impact |
| (4) The wipeout and retention policies of the dataset | (20) Data labeling in the dataset, process, description and/or impact |
| (5) The updates, versions, refreshes, additions to the data of the dataset | (21) Data validation in the dataset, process, description and/or impact |
| (6) Detailed breakdowns of features of the dataset | (22) The past usage and associated performance of the dataset (eg. models trained) |
| (7) Details about collected attributes which are absent from the dataset or the dataset's documentation | (23) Adjudication policies and processes related to the dataset (labeler instructions, inter-rater policy, etc.) |
| (8) The original upstream sources of the data | (24) Relevant associated regulatory or compliance policies (GDPR, licenses, etc.) |
| (9) The nature (data modality, domain, format, etc.) of the dataset | (25) Dataset Infrastructure and/or pipeline implementation |
| (10) What typical and outlier examples in the dataset look like | (26) Descriptive statistics of the dataset (mean, standard deviations, etc.) |
| (11) Explanations and motivations for creating the dataset | (27) Any known patterns (correlations, biases, skews) within the dataset |
| (12) The intended applications of the dataset | (28) Human attributes (socio-cultural, geopolitical, or economic representation) |
| (13) The safety of using the dataset in practice (risks, limitations, and trade-offs) | (29) Fairness-related evaluations and considerations of the dataset |
| (14)Expectations around using the dataset with other datasets or tables (feature engineering, joining, etc.) | (30) Definitions and explanations for technical terms used in the Data Card (metrics, industry-specific terms, acronyms) |
| (15) The maintenance status and version of the dataset | (31) Domain-specific knowledge required to use the dataset |
| (16) Difference across previous and current versions of the dataset |
Telescopes provide an overview of the dataset. These are questions about universal attributes applicable across multiple datasets, for example ”Does this dataset contain Sensitive Human Attributes?”. Telescopes can be binary (contains, does not contain) or multiple choice (Select all that apply: Race, Gender, Ethnicity, Socio-economic status, Geography, Language, Sexual Orientation, Religion, Age, Culture, Disability, Experience or Seniority, Others (please specify)). These serve three specific purposes. First, telescopic questions generate enumerations or tags that are useful for knowledge management, indexing and filtering in large repository of Data Cards. Second, they introduce and set context for additional information within a row, helping readers navigate larger or more complex Data Cards. Lastly, telescopic questions introduce conditional logic to streamline the experience of filling out a Data Card. When viewed together, telescopic questions offer a shallow but wide overview of the dataset.
Periscopes provide greater technical detail pertaining to the dataset. These are questions about attributes specific to the dataset that add nuance to telescopes. For example, “For each human attribute selected, specify if this information was collected intentionally as a part of the dataset creation process, or unintentionally not explicitly collected as a part of the dataset creation process but can be inferred using additional methods.” A periscopic question can ask for operational information such as the dataset's shape and size, or functional information such as sources or intentions. Responses typically look like key-value pairs, short descriptions, tables, and visualizations. Since periscopes often describe analysis results, statistical summaries, and operational metadata, they are often reproducible and can be automated wherein automating generates results that are more accurate or precise than human input.
Microscopes offer fine-grained details. These are questions about the “unobservable” human processes, decisions, assumptions and policies that shape the dataset. These elicit detailed explanations of decisions or summarize longer process documents that governed responses to the corresponding periscopic questions. For example, “Briefly describe the motivation, rationale, considerations or approaches that caused this dataset to include the indicated human attributes. Summarize why or how this might affect the use of the dataset.” Necessarily, answers to these questions are difficult to automate in the absence of standardized terms and operating procedures. Answers to microscopes are typically long-form text with lists and links, data tables, and visualizations.
Telescopic questions are easiest to answer, but offer relatively low utility. Periscopic questions facilitate quick assessments of suitability and relevance of the dataset, essential for simple decision-making. We observed that microscopic questions were most challenging to answer since they require articulating implicit knowledge. We find that the interpretations of a Data Card are greatly influenced by the presence or absence of these levels of abstraction. These questions enabled agents and producers alike to assess risk, plan mitigations, and where relevant, identify opportunities for better dataset creation. Together, telescopes, periscopes, and microscopes layer useful details such that numerous readers can navigate without losing sight of the bigger picture.
3.3 Content and Schema
Our initial approach was to create a single template capable of capturing the provenance, intentions, essential facts, explanations and caveats in an accessible and understandable way. In co-creating Data Cards for different types of datasets, we identified 31 broad, generalizable themes (Table 2) that comprehensively describe any dataset (O2). However, themes vary in importance on a per-task basis to stakeholders. Sections in our template (F) capture these themes, further demonstrating how they are deconstructed into sets of scopes (3.2.1). To illustrate the differences in descriptions of a theme elicited per dataset, we include two Data Cards from our case studies (4.1, 4.2) in appendix D and E respectively.
3.3.1 OFTEn Framework. Over time, we found it necessary to develop a consistent and repeatable approach to identify and add new themes from dataset life cycles in a Data Card that are reportable by everyone in the organization. Additionally, certain topics such as consent, can span entire dataset life cycles with different implications at each stage. We introduce OFTEn, a conceptual tool for systematically considering how topics promulgate across all parts of a Data Card (P1, P3), through detailed inductive and deductive dataset transparency investigations.
OFTEn (Table 3) abbreviates common stages in the dataset life cycle (”Origins, Factuals, Transformations, Experience, and n=1 example”). Though ordered, stages are loosely defined to mirror typical non-linear dataset development practices. Notably, agents’ use of the dataset is considered a distinct stage in OFTEn, affording the flexibility to incorporate feedback from downstream stakeholders (dataset consumers, product users, and even data contributors). This establishes a trail to track the performance of AI systems trained and evaluated on the dataset, and exposes any caveats or limitations that potential agents should be aware of.
An OFTEn analysis of the dataset can preemptively enable the discovery of insights that would otherwise not be generally evident. Inductively, OFTEn supports activities with agents to formulate questions about datasets and related models that are important for decision-making. At its simplest, it can be visualized as a matrix in which rows represent the dataset life cycle, and columns provide prompts to frame questions (who, what, when, where, why, and how) about a given topic in the dataset's lifecycle (Table 3). Its participatory use enables reporting both dataset attributes and implicit information that can affect outcomes in real-world deployment. Deductively, we use OFTEn to assess if a Data Card accurately represents the dataset, resulting in formative effects on both documentation and dataset. Lastly, we find that Data Cards with a clear underlying OFTEn structure are easy to expand and update. This structure allows Data Cards to capture information over time, such as feedback from downstream agents, notable differences across versions, and ad-hoc audits or investigations from producers or agents.
| Description | Themes | |
| Origins | Various planning activities such as problem formulation, defining requirements, design decisions, collection or sourcing methods, and deciding policies which dictate dataset outcome | Authorship, Motivations, Intended Applications, Unacceptable uses, Licenses, Versions, Sources, Collection Methods, Errata, Accountable parties |
| Factuals | Statistical and other computable attributes that describe the dataset, deviations from the original plan, and any pre-wrangling analysis and investigations, including those pertaining to biases and skews | Number of Instances, Number of Features, Number of Labels, Breakdown of subgroups, Description of features, Taxonomies of labels, Missing/Duplicates, Inclusion and exclusion criteria |
| Transformations | Various operations such as filtering, validating, parsing, formatting, and cleaning through which raw data is transformed into a usable form including labeling or annotation policies, validation tasks, feature engineering and related modifications | Rating or Annotation, Filtering, Processing, Validation, Synthetic features, Handling of PII, Sensitive Variables, Fairness Analyses, Impact Assessments, Skews & Biases |
| Experience | Dataset is benchmarked or deployed in experimental, production, or research practice, including specific tasks, access training requirements, modifications made to suit the task, analyses, unexpected behaviors, limitations, caveats and comparisons to similar datasets | Intended Performance, Unintended Application, Unexpected Performance, Caveats, Extended Use Cases, Safety of Use, Downstream Outcomes, Use & Use Case Evaluation |
| N=1 (examples) | Examples in the dataset, including typical, outlier, raw and transformed examples; concrete examples or links to additional artifacts of relevance; links to guided or unguided explorers of datapoints in the dataset | Examples or links to typical examples and outliers ; Examples that yield errors; Examples that demonstrate handling of null or zero feature values; code blocks & scripts, extended documentations, web demos |
3.4 Evaluation of Data Cards
We worked with over 18 producers to understand workflows of creating and maintaining Data Cards, and conducted an interview study (n=10) to validate our observations. While a detailed report of this study is out of scope of this paper, we found that producers had a tendency to fork completed Data Cards (which described similar datasets) as a starter template instead of using the provided template. While this practice made Data Cards easier to complete, it resulted in an increase in inaccurate responses, the propagation of errors and modifications to templates in forked Data Cards. Producers would delete blocks and sections that were irrelevant to their dataset, and in specific cases, producers would semantically modify questions to suit their datasets. Though justifiable in the context of a single Data Card, these practices led to the subsequent fragmentation of forked Data Cards. Deleted but relevant questions were irrecoverable, and reconciling updates to the original template was labor-intensive. Finally, we observed that Producers resorted to answering ”N/A” when they were unsure of the answer, or when uncertainty was high. These real-world constraints motivated us to identify mechanisms for assuring the quality of Data Cards, expand organizational vocabularies on uncertainty, and introduce low-barrier processes across the dataset lifecycle that can be easily adopted by organizations.
Initially, each new Data Card created was assigned two reviewers representing job functions typical to agents. Selected reviewers were always unfamiliar with the dataset, but typically fluent in manipulating data or the domain of the dataset. Despite their expertise, feedback provided on these Data Cards were observational and speculative in nature (”The first two listed applications are commonly used and should be understood by both practitioners and laypeople, but I'm not sure about [application]); and often not tactical enough for producers to incorporate into the Data Card. To make reviewer feedback actionable and holistic, we worked with a mix of subject matter experts, data reviewers, functional and tactical roles at our company to identify 98 concepts used to assess datasets and their documentation. From these, we excluded 13 usability and 8 user-experience related concepts, which are captured in our objectives. We then consolidated the remaining concepts into 20 clusters using affinity mapping. Clusters were then classified into five umbrella topics or ”dimensions” that represent contextual decision-making signals used by our experts to evaluate the rigor with which a Data Card describes a dataset, and it's corresponding efficacy for the reader.
3.4.1 Dimensions. Dimensions are directional, pedagogic vectors that describe the Data Card's usefulness to the agents. They represent the different types of judgments readers might make, and yield qualitative insights into the consistency, comprehensiveness, utility, and readability of Data Card templates and completed Data Cards alike. Here, we briefly summarize these dimensions:
- Accountability: Demonstrates adequate ownership, reflection, reasoning, and systematic decision making by producers.
- Utility or Use: Provides details that satisfy the needs of the readers’ responsible decision-making process to establish the suitability of datasets for their tasks and goals.
- Quality: Summarizes the rigor, integrity and completeness of the dataset, communicated in a manner that is accessible and understandable to many readers.
- Impact or Consequences of Use: Sets expectations for positive and negative outcomes as well as subsequent consequences when using or managing the dataset in suitable contexts.
- Risk and Recommendations: Makes readers aware of known potential risks and limitations, stemming from provenance, representation, use, or context of use. Provides enough information and alternatives to help readers make responsible trade-offs.
Reviewers with varying levels of domain and data fluency were asked to test the aforementioned dimensions, set up as a rubric for grading, during their evaluations of Data Cards and any associated Model Cards. Reviewers were asked independently rate the completed Data Card on each dimension, using a 5-point scale with choices Poor, Borderline, Average, Good, and Outstanding. In addition, they were asked to provide evidence in support of their ratings, and steps that producers could take to improve that specific rating. Reviewers found it easier to offer structured and actionable feedback using these dimensions (”Utility or Use: Average. Evidence: Data Card provides all necessary steps for users who may wish to access the dataset, but it's hard for me to determine what use cases are suitable for this dataset. I know the dataset was collected for the purpose of evaluating the performance of the [specific model], but what does the [specific model] do? Next Steps: Provide additional examples of suitable use cases, provide additional detail on what the [specific model] does under intended use case.”). Multiple reviewers reported feeling more confident in their assessments. While these dimensions are primarily used to assess if Data Cards help readers arrive at acceptable conclusions about datasets, feedback from expert reviewers revealed specific opportunities to enhance the datasets themselves.
4 CASE STUDIES
4.1 A Computer Vision Dataset for Fairness Research
A research team created an ML training dataset for computer vision (CV) fairness techniques that described sensitive attributes about people, such as perceived gender and perceived age-range. Sampled from Open Images [20], the dataset included 100,000 bounding boxes over 30,000 images. Each bounding box was manually annotated with perceived gender and perceived age-range presentation attributes. Given the risks associated with sensitive labels describing personal attributes weighed against the societal benefit of these labels for fairness analysis and bias mitigation, the team wanted an efficient way to provide an overview of the characteristics, limitations, and communicate acceptable uses of the dataset for internal ethics reviewers and external audiences.
Three parties were involved in the creation of this Data Card [12], which started after the dataset was prepared. First, the dataset authors who had deep tacit knowledge of the processes and decisions across the dataset's lifecycle. They also had explicit knowledge from extensive analysis performed for the dataset release. However, this was distributed across several documents, and the Data Card was an exercise in organizing knowledge into a “readable format” that could be consistently repeated for multiple datasets. This process occurred asynchronously over a few days.
The next group involved were internal reviewers of the dataset and an accompanying paper, conducting an analysis of how the dataset aligns with responsible AI research and development practices. The analysis focused on subgroups in the labels, the trade-offs associated with each subgroup, and clarifying acceptable and unacceptable use cases of the dataset as a whole, in alignment with an established set of AI Principles [24]. The reviewers recommended that the team create a Data Card. Creating the Data Card as a result of the review process revealed differences in perception across experts. For example, in the Data Card, producers noted that nearly 40% of perceived age-range labels were ‘unknown’. Reviewers were unable to ascertain if this was acceptable, and subsequent conversations raised further questions about the criteria used to label a bounding box with ‘unknown’ perceived age-range. It was found that ‘high’ levels of unknowns were relatively typical to datasets in this problem space, and was attributed to the size of 30% of the bounding boxes being less than 1% of the image. As a result, producers added a custom section about bounding boxes to the Data Card, and created additional supporting visualizations. Further, producers uncovered and iterated on additional Data Card fields for future CV datasets.
The last group involved in the creation of the Data Card were the authors of this paper, who provided human-centered design perspectives on the Data Card. Feedback was primarily geared towards uncovering agent information needs for acceptable conclusions about the accountability, risk & recommendations, uses, consequences, and quality of the dataset (3.4.1). A post-launch retrospective revealed that though the producers did not have access to dataset consumers, downstream agents reported finding the Data Card useful, and requested Data Card templates for their own use.
4.2 A Geographically Diverse Dataset for Language Translation
A team of software engineers and a product manager noticed that certain models were attentive to names to classify a person's perceived gender. Upon investigation, it was found that previous training datasets had insufficient names that belonged to a non-American geography or were uncommon in English. It was also found that model creators were making assumptions about these datasets. In response, the team decided to create a geographically diverse evaluation dataset from a limited set of publicly curated data from Wikipedia.
However, it became clear that a truly diverse dataset would need to consider race, age, gender, background and profession as well. While countries were acceptable proxies for geographic representation, gender would need to be inferred from the entity descriptions. Without an awareness of the goals of the dataset or the definitions of gender in the data design, the team was concerned that model creators could make assumptions leading to inappropriate dataset use. To communicate these two aspects, the team created a Data Card for readers with and without technical expertise.
Experts responsible for the design, data extraction, cleaning and curation of the dataset worked with a human-centered designer in an iterative process to produce the Data Card [7]. While the documentation process itself took approximately 20 hours, the Data Card prompted the team to reflect on how data was selected, reviewed and created. They specifically considered what they did not know about the dataset, their assumptions, the advantages and limitations of the dataset. In doing so, the team was forced to rethink design decisions which increased the overall timeline, but resulted in a more principled and intentional dataset of geographically diverse biographies.
The team utilized the Data Card to engage in overall clearer discussions with stakeholders. In particular, experts stakeholders pointed out that gender is difficult to ascertain in the dataset. These conversations helped the team agree on a definition of perceived gender that relied on gender-indicative terms within the text of the data, using the labels “masculine”, “feminine”, and “neutral” for biographies describing collections of individuals. The team found that some discussions around the Data Card were actually about the dataset, and noted the usefulness of this feedback if received during the design stage. The final Data Card describes the data selection criteria, sampling criteria, sources of fields, and emphasizes the distribution of countries by continental regions. In addition, the team was able to clearly justify reasons for not including non-binary individuals, excluding collected data, and the limitations of this dataset.
5 DISCUSSION
5.0.1 Experiences and outcomes from Case Studies. While both teams appreciated the transparency added to their respective datasets, creating Data Cards as a final step significantly increased the perception of work required. Rather than a post-implementation task, creating Data Cards alongside the dataset offers several benefits. First, it enables the inclusion of multiple perspectives (engineering, research, user experience, legal and ethical) to enhance the readability and relevance of documentation, and the dataset quality over time. Then, it forces the aggregation of disparate documentation across the dataset lifecycle into a single, ground truth document accessible to stakeholders. Lastly, it facilitates early feedback on responsible AI practices from experts and non-experts that can affect data design and analyses. Of note, teams that developed multiple Data Cards over a period started developing a nuanced vocabulary to express uncertainty that accurately reflected the status of the information.
5.0.2 Data Cards as Boundary Objects. Data Cards are designed to embody a high degree of interpretive flexibility [21]. A single Data Card can support tasks such as conducting reviews and audits, determining use in AI systems or research, comparison of multiple datasets, reproduction of research, or tracking of dataset adoption by various groups. For example, data practitioners seeking to evaluate the quality of a dataset for benchmarking or analysis; AI practitioners determining use case suitability of a dataset for deployment in new or existing models; product managers assessing the downstream effects to make data-related decisions about model or product optimizations for the desired user experience; policy stakeholders evaluating the representativeness of a dataset in relation to end users, and the role of various agencies involved in the creating the dataset creation. Importantly, while Data Cards are able to hold a common identity across these groups, they allow stakeholders to analytically make decisions using dimensions, constructs and vocabulary that are meaningful to their own communities of practice. Data Cards are able to facilitate collaborative work across stakeholders, while supporting individual decision making without consensus.
Our design of Data Cards enables the embedding of relevant sections into transparency artifacts that describe ML models and AI systems. Conversely, sections in the Data Card are designed to capture documentation surrounding the use of datasets in ML models. This establishes a network of artifacts that stakeholders can examine when conducting fairness and accountability interrogations, and achieve overall better results for meta-problems across the domain such as knowledge transfer, dataset reusability, organizational governance, and oversight mechanisms. Data Cards, therefore, effectively act as boundary objects [28] and where relevant, boundary infrastructures.
5.0.3 Path to Adoption. Following our initial Data Card release [5], public and private organizations have since sought to adopt similar constructs ([16], [17], [3]). Within our organization, we observed an increase in non-mandated Data Cards created by individuals who organically came across completed Data Cards. While these speak to the utility of Data Cards as a documentation artifact, its quality and comprehensiveness depend on the rigor of the producers, the nuance in expressing uncertainty, and their knowledge of the dataset. Organizational factors include the presence of minimum or mandatory content requirements, process incentives, training materials, and infrastructure for creating and sharing Data Cards. While we propose a relatively comprehensive template for documenting datasets in Data Cards, industry-wide adoption could be spurred by agreed-upon interoperability and content standards that serve as a means for producers and agents to develop more equitable mental models of datasets.
5.0.4 Infrastructure and Automation. Critical to an organization's success is its ability to tailor Data Cards to their datasets, models, and technological stack. Knowledge management infrastructures must be connected to data and model pipelines so new knowledge can be seamlessly incorporated into the Data Card, keeping it up to date. We find that Blocks allows for easy implementation on interactive platforms (digital forms, repositories, dataset catalogs) and adaptation for non-interactive surfaces (PDFs, documents, physical papers, markdown files). While both these case studies produced static PDFs, sections and fields can be easily implemented in a browser-based user interface, configured for views tailored to different stakeholders.
Centralized repositories that can perform search-and-filter operations over hundreds of Data Cards have long-tail benefits for agents in identifying the most suitable datasets for their tasks; measurably distributing the accountability of how datasets are used. We observed a marked preference for infrastructures that enables stakeholder collaboration and co-creation of Data Cards, linking and storage of extraneous artifacts, and the partial automation of visualizations, tables and analyses results. Interestingly, we observed that readers had strong opinions about not automating certain fields in the Data Card, especially when responses contain assumptions or rationales that help interpret results. Fields should be automated to guarantee accuracy and antifragility at all times, preventing the misrepresentation and the subsequent legitimizing of poor quality datasets. Implicit knowledge is articulated by providing contextual, human-written explanations of methods, assumptions, decisions and baselines. We find that adopting a co-creative approach that spans the entire dataset life cycle will result in a deliberate approach to automation in documentation.
6 CONCLUSION
We presented a framework for transparent and purposeful documentation of datasets at scale for responsible AI development, Data Cards. Our underlying approach advances the state of the art by surfacing transparency principles and establishing objectives for transparency; expanding existing paradigms of the constitution of dataset documentation; and by enabling the human-centered design of frameworks for structuring, adapting or expanding, and evaluating Data Cards. We provide an in-depth discussion each framework, and detail qualitative and anecdotal evidence for the efficacy of Data Cards towards creating responsible AI systems through two case studies. A limitation of our approach was the use of Google Docs for Data Card templates. This allowed stakeholders to collaborate and preserved a forensic history of the development of the Data Card, producers were limited to providing answers using text, tables and images. Additionally, this format prevented us from improving template usability through design and automations, a much requested feature from producers. Future work requires a more principled approach for extending and adapting Data Card templates without compromising comparability. Insights from studies call for participatory approaches that engage diverse, non-traditional stakeholders early into the dataset and Data Card development process. Lastly, defining quantitative measures to assess the true value of Data Cards will require adoption at both breadth and depth in the industry. To address this, further investigation is needed into the perceived and actual importance of the content of Data Cards to tasks for different stakeholder groups, and requires the expansion of user studies to a broader participant pool spanning multiple industries. Data Cards templates and frameworks encourage customized implementations that foster a culture for deep, detailed, and transparent documentation. Data Cards are capable of thoughtfully explaining the implications of datasets while highlighting unknowns appropriately. They reveal insights about inherent aspects of dataset that cannot be intrinsically determined by interacting with the dataset. Data Cards enable future industry standards of transparency and documentation that emphasize the ethical considerations of a dataset in ways that can be practically acted upon, support production and research decisions, and well-informed development of large AI models with increasingly complex dataset dependencies.
ACKNOWLEDGMENTS
We are grateful to Aybuke Turker for research contributions; Romina Stella, Candice Schumann, Reena Jana and Susanna Ricco for the Data Cards and two case studies presented in this paper; Emily Denton, Lauren Wilcox, Michael Terry, Negar Rostamzadeh, Kathy Meier-Hellstern, Meredith Morris for their feedback and expertise; Tulsee Doshi, Margaret Mitchell, Timnit Gebru, Martin Wattenberg, Fernanda Viegas, Parker Barnes, Dan Nanas, Nicole Maffeo, Will Carter, Sebastian Gehrmann, Catherine Xu, Vivian Tsai, Danielle Smalls, Anthony Keene, and Lora Aroyo for their constant guidance. We thank internal and external workshop and study participants, and attendees of Data Cards Playbook workshop at 2021 CRAFT for their participation and insightful discussions. We also thank the Center for Responsible AI and Human Centered Technology at Google Research for enabling this work. This work was jointly conducted by the Ethical AI and People + AI Research teams, funded by Google Research. The authors declare no additional sources of funding. The legal department of Google participated in the review and approval of the manuscript; and the decision to submit the manuscript for publication. Aside from the authors and their collaborators, Google had no role in the design and conduct of the study; access and collection of data; analysis and interpretation of data; or preparation of the manuscript. The authors declare no other financial interests.
A Related Documentation Frameworks & Toolkits
To standardize documentation procedures that convey performance characteristics of AI or aspects that lead to the creation and distribution of datasets, many groups have created frameworks and toolkits to support transparency in AI. Each of these efforts were developed with particular stakeholders and issues in mind. The following is a summary of some of these efforts:
- Model Cards is a modular, ethics-informed framework to report trained ML model details [23]. Model Cards consist of qualitative information, such as ethical considerations, target users, and use cases; as well as quantitative information, with an emphasis on model evaluation that is disaggregated (split across the different target subgroups) and intersectional (including evaluation on multiple subgroups in combinations, for example race and gender).
- Datasheets for Datasets is a set of questions designed to evoke information about a dataset that reflect key stages in a dataset's lifecycle [15]. Drawing critical analogies from the automobile industry, clinical trials in medicine, and the electronics industry, Datasheets for Datasets is also used as a workflow by: 1.) Dataset creators to guide their thinking during the process of creating, distributing and maintaining a dataset. 2.) Dataset consumers to decide appropriateness for task, strengths, limitations, and place in a broader system associated with the dataset documented.
- FactSheets is an extensive set of declaration items intended to disclose information about the creation and deployment of an AI service [6]. Modeled after a supplier's declaration of conformity (SDoC) and similar artifacts used in telecommunications and transportation to demonstrate a service's conformity to regulation, items in FactSheets include: purpose and audience; performance variation; safety and security aspects; and provenance of training data—all to gain trustworthiness of AI services.
- Data Statements, originally developed for documenting natural language processing systems, is a practice on how to characterize a dataset using schema elements that minimizes critical scientific and ethical issues—issues that could arise from datasets used in contexts not well suited [9]. In its original form, schema elements in Data Statements featured particular aspects of language datasets, including speech context, speaker demographic and annotator demographic—all of which were inspired by practices from the fields of psychology and medicine that require such disclosure about populations being studied.
B TYPOLOGY OF STAKEHOLDERS
C OFTEN FRAMEWORK AS A GENERATIVE TOOL
| Who | What | When | Where | Why | |
| O | Who was responsible for setting the terms of consent? | What were the terms of consent? | When do the terms of consent expire? | Where all are the terms of consent applicable? Are there any exceptions? | Why were these specific terms of consent chosen? |
| F | How was consent delivered to the surveyed population? | How many data points accompanied consent? | When was the consent collected with respect to data creation or collection? | Where can the consent be accessed? How is it stored? | If at all, why were exceptions made? What happened in cases where consent was not or conditionally provided? Provided but revoked? |
| T | Who tracks consent? | What manipulations of the data are permissible under the given consent? | When can consent be revoked? | X | Why are said transformations in direct conflict with consent? |
| E | Under the terms of the consent, who all can use the dataset? | Under the terms of the consent, what are the permissible uses of the dataset? | When must consent be reacquired from individuals to sustain use of the dataset? | Geographically, where all does the consent permit dataset use? | Summarize conditions and rationales that justify the use of data without consent. |
| N=1 | Provide an example of a consent form | Provide an example of a data point with partial consent | X | X | X |
D DATA CARD FOR COMPUTER VISION DATASET
E DATA CARD FOR LANGUAGE TRANSLATION DATASET
F DATA CARD TEMPLATE
REFERENCES
- 2017. AI Now Institute. https://ainowinstitute.org/
- 2021. ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT). https://facctconference.org/
- Joint Artificial Intelligence Center Public Affairs. 2021. Enabling AI with Data Cards. https://www.ai.mil/blog_09_03_21_ai_enabling_ai_with_data_cards.html
- Nuno Antunes, Leandro Balby, Flavio Figueiredo, Nuno Lourenco, Wagner Meira, and Walter Santos. 2018. Fairness and transparency of machine learning for trustworthy cloud services. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W). IEEE, 188–193.
- Parker Barnes Anurag Batra. 2020. Open Images Extended - Crowdsourced Data Card. https://research.google/static/documents/datasets/open-images-extended-crowdsourced.pdf
- Matthew Arnold, Rachel K. E. Bellamy, Michael Hind, Stephanie Houde, Sameep Mehta, Aleksandra Mojsilovic, Ravi Nair, Karthikeyan Natesan Ramamurthy, Darrell Reimer, Alexandra Olteanu, David Piorkowski, Jason Tsay, and Kush R. Varshney. 2019. FactSheets: Increasing Trust in AI Services through Supplier's Declarations of Conformity. arxiv:1808.07261 [cs.CY]
- Anja Austermann, Michelle Linch, Romina Stella, and Kellie Webster.2021. https://storage.googleapis.com/gresearch/translate-gender-challenge-sets/Data%20Card.pdf
- Iain Barclay, Harrison Taylor, Alun Preece, Ian Taylor, Dinesh Verma, and Geeth de Mel. 2020. A framework for fostering transparency in shared artificial intelligence models by increasing visibility of contributions. Concurrency and Computation: Practice and Experience (2020), e6129.
- Emily M Bender and Batya Friedman. 2018. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics 6 (2018), 587–604.
- Umang Bhatt, Javier Antorán, Yunfeng Zhang, Q Vera Liao, Prasanna Sattigeri, Riccardo Fogliato, Gabrielle Melançon, Ranganath Krishnan, Jason Stanley, Omesh Tickoo, et al. 2021. Uncertainty as a form of transparency: Measuring, communicating, and using uncertainty. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 401–413.
- Ajay Chander, Ramya Srinivasan, Suhas Chelian, Jun Wang, and Kanji Uchino. 2018. Working with beliefs: AI transparency in the enterprise. In IUI Workshops.
- Candice chumann, Susanna Ricco, Utsav Prabhu, Vittorio Ferrari, and Caroline Pantofaru.2021. https://storage.googleapis.com/openimages/open_images_extended_miap/Open%20Images%20Extended%20-%20MIAP%20-%20Data%20Card.pdf
- Upol Ehsan, Q Vera Liao, Michael Muller, Mark O Riedl, and Justin D Weisz. 2021. Expanding explainability: Towards social transparency in ai systems. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–19.
- Heike Felzmann, Eduard Fosch-Villaronga, Christoph Lutz, and Aurelia Tamò-Larrieux. 2020. Towards transparency by design for artificial intelligence. Science and Engineering Ethics 26, 6 (2020), 3333–3361.
- Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2018. Datasheets for datasets. arXiv preprint arXiv:1803.09010(2018).
- GEM. 2022. Natural Language Generation, its Evaluation and Metrics Data Cards. https://gem-benchmark.com/data_cards
- HuggingFace. 2021. HuggingFace - Create a Dataset Card. https://huggingface.co/docs/datasets/v1.12.0/dataset_card.html
- Ben Hutchinson, Andrew Smart, Alex Hanna, Emily Denton, Christina Greer, Oddur Kjartansson, Parker Barnes, and Margaret Mitchell. 2021. Towards accountability for machine learning datasets: Practices from software engineering and infrastructure. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 560–575.
- People + AI Research Initiative. 2022. Know Your Data. https://knowyourdata.withgoogle.com/
- Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. 2020. The open images dataset v4. International Journal of Computer Vision 128, 7 (2020), 1956–1981.
- Susan Leigh Star. 2010. This is not a boundary object: Reflections on the origin of a concept. Science, Technology, & Human Values 35, 5 (2010), 601–617.
- Colleen McCue. 2014. Data mining and predictive analysis: Intelligence gathering and crime analysis. Butterworth-Heinemann.
- Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency. 220–229.
- Sundar Pichai. 2018. AI at Google: our principles. The Keyword 7(2018), 1–3.
- Mahima Pushkarna, Andrew Zaldivar, and Daniel Nanas. [n. d.]. Data Cards Playbook: Participatory Activities for Dataset Documentation. https://facctconference.org/2021/acceptedcraftsessions.html#data_cards
- Mahima Pushkarna, Andrew Zaldivar, and Vivian Tsai. [n. d.]. Data Cards GitHub Page. https://pair-code.github.io/datacardsplaybook/
- Ben Shneiderman. 2003. The eyes have it: A task by data type taxonomy for information visualizations. In The craft of information visualization. Elsevier, 364–371.
- Susan Leigh Star and James R Griesemer. 1989. Institutional ecology,translations’ and boundary objects: Amateurs and professionals in Berkeley's Museum of Vertebrate Zoology, 1907-39. Social studies of science 19, 3 (1989), 387–420.
- Harini Suresh, Steven R Gomez, Kevin K Nam, and Arvind Satyanarayan. 2021. Beyond Expertise and Roles: A Framework to Characterize the Stakeholders of Interpretable Machine Learning and their Needs. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–16.
FOOTNOTE
1For the purposes of practicality, we use transparency artifacts as a general term to describe both Data and Model Cards [23] because of their inextricably linked nature. In this paper, we primarily focus on our insights and advances on datasets and correspondingly Data Cards, our novel contribution.
2Proficiency is a combination of data fluency and domain expertise. Data fluency is described as the familiarity and comfort that readers have in working with data that is both, in or outside of their domain of expertise. The greater the comfort with understanding, manipulating, and using data, the greater the fluency. Domain expertise is defined as “knowledge and understanding of the essential aspects of a specific field of inquiry” [22] in reference to the domain of the dataset.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).
FAccT '22, June 21–24, 2022, Seoul, Republic of Korea
© 2022 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9352-2/22/06.
DOI: https://doi.org/10.1145/3531146.3533231