skip to main content
10.1145/3287560.3287596acmconferencesArticle/Chapter ViewAbstractPublication PagesfacctConference Proceedingsconference-collections
research-article

Model Cards for Model Reporting

Published: 29 January 2019 Publication History

Abstract

Trained machine learning models are increasingly used to perform high-impact tasks in areas such as law enforcement, medicine, education, and employment. In order to clarify the intended use cases of machine learning models and minimize their usage in contexts for which they are not well suited, we recommend that released models be accompanied by documentation detailing their performance characteristics. In this paper, we propose a framework that we call model cards, to encourage such transparent model reporting. Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type [15]) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information. While we focus primarily on human-centered machine learning models in the application fields of computer vision and natural language processing, this framework can be used to document any trained machine learning model. To solidify the concept, we provide cards for two supervised models: One trained to detect smiling faces in images, and one trained to detect toxic comments in text. We propose model cards as a step towards the responsible democratization of machine learning and related artificial intelligence technology, increasing transparency into how well artificial intelligence technology works. We hope this work encourages those releasing trained machine learning models to accompany model releases with similar detailed evaluation numbers and other relevant documentation.

References

[1]
Avrio AI. 2018. Avrio AI: AI Talent Platform. (2018). https:/www.goavrio.com/
[2]
Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. Machine Bias. (2016). https:/www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
[3]
Emily M. Bender and Batya Friedman. 2018. "Data Statements for NLP: Toward Mitigating System Bias and Enabling Better Science". Transactions of the ACL (TACL) (2018).
[4]
Joy Buolamwini. 2016. How I'm fighting Bias in Algorithms. (2016). https:/www.ted.com/talks/joy_buolamwini_how_i_m_fighting_bias_in_algorithms#t-63664
[5]
Joy Buolamwini and Timnit Gebru. 2018. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency (Proceedings of Machine Learning Research), Sorelle A. Friedler and Christo Wilson (Eds.), Vol. 81. PMLR, New York, NY, USA, 77--91. http://proceedings.mlr.press/v81/buolamwini18a.html
[6]
Alexandra Chouldechova. 2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data 5, 2 (2017), 153--163.
[7]
Federal Trade Commission. 2016. Big Data: A Tool for Inclusion or Exclusion? Understanding the Issues. (2016). https:/www.ftc.gov/reports/big-data-tool-inclusion-or-exclusion-understanding-issues-ftc-report
[8]
Kimberle Crenshaw. 1989. Demarginalizing the intersection of race and sex: A black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics. U. Chi. Legal F. (1989), 139.
[9]
Black Desi. 2009. HP computers are racist. (2009). https:/www.youtube.com/watch?v=t4DT3tQqgRM
[10]
William Dieterich, Christina Mendoza, and Tim Brennan. 2016. COMPAS Risk Scales: Demonstrating Accuracy Equity and Predictive Parity. (2016). https:/www.documentcloud.org/documents/2998391-ProPublica-Commentary-Final-070616.html
[11]
Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2018. Measuring and Mitigating Unintended Bias in Text Classification. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (2018).
[12]
Cynthia Dwork. 2008. Differential Privacy: A Survey of Results. In Theory and Applications of Models of Computation, Manindra Agrawal, Dingzhu Du, Zhenhua Duan, and Angsheng Li (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1--19.
[13]
Entelo. 2018. Recruitment Software | Entelo. (2018). https:/www.entelo.com/
[14]
Daniel Faggella. 2018. Follow the Data: Deep Learning Leads the Transformation of Enterprise - A Conversation with Naveen Rao. (2018).
[15]
Thomas B Fitzpatrick. 1988. The validity and practicality of sun-reactive skin types I through VI. Archives of dermatology 124, 6 (1988), 869--871.
[16]
Food and Drug Administration. 1989. Guidance for the Study of Drugs Likely to Be Used in the Elderly. (1989).
[17]
U.S. Food and Drug Administration. 2013. FDA Drug Safety Communication: Risk of next-morning impairment after use of insomnia drugs; FDA requires lower recommended doses for certain drugs containing Zolpidem (Ambien, Ambien CR, Edluar, and Zolpimist). (2013). https://web.archive.org/web/20170428150213/ https:/www.fda.gov/drugs/drugsafety/ucm352085.htm
[18]
IIHS (Insurance Institute for Highway Safety: Highway Loss Data Institute). 2003. Special Issue: Side Impact Crashworthiness. Status Report 38, 7 (2003).
[19]
Institute for the Future, Omidyar Network's Tech, and Society Solutions Lab. 2018. Ethical OS. (2018). https://ethicalos.org/
[20]
Clare Garvie, Alvaro Bedoya, and Jonathan Frankle. 2016. The Perpetual Line-Up. (2016). https:/www.perpetuallineup.org/
[21]
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna M. Wallach, Hal Daumé III, and Kate Crawford. 2018. Datasheets for Datasets. CoRR abs/1803.09010 (2018). http://arxiv.org/abs/1803.09010
[22]
Google. 2018. Responsible AI Practices. (2018). https://ai.google/education/responsible-ai-practices
[23]
Gooru. 2018. Navigator for Teachers. (2018). http://gooru.org/about/teachers
[24]
Cyril Goutte and Eric Gaussier. 2005. A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In European Conference on Information Retrieval. Springer, 345--359.
[25]
Collins GS, Reitsma JB, Altman DG, and Moons KM. 2015. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (tripod): The tripod statement. Annals of Internal Medicine 162, 1 (2015), 55--63.
[26]
Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equality of Opportunity in Supervised Learning. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 3315--3323. http://papers.nips.cc/paper/6374-equality-of-opportunity-in-supervised-learning.pdf
[27]
Michael Hind, Sameep Mehta, Aleksandra Mojsilovic, Ravi Nair, Karthikeyan Natesan Ramamurthy, Alexandra Olteanu, and Kush R. Varshney. 2018. Increasing Trust in AI Services through Supplier's Declarations of Conformity. CoRR abs/1808.07261 (2018).
[28]
Sarah Holland, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia Chmielinski. 2018. The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards. CoRR abs/1805.03677 (2018). http://arxiv.org/abs/1805.03677
[29]
Ideal. 2018. AI For Recruiting Software | Talent Intelligence for High-Volume Hiring. (2018). https://ideal.com/
[30]
DrivenData Inc. 2018. An Ethics Checklist for Data Scientists. (2018). http://deon.drivendata.org/
[31]
Jigsaw. 2017. Conversation AI Research. (2017). https://conversationai.github.io/
[32]
Jigsaw. 2017. Perspective API. (2017). https:/www.perspectiveapi.com/
[33]
B. Kim, Wattenberg M., J. Gilmer, Cai C., Wexler J., F. Viegas, and R. Sayres. 2018. Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV). ICML (2018).
[34]
Brendan F. Klare, Mark J. Burge, Joshua C. Klontz, Richard W. Vorder Bruegge, and Anil K. Jain. 2012. Face recognition performance: Role of demographic information. IEEE Transactions on Information Forensics and Security 7, 6 (2012), 1789--1801.
[35]
Der-Chiang Li, Susan C Hu, Liang-Sian Lin, and Chun-Wu Yeh. 2017. Detecting representative data and generating synthetic samples to improve learning accuracy with imbalanced data sets. PloS one 12, 8 (2017), e0181853. Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, FAT* '19, January 29-31, 2019, Atlanta, CA, USA Timnit Gebru
[36]
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. Deep Learning Face Attributes in the Wild. In Proceedings of International Conference on Computer Vision (ICCV).
[37]
Shira Mitchell, Eric Potash, and Solon Barocas. 2018. Prediction-Based Decisions and Fairness: A Catalogue of Choices, Assumptions, and Definitions. arXiv:1811.07867 (2018).
[38]
Pramod Kaushik Mudrakarta, Ankur Taly, Mukund Sundararajan, and Kedar Dhamdhere. 2018. Did the Model Understand the Question? Proceedings of the Association for Computational Linguistics (2018).
[39]
AI Now. 2018. Litigating Algorithms: Challenging Government Use Of Algorithmic Decision Systems. AI Now Institute.
[40]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 311--318.
[41]
Inioluwa Raji. 2018. Black Panther Face Scorecard: Wakandans Under the Coded Gaze of AI. (2018).
[42]
Microsoft Research. 2018. Project InnerEye - Medical Imaging AI to Empower Clinicians. (2018). https:/www.microsoft.com/en-us/research/project/medical-image-analysis/
[43]
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. In Proceedings of the 34th International Conference on Machine Learning, Vol. 70. PMLR, Sydney, Australia.
[44]
Digital Reasoning Systems. 2018. AI-Enabled Cancer Software | Healthcare AI: Digital Reasoning. (2018). https://digitalreasoning.com/solutions/healthcare/
[45]
Turnitin. 2018. Revision Assistant. (2018). http://turnitin.com/en_us/what-we-offer/revision-assistant
[46]
Shannon Vallor, Brian Green, and Irina Raicu. 2018. Ethics in Technology Practice: An Overview. (22 6 2018). https:/www.scu.edu/ethics-in-technology-practice/overview-of-ethics-in-tech-practice/
[47]
Lucy Vasserman, John Li, CJ Adams, and Lucas Dixon. 2018. Unintended bias and names of frequently targeted groups. Medium (2018). https://medium.com/the-false-positive/unintended-bias-and-names-of-frequently-targeted-groups-8e0b81f80a23
[48]
Sahil Verma and Julia Rubin. 2018. Fairness Definitions Explained. (2018).
[49]
Joz Wang. 2010. Flickr Image. (2010). https:/www.flickr.com/photos/jozjozjoz/3529106844
[50]
Amy Westervelt. 2018. The medical research gender gap: how excluding women from clinical trials is hurting our health. (2018).
[51]
Mingyuan Zhou, Haiting Lin, S Susan Young, and Jingyi Yu. 2018. Hybrid sensing face detection and registration for low-light and unconstrained conditions. Applied optics 57, 1 (2018), 69--78.

Cited By

View all
  • (2025)Fine-Tuning Large Language Models on Cultural Nuances for Linguistically Driven Hyper-Personalization: A Literature ReviewInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT2511121011:1(53-60)Online publication date: 3-Jan-2025
  • (2025)An Emerging Design Space of How Tools Support Collaborations in AI Design and DevelopmentProceedings of the ACM on Human-Computer Interaction10.1145/37011819:1(1-28)Online publication date: 10-Jan-2025
  • (2025)Machine Learning Lineage for Trustworthy Machine Learning Systems: Information Framework for MLOps PipelinesIEEE Software10.1109/MS.2024.341431742:1(51-58)Online publication date: 1-Jan-2025
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
FAT* '19: Proceedings of the Conference on Fairness, Accountability, and Transparency
January 2019
388 pages
ISBN:9781450361255
DOI:10.1145/3287560
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 January 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ML model evaluation
  2. datasheets
  3. disaggregated evaluation
  4. documentation
  5. ethical considerations
  6. fairness evaluation
  7. model cards

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

FAT* '19
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2,484
  • Downloads (Last 6 weeks)317
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Fine-Tuning Large Language Models on Cultural Nuances for Linguistically Driven Hyper-Personalization: A Literature ReviewInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT2511121011:1(53-60)Online publication date: 3-Jan-2025
  • (2025)An Emerging Design Space of How Tools Support Collaborations in AI Design and DevelopmentProceedings of the ACM on Human-Computer Interaction10.1145/37011819:1(1-28)Online publication date: 10-Jan-2025
  • (2025)Machine Learning Lineage for Trustworthy Machine Learning Systems: Information Framework for MLOps PipelinesIEEE Software10.1109/MS.2024.341431742:1(51-58)Online publication date: 1-Jan-2025
  • (2025)Machine learning derived retinal pigment score from ophthalmic imaging shows ethnicity is not biologyNature Communications10.1038/s41467-024-55198-716:1Online publication date: 2-Jan-2025
  • (2025)AI product cards: a framework for code-bound formal documentation cards in the public administrationData & Policy10.1017/dap.2024.557Online publication date: 8-Jan-2025
  • (2025)An empirical study of developers’ challenges in implementing Workflows as Code: A case study on Apache AirflowJournal of Systems and Software10.1016/j.jss.2024.112248219(112248)Online publication date: Jan-2025
  • (2025)Challenges of implementing ChatGPT on education: Systematic literature reviewInternational Journal of Educational Research Open10.1016/j.ijedro.2024.1004018(100401)Online publication date: Jun-2025
  • (2025)Correlation-based methods for representative fairness metric selection: An empirical study on efficiency and caveats in model evaluationExpert Systems with Applications10.1016/j.eswa.2024.126344268(126344)Online publication date: Apr-2025
  • (2025)On the reliability of Large Language Models to misinformed and demographically informed promptsAI Magazine10.1002/aaai.1220846:1Online publication date: 8-Jan-2025
  • (2024)Artificial Intelligence & the Capacity for Discrimination: The Imperative Need for Frameworks, Diverse Teams & Human AccountabilityIgMin Research10.61927/igmin2502:10(801-806)Online publication date: 10-Oct-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media