Black-Box Access is Insufficient for Rigorous AI Audits

External audits of AI systems are increasingly recognized as a key mechanism for AI governance. The effectiveness of an audit, however, depends on the degree of access granted to auditors. Recent audits of state-of-the-art AI systems have primarily relied on black-box access, in which auditors can only query the system and observe its outputs. However, white-box access to the system's inner workings (e.g., weights, activations, gradients) allows an auditor to perform stronger attacks, more thoroughly interpret models, and conduct fine-tuning. Meanwhile, outside-the-box access to training and deployment information (e.g., methodology, code, documentation, data, deployment details, findings from internal evaluations) allows auditors to scrutinize the development process and design more targeted evaluations. In this paper, we examine the limitations of black-box audits and the advantages of white- and outside-the-box audits. We also discuss technical, physical, and legal safeguards for performing these audits with minimal security risks. Given that different forms of access can lead to very different levels of evaluation, we conclude that (1) transparency regarding the access and methods used by auditors is necessary to properly interpret audit results, and (2) white- and outside-the-box access allow for substantially more scrutiny than black-box access alone.

Historically, most academic work on evaluating AI systems has been conducted on models where parameters, data, and methodology are openly available.AI systems that are not available to the public, including ones that are proprietary or in pre-deployment, pose challenges for oversight.AI audits are structured evaluations designed to identify risks and improve transparency by assessing how well models and methods meet specific desiderata [67,202].Norms for AI audits are not yet well established, and their effectiveness can vary depending on the degree of system access granted to auditors [11,33,252].This is crucial because existing calls for audits are often agnostic to the form of access, and industry actors have previously lobbied for limiting access given to auditors. 1   1 For example, Google made the unsubstantiated claim in 2021 that "There will always be better methods for verifying the performance of an AI system (ex: input/output Recently, some developers of prominent state-of-the-art AI systems have kept most details of their models private [29].To public knowledge, voluntary external audits of these systems have primarily involved analysis of the input/output behavior of models [13,192,221,306].This form of access, in which auditors are only able to see outputs for given inputs, is known as black-box.Unfortunately, black-box access is very limiting for auditors.Some problems, such as anomalous failures, are difficult to find with black-box access [150], and others, such as dataset biases, can be actively reinforced by testing data [278].
The ability to query a black-box system is useful, but many of today's evaluation techniques require access to weights, activations, gradients, or the ability to fine-tune the model [34].Whitebox access refers to the unrestricted ability to observe a system's internal workings.It enables evaluators to apply more powerful attacks to automatically identify weaknesses [107,228], study internal mechanisms responsible for undesirable model behaviors [132,153,164], and identify harmful dormant capabilities through fine-tuning [242,342].Meanwhile, outside-the-box access involves additional contextual information about a system's development or deployment such as methodology, code, documentation, hyperparameters, data, deployment details, and findings from internal evaluations.It allows auditors to study risks that stem from methodology or data [24,46,195,278] and makes it easier to design useful tests.This paper makes four contributions: (1) We present shortcomings of black-box methods for evaluating AI systems (Section 3).( 2) We overview the ways in which white-box methods involving attacks, model interpretability, and fine-tuning substantially expand the capabilities of evaluators (Section 4).(3) Similarly, we examine how outside-the-box access, including methodology, code, documentation, hyperparameters, data, deployment details, and findings from internal evaluations, allow for more thorough evaluations (Section 5).(4) Finally, we describe methods to conduct white-and outsidethe-box audits securely to avoid leaks of sensitive information.These include technical solutions involving application programming interfaces, physical solutions involving secure research environments, and legal mechanisms that have precedent in other industries with audits (Section 6).
Given the growing evidence that different forms of access can facilitate very different levels of evaluation, we draw two conclusions.First, transparency regarding model access and evaluation methods is necessary to properly interpret the results of an AI audit.Second, white-and outside-the-box access allow for substantially more scrutiny than black-box access alone.When higher levels of scrutiny are desired, audits should be conducted with higher levels of access. 2 auditing) than direct access to source code, " while lobbying the European Union on a draft of the EU AI Act [108]. 2 Jointly with this paper, we submitted a comment on policy questions related to whiteand outside-the-box audits to a request for information from the US National Institute of Standards and Technology [207].

Etc. Hyperparameters
Figure 1: Black-box access lets auditors query the system and analyze the resulting outputs.Grey-box access lets auditors access limited internal information.White-box access lets users access the full system.Outside-the-box access gives auditors contextual information.In this paper, we argue that white-and outside-the-box access are key for rigorous AI audits.

BACKGROUND 2.1 Black, Grey, White, and Outside-the-Box Access
In accordance with literature on security and software testing [144], we differentiate black-, grey-, and white-box access.We also introduce two new concepts: "de facto white-box" access and "outsidethe-box" access.Figure 1 illustrates these categories, and Table 1 summarizes which techniques each type of access allows.
(1) Black-box access allows users to design inputs for a system, query it, and analyze the resulting outputs.(2) Grey-box access offers users limited access to a system's inner workings.For neural networks, this can include information such as input embeddings, inner neuron activations, or sampling probabilities.There are many ways that users can be given information about a system's inner workings and many corresponding shades of grey.De facto whitebox access is a very light grey form of access that allows users to run arbitrary processes on a system indirectly with the constraint that the system's parameters cannot be copied.We discuss this and other methods to minimize the possibility of leaks in Section 6. (3) White-box access allows users full access to the system.This includes access to weights, activations, gradients, and the ability to fine-tune the model.(4) Outside-the-box access grants users access to additional information about the system's development and deployment.There are many types, which can include methodological details, source code, documentation, hyperparameters, training data, deployment details, and findings from internal evaluations.Different forms of outside-the-box access can vary greatly in their comprehensiveness.For example, possessing high-level details (such as a "model card, " [195]) is less informative compared to having comprehensive documentation from training and testing.

Regulatory Frameworks' Reliance on Audits
Emerging frameworks for AI governance have been designed to rely on high-quality audits.Audits have been called for in the White House Executive Order on AI [217], European Union policy [85][86][87], other policy initiatives [61,68,211,311], general AI principles [60,94,208,216,312], voluntary standards [214,235,314], multilateral commitments [5], and position papers [209,236].In particular, audits in these proposals are intended to provide trustworthy assessments of potential harm and explanations of system behaviors.
Regulatory frameworks have called for evaluations to accurately assess risks.Some jurisdictions may require risk assessment evaluations for AI systems used in certain contexts.These can include tests to ensure non-discrimination, such as New York City's requirement for bias audits of automated employment decision tools [68], or quality and performance evaluations [116].The draft EU AI Act [87] has more recently harmonized quality assurance standards across several high-risk use cases, with provisions for external oversight.Regulators are also increasingly interested in external oversight of AI systems with potentially harmful capabilities.Recently, U.S. Executive Order 14110 [217]   In Appendix A, we expand on the variety of motivations for AI audits.
subject their systems to external evaluations beyond regulatory requirements.For example, the NIST Risk Management Framework provides recommendations for audits related to system design and reliable operation [314].In Section 3, Section 4, and Section 5, we overview advantages of white-and outside-the-box access over black-box access for rigorous assessments.
Assessing model explanations enables scrutiny of automated decisions.Regulatory frameworks also use audits to provide those affected by automated decision-making with explanations of the decisions [111,238].In some jurisdictions, such as the European Union, when individuals are harmed by automated decision-making systems, they may have the right to an explanation, and the results of these explanations may entitle them to Black-Box Access is Insufficient for Rigorous AI Audits FAccT '24, June 03-06, 2024, Rio de Janeiro, Brazil remediation [36,116].Further, explanation requirements may exist for particularly high-risk systems, such as EU platform regulations (the Digital Markets Act and the Digital Services Act) that require transparency from large online platforms using AI systems (e.g., ranking algorithms) to protect against discrimination and abuse of market power [88,115].Finally, disclosure of evidence may be required under liability rules, such as the Product and AI Liability Directives in the EU, to enable potential claimants to adequately defend damage claims [35,114].When producing explanations, a report from NIST emphasizes the importance of explanation accuracy, or [a]n explanation [that] correctly reflects the reason for generating the output and/or accurately reflects the system's process [237].In Section 3 and Section 4.2, we overview advantages of white-box access over black-box access for generating reliable explanations.

Audits in the Status Quo
Recent advancements in AI capabilities -especially from large generative models -have increased public attention on AI audits.As of January 2024, there are no widely adopted norms for conducting AI audits [25].Details of AI audits can vary because they depend upon the system, how it will be used, and what risks it poses.Auditing frameworks and metrics have been proposed for specific use cases, including hiring [142,245], facial recognition [143,250], healthcare [177,183], recommender systems [57,260], and general purpose language models [202].However, Raji et al. [252] identifies five general limitations for algorithmic audits: scope, independence, level of access, professionalism, and public disclosure of methods and results.
Currently, evaluations of proprietary or pre-deployment AI systems are predominantly performed in-house by developers with selective disclosure of methods and outcomes.Some developers have voluntarily partnered with external auditors and provided them with black-box access to state-of-the-art systems [13,147,192,222,306]. Additionally, some developers run programs for external researchers to support their internal evaluation process (e.g., OpenAI's Preparedness Challenge [223] and Red-Teaming Network [224]).However, to public knowledge, these industry-based efforts have involved black-box and limited outside-the-box access such as model cards [195].

LIMITATIONS OF BLACK-BOX ACCESS
Black-box evaluations of AI systems are based on analysis of their inputs and outputs only.Such evaluations often involve assessing performance on test sets [121,172,269,297,299,320] or searching for inputs that elicit harmful outputs [51,147,233,234,271,325].Generative AI audits often attempt to elicit undesirable capabilities or behaviors (e.g., [30,118,140,147,198,206,266,284,292,301]).However, black-box methods are inherently limited in their ability to identify harms or provide meaningful explanations.Readers with a computer science background can consider the analogy of attempting to evaluate the performance of software without reading or modifying its source code.
Black-box methods are not well suited to develop a generalizable understanding.Black-box access limits evaluators to analyzing a system using only inputs and outputs.However, the vast number of possible inputs to AI systems makes it intractable to develop a complete understanding from this alone.This forces evaluators to rely on heuristics to produce 'relevant' inputs for evaluation.For this reason, black-box methods have been shown to be unreliable for detecting failures that elude typical test sets including jailbreaks, adversarial inputs, or backdoors [58,325,353].
Black-box access prevents system components from being studied separately.Analyzing components of a system separately is ubiquitous in science and engineering.It enables engineers to trace problems to support more targeted interventions.However, black-box access obscures what subsystems the AI system is composed of.For example, black-box access does not allow input or output filters to be studied separately from the rest of the system.Other issues can arise from a lack of outside-the-box access to data.Datasets can inform evaluations related to privacy and copyright [262], and can help to avoid problems from data contamination [72,105,131].Having a lack of outside-the-box knowledge about how the system is deployed also prevents evaluators from making a more practical assessment of broader societal impacts [328].
Black-box evaluations can produce misleading results.Since black-box evaluations rely entirely on the queries made to the system, they are biased by how evaluators design inputs [138,205,326].This can lead to misconstrued conclusions about the system's characteristics.For example, systems may satisfy simple statistical tests for non-discrimination, but they may still have undesirable biases in their underlying reasoning [246].Developers with information about the black-box tests can exacerbate this problem by modifying the model's output behavior on test cases despite unresolved flaws in its internal reasoning.In addition, Schaeffer et al. [270] provide examples of black-box prompt-based evaluation methods for language models that can lead to misunderstandings of their emergent capabilities.
Black-box explanation methods are often unreliable.Using black-box methods alone to produce explanations for an AI system's decisions is difficult [117,264].Many black-box techniques to provide counterfactual explanations for model decisions are misleading because they fail to reliably identify causal relationships between the system's input features and outputs [62].Explanation methods for black-box systems can also be exploited by adversaries to produce misleading explanations for harmful decisions [6,290].Furthermore, when generative language models are asked to explain their decisions, their justifications do not tend to be faithful to their actual reasoning [310].
Black-box evaluations offer limited insights to help address failures.Black-box evaluations offer little insight into ways to address problems they discover.The main technique they enable is to train on problematic examples, but this can fail to address the underlying problem [78,97], be sample-inefficient [322], and may introduce new issues.Corrective actions are not robust when they fail to address a problem at its root.For example, some recent works have shown that safety measures built into large language models can be almost entirely undone by fine-tuning on a small number of harmful examples [168,242,336,342].In contrast, white-box methods reveal more about the nature of flaws, facilitating more precise debugging methods [322].

ADVANTAGES OF WHITE-BOX ACCESS
White-box offers a wider range of techniques to detect symptoms, understand causes, and mitigate harms in a targeted manner [34].Even for a system that will only be deployed as a black box, whitebox audits are still more useful for finding problems.Here, we survey techniques for white-box evaluations and their advantages over black-box ones.
4.1 White-box attack algorithms are more effective and efficient.
In machine learning, adversarial attacks refer to inputs that are designed specifically to make a system fail.AI systems have a long history of having unexpected failure modes that can be triggered by very subtle features in their inputs [7,346].Attacks play a central role in evaluations because they help to assess a system's worst-case behavior.
White-box algorithms produce stronger attacks.White-box algorithms allow for gradient-based optimization of adversarial inputs, which is powerful compared to simpler search methods.For example, white-box adversarial attack algorithms against vision systems typically use the gradient of the adversarial objective with respect to the input pixels to design adversarial inputs [107,228].This is much more effective for finding vulnerabilities than unguided black-box search methods.Consequently, white-box attacks are dominant in vision applications.In reinforcement learning, white-box access to a target agent also helps develop stronger adversarial attacks against it [49,323].For language models, optimizing adversarial inputs with gradient-based methods is more challenging because text (unlike pixels) is discrete which prevents gradient propagation.Nonetheless, there are various state-of-the-art white-box techniques for attacking language models.These include using a differentiable approximation to the process of sampling text [112,295,318], projecting adversarial embeddings onto text embeddings [329], and performing gradient-informed searches over modifications to textual changes [80,136,170,175,258,287,355].
Many black-box and grey-box attack algorithms are simply indirect or inefficient versions of white-box ones.Many blackbox attacks against AI systems involve attacking a white-box model with a white-box algorithm and then testing the resulting attack on the target black-box model [179,350].For vision models, the main motivation behind studying black-box attacks is that white-box access is not always available to attackers [21].Additionally, several of the most effective attacks against state-of-the-art models, such as GPT-4 and Claude-2, have simply been the result of transferring a white-box attack generated against an open-source model to the intended black-box target model [355].Other types of blackand grey-box attack algorithms involve inefficiently estimating gradients by analyzing outputs or sampling probabilities across many queries [110,128] when more precise gradients could be obtained trivially with white-box access.
Latent space attacks help to make stronger assurances.Typically, AI systems are attacked by crafting inputs meant to make them exhibit undesirable behavior.However, input space attacks are not well-suited to diagnose certain hard-to-find issues, including high-level misconceptions [300], anomalous failures [353], backdoors [58,334], and deception [63,231].A complementary technique for attacking systems in the input space is to relax the problem and attack their internal latent representations.The motivation of latent space attacks is that some failure modes are easier to find in the latent space than in the input space [53] because concepts important to the system's reasoning are represented at a higher level of abstraction inside the model [16,135,275,309,354]. Thus, robustness to latent attacks enables evaluators to make stronger assurances of safe worst-case performance.Latent space attacks are also more efficient to produce because they require less gradient propagation than input space attacks [229,243], allowing for more thorough debugging work to be conducted on a limited time and computing budget.Latent space attacks are still an active area of research, but some works have emerged showing that robustness to latent space attacks effectively indicates robustness to input space attacks in vision models [158,178,226,267,344,351]. Since textual inputs to language models are discrete, only latent space attacks allow for the direct use of gradient-based optimization, rendering them especially useful for language models [119,134,148,156,171,176,227,265,352].
White-box methods expand the attack toolbox.Some blackbox attack methods are competitive, particularly against language models.These include methods based on local search [240], rejection sampling at scale [96], Langevin dynamics [157,286], evolutionary algorithms [162], and reinforcement learning [51,74,233].Additionally, some of the most effective methods for attacking language models involve human or human-guided generation of adversarial prompts [180,277,325].However, even when blackbox attacks are useful, white-box algorithms are complementary because they generate qualitatively different kinds of attacks.For example, many black-box techniques produce attacks that appear as natural language (e.g., [277]) while white-box algorithms are stateof-the-art for synthesizing adversarial prompts which appear to humans as unintelligible text (e.g., [355]).Relying only on black-box methods would likely miss out on these types of failures entirely because finding them with black-box searches or human inductive biases would be highly improbable.Black-and white-box methods can also be combined to conduct hybrid attacks by using the results of one method as an initialization for another.Combinations of attacks tend to be better at helping humans find vulnerabilities than a single method alone [50].

White-box interpretability tools aid in diagnostics.
While it is possible to infer properties of a system from studying inputs and outputs, understanding its internal processes allows evaluators to more thoroughly assess its trustworthiness [103,256].
Interpreting the inner mechanisms of models has been recognized as a key part of agendas for reducing harms from AI systems [126,137,212], and explaining how models make specific decisions has also been recognized as a way to protect the rights of individuals affected by AI [111,238].have involved identifying novel attacks [353], internal representations of spurious features [95,184], brittle feature representations [44,48,50,52,101,123,199,335,341], and limitations of key-value memories in transformers [99,100,189].As an added benefit, attributing the problem to specific parts of the system's architecture or representations allows developers to address it in a more precise way [322].
Studying internal representations can help to establish the presence or lack of specific capabilities.White-box methods allow for more precise identification of what knowledge and capabilities a system has [8,18].Tools such as concept vectors [2,146] and probes [66] allow humans to assess the extent to which system internals can be understood in terms of familiar concepts.For example, these techniques have been used to study features related to fairness in visual classifiers [109], provide evidence that language models internally represent space and time [113], and show that networks sometimes represent truth-like features along linear directions [38,185].Methods for this are imperfect and still an active area of research [14,84,257], but interpretations like these offer a potentially powerful way to identify whether a model represents specific concepts.
Consider an example.Suppose that an auditor wants to assess sycophancy: a language model's tendency to pander to the biases of users who chat with it [234,280].For example, an evaluator might be concerned that the system will respond differently when the user says they are conservative or liberal in the chat.Blackbox techniques could only be used to argue that the system is not sycophantic by producing examples and analyzing them for apparent sycophancy.However, a white-box interpretability-based approach could offer much more information.For example, if it were not possible for a classifier to distinguish whether the user revealed themselves to be conservative or liberal from the model's internal representations, then this would offer stronger evidence that the system will reliably not exhibit this type of sycophancy.
Mechanistic understanding helps to make stronger assurances.In general, it is impossible to make guarantees about blackbox systems using a finite number of queries without additional assumptions.In contrast to black-box methods, which can only show the existence of failures by finding inputs that elicit them, thoroughly understanding the computations inside of a model gives auditors a complementary way to find evidence against the existence of failure modes.A mechanistic understanding can help researchers develop a predictive model of how the system would act for broad classes of inputs.Some works have aimed to provide thorough investigations of how networks perform simple tasks [39,204,348].Although scaling thorough analysis is an open challenge [303], it offers a strategy for making strong assurances.Recent works have attempted to make progress on this problem by using sparse autoencoders to allow evaluators to more thoroughly study the features represented inside of large language models [31,69,184].
White-box methods expand the toolbox for explaining specific AI system decisions.As discussed in Section 2, existing regulatory frameworks have been designed with specific desiderata for model explanations in order to determine accountability and protect individual rights.Many techniques are used to provide explanations of model behaviors during audits [343].Black-box techniques can only attribute decisions to input features using techniques that involve modifying inputs and analyzing how model outputs change [259].However, these techniques are frequently misleading [264] and can fail to reliably identify causal relationships between the system's input features and output [62].White-box access expands and strengthens the toolbox by allowing for gradient-based techniques [70,79,173,347].It also allows for explainability tools to be combined with interpretations of the model mechanisms to explain a model's behaviors in terms of more abstract concepts.

Fine-tuning reveals risks from latent
knowledge or post-deployment modifications.
State-of-the-art AI systems are typically trained on large amounts of internet data, often in multiple stages.This can cause them to learn undesirable capabilities, such as knowledge of how to perform illegal activities [277,354] or the ability to produce harmful content [23,24,98,182,253,278,304,305].Developers attempt to remove harmful abilities through fine-tuning, but they can unexpectedly resurface through "jailbreaks" [9, 55,73,169,180,220,241,255,277,282,308,325,339,340,355] or further fine-tuning models on a small number of new examples [168,242,336,342].The existence of harmful dormant capabilities in models thus poses risks from attacks and fine-tuning, especially if they are leaked (e.g.Stable Diffusion [161]), open-sourced (e.g., Llama-2 [306]), or deployed with fine-tuning access via API (e.g., GPT-3.5 [221]).Consequently, being able to fine-tune the model offers another strategy to search for evidence of undesirable capabilities and assess the risks in deployment.

ADVANTAGES OF OUTSIDE-THE-BOX ACCESS
In addition to having access to AI systems themselves, giving auditors outside-the-box access to contextual information also helps to identify risks.This can include methodological details, source code, documentation, hyperparameters, training data, deployment details, and the findings of internal evaluations.While it can come in many types, all outside-the-box information can be useful to auditors for three common reasons: (1) helping auditors more effectively design and implement tests, (2) offering clues about potential issues, and (3) helping auditors trace problems to their sources.See also Appendix B where we discuss how outside-the-box access to technical assistance from developers can also be useful for auditors.Code, documentation, and hyperparameters help auditors work more efficiently.As discussed in Section 4, audits can require a number of technical evaluations.Having code and documentation from developers can streamline the process of designing them.For example, consider fine-tuning evaluations.Fine-tuning a model typically requires precisely configured code and hyperparameters that have been carefully selected after extensive testing, often over the course of weeks or months.Using the developer's existing resources is a much more efficient option for auditors compared to re-implementing everything from scratch.
Access to methodological details helps to identify risks.Knowing methodological details can reveal shortcuts taken during development, which can guide evaluators toward discovering problems.For example, if a system was trained with human-generated FAccT '24, June 03-06, 2024, Rio de Janeiro, Brazil Casper, Ezell, et al. data using a non-representative cohort of humans, this can suggest specific social biases that the system may have internalized [46,268,298].Knowing the findings of internal evaluations is especially useful for helping auditors target their efforts toward a set of complementary evaluations.Furthermore, when developers attempt to mitigate flaws, auditors can better assess the effectiveness of these efforts if they have detailed information about the attempted mitigation (e.g., fine-tuning datasets, both old and new versions of model weights, etc.) [239].
Access to data helps auditors trace problems and assess fair use.Recent work has highlighted the ability of dataset audits to identify harmful and biased content used to train models [23,24,98,182,278,304,305].For example, Birhane et al. [24] and Thiel [304] were able to identify previously-overlooked examples of unintended hateful, sexual, and child-abuse-related content in widely-used training datasets.Access to training data also helps to investigate risks of data-poisoning attacks (which is especially important for systems trained on internet data) [41,254,319,321].Meanwhile, legal questions are currently being debated involving the extent to which training generative AI systems on copyrighted content constitutes fair use [65,120,139,261].Auditors may require access to training data to properly assess whether it was used in accordance with copyright law.
Contextual information makes it easier to hold developers accountable.Requirements to produce documentation place greater responsibility on developers to detail their methods, especially if subject to regulatory penalty defaults [332,338].Contextual information provides information about whether developers made decisions in a responsible manner [200].For example, documentation can provide insights into why certain design choices were made over others [190].Datasets and training details can help trace risks to intentional choices, and internal evaluation reports provide insights into how the developer responded to findings.By increasing the scrutiny placed on decisions in the development process, requirements for greater methodological transparency to auditors can deter developers from taking risks in the first place [46,159].

METHODS TO ADDRESS SECURITY RISKS
A concern with white-and outside-the-box audits is an increased risk that a developer's models or intellectual property could be leaked [29].In turn, leaks could compromise developers' trade secrets and pose risks to the public if they enable misuse [210].Widespread norms for secure audits in AI do not yet exist, but there is precedent in other fields for navigating similar challenges to enable secure oversight.The risk of leaks can be minimized through several technical, physical, and legal mechanisms.With these measures, developers can provide white-and outside-the-box access to auditors without the system's parameters leaving their servers.These can reduce leakage risks to a level comparable to ones posed by common existing practices.
Technical: API access can offer remote auditors de facto white-box access.Forms of structured access, particularly research application programming interfaces (APIs) [26,34,92,283], could enable auditors to analyze systems using some white-box tools without giving auditors direct access to model parameters.We refer to this form of access as de facto white-box access if it enables auditors to indirectly run arbitrary white-box processes on models while restricting direct access to model parameters.One example of running an algorithm that accesses a model's parameters via API is an OpenAI GPT-3.5 API which allows for fine-tuning [221].However, more customizable APIs (e.g., [92]) would be needed to allow for more flexible access.Another proposed paradigm is a flexible query API [225], where auditors are given complete access to mock versions of a model and data.The auditors then develop evaluations using their complete access to these mock artifacts before submitting them to be run on the true model and dataset.This allows auditors to better customize their evaluations.
The goal of API access is to ensure that the system cannot easily be reconstructed.However, prohibiting the sharing of weights is neither necessary nor sufficient for this.For example, sharing a small subset of weights with auditors is unlikely to pose significant security risks [187], but sharing other information, such as the product of weights with their pre-synaptic neuron's activations, may allow for parameter reconstruction.This suggests the need for a process by which a developer can raise grievances about specific requests from auditors and have them adjudicated.Overall, while conceptually simple, designing APIs that simultaneously provide the comprehensiveness, flexibility, and security required for rigorous auditing is an open area of research [34].Greater clarity is required regarding how to balance different desiderata.For example, more comprehensive access may impact security by facilitating model reconstruction, as discussed above.In Appendix C, we also discuss how investment, research, and development into secure auditing infrastructure can help with progress toward improved techniques.
Physical: Secure research environments can be used for auditors given unrestricted white-box access.Auditing personnel could securely be given white-box access to a system by hosting them on-site at the developer's facilities in a secure research environment.This is a common practice in other industries despite the costs of requiring auditors to be physically on-site [225,307,315].For example, the International Atomic Energy Agency employs over 300 expert inspectors from approximately 80 countries who do on-site inspections of nuclear facilities [129,130].Compared to API access, secure research environments could allow auditors to access systems more flexibly and efficiently while minimizing the risk that the model is leaked or reconstructed.Safeguards for limiting information leakage through lab employees (such as NDAs) are already common practice and could be adapted for application to on-site auditors.However, it is unclear whether, absent external legal structures, labs could incentivize adherence to protective measures to the same extent as with employees.
Legal: Other industries have developed practices to address the risk of leaks from audits.Across many industries, auditors require privileged access to systems and data in order to perform effective assessments.There are established mechanisms from other fields, such as financial auditing, employed to reduce the risk of leaks.In the finance industry, this manifests in three main ways.First, policies for confidentiality and handling of sensitive information are enforced by formal training and non-disclosure clauses in contracts to hold auditors accountable for violations [89].Second, there are clear terms of engagement that govern the relationship Black-Box Access is Insufficient for Rigorous AI Audits FAccT '24, June 03-06, 2024, Rio de Janeiro, Brazil between auditor and auditee.These typically include specific restrictions on confidentiality expectations tailored to a particular client [15].While some specifics of auditing are managed through contracting, the Public Company Accounting Oversight Board (PCAOB) requires all registered auditors to adhere to common standards in the US [232].Finally, auditors can be legally required to avoid conflicts of interest.In the US, this is done through a regime specifying general provisions for auditor independence (such as reporting requirements) outlined in Title II of the Sarbanes Oxley Act [232].
Provisions are enforced through various agencies, including the Securities Exchange Commission (SEC), which prevents auditor manipulation or the use of financial information for personal gain [215].This type of enforcement could allow for auditors to be held accountable in a way that could reduce the risk of leakage to a level comparable to risks posed by the developer's employees.In fact, employees may pose greater risks of sharing tacit knowledge with competitors than auditors because developers regularly attempt to recruit AI researchers from competitors.

DISCUSSION
White-and outside-the-box audits offer several benefits to developers.One potential benefit of more rigorous audits for developers is increased credibility by creating the perception that their systems are of higher quality.Meanwhile, white-and outside-thebox evaluations also offer developers greater insight into addressing problems with systems they build [248].We discuss in Section 3 how black-box evaluations can only establish when problems exist, while white-and outside-the-box methods can provide a clearer diagnosis to help address them.For example, if a flaw in a system can be attributed to a specific set of components, this enables more targeted interventions to fix it.However, absent legal requirements, developers have strong incentives to limit access granted to external auditors.Thus far, external audits of state-of-the-art AI systems, when they have occurred, have been black-box (to public knowledge) [13,147,192,221,306], suggesting a failure of existing incentive structures to provide greater access to auditors.Developers are typically reluctant to provide more permissive access to their models and related resources [252].This may stem from concerns that information collected through white-and outside-the-box audits could be leaked [29] (though this is addressable-see Section 6).Furthermore, it could expose a system's lackluster performance, vulnerabilities, or poor risk management processes by developers, which could lead to reputational harm and possibly legal liability [166].
Current black-box audits may set a precedent for future norms.Established norms frequently become "sticky" and entrenched in regulatory regimes [213].Accordingly, the current norm for black-box audits [13,147,192,221,306] may set the future standard.Furthermore, in policy debates about audits, industry actors have also lobbied for limiting external auditors to black-box access [108].Without sufficient access and resources, non-industry researchers will struggle to iterate upon methods for more thorough audits [34].Over time, this could limit or bias public understanding of AI systems.A lack of open research on transparency tools and the view that there is little social benefit from greater transparency are mutually reinforcing.Appendix C expands on how investment, research, and development into auditing techniques and infrastructure can facilitate further progress.
Low-quality audits can be counterproductive.Poor (e.g., black-box) audits can have counterproductive effects: they can increase public or regulatory trust in systems on false grounds, preventing appropriate levels of external scrutiny [93,174].They also enable safety-or ethics-washing by developers [90,152,330] who make AI systems that contribute to risks without sufficiently investing in methods to address them.

Conclusion:
We have argued that providing auditors with white-box and thorough outside-the-box access to systems is feasible and allows for more meaningful oversight from audits.We draw two conclusions.First transparency regarding model access and evaluation methods is necessary to properly interpret the results of an AI audit.Second, white-and outside-the-box access allow for substantially more scrutiny than black-box access alone.When higher levels of scrutiny are desired, audits should be conducted with higher levels of access.Finally, we emphasize that white-box and thorough outside-the-box access are necessary but not sufficient for rigor.Audits can and do fail for many reasons.Without careful institutional design, the incentives of developers and auditors may result in audits that do not consistently align with public interest [11,25,67,218,251,252].In Appendix D, we examine additional ways in which the quality of audits can be compromised.

ETHICS STATEMENT
Given the role of rigorous audits in improving accountability and representing public interests, we expect the foreseeable impacts of this paper to be positive.However, audits can fail to benefit the public for a variety of reasons.We discuss these challenges in Appendix D.

A MOTIVATIONS FOR EXTERNAL AUDITS
Audits involve formally evaluating systems to assess risks, compliance with standards and regulations, and other desiderata of interest to stakeholders.High-quality audits from independent, external auditors have been motivated in multiple ways [279]: • Identifying problems: The most direct purpose of audits is to identify risks from unsound systems or practices.• Incentivizing responsible development: When individual components of the development process are insufficiently documented, information necessary to contextually assess risks is lost [145].Audits can assess the sufficiency of internal controls, risk assessment, and documentation [33,149,251,273].Greater accountability for internal practices incentivizes auditees to spend more effort on risk mitigation and documentation [316], especially when facing penalties or public scrutiny [104,338].• Increasing transparency: Publicly shared information from audits can help regulators and the scientific community develop a better understanding of system behaviors and limitations.• Enabling fixes to technical problems: When problems are found during an audit, developers can then work to address them [248].External audits can also identify risk factors that might merit further guardrails on deployment, closer monitoring of deployed systems, or follow-up studies of user impacts.• Balancing transparency and security: Keeping systems entirely secret is maximally secure but prevents external scrutiny.Open-sourcing them allows for maximal scrutiny but can proliferate proprietary or misusable systems [276].Audits offer a middle ground that allows for some transparency and independent risk assessment with high security.• Providing greater credibility to responsible developers: Passing audits increases trust in developers and their systems.Hence, the public can better calibrate their trust in developers and systems.

B TECHNICAL ASSISTANCE AS A FORM OF OUTSIDE-THE-BOX ACCESS
In Section 5, we discuss how outside-the-box access to information can help auditors conduct audits more effectively.However, for similar reasons, access to technical assistance can also be useful.For example, one resource that auditors will often need, especially for large language models, is computing infrastructure [11,218].Further, additional technical assistance from the developers' engineers may also help because they have unique practical knowledge of working effectively with their models.This may include assistance with fine-tuning, developing realistic test cases (e.g., [349]), or integrating models with external tools that enhance capabilities to resemble real-world usage [71,165,203,230].Past experience with AI audits has highlighted the value of technical assistance from developers.
After seeing the final audit report, we realized that we could have helped [METR, (formerly ARC Evals)] be more successful in identifying concerning behavior if we had known more details about their (clever and well-designed) audit approach.This is because getting models to perform near the limits of their capabilities is a fundamentally difficult research endeavor.Prompt engineering and fine-tuning language models are active research areas, with most expertise residing within AI companies.With more collaboration, we could have leveraged our deep technical knowledge of our models to help [METR] execute the evaluation more effectively.
-Anthropic on their audit by METR (formerly ARC Evals) [13] Allowing developers to arbitrarily influence audits undermines their independence, so incorporating requirements for developers to provide technical assistance into legal auditing frameworks may be difficult and is beyond the scope of this paper.However, auditors may find it helpful if specific requests for technical assistance are answered in good faith by auditees.

C INNOVATION ON AUDITING TOOLS
White-box tools for studying AI systems have long been a topic of technical interest, but research on methods often struggles to keep up with the scale and capabilities of leading AI systems [20].There are gaps between the capabilities of white-box evaluation tools and what auditors may need from them.More progress on both foundational research and practical tools will be useful for auditors, especially for state-of-the-art large language models because of their unique versatility and complexity.
Basic research: Current methods have provided useful insights.However, developing a detailed mechanistic understanding is not yet possible in state-of-the-art models.More progress in the basic science of neural networks and efforts to study their inner workings will help further research on evaluation techniques.This will require progress on both developing more intrinsically understandable systems and techniques to interpret trained ones [45,256].
Practical tools: The goal of research on evaluation techniques is to produce methods that can be effectively used off-the-shelf by auditors.In the adversarial attack literature, benchmarks have largely focused on fooling networks with small perturbations to inputs instead of eliciting harm via more real-world features [122,141,263,345].In the interpretability literature, few benchmarks connected to practical tasks exist, with it being common to judge techniques based on researcher intuition [3,50,77,155,193,256].Given the increasing scale and complexity of modern AI systems, developing more effective evaluation tools poses a challenge.Fortunately, open-source and API access to advanced AI systems has enabled progress on evaluation tools.However, no technique for benchmarking evaluation tools is more directly informative than applications on real systems.Partnerships between researchers and developers can facilitate these.
Secure auditing infrastructure: As discussed in Section 6, granting auditors white-box access to systems via application programming interfaces or secure research environments can reduce the risk of leaks.However, because norms for AI audits have not yet been established, there is little infrastructure for conducting audits securely.For example, efforts like the US National Deep Inference Facility project [313] could make more resources available to auditors.Establishing better tools and protocols is another priority Black-Box Access is Insufficient for Rigorous AI Audits FAccT '24, June 03-06, 2024, Rio de Janeiro, Brazil [181].At the same time, it will be key to establish norms and a regulatory framework around AI audits, as has been done in other industries with audits.
D BEYOND ACCESS: OTHER ASPECTS OF RIGOROUS AUDITS.
White-and outside-the-box access is necessary but not sufficient for rigorous audits.Many factors can undermine or degrade the quality of audits.We overview challenges here.
Poorly-resourced audits: Working with state-of-the-art AI systems and effectively evaluating them requires compute and technical expertise.While developing and commercializing advanced AI systems can be lucrative, searching for problems with them might not be profitable or financially sustainable.Existing audits have largely relied on private funding (e.g., [13,147,192,222]), rather than public funding or other more sustainable, reliable, and diversified sources of funding.
Limitations with technical tools: As discussed in Appendix C, there is a gap between existing technical tools for evaluations and the kind of tooling needed to reliably assess the safety and trustworthiness of advanced systems.Until this gap is closed, audits will be limited in identifying risks.
Narrowly-scoped audits: Audits may omit important evaluations.For example, early audits of GPT-4 have focused on riskrelated capabilities [147,192,222] but did not appear to include external evaluation regarding other concerns such as robustness to adversarial attacks; potential for misinformation; demographic representation; or impacts on societal welfare, democracy, discrimination, and equality.Another way in which auditing can be narrow in scope is if it only occurs pre-deployment.A "black cloud" system with ever-changing components is even more difficult to evaluate than a black box.
Conflicts of interest: Auditors may face pressure to refrain from insisting on sufficient access or conducting sufficiently rigorous audits.Auditor conflicts of interest, including collusion with auditees [27,331] are well-known and long-standing problems [106,197].They stem in part from the typical payment structure of auditors: auditors that produce more favorable evaluationsincluding due to receiving inadequate or incomplete information from audit targets -are often preferred over other auditors, leading companies to "opinion-shop" [167] for comparatively lax evaluations.This can trigger a race to the bottom in which audits become progressively less rigorous and less informative [11,17].This type of dynamic could emerge in the absence of adequate regulatory structures.For example, recent audits of state-of-the-art language models from OpenAI [222] and Anthropic [13] were conducted on a voluntary basis by the Model Evaluation and Threat Research organization (METR, formerly named ARC Evals) [192], which maintains a close relationship with both companies the details of which are not publicly disclosed.
Exclusion of under-represented viewpoints: Not all people agree on what behaviors from AI systems are harmful.As a result, audits can exclude under-represented groups if they are designed in a way that fails to take a wide range of interests into account [22,91,163].Improving meaningful participation [288,296] and dialogue [76] among diverse stakeholders plays an indispensable role in improving fairness and representation.
Cosmetic compliance: Absent clear legal requirements, companies have an incentive to prioritize cosmetic compliance with good practices [152], a form of cheap talk [90] or virtue signaling [330] in which audit targets create a superficial (yet misleading) appearance of good faith cooperation.
Regulatory capture: While governance regimes that bolster auditing standards and procedures may appear promising, they, too, can be undermined.Studies in the field of organizational science demonstrate that companies respond strategically to interventions, employing a variety of operational, political, and legal tactics [219] including supporting biased research [1].In its simplest form, companies may selectively disclose audit-relevant information [124,186], enabling them to game outcomes, including in AI audits [25,252].Meanwhile, more sophisticated and well-resourced companies can shape the underlying audit criteria, metrics, and institutions, including by selecting which auditors have privileged access to information and which do not.Legal sociologists describe this symbiotic relationship between regulators and regulated entities as "legal endogeneity": it is precisely the actors that law seeks to control that end up controlling the law [1,82,83,317].AI audits are especially susceptible to these dynamics because the relevant standards are currently unclear [67] and audit tools are bespoke and applied inconsistently across different developers and domains [67].

Table 1 :
required developers of certain foundation models to share test results with the Federal Government.It also instructed the National Institute of Standards and Technology (NIST) to develop evaluation guidelines for harmful AI capabilities, and it tasked the Department of Energy with developing tools and testbeds to evaluate threats from AI systems to security and critical infrastructure.Companies may also voluntarily A summary of what evaluation techniques are possible with which types of access.A ✔ means that a technique is possible while an ✗ means it is not.Many levels of grey-box access are possible, but we highlight sampling-probability-attacks because they are a common example.

Table 2
White-box interpretability tools help evaluators discover novel failure modes.White-box algorithms and interpretability tools have aided researchers in finding vulnerabilities.Examples Black-Box Access is Insufficient for Rigorous AI Audits FAccT '24, June 03-06, 2024, Rio de Janeiro, Brazil Carson Ezell and Stephen Casper were the central writers and organizers.Charlotte Siegmann, Kevin Wei, Andreas Haupt, and Taylor Curtis contributed primarily to Section 2. Noam Kolt contributed primarily to Section 7 and Appendix D. Jérémy Scheurer, Marius Hobbhahn, and Lee Sharkey contributed primarily to Section 3, Section 4.2, Section 7, Appendix C, and Appendix D. Satyapriya Krishna contributed primarily to Section 4.1, Marvin von Hagen and Silas Alberti contributed primarily to Appendix A and Section 2.3.Qinyi Sun, Michael Gerovitch, and Benjamin Bucknall contributed primarily to Section 2.1, Section 6, and Section 7. Alan Chan, David Bau, Max Tegmark, David Krueger, and Dylan Hadfield-Menell offered high-level feedback and guidance.