Rethinking Machine Learning Benchmarks in the Context of Professional Codes of Conduct

Benchmarking efforts for machine learning have often mimicked (or even explicitly used) professional licensing exams to assess capabilities in a given area, focusing primarily on accuracy as the metric of choice. However, this approach neglects a variety of essential skills required in professional settings. We propose that professional codes of conduct and rules can guide machine learning researchers to address potential gaps in benchmark construction. These guidelines frequently account for situations professionals may encounter and must handle with care. A model may excel on an exam but still fall short in critical scenarios, deemed unacceptable under professional codes or rules. To motivate this idea, we conduct a case study and comparative examination of machine translation in legal settings. We point out several areas where standard deployments and benchmarks do not assess key requirements under professional rules. We suggest further refinements that would bring the two closer together, including requiring a measurement of uncertainty so that models opt out of uncertain translations. We then share broader insights on constructing and deploying foundation models, particularly in critical domains like law and legal translation.


INTRODUCTION
Artificial Intelligence (AI) has been rapidly e volving.Within the vast AI landscape, foundation models [Bommasani et al. 2021] have emerged as a pivotal class of machine learning models.These are large-scale pre-trained models that serve as the basis for various downstream tasks through fine-tuning or other adaptation techniques.Benchmarking in AI-particularly in the realm of foundation models-has traditionally emphasized achieving high accuracy on a broad range of highly specific tasks, frequently mirroring licensing exams in various professional fields.In fact, many language foundation models are evaluated on the MMLU benchmark [Hendrycks et al. 2020], which is composed of educational and entrance exams, as well as professional licensing exams.Scoring highly on this benchmark and other similarly situated ones, inevitably leads to eye-catching results like, "GPT-4 has passed the bar exam" [Katz et al. 2023].In some cases, such performance also translates to generally helpful capabilities that could improve services and access to justice in legal settings-see discussion by Bommasani et al. [2021].However, just as licensing exams sometimes do not necessarily reflect the nuances of actual practice [Curcio 2002], benchmarks might also offer an overly rosy picture of model capabilities [Raji et al. 2021[Raji et al. , 2022]].
Within professional fields like law, specific ethical codes and professional conduct guidelines impose additional constraints, assessing a different skill set compared to what is evaluated in licensing exams [Curcio 2002;MacCrate 1992].For example, while part of the licensing exam processing might test for being able to answer multiple choice questions about what is written in the professional rules, it does not a candidate's ability to abide by them, as Curcio [2002, at p.380].These professional and ethical rules are grounded in real-world scenarios and challenges that practitioners encounter.They also present the most egregious scenarios that if a professional violates could result in professional repercussions, including ejection from the profession. 1 In this way, they offer a window into the complexity of decision-making that arises in professional settings.Consequently, while a machine learning model might shine in a benchmark based on a professional exam, it could falter when faced with situations encapsulated by these rules.
This points to a need to rethink and expand benchmarking efforts for machine learning models, using these professional rules as guiding lights.To highlight this gap and propose potential solutions, we focus on the more restricted domain of automated machine translation.Current benchmarks, while comprehensive, often fall short of capturing the intricate demands of professional translation standards, especially in legal settings.Our case study dives into the historical use of general translation APIs, especially within the U.S legal system.It underscores the pitfalls of over-reliance on machine translation APIs in high-stakes contexts and contrasts this with the rigorous standards expected of human translators in similar settings.Despite their problems, benchmarks motivate enhancements in model performance, often resulting in improved scores on those same benchmarks.Aligning these incentives more closely to professional standards, we argue, is a desirable property.This includes consistently assessing mechanisms for models to express uncertainty as part of standard benchmarks.This incentivizes improving the ability of a model to opt out of high-stakes and low-certainty scenarios-a key skill of most professionals.And it means providing specific tests for handling a number of situations outlined in the code of conduct for legal interpreters (such as how to handle filler words, double negatives, and much more), or other professional codes of conduct.
We suggest that the lessons from our machine translation case study can be generalized to the broader foundation model context.Furthermore, as foundation models continue to advance, displaying a heightened ability to follow rules provided via in-context learning [Bai et al. 2022], we suggest that it may be worthwhile to incorporate professional rules directly into their training and evaluation, while providing users with more transparency into otherwise stochastic and inconsistent decisions.
The paper proceeds as follows.First, in Section 2, we give historical background about the opportunities and risks of machine translation in legal settings, along with a comparative perspective noting that recent trends in general foundation model uses mirror the more narrow machine translation setting.Second, in Section 3 we provide a comparative study of machine translation benchmarks and codes of ethics, noting potential gaps in the latter that make evaluations misaligned with professional rules.We also provide a detailed set of recommendations for future benchmark creators, as well as both qualitative and quantitative supplements as demonstrative examples.Finally, we conclude with a broader discussion of the implications of this perspective and generalizable lessons for the broader foundation model setting.

CASE STUDY: GENERAL-PURPOSE AUTOMATED MACHINE TRANSLATION AND THE UNITED STATES LEGAL SYSTEM
We turn our attention to the history of general-purpose automated machine translation Application Programming Interfaces ("APIs")like Google Translate-to provide context for our case study.Generalpurpose machine translation APIs are those that any user can access easily to translate any content without indicating restrictions on scope (for example, general purpose machine translation APIs do not claim to only work well on legal text).The legal system presents both considerable potential for the beneficial application of these APIs and a domain in which their failure could lead to dire consequences.It serves as a salient example of how general systems, despite performing satisfactorily in some tasks, can be harmful without clear delineation of performance boundaries.

The Benefits of General-purpose Machine Translation Systems In Legal Contexts
The allure of using state-of-the-art machine translation is understandable.Under budget and time constraints, a system that seems to perform sufficiently well as a whole can significantly improve the quality of government services, help access to justice, and even expand international trade [Brynjolfsson et al. 2019].In California, court systems try to provide services in every language that residents might need [Cuéllar 2019].Yet, there is a shortage of qualified translators to meet demand for multilingual services [Dolan 2017].
The federal government, too, has mandates to provide multilingual content to millions of people across hundreds of languages.
Executive Order 13166 and Justice Department's 2011 Renewed Commitment Memo, require agencies to translate their web content for example.
In response to these mandates, many turn to automated systems, such as Google Translate, to make their websites multilingual.The General Services Administration (GSA) in the United States points out: "Many Web managers are tasked with installing the 'magic button' solution on their websites to make them multilingual and comply with current mandates, such as Executive Order 13166 and the Justice Department's 2011 Renewed Commitment Memo.So they turn to Google translate or some other automated system" [Godfrey 2012].This pressing need underscores the importance of high-quality machine translation systems as they could potentially help with regulatory compliance and accessibility in overburdened administrative settings.However, these deployments in legal contexts, entail risks that offer crucial lessons for the safety of general foundation model API deployments today, which we turn to next.

The Risks
While it may seem like a huge benefit to use automated machine translation to make government websites multilingual in compliance with regulatory requirements, automated translation tools can have critical failures when providing important information to the public.The Department of Health and Human Services and the Agency for Healthcare Research and Quality commissioned a study and found limited accuracy of Google Translate in relevant agency settings [Balk et al. 2012].And the General Services Administration pointed out that government websites giving critical advice under unknown accuracy can lead to potential harms [Godfrey 2012].To illustrate exactly how such failures can lead to real-world harms, we examine a few cases and scenarios.
Immigration.In the United States, there are no state-provided translators for filing certain types of immigration paperwork.And language access in American immigration proceedings remains a fundamental problem [Benton 2019] leading some to turn to machine translation [Schroeder 2017].
This practice can lead to failures, such as discrepancies in paperwork that can make an asylum claimant appear deceitful.Consider the following story.A young woman fled Russia after becoming the "victim of egregious racial violence" [Schroeder 2017, p. 320 We choose a subset of rules that are particularly suited evaluation in benchmarks, but others could likely be further incorporated as well.
Services ("USCIS") must be filed in English.So she resorted to using Google Translate to complete the required forms, "resulting in number of mistranslations and incomplete answers to questions that later contributed to a finding by USCIS that her testimony was not credible" [Schroeder 2017].It was only after hiring legal representation that the mixup was cleared and she was granted asylum.This young woman's story is a concrete example of both the allure and the harm that machine translation can have.It is also a repeated lesson that has been learned in the immigration system.
More recently, asylum seekers from Afghanistan relied on machine translation software for their applications.In one case, machine translation software incorrectly translated "I" to "we" in Dari [Rogin and Corkery 2023].This led to a discrepancy between the initial interview of the asylum seeker and the contents of their application document.Consequently, the judge dismissed the case due to this inconsistency.Similarly, machine translation software reportedly could not properly translate military ranks in Pashto and Dari, resulting in problematic errors in the asylum review process [Rogin and Corkery 2023].As advocates pointed out, these mistakes may seem harmless in some contexts, but in asylum claims even such small inconsistencies can jeopardize the claim.
These key mistranslations are not a one-off scenario.In a recent study, researchers pointed out that legal translation can critically misrepresent key terms.For example, the sentence: "The trial court enjoined the violence but specifically exempted peaceful picketing from the scope of the injunction."They found that when using MT to translate the sentence into Kannada, the MT system instead translated the sentence into something more like "The trial court ordered the violence but exempted peaceful picketing from jurisdiction" [Prabhu et al. 2021].
Police Stops.There have been instances where police departments in the United States used machine translation during stops to ask for vehicle searches.This practice has led to confusion and Fourth Amendment violations due to inaccuracies in translations.For example, in United States v. Ramirez-Mendoza [2021], a police officer stopped a motorist and asked if there was anything illegal in their vehicle, to which the motorist answered no.The officer used Google Translate and verbally queried the app to translate, "would like to search the car to make sure, okay?"The API used the word "registrar" for the verb "to search." There was a disagreement among the parties as to whether this was a proper translation.A linguist for the government translated the API's output as, "I like to search the car to make sure that it is all right."On the other hand, the motorist stated that they did not understand the translation and thought it meant, "I like check the cars, revise the cars."At the time they did not understand that it meant the officer would search the car and simply answered "uh-huh."The court found that the government did not meet the burden of proof to show that this was in fact an accurate translation and would not have caused such a miscommunication, resulting in a successful Fourth Amendment challenge by the motorist.Evidence and use in courts.Due to potential mistranslations, courts often disallow machine translations as evidence in proceedings.For example, in [ABC Corp. v.The Partnerships and Unincorporated Associations Identified on Schedule A 2022], plaintiffs tried to rely on automatic translations by Microsoft Edge to make a claim of mass infringement by a Chinese-language website.Because no one with a proficiency in Chinese reviewed the translations the court found that this evidence was inadmissible under the "wellestablished rule that a document in a foreign language is generally inadmissible unless accompanied by a certified English translation." The court also noted that there were also clear issues with the credibility of the evidence since some translations seemed nonsensical in nature.The court noted, for example, that "it does not know what the emphasized portion of the following passage means: 'TWO NEW BRANDS FILED!A POPULAR ANIMATION AND AN IL-LUSTRATION WORK, TAKE ADVANTAGE OF THE FACT THAT IT IS NOT FROZEN AND QUICKLY REMOVE THE SHELF TO WITHDRAW CASH.'... What these statements mean is not apparent and has not been explained.These statements could indicate either that the translation is unreliable or that the translation is reliable, but the website is ambiguous in context." In other settings courts themselves have relied on machine translation to interpret the law, with potentially fraught results.For example, in Avelino Cruz Martinez v. United States [2016], the majority opinion used Google Translate to interpret a Treaty, while the dissent points out that the translation the majority relies on is unreliable.

Machine Translation: The Tip of the Iceberg
The increasing prevalence of general foundation model APIs reflects the benefits and risks inherent in general translation APIs.
Filing for immigration in the United States is a complex and challenging process, often necessitating the services of a lawyer.However, legal representation can be costly.Consequently, just as people turned to tools like Google Translate, the public is using general foundation model APIs for assistance with visa applications and other immigration matters (in addition to using them for the same translation tasks as previously described).For instance, some individuals have published guides on how to employ ChatGPT in drafting an O1/EB1 reference letter [Mor 2023].ChatGPT has also been used for other legal purposes leading to a class action litigation over the potential unauthorized practice of law [Faridian v. DoNotPay, Inc. 2023].News headlines like, "Will AI chatbots power the future of police language translation?"[Seki 2023] are becoming increasingly common.Similar to general translation APIs, the use of general foundation model APIs for tasks with significant legal implications can be advantageous when performed effectively, but could also result in high-stake failure modes if the task falls outside the model's capabilities.Therefore, it is crucial to glean insights from the experiences with machine translation APIs to improve the broad deployment of these models.
This phenomenon can be attributed to the benchmarking uncanny valley.Models demonstrate proficiency over a large, though unspecified and undefined task space.Benchmarks indicate that these models can handle a wide range of tasks.But their capability is not universal.Models give the impression of being sufficiently competent to execute specific tasks in a generalized and satisfactory manner, but they exhibit unique failure patterns in critical contexts.

WHAT CAN MACHINE LEARNING BENCHMARKS LEARN FROM PROFESSIONAL RULES AND CODES OF CONDUCT?
Certified court interpreters are bound by a code of conduct with highly specific rules about how to handle various sitations they might encounter.These rules described everything from conflicts of interest to how to handle filler words to ensure accurate translations.Machine learning models can draw lessons from how courts regulate interpreters through both certifications and the code of ethics.In this section we first examine the legal interpreter's code of ethics, we highlight key aspects of the code of ethics that directly contradict the structure of machine translation benchmarks.We then highlight suggested changes to future benchmarks and machine translation challenges to better align with the code of ethics, accompanied by demonstrative experiments.
Interpreter certification can be viewed as a benchmark that assesses their capability to perform a specific task within a particular domain.For example, in New York the examination might have multiple choice questions like seen in Figure 1.Attaining a certain performance level on this benchmark might imply the model's suitability for accomplishing this specific task, but it is not necessarily A statement in English is presented followed by four statements in Spanish.For each question, select the option that most closely matches the translation into Spanish.
The speaker cleared his throat before starting his speech.
A. El orador carraspeó antes de comenzar a su discurso.
[The speaker cleared his throat before starting his speech.]B. El orador se aclaró la garganta después de dar comienzo a su discurso.
[The speaker cleared his throat after starting his speech.]C. El orador limpió su garganta antes de dar comienzo a su discurso.
[The speaker cleaned his throat before starting his speech.]D. El orador aclara la garganta antes de dar comienzo a su discurso.
[The speaker clears his throat before starting his speech.]the same as certifying a human for the same task.There is an underlying assumption that humans can generalize and follow additional rules, like codes of conduct or ethics.The exam also does not certify interpreters to practice law or translate in other contexts, nor does it cover all the regulations an interpreter must follow.In this way, though it is possible that a machine translation model might pass this exam, but it does not certify that the model can follow all these other external guidelines or have these additional capabilities.As a result, a more accurate picture of model performance in a given area of translation-though still not dispositive-would consider these other codes of conduct.
So, how does the evaluation process and ethical guidelines for court interpreters differ from machine translation evaluations?We summarize a subset of rules in one such code of conduct for California court interpreters in Table 1 and highlight some key differences here.
Perhaps one of the most important, and most omitted from machine translation benchmarks and products, is that the code of ethics mandates that interpreters should not attempt to speculate on a translation.Instead, they should ask for clarification or notify if there are words they do not comprehend [Judicial Council of California Court Interpreters Program 2013].In effect, interpreters must be aware of their limits, abstain from interpreting when they've exceeded their understanding, and seek clarification (or notify) the court and their client when they've reached this point [Judicial Council of California Court Interpreters Program 2013].We term this as the abstain-and-notify principle.Google Translate and most other machine translation APIs-to our knowledge-never abstain from answering any query nor provide a confidence metric in their output.While the "always on" nature of these models is beneficial for identifying new use-cases, it also exposes them to failures, as previously discussed.Numerous interventions on responsible AI deployments, such as Model Cards [Mitchell et al. 2019], Datasheets [Gebru et al. 2021], Reward Reports [Gilbert et al. 2022], Holistic Assessments [Liang et al. 2022], Internal Audits [Raji et al. 2020], and others, have sought to improve transparency into the limitations of language-based foundation models (including translation models).And they are important mechanisms for experts to understand the scope of operation.Sometimes, this information is even prominently displayed to users, such as on the website for a machine translation modeling effort called No Lanugage Left Behind (NLLB) [Costa-jussà et al. 2022], which presents the estimated proficiency of the model for a particular language (based on the language performance on an accompanying benchmark).
Yet, these documentation mechanisms are fundamentally different than what is required for legal interpreters.Legal interpreters must notify their clients of their uncertainty and potential for mistakes during the normal course of their work, not via ex ante disclaimers which might be the equivalent of documentation mechanisms in machine learning.We might consider an alternative deployment mode that emphasizes runtime or live documentation, offering assessments at the per-input level.This would mirror a translator adhering to the abstain-and-notify principle or refraining from operating in areas that they have not been certified for.However, the ability to recognize the extent of their knowledge and opt out is an ongoing research challenge and not prevalent in machine translation or other general foundation model deployments.And this capability is not often assessed as a component of standard benchmarks.We argue for this ability to be a standard benchmarked skill for general systems, reflecting professional rules like those for legal interpreters.
In Appendix A we provide extended explanations with demonstrative experiments on how this might be incorporated into benchmarks.Generally, however, we suggest that benchmarks allow models to abstain from answering a question.The goal of benchmark users is then to maintain a given high level of performance for their models, while minimizing the number of prompts that the system abstains from answering.For example, one might require that 90% of accepted translations must have a chrf++ score [Popović 2017] of > 70 and then the model creator should aim to maintain that 90% target even if it means abstaining from most queries.Over time, the goal of "leaderboard climbing" is then to maintain the high level of performance but expand the number of prompts that are accepted by the model.We call the general class of runtime mechanisms that determine whether or not to abstain from a prompt "Know-What-You-Know" (KWYK) checks.In Appendix A, we conduct experiments demonstrating such approaches for the legal translation task, noting that this is a non-trivial research problem.Nonetheless, these KWYK checks are a path forward for aligning model deployments with real-world professional codes of conduct.This is not the only mechanism by which benchmarks could be aligned with rules of professional conduct.Other rules could be incorporated as explicit tests in the benchmark.For example, consider the following English to Ukrainian translations: Ensuring that metrics account for these potential errors are important for aligning with real-world expectation of judicial interpreters.We provide a small demonstrative experiment for these additional rules and metrics, we create 100 prompt examples across a number of different rules covered in Table 1 and test whether gpt-3.5-turbocomplies with the rules by default.These can be thought of as small unit tests for certain professional rules.
Unit Preservation (En-Es).Court interpreters must preserve units of the original text.For example, they should not convert pounds to kilograms.We test this by generating (assisted by GPT-4) 100 English prompts that contain units of measurement, such as, "This room is about 20 square meters in size."We then prompt gpt-3.5-turbo to translate this into Spanish and test that the unit remained the same as the English prompt.
Filler Word Preservation (En-Es).Court interpreters must strive to preserve filler words in the original text, potentially to help convey a sense of uncertainty from the speaker.We generate English text with "uh" and other filler words inserted.We then test if translations contain the same amount of filler words.We find that 77% of examples have all filler words preserved while others are made more formal with the removal of filler words.
Idiom Identification (Zh-En, En-De).Court interpreters should strive to "render [idiomatic expressions] using an equivalent idiomatic expression in the target language" and otherways, if they are not certain of either the meaning of an idiom or what its equivalent would be in the target language, they must identify and spell out the expression to the judge to deliberate on it [Judicial Council of California Court Interpreters Program 2013].For this, we sample 100 examples from two datasets of idiomatic expressions from Chinese to English [Tang 2022] and from English to German [Fadaee et al. 2018].We then prompt gpt-3.5-turbofor both the translation and the identification of any idiomatic expressions in the original text (prompt available in Appendix C. We found that no English idioms were flagged when asked for English to German translations, but 38% of sampled Chinese idioms were flagged when asked for Chinese to English translations. Error Preservation (En-Es).Even when the speaker makes a mistake, the translation should preserve the mistake according to the ethics guides.We create a 100 English sentences with mistakes based on homonyms and other grammatical errors.When prompted to translate these into Spanish, the system tends to correct mistakes in translations, only preserving errors 12% of the time.
Word Repetition Preservation (En-Es).Even when a speaker repeats words, this repetition should be preserved.We create 100 English sentences with repeated words.Then we check whether the Spanish translation preserves the repeated word.Only 10% of sentences preserved the repeated word, other translations silently removed the repetition.
Clarification (Formality, It, En-Es).Pilault et al. [2023] provide a dataset of situations where it might not be clear how to translate a given text into a target language.We select two of these and test whether gpt-3.5-turboraises a clarification flag.We examine whether it clarifies the level of formality that it should use (in languages like Spanish one might use a formal "you" ("usted") or an informal "you" ("tu").Similarly, it may need to clarify whether to use the masculine or feminine form of the word "it" depending on what the word is referring to.We find that it never makes use of the clarification flag, but as we will note later we do not test a comprehensive set of prompts that may change this result.Double Negative Preservation (En-Zh).Finally, the professional rules for court interpreters state that double negatives should be preserved.This is important in witness statements, for example, who may be trying to avoid perjury but want to evade the question through the use of double negatives.We test English to Chinese translation with 100 double negatives and find that only around 61% of the double negatives are preserved in the translation.
Limitations.These are meant to be demonstrative "unit tests" of a particular deployment approach (in this case a prompted language model).We seek only to provide some inspiration for benchmark creators on how they might modify assessments to include professional rules.Further variations on prompts, techniques, and datasets might change the results we provide here and improve the overall ability of gpt-3.5-turbo to align with the court interpreter rules.We provide more details in Appendix C.
There may, of course, be cases where deployed translation services should follow a different set of rules that would remove such filler words, but it is nonetheless important to create mechanisms for transparency, insight, and control over these mechanisms, rather than allowing the stochastic processes of machine translation to pick at random among many options.If a model is silently following some rule (for example, in some cases it preserves filler words and in others it does not, it is better to identify this and surface it to the user explicitly, giving them control and transparency.In such cases, perhaps the preferred default could adhere to professional rules in a given domain with an option to modify this.We discuss this further in Section 4.

ALIGNING MODELS WITH PROFESSIONAL CODES OF CONDUCT
The lessons from machine translation and codes of ethics for judicial interpreters are more general than the machine translation setting.Codes of ethics provide valuable insights for how we might design tests to identify failure modes of much more general systems, like foundation models.They provide useful guidelines for human evaluations and annotations, but also for guiding the models themselves via so-called "constitutional AI" mechanisms.
Why is it important to align with codes of ethics or professional rules?While professional rules and codes of ethics are not a panacea, they can help machine learning developers identify potential failure modes that licensing exams simply won't capture.For example, while uncertainty quantification and ability to clarify uncertainty is sometimes evaluated in isolated benchmarks, it is far from standard across benchmark settings-especially in machine translation.We found that few production systems explicitly reveal an easy-to-understand mechanism to users about potential uncertainty (though ChatGPT occasionally does provide free-form explanations when it disambiguates between multiple meanings in a translation).And few benchmarks assessing holistic language model performance include uncertainty quantification or other KWYK checks as part of the assessment (e.g., [Liang et al. 2022]).
Which rules and codes?Of course, there are a number of different ethical codes and professional rules depending on the deployment context.The reality is that general purpose systems should capture all of these because they can potentially be used in a general way.It is possible that some will conflict with one another, but then it is important to resolve this conflict ex ante and provide transparency as to which rules the agent is being constrained to.Building out this rule-set is a much bigger project, however, beyond the scope of this paper.One benefit of general-purpose foundation models is that we can provide them with the codes of conduct and rules directly.Consider the following example: Input: i dont think i want to go to the movies, ty tho ChatGPT-4 (Aug. 3 version): No creo que quiera ir al cine, gracias de todos modos.
Google Translate: No creo que quiera ir al cine, aunque The Google Translate version does not pick up the informal abbreviations, like "ty" (thank you) and the GPT-4 translation makes the writing much more formal with capitalization and lack of texting abbreviations.But we can then provide the code of ethics rule directly that states: "Maintain the register or level of language from the source text in the target language.The interpreter's task is to accurately convey factors like word choice, style, and tone, without modifying them to cater to the comprehension level of the listener.Do not intervene or express opinions about the listener's ability to understand." When this rule is directly provided in the context GPT-4 changes its translation to "no creo que quiera ir al cine, gracias de todas maneras." Similarly, consider another example: Input: i dont not understand ChatGPT-4 (8/3): no entiendo [I don't understand.]Google Translate: no entiendo [I don 't understand.]In this case, the double negative in the English prompt is removed in the translated Spanish.But if you modify the prompt for GPT-4, giving it the rule on preserving double negatives and telling it directly to follow this rule, you get the more correct and aligned translation, "No no entiendo." (I don't not understand.) As we suggest earlier, it may be possible to catalog the professional rules and codes of conduct and provide them to the model to provide consistency in its behavior across particular domains.This can also help provide more transparency into why certain decisions were made, rather than rely purely on stochasticity and data biases.
How does this generalize to other professional rules?Not every set of professional rules is directly applicable for benchmarking efforts.Nonetheless, as general-purpose AI systems attempt to cover all professions, it is important to consider parallels when possible.For example, the Code of Ethics for Educators put forth by the National Education Association states that educators "[s]hall not unreasonably deny the student's access to varying points of view" and they "[s]hall not deliberately suppress or distort subject matter relevant to the student's progress" [The National Education Association [n. d.]].This directly parallels questions about how model responses should be aligned with different viewpoints and values.It also reflects risks around models outputting "hallucinated" or distorted facts.Similarly, the International Federation of Accountants in their Code of Ethics for Professional Accountants states that "[w]here appropriate, a professional accountant shall make clients, employers or other users of the accountant's professional services aware of the limitations inherent in the services" [International Ethics Standards Board for Accountants 2009].This reflects similar standards on abstaining from responses in legal interpretation, though it allows more flexibility in the presentation of limitations.And only recently have researchers begun assessing model likelihood of hallucinating fake case citations and facts, despite this already resulting professional sanctions against attorneys who used foundation models as part of their legal work [Dahl et al. 2024].Finally, refraining from speculation under high uncertainty is an essential component of most professional standards, whether in translation or elsewhere.We suggest that KWYK checks should be assessed as part of standard general-purpose benchmarks in a more integrated way, as opposed to separate standalone assessments of model uncertainty quantification.

RELATED WORK
We provide an extended related work in Appendix B but here briefly examine some relevant core research.Some have pointed to general challenges in evaluation [Ethayarajh and Jurafsky 2020;Raji et al. 2021Raji et al. , 2022]].Other works have also looked to the law to inform responsible model deployments [Henderson et al. 2022;Nay 2022].And yet other works have pointed to a growing need for examining heterogeneity and uncertainty in model performance [Dixon et al. 2018;El-Yaniv et al. 2010;Geifman and El-Yaniv 2017;Liang et al. 2022;Simig et al. 2022;Smith et al. 2022].And a growing body of work has called for improved evaluation of AI-supported legal services [Hagan 2023;Kapoor et al. 2024;Linna 2021;Linna Jr 2021].
In complying with legal interpreter professional codes of conduct, we identify "Know-What-You-Know (KWYK)" approaches as a potential path forward.A wide range of approaches have been tested in isolation, but few (if any) are publicly acknowledges as being deployed general-purpose foundation models or automated machine translation mechanisms.Nonetheless, a large body of work might be relevant in considering potential KWYK checks [Ahuja et al. 2022;Chelba et al. 2020;Cobbe et al. 2021;Kadavath et al. 2022;Li et al. 2022;Srinivasan et al. 2021;Ye et al. 2021;Zhu et al. 2022].
Finally, in recent years, machine translation has been used in legal systems, with mixed results.Others have also explored this to some extent [Ali 2016;Elnaggar et al. 2018;Muravev 2020;Prabhu et al. 2021;Vieira et al. 2021;Wahler 2018], though not in the context of comparative lessons for machine learning benchmarks nor in the context of professional rules.

CONCLUSION
Overall, in this paper we argue that machine learning benchmarksparticularly for general purpose systems like machine translation and foundation model APIs-should look to professional codes of conduct for inspiration in assessing particular behaviors.As general-purpose systems are more frequently utilized in high-risk settings like the law, it will be important to leverage existing sources of knowledge to identify pitfalls and ensure deployment structures that account for particular failure modes.In our case study, for example, we note that a key requirement for legal interpreters is not guessing at translations.Yet, machine translation APIs like Google Translate are "always on," taking a guess even when it could lead to harms.By adjusting how we benchmark systems, we can begin to align models more closely with the high standards of conduct expected of people performing similar tasks.This is obviously not the whole answer, but is another resource for improving machine learning evaluation.

A HOW CAN WE INCORPORATE UNCERTAINTY INTO BENCHMARKS?
In the main text, we discuss how to express uncertainty to the user.In this section we describe a demonstration on how such uncertainty mechanisms can be incorporated into benchmarks, modifying existing open-source translation tools to account for uncertainty.

A.1 Know What You Know (KWYK) Checks
We define a KWYK check as follows.A model takes as input a user query or prompt  and generates a response .Given a reference ground truth  * , assume there exists an acceptability metric  * (, ,  * ) ∈ 0, 1 that assesses whether a model output is acceptable.An acceptability metric can include a several sub-metrics, incorporating aspects of safety, accuracy, privacy, etc.The goal is to assess whether the model's output is acceptable for presenting to the user.For the purposes of the following discussion we will only consider metrics of acceptability from the perspective of accuracy or quality.For example, in the multiple choice setting this is simply whether the model output the correct answer (1) or not (0).In the translation setting this might be whether the chrf++ scores is over some threshold (e.g., 1 if chrf++ > 50).The KWYK check's goal is to abstain or notify if the model is likely to yield an unacceptable output.This reflects the constrained optimization problem: (1) subject to 1 where   = 1 are situations where KWYK check outputs a deterministic decision to serve the query for a given input (  ) in the dataset of  ,  is some threshold for acceptability.That is, we want to maximize the number of times we accept a user's request while making sure that the model remains above some average acceptability  for those queries that the KWYK check does not abstain from.
The KWYK check can take many forms.It can be a calibrated model that takes only input and predicts the model's likely level of performance.This acts as an ex ante KWYK check that never queries the true model and could be useful when there is a giant computationally expensive model that is only acceptable for some small portion of the queries it will receive.A smaller KWYK check can be trained to determine if it will ever serve the query to the model in the first place.This is closer to performance prediction work [Ye et al. 2021].An ex post KWYK check might act as a verifier [Cobbe et al. 2021], taking the model's output and checking the result.Alternatively, it could be a mechanism that quantifies uncertainty in a calibrated fashion [Kadavath et al. 2022], abstaining in a way that maintains a certain level of accuracy.Or it could combine all of the above methods.The goal is to abstain for the minimal amount of time to maintain the implicit agreement of performance for users or to notify them when this level of performance is unlikely to be achieved for a given input.
We note that KWYK checks may not be well-calibrated throughout the state space so it may also be important to include an assessment of how out-of-domain an input is as part of the KWYK check itself.That is, when the KWYK check cannot make a reliable assessment of the inputs since it is too out of distribution, it may be wiser to abstain rather than rely on the KWYK check.
Metrics.We suggest two initial metrics in this paradigm: the acceptability rate and the abstain rate.The acceptability rate should be maintain above some threshold  while lowering the abstain rate.The threshold  is dependent on the sensitivity of the task.We note, again, that there are many similarities between this framework and prior work that we will discuss later.Nonetheless, this is not a common feature of benchmarks and our goal is to emphasize the need for more integration between KWYK checks (broadly defined), benchmarks, and deployments.In any structure of this task, however, we emphasize that this paradigms calls for at least one additional data split: the calibration set.The calibration set is used to fit any KWYK check.We define a KWYK check as something that can and should be deployed with the model, so the combined assessment must fit the KWYK check to held out data to prevent a false picture of KWYK check performance.
Incentives.Importantly the structure of a KWYK check evaluation and objective changes benchmarking incentives.As depicted in Figure 3, the current incentive structure emphasizes improving tasks on average across all benchmark tasks relatively evenly.A shift toward a benchmark with KWYK checks would incentivizes hitting a particular level of performance for some tasks, while abstaining elsewhere.Improvements would instead seek to maintain a high level of performance while adding new capabilities at that level, rather than evenly improving performance throughout the distribution.
Limitations and future work.Future work can improve upon the exact formulation of the metrics and paradigm.Our goal is mainly to emphasize the importance of this setting through a comparative study and to provide an initial proposed structure for benchmarks to consider.For example, benchmark creators can explore further the trade-offs between acceptability rates and abstain rates, as well as improved ways to select data splits.We note that just like there are trade-offs between Type I and Type II errors, there are trade-offs between acceptability and abstention.While our goal is to hit a threshold C after which one can look to lower the abstain rate, other situations may call for a combined metric closer in equivalence to an F1 score.Furthermore, we do not provide a strict suggestion for the user experience in how KWYK checks are presented.For example, if KWYK checks are presented as warning labels, but are poorly calibrated, they might give users false confidence in model outputs.Examination of how KWYK checks should be used both encompasses existing human-computer interaction research, and poses interesting future research questions.

A.2 Demonstrative Experiments
In this section, we provide a demonstrative experiment showing how one might consider integrating KWYK checks into benchmarks.This is not meant to be a panacea, but rather a small example of how results could be reported in a way to encourage KWYK check development and representation in benchmarks.We provide all code and additional details in the Supplementary Material.
A.2.1 Translation Models.Our first example setting is for machine translation models, motivated by much of our analysis.In an ideal setting translation models would identify if a given piece of text is less likely to yield a quality output, either abstaining from the translation or warning the user.We define acceptability as achieving a chrf++ score [Popović 2017] of > 50.In this setting, we set , our acceptability threshold to > 70%.Here, we use the Opus100 dataset [Zhang et al. 2020] as a calibration set and we examine only English-Spanish translation for simplicity.This simulates a model that has been evaluated on Opus100 and is getting ready for deployment.We then use the Europa Education and Culture Translation Memory (EAC-TM) dataset [Steinberger et al. 2014], which consists of government forms that must be translated mimicking a legal-adjacent setting, as the evaluation set and the Flores 200 dataset [Costa-jussà et al. 2022], consisting of Wikipedia passages, as the calibration set.We use the NLLB distilled 1.3B parameter model [Costa-jussà et al. 2022] for our model-under-evaluation.We first run the model-under-evaluation to generate translations for all Opus100 passages.Then we fit different models to function as KWYK checks, classifying whether the model-under-evaluation will achieve an acceptable output for a given input text.We test a roberta-large [Liu et al. 2019] and xlm-roberta-base [Conneau et al. 2019] models fit only to model inputs.We then also use the xlm-roberta-base model as a verifier, taking the output of NLLB model as well as its input.We find that the xlm-roberta-base model, functioning as a verifier KWYK check, was able to achieve a target of 75% acceptability with an 18.9% abstain rate.Hendrycks et al. [2020] to the model.We create a probability distribution over multiple choice answers by predicting the likelihood of the corresponding choice letters.We first use an uncalibrated KWYK check, which takes the model's uncertainty and only answers questions with over 50% confidence.We then use another calibration method, which takes as input the uncertainty, as well as a feature vector of linguistic features from the Text Characterization Toolkit (TCT) [Simig et al. 2022].Then an XGboost [Chen and Guestrin 2016] model is fit to a hold-out calibration set to predict whether the model will correctly answer the query, given the model's uncertainty and the linguistic features In the standard setting, improving average performance evenly across all tasks is incentivized.In an alternative structure, KWYK checks aim to keep model performance high for any given task that the model does not abstain from, incentivizing adding new tasks only when they are well-calibrated.

KWYK
of the text.We then only answer questions over the 50% threshold for this KWYK check on the test set.In this setting, the threshold  target might be between 60% and 70% which is the percentage required to pass the bar.However, none of the KWYK checks are well-calibrated enough to achieve this level even high levels of abstain rates.This suggests that both the base model and the KWYK check mechanism must be improved for this model.

B EXTENDED RELATED WORK
Responsible Evaluation Raji et al. [2021] suggest that the goal of evaluating "everything in the whole wide world" is a fallacy.Raji et al. [2022] also suggest that policymakers and stakeholders holder false beliefs that AI works sufficiently well in their setting.Our work can be thought of as investigating the components of evaluation that will allow model evaluators to make informed and limited claims about exactly what setting they are evaluating and what is the range of performance that should be expected in this setting.Other works have also looked to to the law to inform responsible model deployments [Henderson et al. 2022;Nay 2022].
Evaluation.A set of other work has examined heterogeneity in model performance across highly specific dimensions.For example, Smith et al. [2022] and Dixon et al. [2018] break down text classifier performance by mentions of identity categories.And Bao and Barzilay [2022] provide a path forward for vision systems in automatically splitting datasets to identify biases in performance.Ethayarajh and Jurafsky [2020] point out that benchmarks should carefully consider how performance is aggregated and what dimensions of performance are considered.Our work is distinct from, but also related to, selective classification evaluations [El-Yaniv et al. 2010;Geifman and El-Yaniv 2017;Liang et al. 2022].Liang et al. [2022] also consider model uncertainty during evaluation by implementing a selective classification method.In this method, the accuracy is evaluated for a fraction, denoted as , of instances to which the model allocates the highest probability, while the model refrains from acting on the remaining 1 −  instances.The researchers offer the selective classification accuracy for increments of  ranging from 0 to 1.
Know-What-You-Know (KWYK) Check Approaches.A number of other approaches attempt to provide uncertainty estimates on the quality of a model's output.These could help bring models more in alignment with professional codes of conduct, especially in the machine translation setting.These include: verifier models, uncertainty estimation and calibration, performance prediction, and more.This class of approaches seek to identify when a model is confident in its response and when it should opt out.This is essential for aligning with professional standards, particularly in legal translation.We briefly cover some approaches here.First, use of "verifier" models have sought to improve the performance of models by re-ranking answers.Cobbe et al. [2021] shows that training a verifier model to judge the correctness of the model's output significantly improves model performance on GSM8K, a dataset of grade school math word problems.Li et al. [2022] uses a verifier model to adjust the weighting for each sample output during majority voting, and shows an improvement in model performance on GSM8K.These verifiers can also be thought of as KWYK checks to abstain if the verifier believes an answer is incorrect.Another class of approaches seek to calibrate the uncertainty of the models confidence in an output.For example, Kadavath et al. [2022] investigates whether language models can assess the accuracy of

Figure 1 :
Figure 1: Example multiple-choice question from New York State Unified Court System [2018].Note: we provide English and explanations in brackets that are not in the exam to assist the reader.

Figure 2 :
Figure2: gpt-3.5-turboalignment with professional rules for court interpreters using the prompt in the Appendix.We note that other prompts and deployment mechanisms may cause stochasticity in these exact assessment.This is to demonstrate how one might begin to incorporate aspects of professional standards into benchmarking.

Figure 3 :
Figure3: Two hypothetical models and incentive structures for leaderboard climbing.In the standard setting, improving average performance evenly across all tasks is incentivized.In an alternative structure, KWYK checks aim to keep model performance high for any given task that the model does not abstain from, incentivizing adding new tasks only when they are well-calibrated.

Table 1 :
].She did not have an attorney, nor a professional translator, but all documents filed with United States Citizenship and Immigration A subset of Interpreting and Translating Rules extracted from "Professional Standards and Ethics for California Court Interpreters" [Judicial Council of California Court Interpreters Program 2013].

Table 2 :
[Chung et al. 2022heck on the Opus-100 dataset.Then we calibrate on Flores 200 and evaluate on Europa EAC-TM.IC Accuracy is how accurate the KWYK check was in abstaining.Abstain rate is the overall % of EAC-TM samples abstained on.Acceptability rate is the % of accepted queries that were acceptable.Oracle is a model that perfectly accepts or abstains from queries.Never Abstain is the same as a model with no KWYK check.A.2.2The Bar Exam.Beyond translation, we might consider general purpose system that is deployed where one potential use case is to answer bar exam questions-or similar legal tasks.However, if users start to rely on these answers, it could lead to potential problems.In a simplified setting, we examine what a KWYK check might look like here.Noting that few if any open-source models can achieve above 60% accuracy on the bar exam task as of writing, we start with the flan-t5-xl model[Chung et al. 2022], a relatively strong baseline model, and feed the multiple choice bar exam questions from

Table 3 :
Professional Law (MMLU) test set accuracy after using validation set and the auxiliary training set for calibration.