(Why) Is My Prompt Getting Worse? Rethinking Regression Testing for Evolving LLM APIs

Large Language Models (LLMs) are increasingly integrated into software applications. Downstream application developers often access LLMs through APIs provided as a service. However, LLM APIs are often updated silently and scheduled to be deprecated, forcing users to continuously adapt to evolving models. This can cause performance regression and affect prompt design choices, as evidenced by our case study on toxicity detection. Based on our case study, we emphasize the need for and re-examine the concept of regression testing for evolving LLM APIs. We argue that regression testing LLMs requires fundamental changes to traditional testing approaches, due to different correctness notions, prompting brittleness, and non-determinism in LLM APIs.


INTRODUCTION
Large Language Models (LLMs) are increasingly integrated into software applications [16].Due to the high cost of developing and maintaining in-house LLMs, many applications rely on LLM APIs provided by companies like OpenAI, Anthropic, and Google [3,13,26].Although LLM APIs provide easy access to state-of-the-art models, they also bring in uncertainties for their downstream applications: It is not uncommon for application developers to find their carefully engineered prompts that worked yesterday work less well after updates from the LLM provider's side [7,30].In Figure 1, we highlight such an example from our case study on a toxicity detection task, where the LLM API update from text-davinci-003 to gpt-3.5-turbo-instructcauses a major performance downgrade and changes the good choices for prompt selection.Similar to traditional web service API [18] and more conventional ML APIs [9], updates to server-side LLM API controlled by a different party are hard to deal with.First, LLM APIs can be updated silently: OpenAI's gpt-3.5-turbomodel has been updated twice (by Nov 2023) but the updates are not visible to the downstream developers.Such silent API updates change only the underlying LLM but not the API signature 1 , causing unexpected behavioral changes (e.g., formatting of generated code) [7] to the application developers.Second, LLM APIs are scheduled to be deprecated and discontinued [27], effectively forcing application developers to adopt newer API versions.For example, the text-davinci-003 model will be deprecated on Jan 2024.The forced transition to gpt-3.5-turbo-instruct can cause unexpected prompt performance changes, including the introduced performance downgrade as illustrated in our example in Figure 1.
To cope with evolving LLM APIs, application developers need support for monitoring and analyzing how their prompts perform differently when the LLM API changes.Existing software engineering practices suggest that regression testing is essential for identifying changes between software versions, often particularly to ensure that fixed bugs are not reintroduced [34].We argue that LLM application developers should take a similar approach.However, existing regression testing practices can not directly translate in the LLM context, as we will illustrate.Based on our observations in the case study, we highlight three fundamental changes for regression testing LLM APIs: First, LLM regression tests should be defined at a different granularity.In traditional software engineering, a single breaking regression test would indicate a bug in the software implementation.In contrast, it is common for ML models to change predictions for individual data points after updates.The common practice is to examine overall model accuracy, which has been criticized for being coarse-grained [33].To gain a more nuanced understanding than overall model accuracy, LLM regression tests should be defined over data slices rather than on single predictions or the entire dataset.This calls for a different correctness notion, as "regression" is defined over slice-level aggregated metrics and the slice-level test only fails when the metrics change beyond a threshold.
Second, LLM regression tests need to monitor both model and prompt updates.It is well-known that prompt engineering can greatly influence LLMs' performance [19].As we will show, we observed that different prompt designs regress or improve differently on the same API update, making the optimal prompt design change from API version to version.We argue that tracking both LLM and prompt versions is essential for LLM regression tests.
Third, LLM regression tests need to deal with non-determinism of LLM APIs.LLMs are known to produce non-deterministic outputs: Non-determinism is often introduced intentionally for generating high-quality outputs with a non-zero temperature [e.g., 12], but can even be observed with a zero temperature setting [29], where the LLM should deterministically predict the next most likely token.It is necessary to deal with flakiness in LLM regression tests by considering their inherent non-determinism.
In summary, our paper has the following contributions: • An exploratory case study on toxicity detection with the GPT-3.5 model family, showing API upgrades can cause significant performance deterioration, and that prompt is an important factor in behavioral changes.• A re-examination of the concept of regression testing for LLM APIs and its required fundamental changes, due to different correctness notions, prompting brittleness, and nondeterminism in LLMs.
• A vision on research opportunities in supporting systematic regression testing for prompting LLM APIs.

BACKGROUND AND RELATED WORK 2.1 Evolving AI APIs
As ML models are increasingly provided as a cloud service through APIs (e.g., Perspective API [14], ChatGPT [26], Amazon Rekognition [23]), it has been noticed that these models evolve over time without clear documentation [10,37], similar to traditional web service API [18].This can pose risks to downstream application developers, who do not have control over model updates and can potentially suffer from performance regression [7,9].Beyond demonstrating the problem, there has only been limited work on actually supporting developers facing evolving APIs not under their control.The most prominent example for ML models is done by Cummaudo et al. [9], where they focus on detecting changes in the label space and prediction confidence for vision APIs.Our work extends the existing literature by explicitly adapting the concept of regression testing in the LLM contexts and highlighting the need for more nuanced regression test suites.

The Rise of Prompting LLMs
LLMs present a fundamental shift in NLP applications through the prompting interface, which allows rapid prototyping and iterations [19]: Application developers can easily tweak prompts and validate prompts on a few examples without the need to curate data and build models.In a sense, the LLM together with a specific prompt can be considered equivalent to a traditional specificallytrained ML model for a specific task, such as toxicity detection.However, the prompting paradigm also brings in the risk of prompt brittleness, as prompts can be sensitive to small changes [21] and the good choices for prompts change when the LLM changes.Our work highlights prompts as an additional factor to consider for regression testing LLMs.

ML Model Testing
ML models are usually evaluated by model fit using aggregated metrics like accuracy, as models are expected to make occasional mistakes [17].However, traditional model evaluation has been criticized for being coarse-grained [33] and suffering from issues like spurious correlations [1].Therefore, recent work has proposed nuanced behavioral model testing as an alternative [25,33], where the testers explore nuanced model behaviors beyond a single score.
Prior work has explored different methods to explore and test model behaviors [e.g., 32,33,41], as well as different ways to automate testing specific model behaviors [e.g., 35,36] (see Yang et al. [40] for a detailed survey).Another line of work on data slicing [e.g., 4,6,11] focuses on identifying data regions where a model under-performs.Our work introduces a new scenario for ML model testing: regression testing over evolving LLM APIs.

CASE STUDY: TOXICITY DETECTION
Since regression of evolving LLM APIs is an emerging problem of which we have little understanding, we first explored the problem with an exploratory case study.We picked a paradigmatic case [42]  Table 1: Representative models from OpenAI's GPT-3.5 family [28], sorted by release date.
of toxicity detection, a task widely used for online content moderation and long performed by models specifically trained for that task [15], but recently LLMs with a suitable prompt have shown similar or better performance [38].Our case study aims to explore (a) how prompt behaviors change (regress) over LLM updates and (b) where regressions can be detected. 2  3.1 Experiment Setup 3.1.1Datasets.We selected two toxicity detection datasets for our case study: Civil Comments [8] and GitHub Discussion [22], covering different contents (generic vs. specialized) and text lengths (short vs. long) for toxicity detection.
The Civil Comments dataset is collected from Civil Comments platform, representing a wide range of comments on the Internet.We sampled 1000 comments from the dataset, among which 41 are toxic and 959 are non-toxic.
The GitHub Discussion Dataset contains 174 discussions, among which 74 are toxic and 100 are non-toxic.The 74 toxic discussions are collected using the links provided by an existing study [22], and we randomly sample another 100 non-toxic discussions from GitHub.
Noticeably, four out of these five models are already scheduled to be deprecated in 2024, effectively forcing application developers to switch to one of the newer models.The models are also updated silently: gpt-3.5-turbo-0301and gpt-3.5-turbo-0613are snapshots of the gpt-3.5-turbomodel, which will soon point to gpt-3.5-turbo-1106.

Prompts.
We employed four prompting strategies to explore how they behave differently on model updates: • Simple instruction (P1): The prompt instructs the model to classify the text as "toxic" or "non-toxic", followed by the text to classify.This serves as a simple baseline a developer might first try.• Simple instruction, placed last (P2): The same as above but put instructions after the text.This design follows the insight that LLMs have recency bias [45] and stating instructions last makes LLMs less likely to ramble [20].
Only reply with the label.

# Detailed instruction (P3)
Below is a GitHub discussion.Sometimes the discussion can get heated and have toxic comments.Toxic comments can contain curse words, can sound condescending, can be mean to others, or can make people feel angry without using offensive words.Classify the GitHub discussion as "toxic" or "non-toxic".
Only reply with the label.Document: {text}
Only reply with the label.We share the prompt templates in Figure 2.

Metrics.
To evaluate the accuracy of each model + prompt combination and monitor their changes, we use the standard performance metrics accuracy and F1, and set model temperature to 0 to obtain the most likely predictions.2).Among them, 70.2% drop accuracy greater than 5%.Noticeably, across all different prompts, the model update from textdavinci-002 to text-davinci-003 causes a consistent performance drop (16.8% on average) on the GitHub Discussion Dataset but a consistent performance increase (11.8% on average) on the Civil Comments dataset.We hypothesize that the huge performance differences are due to the new training method text-davinci-003 used, which causes major inconsistency across the two versions.

Model updates affect different prompting strategies differently.
We observed that among all model updates, 55% do not cause a consistent performance drop or increase across prompts, i.e., the same model update helps some prompts but hurts others for the same task.Specifically, we found that the simplest prompt, P1, drops accuracy in 75% of the model updates, while the few-shot prompt, P4, only drops accuracy 45% of all times.Zooming in, we can see that the update from gpt-3.5-turbo-0301 to gpt-3.5-turbo-0613caused a 9.6% accuracy drop for P1, but a 5.1% increase for P4 on the Civil Comments dataset.This is particularly concerning, as the update is silent when a developer uses the main API gpt-3.5-turbo,which updates the underlying model from time to time.Such non-uniform performance changes cause a major problem for prompt engineering: The developer may find that their carefully engineered prompt is no longer the best choice after a silent API update.For example, the detailed instruction prompt (P3) has been the best-performing prompt up to the last model update, but falls behind the few-shot prompt (P4) by 8.7% on the latest model (gpt-3.5-turbo-instruct).This indicates that prompt engineering is not a one-time effort, and calls for prompt versioning and prompt monitoring (see detailed discussion in Section 4.2).

Regressions happen even when prompt performance improves.
We also found that overall 10.9% individual predictions regress (from correct to wrong) over API updates.Almost always (87.9%) when overall accuracy improves in an update, at least one previously correct prediction regresses.For example, the model update from text-davinci-002 to text-davinci-003 improves P3's accuracy on the Civil Comments dataset by 7.7%, but 1.8% of the previously correct predictions now fail.
As such regressions are invisible in the aggregated accuracy scores, it would be particularly concerning if the improvements and regressions are not uniform-the prompt may work better on some data slices but worse on others, causing fairness implications even when overall accuracy stays stable or improves.

Regressions happen beyond the decision boundary.
A natural hypothesis is that regressions happen on data points that models are less confident with (i.e.near the decision boundary).To explore this hypothesis, following existing work [44], we use information entropy to measure the model's confidence on a data point: where   is the entropy on input , and    is the model's probability to predict label  on input .Intuitively, when the model's prediction probabilities are more evenly distributed across different labels, the entropy is higher and the model is more uncertain on the input.Since many LLM APIs do not expose the actual prediction probabilities, we approximate a model's prediction probabilities by running it on the same input multiple (n=20) times with a non-zero temperature (t=0.7).Overall, we found that models are indeed more uncertain about flipping data points on average (Table 3).However, we also found that 63.8% of regressions happen when models are very confident about their results (i.e.entropy = 0).This implies that model updates can drastically change predictions on data points far away from the decision boundary.
Across the models, we also found that different models show different levels of self-consistency: gpt-3.5-turbo-0301seems to be the most self-consistent one, while the update to gpt-3.5-turbo-0613makes it much less self-consistent.This indicates another form of regression: While the two models' accuracy is comparable, the update can affect model calibration [45] and make the model less self-consistent (or over-confident).

3.2.5
Regressions are not uniform across data slices.We next explore where regressions happen systematically for specific data slices, with the metadata provided by the authors of the GitHub Discussion dataset [22].
We found that 90% of regressions happen on toxic discussions, despite only 42.5% of discussions being toxic in the dataset.Breaking down the regression on toxic discussions by the provided metadata (Figure 3), we found that regressions are disproportionally common when the toxicity is triggered by politics (25.7% overall vs. 33.3%among regressions), targets code (21.6% vs. 33.3%),or is severe (54.1% vs. 66.7%),suggesting that model updates can cause systematic worse performance for these specific data slices.

Limitations.
Readers should be careful when generalizing the results beyond the current experiment settings: We used specific prompt formats and sent the prompts as a single user request.The optimal prompt design may change when the LLM API varies.

TOWARDS REGRESSION TESTING FOR PROMPTING LLMS
Our exploratory case study highlights that model regression is a real problem that is deeply affected by prompting and LLM nondeterminism.Based on our observations, we conclude with a discussion on how researchers can support regression testing for LLMs.

Identifying Data Slices as Regression Test Suites
In our case study, we found that individual predictions regress frequently (10.9%).Therefore, treating each data point as a regression test will simply be intractable.An alternative would be to look at aggregated metrics over the entire dataset.However, this level of monitoring is too coarse-grained and cannot inform developers on how to debug and adjust their prompts.We argue that LLM regression tests should be at the level of slices.Our preliminary results show that it is possible to look at semantic slices and localize where regressions happen (e.g., toxicity targeting code for GitHub toxicity).However, our slicing relies on extra metadata, which may not be available for many datasets.Future research should further scaffold developers to identify data slices as regression test suites, possibly by transferring existing approaches like slice discovery [11] and error analysis [6,39] on a single model to regression testing.

Tracking Prompts for Regression Testing
Our case study points out that prompt performance can be unstable across different APIs and each API has different best-performing prompts.Therefore, developers need to track and update their prompt (possibly from a history version), to maintain or improve prompt+LLM performance.
However, existing prompt engineering practices provide insufficient support for prompt versioning and monitoring [43]-Information and knowledge are often lost in the iterative prompt engineering process.Future research can design systems for prompt+LLM tracking [e.g., 2, 24] to help developers explore behavioral changes, debug regressions, and update their prompts.

Tackling Non-determinism in LLM Regression Testing
Our case study shows that LLM predictions can flip a lot with a nonzero temperature.This can cause lots of flakiness when we perform regression testing for LLMs.Future research on LLM regression testing should explicitly consider such non-determinism in their research design.For example, to avoid a large sample size for each regression test, researchers can develop suitable statistical tests and test minimization strategies.While our work focused on classification tasks, regressions can also happen for generative tasks, where non-determinism is even more common for generating high-quality outputs.To support regression testing LLMs on generative tasks, future research should consider incorporating multi-dimensional metrics [46] and supporting developers in testing output properties specific to their requirements [31].

Figure 1 :
Figure 1: An LLM API update from text-davinci-003 to gpt-3.5-turbo-instructcauses a major performance downgrade on classifying toxic comments.The API update also changes the prompt choice: Prompt A (left) now outperforms Prompt B (right) by 8.7% accuracy.

Figure 2 :•
Figure 2: Prompt templates for our experiments on the GitHub discussion dataset.Templates for the Civil Comments dataset are similar with some adaptations.

Table 2 :
Accuracy for prompt (Pn) and model combinations on the Civil Comments and GitHub Discussion datasets.The best-performing prompt(s) for each LLM API are highlighted in bold.We observed similar results for F1 scores.

Table 3 :
Model entropy on the Civil Comments and GitHub Discussion datasets, averaged across all prompts.