WorldBench: Quantifying Geographic Disparities in LLM Factual Recall

As large language models (LLMs) continue to improve and gain popularity, some may use the models to recall facts, despite well documented limitations with LLM factuality. Towards ensuring that models work reliably for all, we seek to uncover if geographic disparities emerge when asking an LLM the same question about different countries. To this end, we present WorldBench, a dynamic and flexible benchmark composed of per-country data from the World Bank. In extensive experiments on state of the art open and closed source models, including GPT-4, Gemini, Llama-2, and Vicuna, to name a few, we find significant biases based on region and income level. For example, error rates are 1.5 times higher for countries from Sub-Saharan Africa compared to North American countries. We observe these disparities to be consistent over 20 LLMs and 11 individual World Bank indicators (i.e. specific statistics, such as population or CO2 emissions). WorldBench also enables automatic detection of citation hallucination, where models cite the World Bank itself while providing false statistics, and a manner to assess when an LLM’s stored facts begin to go out of date. We hope our benchmark will draw attention to geographic disparities in existing LLMs and facilitate the remedying of these biases.

for countries from Sub-Saharan Africa compared to North American countries.We observe these disparities to be consistent over 20 LLMs and 11 individual World Bank indicators (i.e.specific statistics, such as population or CO 2 emissions).WorldBench also enables automatic detection of citation hallucination, where models cite the World Bank itself while providing false statistics, and a manner to assess when an LLM's stored facts begin to go out of date.We hope our benchmark will draw attention to geographic disparities in existing LLMs and facilitate the remedying of these biases.

INTRODUCTION
Large language models (LLM) exhibit remarkable performance on a wide array of tasks, from summarizing the news to writing code to answering trivia questions [22,24,30].Impressively, LLMs have also been effective on real-world benchmarks.For example, GPT-4 [21] has been shown to pass the licensing exams for both legal [16] and medical professions [15,19].However, LLMs are also known to hallucinate, where they generate inaccurate text in a plausible manner [12].This can pose particular risks for factual recall tasks.Given the black box nature of LLMs, continued development and application of diverse benchmarks is instrumental in understanding when LLMs can be trusted to answer reliably.
In addition to issues with correctness, AI in general has well documented challenges with performance disparities, in which seemingly strong models fail more frequently for some subset of inputs than others.Performance disparities can manifest as fairness issues when the subset of inputs where the model underperforms is characterized by sharing a sensitive attribute.For example, Buolamwini and Gebru [7] identified widespread performance disparities along race and gender lines across commercial facial recognition systems, while others have shown that object recognition models suffer performance drops when images originate in lower income countries [9,10].Similarly, Ojo et al. [20] show LLMs are less performant when tasks are posed using African languages instead of English.A key first step to building models that work for all is creating benchmarks to quantify not only performance, but also performance disparities.
To this end, in this work, we introduce a novel benchmark called WorldBench to uncover if geographic disparities emerge in LLM factual recall.In other words, we ask, are LLMs more accurate in answering questions about some parts of the world than others?To systematically tackle this question, we compute LLM performance on a country-wise level, by way of utilizing per-country indicators (i.e.statistics) from the World Bank [6].We build and validate (via human inspection) an automated, indicator-agnostic prompting and parsing pipeline to interface with the World Bank data, summarized in Figure 2.This way, any set of indicators can be used in future variations of WorldBench, without having to change our code, which we will make public.In our study, we incorporate 11 diverse indicators, each having data for about 200 countries, resulting in a total of 2, 225 questions per LLM.
We evaluate 20 state of the art LLMs released in 2023, ranging from open-source models like Llama-2 and Vicuna [27,32], to private commercial ones accessible via API, including GPT-4 and Gemini [21,26].As visualized in Figure 1, when averaging over all LLMs and indicators, we observe substantial differences in per-country error, with African countries seemingly incurring the largest errors.Using country categorizations defined by the World Bank, we quantify disparities across 7 regions and 4 income groups, finding that LLMs are most accurate for countries from Western regions and the high income category.Problematically, these error rates rise by a factor of about 1.5× when moving to the region (Sub-Saharan Africa) and income group (low income) for which models are least accurate.Moreover, we find these disparities and their order (i.e. which groups have most/least error) to be consistent when inspecting LLMs or indicators individually.That is, all 20 LLMs exhibit geographic disparities in factual recall.
In addition to our main result, we utilize the temporal aspect of the World Bank data to conduct extra analyses, such as automatically cross-checking LLM generated "citations" which turn out to be hallucinated, and inspecting error as a function of the groundtruth year, finding that some LLMs in our suite may already be slightly out of date.
In summary, we present WorldBench, a flexible benchmark for understanding LLM factual recall abilities on a per-country basis.With WorldBench, we conduct a large scale evaluation of 20 LLMs, and find pervasive geographic disparities across regions and income levels.We hope our benchmark can facilitate further research on the fairness of LLMs, towards building models that work well for all.

RELATED WORK
Evaluating Factual Recall.Recent works have documented the performance of LLMs in factual recall: [17], [14], [23], [25].The general conclusion to these works is that while existing LLMs appear capable in answering certain factual question, their factual recall is less than perfect, as models can hallucinate completely fabricated information [12].Zhang et al. [31] specifically investigated the recall of geographic information, though their study is limited to GPT-4 and does not inspect disparities.Some works (e.g., [17], [23]) linked factual recall to 'popularity', showing that error rate increases for less popular entities.While those studies categorize facts by popularity, each question in our benchmark has an associated country, as well as Region and Income group.These additional annotations enable going beyond overall error, so to assess geographic performance disparities in factuality.
Bias.The issues of bias and fairness in AI are of immense societal impact.Several studies have observed computer vision models to exhibit disparate performance when grouping inputs by race, gender, and across income levels and geographies, for tasks like facial recognition, object classification, and diverse image generation [8][9][10][11].In the realm of language processing, Ojo et al. [20] observed a performance gap when tasks are presented in African languages.To the best of our knowledge, our study is the first to propose an automated and systematic examination of country-wise disparities in LLM factual recall, which in turn enables inspection of disparities across regions and income groups.
Benchmarks.Other works have noted and sought to improve challenges associated with evaluating factuality, primarily for tasks like summarization, where constructing a similarity metric between generated and reference texts is nontrivial.In our case, we design our benchmark to obtain numeric answers from LLM repsonses, with which we can compare to groundtruth values with the simple metric of absolute relative error.Further, we utilize a reputable third party (the World Bank), so that (i) the questions asked are relevant, (ii) inputs are grouped into salient cateogries, and (iii) groundtruth answers are accurate and up-to-date.Our benchmark provides a manner to quantify the performance of large language models (LLMs) on a per-country basis.We disentangle data collection from evaluation by utilizing the World Bank's data bank, which contains statistics (called indicators) pertaining to numerous diverse aspects of global development.Crucially, the data is available for nearly all countries and is updated year to year.With WorldBench, one can flexibly select specific statistics of interest, and dynamically re-evaluate models as time passes to see if they remain up to date.In this work, we uncover substantial geographic disparities in LLM performance for a wide range of models released by industry leaders, revealing the inequities pervasive across state of the art LLMs.Our benchmark offers a few unique advantages to most existing benchmarks.First, and most importantly, WorldBench equitably represents all countries.Thus, we can query a language model for the same exact statistic for completely different countries, enabling direct comparisons across countries to uncover disparities in performance.Next, data quality and licensing is assured, as it comes from a globally reputable source which explicitly allows for its use by the public.Third, our benchmark is dynamic and flexible.The dynamic nature comes from the fact that the statistics are updated on a yearly basis, enabling the longevity of our benchmark, as well as analysis of LLM factual recall along a temporal dimension (see §6.2).The flexibility is borne out of the vast number of indicators one could choose from.In other words, if one sought to better examine the ability of language model to recall facts about the environment, they can elect to choose indicators from the Climate category.In contrast, if a language model is being developed for financial purposes, one could focus on indicators from the Economy and Growth categories.In this study, we select 11 indicators, as shown in Table 1.The indicators are chosen to represent multiple different categories, and qualitatively are amongst the indicators that are easier to understand for lay people (i.e.non-experts in global development, like AI researchers).In total, there are 2, 225 questions, reflecting an average of 202 countries with groundtruth data per indicator studied.
Country categorization.The World Bank also provides various categorizations of countries, based on geographic or economic reasons [5].We focus on two high level categorizations, visualized in Figure 3, which divide the world into 7 Regions and 4 Income groups.We note that, like the collection and maintenance of the groundtruth data for our benchmark, country categorization is carried out by an external body (i.e. the World Bank) to the model producers and evaluators.We hope that the disentanglement of these three parties enables a more objective comparative analysis, informed by experts on global development.

Language Model Evaluation
While the World Bank's open data is crucial to our analysis, additional steps are needed to interface with the available data scalably.To enable large scale evaluation of LLMs, we design a procedure to obtain a numeric answer given an arbitrary indicator, country, and LLM of interest.Namely, we utilize a template prompt to guide models to provide answers in a mostly uniform fashion, and then apply an automated parsing method to extract the numeric value from the raw LLM output.We detail these steps below, as well as results from human studies to validate the correctness of our pipeline.We also explain how we compute errors, given numeric answers from LLMs and the World Bank's groundtruth data.
Prompting.Our standard prompt consists of a base instruction, an example, and a template question filled in with values for the indicator and country of interest.Figure 4 displays the base instruction and example.For consistency, we fix our choice of example country, electing Switzerland, as it has groundtruth data for all indicators in our study; we confirm results are similar when using alternate example countries in Appendix E. Importantly, we prompt the model to only provide the number in its response.Without this instruction, models generate longer free-from responses, increasing the difficulty of automatically extracting numeric values and the the computational cost of our benchmark.For every question (i.e.combination of an indicator and country), we first initialize the chat history of the LLM of interest with the base instruction and example, and then ask the question.Notably, all three components are modular with respect to the country and indicator of interest, allowing for them to work for any World Bank indicator.
Parsing.Despite the instruction to 'only provide the number', LLMs at times exhibit undesirable, like including other text (e.g.special tokens) or repeating the question with new countries and responding to itself again and again.We design an automated parsing method to scalably extract a numeric value from the raw LLM outputs.The parsing method removes special characters, and in most cases, extracts the first numeric value provided.We also account for special cases like, for example, where a suffix (e.g.'million' or 'billion') is used.In a small number of cases, the LLM either provides no output, an invalid output (e.g. a number with two decimal points), or abstains from answering.For these outputs and any others where the parsed number cannot be converted to a float, we exclude them from further analysis.
Error metric.To compare numeric values, we utilize absolute relative error, computed as follows: given two scalars , , we define Absolute Relative Error as |− | max(, ) .Essentially, this metric conveys by what percent two measures are different from one another.For example, an absolute relative error of 0.1 means that one value was 10% larger or smaller than the other.Notice that absolute relative error always falls between 0 (because all values we encounter are non-negative) and 1 (because the denominator is the maximum of the two positive values).We elect to use relative error over absolute error because the ranges of values varies dramatically across indicators, with the population indicator having some groundtruth values in the millions and billions, while others (e.g.unemployment) take on values under 10.Each question is defined by a query (i.e.Before asking a language model a question, we prompt it with a base instruction and example.Then, we automatically parse the raw output to obtain a numeric value which can be compared to the groundtruth data. our parsing is mostly complete, as we obtain a numeric value in 98.2% of cases where an answer can be parsed.To verify the correctness of the parsing, we first check 945 randomly selected raw LLM outputs where a numeric value was parsed.In 98.7% of these cases, the parsed value was correct (details in appendix C).Then, we take a closer look at parsed responses that incurred high (over 0.85) absolute relative error compared to the groundtruth value.For 825 randomly selected high error cases, the parsing was manually verified to be correct 93.7% of the time.Motivated by this slightly lower correctness rate, we also analyze median errors over groups in Appendix B, where observed trends are consistent (and disparities over Regions and Income groups are even larger).We conclude that our prompting and parsing pipeline is largely complete and correct.Nonetheless, when evaluating a new LLM, we recommend verifying the parsing behavior using the four validations we outline above, as individual LLMs can have unique idiosyncracies (e.g.special tokens or output patterns) that potentially could affect parsing.Along with all code, we will also publicly release methods to facilitate automatic and manual verification of parsing.
Groundtruth selection.For each indicator and country, data is available over a span of many years, though certain values are missing.To define a single groundtruth value for per country per indicator, we average the statistic over the past three years.The primary motivation for this strategy is to maximize the number of countries included in our study.Alternatively, one could select a specific year to draw all groundtruths from, though the number of countries considered would be lower than the averaging strategy.In Appendix D, we compare groundtruth values obtained via different selection methods, and observe groundtruths to only vary by a small amount.We also explore specifying a year when querying LLMs, and observe consistent results with respect to performance disparities to those observed without year specification in the query.Lastly, we more closely inspect overall error rates between LLM responses and groundtruths selected by specifying a year in section 6.2, to gain insight on if LLM responses are dated (i.e. more accurate for a prior year than the most recent year).

EVALUATION SUITE
We seek to evaluate a wide array of language models, including both open source and private.For the open source models, we utilize Huggingface's transformers library [29] to obtain and operate 15 models (and respective tokenizers).Namely, from Meta's LLama-2 [27], we include both base and chat-tuned versions of the 7 and 13 models, where 7 indicates 7 billion parameters.We also include two Vicuna models (7 and 13), which are fine-tuned from Llama-2.From Microsoft, we have 7 and 13 Orca-2 models [18], as well as Phi-2, the smallest model in our suite with just 2.7 parameters.From Mistral-AI, we include the 7 instruction-tuned model [13].We also study Zephyr-7  [28], tuned from a Mistral-AI model.Lastly, we include 7 and 14 Qwen models from Alibaba Cloud, both with and without chat-tuning [4].For closed source models, we include the following LLMs.From OpenAI, we evaluate gpt-3.5-turboand gpt-4 [21].From Google, we evalute Gemini [26].From Cohere, we evaluate the 'command' model, as well as the same model equipped with retrieval augmented generation (RAG) [2].RAG is a procedure where a langauge model can retrieve relevant documents (in this case, from the internet) and look over them before generating a response.Error rates are lower for western and high income countries.Mean absolute relative error rate per region and income group reported over all 11 queries and 20 language models studied.When computing median instead of mean, similar trends hold, with even larger disparities (see Figure 15).We note that the best performing LLMs have much lower error rates than the averages presented above (see figure 7).

RESULTS: PERVASIVE AND CONSISTENT GEOGRAPHIC DISPARITIES 5.1 Large disparities across Regions and Income groups
Figure 6: Error rates can vary significantly across countries, with some countries experiencing nearly 3× higher absolute relative error than others.Strikingly, all of the 15 countries with the lowest error rates fall in the high income category, while all of the 15 countries with the highest error rates fall in the low income category.
respectively.In contrast, the mean absolute relative error rises to 0.461 for countries from Sub-Saharan Africa, which is about 1.5× higher than the error for North America.For Income groups, mean absolute error rises steadily as the income level drops, with the lowest error being 0.346 for high income countries, and the highest error being 0.480 for low income countries.

Error nearly triples between some countries
On a per country basis, disparities can become even more pronounced.Figure 6 visualizes mean absolute error rate per country for the countries that, when asked about, language models (on average) have the most and least amount of error.We observe that 13 of the 15 countries that incur the least amount of error are European, while all 15 of these countries fall are categorized as high income.On the other hand, countries that incur the most error are all categorized as low income.Strikingly, error rises by a factor of nearly 3 across the two groups.

Consistent disparities across LLMs and indicators
Previously, we presented results averaged over all LLMs and indicators, grouped either by country or category (i.e.Region or Income group).We now inspect performance along the axes of LLMs and indicators separately, starting with LLMs.In addition to absolute relative error, we also employ a second metric to summarize differences in performance across certain categories.Namely, we define Disparity as max   ,  ∈   −   , where  is the set of mean absolute relative errors for each category of a given categorization.In other words, for example, Disparity over Regions is the gap between the mean absolute relative errors for the region with the greatest error and the region with the least error.Disparity also always falls between 0 and 1.To contextualize disparity scores, we compute a baseline corresponding to the disparity achieved using a random categorization of countries into  groups; we set  = 7 for Regions and  = 4 for Income groups.We approximate the baseline disparity given a set (i.e. for one LLM of interest) of per-country errors by applying ten random country categorizations and averaging the observed disparity over all trials.Figure 7 visualizes average error and disparities per LLM.From the left most panel, we see that the lowest mean absolute relative error achieved is 0.19, and the value for most models is near 0.4, indicating that there is substantial room for improvement for this task.Shifting from error to disparity (middle and right panels), we observe that all models exhibit disparate performance over regions, with gaps of at least 0.1 between the regions with the most and least error per LLM.Across income groups, disparities are also consistently present, though to a lesser degree, with only 4 of the 20 models studied achieving a disparity below 0.1.Nonetheless, both over Regions and Income groups, observed disparity almost always far exceeds the expected disparity for a random categorization (blue dashed lines).
A few expected trends emerge: base models are outperformed by their chat-tuned versions; smaller models are outperformed by their larger versions.One such trend we highlight is the impact of retrieval augmented generation (RAG), which is utilized for to augment the Cohere LLM.Incorporating RAG reduces mean absolute error by nearly a factor of two, reducing it from 0.416 to 0.231.Impressively, RAG causes disparity across Income groups to nearly vanish, going from 0.15 to 0.02, the lowest such disparity observed across our model suite, and on par with a random categorization of countries.However, it is worth noting that RAG comes at the cost of latency, as internet searches are required and the LLM must review retrieved documents in addition to the provided prompt.Nonetheless, RAG appears to be a promising direction for reducing errors and also potentially disparities.
Turning our attention now to indicators, Figure 8 shows errors and disparities per indicator.Mean absolute relative error exceeds 0.3 for all but two of the indicators.Again, disparities are present for most cases, though they are more pronounced across Regions   than across Income groups.Moreover, over both Regions and Income groups, observed per-Indicator disparity far exceeds the random baseline in almost all cases.Indicators that seem to be driving the observed disparities include CO 2 Emissions, Renewable Energy Ratio, and Unemployment.For a complete breakdown of performance and disparities for each (LLM, indicator) pair, we refer to Appendix A.

Ordering of Regions and Income groups by
error are consistent per-LLM and per-Indicator: Lowest error is with Western and high income groups Having demonstrated that significant disparities are present for each LLM and each indicator separately, we now show that the order of disparity is consistent as well.that is, the regions and income groups with highest and lowest error (respectively) are the same within each subset.Namely, LLMs achieve the lowest error when answering questions about Western or high income countries, and they suffer the greatest error when answering questions about countries from the low income category.In figure 9, we show the distribution of error ranks.That is, e.g. in the top right heatmap, for each LLM, we rank the regions by their mean absolute relative erros, and then report the fraction of LLMs for which a region obtains a specific rank.Thus, we see that for 75% of the LLMs, the highest error occurs for Sub-Saharan African countries.Strikingly, the pattern across income groups is strongly pronounced.Error ranks are almost perfectly inversely related to amount of income, wiht the high income group having lowest error for 95% of LLMs and the low income group having highest error for 90% of LLMs.Again, the same trends emerge when inspecting error rates per indicator.Thus, the original trends we observe when averaging over all LLMs and indicators, visualized in Figure 5, appear to hold when we inspect each LLM individually and each indicator individually.These results suggest that geographic and income-based disparities in LLM factul recall are pervasive throughout existing LLMs.

NOTEWORTHY OBSERVATIONS
In analyzing per-country performance and geographic disparities in LLM factual recall, we additionally came across a number of noteworthy observations made possible by our benchmark.First, we found that LLMs occasionally offer what resembles citations in their responses, including instances where the WorldBank itself was mentioned.Since we have that exact data, we were able to cross-check the LLM "citations".Second, because we have data per-country per-year, we could compute error rates while selecting groundtruths from specific years, so to see how up-to-date LLM responses are.We explore these observations in more detail below.

Citation Hallucination
Despite being prompted to only return a numeric value, the LLMs we studied still would often produce additional text.Interestingly, sometimes generated text would resemble a citation1 , claiming the provided answer was sourced from institutes like the World Health Organization, the International Monetary Fund, and even, the World Bank.In the last case, we cross-checked the provided responses to see if the numeric response matched the groundtruth World Bank data, contained in WorldBench.Overall, responses with "citations" were no more accurate than those without "citations", still incurring substantial mean absolute relative errors.Specifically, in 650 instances where the string "World Bank" (case insensitive) was mentioned, mean absolute error rate was 0.465.This suggests that the LLM-produced "citations" are hallucinated, as the provided responses do not actually come from the sources listed.Figure 10 displays a few examples of LLM produced "citations".For each example, we highlight the "citation", and provide the absolute relative error of the parsed answer compared to (1) the groundtruth value from the specific year cited, and (2) the lowest absolute relative error to groundtruths for any of the past ten years.In the first example, the LLM answer is way off, despite the arguably convincing Figure 11: Error rate of LLM outputs compared to year from which groundtruth is extracted.Many models show the lowest error rate when their outputs are compared to groundtruths from 2021, indicating that models may already be slightly out of date.
"citation".Interestingly, we also observe an instance where the provided answer does not match the groundtruth from the cited year, yielding an error of 0.383, but it does match the groundruth from the following year, with error dropping to 3.97%.Finally, we see an example where the provided answer is off by almost exactly a factor of 10 (relative error of ∼ 0.9).This highlights a pitfall in using LLMs to return numeric information, as the difference in tokens between two numbers can be very small, while the resultant encoded value can be very large.
In summary, hallucinated citations pose a serious challenge in LLM reliability.On one hand, producing false citations obfuscates model errors, and generally denigrates the overall trust the end user has in the system.On the other, that the LLMs appear to know what sources would contain the answer seem to be an encouraging sign to the potential benefits of retrieval-augmented systems.

Are some LLMs already out of date?
Now, we compare LLM responses to groundtruths from specific years for all LLM responses, not just the rare few where "citations" are present.Figure 11 shows the mean absolute relative error over indicators and all countries per LLM, computed using groundtruths selected in a variety of ways.The orange dashed line corresponds to the default groundtruth selection (averaging over any available data from the past three years), while the light blue one corresponds to using data from the most recent year (per country; details in D.2).The solid blue lines correspond to using the groundtruth value from the year on the x-axis.A trend that emerges in 13 of the 20 LLMs is that the lowest error occurs when comparing to data from 2021.In one extreme, error increases from 0.5 to 0.54 when changing the groundtruth year from 2021 to 2022.These results suggest that the facts internally stored in some LLMs may already be out of date, reporting statistics closer to previous years, especially if their training data was curated in years past.Of course, an LLM cannot recall a fact that did not exist at the time of its training.Nonetheless, as the use of LLMs continues to grow, the ability to stay up to date will be paramount.We hope WorldBench can aide in this pursuit.

What kinds of countries experience high error rates?
We now present a purely correlational study to better understand what countries experience the highest error rates.Using the percountry data for each indicator studied, we compute the correlation between these values and per-country error.We also compare the normalized (by mean) standard deviation of responses per country per indicator, with responses taken over five trials.The hypothesis here is that LLMs will have greater variance in answering questions about countries they are less accurate for, similar to [1]; we call this self-consistency.We compute correlation to country-wise errors for each (LLM, indicator) pair separately, as the values can take on substantially different ranges as either LLM or indicator changes, and then average over all such pairs.Results are reported in table 2. We find that most indicators are not correlated with per-country error.The strongest correlation is −0.396 for GDP PPP per person employed, suggesting that LLMs perform worse on countries with lower per-person wealth.Notably, neither population nor GDP are correlated well with error.As for self-consistency, in most cases, correlations are within 0.3 − 0.4.In a couple instances, high correlations are observed, suggesting that sampling multiple outputs and inspecting variance can sometimes (but not always reliably) aide in estimating the uncertainty of the LLM.In summary, our simple correlational analyses do not shed much insight in to why particular countries incur higher error rates for LLMs.We conjecture that the availability of training data plays a large role.However, the groundtruths are available for all countries, and World Bank data is likely in the training sets of many LLMs, as indicated by the hallucinated citations to them.We leave investigation to the cause of the geographic disparities we observe to future work.

LIMITATIONS
Is it reasonable to expect language models to perform this task?LLMs are not directly optimized for information retrieval, and developers often caution that LLMs many not always provide factual answers.Furthermore, retrieving specific numbers can be challenging, given the fact that many sequences of numbers are feasible/reasonably likely to appear in natural language, where as the distribution of words has far less entropy.Nonetheless, LLMs have been observed to produce factual responses to certain queries, achieving as high as 86% exact match on TriviaQA [3].Indeed, in our experiments, we observe mean absolute error rates as low as 3.6% for the Population indicator and 5.8% for the Electricity Access indicator (see Appendix A), suggesting that LLM-based factual recall is feasible.We emphasize that the point of our benchmark is to enable comparison in LLM performance across countries, so to uncover systemic disparities.Moreover, despite warnings from developers, as LLMs become more ubiquitous, end users will likely still make factual queries, to which we'd hope language models respond accurately, and importantly, without substantial differences in performance due to factors like geography or wealth of the country of interest.Thus, we hope our benchmark aide in assuring that LLMs exhibit fair performance when deployed.
Can LLMs ever ace this task?Some of the indicators studied are volatile, in the sense that they change non-trivially from year to year.Also, some metrics can take on slightly different values based on which organization measured them (e.g. the World Bank's numbers may differ from the United Nation's numbers).Thus, we do not expect LLMs to achieve perfect performance on this metric.Nonetheless, we believe our benchmark can offer valuable signal in measuring geographic disparities.That is, even though error rates may never be exactly zero, we can hope that they will not vary substantially across countries.

CONCLUSION
We present WorldBench, a benchmark to quantify geographic disparities in LLM factual recall.We find pervasive and consistent biases across 20 evaluated LLMs, with Western and higher income countries experiencing lower error rates.By utilizing World Bank data, our benchmark is flexible and will remain up to date.Thus, we hope our benchmark can aide in reducing geographic disparities of future generations of LLMs, towards models that work well for all.

A COMPLETE RESULTS BREAKDOWN
We now present the results as completely as possible.In Figure 12, we present mean absolute relative error per LLM per indicator.In Figure 13, we present disparities over regions per LLM per indicator, and in Figure 14 we show the same for disparities over income groups.In general, the indicators that are most challenging are challenging for all LLMs.

B LARGER DISPARITIES WHEN USING MEDIAN INSTEAD OF MEAN ERROR
We now present results when aggregating with median instead of mean. Figure 15 shows that disparities grow larger when inspecting median absolute relative error instead of mean.We attribute this difference to some outlier countries, such as Bermuda for North America and Greenland for Europe & Central Asia.

D ALTERNATE GROUNDTRUTH SELECTION STRATEGIES D.1 Variance across groundtruth values selected from different years
We confirm that variance due to alternate groundtruth selection strategies is minimal.Groundtruths can be selected by specifying a particular year, or by averaging over the past three years, as we do in the main text.Table 3 shows the mean absolute relative error obtained by comparing the groundtruth value obtained by selecting a specific year and the groundtruth value obtained by averaging over the past three years.We find that, averaged over all indicators, the absolute relative error between two different groundtruth values  year to year, with some indicators being more volatile.Nonetheless, our benchmark can still offer valuable signal for measuring disparities (its intended purpose), as volatilities are present for all countries.
D.2 Clarification on using 'most recent year' In Figure 11, we plot the error incurred when comparing to groundtruth values selected over different years, so to investigate if LLM reported statistics are closer to values from previous years (see section 6.2).One baseline selection strategy was termed 'most recent year'.We now clarify how this value is computed.We pick the most recent available statistic per-country, as some countries may have more recent statistics than others.We exclude any countries that have no statistics for each of the past five years.Note that at the time of this study (December '23), the most recent available statistics for any country was from 2022.Thus, for some countries, the 'most recent year' groundtruth was be drawn from as early as 2017, though in the vast majority of cases, it was drawn from 2022.

D.3 Specifying a year in the question
We also investigate if observed errors or disparities by LLMs could be caused by ambiguity in our prompt.Namely, in our prompt, we do not specify the year from which we desire the LLM to provide the requested metric for the given country.In the absence of a specification, we believe it is reasonable to assume that the most recent value is desired.Nonetheless, we conduct extra experiments where a specific year is mentioned in the prompt.We ask for values from 2021 and from 2016.Table 4 shows the results.Trends are very similar for both cases where a year is specified, and the case where no year is specified (matching the results we present in the main text).Note: GPT-4 was excluded in this ablation, purely for reasons of reducing cost.

E SIMILAR RESULTS WHEN USING DIFFERENT EXAMPLE COUNTRIES
We also verify that changing the choice of example country does not alter our main findings.Recall that we provide an example in our standard prompt.We originally chose Switzerland, as it had data for all indicators in the study.Now, we also inspect results when using Colombia and Mali as example countries.We choose these countries as they pertain to Regions that experience different levels of error (Colombia incurs around an average level of error, while Mali incurs high error).Table 5 shows the results.Again, main trends are consistent, with Western and High income countries

Figure 2 :
Figure 2: Overview of WorldBench.Our benchmark provides a manner to quantify the performance of large language models (LLMs) on a per-country basis.We disentangle data collection from evaluation by utilizing the World Bank's data bank, which contains statistics (called indicators) pertaining to numerous diverse aspects of global development.Crucially, the data is available for nearly all countries and is updated year to year.With WorldBench, one can flexibly select specific statistics of interest, and dynamically re-evaluate models as time passes to see if they remain up to date.In this work, we uncover substantial geographic disparities in LLM performance for a wide range of models released by industry leaders, revealing the inequities pervasive across state of the art LLMs.

Figure 4 :
Figure4: Standard pipeline for extracting numeric answers from LLMs.Each question is defined by a query (i.e.Before asking a language model a question, we prompt it with a base instruction and example.Then, we automatically parse the raw output to obtain a numeric value which can be compared to the groundtruth data.

Figure 5
Figure 5 visualizes our central finding.Over 20 LLMs and 11 World Bank indicators, we observe substantially disparate average performance based on the Region and Income group of the country of interest.Namely, the mean absolute relative error is 0.316 and 0.321 for countries from North America and Europe & Central Asia

Figure 5 :
Figure5: Language models exhibit disparate performance for countries from different regions and income groups.Error rates are lower for western and high income countries.Mean absolute relative error rate per region and income group reported over all 11 queries and 20 language models studied.When computing median instead of mean, similar trends hold, with even larger disparities (see Figure15).We note that the best performing LLMs have much lower error rates than the averages presented above (see figure7).

Figure 7 :
Figure 7: Performance of 20 LLMs averaged over 11 indicators from WorldBench.We present the absolute relative error (left), as well as disparities across regions (middle) and income groups (right).For disparities, the blue dashed lines correspond to the disparity incurred using a random categorization of countries (into 7 groups for Regions and 4 for Income groups), averaged over ten trials.Observed disparities far exceed the amount expected for a random categorization of countries across nearly all LLMs.

Figure 8 :
Figure 8: Error rates and disparities per indicator, averaged over LLMs.For disparities, the blue dashed lines correspond to the disparity incurred using a random categorization of countries (into 7 groups for Regions and 4 for Income groups), averaged over 10 trials.

Figure 9 :
Figure 9: The order of regions and income groups by absolute relative error is largely consistent per LLM (top) and per indicator (bottom).For both LLMs and indicators, the regions with the lowest errors are most frequently North America and Europe & Central Asia, while the regions with the highest error are most frequently Sub-Saharan Africa and East Asia & Pacific.For Income groups, error nearly always increases as income decreases.

Figure 10 :
Figure 10: In addition to hallucinating false answers, we also observe LLMs to occassionally hallucinate citations.Above, a few examples of hallucination citation are shown.

Figure 12 :
Figure 12: Absolute relative error averaged over countries per LLM and Indicator.Language models and indicators are each sorted by overall average error respectively.

Figure 13 :
Figure 13: Disparities over Regions per LLM and Indicator.Language models and indicators each sorted by overall average error.

Figure 14 :
Figure 14: Disparities over income groups per LLM and Indicator.Language models and indicators each sorted by overall average error.

Figure 15 :
Figure 15: Median absolute relative error per region and income group.See figure 5 for mean errors.

Table 1 :
Global development indicators in WorldBench, each defined and maintained by the World Bank.

Table 2 :
(Left) Correlation between per-country mean absolute relative error and individual indicator values.(Right) Perindicator, correlation between per-country mean absolute relative error and normalized standard deviation of responses obtained over five trials.
We propose a general (i.e. for any LLM) pipeline for prompting LLMs responses to flexible (with respect to the country or indicator in question) queries.We seek to validate two aspects of this pipeline: completeness, where the parsing successfully extracts numeric answers in all instances where a numeric answer was provided, and correctness, where the parsed number should match the original numeric value embedded in the text.By simply running our parsing method, we can obtain our first statistic: parsing extracted a numeric answer for 88.9% of responses.For the 11.1% of responses where parsing failed, failures are either due to the LLM not providing