Embracing Diversity: Interpretable Zero-shot classification beyond one vector per class

Vision-language models enable open-world classification of objects without the need for any retraining. While this zero-shot paradigm marks a significant advance, even today's best models exhibit skewed performance when objects are dissimilar from their typical depiction. Real world objects such as pears appear in a variety of forms -- from diced to whole, on a table or in a bowl -- yet standard VLM classifiers map all instances of a class to a \it{single vector based on the class label}. We argue that to represent this rich diversity within a class, zero-shot classification should move beyond a single vector. We propose a method to encode and account for diversity within a class using inferred attributes, still in the zero-shot setting without retraining. We find our method consistently outperforms standard zero-shot classification over a large suite of datasets encompassing hierarchies, diverse object states, and real-world geographic diversity, as well finer-grained datasets where intra-class diversity may be less prevalent. Importantly, our method is inherently interpretable, offering faithful explanations for each inference to facilitate model debugging and enhance transparency. We also find our method scales efficiently to a large number of attributes to account for diversity -- leading to more accurate predictions for atypical instances. Finally, we characterize a principled trade-off between overall and worst class accuracy, which can be tuned via a hyperparameter of our method. We hope this work spurs further research into the promise of zero-shot classification beyond a single class vector for capturing diversity in the world, and building transparent AI systems without compromising performance.

To address this issue, we infer attributes and embed multiple vectors, reducing disparities and enhancing interpretability.(Right) Our method scales better than prior works as we include more attributes (Section 6.1), enabling us to account for the many ways in which diversity can arise.
Vision-language models enable open-world classification of objects without the need for any retraining.While this zero-shot paradigm marks a significant advance, even today's best models exhibit skewed performance when objects are dissimilar from their typical depiction.Real world objects such as pears appear in a variety of forms -from diced to whole, on a table or in a bowl -yet standard VLM classifiers map all instances of a class to a single vector based on the class label.We argue that to represent this rich diversity within a class, zero-shot classification should move beyond a single vector.We propose a method to encode and account for diversity within a class using inferred attributes, still in the zero-shot setting without retraining.We find our method consistently outperforms standard zero-shot classification over a large suite of datasets encompassing hierarchies, diverse object states, and real-world geographic diversity, as well finer-grained datasets where intra-class diversity may be less prevalent.Importantly, our method is inherently interpretable, offering faithful explanations for each inference to facilitate model debugging and enhance transparency.We also find our method scales efficiently to a large number of attributes to account for diversity-leading to more accurate predictions for atypical instances.Finally, we characterize a principled trade-off between overall and worst class accuracy, which can be tuned via a hyperparameter of our method.We hope this work spurs further research into the promise of zero-shot classification beyond a single class vector for capturing diversity in the world, and building transparent AI systems without compromising performance.

INTRODUCTION
A pivotal advance in machine learning is the advent of foundation models.A single foundation model trained on large-scale data can supplant multiple task-specific models.Vision-Language models (VLMs) are popular foundation models capable of encoding text and images in the same representation space.Compared to standard classifiers which can only classify objects from a predefined list of classes with examples, VLMs are capable of open-world, zero-shot classification-meaning, VLMs can classify any object using text descriptions without any additional training.This zero-shot paradigm has spurred the development of many VLMs [15,24,36] with impressive classification performance.
Despite their remarkable performance, even today's best models exhibit skewed performance for certain groups of images.For example, [28] show models such as CLIP have exacerbated the gap in performance between regions such as Africa and Europe (as well as the gap across income-levels).We find similar biases arise when an object is visually dissimilar from its typical depiction.For example, Figure 1 (left) shows CLIP's 97.3% accuracy on typical pears drops dramatically when a pear is peeled (45.2%) or puréed (30.3%).Addressing such biases is crucial to the reliability of classifiers in the real world, where instances within a class can vary significantly.
Zero-shot classifiers, like standard models, use a single vector in deep embedding space to describe an entire class.For standard zero-shot classification, a vision-language model (i) encodes the image along with 80 hand-crafted prompts per class name (e.g., "a photo of a pear" or "a drawing of a pear"), (ii) averages the 80 embeddings per class to obtain a single vector, (iii) predicts the class whose vector maximizes cosine similarity to the image embedding [24].Prompt averaging encourages all instances of a class to be mapped to the same vector in the model's embedding, inherently limiting the model's ability to infer the innumerable diversity within a class.A pear can be diced, sliced, whole, in one's hand, or in a bowl.In each case, the image of the pear would be markedly different, and its embedding may not always be well aligned with the single vector that is supposed to represent the entire class.Thus, there is a natural tension between the one vector per class paradigm and performing consistently across a class with high diversity, which we empirically validate.While many strategies exist to mitigate performance disparities when labeled-data is available, these methods do not transfer to the data-free setting of zero-shot classification.Fortunately, unlike standard classifiers, the open-world nature of VLMs enables them to represent any attribute using the text encoder.VLMs can enrich the single per-class vector with attributes to more faithfully capture the variety with which a class can appear, pinpointing whether a pear is peeled or puréed.Thus, we argue that instead of learning one vector per class that is invariant to diversity, we should leverage the open-world nature of VLMs to explicitly account for the diversity within a class (i.e., via multiple vectors).
Recent works offer promising signs that zero-shot classification can be improved by incorporating attributes beyond the class name, such as subclasses [21] or visual descriptors [18,23].However, the former is limited to datasets with hierarchical label sets, and the latter reverts back to the one vector per class paradigm via simple averaging, limiting the benefits of incorporating more attributes (Section 6.1).Importantly, diversity comes in many forms that generic descriptors or subclasses alone may not adequately capture.Fig. 2. We test models on datasets that provide groundtruth attributes (shown in bold) annotating hierarchies, diverse states, and real-world shifts (e.g., Rojas et al. [29] labels the income level and country of origin of each image, towards promoting AI models that reduce bias) within a class.We find that standard zero-shot accuracy ('Base Acc.' above) drops significantly when certain attributes are present, namely when the attribute manifests in visual differences from what the model considers 'typical' for the class.We design our method to improve performance on these 'atypical' instances.
In this work, we propose a zero-shot method for enriching classes with open-ended attributes to boost zero-shot classification.Our method consists of two steps: (i) an attribute inference step, in which we use generative language modeling (an inherent, under-utilized capability of some modern VLMs) to enumerate relevant attributes along many axes of diversity, and (ii) a prediction consolidation step, where we flexibly attend only to subpopulations (i.e., instances within a class sharing an attribute) that are most relevant to the image.By enriching and carefully consolidating attributes to describe diversity within a class, our method more faithfully encodes atypical instances.Furthermore, by introducing interpretable intermediate outputs (i.e. the inferred attributes), our method affords greater transparency, as each inference comes with the specific list of fine-grained attributes used to predict the class, and attribute overlaps across classes can help anticipate and articular potential failures, before they happen.
In experiments over a large suite of datasets encompassing hierarchies, diverse object states, and real-world geographic diversity, we observe our method matches and in most cases exceeds the performance of existing methods, showing that transparency can be achieved without compromising on performance.Our method yields consistent gains on a second dataset suite with finer-grained classes and no labeled diversity, showing that our method still works well when intra-class diversity may be less present.Encouragingly, we find larger improvements occurring for the hardest classes and subpopulations, where atypical instances are usually found, resulting in reduced performance disparities.Compared to existing methods, we find that our approach can effectively scale to a much larger number of attributes to cover broader axes of diversity, as shown in the right panel of Figure 1.Our method also offers a principled trade-off between accuracy overall vs. on the worst classes, all without additional training.In summary, we (i) identify a limitation of the one-vector-per-class paradigm in adequately representing classes with diverse subpopulations, (ii) propose to go beyond one vector per class, leveraging under-utilized abilities of VLMs to explicitly account for intra-class diversity, and (iii) extensively validate the effectiveness of our method to perform zero-shot classification in both a more transparent and accurate way, especially for diverse subpopulations that are often overlooked.

REVIEW OF LITERATURE
Despite impressive overall accuracy, modern classifiers still suffer from biases.That is, they under-perform on some parts of the data, often due to spurious correlations or data imbalances in the training set.These biases can result in significant negative real-world impact.For example, Buolamwini and Gebru [3] exposed significant bias along demographic lines for facial recognition systems, and more recently, Richards et al. [28] demonstrated that despite steady progress on typical benchmarks, today's best models still generalize poorly to images from lower-income households and certain geographic regions.Namely, VLM-based zero-shot classifiers were shown to have even larger performance disparities across geographic and economic shifts than their supervised counterparts.
However, the promise of open-world zero-shot classification rightfully draws much attention to VLMs, which operate by mapping images and text to a shared latent space.CLIP [24], a seminal VLM, achieves this via joint contrastive training of image and text encoders on 400 million image-caption pairs.Recent models such as BLIP-2 [15] bootstrap the training of more powerful VLMs by taking larger pretrained vision and language backbones and fusing their outputs to a single space, which in turn can even be used to generate text; that is, some modern VLMs contain a fully functional LLM with (often under-utilized) generative abilities.To perform zero-shot classification with VLMs, one computes the class that has the highest cosine similarity between a test image's embedding and the embedding of a class name, often averaged over many (80 for CLIP) handcrafted prompt templates.While many efforts have improved VLM-based classification via prompt-tuning [8,12,17,19,[38][39][40], nearly all require some labeled data.Other works focus more closely on the task of debiasing VLM-based classifiers [6,14,32,37], though they too utilize labeled data, placing them out-of-scope of the true zero-shot setting.
Compared to previous classifiers, the key novelty of VLMs is their ability to encode any text.However, standard zero-shot classifiers only embed classnames, either alone or averaged over prompts.We propose to leverage the openvocabulary capabilities of VLMs to improve coverage of intra-class diversity by embedding more than just the class name.One effort along these lines is PerceptionCLIP [1], which infers contextual attributes per image as generative factors and does class inference conditioned on them.Other works utilize LLM-generated class descriptors, towards creating a concept-bottleneck [35] or rationales for inference [9], though these methods use data to train a linear layer atop descriptor similarities.DCLIP [18] show including descriptors can also improve performance in the zero-shot setting, and Pratt et al. [23] extend the gains using additional handcrafted queries.WaffleCLIP [30] shows that appending random characters or words can achieve similar performance to descriptor-based methods like DCLIP, without the need for an external language model.Importantly, although these works obtain more than one vector per class, they ultimately average over them.Thus, decision boundaries remain linear and biases may linger, as atypical instances are still suboptimally uncovered (see Sections 4.2 and 6.2).In contrast, like us, CHiLS [21] introduces a non-linearity in three steps: they (i) define subclasses with groundtruth label hierarchies or by querying GPT-3, (ii) do zero-shot classification on this extended set of classes (subclasses) and original classes, (iii) reweight the standard zero-shot score for each class with the max score from step (ii) over subclasses within the class.However, CHiLS is designed specifically for hierarchical label sets, which limits the types of diversity it can capture (see Section 6.1).

MOTIVATION
We hypothesize that the standard one-vector-per-class paradigm poses a tension for highly diverse classes.We investigate this by measuring classification performance as a function of class diversity.Indeed, we find classes with higher diversity suffer worse performance under the one-vector-per-class classification paradigm.Then we illustrate how newfound open-vocabulary capabilities of VLMs can enrich the single class vector to encompass diverse instances without additional training.That is, we show that incorporating attribute information can substantially improve VLM recognition of atypical subpopulations.Fig. 3.The average precision (AP) of a classname embedding is often much lower than the average precision of a subpopulation (i.e.classname with attribute) embedding.Subpopulations that see large increases in AP by including the attribute tend to be atypical.We design our method to improve accuracy on these diverse subpopulations, by inferring and explicitly accounting for them.

A single vector inadequately represents diverse classes
A standard VLM classifier is most effective when it aligns all instances of a class to their class vector (and away from vectors for other classes).Intuitively, aligning instances with high diversity is challenging as their image embeddings are more dispersed-and particularly tough for fixed open-vocabulary VLMs that do not benefit from knowing the specific classes of interest during their pre-training (see Appendix G.1).We see in Figure 2 for example the less typical Arctic fox is far harder to recognize than a typical fox (52.0%versus 84.5% accuracy).We observed similar drops in accuracy for a deflated balloon versus a regular balloon and an unpaved street versus a paved one.To systematically quantify this tension, both for VLMs and for the one vector per class paradigm generally, we examine class accuracies on ImageNet [7] relative to the diversity of each class across several models with varying levels of supervision.To proxy diversity, we measure the variance of image embeddings within a class.In all cases, we observe a strong negative correlation between class-wise accuracy and diversity (see Table 4 and details in Appendix C).That is, classes with higher diversity have lower accuracy in the one vector per class paradigm.

A path forward: VLMs can recognize diversity with relevant attributes
Although standard VLMs use solely classname in zero-shot classification, their shared embedding space allows to encode relations to any other text.In turn we ask: can the open-vocabulary encoder of VLMs better situate diverse classes given relevant attributes?Specifically, we assess whether enriching classes with attributes can improve zero-shot classification on a suite of datasets with ground-truth attributes per class (details in Appendix B).We form a subpopulation by taking instances within a class that share an attribute.For each subpopulation, we compute the similarity of image embeddings with the text embedding of (i) the classname and (ii) the classname with the corresponding attribute, using CLIP ViT-B/16.We then measure the average precision of the two similarity scores for distinguishing instances within the subpopulation from instances outside of the class.We find, as shown in Figure 3, that for the vast majority of cases, incorporating attributes leads to more precise recognition, and often by large margins: adding molten to cake improves average precision by over 40 points.Upon inspection, the highest gains in average precision tend to occur for atypical subpopulations (see Appendix B).Thus, VLMs can recognize instances in a class even when they are atypical, but this ability is restricted under the one vector per class paradigm.

METHOD
We now propose a method to better utilize the ability of VLMs to recognize diverse subpopulations.Our method consists of attribute inference and prediction consolidation.First, we query a large language model (LLM) for diverse per-class attributes that span many (often overlapping) subpopulations.Then, after computing the similarity of an image to each subpopulation, we non-linearly consolidate these similarities to obtain one score per class.We elaborate on these two steps below.

Attribute Inference Along Many Axes of Diversity
To better cover the diverse subpopulations that may exist within a class, we incorporate attribute information.However, diversity can come in many forms.That is, the way in which two instances of a class differ can itself vary.Consider the examples in Figure 2. The Arctic fox case shows how a class can contain distinct finer-grained categories.In a related manner, the state or condition in which the class instance is in can also substantially change its appearance: a balloon looks much different when it is deflated.Further, there exist generic attributes that can lead to substantial visual differences regardless of the class, such as the region or income level of the country where an image is taken, exemplified by the two Street View images.Thus, to capture the many ways in which diversity can arise, we employ multiple distinct queries, in contrast to prior work.Namely, we infer: • Class specific attributes, such as the possible states of an object (e.g., diced or sliced for pear).We also obtain descriptions for and different kinds of each class, as in DCLIP and CHiLS respectively.
• Class adjacent attributes, like co-occurring objects or backgrounds, to get useful context.
• Class agnostic attributes that describe how objects vary in general.For example, towards improving geographic fairness, we list potential choices for the income-level, region and country of origin of the image.We also introduce a novel two-step LLM query, where we first ask the LLM to list generic axes of diversity, and then have it populate those axes.We name this auto-global as it automatically generates many global attributes.Appendix D.2 contains the exact LLM prompts and example inferred attributes for each query above.

Nonlinear Prediction Consolidation
Enumerating attributes along various axes of variation results in descriptions of many diverse subpopulations per class.
Since VLMs have open-vocabulary text encoders, we can directly embed these subpopulation descriptions, in addition to the class name.Given a test image, we compute similarities to each of these embeddings.We then must consolidate them to obtain a single score per class.Since there is only one vector per class (the classname-based embedding), the decision boundary is linear, as shown in the middle panel.The edge of the hypersphere is colored (orange for wolf, blue for fox) to indicate the predicted class for an image embedding at that location.Notably, the Arctic fox is misclassified as wolf, as its appearance more closely resembles a typical wolf than a typical fox, and so, the embeddings of Arctic fox images fall closer to the text embedding of "wolf" (and vice-versa for the red wolf).Methods like DCLIP and WaffleCLIP embed more than just the classname, but they consolidate similarities via averaging, again resulting in a linear decision boundary.Even if atypical subpopulations are included at first, averaging can narrow the initial diverse coverage, as most embeddings for a class may better describe a typical instance.
In contrast, we propose the following nonlinear consolidation: we compute the single score per class for a given test image as the average of the similarities of the image embedding to only the  closest subpopulations embeddings for the class, where  is typically small (we use  = 16).This way, an image can have a high class score even if it is only similar to a small subset of subpopulations, as is the case for atypical instances.Thus, the Arctic fox and red wolf can be correctly classified despite being far from the classname and most subpopulation embeddings for their respective classes, as shown on the right panel of Figure 4, where we use  = 1 for simplicity (i.e.images are mapped to the class of the closest dotted or solid line, leading to a non-linear boundary).We shed insight on the effect of varying the hyperparameter  in Section 6.2, revealing a tunable accuracy-fairness trade-off.

ANALYSIS
We now empirically validate our method's effectiveness and enhanced interpretability over two dataset suites.Our method performs on par with (and usually surpasses) existing approaches in overall accuracy.Notably, we see larger gains for the hardest classes and subpopulations, which are likely more diverse and atypical respectively (precisely the samples on which our method is intended to improve performance).Furthermore, while matching or exceeds the performance of existing, we offer unique interpretability benefits, such as fine-grained and faithful explanations, as well as the potential for error anticipation; we detail these below.also report worst region and worst income group accuracy.Our baselines include: standard zero-shot (only one vector per class, corresponding to the classname embedding) which we call Vanilla, DCLIP (averages over class descriptors), WaffleCLIP (averages over random descriptors sampled over ten trials), and CHiLS (reweights standard zero-shot class score with max probability of different kinds of the class).Notably, we average all text embeddings over the 80 prompts crafted for CLIP, so to report best possible baseline results.

Consistent Gains
5.1.2Datasets.We curate a suite of eight attributed (so to have groundtruth subpopulations) datasets spanning different axes of diversity.We use the four Breeds datasets [31] for their hierarchical label sets, as used in the CHiLS paper; in fact, the Breeds datasets were the ones where CHiLS was most effective.Next, we devise two classification tasks (coarse and fine grained) from the MIT States dataset [13] to track performance over labeled states (e.g., sliced or diced for pear).Importantly, we also include the datasets Dollarstreet [29] and GeoDE [26], which contain images from varied geographic regions and income levels.As the diversity in these datasets is occurs naturally, they can encompass many axes of variation, as opposed to our other datasets that only varying along known axes, like object state or kind.
We note that we strived to minimally fit our method to the evaluation suite.That is, we do not optimize our query set to maximize performance on the datasets we selected, which can be challenging for zero-shot classification methods.
One specific measure we took toward this end was fixing our method completely before evaluating on the second dataset suite.Thus, the second dataset suite serves as a held-out challenge set, intended to test the generalizability our method to settings where intra-class diversity may not be present.See Appendix D.1 for complete details on our dataset suite.Table 3. Zero-shot classification performance on finer-grained held-out datasets without attributes, using CLIP with a ViT-B/16 encoder.We observe similar results for BLIP-2 (Table 10).We discuss reasons for the failure of CHiLS on ImageNet-scale tasks in Appendix F. Our method effectively generalizes to new settings without tuning the set of queries for attribute inference.

Results
. Table 1 shows results for datasets with diversity along hierarchical and states axes, and table 2 shows results for geographic diversity.Our method consistently matches (and even improves) accuracy of existing methods, even over CHiLS in the hierarchical setting it was specifically designed for.Notably, CHiLS becomes less effective for other datasets, while our method remains strong.We observe larger gains for worst class and subpopulation metrics, especially over baselines that consolidate via averaging (Vanilla, DCLIP, Waffle), supporting the claim that our method improves coverage of the most atypical instances, and that moving beyond the one vector per class paradigm helps in this regard.For example, compared to baselines that consolidate via averaging to obtain one vector per class, our method improves accuracy for the worst classes and subpopulations by 2 − 3% in most cases.For Dollarstreet, these gains manifest in a 9% average relative gain over baselines for the accuracy over worst income group metric (and an even larger gain for the worst 20% of subpopulations), showing that our methodology can facilitate progress on real-world fairness indicators.
Turning our attention to the held-out challenge datasets, Table 3 shows our method can generalize effectively to finer-grained classification tasks where intra-class diversity is not explicitly known to be present.Our method improves Fig. 5. Instances where our method corrects mistakes of the standard approach.The attributes used in inference also serve as faithful fine-grained explanations.Notably, these samples are atypical, suggesting that inspecting samples where our method and standard classification disagree can enable automatic surfacing of atypical cases, towards better understanding the task at hand.
accuracy on the hardest classes by an average of 1.5% over the closest baseline.Similarly, our method exceeds all baselines by about 1% in overall accuracy in nearly all cases, suggesting that embracing diversity does not come at a cost of overall performance.Moreover, the effectiveness of our method in these new settings show that the queries we select (for inferring attributes) generalize beyond our original dataset suite.That is, we do not need to tune the LLM queries for each new classification task of interest.Nonetheless, the ability to add and remove LLM queries can be seen as a strength, as a practitioner is provided more control than in standard zero-shot classification.

Faithful
Fine-grained Interpretations For Free.Having shown that our method is equally (and usually more) performant than existing approaches, we now discuss the enhanced interpretability of our method.Namely, each inference comes with a list of the  subpopulations specifically relevant to the test image for free.Figure 5 shows a few example where our method corrects misclassifications from the standard approach (see Appendix A for more).
These interpretations are faithful, as they are exactly the subpopulations used to compute the class score.Also, since we include attributes along various axes of diversity, our interpretations are finer-grained than prior work: DCLIP yields the same set general descriptors for any image predicted to a given class, and WaffleCLIP offers no interpretability at all.This interpretability can enable model debugging, as erroneous predictions can be traced back to attributes that either do not match the intended class (i.e.LLM mistake) or cannot be recognized well (i.e.VLM mistake).At a high level, while standard zero-shot classification is a complete black box, the LLM-inferred attributes of our method provide interpretable intermediate outputs, increasing the transparency of the system overall.Moreover, our inference strategy results in concise explanations, which have more utility than explanations that are too long for a human to digest [25].

Anticipating and Articulating Potential
Failures.In addition to explaining each inference, the interpretable intermediate outputs of our pipeline also allow for error anticipation.Namely, by comparing the inferred attributes for each class, one can anticipate and describe similar subpopulations from different classes, which may correspond to inputs where the model is less effective.For example, for the Living-17 task in the Breeds datasets, the LLM lists gibbon as a kind of both the ape and monkey classes.While gibbons are apes, they are smaller than most apes, which makes them resemble monkeys.Indeed, standard zero-shot accuracy for gibbons is only 14%, where as other apes are classified at an accuracy of 93.5%2 .In another case, rug is listed as a co-occuring object for the bed class, when rug itself is another class in the dataset.While anticipating potential failure modes is intuitive for humans, it is challenging to do so at scale.By incorporating an auxiliary model (LLM) with interpretable intermediate outputs (inferred attributes), practioners can both more easily audit and verify the zero-shot classification pipeline, and better understand potential challenges with the task of interest.We hope that the greater transparency of our system (importantly, achieved without compromising on overall accuracy) can result in increased trust and more responsible use.

ABLATIONS
We now detail additional ablation studies to shed insight on the source of our method's improvements over existing art and how a practitioner can apply our method with greater control.First, we study how performance varies for both our method and baselines as the number of attributes grows, so to demonstrate the value of our flexible consolidation strategy, specifically for inputs from the hardest classes.Then, we identify a principled trade-off between accuracy overall and on the worst classes using our method, controlled by the hyperparameter .
Fig. 6.Accuracy, overall and for the worst classes, as new types of attributes are added.Performance for our consolidation scheme continuously improves, while it saturates or deteriorates for others.Figure 11 shows similar trends for accuracy on the worst 20% of classes and subpopulations.

Scaling with the Many Axes of Diversity
One source of gains for our method is that we infer attributes of many types, while prior works only include one.We argue that our flexible consolidation (of subpopulation similarities to a single class score) also provides improvements over naive averaging or the nonlinear consolidation of CHiLS.To test this, we sequentially add each type of attribute, and inspect performance using the three methods.Figure 6 shows our consolidation scales best as more attributes are added, with sizable gains for accuracy over the worst classes.In contrast, performance saturates with averaging, and actually deteriorates with CHiLS.The latter occurs since CHiLS assumes that subpopulations are mutually exclusive, as is the case in hierarchical label sets.When adding attributes along the many axes of diversity, resultant subpopulations overlap, making a zero-shot classification over all subpopulations (as done in CHiLS) unreliable.Averaging is also suboptimal, as the impact of each attribute diminishes as the number of attributes added increases: we see this in the left plot, as accuracy barely increases for the final three added attribute types.Also, samples that are close to only a few subpopulations but far from most (i.e., atypical instances) ultimately receive a low score when all scores are averaged.Thus, while averaging over subpopulations can improve accuracy (to an extent), it is less suited to improving performance on atypical instances than our method.We explore this further in the next section.

Tunable Trade-off between Accuracy Overall and On Worst Classes
Recall that our method consists of computing the similarity of a given test image's embedding to the embedding of numerous (on the order of hundreds) subpopulations per class, before averaging over only the top  similarities, where  is small.Note that when  = ∞, our consolidation reduces to simple averaging over all vectors per class.To shed insight on how our consolidation differs from averaging, we sweep , while keeping our attribute inference fixed.
Additionally, we explore linearly interpolating class scores using our consolidation (top-) and full averaging via a second hyperparameter , so that  = 0 results in our method and  = 1 is averaging.We jointly sweep  from 0 to 1 and  from 1 to 128 to pinpoint the way in which our consolidation improves upon averaging.
Figure 7 shows overall accuracy vs. accuracy on the worst 5% of classes3 for both  and .The trend is identical for the two parameters: first, both accuracy metrics increase as we move away from full averaging, with much larger gains occurring for the worst classes.Then, accuracy begins to drop, while accuracy on the worst classes continues to improve.To understand this trade-off, consider an instance that has high similarity to one subpopulation embedding for a class, and low similarity to all others.In the  = 1 case, this instance is given a high score for the class.This can benefit atypical instances of the class, as they may be visually dissimilar from most other instances (recall the Arctic fox).However, this can introduce errors, as the correct prediction for an instance mostly close to embeddings from its true class can be flipped with the presence of just one highly similar (perhaps unreliable) subpopulation embedding from a different class.Thus, lower choices of  may benefit more atypical instances, leading to improved accuracy on worst classes (which are most diverse; see 3.1), potentially at the cost of overall accuracy.With this insight, practitioners can choose how to tune our method based on their end goals.Also, since  is continuous, it offers closer control of this tradeoff: indeed, accuracy on the worst classes can be improved by a larger margin when varying , and varying  and  together can lead to best numbers for both metrics. 4

LIMITATIONS
On utilizing auxiliary models.Our method adds an LLM into the zero-shot classification pipeline, which can increase computational cost and introduce a source of error.We note that the added compute for inferring attributes is only done once per task, so the asymptotic cost per inference only differs marginally (due to computing similarities to more vectors per class, which is a very fast operation) compared to the standard approach.To inspect the reliability of LLM outputs, we manually verify 300 randomly selected LLM outputs.We find only 2.7% of responses are uninformative 5 , and none of the 300 to be inaccurate.Moreover, our flexible consolidation scheme offers a kind of robustness to irrelevant LLM outputs: Recall, only the similarity to a small number of subpopulations per class contribute to each logit.Thus, irrelevant (i.e.not appearing in the data) subpopulations are effectively ignored and do not affect the logit.However, we still know that LLMs are capable of providing inaccurate outputs, and even detected one such instance (the gibbon example from section 5.2.2).We find the automatic detection of unreliable LLM outputs to be an interesting avenue for future work, both to improve accuracy and gain insight of potential complexities in the given task.
VLMs are not always reliable.Our work assumes that VLMs are capable of recognizing subpopulations within a class when named.While this is often true, VLMs can still fail, especially for composite concepts.We aim to keep our subpopulations simple, ascribing only one attribute to each.Nonetheless, it is currently not possible to know apriori whether a VLM can recognize a subpopulation in a zero-shot manner.We hope more work on uncertainty estimation can enrich our method, by way of automatically flagging and removing subpopulations that the VLM will not be able to reliably detect.

CONCLUSION
To represent classes with diverse instances, which can come in many forms, one vector per class may not be enough.
Moreover, VLMs have amazing abilities that are restricted when we only use one vector per class.Thus, instead of ignoring intra-class diversity, we embrace it, by explicitly inferring and encoding as much of it as we can.We propose a simple nonlinear consolidation scheme that flexibly attends to subpopulations present in an image while ignoring those that are irrelevant.We find that our method consistently matches or improves over strong baselines, and careful ablations indicate that our method's gains come from improving performance on the hardest classes and subpopulations.Thus, embracing diversity can help reduce performance disparities, including on real-world fairness benchmarks, towards models that work well for all.Our approach allows powerful models to work together in a transparent way via intermediate interpretable outputs, facilitating inferences with explanations, as well as greater tools to understand and debug potential failures.We hope our work spurs further curiosity around how existing paradigms may limit the capabilities of our modern models, towards the development of new AI systems that overcome the fairness and transparency limitations of today.

A EXAMPLE INTERPRETABLE INFERENCES
We show additional examples of interpretable inferences in figure 8.

B CASES WHERE ATTRIBUTES HELP MOST
We now provide more examples of instances where the prevailing paradigm for zero-shot classification results in disparate performance, and consequently, where our method yields largest improvements.The crux of the issue with existing paradigm is that the classname struggles to be close to embeddings for images from all subpopulations of the class, particularly when the class contains many visually diverse subpopulations.For example, a penguin looks very different than most birds, so embeddings of penguin images will be some distance away from embeddings of most birds.Similarly, the penguin images may not reside close to the text embedding of the caption 'a photo of a bird'.Indeed, we find standard zero-shot classification accuracy for King Penguin birds is only 46%, while accuracy for the class is 96%. Figure 9 shows this example along with other instances where standard zero-shot classification leads to biased performance.We highlight examples that our method leads to improvements.Notice that the subpopulations tend to be atypical.
How then does our method result in improvements?We leverage the fact that despite poor standard zero-shot accuracy on subpopulations that lie far from their classname embedding, VLMs are still capable of recognizing these Classifier CLIP DINO Sup.Encoder CLIP -0.28 -0.51 -0.43 DINO -0.37 -0.54 -0.48 Sup.
-0.47 -0.72 -0.65 Table 4. Correlation between diversity and accuracy by class on ImageNet.We study three models: vision transformers trained with CLIP, DINO, or traditional label supervision.Diversity refers to variance of image embeddings within a class, with embeddings obtained with the 'encoder' model.
atypcial subpopulations.That is, penguin images may be far from the 'bird' text embedding, but they are actually quite close to the 'penguin' text embedding.That is, standard zero-shot classification does not take advantage of the ability of VLMs to recognize objects at a deeper level than that of the classification task.Thus, by including the right attributes, we can enable accurate recognition of atypical subpopulations.
Figure 10 shows more examples of subpopulations where including the groundtruth attribute results in significant gains in average precision, indicating that including the attribute allows for recognition of atypical subpopulations.
In this figure, AP corresponds to the average precision score obtained when using the similarity of an image to (a) the classname embedding or (b) the embedding of the classname with the attribute (e.g.'bird' vs. 'King Penguin bird') to detect images belonging to that subpopulation.Again, these subpopulations generally appear differently than a typical instance from their class, making the classname embedding an imprecise probe for that subpopulation.However, evidently, when given the attribute, VLMs are still capable of recognizing the subpopulation.

C DETAILS ON CORRELATION BETWEEN DIVERSITY AND ACCURACY PER CLASS
We compute ImageNet accuracy per class using three models: CLIP ViT-B/16 via standard zero-shot classification, DINO ViT-S/16 with a linear classification head fit to ImageNet over fixed features [4], and a ViT-S/16 trained with traditional class-label supervision on ImageNet [33].Notably, all these models utilize a linear classification head.That is, they operate under a one vector one class paradigm.To proxy diversity, we measure the variance of embeddings per class.That is, per class, we compute the average squared distance between the mean embedding and the embedding of each class instance.Note that our measure of diversity depends on the image encoder; we explore using each of the three aforementioned models.Table 4 shows the results.All correlations are strongly negative, indicating that across classifiers and using various measures of diversity, classes with higher diversity are predicted at lower accuracies.This supports the intuitive hypothesis that consistently representing an entire class with one vector is made challenging when the class contains diverse instances.

D ADDITIONAL EXPERIMENTAL DETAILS
Note that we will provide all code, so that further details are easily accessible.

D.1 Datasets
The four hierarchical datasets we utilize are subsets of ImageNet [7] curated by [31].We also utilize the attributed dataset of MIT States [13], deriving two classification tasks from their annotations.Finally, we utilize the geographic fairness benchmarks of Dollarstreet [29] and GeoDE [26].When reporting subpopulation accuracies, we use income level as the ground truth attribute for Dollarstreet.Note that for MIT States and Dollarstreet, we conduct a filtering of classnames.Namely, we compute cosine similarity of CLIP embeddings for each pair of classnames.For any pair exceeding a threshold, we remove one classname from consideration.We do this because MIT States was not originally intended to be a classification dataset, and we observed highly similar classnames in Dollarstreet (e.g.'toilet' and 'bathroom/toilet').We use a threshold of 0.8 and 0.9 to generate the coarse and fine-grained MIT States datasets respectively, and use a threshold of 0.9 for Dollarstreet.

D.2 Inferring Attributes
We now provide details on our exact LLM queries.First, for class-specific and class-adjacent queries, table 5 shows the precise prompt shown to the LLM along with example outputs, both for the class pear.For all queries, we append Only use up to three words per list item so that the LLM does not drone on.We sample from the LLM (Vicuna-13b-v1.5)with a temperature of 0.7, repetition penalty of 1, and a max number of new tokens of 512.
We now provide more information on class-agnostic queries.We use continents as regions, and the five most populous countries per continent as our list of countries.These can both be obtained via prompting an LLM or searching the internet.

D.3 Auto Global
We now show more details for the auto-global query, which we found quite impressive.It consistently was amongst the attribute type that provided the most accuracy gains across datasets.The first prompt to the LLM was: List 16 common general ways in which two instances of the same object may look different.For example, size, age, or cleanliness.Only use one word per list item.

The next prompt was:
For each of those items, list up to four different general adjectives related to the time.
Please use common words..
Then, finally, out of laziness, we included a third prompt of: Thanks.Please organize your output as a python dictionary.The resultant axes of variation and attributes per axis can be found in Table 6.

E ADDITIONAL RESULTS
In the main text, we presented results using CLIP.Results for BLIP-2 can be found in Tables 8 and 9. Trends are consistent with results CLIP.For a global picture, we present results averaged over both VLMs and all datasets in table 7. Our method performs best over all metrics, again with largest gains occurring over the worst classes and subpopulations.
We also show results for each dataset individually in table 11.We find it encouraging that our results are consistent across both VLMs and for each of our eight datasets.
Further, for the analysis in Section 6.1, we show performance using the similar metrics of accuracy over the worst 20% of classes and subpopulations, as shown in most tables.See figure 11.Trends are the same as in the main text, though slightly less pronounced.To be clear, our consolidation yields best performance, while others either saturate or deteriorate.
Lastly, we also show additional plots for the analysis in Section 6.2.In the main text, we plotted accuracy overall vs.
over the worst 5% of classes.We choose to show accuracy over the worst 5% because it most clearly conveys the tradeoff Fig. 11.Accuracy for the worst 20% of classes and subpops, averaged over our dataset suite as we sequentially add new types of attributes using different consolidation schemes.See figure 6 in the main text for accuracy overall and over the worst 10% of classes, along with more discussion.As shown in the main text, our method scales the best as attributes are added sequentially.
Fig. 12.We replicate figure 7 using metrics that look at a larger portion of the worst classes.A similar tradeoff emerges, though in a slightly less pronounced way.We note that this is expected, as increasing the number of classes considered likely also increases the number of less diverse classes included.similarities.On BLIP-2, while CHiLS does not fail catastrophically, it still underperforms compared to our method, with accuracy about 1% lower.
One could likely fix this problem by changing the temperature of the softmax, but we opt to faithfully follow the original method.Indeed, a modified version of CHiLS without the softmax (which amounts to our method using only the Kinds query (see table 5 and with  = 1) does not fail catastrophically on ImageNet, though the overall accuracy and accuracy on worst classes for this 'fixed' CHiLS does not exceed our method's results.
While this change seems small, we believe it encapsulates a difference in philosophy between CHiLS and our method.
CHiLS is designed for datasets with clear hierarchy, where each input can fit neatly into one of many mutually exclusive subpopulations.In contrast, we argue that diversity emerges in many ways, with overlapping subpopulations arising from attributes drawn along various axes.By taking a softmax, CHiLS requires that an input is not only similar to one subpopulation within a class, but that it is also dissimilar from the other subpopulations.In our method, instead of seeking to explicitly name all subpopulations in a mutually exclusive way, we enumerate many potential attributes, and create a flexible consolidation that only requires an input to be close to a few subpopulations within its class for it to be classified correctly.

G WHEN CAN WE CRAM AN ENTIRE CLASS IN ONE VECTOR, AND WHEN CAN WE NOT?
Arguably, diversity within classes is unavoidable, as two instances can vary in numerous ways (discussed further in Section 4.1).How then, have classifiers enjoyed success under the one-vector-per-class paradigm, despite its tension with intra-class diversity?First, we note these performance disparities are often obfuscated in metrics like overall accuracy; indeed, the supervised classifiers studied above each achieve impressive overall accuracies.Nonetheless, the tension can be somewhat resolved if (i) one learns embeddings that reduce the diversity that is present in input space, and/or (ii) the single vector learned per class contains features that are unique to the class and present across class instances, despite intra-class variance that persists in the embedding space.We expand on these below.We find that the bias of the zero-shot vector is on par with having only 3% of the training images in the fox class be Arctic foxes in the supervised setting, suggesting that limitations of the one vector per class paradigm may be exacerbated in the zero-shot setting.

H ONE FINAL TRADE-OFF
In section 6.2, we should two hyperparameters that could trade overall accuracy for accuracy over the worst classes.We now present one more, along with a theoretical explanation.Throughout the paper, we consider 'averaging' to mean computing similarities to multiple vectors and then averaging those similarities; this is how DCLIP and WaffleCLIP average, and will refer to this as Average Sims.However, averaging over prompts as done in originally in CLIP consists of averaging vectors first and then computing similarity to one average vector; we call this Average Vecs.The difference is subtle: in the latter case, an additional normalization occurs when cosine similarity is taken.
We now show theoretically that when all embeddings are normalized (i.e. for CLIP), Average Vecs simply rescales the class score yielded by Average Sims by a factor that measures how diffuse the vectors for the class are.Let  be an image embedding and { 1 ,  2 , . . .,   } be subpopulation vectors for a given class.We assume all vectors are normalized Fig. 14.Averaging subpopulation vectors before computing similarity to an image embedding proves to be another way to trade overall accuracy for accuracy on the worst classes.That is, when we first compute similarity to each subpopulation and then average, we obtain higher overall accuracy but lower accuracy on the worst classes, compared to when we first average subpopulation vectors and then compute the similarity to the average vector.

Fig. 1 .
Fig. 1. (Left) Instances of a class can appear in many diverse ways, like the pears above.Using one vector (the classname embedding) to represent the whole class results in disparate performance, particularly for atypical instances.(Middle) To address this issue, we infer attributes and embed multiple vectors, reducing disparities and enhancing interpretability.(Right) Our method scales better than prior works as we include more attributes (Section 6.1), enabling us to account for the many ways in which diversity can arise.

Figure 4
Figure 4 illustrates the simple case of fox vs wolf classification, where solid/dotted lines correspond to classname/subpopulation embeddings on the hypersphere (shown here in 2D).The left-most panel shows examples from the two classes near where their image embeddings would lie.Text embeddings for the subpopulations (dotted lines) are close to corresponding image embeddings, as VLMs are capable of recognizing even diverse subpopulations (see Section 3.2).Standard zero-shot inference maps a test-time image to the class of the nearest classname text embedding.

Fig. 4 .
Fig.4.An Arctic fox can more closely resemble a typical wolf than a typical fox.Standard zero-shot classification using one vector per class (the classname embedding) is ill suited for this case.We address this issue by nonlinearly consolidating similarities to multiple vectors per class that explicitly encode the diverse subpopulations within the class.See section 4.2 for full explanation.

Fig. 7 .
Fig. 7. Right: As  decreases, first, accuracy overall and on the worst classes both increase.Then, overall accuracy begins to decrease while accuracy on the worst classes continues to improve.Thus, we can control this trade-off via .Left: , the continuous analog of , allows for greater control.

Fig. 9 .
Fig. 9. Example subpopulations where our method exhibits sizable accuracy gains compared to standard zero-shot classification (i.e.classname embedding only).

Fig. 10 .
Fig. 10.Example subpopulations where the classname embedding is imprecise, but including the attribute leads to large boosts in average precision.Notably, these subpopulations reflect instances atypical to the class.

Fig. 13 .
Fig. 13.Arctic Fox bias is amplified in zero-shot classifier vs. to supervised linear probes.

Table 1 .
[5],24]Diverse Datasets, Particularly for the Hardest Classes and Subpopulations 5.1.1BaselinesandMetrics.We measure performance of zero-shot classifiers using the popular CLIP ViT-B/16 and BLIP-2 VLMs[15,24].To infer attributes, we utilize the open source Vicuna-13b-v1.5 language model[5], which notably is already contained in the BLIP-2 model we use.We report accuracy overall as well as averaged over the worst 20% of classes and subpopulations.Note that we only use groundtruth attributes when computing metrics; our method exclusively uses attributes inferred via the queries listed in Section 4.1.We also compute the lowest subpopulation accuracy per class and average that, so to obtain the metric denoted as 'Avg Worst Subpop'.For the real-world shifts, we Zero-shot classification on datasets with known variation types for CLIP with a ViT-B/16 encoder.States averages results over the two categorizations of MIT States data, while Hierarchical averages results over four Breeds datasets.We observe similar results for BLIP-2 (Table8).

Table 2 .
Zero-shot classification performance on geographically diverse images from DollarStreet and GeoDE using CLIP with a ViT-B/16 encoder.We observe similar results for BLIP-2 (Table9).

Table 5 .
Example LLM prompts and outputs for class-specific and class-adjacent queries.

Table 6 .
Attributes and axes of diversity inferred via the auto-global query.See D.3 for more information.

Table 7 .
Average performance over eight datasets and two VLMs.