Content-Based Search for Deep Generative Models

The growing proliferation of customized and pretrained generative models has made it infeasible for a user to be fully cognizant of every model in existence. To address this need, we introduce the task of content-based model search: given a query and a large set of generative models, finding the models that best match the query. As each generative model produces a distribution of images, we formulate the search task as an optimization problem to select the model with the highest probability of generating similar content as the query. We introduce a formulation to approximate this probability given the query from different modalities, e.g., image, sketch, and text. Furthermore, we propose a contrastive learning framework for model retrieval, which learns to adapt features for various query modalities. We demonstrate that our method outperforms several baselines on Generative Model Zoo, a new benchmark we create for the model retrieval task.


INTRODUCTION
We introduce the task of content-based model search, which aims to find the most relevant deep image generative models that satisfy a user's input query.For example, as shown in Figure 1, we enable a user to retrieve a model capable of synthesizing images that match a text query (e.g., Miniatures), an image query (e.g., a landscape photo), sketch query (e.g., Mario sketch), or models similar to a given model.Users can leverage retrieved models in various ways: they might further fine-tune it with a unique concept, use the model to generate an image with a more concrete prompt, or combine multiple retrieved models into one.
Content-based model search is becoming a vital task as the number and diversity of intriguing generative models continue to explode.These models have formed a new form of media content that requires efficient indexing and searching.While one might have expected large-scale text-to-image models [Ramesh et al. 2022;Rombach et al. 2022;Saharia et al. 2022] to have led to a shrinking universe of useful models, instead, this powerful class of content generators has touched off an accelerating proliferation of new and customized models.That is because powerful generative models are capable of being adapted to capture a wide range of personalized subjects [Gal et al. 2023;Kawar et al. 2023;Kumari et al. 2023;Ruiz et al. 2022].Everyday users are routinely creating new generative models, and the community has collectively shared tens of thousands of custom models on community-driven platforms, such as Civitai [civ 2022] and HuggingFace [hug 2022] in the past year alone.Many popular models arise from the model creators' unique creative processes, involving careful selections of subjects, algorithms, hyperparameters, and often proprietary artistic training data.Furthermore, some model creators receive compensation when other users download or use their models.
Aside from personalized models, a variety of deep generative models are being created as backbones for computer graphics applications [Bermano et al. 2022;Tewari et al. 2020], and as works of art that explore a wide range of themes [Elgammal 2019;Hertzmann 2020].Each model captures a small universe of curated subjects, which can range from the realistic rendering of faces and landscapes [Karras et al. 2021] to photos of historic pottery [Au 2019] to cartoon caricatures [Jang et al. 2021] to single-artist stylistic elements [Schultz 2020].Various methods also enable creative modifications of existing models via human-in-the-loop interfaces [Bau et al. 2020;Gal et al. 2022b;Wang et al. 2022].
Given the accelerating pace at which generative models are being created and uploaded to the internet, the ability to conduct contentbased model search will be instrumental in supporting the effective use of large model collections.Existing model-sharing platforms [civ 2022; hug 2022] primarily rely on matching human-created model names for model search.However, it is difficult to choose sufficiently detailed names to describe the unique, complex, and highly specific images produced by a generative model.Our work aims to define a new approach, searching for models directly based on their content rather than based on manual definitions alone.
Content-based model search is a challenging task: even the simplified question of whether a specific image can be produced by a single model can be computationally difficult.Unfortunately, many generative models do not offer an efficient or exact way to estimate density, nor do they natively support assessing cross-modal similarity (e.g., text and image).
To address the above challenges, we first curate a benchmark retrieval dataset, the Generative Model Zoo, consisting of (1) 259 publicly available generative models that vary in content as well as model architectures, including GANs (e.g., StyleGAN-family models [Karras et al. 2020b]), diffusion models (e.g., DreamBooth [Ruiz et al. 2022]), and auto-regressive models (e.g., VQGAN [Esser et al. 2021]) and ( 2) 1000 customized text-to-image diffusion models, each one fine-tuned on a single instantiation of an object class or a single artistic image.As part of our benchmark, we define ground truth image, text, and sketch queries for evaluating model retrieval.
We present a general probabilistic formulation of the model search problem and propose a learning-based method given this formulation.We summarize our contribution as follows: • We introduce a new task of content-based search over deep generative models.Given a piece of text, an image, a sketch, a generative model, or a combination of them as a query, we aim to return the most relevant generative models that can synthesize similar content.• We formulate the task of content-based model retrieval as estimating the probability of generating an image that matches the query content.We propose a new contrastive learning method to estimate the same given the model's image distribution statistics and query.Our learning-based method outperforms several baseline algorithms.
• We curate a benchmark dataset, the Generative Model Zoo, which includes a diverse compilation of more than 250 communitycreated generative models, 1000 single-image fine-tuned models, and a set of ground truth (query, model) pairs for evaluating the retrieval algorithm on different query modalities.

RELATED WORKS
Deep generative models.Generative models are open-sourced at a rate of thousands per month.They use different learning objectives [Goodfellow et al. 2014;Ho et al. 2020;Kingma and Welling 2014;Oord et al. 2016;Song et al. 2021b], training techniques [Karras et al. 2020a;Mokady et al. 2022;Rombach et al. 2022;Sauer et al. 2021], and network architectures [Brock et al. 2019;Esser et al. 2021;Karras et al. 2019;Razavi et al. 2019].They are also trained on different datasets [Choi et al. 2020;Mokady et al. 2022;Schultz 2020;Yu et al. 2015] for different applications [Albahar et al. 2021;Chen et al. 2020;Ha and Schmidhuber 2018;Lewis et al. 2021;Patashnik et al. 2021;Zhang et al. 2021;Zhu et al. 2021].This diversity leads to the question, among all the models, which one shall we use?Our goal is not to introduce a new model.Instead, we want to help researchers, students, and artists find existing models more easily.
Model editing and fine-tuning.As methods for editing and finetuning generative models become more accessible and efficient, these have also contributed to a proliferation of models.Several works edit a pre-trained generative model with simple user interfaces like sketching [Wang et al. 2021], warping [Wang et al. 2022], blending [Bau et al. 2020], and text prompts [Gal et al. 2022b   models to match a small collection of images [Karras et al. 2020a;Li et al. 2020;Mo et al. 2020;Nitzan et al. 2022;Noguchi and Harada 2019;Ojha et al. 2021;Wang et al. 2020Wang et al. , 2018;;Zhao et al. 2020a,b].
Customization of large-scale diffusion models.Recently, large-scale text-to-image models [Ramesh et al. 2022;Rombach et al. 2022;Saharia et al. 2022] have shown exemplary performance in generating diverse styles and compositions given the text prompt, with downstream versatile image editing capacity [Avrahami et al. 2022;Brooks et al. 2023;Hertz et al. 2022;Tumanyan et al. 2023;Zhang et al. 2023].But these models cannot still synthesize personalized and artistic concepts that are hard to describe via text.To enable that, various works have proposed model customization techniques [Gal et al. 2022a;Han et al. 2023;Kawar et al. 2023;Kumari et al. 2023;Ruiz et al. 2022].This has given rise to tens of thousands of fine-tuned models for different styles and concepts [civ 2022; hug 2022], thus making model search increasingly relevant.Recently, several works use retrieval-augmented generative models to improve the fidelity of rare entities [Blattmann et al. 2022;Casanova et al. 2021;Chen et al. 2023;Ma et al. 2023].Different from these, our goal is to simply find a model relevant to user queries, rather than generating an exact image given the query.
Content-based retrieval.Building upon classical information retrieval [Baeza-Yates et al. 1999;Manning et al. 2010], content-based retrieval deals with queries over image, video, or other media [Datta et al. 2008;Gudivada and Raghavan 1995;Hu et al. 2011].Contentbased image retrieval methods use robust visual descriptors [Dalal and Triggs 2005;Lowe 2004;Oliva and Torralba 2001] to match objects within video or images [Arandjelović and Zisserman 2012;Sivic and Zisserman 2003].Methods have been developed to compress visual features to scale retrieval [Gong et al. 2012;Jégou et al. 2010;Torralba et al. 2008;Weiss et al. 2008], and deep learning has enabled compact vector representations for retrieval [Babenko et al. 2014;Krizhevsky and Hinton 2011;Torralba et al. 2008;Zheng et al. 2017].In addition to image queries, various works have proposed methods for sketch-based retrieval [Eitz et al. 2010;Lin et al. 2013;Liu et al. 2017;Radenovic et al. 2018;Ribeiro et al. 2020;Sangkloy et al. 2016;Yu et al. 2016].There is a growing interest in joint visuallanguage embeddings [Faghri et al. 2017;Frome et al. 2013;Jia et al. 2021;Karpathy et al. 2014;Radford et al. 2021;Socher et al. 2014] that enable text queries for image content.We also adopt deep image representations for our setting, but unlike single-image retrieval, we index distributions of images that cannot be fully materialized.Concurrent to our work, HuggingGPT [Shen et al. 2023] also explores model retrieval but for user-defined tasks such as object detection.

METHODS
In this section, we develop a retrieval framework for deep generative models.When a user specifies an image, sketch, or text query, we would like to retrieve a model that best matches the query.We denote a collection of  models by { 1 ,  2 , . . .,   } and the user query by , and we assume a uniform prior over models (i.e.,  ∼ unif{ 1 ,  2 , . . .,   }).Every model  captures a distribution  ( | ) over images .While prior retrieval methods [Manning et al. 2010;Smeulders et al. 2000] search for single instances, we aim to construct a method for retrieving distributions.
To achieve this, we introduce a probabilistic formulation for generative model retrieval.Figure 2 shows an overview of our approach.Our formulation is general to different query modalities and various types of generative models, and can be extended to different algorithms.In Section 3.1, we derive our model retrieval formulation based on a Maximum Likelihood Estimation (MLE) objective based on pre-trained deep learning features.In Section 3.2, we further propose a contrastive learning method to adapt features to different query modalities and model search task.We present our model retrieval algorithms for an image, a text, and a sketch query, respectively.In Sections 4 and 5, we discuss our new benchmark and user interface for model search.

Probabilistic Retrieval for Generative Models
Our goal is to quantify the posterior probability of each model  given the user query , and retrieve the model with the maximum  ( |).We define the probabilistic model retrieval objective as: As we assume models are uniformly distributed, it is equivalent to finding the likelihood of the query under each model,  (| ).
There are two scenarios for inferring the query likelihood  (| ): (1) When the query  shares the same modality with the model  (e.g., searching models with image queries), we directly reduce the problem to estimating the model's density, (2) When the query  has a different modality from the model  (e.g., searching models with text queries), we use cross-modal similarity to estimate  (| ).We discuss both cases in the following.
Image-based model retrieval.Given an image query , we directly estimate the likelihood of the query  (| ) from each model.In other words, we seek to find the model that is most likely to generate the query image.Since density estimation is intractable for many generative models (e.g., GANs [Goodfellow et al. 2014], VAEs [Kingma and Welling 2014]), we approximate each model as a Gaussian distribution over image features [Heusel et al. 2017].After we sample an image from each model, denoted by , we obtain its image feature    (), where   is a frozen feature extractor.We now express  (| ) in terms of image features , so Equation 1 becomes: where the query image feature is denoted by     (), and each model   is approximated by  (|  ) ∼ N (  , Σ  ).We refer to this method as Gaussian Density.
Text-based model retrieval.Given a text query  and a generative model  ( | ) capturing a distribution of images , we want to estimate the conditional probability  (| ).
Here we assume conditional independence between query  and model  given image , so  (|,  ) =  (|), and we apply Bayes' rule to get the final expression.In Equation 3, a text query  may correspond to multiple possible image matches  ( |), so the expression cannot be simplified the same way as image-based model retrieval.Instead, we can view Equation 3 as an integral of the mutual information between  and , which can be estimated with a contrastive representation [Oord et al. 2018].Thus, we use CLIP similarity [Radford et al. 2021] to approximate where h  im () and h  txt () are the normalized CLIP image and text features;  is a temperature term.
To approximate the integration, one can sample images  from  ( | ) and then take the average of the CLIP similarities between images and query.We refer to this method as Monte-Carlo.To further speed up computation, we can pre-compute the mean of CLIP image embeddings for each model, and at inference time, we directly evaluate similarities between the query embedding and the mean embedding, as follows: where We find that this approximation works well in practice, and we refer to this method as 1 st Moment.Derivation details for Monte-Carlo and 1 st Moment can be found in Appendix C.
1 st Moment method for image queries.Likewise, we find that applying 1 st Moment for image-based model retrieval yields a performance close to Gaussian Density.Interestingly, for sketch queries, 1 st Moment method outperforms Gaussian Density.While CLIP shows strong cross-modal matching performance between sketch queries and image models, the domain gap makes density estimation a less favorable option for sketch queries compared to the 1 st Moment method.We provide more discussions in Section 6.

Learning to Retrieve Models
We have outlined several ways to approximate  ( |) in Section 3.1 to perform content-based model retrieval.However, these approximations involve only pre-trained feature extractors and may not be optimal for model retrieval with certain query modalites.For example, the frozen features that work well for images may struggle with sketches.To address this issue, we introduce a set of learnable parameters to fine-tine the approximations of  ( |) on our Generative Model Zoo dataset (Section 4).
Under the contrastive learning framework, we maximize ) , the mutual information between the query  and the model  .Since we assume a uniform prior over the models, the mutual information is effectively our retrieval objective E [log  ( |)].
The training dataset D consists of sample statistics for  models and a number of ground-truth query-model pairs.We optimize a matching function   that yields a query-to-model similarity score, by minimizing the InfoNCE loss [Oord et al. 2018]: where   is a matrix parameterized by  , (  , Σ  ) are the featurespace sample mean and covariance of model   , and h is as calculated in the previous section using a pre-trained feature extractor.
The contrastive learning objective maximizes the similarity score between query  and the ground truth model and minimizes the similarity between the query and other models.We define the matching function based on two formulations described in Section 3.1: 1 st Moment and Gaussian Density.For 1 st Moment, we project the original feature using the matrix   .For Gaussian Density, N ( h |  ,   Σ     ) denotes the gaussian PDF with parameters {  ,   Σ     }, evaluated at the query feature h .We experiment with different   parameterizations, including full, triangular, and diagonal matrices.For the diagonal case, we limit the range of   such that   = diag(sigmoid( )) In Section 6, we show that our method can retrieve models that share similar visual concepts with the query.

THE GENERATIVE MODEL ZOO
We introduce a new benchmark, the Generative Model Zoo, consisting of a set of generative models capturing a variety of architectures and subject areas for evaluating retrieval performance.
Internet model zoo.It consists of a total of 259 publicly available generative models trained using different techniques, including GANs [Gal et al. 2022b;Karras et al. 2018Karras et al. , 2021Karras et al. , 2020b;;Kumari et al. 2022;lucid layers 2022;Mokady et al. 2022;Pinkney 2020;Sauer et al. 2022;Wang et al. 2021Wang et al. , 2022]], diffusion models [Dhariwal and Nichol 2021;Ho et al. 2020;Ruiz et al. 2022;Song et al. 2021a], MLP-based generative model CIPS [Anokhin et al. 2021], and the autoregressive model VQGAN [Esser et al. 2021].Out of the 259 models, 23 were created via fine-tuning by individual users, 133 were obtained from the GitHub repositories of academic papers, and the remaining models were collected from public model sharing website [hug 2022].This includes models like DreamBooth [Ruiz et al. 2022] that are finetuned to generate a specific concept (e.g., a unique toy) with a fixed text prompt as input.Hence, to index these models, we generate samples conditioned on the unique text prompt used to fine-tune the model.A comprehensive list of the models and their respective sources is included in Appendix B.
We create a set of benchmark queries for each model in multiple modalities, i.e., text, image, and sketch.The text queries include human-written textual queries that describe the images in a model.Image queries are created by sampling images from models, and sketch queries are created by using the method of Chan et al. [2022].
Synthetic model zoo.To further test our method at scale, we create a synthetic model zoo of 1,000 models.Each model is finetuned from the pretrained Stable Diffusion v1.4 checkpoint [Rombach et al. 2022] on a single image.The models are fine-tuned on instances from animal classes in ImageNet [Deng et al. 2009] or artistic images downloaded from Unsplash [ Unsplash 2022].We manually pick 50 ImageNet classes and 10 instance images from each class, thus training 500 models with instance images from the ImageNet Dataset.For the artistic models, we select 500 images from Unsplash that match the keyword "art".To fine-tune the model, we randomly select between Dreambooth [Ruiz et al. 2022] with LoRA [Hu et al. 2021] and Custom Diffusion [Kumari et al. 2023].In this case, we only create image and sketch queries, as describing specific instantiation of a general category from ImageNet classes is difficult via only text.Similarly, describing an art work using text is also non-trivial.

USER INTERFACE
Our baseline method is fast enough for interactive use, and we create a web-based UI for the search algorithm.The user can search for models by entering text or uploading an image/sketch.The interface displays models that best match the query, where clicking on a model's thumbnail shows more image samples.The website utilizes a backend GPU server to enable real-time model search on any client device.Figure 3 shows a screenshot of our UI.

EXPERIMENTS
For our baseline retrieval methods, we benchmark their performance over text, image, and sketch modalities and discuss several algorithmic design choices.
Implementation details.We test our method with the commonly used pretrained feature extractor -CLIP [Radford et al. 2021].We use the ℓ 2 -normalized CLIP features, as discussed in Section 3.1.
We pre-calculate each model's generated distribution mean and covariance in the CLIP feature space, following the pre-processing steps described in clean-fid [Parmar et al. 2022].For ImageNet fine-tuned models, we generate class-specific prompts used to sample images using ChatGPT.For artistic fine-tuned models, we use general prompts, e.g., "a painting in the style of <specific art style>".More implementation details are in Appendix B.
To evaluate model search, we use the top-k accuracy metric, i.e., frequency of finding the unique ground truth model within the top-k retrieved models using the query corresponding to that model.
Table 1.Image-and sketch-based model retrieval.We compare retrieval performance with pre-trained and fine-tuned scoring functions utilizing CLIP network backbone.We observe the best performance with CLIP and fine-tuning.The (FT) suffix denotes the learning-based method (Equation 6) using the optimal parameterization selected via first doing cross-validation on the Synthetic Model Zoo as shown in Table 2.

Top-k Accuracy
Top-1 Top-5 Top-10 Image (Gen.)  1.We show the performance of different formulations with CLIP features.We train the transformation matrix   in the learning to retrieve method using the Synthetic model zoo.The retrieval performance is best with CLIP features combined with our learning method, especially for sketch-based retrieval.The learning method is denoted with the suffix (FT) in Table 1.For comparison, we also include a modified k-NN algorithm in which we find the query's feature-space k-nearest neighbors among each model's image samples and sort the models by the mean distance to those k samples.We refer to this baseline as "best-k neighbors" or simply "best-k".Figure 7 shows qualitative examples of model retrieval for different image and sketch queries.
Learning to retrieve models.We conduct 5-fold cross-validation on Synthetic Model Zoo to determine the best parameterization of the transformation matrix   for the learnable matching function (Equation 6).Table 2 shows the test-set top-1 accuracy of each method compared to the baseline, which uses the pre-trained feature extractor.We find that our contrastive learning method consistently improves the pretrained features for various query modalities and formulations.We select the best parameterization via crossvalidation for training the matching function from scratch on the full Synthetic Model Zoo and then test on the Internet Model Zoo.The learned matching function generalizes to the Internet Model Zoo as shown in Table 1.We also train our method on Internet Model Zoo We fine-tune different versions of similarity function and transformation matrix parameterization with 5-fold cross-validation on the Synthetic Model Zoo.We use it to determine the optimal parameterization of   (marked with a "★") for each method and modality.8. Please refer to Appendix A for a quantitative analysis.
Text-based model retrieval.We show the results of text-based retrieval on Internet Model Zoo, i.e., collected publicly available generative models, which we manually labeled with corresponding text descriptions.We use the CLIP [Radford et al. 2021] as the pretrained feature extractor since CLIP has both text and image encoders.Since our learning-based method requires a training and test split, we perform 5-fold cross-validation and show the mean Top-1 accuracy for all methods, including baselines, to be consistent.Table 3 shows the retrieval performance of both methods.Similar to image and sketch-based retrieval, we also include results for the best-k neighbors baseline.The proposed baseline method achieves an accuracy of 77%, while the best learning-based method outperforms it with 81% accuracy.
Figure 7 shows qualitative examples of the top three and lowestranked retrieval given a text query with the 1 st  method .Both quantitative numbers and visual inspection of results show that our method retrieves relevant generative models.We also analyze the retrieval score of all models corresponding to a given query.For object categories, such as "dogs" or "buses", we observe a clear drop in retrieval score for irrelevant models.For broader queries, such as "indoors", "modern art" and "painting", the drop-off is gradual.We show detailed analysis in Appendix A.
Comparison with metadata-based search.We compare against an alternative where we index models using user-defined descriptions and search for models by text-matching the query and the model descriptions.To test this, we use the text query associated with each model as its description (e.g., "portraits with Botero's style" for one of the StyleGAN-NADA models).
We test metadata-based search on image-and sketch-based retrieval, with two ways of description-matching. ( 1) Description (text): we caption image/sketch queries using BLIP [Li et al. 2022] and retrieve models whose descriptions contain any nouns, verbs, or adjectives (i.e., any non-filler words) that appear in the caption.
(2) Description (CLIP): we embed model descriptions into CLIP features and compare the similarity between the image/sketch query feature and the description feature.
As shown in Table 4, the content-based search methods outperform the baselines while they also obviate the need for model descriptions.Metadata-based search typically fails when captions are incorrect or when the object-centric tags cannot describe the model fully.Creating comprehensive descriptions might reduce the gap, but it is time-consuming to describe every visual aspect and anticipate users' queries in advance.Our evaluation does not include text-based queries, as we use the model descriptions as the ground truth.However, we expect the baselines to have the same limitation for text queries in the case where a text query requests a visual concept that the tags fail to enumerate.
Running time and memory.Time and memory efficiency are crucial for supporting many concurrent user searches over a large-scale model collection.While 1 st Moment obtains competitive top-k accuracies, it runs 3.2-7.5 × faster than Monte-Carlo or Gaussian Density.The fine-tuned 1 st Moment method further improves accuracy, but runs at speeds more on par with Monte-Carlo or Gaussian Density. 1 st Moment is extremely memory-efficient.Further analysis is in Appendix A.

EXTENSIONS
We describe two extensions: searching with multimodal queries and searching with a given model.
Multimodal user query.We show qualitatively that our search method can be extended to multimodal queries based on the Productof-Experts formulation [Hinton 2002;Huang et al. 2022].Given a multimodal query (e.g., text-image pair), the retrieval score is the product of individual query likelihoods.We demonstrate how one can leverage multiple input modalities to perform nuanced searches in Figure 4 using the 1 st Moment method with pre-trained features.We show additional multimodal retrieval results in Appendix A.
Generative model query.Given the generative model collection, we can also retrieve similar models based on the cosine similarity between the feature-space means of query models and other generative models.Figure 5 shows qualitative examples of similar-model retrieval using CLIP feature space.[Li et al. 2022] and select models whose descriptions contains any non-filler words from the caption, (2) Description (CLIP), we embed model descriptions into the CLIP feature space and compare the similarity between the CLIP query features and the CLIP description features.

DISCUSSION AND LIMITATIONS
We have introduced the new task of content-based generative model search.We have developed a data set for benchmarking model retrieval algorithms, and we have described several baselines and a contrastive learning method for further improving features.

Pointy eyes
Fig. 4. Multi-modal user query.Image and text queries combined can define nuanced concepts.When querying with the phrase "pointy eyes" plus a cat photo, we can retrieve "Triangle Eyes Cat" and "Alien Eyes Cat".Photo by cocoparisienne on Pixabay.
Limitations.While we can retrieve models in real-time based on text, image, or sketch queries, our method has several limitations.First, while our method is able to successfully retrieve models that capture a single concept (e.g., StyleGAN3 [Karras et al. 2021] and DreamBooth [Ruiz et al. 2022]).There remain many future directions, including searching for personalized models that contain multiple subjects [Kumari et al. 2023], 3D neural objects [Mildenhall et al. 2021;Poole et al. 2022], and other media such as text, audio, and videos.
Second, our method can sometimes fail to match the user intent.A typical failure case is illustrated in Figure 6, where a query for a sketch of left facing horse retrieves general horse models instead of the most similar GANSketch "Left-facing Horse" model.It fails to respect the spatial orientation of the object in the query image.We show further analysis on this in Appendix A. Nevertheless, our experiments have shown that it is realistic, possible, and useful to search an indexed collection of models by matching them against the output behavior of the models.As the number of customized and pre-trained models continues to balloon, we anticipate that search algorithms that can effectively locate relevant customized models will be an increasingly valuable resource for researchers and content creators.9. Similarity score drop-off.Top: 1 st Moment similarity scores of more specific queries like "a bird that talks", "edvard munch", or "buses" show a clear drop between relevant and irrelevant models.In the first case, the parrot model was ranked top, closely followed by the bird model.In "edvard munch", the StyleGAN-NADA Edvard Munch-style face model was ranked top.Subsequent ranks consist of other photorealistic and artistic face models.In "buses", the ProGAN bus model was ranked top.Subsequent ranks consisted of models that generate trains, faces, and bridges.Bottom: for more general queries like "animals" or "animated faces", or a query like "coffee cup" that does not have any good matches, the similarity scores appear more uniform.In "animals", the 10 retrieved models generate animals, including sheep, dogs, cows, horses, cats, and giraffes.In "cartoon faces", we find 8 painting/sketch/cartoon-style face models and 2 photorealistic face models.In "coffee cup", since our collection does not include any highly relevant models, we find a broad variety of models.The retrieved models consist of a "bottle" model, a modern art painting model, a cat model, a "trypophobia" model, two modern art painting models, two anime/cartoon character models, and two face models.
Table 8.Model search with fewer-sample statistics.We run model search while using fewer samples to compute the feature-space image statistics for each model at test-time."ALL" refers to using all 50K or 2,400 available samples depending on the model, which is the default (see Section B).In in other tests we randomly sample a subset of given size.a GPU.For one million models, sorting takes 0.12ms and 0.062 GB of VRAM.When we use the 1 st Moment method, we enable a user to retrieve models from a 1-million-model collection with a text, sketch, or image query in under 10 ms.
Model Search with Fewer-sample Model Statistics.To study how well model search works if model statistics were gathered from fewer generated images, at test-time, we use only a portion of the generated images to compute the feature-space mean and covariance statistics, while keeping all else the same.We show the results in Table 8.Overall, while a greater sample size helps, we get adequate performance with as few as 100 samples.
Evaluation over multi-modal queries.To test retrieval with multimodal (image + text) queries, we select a subset of 36 models from the Internet Model Zoo whose description can be split into some text and a generic image (e.g., "purple fur" + image of a regular horse, for a purple horse model).Our CLIP+1 st Moment method results in 0.47 and 0.86 for top-1 and top-5 accuracies.The qualitative results are presented in Figure 10.
Refined search using spatial features.As shown in Figure 6 in the main paper, our method sometimes fails when there is a directional dependency, e.g., left-facing vs. right-facing horse models.To this end, we experiment with spatial features.Given 50 image queries of left-and right-facing horses (identical images, flipped), we take 7x7 spatial features from the last layer of ConvNext [Liu et al. 2022] pre-trained on ImageNet-22k and see if each query retrieves the left-or the right-facing-horse model.The horse model facing the same direction as the query is favored 100% of the time using spatial features and the 1 st Moment method.In contrast, 1 st Moment (pre-trained), 1 st Moment (fine-tuned), Gaussian Density (pre-trained), and Gaussian Density (fine-tuned) methods favor the correct model 52%, 52%, 49% and 54% of the time, respectively.Fig. 10.Evaluation over multi-modal queries.We show the qualitative results of retrieval using text only, image only, and image + text as the query.We search over a subset of the Internet Model Zoo, specifically, we selected 36 models whose descriptions can be split into text and image.We can utilize direction from both text and image to retrieve models more consistently.Photos by Himanshu Choudhary, Alexander Andrews, and Qijin Xu on Unsplash.Rendering by @ForgottenWorld on Blender Artists (bottom right).Model statistics calculation.For the Internet Model Zoo, we use 50K generated images to pre-calculate the sample mean and covariance, except for the 108 DreamBooth fine-tuned Stable-Diffusion models where we use 2,400 generated images.In the case of Synthetic Model Zoo, we sample 2400 images to calculate the model statistics, as it consists of 1000 customized diffusion models sampling from which is computationally inefficient.We always add a small factor (1 −4 ) of the identity matrix to the covariance estimate to improve numerical stability.For ImageNet fine-tuned models, consumed by training redundant models when a good model for a problem already exists.

Fig. 2 .
Fig. 2. Overview.Given a collection of models  ∼ unif{ 1 ,  2 , . . .,   }, (a) we first generate samples for each model  , and (b) encode the samples into features and compute the 1 st and 2 nd order feature statistics for each model.The statistics are cached for efficiency.(c) We then learn a matching function using contrastive loss that takes as input the query feature and model statistics and returns the similarity score between the model and query.The models with the best similarity scores are retrieved.We support queries of different modalities (text, image, or sketch).Photo by @cedric_photography (right).

Fig. 3 .
Fig.3.User interface.In our model search platform, a user can submit a text query, an image query, or a combination of both to retrieve generative models that best match the query.Here we show the top 6 retrievals for the text query "painting".The user can further explore additional samples from the models or search for models similar to a particular model.

Fig. 5 .Fig. 6 .
Fig.5.Finding similar models.Our method can find models that share similar characteristics with the query model.

Fig.
Fig.9.Similarity score drop-off.Top: 1 st Moment similarity scores of more specific queries like "a bird that talks", "edvard munch", or "buses" show a clear drop between relevant and irrelevant models.In the first case, the parrot model was ranked top, closely followed by the bird model.In "edvard munch", the StyleGAN-NADA Edvard Munch-style face model was ranked top.Subsequent ranks consist of other photorealistic and artistic face models.In "buses", the ProGAN bus model was ranked top.Subsequent ranks consisted of models that generate trains, faces, and bridges.Bottom: for more general queries like "animals" or "animated faces", or a query like "coffee cup" that does not have any good matches, the similarity scores appear more uniform.In "animals", the 10 retrieved models generate animals, including sheep, dogs, cows, horses, cats, and giraffes.In "cartoon faces", we find 8 painting/sketch/cartoon-style face models and 2 photorealistic face models.In "coffee cup", since our collection does not include any highly relevant models, we find a broad variety of models.The retrieved models consist of a "bottle" model, a modern art painting model, a cat model, a "trypophobia" model, two modern art painting models, two anime/cartoon character models, and two face models.

Figure 11
Figure 11 shows a qualitative comparison between using spatial features vs. global CLIP features when retrieving left and right-facing horse models with our Gaussian Density (fine-tuned) method.Model search with global CLIP features retrieves the right-facing horse model even though the query depicted a left-facing horse (Figure 11, last column), where we retrieve the correct model with spatial features.B DATASET DETAILSDistribution of Models within Generative Model Zoo.The 259 models in Internet Model Zoo consist of 1 ADM model[Dhariwal and Nichol 2021], 3 CIPS[Anokhin et al. 2021], 2 DDIM[Song et al. 2021a], 2 DDPM[Ho et al. 2020], 1 FastGAN[Liu et al. 2021], 18 GANWarping[Wang et al. 2022], 14 GANSketching[Wang et al.  2021], 31 ProGAN[Karras et al. 2018], 7 Self-Distilled GAN[Mokady

Table 2 .
Selecting best parameterization for   with cross-validation.

Table 3 .
Learning to retrieve with text queries.We show the result of the learning-based method with text queries using a 5-fold cross-validation on the Internet Model Zoo itself.This is because we have ground truth text queries only for that, as discussed in Section 4. We optimize different parameterizations of the transformation matrix on the training split and show retrieval performance on the testing split of 51 models.

Table 4 .
Comparison with metadata-based search.We compare our Gaussian Density content-based search method with two metadata-based search methods: (1) Description (text): we caption image/sketch queries using BLIP