Content Based Deep Learning Image Retrieval: A Survey

With the development of digital technology, various fields generate and share a large amount of visual content. Image retrieval is a hot research direction in the field of computer vision. Efficient and accurate retrieval of query content from massive data is the ultimate form pursued by image retrieval technology. In recent years, the rise of deep learning technology has promoted the rapid development of the field of computer vision. Due to the powerful expressive ability of deep features on image content, image retrieval based on deep learning has become the most cutting-edge research direction in CBIR technology. This paper summarizes the relevant research on the classic deep learning image retrieval technology in recent years, first introduces the form of the CBIR problem, and then lists the classic datasets in this field. Afterwards, content-based deep image retrieval methods are reviewed from the perspectives of network models, deep feature extraction, and retrieval types. Finally, summarize the problems to be solved urgently in the current research, and look forward to the future research direction.


INTRODUCTION 1.Background
With the rapid growth of digital content in the modern Internet, retrieving image content in the wide-area Internet has become

Image Retrieval Task
The purpose of image retrieval is to search an image database for images that are similar or homologous to an input image.Image retrieval can be divided into text-based image retrieval (TBIR) and content-based image retrieval (CBIR) according to the way of describing image content [65].
TBIR uses manual annotation or semi-automatic annotation of image recognition technology to describe the image content, and forms keywords to describe the image content for each image.In the retrieval phase, the user retrieves the annotated images from the image library through keywords.In addition, this method is easy to implement.Due to the existence of manual or image recognition technology annotations, the accuracy of the algorithm is relatively high, and it has a good application prospect in the face of small and medium-scale image search problems.
Due to the time-consuming and labor-intensive manual annotation of TBIR, the process is easily affected by factors such as the knowledge level of the annotator, language use, and subjective judgments, resulting in problems such as differences in text descriptions and pictures.In order to solve the semantic gap between the high-level semantics and low-level visual features of retrieved images, both academia and industry have made efforts to develop CBIR.With the continuous improvement of deep learning theory, CBIR has made great progress.In large-scale image retrieval, the CBIR task is to search for the most relevant content to a given query data in a large image collection, which mainly includes two stages of feature extraction and similarity measurement.Compared with TBIR, which uses unstructured data, namely text, as the annotation method, the use of deep features enables CBIR to overcome the shortcomings of TBIR and improve retrieval efficiency.instance-level MS-COCO [32] 80 1.2 × 10 5 multimodal retrieval Flickr30k [41] -3 × 10 4 multimodal retrieval GLD v2 [61] 2 × 10 5 5 × 10 6 instance-level XMarket [6] 5471 1.8 × 10 5 category-level CUB200-2011 [55] 200 1.2 × 10 4 category-level Aircraft [34] 102 1 × 10 4 fine-grained Paris-6k [40] 12 6, 000 instance-level Oxford5k [39] 11 5, 000 instance-level UKBench [37] 2550 1 × 10 4 instance-level Holidays [20] 500 1, 500 instance-level Sketchy [46] 125 8.8 × 10 4 sketch retrieval Fashion-IQ [18] 3 7.8 × 10 4 interactive retrieval various algorithms.Commonly used datasets are listed in Table 1.
The Google Landmarks Dataset v2 [61] contains more than 5 × 10 6 images and 2 × 10 5 different instance labels, including more than 4 × 10 6 images in the training set, 7 × 10 5 images in the reference set, and 1 × 10 5 images in the test set.GLDv2 is the largest landmark dataset, containing annotated images of man-made and natural landmarks.NUS-WIDE [10] is a multi-label definition dataset about image text matching, which contains 2.7 × 10 5 pictures, and each picture contains an average of 2~5 labels.The MS-COCO [32] dataset contains 1.2 × 10 5 images, and each image contains at least 5 sentence annotations.Flickr30k [41] contains more than 30,000 pictures, and each picture contains 5 sentence annotations.Oxford-5k [39] consists of more than 5,000 images of 11 Oxford buildings.Sketchy [46] contains 125 sketch image pairs of different categories, each category contains 100 images.

Evaluation Methods
Choosing an appropriate evaluation formula in image retrieval tasks depends on two factors: the algorithm itself and the problem domain.At present, the commonly used evaluation metrics of CBIR include Recall, Precision, F-score.The recall rate refers to the percentage of images correctly retrieved by the retrieval system to the total number of relevant images in the dataset, and the calculation formula is shown in Equation 1: where T represents the number of correctly retrieved samples, and M represents the number of samples not returned in the dataset related to the query image.Precision refers to the percentage of images correctly retrieved by the retrieval system to the total number of retrieved images, and the calculation formula is shown in Equation 2: where F represents the number of samples retrieved that are not related to the query sample.In general, R and P are contradictory, and the recall rate and precision rate can be judged according to the requirements for image retrieval tasks in specific fields.The F-score refers to the weighted harmonic mean of the recall rate and the precision rate, and the calculation formula is shown in Equation 3: where  is a parameter to adjust the weight of recall rate and precision rate.If a higher precision rate is required,  will be reduced, and if a higher recall rate is required,  will be increased.When  = 1, R and P are equally important, that is, F1-score.The higher the F1 value, the better the retrieval performance of the system.In addition to the F1, mAP (mean Average Precision) is also one of the important indicators to evaluate the overall performance of the retrieval system.

DEEP CBIR
The deep image retrieval technology is generally based on the image features extracted by the deep neural network for vector retrieval, because the features contain the semantic content of the image, so the deep image retrieval belongs to the content-based image retrieval [65].

Deep Image Retrieval
3.1.1Category-level Retrieval.The main task of category-level image retrieval is to retrieve any image of the same category as the query image.Sharma et al. [47] proposed a supervised discriminative distance learning method that outperforms baselines in category-based image retrieval tasks.Meng et al. [35] performed feature extraction and matching at the class level, and proposed a new image retrieval method based on merged regions.[63] proposed a cross-domain representation learning framework, which achieved strong performance in category-level image retrieval.

Instance-level
Retrieval.The goal of instance-level image retrieval is to find images containing specific instances in the query image, which may be captured under different background conditions.To achieve accurate and efficient retrieval in large-scale image databases, the core task of instance-level image retrieval is to obtain compact and discriminative feature representations of images.[44] developed a deep CNN-based baseline for instance retrieval using local feature extraction based on CNN representations.
Other approaches to image instance retrieval include local convolutional feature packs [36], instance-aware image representation methods [25], and hashing models for deep multi-instance ranking [9], etc. Amato et al. [2] introduced a deep feature representation method based on scalar quantization, and proved the effectiveness of the method on instance-level retrieval benchmarks.Krishna et al. found that models trained using contrastive methods outperformed pretrained baselines trained on ImageNet in retrieval tasks.Bai et al. [4] proposed an unsupervised framework that focuses on instance objects in images, called adversarial instance-level image retrieval.It is the first time that adversarial training is used in the retrieval process of instance-level image retrieval tasks, which can significantly improve retrieval accuracy without increasing time cost.
3.1.3Fine-grained Retrieval.Xie et al. [62] proposed the concept of fine-grained image search.Driven by deep learning technology, more and more fine-grained image retrieval methods based on deep learning have been proposed [31,75,76].[56] proposed a deep ranking model that learns a fine-grained image similarity model directly from images.Ahmad et al. [1] proposed an objectoriented feature selection mechanism for pre-training CNN's deep convolutional features.The model uses a locality-sensitive hashing method to enable fine-grained retrieval in large-scale surveillance datasets.
3.1.4Cross-modal Retrieval.With the application of deep neural networks in the field of image retrieval research, cross-modal retrieval has received extensive attention.The two modalities of image and text are very common in the field of retrieval.When the data of one modality is given, the cross-modal retrieval task needs to find several corresponding or closest data to the given modality in the space of another modality.
Multimodal retrieval methods include deep visual semantic hashing [7], self-supervised adversarial hashing [27], deep cascaded cross-modal ranking model [59], deep mutual information maximization algorithm [16].Dey et al. [11] proposed a cross-modal deep network structure that allows text and sketches to be used as query input, and uses an attention model to retrieve multiple objects in the query.Lee et al. [26] studied the image-text matching problem and proposed a stacked cross-attention mechanism that uses image regions and words in sentences as context to discover complete potential alignments and infer image-text similarities.Wang et al. [60] proposed a cross-modal adaptive information transfer model consisting of cross-modal information aggregation and cross-modal gating fusion to adaptively explore the interaction between images and sentences in text-image matching.Chaudhuri et al. [8] proposed a remote sensing cross-modal retrieval framework based on deep neural networks.Sumbul et al. [50] proposed a new self-supervised cross-modal image retrieval method, which does not require any labeled training images, can still effectively maintain the similarity between modalities and between modalities, and eliminate the differences between modalities.
3.1.5Sketch-based Retrieval.Sketch based image retrieval (SBIR) is essentially cross-modal information retrieval.Researchers have established effective SBIR algorithms from three aspects: deep multimodal feature generation, cross-modal correlation modeling, and similarity function optimization.Eitz et al. [13] benchmarked SBIR.Qi et al. [42] proposed SBIR based on Siamese CNN architecture.Song et al. [49] constructed a new fine-grained SBIR (FG-SBIR) model by introducing attention modules, shortcut connection fusion blocks and high-order learnable energy functions.Pang et al. [38] first discovered and solved the cross-category FG-SBIR generalization problem, defined FG-SBIR cross-category generalization as a domain generalization problem, and proposed an unsupervised learning method to model a general visual sketch feature flow shapes, automatically adapting to new categories.[67] proposed a zero-shot SBIR (ZS-SBIR) benchmark for retrieval of classes that were not trained.Dey et al. contributed a large-scale ZS-SBIR dataset QuickDrawerExtended [12] to the community.
Other approaches to SBIR include a cross-domain representation learning framework [63], a CNN-based semantic reranking system [57], and semantically aligned pairwise recurrent consensus generative networks [169].Bhunia et al. [5] designed a cross-modal retrieval framework FG-SBIR based on reinforcement learning to solve the problem of taking a long time to draw sketches.Torres et al. [54] utilized the uniform manifold approximation and projection (UMAP) for dimensionality reduction, proposing the use of compact feature representations in the SBIR environment.Sain et al. [45] proposed a SBIR model that can adapt to the agnostic drawing style in view of the diversity of styles of different users when drawing sketches.Yu et al. [68] first defined and solved the problem of finegrained instance-level image retrieval using freehand sketches, and provided a large-scale fine-grained sketch dataset.

Conversational Image Retrieval.
Conversational image retrieval can gradually clarify the user's retrieval intention according to the interactive user response, and achieve more accurate retrieval.Liao et al. [30] proposed a knowledge-aware multimodal dialogue model that considers the semantic and domain knowledge contained in visual content.Guo et al. [17] introduce an interactive image search method based on deep learning, which enables users to provide feedback through natural language.On this basis, Zhang et al. [71] proposed a constraint-enhanced reinforcement learning framework to effectively incorporate users' preferences over time.Zhang et al. [72] proposed a reward-constrained recommendation framework for text-based interactive recommendation.Yuan et al. [69] proposed a multi-turn natural language feedback text framework that can effectively handle conversational fashion image retrieval.Kaushik et al. [23] introduced a multi-view conversational image search system, developed a reinforcement learning model based on the initial running state, incentives, and sessions, and predicted the images provided to the user through a customized search algorithm.

DNNs For CBIR
The most representative models for the feature extraction in image retrieval include VGG [48], GoogLeNet [51], ResNet [19] and EfficientNet [52].[48] has more convolutional layers than AlexNet [24], and VGG-16 and VGG-19 are the most widely used versions, consisting of 13 and 16 convolutional layers, respectively.The strategy of VGG is to deepen the number of layers of the convolutional neural network.The experimental results show that within a certain range, deepening the network can effectively improve the performance of the model.[51] designs an inception module, which can construct a sparser CNN structure.By using different sizes of convolution kernels to capture different sizes of receptive fields, the last layer uses a global mean pooling layer to replace the fully connected layer, reducing model parameters.Compared with AlxeNet and VGGNet, the GoogLeNet model is deeper and wider, with fewer model parameters and higher learning efficiency.Deeper architectures are beneficial to learn higher-level abstract features, thereby reducing the semantic gap.[19] converts a normal CNN network into a residual network using skip connections, and ResNets have fewer convolution filters than VGGNets.ResNet uses skip connections or just skips some layers to avoid the problem of gradient disappearance.The skip connections act as gradient highways, allowing gradients to flow undisturbed.

EfficientNet.
Compared with the traditional model random scaling, EfficientNet [52] uses the composite coefficient technology to balance the ratio of the three dimensions of width, depth and image resolution.In addition, 7 versions of different scales have been developed, and experiments have shown that its performance exceeds most convolutional neural networks and is more efficient.

Deep Feature
The feature extraction based on deep learning is mainly carried out by the fully connected layer or convolutional layer.The model can extract the global features from the fully connected layer, or local features from the convolutional layer, or combine the two methods.Specifically, the way of feature fusion includes layer level and model level [65].

Deep Feature Selection.
The convolution extracts local features, and the fully connection reassembles the previous local features into complete features through the weight matrix, thus representing the global features of the image.After the features extracted by the fully connected layer are reduced and standardized by PCA, the similarity between images can be measured.However, using fully connected layer features alone may limit image retrieval accuracy.Song et al. pointed out that establishing a direct connection between the first fully connected layer and the last one can achieve a coarse-to-fine improvement [49].Furthermore, since the fully connected layers represent image-level features, they lack local geometric invariance.To this end, Song et al. also extract local features on a finer scale to solve the background clutter problem.Because the lack of geometric invariance will affect the robustness of features to image transformation, such as image cropping, occlusion and so on.To this end, researchers proposed to use intermediate convolutional layers to solve this problem [3,44,70].
Features are usually aggregated using pooling operations, where sum/average pooling and max pooling are the two simplest pooling methods.Pooling the features extracted by the convolutional layer can effectively reduce the number of parameters and enhance the robustness of feature representation.In addition, pooling methods such as R-MAC [53], CroW [22], SPoC [3] and GeM pooling [43] can also effectively improve the retrieval performance of image features.

Deep Feature Fusion.
Feature fusion is to combine the strengths of different features to achieve complementary advantages.[33] merge multiple deep global features from different fully connected layers.Li et al. [29] applied the R-MAC coding scheme to the 5 convolutional layers of VGG-16 and concatenated them into multi-scale feature vectors.Wang et al. [58] selected all convolutional layers of VGG-16 to extract image feature representations to achieve multifeature fusion, and this method is more robust than using only single-layer features.
In fine-grained image retrieval, in order to emphasize the decisive role of local features, Yu et al. used low-level features to refine the ranking results of high-level features instead of directly connecting multi-layer features.Through the mapping function, low-level features are used to measure the fine-grained similarity between the nearest neighbor images that have the same semantics as the query and the image.Gong et al. [15] proposed a multi-scale orderless pooling CNN, which extracts and encodes CNN features from different layers, and then connects the aggregated features of different layers to measure images.Li et al. [73] proposed a multi-layer orderless fusion (MOF) algorithm on the basis of multi-scale orderless pooling, and the experiments on the Holiday and UKBench datasets proved that the performance is better.Zhang et al. [28] fused the index matrix generated by two features extracted from the same CNN, which has low computational complexity.Yang et al. [66] gave up the two-stage retrieval and proposed a deep orthogonal local and global (DOLG) feature fusion framework for end-to-end image retrieval.The image retrieval performance of this method was verified on the Oxford and Paris datasets.
Fusing the features of different models requires the complementarity between the models.Simonyan et al. [48] introduced a fusion strategy within the convolution model, fusing VGG-16 and VGG-19 to improve the feature learning ability of VGG.Yang et al. [64] introduced dual-stream attention in CNN to achieve image retrieval.This method can calculate image similarity by retaining salient content and suppressing irrelevant regions like humans, and achieved strong image retrieval performance.Zheng et al. [74] believed that fusion between models can bridge the gap between intermediate and high-level features, so combined VGG-19 and AlexNet to learn combined features.Ge et al. [14] proposed a multilevel feature fusion method to improve the feature representation of high-resolution remote sensing image retrieval.Jiang et al. [21] proposed an image retrieval method based on image feature fusion and discrete cosine transform.They compared methods based on shallow feature fusion and deep feature fusion, and the experiments on Oxford dataset show that both methods can improve the performance of the retrieval system.According to the order of fusion and prediction, feature fusion can be divided into early fusion and late fusion.Among them, early fusion first fuses features, and then performs image retrieval on the fused unique feature representation [33,64,66].Late fusion improves retrieval performance by combining retrieval results with different features [28].

CONCLUSION
This paper reviews the research progress of CBIR based on deep learning, expounds the connection between each method and summarizes the representative methods.CBIR based on deep learning has become a hot research direction at this stage.Researchers have produced a lot of innovative work and made great progress in retrieval accuracy and retrieval efficiency, but many new problems have also emerged.First of all, feature selection and extraction are the basis of CBIR.How to select appropriate features to reflect the semantics contained in images has always been the first problem in the past, present and future.In addition, in the face of the increase in the dimensionality of feature vectors brought about by feature fusion, dimension reduction technology is worthy of further study, because only low-dimensional and good discriminative features can guarantee retrieval performance and efficiency.How to use low to medium feature vector dimensions to express images is still a big problem.Secondly, data-driven is one of the characteristics of deep learning.Specific retrieval tasks require specific datasets as benchmarks, and the introduction of various types of datasets has become an urgent need for researchers.At this stage, the CBIR method focuses on static datasets and is difficult to apply to incremental scenarios.With the increase of new data, how to make the trained system perform incremental learning is a problem worth considering.Finally, the ultimate goal of image retrieval is people-oriented, and how to use feedback technology to achieve user satisfaction with minimal iteration still needs further research [65].