Blind Image Super-resolution with Rich Texture-Aware Codebooks

Blind super-resolution (BSR) methods based on high-resolution (HR) reconstruction codebooks have achieved promising results in recent years. However, we find that a codebook based on HR reconstruction may not effectively capture the complex correlations between low-resolution (LR) and HR images. In detail, multiple HR images may produce similar LR versions due to complex blind degradations, causing the HR-dependent only codebooks having limited texture diversity when faced with confusing LR inputs. To alleviate this problem, we propose the Rich Texture-aware Codebook-based Network (RTCNet), which consists of the Degradation-robust Texture Prior Module (DTPM) and the Patch-aware Texture Prior Module (PTPM). DTPM effectively mines the cross-resolution correlation of textures between LR and HR images by exploiting the cross-resolution correspondence of textures. PTPM uses patch-wise semantic pre-training to correct the misperception of texture similarity in the high-level semantic regularization. By taking advantage of this, RTCNet effectively gets rid of the misalignment of confusing textures between HR and LR in the BSR scenarios. Experiments show that RTCNet outperforms state-of-the-art methods on various benchmarks by up to 0.16 ~ 0.46dB.


INTRODUCTION
Blind Super-Resolution (BSR) aims to realistically reconstruct highresolution (HR) images from low-resolution (LR) images with unknown degradation [7,8,21,30,34,55,57].To avoid the General Adversarial Network (GAN) artifact, codebook-based BSR approaches [7,8], inspired by VQVAE [47,48] and VQGAN [63], model high-resolution textures using a discrete feature space created by a pre-trained feature codebook to reconstruct HR images.These methods have shown promising results, as the codebook effectively constrains the output to fall within a valid solution space.
One of the major challenges in BSR is the complex blind degradation, which leads to similar LR versions from different HR inputs, disrupting the LR-HR matching correlation [7,8,21,34,55,57].For example, in Fig. 1, we sample two HR images from the DIV2K [1] 103.6   [7,8,66] (including various noise, blur, and compression).The MSE in RGB space on the line indicates the patch distance.
dataset and degrade them using the widely used blind degradation procedure of [66].We compute the Mean Square Error (MSE) for the similarity evaluation and find that complex degradation reduces the distinction between LR patches compared to their HR distinction.In detail, HR 1 has a smaller MSE (40.1) with LR 2, in contrast to its own corresponding LR patch (LR 1), which has an MSE of 40.3.In addition, similar LR patches tend to match the same HR patch rather than their individual HR versions.Such phenomena complicate the handling of LR data.
To address this issue, recent codebook-based methods [7,8] incorporate an additional LR encoder to model the LR-HR relationship, based on the texture codebook learned from HR data (Fig. 2.a).While this technique is effective when dealing with mildly degraded data, it shows lower-quality results when handling severely degraded areas.We find two main factors that limit its performance on complex degraded data. 1) First, the previous codebook space, built from distinct HR data, struggles with confusing LR inputs.Unlike the clear relationship between HR textures, HR-LR relationships within BSR are confusing and often many-to-one.This poses a challenge for the previous codebooks pre-trained for HR reconstruction [7,8] to distinguish different textures from similar degraded versions, thus limiting the diversity of texture restoration.Besides, to simplify learning, they apply the codebook only at the network bottleneck, which effectively captures larger textures but may miss mid-to-low-level details. 2) Second, they use image classification-based features (often from backbones like VGG [50] pre-trained on ImageNet) for additional semantic regularization during codebook learning.However, high-level tasks that prioritize global semantics may neglect local information [22,29] crucial for low-level tasks, causing inconsistency between pre-trained features and local texture perception (e.g., Fig. 2.b).To this end, developing a texture-friendly and efficient prior based on existing global prior is worthwhile for BSR tasks, but remains underexplored.

RELATED WORK 2.1 Codebook-based SISR
Traditional codebook-based methods [5,61] have been effective in modeling low-resolution (LR) and high-resolution (HR) patches in color spaces, especially under light degradation.However, in the case of blind super-resolution (BSR) with severe and unknown degradation, their effectiveness decreases due to the complex correspondence of different resolutions.Recent advances in deep learning [42,47,48,63] have enabled the development of vector quantization-based methods [7,8,75] that transition patch  matching from pixel to feature space, showing notable improvements in BSR scenarios.Specifically, these methods used a highresolution VQVAE [47,48] generation model (vector codebook and decoder) to model HR textures and an additional LR encoder for cross-resolution feature matching.Despite these advantages, as mentioned in Sec. 1, recent codebook-based methods still struggle with limited diversity and limitations and coarse modeling for fine textures.Therefore, we design the DTPM to alleviate codebook collapse and achieve hierarchical texture modeling.

Prior-based SISR
Since SR is inherently an ill-posed problem, using additional image priors can effectively improve the restoration performance.The prior-based super-resolution methods can be simply divided into explicit and implicit methods.Explicit methods [22, 30-32, 60, 71, 73, 76], which use HR reference images, can restore realistic textures but have low performance with limited reference images.Implicit methods [14,24,40,44] use generative model-based priors [3] and achieve superior results on domain-specific images such as faces [4,54,62].Several methods learn a posterior distribution with pretrained StyleGAN [3] and use another encoder to project LR images into StyleGAN's latent space.However, since learning the prior from the generative model on generic images is challenging, recent methods use the high-level task-based priors for image texture reconstruction [7,8,59].However, they tend to overlook local textures and instead focus on global semantics, making them less suitable for texture-sensitive image restoration.

METHOD 3.1 Overview
The framework of our method is shown in Fig. 3 and briefly described herein.
Training.During training, RTCNet inputs both low-resolution (LR) images   and high-resolution (HR) images   , with CNN encoders   ,   used to extract hierarchical features   ,   =   (  ),   (  ) respectively.Following prior work [8], additional RSTB [34] layers are added to   for stronger learning ability.The extracted features are quantized via hierarchical codebooks in the Degradation-robust Texture Prior Module (DTPM), where  denotes the hierarchical codebooks in DTPM.Finally, we pair the quantized features of all resolutions with decoders   ,   for cross-resolution reconstruction, and compute the reconstruction loss against ground truth images.
Inference.Different from training, we only apply the LR input to the HR reconstruction process to obtain the super-resolution result   :   =   (  (,   (  )). (2)

Degradation-robust Texture Prior Module
In this section, we introduce our Degradation-robust Texture Prior Module in detail, including vector quantization, cross-resolution consistency constraints, and the hierarchical codebook structure.
Vector Quantization.For each point feature  ∈   , its quantized result  ∈   is the nearest neighbor based on 2 distance in the codebook  ∈   × as where N denotes the size of the codebook.Given the input feature  ∈   × × , its quantized feature  is the combination of the quantized results of all point features within  , expressed as Following preious work [12,52], we directly copy the gradient from  to  for backpropagation and use the following loss function,   to optimize the codebooks: where (•) means stop-gradient operation and  = 0.25 [12,52].
In training, DTPM quantizes HR and LR features simultaneously.Its loss,    , is the sum of   of hierarchical codebooks: (5) Cross-Resolution Correlation Constraints.We investigate the texture correlation between HR and LR images, focusing on the cross-resolution consistency.We decompose the texture consistency between LR and HR data into two separate components: 1) Reconstruction consistency constraint in RGB space.The similar code representations should have similar texture content across resolutions.Since paired HR and LR images share the same content, their quantized features   and   should be able to reconstruct both   and   inputs using decoders of both resolutions, Generated images should align with their corresponding resolution inputs, to which image reconstruction supervision is applied, where   denotes the image reconstruction loss function in Sec 3.4.2) Representation consistency constraint in codebook space.Images with similar texture content across resolutions should have similar representations in the codebook space.Specifically, we constrain the features extracted from paired HR-LR images to be consistent with each other, Multi-scale Codebook Structure.The hierarchical codebook structure is based on the assumption that textures of different sizes can be characterized by codebooks of different scales.In the implementation, we employ two scales of ×4 and ×8 downsampling, hereafter referred to as local scale  and global scale  below.We apply codebooks to these scales for feature quantization, In contrast to previous bottleneck-like methods, additional shallow codebooks can represent diverse and minute texture information at smaller scales, which is helpful for generating finer textures.To mitigate convergence difficulties when training multiscale codebooks from scratch, we propose a deep-to-shallow training strategy.Specifically, codebooks are trained sequentially, starting from the deepest scales and progressing toward the shallowest scales.Fig. 4 shows the detailed training strategy.First, the global codebook is trained starting from scratch, and the temporary decoder is implemented in place of the local codebook and the multi-scale decoders.
In this phase, the multi-scale encoder and the global codebook are trained well.Second, we introduce the local codebook and replace the temporary decoder with multi-scale decoders, and freeze the well-trained modules in Stage 1.The well-trained encoder and the global codebook allow for more effective and stable optimization of the local codebook.

Patch-aware Texture Prior Module
To obtain a low-level friendly prior emphasizing local details over global semantics for enhanced texture perception (Sec.1), we build our Patch-aware Texture Prior Module (PTPM) upon patch-level classification pre-training, drawing insights from multiple instance learning [45,46] and fine-grained image classification [19].This section details the creation process of PTPM, covering data generation and agent task pre-training.
Patch Data Generation.In general, non-overlapping patches are extracted from the images, and those without sufficient segmentation labels are discarded.The rest are assigned respective labels.Fig. 5(a) illustrates the process of extracting a patch  ∈    ×  ×3 from an image  with segmentation map  in a non-overlapping manner.For each patch , we consider it valid and assign its category label   as  ∈  , if the proportion of   belonging to  exceeds .Otherwise, it is deemed invalid, The weights, and add an additional linear classifier  for pre-training.To ensure compatibility of the learned prior with the 2 distance in the codebook space, we add contrastive supervision    [43] to the cross-entropy loss   .Patches within the same category are treated as positive samples, while patches from different categories are negative samples in    .Given patch samples  = {  | = 0, 1, ...,  } and labels  = {  | = 1, 2, 3, ...,  }, the total prior training loss function is: where   =  ℎ (  ) denotes the feature embedding of   after GAP and ŷ =  (  ) denotes the prediction results.
Texture-orient Label Reorganization and Prior Refinement.Coarse pre-training using patch data and original image-level labels may be affected by global label influence.We mitigate this problem by reorganizing class labels based on coarse pre-training results.This process, shown in the right part of Fig. 5, entailed feature visualization by t-SNE [53], merging similar texture data with different labels, and separating discrete clusters.To further expand our data, we combined an edge-sensitive image matting dataset, resulting in 20,181 samples assigned with 35 reorganized labels.We then fine-tuned the PTPM Net using this restructured data for the final PTPM.For an intuitive comparison, we show the feature distribution comparison of our PTPM and the image classification-based prior in the supplement.The PTPM feature shows better clustering performance than the image classificationbased feature, signifying its increased sensitivity to texture changes.Additionally, in Fig. 6, the L2 nearest neighbors of selected samples in different prior spaces show that PTPM's method of measuring texture similarity more closely aligns with human perception.

Training losses
Codebook Loss.This loss optimizes DTPM, including the codebook loss and two correlation constraints: Image Reconstruction Loss.We use 1 and Perceptual Loss [23] as the main reconstruction loss.Following previous work [7,8], we use a U-net discriminator  in [55] and a hinge loss as an adversarial loss to get more realistic textures.Given a reconstructed image   and its ground truth image   , the image reconstruction loss can be formulated as where   denotes a pre-trained VGG16 [50] network.
PTPM Loss.We integrate the PTPM prior into the DTPM's training by applying scale-matched texture prior regularization.Specifically, the global texture priors are the activations of the 5th ReLU of the ImageNet-pretrained VGG19 [50] network   and the local-friendly priors are the activations of the 2nd Max Pooling of our PTPM Net   .We extract the texture priors from the HR images.The PTPM supervision    is computed between the quantized features  and the corresponding texture priors which can be formulated as where   , are single convolution layer to transfer from the codebook space to the prior space.The total PTPM supervision is the sum of the supervision on the quantized features of two scales as Overall Loss.The overall loss is then defined as

EXPERIMENTS 4.1 Datasets and Evaluation Metrics
Prior Pre-training Datasets.Our coarse patch classification dataset is based on the ADE20K [74] semantic segmentation dataset and expanded using the SIM [51] image matting dataset.Following the strategy in Sec 3.3, we generate a final dataset with 17, 880 training samples and 2, 301 validation samples.

Super-Resolution Training Dataset.
We build an overall training dataset including DV2K [1], DIV8K [15], Flickr2K [35], and OST [59] datasets.HR patches are generated using the following approach: 1) crop large images into non-overlapping 512 × 512 patches; 2) apply the blur detection method [25] to each patch to filter out blurred patches with a blurred area greater than 95%.Our final training dataset contains 123, 395 HR patches, while we generated LR patches for each iteration using the widely used degradation model proposed in [66].
Evaluation Metrics.We used Peak Signal to Noise Ratio (PSNR) and Structural Similarity (SSIM) to evaluate the quality of generated images.In addition, for better perceptual evaluation, we also use the Learned Perceptual Image Patch Similarity (LPIPS) [67].
Implementation Details.We implement our model using the PyTorch framework.In both low-level prior pretraining and SR training, we use an Adam [27] optimizer with  1 = 0.9,  2 = 0.99,  = 1 × 10 −4 .The number of codes in both scale codebooks is set to 512.The RTCNet is trained with a batch size of 16 and a HR patch size of 256 × 256 on 4 NVIDIA V100 GPUs for about 4 days.
As shown in Tab. 1, our method achieves the best PSNR/SSIM performance on almost all 6 datasets.In Fig. 7, we compare the restored images of different BSR methods.First, consistent with the results in Tab. 1 and Tab. 2, DAN and CDC have limited recovery effects for complex degraded images.Second, Real-ESRGAN and SwinIR-GAN tend to confuse noise and texture.They erase texture details as noise and cause over-smoothing problems.Besides, Fe-MaSR mistakes some noise for texture, resulting in noisy texture generation.On the contrary, since our DTPM effectively maintains the cross-resolution consistency of texture codebooks, it is more robust to low-resolution degradation.It can reasonably distinguish between texture and degradation, ensuring the restoration of realistic textures while denoising.The multi-scale structure and low-level friendly priors further improve the restoration of local fine textures.In general, our RTCNet achieves state-of-the-art performance in quantitative metrics and human perception.

Ablation Study
Effectiveness of Cross-Resolution Correlation.To verify the effectiveness of cross-resolution constraints, we conduct an ablation study on the two cross-resolution strategies used: cross-resolution representation consistency (Rep.C.) and cross-resolution reconstruction consistency (Rec.C.).As shown in Tab.   the cross-resolution constraint forces the LR representation in the codebook to be closer to the HR, making it as distinguishable as the HR in the codebook space.And the combination of the two can further enhance the improvement.
Effectiveness of Hierarchical Structure.We validated the effectiveness of prior feature regularization and deep-to-shallow training strategy for multi-scale codebook training in Tab. 3. Training a multi-scale model from scratch leads to insufficient texture learning due to the more diverse and sensitive texture degradation at the local scale, making its performance even worse than that of the single-scale model (see rows 4 and 5, Tab. 3).The addition of the deep-to-shallow training strategy stabilizes the learning of the Notably, the performance of the deep-to-shallow strategy is not significantly better than the single-scale model, while after adding the PTPM regularization on this basis, the result is better than the single-scale model.This shows that the full multi-scale model can achieve better performance for fine textures than the single-scale model through hierarchical texture learning, but the learning of local-scale texture is challenging and requires the assistance of a low-level texture-friendly prior.
Comparison with Previous Codebooks.To verify the superiority of our proposed DTPM, we compare it with the high-resolution reconstruction-based codebook of FeMaSR [8].For fairness, both of them used the bottleneck model structure (codebook at x8 downsampling) and trained with our overall loss except for the local-scale PTPM loss.As shown in Tab. 4 and Fig. 8, DTPM outperforms Fe-MaSR [8] in both quantitative and qualitative results.Compared to FeMaSR [8], our single-scale DTPM has a more stable and realistic   In contrast, DTPM has a wider range of codebook usage.This is because the codebook space is trained with LR data and contains more discriminative LR representations under the cross-resolution consistency constraint.Benefiting from this, DTPM achieves more diverse texture generation and stronger stability (less noise on row 2 and clear texture on row 1 in Fig. 8).
Effectiveness of PTPM in blind super-resolution.To verify the superiority of our PTPM for low-level texture learning, we compared the impact of different semantic features on the learning of the local codebook in RTCNet.As shown at the bottom of Fig. 9 and in Tab. 5, while all pre-trained priors improve the texture restoration performance, our PTPM prior outperforms the ImageNet-based pretrained prior.This superiority can be attributed to our PTPM's better perception of low-level texture correlations.In addition, patch-level pre-training and texture-oriented label organization both improve the performance of the PTPM.To demonstrate the superior ability of the PTPM prior to distinguish different types of textures compared to the image classification-based prior, we analyzed the frequency distribution of different codes used for super-resolution on the OST dataset in the Supplement.As can be observed, compared to the ImageNet Classification pre-trained priors, PTPM shows more distribution differences between the "grass" and "plant" categories, which have more overlapping semantic labels, and has a smaller difference in the "sky" and "water" categories, which have different semantic categories but relatively similar textures.This shows that our pre-training strategy enables PTPMNet to pay more attention to the correlation of local texture information by excluding high-level information from the pre-training.

Limitation and Discussion
First, by observing the results, we find that RTCNet has some limitations when dealing with regular texture restoration, especially for data types that have plenty of such textures, such as buildings (examples in Supplement).This problem also occurs with the previous codebook-based method, which we will investigate in future work.Second, we find that the improvement of RTCNet is more obvious in the heavily degraded samples than in the lightly degraded samples (perhaps no improvement in some light samples) (Fig. 11.c).We speculate that the notable improvements in the heavily degraded data are due to increased matching confusion, a scenario where RTCNet performs optimally.Conversely, light degradation with less confusion can also be handled by previous methods, leading to marginal improvements.Although both data types are common in applications, we argue that the correction of complex degraded data has great challenges and value for super-resolution (SR) tasks.Third, the improvements brought by PTPM are not very considerable and stable, indicating that larger valid datasets and a more refined pretraining strategy is valuable for better performance.Furthermore, based on experience, pre-training features tend to adapt more effectively to data with strong domain priors, suggesting that applying the codebook method in combination with pre-training strategies to specific types of data may be a direction worth investigating.

CONCLUSION
In this paper, we have presented the Rich Texture-aware Codebook Network (RTCNet) framework for blind image super-resolution.With our proposed Degradation-aware Texture Codebook Module, we allow for more efficient modeling of LR-HR correspondences than previous single HR reconstruction pre-training.The architectures of DTPM allow it to model large and fine textures separately.In addition, we build the low-level friendly Patch-aware Texture Prior Module (PTPM) which further improves the performance of DTPM.Various experiments on different benchmarks show that our RTCNet achieves state-of-the-art performance.

A SUPPLEMENTARY MATERIALS A.1 LR Confusion in BSR data
This section presents a statistical analysis of the LR data from all the validation datasets utilized to show the universal confusion phenomenon observed in LR data compared to HR data.We densely cropped all HR images and their corresponding LR versions, which have the same size as the HR version after bicubic upsampling, into 128 × 128 patches (a total of 26,753 patches).We then computed the mean squared error (MSE) between all HR and LR patches.First, we analyzed the distributions of both HR and LR patches using the MSE, as shown in Figure 10.a.The plot highlights that HR patches have a more concentrated MSE distribution, while LR patches have a more dispersed one.This indicates that the LR data are more prone to confusion.Second, we extracted the index of HR patches in the nearest HR patch sorting of its LR patch.As shown in Figure 10.b, the index is dispersed, with a significant proportion not being the top-1 nearest.This implies that many LR patches have a closer MSE distance to other HR patches than their corresponding HR patches.Furthermore, Fig. 10.c shows the frequency of selection of each HR patch as the nearest to different LR patches.The figure shows that there is a large partition of non-1 frequency, indicating a large part of the LR-HR mismatch.Although the MSE statistic is not entirely suitable for evaluating the similarity between patches, the considerable partition of mismatches between HR patches and their LR counterparts suggests the confusion caused by blind degradation and the complex correlation it introduces.

A.2 Detailed Comparison between DTPM and FeMaSR
In this section, we performed a comparative analysis of our DTPM and previous high-resolution reconstruction-based codebooks (using FeMaSR as an example).We statistically investigated the performance improvements of DTPM over the FeMaSR method on samples of varying difficulty in the DIV2K validation set.Specifically, we divided the high-resolution (HR), low-resolution (LR), and their super-resolution (SR) results into 128 × 128 patches(15585 in total).We used the mean squared error (MSE) distance between LR and HR as a simple measure of sample difficulty, with smaller values indicating easier samples and larger values indicating more difficult samples.We compared and plotted the measurements including MSE(Fig.

A.3 Validation of Hierarchical texture learning of multi-scale structure
To better understand the advantage of hierarchical codebooks for texture learning, We explore the texture content learned at different scales in the hierarchical codebook architecture.Specifically, during the quantization process of local-scale DTPM, we replace the quantized features obtained from the low-resolution input with the noise features generated by the random indexes, thereby erasing the influence of the local scale feature during the reconstruction process.Quantitative and qualitative results are shown in Tab.6 and Fig. 15, respectively.In Fig. 15, when the local-scale information is missing, the detailed texture restoration is heavily affected, causing unrealistic

A.4 More Analysis Experiment of PTPM Prior Features
We present the detailed visualized comparison using t-SNE dimensionality reduction between image-classification-based priors and our PTPM priors in Fig. 14.Compared with image-classificationbased prior features, our PTPM priors have better clustering performance, meaning more sensitivity to local texture similarity.To further illustrate the difference between our PTPM prior and the ImageNet prior in the process of learning low-level texture, we conducted super-resolution statistics on the OST dataset.The highresolution images in the OST dataset were divided into seven categories according to rough textures, including animal, building, plant, grass, sky, water, and mountain.We degraded the HR data in the OST dataset and perform BSR on them.Then we counted the using frequency of each code in the codebook during the super-resolution process by category.By comparing the distribution of code used when facing different textures, we compare the learned code spaces' rationality on texture perception.As can be observed in Fig. 13, when compared to the ImageNet-Classification-pretrained priors, PTPM exhibits more distribution differences between the "grass" and "plant" categories, which have more overlapped semantic labels, and has a smaller difference in the "sky" and "water" categories, which have different semantic categories but relatively similar textures.This shows that our pre-training strategy enables PTPMNet to pay more attention to the correlation of local texture information by excluding high-level information from the pre-training.
A.5 Detailed Structure of RTCNet.
In complementary to the RTCNet framework in the paper, we provide the details of hyperparameters for the encoders and decoders in Fig. 12.

A.6 More Results
Extra Quanlitative Comparison.We show more qualitative results comparison in Fig. 16.
Results on Real LR Images.We also provide super-resolution results on real low-resolution images in Fig.

Figure 1 :
Figure 1: Confusing LR samples with different HR textures processed with the same random degradation used in [7, 8, 66] (including various noise, blur, and compression).The MSE in RGB space on the line indicates the patch distance.

Figure 2 :
Figure 2: An illustration of our motivation.(a) Left: The previous HR-reconstruction-based codebook trained only on HR reconstruction, requiring a second stage training for the LR encoder; right: DTPM, which incorporates both resolutions and cross-resolution consistency during training.(b) Top: Image classification-based features susceptible to global factors such as class labels, object shapes, and contours; bottom: our PTPM prior without the influence of global information.

Figure 3 :
Figure 3: The RTCNet framework.(1) During training, LR and HR input images are encoded using multi-scale encoders.These features are quantized in multi-scale codebooks via DTPM.The LR and HR decoders then perform dual-resolution reconstruction.(2) During inference, only LR images are used as input; these are processed by the LR encoder and DTPM to obtain multi-scale quantized features, which are then used by the HR decoders to reconstruct super-resolution images.

Figure 4 :
Figure 4: Hierarchical structure and its training strategy, using LR parts as an example due to the symmetry between LR and HR pipelines except for the RTSB in LR Encoder.

Figure 5 :
Figure 5: PTPM consists of two main blocks: (a) patch data generation; (b) patch classification training and label refinement.

Figure 6 :
Figure 6: L2 nearest neighbors of several selected samples in different prior spaces.

Figure 7 :
Figure 7: Visual comparison with other blind super-resolution images.The PSNR/SSIM/LPIPS values are shown at the bottom of the images.The captions in the images below have the same meaning as the descriptions provided here.

Figure 9 :
Figure 9: Visual comparison of local-scale codebooks trained with different prior features.

Figure 10 :
Figure 10: The MSE statistics of LR-HR data in validation datasets.
11.a), Peak Signal to Noise Ratio (PSNR, Fig. 11.b), and Structural Similarity Index (SSIM, Fig. 11.d) of the SRs of DTPM and FeMaSR under different levels of difficulty.To better illustrate the advantages of DTPM on difficult samples, we also investigate the performance gain of DTPM over the FeMaSR method for different sample difficulty levels(Fig.11.c).As shown in Fig.

Figure 11 :
Figure 11: Detailed comparison between DTPM and FeMaSR.(a) Distribution of MSE between LR and HR.(b) Distribution of PSNR between SR and HR.(c) Distribution of PSNR gain of DTPM over FeMaSR.(d) Distribution of SSIM between SR and HR.(e) Number distribution of image patches with different LR-HR MSE.

Figure 12 :
Figure 12: Detailed Structure of the encoders and decoders of RTCNet.The HR encoder is the same as the LR encoder but without the bicubic upsampling layer and RTSBs.

Figure 13 :
Figure 13: The frequency distribution of different codes used during super-resolution on the OST dataset.

Figure 15 :
Figure 15: Reconstruction Comparison of the local-scale quantized features generated by random noise and the matching local-scale quantized features obtained from the input.
3, both of them can effectively improve the performance of DTPM.This is because

Table 4 :
Comparison of DTPM with the high-resolution reconstruction-based codebook of FeMaSR [8] in the DIV2K validation set.The '*' indicates that we conduct a single-scale codebook without the hierarchical structure.

Table 5 :
Comparison of different priors used to learn the local-scale codebook in the DIV2K validation set.( † denotes the coarse prior before label refinement and fine-tuning).
LR Image-Classification Prior PTPM HR 14.30/0.1767/0.377114.25/0.1682/0.332417.80/0.2147/0.500217.78/0.2134/0.4828 11, comparedto FeMaSR, our DTPM has achieved improvements in different levels of difficulty, especially for samples with higher difficulty Tab.11.c.This verifies the good adaptability of DTPM to LR data, and thanks to its mining of texture cross-resolution consistency, DTPM can better distinguish different types of textures and perform diverse reconstructions for more difficult samples.

Table 6 :
Comparison between full hierarchical structure DTPM and noisy-local DTPM.