A Modular System for Enhanced Robustness of Multimedia Understanding Networks via Deep Parametric Estimation

In multimedia understanding tasks, corrupted samples pose a critical challenge, because when fed to machine learning models they lead to performance degradation. In the past, three groups of approaches have been proposed to handle noisy data: i) enhancer and denoiser modules to improve the quality of the noisy data, ii) data augmentation approaches, and iii) domain adaptation strategies. All the aforementioned approaches come with drawbacks that limit their applicability; the first has high computational costs and requires pairs of clean-corrupted data for training, while the others only allow deployment of the same task/network they were trained on (\ie, when upstream and downstream task/network are the same). In this paper, we propose SyMPIE to solve these shortcomings. To this end, we design a small, modular, and efficient (just 2GFLOPs to process a Full HD image) system to enhance input data for robust downstream multimedia understanding with minimal computational cost. Our SyMPIE is pre-trained on an upstream task/network that should not match the downstream ones and does not need paired clean-corrupted samples. Our key insight is that most input corruptions found in real-world tasks can be modeled through global operations on color channels of images or spatial filters with small kernels. We validate our approach on multiple datasets and tasks, such as image classification (on ImageNetC, ImageNetC-Bar, VizWiz, and a newly proposed mixed corruption benchmark named ImageNetC-mixed) and semantic segmentation (on Cityscapes, ACDC, and DarkZurich) with consistent improvements of about 5\% relative accuracy gain across the board. The code of our approach and the new ImageNetC-mixed benchmark will be made available upon publication.

Figure 1: An illustration of our modular system (SyMPIE) for efficient image enhancement targeting increased model robustness to corruptions in different multimedia tasks.SyMPIE contains two modules, namely, NEM and DWM (see Fig. 2 for their details).SyMPIE estimates parameters to clean input samples and can be integrated into any deep network architecture for multimedia understanding.SyMPIE is fully differentiable and can be trained end-to-end on an upstream task, without the need for paired clean/corrupted images to enhance images automatically.We use upstream task to identify the one used to pre-train our system (e.g., classification) and downstream task to identify the deployment one.

INTRODUCTION
Deep learning models have been widely used in several multimedia systems and applications.Nonetheless, recent studies have explored their robustness to corrupted input data [19,20], showing critical performance degradation affecting the quality of experience of many downstream applications such as extended reality [2] and video streaming [60].Indeed, the data-hungry nature of deep neural network (DNN) models, as well as the ever-growing complexity of network architectures, make the generated models sensitive to even mild distribution shifts causing a severe degradation in performance [44,45,53].These eventualities are often naturally met in many real-world applications, where data may unavoidably encounter natural alterations or corruptions [20,36], such as sensor degradation (e.g., shot noise, defocus blur) or compression artifacts, to mention a few.
Considering the increasing adoption of deep models, this issue has become of paramount relevance.Therefore, a new research field emerged to attempt to make models more robust under different perspectives.Existing approaches to improve model robustness can be categorized into three main branches: (i) data augmentation approaches, (ii) test-time adaptation approaches, and (iii) enhancing and denoising approaches, such as autoencoder-based, generative, and adversarial denoisers (e.g., GANs, diffusion models, etc.).
Data augmentation techniques [6,22,65] either re-train AI models from scratch or fine-tune them by applying a large set of generalpurpose augmentations (often synthetic) that mimic common corruptions encountered in the real world on data.In other words, this paradigm aims to produce a model that is robust against corrupted images and that can be very effective in improving model abilities to generalize under data distribution shift.The main drawbacks of this approach are that the trained model can only perform the task that has been trained with supervision on; the trained model tends to perform well on images with training-like distortions; the types of augmentations should be decided a priori, and they cannot be easily tuned to a scenario with variable corruptions (unless via expensive retraining).
To improve model robustness to variable-type corruptions at inference time compared to training time, Test-Time Adaptation (TTA) methods have been proposed [25,26,67].TTA methods focus on resolving data distribution shifts directly at test-time via dynamic updates of pre-trained models based on the specific characteristics of the target test data [14,42,57].TTA methods enable a model to improve its performance when encountering variations, unseen examples, or changing conditions at test time.Unlike data augmentation techniques, TTA methods do not involve expensive training procedures, but they are still bounded by the same task to be used during the pre-training and testing/adaptation phases.
Other popular techniques are enhancement and denoising, i.e., the process of improving input samples before they are fed to the downstream network [54].The most promising techniques are nowadays based on deep learning models.For instance, autoencoders are used to encode compact representations of input samples and then decode them to minimize a reconstruction loss with respect to ground truth samples [1,32,66].Generative and/or adversarial models (e.g., diffusion models [11,56,63] or GANs [8,29,40,41,58]) modify input samples by iterative processes or style transfer techniques.The main drawbacks of these methods are the need for expensive inference time and for paired clean-corrupted samples at training time while being able to handle in-domain corruptions only (i.e., particular corruption types seen at training time).
Overall, current techniques all come with some key limitations that we briefly discussed.In this paper, we propose a System for Modular Parametric Image Enhancement (SyMPIE), to overcome some of these limitations.SyMPIE implements a hybrid strategy combining the best of data augmentation and enhancement approaches while being computationally efficient and fully compatible with any model and any downstream task.
We propose to tackle the issue from a completely different direction, designing a modular system that predicts parameters used by an ad-hoc module to enhance the received input (see Fig. 1).We utilize generic data augmentation pipelines and common upstream task/network (e.g., image classification with ResNet50) to pre-train our model (System for Modular Parametric Image Enhancement, SyMPIE).Our key insight is that a large portion of input corruptions found in the real world can be modeled by applying either global operations on color channels (for example, a night scene can be approximated by a darkened daytime scene), or spatial noise which can be filtered out by a fast convolution operation with small kernel size (for example, a Laplacian filter or a blurring operation can be used to sharpen an image or to reduce noise on the image).Additionally, our model is pre-trained minimizing a standard classification task loss, alleviating the need for paired clean-corrupted samples during pre-training.In other words, the downstream task and model do not necessarily need to match the upstream task and model used for pre-training our model.For example, we may train our model on a corrupted ImageNet [21,46] benchmark for image classification using a Convolutional Neural Network (CNN), and deploy it on semantic segmentation benchmarks [49] with adverse conditions to support a transformer architecture.
Therefore, our SyMPIE is similar to enhancement and denoising approaches, with the key advantage of containing much smaller modules that are pre-trained without paired clean and corrupted ground truth.
Our contributions can be summarized as follows.
• We propose a lightweight modular image enhancement system, named SyMPIE, which predicts parameters of ad-hoc operators that are applied to input samples to improve the content understanding.• In the experimental analyses, SyMPIE consistently improves the accuracy of downstream models for the corresponding task with minimal impact on the number of compute operations (about 2GFLOPs on top of 343GFLOPs of a classical ResNet50 architecture for a full HD resolution input) and can be seamlessly applied on top of any competing approach improving their accuracy (up to 8% relative gain) without retraining.Unlike data augmentation methods, SyMPIE does not require long training procedures.• Unlike existing denoising approaches, SyMPIE is fast, does not require paired clean-corrupted samples during pre-training, and can be reused in multiple setups without losing its efficacy.In particular, SyMPIE proves to be fully compatible with any convolutional and transformer-based architecture, including recent foundation models (e.g., CLIP).SyMPIE enhances the accuracy of the aforementioned models on several downstream tasks (e.g., image classification, semantic segmentation) in the presence of corrupted test data improving the quality of experience of the final end users.

RELATED WORKS
Robust generalization.The robustness of computer vision models against distribution shifts is of vital importance in multimedia applications.Starting with the discovery of adversarial examples [52], many different robustness venues have been explored by the community, such as common image corruptions [21], images with conflicting shapes/textures [13], and style variations [30].The core idea behind robust generalization is to make multimedia models increasingly invariant to such shifts; e.g. a model trained on a training set should generalize well to samples with unseen styles, corruptions and perturbations.Arguably, the most common effects found on real data are image corruptions, which have been standardized and categorized in special benchmark datasets [21] for the evaluation of deep learning architectures.This is the scenario we focus on in our paper, due to the practical importance.
As mentioned in Section 1, there are various methods that aim to provide robustness to computer vision models.These methods can be categorized into three groups; i) data augmentation methods, ii) test-time adaptation methods, and iii) preprocessing (e.g., denoising and enhancement) methods.Data augmentation methods are built on a simple premise; one can achieve robustness by retraining or fine-tuning a model on a training set updated with the samples that the original model failed on (e.g., unseen distributions).Adversarial training [34] has been one of the most prominent examples of such methods, where adversarially corrupted examples are included in the training set during retraining of the model.Since then, the same core idea has been used by a plethora of methods that addressed robustness from a data augmentation perspective.AugMix and its variants [22,59] proposed to use randomly sampled augmentations in a cascaded manner to diversify the training distribution.DeepAugment proposed to use image translation models to produce new training images [18].Learned augmentation policies tailored for improved robust generalization are proposed in [10].The use of fractals and feature maps to create new samples [23], frequency-specific expert model ensembles [47], max-entropy augmentations [39] and frequency-spectrum perturbations [6,65] are some examples of data augmentation methods which have proven themselves successful against combating the issues caused by distribution shifts.Test-time adaptation.Although they currently hold the state-ofthe-art on many benchmarks, data augmentation methods have a key disadvantage, where they require retraining or finetuning every time they encounter (and fail against) new distributions, which may or may not be unknown in practical scenarios.Assuming one can perform these expensive updates periodically, model capacity issues as well as catastrophic forgetting [28] are likely to be new accuracy bottlenecks, which will inevitably lead to practical issues in deployment scenarios.
In order to avoid such problems, another branch of methods focuses on analyzing input images to detect whether they are of unseen distribution.Using network prediction consistency [12,33], specialized detector models [15] and sample statistics [61] have proven to be successful for adversarial examples, and similar ideas have been extended to common image corruptions [14,25,26,42,57,67].
Going further than just detecting critical samples, more recent methods aim to improve robustness in test-time.Tent [57] minimizes the test-time model prediction entropy via optimizing channelwise affine parameters per-batch.MixNorm [25] points out the inherent assumption of having samples from a single distribution in a batch, and its detriment on models with batch-norm layers.They then propose MixNorm layers, a replacement for batch-norm layers during test-time that adapts to query image statistics by using novel views of the same image.MEMO [67] minimizes the entropy of predictions of the input sample point and its augmented views, which improves robustness even with a single image.NOTE [14] proposes instance-aware batch-norm layers to improve robustness with a single image without the overhead of additional forward passes (e.g., for augmented views of the input sample) required for other methods.SAR [42] proposes a new entropy minimization method that takes into account samples with large gradients and avoids collapsing to trivial solutions, thus improving overall robustness performance.
Preprocessing.Test-time adaptation methods largely alleviate the issues inherent to data augmentation approaches; they perform partial updates at most, therefore they are fast and cheap.However, they still require a form of training, which requires periodic updates in deployment scenarios where data shift is ever-present.
Another branch of methods that aims to achieve robust generalization can be roughly categorized as preprocessing methods.These approaches do not tackle the problem as a normalization issue, unlike test-time adaptation methods.Instead, they attempt to bridge the distribution gap between the query sample and the model training set, by trying to enhance and denoise the query image.Some naive preprocessing strategies such as label smoothing, spatial smoothing, and total-variance minimization are shown to help improve robustness [61,64].Thinking of the corruption over the image as noise, conventional auto-encoding mechanisms can be used to purge the input image of its corruption-induced elements [1].Essentially, these methods aim to learn an approximation of a transformation over the query image, which makes the underlying issue an image-to-image translation problem.There are many image translation methods using GANs [8,29,41,58] or more powerful diffusion models [11,56] that can be either used off-the-shelf or updated to work well with our scenario.Although these methods do not necessarily update the actual downstream task model, they still require per-distribution updates for the model, which makes it expensive.Furthermore, the enhancement or denoising model adds an expensive step that might not be feasible for certain deployment scenarios.
The proposed approach aims to melt the best characteristics of these methods in a single pot: high accuracy, fast execution time, cross-distribution and cross-task generalization, and removing the need for paired data for training.

A MODULAR SYSTEM FOR IMPROVING MODEL ROBUSTNESS
In this section, we introduce the main contributions of this work.We detail our two-module network architecture in Sec.3.1.Its training procedure exploits exponentially smoothed versions of our modules for improved generalization and is discussed in Sec.3.2.Finally, the inference process is described in Sec.3.3.A detailed overview of our modules is shown in Fig. 2: SyMPIE can be inserted between the input data and any generic downstream model.Three key features distinguish our approach from existing image enhancement approaches, namely: (i) the removal of paired clean-corrupted images as a requirement for training; (ii) the prediction of enhancing parameters rather than of the cleaned images directly; (iii) the ability to generalize across several upstream and downstream tasks and networks.
Existing works, indeed, train enhancer models as a regression task on the clean images, effectively optimizing the reconstruction PSNR (Peak Signal-to-Noise Ratio).We tackle the issue from a completely different point of view, focusing on efficiency and downstream accuracy, instead.Incidentally, this allows our model to be trained and fine-tuned on any available dataset, especially on real-world datasets that contain naturally corrupted images with no clean counterpart ground truth.

Noise Estimation Module (NEM).
To estimate the parameters needed for enhancement, we designed a small residual CNN, , which estimates explicitly the set of parameters used by the differential warping module .Motivated by the observation that most of the distortions of the input samples affect the color space only [35,38,68], we identify a set of filters and linear transformations to be applied to the input samples for their improvement.
In our setup, we aim at estimating the parameters of i) a filter kernel K ∈ R  × (where kernel size  used in the experiments is given in Sec.4.1); ii) a linear transformation of color channels, represented by the matrix C  ∈ R  × ; and iii) a global color shift C  ∈ R  where  is the number of input channels (e.g.,  = 3 for RGB images).This set of parameters allows our architecture to model several naturally occurring corruptions, such as under/overexposed images, sensor noise (e.g., Gaussian, impulse, shot noise), unbalanced white point, etc. without the need to employ expensive deep learning models for enhancement.
To estimate the above parameters and keep the additional footprint minimal, we implement the NEM as a small residual CNN  (see the top half of Fig. 2).This allows the NEM module to efficiently extract global information from an input image, and predict suitable parameters for the subsequent module.More in detail,  : Moreover, in the vanilla deep learning-based image enhancement formulation [1,32], the network weights used to process the images are fixed.Therefore, the models learn to partition their own parameter spaces to deal with different situations, reducing the overall efficacy and parameter efficiency.Our model, instead, explicitly predicts the parameters depending on the input, therefore, our model can seamlessly adapt to multiple types of corruptions without the need to partition its parameter space.

Differentiable Warping Module (DWM).
To enhance input images using predictions obtained from the NEM, we use our Differentiable Warping Module (DWM), , depicted in the bottom half of Fig. 2. The module receives as input a corrupted image and the parameters estimated by the NEM and applies the parametric operators to the former to enhance the image.Formally, we have  : (X, (K, C  , C  )) ↦ → X, where X ∈ X is the improved image.
The operations of  are completely differentiable, allowing backpropagation of gradients through the DWM, and optimizing the predictions of the NEM in an unsupervised manner.This enables our method to be trained without any paired data, since gradients propagated through the DWM come directly from predictions over a frozen upstream model.In other words, the NEM is trained to enhance samples in order to maximize the performance of a frozen upstream network.This is a great advantage compared to existing works which either cannot enhance the inputs to improve the performance of downstream models (e.g., diffusion models) or cannot effectively leverage the gradient flow to produce cleaner inputs (e.g., adversarial approaches).Furthermore, when modifying input Require: X ∈ X is an input image, and Y ∈ Y is a vector of the corresponding ground truth category label.Require:  is the pre-trained upstream task model,  is our SyMPIE to be optimized and   is its smoothed version.Require:  is the exponential smoothing rate, and   is the weight of the regularization loss.samples directly, several approaches [37] introduce distortions that appear random and unnoticeable to the human eye but can completely change the output prediction of the model.In our setup, we avoid this by directly estimating the parameters of transformations whose kernels are translation-invariant (that is, given a translation  , the parameters are generated by a multidimensional function ì  () = ì  ( ()) for all pixel locations ).In particular, our modules apply transformations to the whole image at the same time rather than multiple transformations on local subsets of pixels.Therefore, our modules enforce a stronger and visible change in the input image to allow a change in the downstream prediction.In general, our approach is optimized to enhance the content of an input and improve the performance of the upstream task, not just the perceived image quality, unlike the other approaches.The use of global operations also stabilizes the gradients received by the NEM, enabling its training with fewer samples than other approaches (e.g., when compared to augmentation strategies, our approach requires up to 60 times fewer training samples).

Training Process
A key benefit of our module is its end-to-end training on any given upstream task (we train it on image classification tasks in our experiments) for its deployment on any downstream task (e.g., image classification, semantic segmentation, etc.) without further need of fine-tuning it.In the following, we describe the training procedure used to optimize our model (summarized in Fig. 3) and the data augmentation strategy used during the tuning.A detailed description is provided as pseudo-code in Algorithm 1.
We use a frozen model  pre-trained on a source upstream task (e.g., classification) to provide class predictions from the cleaned input samples (P =  ( X)), and we compare them to the labels Y ∈ Y using the cross-entropy loss L  , to obtain  1 = L  (P, Y).
As will be shown in Sec.4.4.2, a common failing point of denoisers is modal collapse upon their iterative application to the same image.To avoid this, we introduce an additional regularization term to the training loop, which employs the exponentially smoothed version of our modules (we call this   ).The parameters of   at each training iteration  > 0 are computed by  , =  , −1 + (1 − )  , where  is the exponential smoothing rate and  ,0 =  0 with   being the parameters of  at iteration .During the optimization,   is used to generate an intermediate image X  =   (X), which will show different visual cues than X, aiding in the generalization.
The intermediate sample is then fed to the modules currently being optimized (i.e.,  ) to obtain X =  (X  ), which is used in the computation of the regularization loss term.The two-step sample is then processed as normal, obtaining a prediction from the classifier model P  =  ( X ) to compute a new loss term  2 = L  (P  , Y), weighted by   .
Our modules attempt to enhance input samples which could not otherwise be handled effectively by popular downstream networks.For this purpose, we design a set of augmentations that serves as a proxy for the distortions experienced at deployment time, as we describe next.

Data Augmentation Pipeline.
To train our modules, we employ a variegate data augmentation pipeline, serving two main purposes: (i) to encourage generalization of our architecture over a vast array of possible input corruptions, (ii) to reflect conditions that are seen in the real world during deployment.In particular, we trained our architecture on the ImageNet-1k [46] dataset augmented using the corruptions proposed in [21] together with four additional ones to mimic adverse weather conditions at variable degree of severity.
The four additional corruptions are the following: (1) The darken corruption mimics the effect of under-exposure of the scene by reducing the intensities of all pixels in a consistent fashion.For example, this situation may happen after encountering glare that forces cameras to reduce the exposure before re-adjusting.A visual example of the resulting images can be seen in the first row of

Inference Process
The goal of our system is to be as easy to use and modular as possible.Therefore, we strove to keep the inference process footprint minimal.In Fig. 1, we reported a comparison between standard inference practice on a downstream task (top half), and the modified pipeline using our modules (bottom half).Notice how our modules can be seamlessly embedded in any architecture to improve its final accuracy.
In detail, to use our system on a given input sample, we first normalize and standardize it (as detailed in Sec.4.1).Then, for faster processing, we downsize a copy of the input image to have the smallest dimension equal to 232 pixels and extract a central square crop of 224 pixels from it.This low-resolution cropped version is fed to the NEM, which predicts the parameters to enhance the image.Finally, the predicted parameters and the full-resolution image are fed to the DWM to clean the image and enhance its content.
The resulting output can either be used as-is, if the final objective was attaining a cleaned image, or fed to any downstream model to attain a more accurate prediction on the considered task.Overall, the additional computational overhead is minimal, with our modules supporting up to about 300fps at full HD resolution (see Sec. 4

RESULTS AND DISCUSSION
To validate the generalization capability of our modular system, we evaluate its performance on two main tasks: image classification (Sec.4.2), and semantic segmentation (Sec.4.3).In the following, we will report the quantitative and qualitative results attained in various tasks by discussing them and drawing comparisons with competing strategies.

Implementation Details
Experimental Setup.To highlight the generalization capability of our approach, we trained our modules only once on the upstream classification task on the ImageNet dataset.The optimization lasted for 50 steps, using batch size 384 and Adam [27] optimizer with learning rate 10 −3 scheduled according to a cosine annealing strategy.We remark that when optimizing SyMPIE, the upstream module can be completely frozen, reducing the computational complexity of training.This is the scenario we consider in our work.Furthermore, the training is done once and the same pre-trained weights are used for all the downstream models and tasks.
Our NEM is implemented using a 3-block architecture with strong downsampling (i.e., with stride 4) between layers.A detailed breakdown is reported in Tab. 1.In total, our module uses seven 2D convolutions and a single fully connected layer to project the downsampled features into the space of parameters (K, C  , C  ) ∈ R  .We considered a kernel size  = 5, which yields   ( +1) + 2 = 37 in our setup.During training, the exponential moving average rate was defined as  = 0.9, and the weight factor for the regularization loss was set to   = 0.5.Without loss of generality, the normalization of the input images at inference-time must match the normalization seen during training.In our case, X is normalized in the range [0, 1] and standardized using mean  = [0.485,0.456, 0.406] and standard deviation  = [0.229,0.224, 0.225].
Datasets.We employed our system on several real and synthetic datasets to verify its effectiveness in different scenarios for the image classification task (Sec.4.2).In particular, we used the ImageNetC [21] and ImageNetC-Bar [38] datasets having synthetic corruptions (both based on the ImageNet [46]) and the real dataset VizWiz [3] having natural corruptions.We also introduced ImageNetC-mixed 2 , a new benchmark to investigate the reliability of models when presented with multiple corruptions at once.Table 2: Results for the image classification task on the ImageNetC dataset (higher is better).
We built it by randomly applying 1-to-3 corruptions to the same image, sampled from the joint pool of augmentations of our data augmentation pipeline and those proposed in ImageNetC.
For what concerns semantic segmentation (Sec.4.3), we analyzed the models using driving scenes tackling the domain adaptation problem for clear-to-adverse weather conditions.In this case, we employed three real world datasets: i) Cityscapes [9] as the training (source) domain; ii) ACDC [49] and iii) DarkZurich [48] as testing downstream (target) domains.
Metrics.In image classification, we report the per-corruption accuracies (Acc, ↑), their mean (Corr.Avg., ↑), and the the accuracy on corruption-free data (Clean, ↑).In semantic segmentation, we report the results as per-class IoU (Intersection over Union) or its mean (mIoU, ↑).In all cases, Δ (%Δ) refers to the absolute (relative) gain with respect to the considered baseline.

Results on Single Synthetic Corruptions (ImageNetC).
For this discussion, we refer to Table 2, where we report the final accuracy attained by our approach when mounted on multiple different backbones.In particular, we use a ResNet50 [16] with different pre-training weights (TorchVision [55] V2 and V1, HA [65], PRIME [39], and PIXMIX [24]), a VGG16 [50], a Swin-Tiny [31], and finally the CLIP [43] foundation model.Notably, we train our SyMPIE model only once using ResNet50 with TorchVision's V2 weights, and we use the trained model as-is on all other experiments, highlighting the generalization capability of our approach.
Our SyMPIE consistently improves the average accuracy across the considered models with an average gain of 2.2% in absolute terms and an average relative gain of 5.0%.As anticipated, a remarkable feature of SyMPIE is the capability to improve the performance even for architectures designed to work well in the ImageNetC task, such as HA [65], PRIME [39], and PIXMIX [24].These are three strong state-of-the-art data augmentation approaches and, therefore, can handle corrupted samples from ImageNetC better than weaker data augmentation approaches.Keeping this in mind, the improvement brought by our approach when used jointly with these architectures is even more striking, signifying that the effect of our module is complementary to existing state-of-the-art approaches, and using both strategies together can improve the absolute accuracy significantly.Numerically, when using our system together with HA, we are able to improve the performance over TorchVision's V1 baseline by almost 20% in absolute terms (49.4% relative).
Moreover, the gain is well-spread across the various corruptions and, in cases particularly suited to be tackled by our constrained model, we can appreciate some higher gains.For example, in Motion Blur and Snow, we get significant performance gains (up to 9% absolute points) even on already-robust backbones like those pre-trained via HA.We believe that the motion blur kernel can be estimated accurately with our learned filter, whereas our affine color transforms handle the low contrast brought by snow, regardless of the localized bright spots.We experience slight performance drops on the fog corruption, due to changes to the frequency distribution our system may introduce via the global filter.Methods such as HA are designed with frequency-spectra changes in mind.Therefore, they expect a distribution of input images with specific frequency characteristics.Another performance drop is observed on the Brightness corruption.This corruption is modeled as a non-linear change in the HSV color space, and our affine color transforms could not approximate the underlying function accurately.Note that these limitations introduce only marginal drops in accuracy, and do not change the overall improvements brought by our system.Lastly, we discuss the results when employing our approach with downstream architectures that are different than the one used for pre-training our modules.We verify the usefulness of our method on the widely used VGG16 convolutional architecture and on a transformer-based architecture (e.g., Swin-Tiny).Then, we considered the CLIP foundation model [43] that was pre-trained to align image-text embeddings to the same semantic value, rather than explicitly recognizing the input image category.Our modules bring improvement even in this case, highlighting how the approach is able not only to change the graphical appearance but also to highlight the semantic content, making it easier for the downstream network.

Results on Mixed Synthetic Corruptions (ImageNetCmixed).
Following the tests on a single corruption at a time, we consider the more challenging setting of multiple corruptions together using the hereby proposed ImageNetC-mixed dataset.(refer to the datasets section in 4.1 for more information).The results of these analyses are shown in Table 3.Despite the existing data augmentation approaches losing about 10% accuracy compared with the single corruption case, our approach maintains a stable gain of 2.0% in absolute terms, corresponding to 5.0% relative improvement.This proves the ability of our approach to handle the composition of input corruptions, as it is often encountered in practice.

4.2.3
Results on Real-World Corruptions (VizWiz).In the previous sections, we have investigated the performance of our modules on synthetic corruptions, following the standard evaluation pipeline [21,22].However, it is important to verify that the performance improvement is maintained even on real data.To this end, we used the corrupted real-world VizWiz dataset [3].The numerical results are reported in Table 4, where we observe a 1.2% relative gain on corrupted images, while also improving the accuracy on clean images by relative 0.4%.

Computational Cost.
We report the complexity of our approach in FLOPs (Floating Point Operations) in Table 5, comparing it to other architectures that can be used as (or converted into) input-level denoisers.We computed the cost of our approach using the PTFlops library [51] at a resolution of 1920 × 1080px 2 .Our module is one to four orders of magnitude faster than the diffusion model competitors.Moreover, its computational complexity does not scale significantly with the input resolution -since a resizing stage is done before parameter estimation and the only operations applied on the full-resolution image are the spatial convolution and a matrix multiplication over the channels (see Sec. 3.3).This means that we can handle images of arbitrary resolution, contrary to what happens for, e.g., the fixed-size diffusion models.Similar considerations also hold for auto-encoders and GANs.This confirms the efficiency of our strategy.
Finally, we computed the throughput of our module on an NVIDIA GTX 1080Ti GPU, obtaining a speed of 289.4 images/second; corresponding to a per-image inference time of about 3ms, meaning that our approach can be easily employed in real-time applications with minimal impact on the inference time of downstream architectures.

Results for Semantic Segmentation
The analyses up to this point have been in the same task as training (i.e., image classification) with some changes in the downstream architecture.However, to truly show the generalization capabilities of our approach, we change the downstream task completely and verify that the accuracy improvement is preserved.For this investigation, we choose the semantic segmentation task and, in particular, domain adaptation to adverse weather conditions.We train the baseline architecture used in [4,5] (DeepLabV2 [7] with ResNet50 backbone [17]) on the Cityscapes [9] dataset, and deploy it on two adverse weather datasets, namely, ACDC [49] and DarkZurich [48].
The quantitative results of this study are reported in Table 6, where our architecture improves the mIoU score by 4.1% and 3.8% relative gain on ACDC and DarkZurich, respectively.We report some qualitative results on ACDC in Fig. 6, showing the original RGB image, the image warped by our architecture, and the corresponding predictions made by the segmentation model.In the first image (snow scenario), our model improves the overall contrast.This aids in distinguishing the sidewalk on the right of the scene (light blue box), which was missed almost completely by the segmentation system when shown the original data.Moreover, it helps to recognize more accurately the building (light green box).In the second image (night scene), instead, our method brightens the scene, allowing us to identify the person on the side (light green box), which was not recognized by the source only architecture.

CONCLUSIONS
In this paper, we introduced a novel, modular, and efficient system that predicts explicit parameters for content enhancement of an input image targeting improved accuracy on downstream tasks.The computational footprint of the modules is minimal, being more than 10x faster than competing approaches (2GFLOPs in total), and enjoys a throughput of about 300 images per second.A key feature of our approach is the capability of training without any paired (clean/corrupted) input samples, but rather learning automatically the most suitable transformations of an input image using the supervision on an upstream task.Even more remarkably, once the system has been trained on a given upstream task, it can generalize to arbitrary downstream tasks without fine-tuning.To confirm the generalization claim, we validate our approach on three classification datasets and two semantic segmentation datasets, achieving noticeable improvement in all of them.
In future work, we plan to introduce new parameterizations and operators to capture a richer representation of the input and estimate more accurately the enhancing parameters (e.g., features of frequency components, or non-spatially-uniform operators), while maintaining high efficiency and full differentiability, which are fundamental for our approach.Another avenue of research will target compressed video data, investigating the effect of re-using parameters across multiple frames With the goal of further improving the computational efficiency.A final plan involves the investigation of a multi-task training scenario, allowing the model to learn more general transformations.

3. 1 A
System for Modular Parametric Image Enhancement (SyMPIE) Our System for Modular Parametric Image Enhancement (SyMPIE) is made of two modules.The first module (Sec.3.1.1)is the Noise Estimation Module (NEM), , which is implemented by a small CNN to estimate a set of parameters used by the second module from training samples.The second module (Sec.3.1.2) is the Differentiable Warping Module (DWM), , and is used to process input samples in order to remove distortions from them.To obtain the final SyMPIE architecture, we combine the two modules in a single block by  =  •  where • denotes the module composition.

Figure 2 :
Figure 2: A detailed scheme of our modules working together to enhance the content of an image.The Noise Estimation Module (NEM) receives a corrupted input and predicts a triple of parameters (C  , C  , K).These parameters are used by the Differentiable Warping Module (DWM) to enhance the image using parametric operators.
where X is a corrupted input, X denotes the space of input images, and (K, C  , C  ) are the parameters used for the cleaning process.More implementation details are given in Sec.4.1.

Figure 3 :Algorithm 1
Figure 3: An overview of the training procedure of our modular system.

Fig. 4 . ( 2 )
The horizon corruption shifts the white-point of the input image to mimic haze on sunset/sunrise scenes.The new white points (RGB) are sampled uniformly in the range [255, 192 + , 192 − ],  ∼ U [−32, 32], as shown in Fig.5a.An example is reported in the second row of Fig.4.(3)The night corruption simulates nighttime acquisition.We tackle this problem in two ways: (i) changing the white point to a dark blue (uniformly sampled in the color range[32,32,64 + ],  ∼ U [−24, 24], as shown in Fig.5b) and (ii)

Figure 4 :
Figure 4: Qualitative examples of our new input corruptions.
(a) Range for horizon.(b) Range for night.

Figure 5 :
Figure 5: Color ranges used to sample new white points when applying horizon and night corruptions.
Ours Defocus Glass Motion Zoom Contrast Elastic JPEG Pixelate Gaussian Impulse Shot Brightness Fog Frost Snow Clean Corr.Avg.Δ %Δ