Asymmetric Dual-Decoder U-Net for Joint Rain and Haze Removal

This work studies the joint rain and haze removal problem. In real-life scenarios, rain and haze, two often co-occurring common weather phenomena, can greatly degrade the clarity and quality of the scene images, leading to a performance drop in the visual applications, such as autonomous driving. However, jointly removing the rain and haze in scene images is ill-posed and challenging, where the existence of haze and rain and the change of atmosphere light, can both degrade the scene information. Current methods focus on the contamination removal part, thus ignoring the restoration of the scene information affected by the change of atmospheric light. We propose a novel deep neural network, named Asymmetric Dual-decoder U-Net (ADU-Net), to address the aforementioned challenge. The ADU-Net produces both the contamination residual and the scene residual to efficiently remove the rain and haze while preserving the fidelity of the scene information. Extensive experiments show our work outperforms the existing state-of-the-art methods by a considerable margin in both synthetic data and real-world data benchmarks, including RainCityscapes, BID Rain, and SPA-Data. For instance, we improve the state-of-the-art PSNR value by 2.26/4.57 on the RainCityscapes/SPA-Data, respectively. Codes will be made available freely to the research community.


I. INTRODUCTION
W HEN photographing in bad weather, the quality of outdoor scene images can be greatly degraded by the contamination, i.e., rain, haze or snow, etc. distributed in the air.Such contamination absorbs or disperses the scene light, thereby reducing the contrast and color fidelity of the scene image.Hence, the existence of contamination significantly affects many real-world vision systems, such as scene recognition, object tracking, semantic segmentation, etc, and all of these vision systems are essential for autonomous driving [1]- [3].In another word, such outdoor vision systems, which works efficiently in ideal weather condition, will suffer a plummet due to complex real-world weather conditions.Therefore, it is essential to develop algorithms to restore images contaminated by different contaminants as a pre-processor for such outdoor vision systems.
In this work, we focus on a real yet less-investigated scenario, the co-occurrence of the rain and haze in the scenes.Both image rain removal and haze removal are challenging low-level computer vision tasks.Many efforts have been made to solve the individual rain removal and haze removal tasks [4]- [6].However, only a few works consider removing the rain and haze jointly in scene images [7]- [9].In the realworld scenario, it is a very common situation that the rain and haze co-occur in the rainfall environment (see Fig. 1a) [10].Along with rain streaks and raindrops, the uneven haze will also obscure the image, interfering with the perception of the environment.Such a scenario brings challenges to the outdoor vision systems that are required to jointly remove the rain and haze in images.
The existing methods for single-image rain and haze removal can be roughly categorized into two categories: priority knowledge-oriented approaches and data-driven approaches.The prior knowledge-based image rain removal [11]- [13] and haze removal methods [14]- [16] are mostly based on the physical imaging models.However, such solutions suffer from the robustness issue when deployed into real-world scenarios [17], [18].Recent advances in deep learning demonstrate dramatic success in haze removal [19]- [21] and rain removal [22]- [24].
Learning-based methods in both fields have achieved cuttingedge performance on synthetic datasets.However, methods designed for certain contamination cannot handle the complex real-world scenario with the co-occurrence of the rain and haze in the natural scenes.Recent studies also pointed out the necessity of joint-removal, such as Han et al. [9] decompose rain and haze by a Blind Image Decomposition Network, and Kim et al. [8] remove rain and haze by a frequencybased model.A new dataset for the purpose of benchmarking joint rain and haze removal, named RainCityScapes, is also proposed to facilitate research on this important task [7].Thus such a joint-removal task becomes an open problem in the community and calls for further study.
Recent advances in low-level computer vision have made remarkable progress, where a well-trained deep neural network can almost perfectly remove the contamination in the outdoor scene images.However, no existing work considers paying attention to the scene difference in the restoration process.We observe that the true residual, obtained by (Input − Ground Truth) (see Fig. 1c), contains the scene information.That is, a neural network designed to focus on contamination may suffer from a gap in recovering the scenes.Such a gap motivates us to develop a unified method to remove the contamination and compensate for the scene information in one go.
In real-world scenarios, the weather condition is complex, that is, different components, such as rain streak and haze, may co-occur in the scenes.The occurrence of some components, i.e., heavy haze, impacts the atmospheric light.As a consequence, the scene information at the photometric level can be degraded.Physically speaking, along with removing contamination in the image, it is also necessary to restore scene information affected by the change of atmospheric light.To address this issue, we proposes a novel dual-branch architecture, called Asymmetric Dual-decoder U-Net (ADU-Net).The ADU-Net consists of a single branch encoder and asymmetric dual-branch decoders.In the asymmetric dual-branch architecture, one branch, the contamination residual branch, is designed to remove the contamination (see Fig. 1d).Another branch, the scene residual branch, is required to perform the recovery of scene information (see Fig. 1e).The contamination residual branch, equipped with a novel channel feature fusion (CFF) module and window multi-head self-attention (W-MSA), produces the contamination residual.The scene branch, powered by a novel global channel feature fusion (GCFF) module and shift-window multi-head self-attention (SW-MSA) mechanism, aims to preserve the scene information by the scene residual.The joint efforts of contamination residual and scene residual separate the rain and haze from the input scene image, while preserving the scene of the image (see Fig. 1f).The proposed ADU-Net can effectively remove the different contamination in the images and compensates for the scene information on multiple benchmark datasets, including RainCityscapes [7], BID rain [9] and SPA-Data [24].
Our contribution can be summarized as follows: • We propose a novel yet efficient neural architecture, ADU-Net, to jointly remove rain and haze in scene images.
• We present an asymmetric dual-decoder, which removes the contamination while compensating for the scene information of the image.To the best of our knowledge, this is the first work to consider the recovery of scene information in deraining and dehazing tasks.
• Extensive experiments, including quantitative studies and qualitative studies, are conducted to evaluate the effectiveness of the ADU-Net.Empirical evaluation shows our method outperforms the current state-of-the-art methods by a considerable margin.

II. RELATED WORK A. Single-image Rain Removal
The very first single-image rain removal methods were based on a priori knowledge.Morphological component analysis (MCA) [11] employs bilateral filters to extract highfrequency components from rain images, where the highfrequency components are further decomposed into "rain components" and "non-rain components" through dictionary learning and sparse coding.Luo et al. [12] proposed a single-image rain removal algorithm based on mutual exclusion dictionary learning.Gaussian mixture model prior knowledge [13] was utilized to accommodate multiple orientations and scales of rain streaks.In [17], Zhu et al. detected the approximate region, where the rain streaks were located, to guide the separation of the rain layer from the background layer.
However, early models based on a prior knowledge often suffer from a lack of stability in real scenarios [11]- [13].Since 2017, deep learning approaches are developed for rain removal tasks.Deep detail networks [25] narrowed the mapping from input to output and combined prior knowledge to capture high-frequency details, making the model stay focused on rain streaks information.By adding an iterative information feedback network, JORDER [26] used a binary mapping to locate rain streaks.A non-locally enhanced encoder-decoder structure [27] was proposed to capture long-range dependencies and leverage the hierarchical features of the convolutional layer.In [28], Li et al. proposed a deep recurrent convolutional neural network to remove rain marks located at different depths progressively.A density-aware multi-stream connectivity network was introduced for rain removal in [29].By adding constraints to the cGAN [30], Zhang et al. [22] generated more photo-realistic results.A progressive contextual aggregation network [23] was proposed as a baseline for rain removal.A real-world rain dataset was constructed by Wang et al. [24], they also incorporated spatial perception mechanisms into deraining networks.Recently, Zhu et al. [31] proposed a gated non-local depth residual network for image rain removal.

B. Single-image Haze Removal
Similar to image rain removal methods, early work on image dehaze tended to employ statistical methods to acquire prior information by capturing patterns in haze-free images.Representative methods includes Dark channel prior [14], color-line prior [15], colour attenuation prior [16], etc.However, prior-based methods tend to distort colors and thus produce undesirable artifacts [14]- [16].In the deep learning era, methods started to not rely on prior knowledge, but to estimate atmospheric light and the transmission map directly.For example, Cai et al. [32] proposed an end-to-end dehazing model named DehazeNet, where haze-free images are produced by learning the transmission rate.Similarly, Ren et al. [33] employed multi-scale deep neural networks to learn the mapping relationship between foggy images and their corresponding transmission maps, aiming to reduce the error in estimating the transmission maps.AODNet [19] reconstructed the atmospheric scattering model by leveraging an improved convolutional neural network to learn the mapping relationship between foggy and clean pairs.In [34], a single network was proposed to simultaneously learn the intrinsic relationship between transmission maps, atmospheric light, and clean images.Ren et al. [35] built an encoder-decoder neural network to enhance the dehazing process.A network with an enhancer and two generators was proposed by Qu et al. [36].Chen et al. [21] proposed a patch map-based PMS-Net to effectively suppress the color distorted issue.

C. Joint Rain and Haze Removal
In this line of research, Hu et al. [7] built an imaging model for rain streaks and haze based on the visual effect of rain and the scene depth map to synthesize a realistic dataset named RainCityscapes.Han et al. [9] constructed a superimposed image dataset and proposed a simple yet general Blind Image Decomposition Network to decompose rain streaks, raindrops, and haze in a blind image decomposition setting.Kim et al. [8] proposed a frequency-based model for removing rain and haze, where the frequency-based model divided the input image into high-frequency and low-frequency parts with a guided filter and then employed a symmetric encoder-decoder network to remove rain and haze separately.Based on prior knowledge, Liang et al. [18] proposed a three-stage model, which (1): utilized dark channel prior and depth information to dehaze the low-frequency part of the input image, (2): employed a residual network to remove rain streaks in high-frequency part, and (3): introduced a cGAN [30] to refine the local details of the restored image.

III. METHOD
This section details the proposed method in a top-down fashion: starting from the problem formulation of our application, followed by the architecture of the proposed Asymmetric Dual-decoder U-Net (ADU-Net) and its building block, namely asymmetric dual-decoder block (ADB).
Notations.Throughout the paper, we use bold capital letters to denote matrices or tensors (e.g., X), and bold lower-case letters to denote vectors (e.g., x).

A. Problem Formulation
Let a third-order tensor, I ∈ R C×H×W , denote an input image, where C, H and W present the channel, height, and width of the image, respectively.In our application, both rain and haze are synthesized into the origin scene images as input images.Each input image I is labelled with its ground truth image I gt without rain and haze in the scene.Our ADU-Net f θ , consisting of a single branch encoder f E , and an asymmetric dual-decoder f AD , can remove the rain and haze in the input image, such that the output of the ADU-Net, Y = f θ (I) can restore its ground truth scene I gt .The ADU-Net is trained to learn a set of parameters, θ * , with minimum empirical objective value L(I gt , Y ).

B. Network Overview
We first give a sketch of the proposed ADU-Net.In rain and haze removal applications, one ideal option is to employ the deep neural network to understand the scene of the input image and separate the rain and haze from the input image.In our work, we develop the ADU-Net to remove the rain and haze jointly.As shown in Fig. 2, the ADU-Net is stacked by a single branch encoder and an asymmetric dual-decoder.In the encoder f E , we have five convolutional blocks, with each denoted by Conv i , 0 ≤ i ≤ 4. The output of each convolutional block is denoted by Then a following asymmetric dual-decoder f AD aims to recover the scene image without rain and haze (see Fig. 2).The proposed asymmetric dual-decoder is stacked of a set of ADBs, which produce two streams of latent features, denoted by Z c j and Z s j in the j-th ADB.Specifically, the processing can be formulated as or After the last ADB, each stream of latent features Z The network is optimized by the negative SSIM loss [37] as L SSIM = −SSIM(I gt , Y ).Noted the common practice uses both the negative SSIM loss and MSE loss as the objective.Empirically we observed that a negative SSIM loss works better in the proposed ADU-Net, which will be justified in § IV-D.

C. Asymmetric Dual-decoder Block
In this part, we will describe the asymmetric dual-decoder f AD in ADU-Net.As shown in Fig. 2, f AD consists of four ADBs and a convolutional block, while the ADBs are stacked by two different instantiations (e.g., ADB 0 vs. ADB j , j = 1, 2, 3).In this following, we will first describe ADB 0 , a simple form of the block.Then with a minor modifications, we can realize the ADB j , j = 1, 2, 3 on top of the ADB 0 .The ADB 0 is a two branch architecture (see Fig. 3), which receives the F 3 and F 4 as input, and produces two latent features Z c 0 and Z s 0 .In ADB 0 , the two latent features are respectively encoded by two branch of network, namely contamination residual net (denoted by g c ), and scene residual net (denoted by g s ), given by and Contamination Residual Net.In the contamination residual net (g c ), F 3 and F 4 are fed to a channel feature fusion (CFF) module to localize the rain and haze areas in the scene image, as The details of CFF are illustrated in Fig. 4. Given two feature maps F 3 and F 4 as input, it first fuses the two inputs by using element-wise addition and then feeds the fused feature maps to 2-layer convolutional blocks to obtain the attention weights, formulated by where σ, BN, ReLU are sigmoid function, batch normalization, and rectified linear unit activation, respectively.Here, the kernel size of Conv is 1 × 1, which can be understood as applying a fully-connected layer to the channel features.
Then we can apply the attention weights to the input feature maps and obtain the fused output, as The CFF module fuses the input feature maps, and the fusion weights are produced via the channel patterns.We further employ a self-attention mechanism to build the spatially longrange dependencies of the fused feature maps G s 0 , given by where W-MSA is the window multi-head self-attention from the Swin Transformer [38].
Having fusing the input feature maps and being processed by the attention mechanism, we can obtain the contamination residual feature maps as The contamination residual net (g c ) aims to attend to the rainy and hazy regions, thereby highlighting the rain and haze components in the contamination residual feature maps.Scene Residual Net.Since we can observe from the contamination residual (Y c ) that it contains the scene information along with the rain and haze, we develop a scene residual net (g s ), that can compensate for the removed scene information in the image.In doing so, the global channel feature fusion (GCFF) module is proposed to capture valuable global scene information of the image, and fuse features, as As shown in Fig. 5, F 3 and F 4 are first fused, and summarized to its global feature, as where GAP indicates the global average pooling.Then a 2layer convolutional block is used to modulate per element of the global feature m s 0 , written by: We can thereby fuse the input feature maps as: In GCFF, we employ the shift-window multi-head selfattention (SW-MSA) to enhance the spatial interaction of the feature maps and obtain the scene residual features, described by and Instantiation of ADB j .The difference between ADB j , j = 0 and ADB 0 is that ADB 0 receives two feature maps as input, while ADB j , j = 0 includes three feature maps as input.To adapt the architecture of ADB 0 to ADB j , j = 0, we make minor modification (see Fig. 6).Specifically, for any block, ADB j , its input includes the output from j − 1-th ADB blcok, e.g., Z c j−1 , Z s j−1 ∈ R d×h×w , and from the 3 − j-th convolutional encoder, e.g., F 3−j .We first concatenate the Z c j−1 , Z s j−1 , and reduce its dimension from 2d × h × w to d × h × w, as and With F 3−j , the output of ADB j can be obtained as and In this work, we propose a novel architecture for the rain and haze removal task.Considering the network capacity and hardware overhead, we propose two sizes of networks.One is the lite network, called ADU-Net, and another one is the large network, called ADU-Net-plus.In § IV, we present the details of two architectures.The network performance is also evaluated in § IV.
Remark 1 The residual U-Net architecture has been used extensively for the rain or haze removal tasks [39], as shown in Fig. 7a.Having the observation that the contamination residual, produced by the decoder, contains the scene information, we aim to develop a dual-decoder U-Net, with one decoder producing the contamination residual, and another one producing the scene residual as a scene compensator.Its initial design is shown in Fig. 7b.Considering the physical property of the contamination and scene information in the input image, we propose a novel network architecture, ADU-Net, where we integrate two decoders with non-identical architectures (see Fig. 7c).We justify our design in § IV-D.

IV. EXPERIMENTS
In this section, we first give the implementation details of the proposed ADU-Net and ADU-Net-plus.Then the benchmark datasets and evaluation protocol are also introduced.We further compare our network to the state-of-the-art methods and conduct ablation studies to evaluate the superiority of the proposed network and each components.In the final part, we demonstrate substantial qualitative results to analyze the superior performance of our network.

A. Implementation Details
Network Architecture.The overall neural architecture of the proposed network is shown in Fig. 2. Table I lists the kernel size of the convolutional layers.In the encoder block, the feature maps are processed by the Batch Normalization [40] and ReLU [41] after the convolutional layer, i.e., Conv 0 , Conv 1 , Conv 2 , Conv 3 , Conv 4 .Then the max-pooling layer is employed to down-sample the feature maps in each layer.In the decoder block, we also list the kernel size in the convolutional layers (see Table I), and employ the Leaky ReLU as the activation function.Having the computational efficiency in mind, we develop two neural networks of different scales.The light one is denoted as ADU-Net, while the large one is denoted as ADU-Net-plus.As shown in Table I, the difference between the two networks is merely the modification to the channel dimensions.The superiority of our network will be evaluated in § IV-C.

TABLE I DETAILS OF THE KERNEL SIZE IN CONVOLUTION LAYERS. H AND W
Network Training.We implement our method using PyTorch deep learning package [42].All experiments are evaluated on NVIDIA GTX 2080ti GPUs.In the experiments for RainCityscapes [7] and BID Rain datasets [9], the input images are resized to 512 × 256.For the SPA-Data, we follow the practice in [24], that uses original images with size of 256 × 256.The Adam optimization scheme with an initial learning rate of 0.001 is used to optimize the network.We train the network for 100 epochs for RainCityscapes and BID Rain datasets, and 20 epochs for SPA-Data.The learning rate adjustment strategy is employed to realize the learning rate decay, where the learning rate is decayed by a factor of 0.1 when the accuracy of the network does not improve in 5 epochs.

B. Datasets and Evaluation Protocol
We evaluate the proposed methods on two synthetic datasets, i.e., RainCityscapes [7], BID Rain [9], and a real-world dataset SPA-Data [24].In the following, we will introduce these datasets and the statistics of each dataset are illustrated in Table II.
RainCityscapes.The RainCityscapes dataset is synthesized from the Cityscapes dataset [43].It takes 9,432 images synthesized from 262 Cityscapes images as the training set and 1,188 images synthesized from 33 Cityscapes images as the test set.All the selected images of Cityscapes are overcast, without obvious shadow.Rain streaks and haze is synthesized by different intensity maps.By adjusting the intensity of the rain streaks and haze, each original image can produce 36 different synthesized images.The results of different methods are reported in Table III.
BID Rain.The BID Rain dataset is also synthesized from the Cityscapes dataset.It samples 2,975 images from the validation set of the Cityscapes dataset as a training set, and 500 images from the test set of the Cityscapes dataset as its test set.This is a complicated dataset as the images contain rain streaks, haze, snow, and raindrops.The rain streaks masks are sampled from Rain100L and Rain100H [26], and the snow masks are sampled from Snow 100K [44].The haze masks include three different intensities originating from FoggyCityScape [45].The raindrops are produced from the metaball model [46].Those weather components are mixed with the images in the Cityscapes dataset using the physical imaging models [14], [26], [44]- [46].In the training set, every image can be mixed with each weather component with random probabilities, and we evaluate our model in six different cases, the combinations of the weather components in each case are as follows (1): rain streaks, (2): rain streaks and snow, (3): rain streaks and light haze, (4): rain streaks and heavy haze, (5): rain streaks, moderate haze and raindrops and ( 6): rain streaks, snow, moderate haze and raindrops.Refer [9] for more details of six settings.The results of different cases are shown in Table IV.
SPA-Data.The SPA-Data is a real-world dataset, which is cropped from 170 real rain videos, of which 86 videos are collected from StoryBlocks or YouTube, and 84 videos are captured by iPhone X or iPhone 6SP.Those videos cover outdoor fields, suburb scenes, and common urban scenes.This dataset contains 638,492 image pairs for training and 1,000 for testing.The results of SPA-Data in Table V.
In our experiments, the network performance is quantitatively evaluated by the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) metrics.A higher value of PSNR and SSIM indicates a better image recovery performance of the network.

C. Comparison to the State-of-the-Arts
To verify the advance of our method, we compare the performance of our method with current state-of-the-art methods across three datasets.
RainCityscapes.In the RainCityscapes dataset, we compare our methods to the the state-of-the-art rain removal methods including RESCA [27], PReNet [23], DuRN [47], RCDNet [4], SPANet [24] and MPRNet [5].We also compare our methods with approaches that jointly remove the rain and haze, i.e., DAF-Net [48], DGNL-Net [7].The comparison with haze removal methods, like EPDN [36], DCPDN [34], AECR-Net [6], is also conducted.The results are reported in Table III.We can find that our vanilla solution, i.e.ADU-Net, outperforms the existing state-of-the-art methods.In particular, it improves the PSNR/SSIM values of the DGNL-Net by 1.45/0.0041,indicating the superior design of our method.The plus version of our method, i.e., ADU-Net-plus, again brings performance gain over the AUD-Net, where the ADU-Net-plus improves the PSNR/SSIM values by 0.81/0.0021.BID Rain.Since the scene in the RainCityscapes dataset only contains rain and haze information, we further evaluate our methods on the challenging dataset, BID Rain, to verify its generalization of working in complicated weather conditions.Table IV illustrates the comparison of the model performance in each weather condition.We can observe that the proposed ADU-Net can outperform the BIDeN [9] in each of the cases.Especially in cases ( 2) and ( 3), the ADU-Net brings the maximum performance gain.One possible explanation is that the proposed ADU-Net is designed with dual-decoder, which is tailored for the images in case (2) including the rain streaks and snow, or that in (3) including rain streaks and a light haze.However, the improvement in the other cases reveals the generalization of our proposal.Along with the ADU-Net, its plus version can significantly improve both PSNR/SSIM values, showing the superiority of our network architecture.In case (4), the performance of ADU-Net is lower than that of BIDeN.One possible explanation is that the "heavy haze" covers the scenes, which makes it difficult for our network to produce the scene residual.Nevertheless, this issue is addressed by increasing the parameter size, supported by the performance in ADU-Net-plus.
SPA-Data.We also evaluate our methods in the large-scale dataset, SPA-Data.We compare our methods to the existing state-of-the-art methods in Table V, including RESCAN [28], PReNet [23], SPANet [24] and RCDNet [4].As shown in Table V, the proposed methods outperform the existing methods by a large margin.For example, the improvements read of 2.72/0.0051(PSNR/SSIM) from ADU-Net and 4.57/0.0090from ADU-Net-plus, as compared to RCDNet, showing the strong performance of our network architecture.

D. Ablation Study
In this section, we conduct thorough ablation studies to verify the effectiveness per component in the proposed network.All studies in this section are conducted using ADU-Net on the RainCityscapes dataset.
Loss Function.In our implementation, the network is optimized by the negative SSIM loss, i.e., L SSIM .While in many practices of the low-level computer vision tasks, the MSE loss i.e., L MSE , is employed [49].In this study, we evaluate the effectiveness of each loss function.As shown in Table VI, we can find that each of the loss functions works better for our rain and haze removal task, and the network performance training from the two-loss functions are similar.However, the multi-task training, which optimizes the loss functions jointly, will degrade the network performance, indicating that the network may be saturated using one loss function, and the joint training will harm the network.
Effect of Dual-branch Architecture.Our work naively proposes a dual-branch architecture, i.e., asymmetric dualdecoder U-Net, for rain and haze removal tasks.In this study, we will justify the effectiveness of the dual-branch design in our task (shown in Fig 7).The above study shows our design flow is reasonable.We further evaluate the effectiveness of the contamination residual branch and scene residual branch in ADU-Net (see the results in Table VIII).As compared to the Residual U-Net, each branch can improve its performance, showing the effectiveness of the proposed residual branch.Also, we can observe that the combination of the proposed residual branches can achieve further improvement, indicating that those two decoders learn complementary features of the image.
Effect of Self-attention Module.In this study, we evaluate the effectiveness of the self-attention mechanism in the proposed ADU-Net.The results are reported in Table IX.Table IX reveals the effectiveness of the self-attention mechanism in the proposed network.We can also observe that the W-MSA module and SW-MSA module can help the network to learn complementary information in each branch, justifying our assumption in the design.Effect of Feature Fusion Module.In the proposed architecture of the ADU-Net, each decoder block has two information flows, respectively encoding the contamination residual and scene residual (see Fig. 3 and Fig. 6).Each information flow yields the feature fusion w.r.t. the concern of physical properties.In this study, we evaluate our design.Table X ablations the effectiveness of the feature fusion blocks.Each of the CFF or GCFF can improve the accuracy by about 0.2 PSNR value.However, combining those two blocks can further bring an outstanding performance gain on top of the individual one, around 0.6 PSNR value.This can greatly verify the good practice of the feature fusion blocks in our design.

E. Visualization
Along with the quantitative analysis in the above paragraphs, we further conduct qualitative analysis to verify the superiority of our work.In this study, we first illustrate the rain and haze removal performance between our work and existing SOTA methods in synthetic datasets (see Fig. 8).Various realworld outdoor scenes are also evaluated (see Fig. 9).The generalization of the proposed ADU-Net is further evaluated by removing other contamination, e.g.only rain in Fig. 10, or rain and snow in Fig. 11.
The first study is evaluated on the RainCityscapes dataset.We compare our method with the state-of-the-art methods, including PReNet [23], AECR-Net [6] and DGNL-Net [7].As shown in Fig. 8, our method can produce a much clear scene image (see the red box for details).For example, in the fourth row of Fig. 8, our method removes most of the haze and produces a clear shape of the tree branches.While other methods fail to recover the tree branches.This clearly shows the superiority of our method.
In the second study, we conduct the analysis on real-world images1 used in [24], to justify the potential of our method in real scenarios.We again compare our method to PReNet, AECR-Net, and DGNL-Net.For a fair comparison, each method adopts publicly available fine-tune weights trained on their own datasets.As can be observed from Fig. 9, the scene images, generated by our method, are more clear and more realistic than those from other methods.For example, as compared to the rain removal network PReNet, our method can also remove the haze in real-world scenes.The hues of the recovered scene from our method are also more realistic than that from the dehazing network AECR-Net and reflective details of the scenes are maintained by our method.As compared to DGNL-Net, the closest work to ours, our ADU-  Net can remove more rain streaks (the second row) or haze (the third row) and retains more scene details (the first row).This study can vividly show the effectiveness of our method in real scenarios.
To demonstrate the generalization of our dual-decoder ar-chitecture in separating different contamination, we show the residual produced by different branches.Fig. 10 shows the results of our method on the BID Rain dataset.The first row is the input image.The second row and third row present the masks of contamination residual and scene residual.The fourth row and fifth row are the generated images and the ground truth.We can find that our method separates the contamination (e.g., snow or haze) and scene clearly, and produces highquality scene images.A similar observation is also made in the real-world images from Internet-Data in Fig. 11.This study also verifies our motivation that most of the contamination components in the image are included in the contamination residual while the scene residual contains more detail of the scene including building structures and driveway lines.This analysis again illustrates the superior generalization of the proposed method.

V. CONCLUSION
In this paper, we propose ADU-Net, the first module involving two residual branches, for the joint rain and haze removal task.Unlike previous work focusing on the contamination removal only, ADU-Net recalls the importance of restoring the scene information affected by the change of atmospheric light.By leveraging our proposed scene residual and contamination residual, ADU-Net can produce clear scene images.The superiority of ADU-Net is evaluated by extensive experiments, and the proposed ADU-Net outperforms the current state-of-the-art approaches significantly across three benchmark datasets and tasks.We believe our study will serve as a strong baseline for future work, and inspire more research work in the line of joint rain and haze removal task.

Fig. 1 .
Fig. 1.Example of a scene image and its residual maps.(a) is the input image and (b) is the ground truth from RainCityscapes dataset.Image in (c) is the difference between (a) and (b).(d) and (e) are the contamination residual and scene residual.(f) is the result of (a)+(b)."Res" indicates "Residual".The contamination and scene details are included in the red and yellow boxes, respectively (zoom in to find the details).

c 3 or Z s 3 is
encoded by a convolutional block to recover the channel dimensions into the image space (e.g., C = 3), asY c = Conv 5 (Z c3 ) and Y s = Conv 5 (Z s 3 ).We denote the Y c as the contamination residual, and Y s as the scene residual.Having the Y c and Y s at hand, one can obtain the restored scene image Y as

Fig. 2 .
Fig. 2. The network architecture of the proposed ADU-Net, which consists of a encoder f E and a asymmetric dual-decoder f AD .f E has five Conv i blocks and f AD has four ADB j blocks and a Conv block.The network is optimized by the SSIM loss function.

Fig. 4 .
Fig. 4. Architecture of the channel feature fusion module.

Fig. 5 .
Fig. 5. Architecture of the global channel feature fusion module.

Fig. 7 .
Fig. 7. Schematic comparison of the ADU-Net architecture and U-Net-based architectures.(a) is a vanilla architecture of the residual U-Net.(b) is a simple form of the residual U-Net with dual decoders.(c) is the diagram of our method.

Fig. 10 .
Fig. 10.Visualization of the contamination removal on the BID Rain dataset.The images in BID Rain are synthesized with rain streaks, raindrops, snow, and haze.The first row is the input image.The second row and third row are contamination residual and scene residual.The fourth row and fifth row are the clean image and ground truth.

Fig. 11 .
Fig. 11.Visualization of the contamination removal on real-world rain images.The first row is the input image.The second row and third row are contamination residual and scene residual.The fourth row is the clean image.

TABLE II THE
STATISTICS OF DATASETS.

TABLE III COMPARISON
WITH THE STATE-OF-THE-ARTS METHODS OF RAIN REMOVAL AND HAZE REMOVAL ON RAINCITYSCAPES DATASET.† INDICATES THE NETWORK WAS TRAINED ON THE RAINCITYSCAPES DATASET.‡ INDICATES THE RESULTS OF THE ALGORITHMS AS REPORTED IN [7] 1 st /2 nd BEST IN RED/BLUE.
Table VII shows the empirical comparison of three architectures, i.e., Residual U-Net, Dual-decoder U-Net, and the proposed ADU-Net.Table VII verifies our design is reasonable, where the dual-decoder U-Net outperforms the vanilla version of the residual U-Net and

TABLE IV COMPARISON
WITH THE STATE-OF-THE-ARTS METHODS ON BID RAIN DATASET.† INDICATES THE NETWORK WAS TRAINED ON THE BID RAIN DATASET. 1 st /2 nd BEST IN RED/BLUE.INDICATES THE RESULTS OF THE ALGORITHMS AS REPORTED IN [4]. 1 st /2 nd BEST IN RED/BLUE. ‡

TABLE VII EFFECT
OF DUAL-BRANCH ARCHITECTURE IN RAIN AND HAZE REMOVAL.WE USE BOLD TO INDICATE BEST THE RESULT.

TABLE VIII EFFECT
OF DUAL-BRANCH DECODER IN ADU-NET.WE USE BOLD TO INDICATE BEST THE RESULT.

TABLE IX EFFECT
OF SELF-ATTENTION MODULE.WE USE BOLD TO INDICATE BEST

TABLE X EFFECT
OF FEATURE FUSION MODULE.WE USE BOLD TO INDICATE BEST