Attentive Partial Convolution for RGBD Image Inpainting

In this work, we demonstrated the use of Partial Convolutions for RGBD image inpainting. We proposed the two models L-PConv and Attn-PConv. The baseline partial convolution model is outperformed by both of our proposed models, with the Attn-PConv model performing the best. The proposed Attn-PConv model is able to infill missing pixels with relatively less training time when compared to some other GAN-based models. As far as we know this is the first time a partial convolution model has been used successfully for RGBD image inpainting. The results( Image SSIM: 0.9787, Image PSNR: 30.9665, Depth SSIM: 0.9818, Depth PSNR: 35.7311) indicate that our model is successful in RGBD image inpainting. The addition of the additional loss terms and the Attentive Normalization techniques help improve the performance of the model significantly. We believe our model can be successfully used in AR-related applications where infilling missing pixels is performed frequently especially for both RGB and Depth images together. Beyond the scope of this study, we envision practical applications for our model in augmented reality, particularly in scenarios where frequent pixel infilling is required for both RGB and Depth images. In the future our research trajectory aims to incorporate higher resolution, aligning with the capabilities of modern cameras capable of capturing images at 4k resolution.


INTRODUCTION
Inpainting [2], the process of reconstructing missing pixels in an image, plays a crucial role in various digital media applications, including the restoration of damaged photographs and the editing of visual content in advertising.This technique allows for the seamless removal or addition of elements within an image, often referred to as "filling in the holes." A particularly compelling application is in the online advertisement industry, where inpainting facilitates the removal of unwanted objects, ensuring that advertisements meet aesthetic and brand standards without compromising on visual integrity.
The significance of inpainting extends beyond aesthetic enhancements to functional improvements in augmented reality (AR), diminished reality (DR), and depth sensing technologies.(1) In the case of Augmented Reality, the integration of digital content with the actual environment is crucial for a seamless experience.For this reason, smartphones and augmented reality headsets are fitted with depth sensors that record the distance between the device and locations in the actual world as a depth map corresponding to the scene's pixels.A depth image's information can be utilized to precisely place virtual objects in front of or behind real-world objects, creating immersive and realistic user experiences.Time of Flight or ToF sensors are widely used to capture depth maps because they are cheap and in fact can be found in consumer smartphones such as the Samsung Galaxy A80.Unfortunately, there are issues with ToF cameras that degrade the quality of the depth data they capture and cause the output to have missing pixels [7,41,42].
(2) Real content should be digitally erased as well as added to the digital content environment.A component of augmented reality known as Diminished Reality, or DR, is the perception of an object being removed from its surroundings.For the successful removal of an object, it is vital to fill in the missing pixels resulting from the object removal with plausible pixels.To create DR, digital content that emulates the area behind an object is displayed after the object has been removed.The user thus believes the object is no longer present.(3) A lot of the Head Mounted Displays or HMDs suffer from missing pixels.There is a delay when the image is rendered by the GPU and when it is displayed on the HMD.When the head moves smoothly, this delay is not an issue, but rapid, jerky head motions can result in missing pixels.
In the realm of online advertising, the ability to modify image content dynamically allows advertisers to tailor their campaigns to specific audiences, improve the visual appeal of ads, and even personalize content in real-time based on user demographics or behaviors.However, the effectiveness of these strategies is contingent upon the quality of inpainting techniques, which must seamlessly blend edited areas with the surrounding imagery to maintain the authenticity and attractiveness of the advertisement.
The majority of current cutting-edge image inpainting techniques concentrate only on RGB images rather than RGBD ones and the rest focus on only inpainting Depth images separately and not RGB and Depth together.Note that an RGB-D image offers pixel-level depth information matched to the relevant image pixels.An image or image channel that provides data on the distance between scene objects' surfaces from a vantage point is called a depth map or depth image.Understanding the geometric relationships in a scene requires accurate depth estimation.Depth images can be used for a wide range of applications in 3D computer graphics and computer vision which are mentioned below.Depth maps can be used for simulating shallow depths of the field, where some elements of a scene can be made to appear to be out of focus.Another application for depth maps is to detect objects hidden from view and can thus be ignored for particular rendering requirements.This method is called z-buffering This is essential for real-time applications like video games since they call for a quick sequence of completed renders that can be displayed at a predictable and steady rate.Depth maps are used to model 3D shapes or reconstruct them [30].Lee et al. [17] used depth maps along with RGB images for real-time 3D object detection for augmented reality.Their approach demonstrated the advantages of using depth maps and RGB images by detecting objects with complex structures, robust detection under varying lighting conditions, etc.A reliable depth image-based plane detection technique was put forth by Jin et al [15].They could find planes with complex patterns and texture using depth data when texture-based plane detection methods typically fail.Shin et al. [28] used Depth images for 3D scene reconstruction which can be used to create a 3D model of a scene using a single RGB image.
As the era of Deep Learning exploded, the researchers shifted towards Neural Networks based Inpainting methods [3,11,19,33,36,38,39,43].To notify the model about the hole region and the unmasked region, a binary mask is additionally given along with the input image with pixel intensity values of "0" and "1".One such deep learning-based approach is known as Partial Convolution [19].Partial convolution is one of the state of the art models for RGB image inpainting.The Partial Convolution model cannot be directly applied for RGBD image inpainting.
Considering most existing image inpainting models focus on either RGB image inpainting [19,32,35,[37][38][39] or Depth inpainting [5,12,22,23,40] separately, and depth information plays a crucial role in AR/VR applications [16,21,34], we, therefore, aim to perform inpainting of not only RGB images but also depth images as well.We leverage Deep Learning techniques to not only solve the missing pixel problem just for RGB images but for RGBD images.One intuitive approach would be to train two networks independently, one of which restores textures with an input from an RGB image and the other of which restores geometry with an input from a depth image.This would allow the restoration of missing pixels in RGB images and Depth images separately.However, Depth contains vital geometric information or the three-dimensional structure of the scene, which can complement RGB inpainting.Many state-ofthe-art models which solve the problem of missing pixels for RGB images are highly complex and take days or weeks to train.Thus, our goal is to develop a relatively lightweight model that can be trained in hours and be able to make inferences on mobile or edge devices.
Our contributions are listed as follows: • We developed partial convolution-based deep learning models for RGBD image inpainting.The model takes in as input a 4-channel RGBD image along with a 4-channel binary mask.The binary mask indicates whether each pixel's or feature's value is valid (0 for missing pixels and 1 for valid pixels).The model's output is a 4-channel RGBD image with no missing pixels.• Our proposed Attn-PConv model builds upon the foundation of partial convolutions, integrating a tailored loss function and introducing attentive normalization to elevate its performance beyond that of the baseline partial convolution model.

RELATED WORKS
In the past, Diminished Reality or DR [4] inpainting techniques have been utilized to capture the missing pixels using multiple view observations.These kinds of approaches require real-time observations made with multiple cameras.Mori et al. [24] used 3D PixMix, which uses both RGB and Depth information in the inpainting process, to illustrate a similar methodology.However, a multi-view based technique is ineffective if the target's background is obscured in other views.Other approaches include the use of inpainting-based methods which do not rely on multiple view observations and can replace missing pixels by using pixels that are already available in the image.Deep Learning based methods have been used for inpainting missing pixels for RGB and Depth images separately.Using two networks a coarse network and a refinement network, Yu et al. [39] suggested an RGB inpainting model that ensures the color and texture coherency of generated regions with their surroundings.They also deployed a modified version of WGAN-GP loss [10] to guarantee global and local consistency and make training stable.They also included a context attention module that enables networks to use data from remote spatial regions.Improving upon their previous work mentioned above, Yu et al. [38] also proposed a GAN [9] based RGB image inpainting model with gated convolution for irregular masks.Liu et al. [19] proposed Partial Convolutions for RGB inpainting of irregular holes.Their of partial convolution layer comprises masked and re-normalized convolution operations followed by a mask-update setup.Huang et al. [12] performed Depth inpainting using the self-attention mechanism and boundary consistency.Makarov et al.
[22] took a vision transformer based approach and performed Depth inpainting by using RGB images and sparse depth maps using a restrictive convolutional tokenization model.From the above, we can see that most previous works have been based on RGB inpainting or Depth inpainting separately and not RGBD inpainting.Therefore we propose a partial convolution-based model for RGBD inpainting which not only works well but also takes less time to train.

BACKGROUND 3.1 Partial Convolution Model
Partial Conv model was published by NVIDIA in 2018 [20].The Partial Conv model predicts and fixes the missing portions of an image using an input image and a mask image.The authors of this paper use a U-Net [26]-like network with skip connections, replacing all of the conventional convolutional layers with the proposed partial convolutional layers.The partial convolution determines a pixel's participation in the convolution depending on the mask value.In a typical convolution, the weights are multiplied by the input.This leads to poor image quality because the missing pixels are also used for convolution.The image quality is considerably enhanced when only pixels with a mask value of 1 are used for convolution.
Let's define  and  as the convolution filter's weights and bias.The associated binary masks  show the validity of each pixel/feature value (0 for missing pixels and 1 for valid pixels), whereas  represents the pixel values (or feature activation values) being convolved.The partial convolution is represented as follows, where ⊙ represents element-wise multiplication. 1 is a matrix of ones which has the same shape as the mask .A scaling factor called (1)/() is used to modify the results as the number of valid input values for each convolution changes.
The binary mask is updated after each partial convolution layer.The appropriate location will be taken into consideration for the following partial convolutional layer if the outcome of the current convolution is dependent on at least one valid input value.

Loss functions
The original partial convolution model was designed for processing only RGB images.Therefore the loss functions it uses operate only for the 3-channel RGB images.The original model uses four loss functions, namely the  1 or the per-pixel loss, the Perceptual or VGG loss, the Style loss, and the Total Variation loss.Each of the above loss terms is briefly described below.
1 Loss: The  1 loss is given as follows, where  represents the binary mask,   is the network prediction and   is the ground truth image. ℎ and   are the losses for the hole and valid pixels respectively.
Perceptual or VGG loss: The VGG loss proposed by Gatys et al. [8] as the name suggests, uses a pre-trained VGG-16 [29] model with pre-trained ImageNet weights to extract relevant higher-level feature representations from the predicted   and ground truth   image at different layers and then calculates the  1 distance them.This ensures that the filled image and the ground truth image have similar feature representations.The loss function is defined as follows, where   is the same as   except the valid pixels are directly replaced by the ground truth pixels.Ψ  is the feature maps of the ℎ selected layer computed by a pre-trained VGG-16.Here the output of layers 1, 2 and 3 are used for the loss.
Style Loss: Similar to the Perceptual or the VGG loss Style Loss is also computed using the feature maps given by a pre-trained VGG-16 model.However, first, the Gram Matix or autocorrelation is calculated for each feature map for the predicted, computed and ground truth before the  1 distance is calculated.The Style Loss is given as follows, Here, the matrix operations assume that the high-level features Ψ()  is of shape (    ) ×   which results in a   ×   Gram matrix and   is the normalization factor 1/      for the ℎ selected layer.
Total Variation Loss: The total variation loss [14] is used to ensure the smoothness of the completed images.It is given as follows, where P is the area where the hole has dilated by one pixel.
Total Loss: The total loss for the original partial convolution model is given as follows, L total =L valid + 6L hole + 0.05L perceptual + 120(L style out + L style comp ) + 0.1L tv (7) Using a hyperparameter search on 100 validation images, the loss term weights were calculated.

MODEL AND NETWORK ARCHITECTURE
In this section, we present our model and deep network architecture.Our model is based on the Partial convolution model.When using the baseline Partial convolution model for RGBD image inpainting, we discovered that even though the model was able to perform well for the inpainting of RGB images, it was performing poorly for the Depth images.We thus improve the model and propose two new models, L-PConv and Attn-PConv.

L-PConv
The partial convolution model uses a pre-trained model such as VGG16 to compute the loss values which greatly reduces training time and does not require the training of an additional Discriminator network which is generally the case for training GANs.Therefore to improve the partial convolution model we wanted to first improve the loss function for RGBD inpainting.By analysis, we believe the low performance of Partial Convolution model is attributed to the loss function being designed to operate only on 3-channel RGB images and not 4-channel RGBD images.We, therefore, proposed L-PConv, a Partial Convolution Model with updated Loss function.We added additional loss terms to the original loss function.Specifically, to generate not only RGB but also the Depth image, we made the following changes to the loss function.First, we add another Perceptual Loss term for the Depth images.For the Depth image, the perceptual loss is calculated similarly using a pre-trained VGG16 model.However, we compute the loss only between the predicted image and the ground truth image.The perceptual loss for the depth image is given as follows, Second, we add another Style Loss term for the Depth images.For the Style loss, we first find the Gram matrix for the predicted and the ground truth image before finding the  − 1 distance between them.
Final Loss for the modified Partial Convolution model + 120(L style out + L style comp ) + 0.1L tv + 0.025L perceptual Depth + 60L style out Depth (10) The loss term weights were calculated using a hyperparameter search on 100 validation images.
By adding the additional loss terms, which take care of the Depth images as well, we were able to get good results for not only the RGB images but also for the Depth images.The results of this model will be shown in the evaluation section.

Attn-PConv
To further improve our L-PConv model we wanted to incorporate the attention mechanism [31].Attention Mechanism can gather spatial information and learn long-range dependencies from distant spatial locations thus improving performance.Thus, we propose an Attentive Partial Convolution Model (Attn-PConv) with updated loss function and Attentive normalization.Specifically, we replaced all the Batch Normalization Layers in the U-Net architecture with Attentive Normalization to create Attn-PConv.Li et al. [18] proposed Attentive Normalization for combining both feature normalization and feature attention.The final affine transformation used by AN to re-calibrate features in an instance-specific manner is the weighted sum of all the affine transformations it has learned instead of just learning only one.The weights are learned using channel-wise feature attention.
The network architecture of our proposed Attn-PConv is shown in

EVALUATION
To evaluate our model, we choose the NYU Depth Dataset V2 [25].We first used the NYU Depth dataset to get 256x256 RGB and Depth images.For the masks, instead of randomly removing pixels from the image we use a set of sketches from the QD-IMD: Quick Draw Irregular Mask Dataset [13] which is a subset of Quick Draw dataset (a collection of 50 million human drawings) [1].
• NYU Depth dataset V2: This dataset consists of 1449 densely labeled pairs of aligned RGB and depth images.We first crop and resize the images to 256x256.We chose 256x256 as this would allow us to run our model with the relevant batch sizes on 1 GPU.Our model takes in as input 4 channel RGBD images and performs inpainting therefore after getting the images to the desired dimensions we stack the RGB and the Depth images to get 4 channel RGBD image.We performed data augmentation by flipping the images horizontally and vertically to get 4200 images for training and 40 images for testing.• QD-IMD: Quick Draw Irregular Mask Dataset: We use the QD-IMD dataset from GitHub which contains about 10,000 binary images in the form of sketches.Some of the masks in the dataset have large areas of missing pixels.This is generally not practical, so we select a subset of 600 masks for our model.We use 500 for training and 100 for testing.
The binary mask images were also stacked to get 4 channel mask images.
Our model takes in as input 4-channel RGBD images with 4channel binary masks to mask the RGBD images and then outputs the 4-channel RGBD images with no missing pixels.The NYU dataset provides RGB and Depth images separately.We crop and stack them to obtain 4-channel images.Similarly, we stack the masks to create 4-channel mask images.As previously mentioned, a mask update step comes after the masked and re-normalized convolution operation in the partial convolution process.Any masking where the partial convolution was able to operate on an unmasked value is removed during the automated mask update step.We train all the models using 1 Nvidia Titan Xp GPU.Training our proposed Partial Convolution model with modified loss and attentive normalization takes about 20 hours to train for 50 epochs and inferences are done in real-time.
Figure 2. shows the performance of our Attn-PConv model for the NYU test dataset.In the figure, we can observe that our modified model is able to learn the complex structures in the RGBD image and successfully infill the missing pixels.If we take a closer look at the generated RGB image in the first row we can see that the model is able to generate the missing pixels for the books kept on the shelf.This is a difficult task as the books have sharp boundaries and it is always difficult to perform infilling of missing pixels in these regions.The mask used for the first row is also quite thick, making the prediction task for the model more challenging.The second row shows the much more complex and irregular mask along with the image, which also has a lot of complex structure.Based on the generated RGB and Depth results we can see the proposed model performs well for complex irregular masks.Finally, in the last row, we can also see that the model is able to learn and perform missing pixels infill successfully.Enhanced loss function and attentive normalization module contribute to achieving desired performance.Additional terms in the loss function penalize poor model performance, while attentive normalization enables learning long-range dependencies across image areas, resulting in superior performance.In Figure 3. we compare our two proposed partial convolution models with the baseline partial convolution model for the NYU test dataset.We can see in the first row for the baseline model result, that even though the model was able to infill missing pixels for the RGB images since the Partial Convolution model was designed with RGB images in mind it does not perform well in generating the Depth image.This problem is resolved by our L-PConv, Partial Convolution model with the updated loss which can correctly generate Depth images.However, sometimes the model can generate a Depth image with minor errors which can be observed if we take a closer look at the sofa chair results for the generated and ground truth image.We can see a dark color in the front of the sofa which means it is very close to us however in reality this is not the case which is evident from the ground truth image.However, with our Attn-PConv model, we do not notice any such errors or artifacts.This is clearly observable from the bottom row of images.This can be attributed to the fact that the attention in the proposed model can gather spatial information and learn long-range dependencies from distant spatial locations.
In Table 1.we provide the comparison of quantitative evaluation metrics for the proposed models with the baseline partial convolution model.Here we have used two widely used image evaluation metrics which are the Structural Similarity Index Measure [44] or SSIM and Peak signal-to-noise ratio [27] or PSNR to compare the generated RGBD images with the ground truth ones.From the table, we can observe the baseline Partial Convolution model (Baseline PConv) performs well for inpainting RGB images with Image SSIM value of 0.9640 and PSNR value of 25.7998 but as expected it does not perform well for the Depth images with low SSIM value of 0.7007 and PSNR value of 11.8196.Our first proposed model (L-PConv) with a modified loss function alleviates this problem by not only performing better than the baseline model for not only RGB images  Regarding the training time, our Attn-Pconv model takes about 20 hours to train for RGBD images whereas most of the GANbased models take more than a week to train for RGBD images [6].Our model's efficiency stems from employing a single U-Net type generator for RGBD inpainting.Thus, we train only one network, whereas GAN-based models must train both the Generator and the Discriminator which significantly increases the number of training parameters and consequently training time.

CONCLUSION
In this work, we demonstrated the use of Partial Convolutions for RGBD image inpainting.We proposed the two models L-PConv and Attn-PConv.The baseline partial convolution model is outperformed by both of our proposed models, with the Attn-PConv model performing the best.The proposed Attn-PConv model is able to infill missing pixels with relatively less training time when compared to some other GAN-based models.As far as we know this is the first time a partial convolution model has been used successfully for RGBD image inpainting.The results( Image SSIM: 0.9787, Image PSNR: 30.9665,Depth SSIM: 0.9818, Depth PSNR: 35.7311) indicate that our model is successful in RGBD image inpainting.The addition of the additional loss terms and the Attentive Normalization techniques help improve the performance of the model significantly.We believe our model can be successfully used in AR-related applications where infilling missing pixels is performed frequently especially for both RGB and Depth images together.Beyond the scope of this study, we envision practical applications for our model in augmented reality, particularly in scenarios where frequent pixel infilling is required for both RGB and Depth images.In the future our research trajectory aims to incorporate higher resolution, aligning with the capabilities of modern cameras capable of capturing images at 4k resolution.

ACKNOWLEDGMENTS
We gratefully acknowledge this research is partially supported by the Federal Highway Administration Exploratory Advanced Research, FHWA EAR 693JJ320C000021.

Figure 1 .
Similar to the baseline model we use a U-Net like generator network with all the convolution layers replaced with Partial Convolutions.The Encoder block consists of Partial Convolutions with ReLU activations and Attentive Normalizations.The Decoder block consists of Up-Sampling layers with Partial Convolutions, LeakyReLU activations and Attentive Normalizations.

Figure 3 :
Figure 3: Comparison RGBD inpainting results for all the models

Table 1 :
Comparison of evaluation metrics for different models for NYU dataset