Neural Partitioning Pyramids for Denoising Monte Carlo Renderings

Recent advancements in hardware-accelerated raytracing made it possible to achieve interactive framerates even for algorithms previously considered offline, such as path tracing. Interactive path tracing pipelines rely heavily on spatiotemporal denoising to produce a high-quality output from low-sample-count renderings. Such denoising is typically implemented as multiscale-kernel-based filters driven by lightweight U-Nets operating on pixels, and encoders operating on samples. In this work, we present a novel kernel architecture in the line of low-pass pyramid filters. Our architecture avoids the issues with the low-frequency response of previous such filters, resolving ringing, blotchiness, and box-shaped artefacts while improving overall detail. Instead of using classical down-sampling and upsampling approaches, which are prone to aliasing, we let our weight predictor networks learn to partition the input radiance between pyramidal layers, predict kernels for denoising


INTRODUCTION
Path tracing, despite its high computational cost, has become the primary algorithm used for physically-based rendering throughout the animation and visual effects industry due to its accuracy and flexibility. Current GPUs can produce noisy low-sample-count renderings at interactive rates thanks to hardware-accelerated raytracing and reconstruct clean images using neural denoising filters, also accelerated using linear algebra hardware. While such efficient pipelines have enormous potential to improve creative workflows and unlock new applications of path tracing in the real-time domain, the quality of spatiotemporal denoisers remains a bottleneck.
Many recent works in neural denoising rely on kernel prediction; instead of predicting an output image directly, they map the final activations of their neural networks to local, per-pixel filtering kernels. Then, they compute their output by applying these kernels to the noisy input image. Kernel-predicting methods have proven to hold some significant advantages; they are more robust, train faster and offer improved performance [Bako et al. 2017].
Denoising benefits from larger kernels as weighing and averaging more pixels reduces variance. If the variance is sufficiently low, larger kernels can be more selective, only picking pixels that form local structures, thereby preserving details. Unfortunately, enlarging kernels quickly becomes impractical as the computational cost of kernel prediction and application scales quadratically with the kernel size. Hierarchical pyramid kernels provide a natural solution; by efficiently applying small, cascaded filtering kernels, these methods can achieve large footprints at low computational costs [Vogels et al. 2018].
Although pyramidal kernel prediction holds immense potential, predicting parameters for such filters with neural networks is challenging. We identify some shortcomings of scale-composition, which limit its overall performance. We propose a novel pyramidal filter that avoids these issues by introducing jointly learnable downsampling and upsampling stages. Our downsampler learns to partition radiance between layers of our filtering pyramid, providing each layer with inputs matching their ideal noise characteristics. In conjunction, our upsampler learns to reconstruct edges and highfrequency details lost during downsampling, accurately aligning each layer for recombination. By their combined effect, each layer of our pyramid filter learns to denoise image structures of corresponding sizes, which can be recombined by simply summing up the upsampled output of each layer. Our filter is inherently energypreserving and numerically stable. Furthermore, we take special care to use robust activation functions to predict our filtering parameters.
We also present considerable improvements to previous Monte Carlo denoising pipelines by applying prominent ideas from related fields. By adapting backpropagation through time [Graves 2012] from recurrent neural networks, we train on 64-frame-long sequences instead of 8, simultaneously increasing the patch size to 256 pixels compared to 128 in previous works. By motion compensation of our training patches, we train our filter to better utilize temporal coherency. We adapt architectures from image restoration [Zamir et al. 2022] and large kernel convolution [Liu et al. 2022a] methods to scale our weight predictor network to 30 million parameters, further improving denoising performance at interactive to offline rates.
To summarise our contributions: • We propose a pyramidal filter with learnable partitioning and upsampling stages. Through Fourier analysis, we explore how our filter resolves previous shortcomings and provides further advantages.
• We adopt training techniques previously used for recurrent networks and image restoration networks proposing small and large weight predictor networks to drive our method. We demonstrate state-of-the-art performance in multiple timing categories.
• Our implementation and dataset are available under the MIT license on our project website. 1

RELATED WORK
Algorithms for denoising Monte Carlo renderings have actively been developed in recent years, with numerous solutions successful at different performance budgets. We group methods into three categories: real-time denoisers processing 1-4 samples within 30 milliseconds, interactive denoisers processing 4-8 samples in a performance budget between 30 and 100 milliseconds, and offline denoisers process over 8 samples in over 100 milliseconds. We focus on recent works relevant to our work; for a wider overview, we refer the reader to excellent surveys by  on deep learning-based methods and by Zwicker et al. [2015] on classical methods. Non-machine-learning-driven filters [Koskela et al. 2019;Kozlowski and Cheblokov 2021;Nvidia 2022a;Zhdan 2021] are still commonly used for high framerate real-time applications. However, optimised kernel-predicting filters [Fan et al. 2021;Meng et al. 2020;Thomas et al. 2022] take prominence when the budget allows for inference using a lightweight neural network. Most notably, Thomas et al. [2022] jointly solve denoising and supersampling using a U-Net [Ronneberger et al. 2015] inspired multiscale filter.
Neural kernel-predicting filters dominate interactive denoising methods, with two general approaches emerging: pyramidal filters and generalised bilateral filters [Tomasi and Manduchi 1998]. Our work focuses on pyramidal filtering, with a detailed discussion of its background in Section 3. As an alternative, some recent works generalise bilateral filters and combine them with learning components, building effective neural edge-preserving smoothing filters. Meng et al. [2020] use a lightweight convolutional network to guide a bilateral grid filter. Işık et al. [2021] further generalise bilateral filtering, calculating the range kernel for eight-dimensional per-pixel affinity features. Their approach avoids the issues with pyramid methods but is computationally more expensive due to higher memory bandwidth requirements and less cache-friendly memory access patterns. In addition, orthogonal extensions such as path-based filtering [Cho et al. 2021;Lin et al. 2021], compositional filtering , and self-supervised post-correction denoising [Back et al. 2022] have been proposed.
Recent offline filters abandon kernel prediction for directly predicting the output image and instead focus on improving the underlying network architecture. The proposed ideas include transformer blocks [Lu et al. 2020;Yu et al. 2021], deformable convolution [Wei et al. 2021], conditioned feature modulation Xu et al. 2019], dual-residual connections ] and generativeadversarial training [Lu et al. , 2020Xu et al. 2019;Yu et al. 2021]. Zheng et al. [2021] apply ensembles of such denoisers to mitigate the artefacts produced by each individual denoiser. These methods offer much larger receptive fields than those of kernelpredicting architectures. However, we argue that denoising is still a substantially easier task when done in the parameter space of predicted linear denoising kernels; we show that our multiscale filter does not incur the same limitations in terms of kernel size and scales when driven by powerful networks, outperforming direct predicting architectures.

PYRAMIDAL DENOISING
Pyramid hierarchical kernels allow for an efficient, cascaded application of small denoising kernels Vogels et al. 2018] while achieving the footprint of large, computationally expensive kernels [Bako et al. 2017]. In this section, we describe the background of previous filter architectures and the shortcomings of their neural adaptations. These limitations serve as the core motivation for our filter design, which we describe in Section 4.
We use various symbols to denote radiance processed at different pipeline stages. Supplementary, Table 1 and Figure 4 offer an overview to ensure clarity for our readers.

Low-pass filtering and downsampling
Pyramidal filters first need to generate a decimated low-pass pyramid [Burt and Adelson 1987] serving as input for per-layer denoising. Although the original formulation uses Gaussian filters, learning-based methods favour box filters or average pooling. Introducing our notation, we can formulate this operation in a closed form:¯= where¯denotes the radiance of pixel ( , ) averaged over samples rendered for frame (and temporally accumulated for recent frames, Equation 10), and¯denotes the radiance of the coarse pixel ( , ) of the -th layer of the low-pass pyramid, where = 0 stands for the full-resolution and = the coarsest layer. We use zero-based indexing for pixel coordinates in our notation. We use a non-recursive pyramid formulation for reasons that will become clear in Section 4.1.

Denoising kernels
Next, pyramidal filters suppress noise in each layer of the pyramid by applying neurally predicted kernels , which are unique for each pixel ( , ) and cover its neighbourhood ( , ): whereˆdenotes the layered denoised radiance and denotes softmax normalisation of the kernels along ( , ). Equation 2 corresponds to a kernel-splatting operation [Gharbi et al. 2019; as and index the denoised layers. Conversely, swapping and , essentially transposing the operation, would denote a kernel-gathering operation, which . The denoised layers contain residual low-frequency noise. Converting the representation to a Laplacian pyramid by subtracting subsequent layers cancels this residual noise. However, small mismatches between the denoised layers cause overshoots and ringing artefacts in the Laplacian pyramid. The predicted blending weights control composition to reduce these artefacts (lower weights give more conservative composition). Zero-Day © 2023 Mike Winkelmann is also commonly used by previous work Vogels et al. 2018].

Upsampling and composition
After applying the predicted denoising kernels separately at each scale, pyramidal filters need to upscale the coarse layers and compose an output image. Unfortunately, each layer of the denoised low-pass pyramid contains some residual low-frequency noise that falls below the receptive field of the small denoising kernels. Thus, denoising filters must solve a challenging task; fusing the clean frequency bands from each layer. Two approaches have been proposed in previous works. First, Delbracio et al. [2014] convert the low-pass pyramid to a Laplacian pyramid by subtracting subsequent layers to isolate the clean frequency bands (shown in brackets in Equation 3). The sum of these bands then gives a clean, full-resolution output. We formulate this operation recursively, where˜=ˆstarts with the coarsest layer, yielding the full-resolution output as˜0 , by applying the following equation: where U and D are shorthands for the two-by-two upsampling and downsampling operators. Delbracio et al. [2014] proposed this formulation of scale-composition for a classical method, applying their ray histogram fusion filter to all layers. Matching the response of the filters applied to each layer is critical for Laplacian pyramids, as inconsistencies between the layers where the filters' response overlaps can cause ringing and over-blurring artefacts. Such consistency is difficult to guarantee between neurally predicted kernels; thus, Vogels et al. [2018], when adapting scale-composition for neural kernel prediction, proposed a neurally predicted per-pixel blending weight, , that can control composition in specific regions to reduce artefacts: While they show this workaround to be reasonably effective, it illustrates a fundamental incompatibility between Laplacian pyramidal denoising and neural kernel prediction. Figure 2 shows the Laplacian pyramid layers and blending weights for an example scene.
Recently,  proposed upscaling layers to full-resolution and taking their weighted sum by neurally predicted weights. While these weights can cancel the residual noise without resorting to Laplacian pyramids, learning the residual noise's characteristics still poses a challenging task for the weight predictor network. In the following section, we present our pyramidal filter design that prevents the creation of such residual noise in the first place.

PARTITIONING PYRAMIDS
An overview of our pyramidal denoising is shown in Figure 4 as a grey-colored inset. While its three-stage structure is similar to existing solutions (Section 3), we propose the following innovations.
By letting our downsampler partition the input pixel radiance in a learnable manner, our weight predictor network learns to provide denoising kernels with radiance better suited for their receptive field. This way, compositing becomes as simple as summing up each layer. Our upsampler applies neurally predicted splatting kernels instead of bilinear upsampling, enhancing coarse layers with edge-preserving capabilities, which are utilised thanks to our neural partitioning stage, enhancing the reconstructed detail in noisy areas.
In this section, we introduce our partitioning downsampler and our splatting upsampler, explaining the unique properties of each and how they work in tandem to improve denoising performance.

Partitioning downsampler
Previous works combine their pyramidal layers after denoising through learnable composition Vogels et al. 2018]. We aim to simplify this weighting task by bringing it to the front of our denoising filter. We use the per-layer weights to partition the radiance of each pixel before downsampling and denoising. Compared to learnable composition, our partitioning task is more straightforward as it is independent of the following denoising kernels and is not prone to ringing and blurring artefacts.
The softmax function is widely used as the final activation layer of neural networks to partition probability spaces as it provides a smooth mapping from a vector with arbitrary components to a vector whose components fall on the interval (0, 1) and which sum up to one Sutton and Barto 2018]. Similarly, we can multiply the resulting weights with a signal to partition said signal;  used softmax normalised weights to partition sample radiances between depth-wise layers for * +1 Figure 3: We splat each coarse pixel (blue) to its four-byfour neighbourhood of one-layer finer-resolution pixels (red) according to predicted kernels , unique to each coarse pixel.
denoising defocus and motion blur, albeit without improvements beyond two layers. Nevertheless, we use a similar operation and do not observe such limitations in our application.
First, we predict partitioning weights , all at full-resolution, which we then softmax normalise along pyramid layers. Then, we partition the radiance across the layers and downsample by average pooling the resulting partitions to produce the multiscale images ready for denoising. We formulate our operation by adding the highlighted part to Equation 1: where denotes softmax normalisation along layers. The leftmost side of Figure 6 shows the predicted partition weights for a demonstration scene and the resulting radiance partitions. Intuitively, the partitions separate image features based on scale and noise characteristics, directing smaller or less noisy features to finer layers. Denoising each feature at the finest layer with the minimum required kernel size is beneficial as it frees up coarser layers for larger, noisier features. We show partitions for further test images in Supplementary, Fig. 3. Once radiance is partitioned, we apply the learned kernels to the radiance¯in the denoising kernel stage, as described in Section 3.2 and shown in Figure 4 (grey-colored area). Specifically, we perform a 5 × 5 kernel-splatting operation (Equation 2) as proposed by Gharbi et al. [2019] and .

Splatting upsampler
If the coarse layers of the pyramid are naively upsampled, for example, using bicubic interpolation, their utility is much limitedthey cannot convey information beyond the Nyquist frequency of the given layer. Previous works Vogels et al. 2018] address this problem by using learnable composition (Equation 4), as discussed in Section 3.3.
Here, we propose learnable upsampling kernels, which can better preserve edges, are not prone to ringing and blurring, and are convenient to predict using neural networks. Our upsampling and compositing is performed in a coarse-tofine order. A finer layer is the sum of the corresponding denoised layer and an upsampled coarser layer: =ˆ+ ∑︁ 4˜+ 1 ( +1 ) for = 3, 2, 1, 0. (7) where +1 is a 4 × 4 learned upsampling kernel,ˆare the denoised layers, and˜are the composed and layers. denotes softmax normalization performed. Our filter uses 5 layers, making = 4 our coarsest layer. We ablate this choice in Section 6.3. The result of the operation is˜0. Figure 3 shows the geometry of our kernel application step. The prediction of the kernel +1 will be explained in Section 5.1.
Making the upsampling kernels learnable allows our method to adopt edge-preserving strategies. These strategies are crucial considering our method's effective footprint; at the lowest resolution layer, one pixel corresponds to 16 × 16 full-resolution pixels, and after applying the 5 × 5 denoising kernels and successive 4 × 4 upsampling kernels, our method reaches an effective footprint of 110 × 110 pixels. The edge-preserving properties of our filter let it mitigate low-frequency noise and transport radiance over large distances. Figure 5 and rightmost column of Figure 6 demonstrate the edge-preserving capabilities of our upsampler.

IMPLEMENTATION
Our work mainly focuses on the filtering stage, so we adapt most of our overall pipeline from previous methods Işık et al. 2021]. However, we note some changes regarding linear radiance and temporal processing. We propose two weight predictor networks to drive our filter; first, we build a 15 million parameter convolutional network (OursSmall), similar to previous work [Işık et al. 2021;, that we optimise to run in real-time using TensorRT 2 and XLA [Sabne 2020]. Second, we build a 30 million parameter network (OursLarge) from ConvNext [Liu et al. 2022a] blocks in the Restormer [Zamir et al. 2022] configuration, optimised for interactive to offline performance. In the following section, we provide an overview of our pipeline, with Supplementary, Sec. A including all the details of our weight predictor networks and feature engineering. Figure 4 presents major components of our denosing pipeline. Its input is per-sample linear radiance and a vector including base colour, normal in camera coordinates, and depth. We find that additional features degrade test-time performance. First, following Işık et al. [2021] and  we use a small fully-connected network, , to encode per-sample data and then we average encodings for each pixel:

Pipeline
where denotes the sample index, and stands for the total number of samples per pixel. Similarly, we average the sample radiance 2 https://developer.nvidia.com/tensorrt values for each pixel in the current frame : We compute all features at the primary ray intersection. Consequently, samples visible through specular reflections and transmissions do not contribute to these features making their image more difficult to denoise. Robustly handling such cases is an active, orthogonal area of research that we touch on in Section 7.
We compute the accumulated radiance¯and embeddings¯over time (refer to Figure 4). Neural temporal blending weights control such accumulation: where W is the warping operator, which warps and bilinearly interpolates frame − 1 to frame using backwards motion vectors calculated at the centre of each pixel in frame . We restrict to the range (0, 1) using sigmoid mapping.
Next, we predict parameters for our denoising filter using a weight predictor network (refer to Figure 4). As input, we provide the radiance and embeddings from the current frame, and the accumulated radiance and embeddings warped from the previous frame: , , , , , = ( , , W¯, −1 , W¯, −1 ). (12) Then we apply our denoising filter as described in Equations 5, 2, and 7. Finally, we apply a secondary temporal loop to improve temporal stability: where is a secondary set of blending weights, similar to , and is a 5 × 5 temporal kernel, applied similarly to 0 . We adopt this mechanism from NTASD ], as we find it is necessary to achieve good temporal stability; every denoising filter we tested (see Section 6) produced intense flickering without this secondary temporal loop.

Training procedure
Spatiotemporal processing. Many video super-resolution methods rely on recurrent convolutional neural networks as their weight predictors [Liu et al. 2022b]. Inspired by their success, we adopt one of their key components; we use Backpropagation Through Time (BPTT) [Graves 2012], taking staggered training iterations over every two consecutive frames. Denoising does not offer delayed rewards; our objective is to output the highest possible quality frame in every iteration. Therefore, the greedy two-iteration BPTT is well suited for training the temporal element of our denoiser.
Previous methods Işık et al. 2021] process entire sequences in each training iteration. Backpropagating through such iterations is extremely memory-consuming, limiting these approaches to 8 frame sequences of 128 × 128 patches.  Figure 4: Processing diagram for our denoiser. Sample radiance and geometric/material features are encoded and averaged over each pixel. The per-pixel radiance and features for the current frame are then combined with those of the previous frame (after motion compensation, Eqs. 10 and 11). The kernels of the per-pixel denoising filters and other parameters are predicted by the weight predictor network (Eq. 12). To apply denoising kernels, the radiance is partitioned into a 5-layer pyramid (4.1), then convolved with the kernels (Eq. 2), and finally the layers are combined using adaptive splatting upsampler (4.2). For better temporal stability, the final radiance values are filtered using predicted temporal kernels (Eq. 13).
patches, exposing our model to substantially more spatiotemporal information.
Dataset. Inspired by Hypersim [Roberts et al. 2021], we leverage Evermotion's Archinteriors and Archexteriors collections to build our production-quality training dataset that exceeds the quality and diversity of datasets used in previous works. Refer to the supplementary Supplementary, Sec. B for more details on our training dataset.
Loss function. Our loss function employs perceptual component as proposed by Thomas et al. [2022], complemented with SMAPE following . Both components contribute also to the temporal loss that additionally involves the warped previous frame, as we detail in the supplementary Supplementary, Sec. C.
Training. We implement our pipeline in Tensorflow 2 [Abadi et al. 2016]. We use the Adam [Kingma and Ba 2014] optimiser with a batch size of 8. We begin training at a learning rate of 10 −4 and exponentially decay the learning rate, halving it every 11 epochs. We train on the full 256 × 256 patches included in our dataset, augmented with flips and rotations. Training typically converges after 50 epochs, taking 2-3 days on a single NVIDIA A40 GPU for our small weight predictor, and 4-5 days on 4 A40s for our large weight predictor.

EVALUATION
In this section, we compare our pipeline performance to stateof-the-art techniques and run several ablations of our pyramidal denoising.
Our test scenes include Kitchen, Dining-room, Bedroom [Bitterli 2016], Bistro [Nvidia 2017], and Zero-Day [Winkelmann 2019] (all excluded from the training dataset). As our method and some compared methods rely on temporal information, we allow a 16-frame warm-up phase at the start of each sequence. We tonemap the output frames based on the ACES guidelines [Hill 2022] to match professional animations. We use the Falcor [Kallweit et al. 2022] renderer to render our input samples and 6144 samples per pixel for reference frames. We compute our reference frames as the median of three uncorrelated 2048 spp estimates to suppress fireflies.

Baseline comparisons
We compare our full pipeline to state-of-the-art denoisers. We render our training dataset using Falcor. We compare against AFGSA [Yu et al. 2021], Intel Open Image Denoise (OIDN) [Intel 2022] version 1.4.3, and Nvidia OptiX AI-accelerated Denoiser (ONND) [Nvidia 2022b] version 7.6, with temporal  and kernel-based [Bako et al. 2017] extensions enabled. OIDN and AFGSA are high-quality offline denoisers with inference times on the order over 500 milliseconds, while ONND is an optimised interactive denoiser taking roughly 100 milliseconds. We average our per-sample albedo, normal and depth data as each competitor requires per-pixel inputs. We retrain AFGSA and OIDN and use the pre-trained, proprietary ONND included in OptiX. While version 5 of ONND was retrainable, it was based on early work by , surpassed by more recent works [Bako et al. 2017;Xu et al. 2019;Yu et al. 2021], and is unsupported on recent GPUs.
In our comparison, we consider commonly used image quality metrics such as per-pixel PSNR, structure-oriented SSIM and MS-SSIM [Wang et al. 2003], perception-informed FovVideoVDP 3 [Mantiuk et al. 2021], Flip [Andersson et al. 2020], and tPSNR [Banitalebi-Dehkordi et al. 2016]. Commonly used PSNR and SSIM measure error per pixel or in a small sliding window. Meanwhile, MS-SSIM and FovVideoVDP apply filter banks, capturing distortions over large areas. Moreover, FovVideoVDP is a video quality metric that captures temporal distortions, including popping and flickering artefacts perceivable by human observers. Table 1 presents the obtained results where the OursLarge weight predictor (refer to Section 5) consistently outperforms all competing solutions for all considered metrics. This observation holds for all numbers of input samples per pixel. OursSmall takes 32 milliseconds to denoise a full HD frame (1920 × 1080), which is three times faster than ONND; OursLarge similar as OIDN requires around 500 milliseconds, while AFGSA 5.5 seconds. Figure 7 shows a visual comparison of denoising performance for selected frames. We refer the reader to our supplementary material for full-resolution images and videos of our animated sequences. Table 1: Comparison of the baseline methods. Denoising quality is measured using a number of well-established fullreference metrics. We denote in bold the best quality score and with a gray background the second-best score for a given number of samples per pixel (spp). ↑/↓ indicate that higher/lower scores are better.   Table 3: Replacements of our denoising filter. We evaluate alternatives to our proposed denoising filter: weighted sum , scale-composition [Vogels et al. 2018], and affinity kernels [Işık et al. 2021]. We leave the rest of our Small and Large pipelines, driven by OursSmall and OursLarge respectively, unchanged. The best and second-best scores are marked separately for the Small and Large weight predictors.

Choice of denoising filter
We compare our denoising filter against several filters proposed in previous works. We swap our denoising filter in our pipeline with implementations of other filters, keeping the rest of our pipeline the same for a fair comparison. We test each filter 4 spp inputs using the OursSmall weight predictor and the OursLarge weight predictor to analyse each filter's scalability. Table 3 shows several metrics averaged for our test scenes. Although metrics mostly show small differences, our denoising filter scales well with respect to weight predictor size. As only our filter shows the best overall performance with both weight predictors, it is a reliable choice for general Monte Carlo noise filtering.

Pyramidal filter ablations
Reference Our Small w/o Upsampler 4 layers 3 layers Noisy Figure 5: Pyramidal filter ablations. The OursSmall weight predictor is considered and the role of: (1) learnable upsampling (Section 4.2) that is replaced by a simple bilinear upsampling (the column "w/o Upsampler") and (2) the number of filter layers as shown in Figure 6 (3rd and 4th columns) are considered. Our complete filter uses 5 layers. The last two columns show the reference image and the noisy input. Bistro © 2023 Amazon Lumberyard, Kitchen © 2023 Jay-Artist, Bedroom © 2023 SlykDrako To analyse each component's role, we perform ablation studies. We use the OursSmall weight predictor network with 4 spp inputs. We test our denoising filter with 3 and 4 layers instead of 5, we substitute our learnable upsampling stage with bilinear upsampling. Table 2 shows several metrics averaged for our test scenes, and Figure 5 shows side-by-side qualitative comparisons.
Our learnable partitioning downsampler is a crucial component of our denoising filter. The learnable upsampling stage, while having a small impact on metrics, improves the denoising of low frequencies and enhances sharp details. Dropping to 3 or 4 layers also has little effect on metrics but limits the low-frequency denoising capabilities of our filter. The computational cost savings in these cases are negligible; therefore, we favour our full scheme. Important visual consequences of dropping learnable upsampling can be observed in the 1st and 2nd row of Figure 5, where bilinear upsampling leads to ringing and overshoot artifacts. In general, reconstructed edges are more blurry in the second column. Reducing the number of layers might lead to excessive washing out of noisy low-contrast patterns and residual low-frequency noise as can be seen in the 3rd and 4th row. The 5th layer is rarely utilised but does occasionally help resolve artefacts as reflected by metrics.

LIMITATIONS AND FUTURE WORK
Our filter is affected by over-blurring and synthetic structural artefacts common to neural networks. When our filter cannot reconstruct a clean, detailed image of the available samples, it inevitably has to tradeoff between these artefacts. While perceptual loss functions can somewhat control the learned strategy, our network occasionally produces dissatisfactory results.
Relying on primary ray intersections diminishes our denoiser's performance in scenes with specular objects. However, path-based denoising [Cho et al. 2021;Lin et al. 2021] and temporally reliable motion vectors [Thomas et al. 2022] are orthogonal research areas with good potential to resolve these issues.
Finally, we do not consider motion blur, depth of field, or other distributed effects-however, the depth-wise layer decomposition proposed by  is orthogonal to our work and could potentially alleviate this limitation.

CONCLUSION
In this work, we proposed a novel pyramidal filter that, combined with a lightweight network, achieves close to real-time performance while offering denoising quality comparable to previous offline solutions. Combining the same pyramidal filter with a large network, we obtain substantial improvements over previous work. Visually comparing our denoised frames reveals better reproduction of highfrequency details and a significant reduction of low-frequency noise and artefacts.
Some of the improvements can be attributed to our training strategies that adapt backpropagation through time, handling longer frame sequences and larger patch sizes. Nevertheless, embedding previously considered state-of-the-art filters into our training pipelines and network architectures still diminishes performance. The blue line represents the signal and the orange line represents noise (calculated by subtracting the reference image). The range of values on both axes is identical in all the plots. Note that the rightmost column does not show˜, but each layer upsampled separately. The predicted partitions contain insignificant low-frequency noise below the denoising kernel's effective band in each layer 1 ○, a mix of useful radiance and noise in the effective band 2 ○, and primarily white noise in the higher frequencies 3 ○, later removed by downsampling 4 ○. The noise filtering stage reduces the noise in the effective band 5 ○, and the upsampling stage reconstructs the small amount of useful high-frequency radiance lost during downsampling 6 ○. Zero-Day © 2023 Mike Winkelmann