A Fast GPU Schedule For À-Trous Wavelet-Based Denoisers

Given limitations of contemporary graphics hardware, real-time ray-traced global illumination is only estimated using a few samples per pixel. This consequently causes stochastic noise in the resulting frame sequences which requires wide filter support during denoising for temporally stable estimates. The edge avoiding à-trous wavelet transform amortizes runtime cost by hierarchical filtering using a constant number of increasingly dilated taps in each iteration. While the number of taps stays constant, the runtime of each iteration increases in these usually memory-throughput bound shaders with increasing dilation, because the increasing non-locality negatively impacts cache hit rates. We present a scheduling approach that optimizes usage of the memory subsystem by permutating global invocation indices in such a way that each wavelet filter iteration is applied through undilated taps. In contrast to prior approaches, our method has identical performance characteristics in each iteration, effectively decreasing maintenance cost and improving performance predictability. Furthermore, we are able to leverage on-chip memory and hardware texture interpolation. Our permutation strategy is trivial to integrate into existing wavelet filters as a permutation before and after each level of the wavelet filter. We achieve speedups between 1.3 and 3.8 for usual wavelet configurations in Monte Carlo denoising and computational photography.

Further, permitting shared memory caching in all iterations.(d) Subgroups within a workgroup are reshaped to rectangular regions.These are constructed in such a way that shifting them anywhere in the shared memory resident pixel tile does not result in a bank conflict.
Given limitations of contemporary graphics hardware, real-time ray-traced global illumination is only estimated using a few samples per pixel.This consequently causes stochastic noise in the resulting frame sequences which requires wide filter support during denoising for temporally stable estimates.The edge avoiding à-trous wavelet transform amortizes runtime cost by hierarchical filtering using a constant number of increasingly dilated taps in each iteration.While the number of taps stays constant, the runtime of each iteration increases in these usually memory-throughput bound shaders with increasing dilation, because the increasing non-locality negatively impacts cache hit rates.We present a scheduling approach that optimizes usage of the memory subsystem by permutating global invocation indices in such a way that each wavelet filter iteration is applied through undilated taps.In contrast to prior approaches, our method has identical performance characteristics in each iteration, effectively decreasing maintenance cost and improving performance predictability.Furthermore, we are able to leverage on-chip memory and hardware texture interpolation.Our permutation strategy is trivial to integrate into existing wavelet filters as a permutation before and after each level of the wavelet filter.We achieve speedups between 1.3 and 3.8 for usual wavelet configurations in Monte Carlo denoising and computational photography.

INTRODUCTION
Image-space denoisers operate on fully accumulated samples.They are a popular choice as they are easy to integrate as a black-box post-processing step.However, even the runtime cost of denoisers explicitly designed for real-time applications is often prohibitive, leading people to amortize cost by combining denoising with upscaling [Kelly et al. 2021;Thomas et al. 2022], through sparsesampling [Willberger et al. 2019;Zhdan 2021], quantization [Thomas et al. 2020], or by reducing the support of their filter kernels [Schied et al. 2018].
A common building block of real-time denoisers are àtrous wavelets [Dammertz et al. 2010], popularized in realtime rendering through spatio-temporal variance guided filtering (SVGF) [Schied et al. 2017].À-trous wavelets maintain a constant number of filtering taps, but iteratively increase support of the filtering kernel by dilation (see fig. 1(c)).However, this constant number of filtering taps does not lead to a constant runtime across all wavelet iterations, due to decreasing reuse between dilated taps with each recursion level and consequently, lower cache hit rates and higher memory instruction latency (see section 4.2).
Image-space denoisers are usually memory-bound gathering operations parallelized by ping-pong buffering.Hence, computations between pixels are embarrassingly parallel, providing maximal flexibility to optimize memory throughput through scheduling, i.e. more efficient ordering of (memory) instructions without changes to the algorithm.More concretely, such a scheduling approach may be implemented by altering the mapping between pixel indices and invocations by adding index access functions  , as shown in fig. 2.
Recent works [Hasselgren et al. 2020;Thomas et al. 2022Thomas et al. , 2020] ] compare favorably against SVGF when their reconstruction quality to reconstruction time tradeoff is considered.However, these implementations rely on non-portable optimizers and runtimes, while comparing against straightforward, portable wavelet filters implemented without considering efficient usage of GPU hardware resources through scheduling.
In this work, we propose several performance-motivated changes the schedule of à-trous wavelets.These changes may be implemented through index access functions  , .The proposed index access functions are loop invariant.Hence, our method is trivial to integrate into existing waveletbased denoisers by addition of a preamble and postamble to the baseline schedule (cf.fig.4).Our contributions are: • A scheduling strategy without memory overhead localizing global memory accesses in an à-trous wavelet scheme (Section 3.1) through a 2D embedding (Section 3.2) that is oblivious to signal boundaries (Section 3.4).• Apart from optimizing cache hit rates and thus shader runtimes, this coalesced schedule enables usage of shared memory, i.e. fast on-chip memory with a user-controlled caching strategy, resulting in further runtime improvements (Section 3.5).• Motivated by the high memory consumption of auxiliary feature-guided denoisers, a shared memory permutation strategy for 2D stencils to eliminate shared memory bank conflicts without introducing unused memory for padding (Section 3.5).

PRIOR WORK
Research in Monte Carlo denoising has focused on the design of novel denoising architectures targeting different sample counts and consequently varying runtime budgets for real-time [Schied et al. 2017;Thomas et al. 2022], interactive [Chaitanya et al. 2017] or offline rendering [Vogels et al. 2018].As our work focuses on scheduling, we review prior work from a structural perspective, i.e. using their data dependencies.For a general survey of denoising algorithms in Monte Carlo rendering, we refer to the surveys of Zwicker et al. [Zwicker et al. 2015] and Sen et al. [Sen et al. 2015].
À-trous Wavelets For Denoising.Edge-avoiding, undecimated wavelets were introduced to computer graphics by Fattal [Fattal 2009] in the context of computational photography.À-trous wavelets were first employed for Monte Carlo Denoising by Dammertz et al. [Dammertz et al. 2010].Schied et al. [Schied et al. 2017] extended this approach to real-time rendering through temporal reprojection.By estimating variance using moments during reprojection and within the downward pass of the wavelet, they are able to estimate results using two alternating buffers without an upwards pass assembling all wavelets to a final estimate.Their following work [Schied et al. 2018] reduces the support of the wavelet as a runtime optimization.Hanika et al. [Hanika et al. 2011] use à-trous wavelets for image denoising without feature channels.Their approach preserves all intermediate buffers of the recursive spatial (downward) pass and executes an upward pass to assemble the final estimate.Given the upward pass only relies on the previously computed wavelets from the downward pass, the whole upward pass can be implemented in a single shader that concurrently reads all results from the downward pass.The implementations proposed by these works [Dammertz et al. 2010;Schied et al. 2017Schied et al. , 2018] ] and applied in industry implementations [Nvidia Inc. 2023] corresponds to the baseline implementation shown in fig. 2. In contrast to undilated filters, à-trous wavelets in the baseline schedule cannot exploit commonly applied optimizations for gathering based post-processing effects, such as hardware texture interpolation [Kawase 2003] and on-chip memory resident caching [Nvidia Inc. 2023] after the first filter iteration.Our method enables these optimizations as undecimated wavelets are applied through undilated filters.
Convolutional Neural Networks in Denoising.More recently research focus has shifted towards learned approaches using convolutional neural networks (CNNs) [Bako et al. 2017;Chaitanya et al. 2017;Kalantari et al. 2015;Thomas et al. 2022Thomas et al. , 2020;;Vogels et al. 2018;Xu et al. 2019].A commonly employed architecture is the U-Net [Ronneberger et al. 2015].Instead of direct prediction of the output image, kernel prediction [Bako et al. 2017;Vogels et al. 2018] computes a pairwise pixel similarity and then applies the similarity score encoded in a filter kernel to the input image.In the context of neural networks, undecimated wavelets may be modelled as increasingly dilated convolutions without pooling; while layers in pooling architectures such as U-Nets are dual to decimated wavelet iterations.Our method arranges undecimated wavelets as a tensor stacking postamble  o outer schedule  i preamble Fig. 3. Overview of our scheduling approach for a single (inner) filter iteration, which is repeatedly applied to form the full à-trous wavelet-based denoiser in fig.1(a).A shader running an undilated filter kernel is augmented with loop-invariant bijections applied to the invocation indices before (preamble) and after (postamble) the filter kernel loop to implement an à-trous filter.The postamble scatters write locations in order to localize global memory read-locations of subsequent iterations without prior application of a bijection in the preamble.Subsequent iterations of the shader are able to ignore boundaries in their input since our embedding approximates texture mirroring for out of bounds memory accesses.decimated wavelets along the feature dimension.As such our method arranges undecimated wavelets in the memory layout commonly seen in neural network architectures.
Input Features.There is a large variability in denoisers, as denoisers proposed in literature are usually heavily adapted to fit the rendering budgets and properties of the rendering engine or task [Barré-Brisebois et al. 2019;Boksansky et al. 2019;Kelly et al. 2021;Willberger et al. 2019].Denoisers usually receive auxiliary features computed as a byproduct of the rendering processsuch as a G-buffer-packed in an engine specific datagram.Further, samples may be separated into effects [Hofmann et al. 2023;Schied et al. 2017;Zhdan 2021] as more efficient denoisers are known for a path subspace.À-trous wavelets are popular for indirect diffuse lighting, where usually large filter radii are required.
Scheduling of Array Processing Code.Analysis and optimization through affine functions has a long tradition in compiler theory.Our proposed bank conflict resolution strategy is designed around the greatest common divisor (GCD) test [Muchnick 1997], a test for loop-carried data dependencies.Modern automatic parallelization and auto-vectorization leverages the polyhedral model [Barré-Brisebois et al. 2019;Grosser et al. 2011;Vasilache et al. 2018], which lifts loop nests into a system of affine constraints (Z-polyhedra) representing every possible dynamic execution instance.While motivated by these approaches, our objective is to specifically design a manual optimization method as detailed in the next section.

METHOD
We give an overview of our complete method as a sequence of pixel coordinate transformations implementing the proposed index access functions along with an example implementation, before discussing each step of this sequence in detail (Sections 3.1 to 3.5).
Model.Independent of their distribution to buffers and images, our method transforms all perpixel data identically.Thus, we assume input to the denoiser is given as a 3-dimensional array  ∈  ×  ×  where  ×  are the dimensions of the input image and  are the image channels along with auxiliary features An iterative denoiser reconstructs  by applying multiple iterations of the filter  .Each iteration is a stencil computation, that integrates pixels from a neighborhood   into each pixel  ∈ 1 × 1 × -for example, through a weighted average  ∈    (, ) using a pixelwise similarity function   as shown in fig. 2. We first describe the application of our method to a shader implementing an undilated filter iteration  , resp. the first iteration of an à-trous wavelet.We then show that an iterative application of this shader results in the desired à-trous filtering scheme.
Undilated Filter Iteration.The shader  2↓ modified through our scheduling method computes ( • ) ( ), which is the filtered image  ( ) with an additional memory layout transformation  applied to the output image.The bijection  takes the array index of a pixel in the input  and maps it to another index in the output array.From a high-level, the proposed  subdivides the output of  into four subimages.In contrast to the baseline, this allows pixels in the next iteration to be accessed without dilation increasing memory subsystem performance.This  is compatible with translation.Consequently, allowing us to transform all accessed memory indices at once, by inserting a preamble  i before, and a postamble  o after the stencil loop.This is equivalent to transforming the invocation index x directly instead of each memory access-reducing the number of computations of  by a factor of || x ||.The adapted shader  2↓ is shown in fig. 3. The (optional) preamble  i optimizes shared memory throughput using a bijection  shm to resolve bank conflicts without row padding (Section 3.5), which is especially desirable if many auxiliary features  are cached for each pixel.In essence,  shm reshapes subgroups within each workgroup to rectangular regions.These are constructed in such a way that their translation within the stencil loop does not result in bank conflicts within the shared memory resident pixel tile (see Figure 1(d)).The postamble  o is defined as the composition of multiple bijective functions The bijection  2↓ performs the mapping between undilated and à-trous filters by contracting even and odd signals into four independent output images (Section 3.1).The bijection   permutes the order of the output images to influence   , which arranges the output images as subimages at collision-free positions in the output (Section 3.2).The (optional) bijection   flips the coordinate systems of the arranged subimages in order to approximate texture mirroring (Section 3.4).This approximation ensures memory accesses falling outside of the current subimage fall into similar regions in neighboring subimages, allowing us to omit the overhead of additional boundary checks in the stencil loop.
Iterative Filtering.An iterative application of  is shown in Figure 1(a).Apart from the last iteration  max , which restores the original image layout through   max ↑ (Section 3.3), each iteration applies  • , iteratively partitioning each subimage embeded in the input image.As a result, contracting memory accesses in each iteration (Figure 1(b+c)) to an undilated filter.As all iterations are applied through an undilated filter, each iteration has identical performance characteristics (cf.section 4.2).
Discussion.Our method delocalizes write operations to localize read operations as read operations in a stencil filter have higher multiplicity.At the same time, this places most of the computational cost of  into the postamble  o , which is independent of the stencil loop.Consequently, overhead of  o may be hidden through instruction-level parallelism.Opposed to a naive recursive application of  2↓ to each output image, we embed subimages for two reasons.First, to allow memory reuse by ping-pong buffering.Second, to optimize occupancy.Through the embedding, a single shader  dispatch processes all subimages concurrently without introducing additional inactive invocations through quantization to the workgroup size at subimage boundaries (see fig. 1(d)).In some variations, such as the example discussed next, the embedding makes our method invariant to the iteration index .
Example.In the remainder of this paper, we discuss each bijection in detail along with implementation variants and tradeoffs.As a motivational example, consider the concrete implementation variant of our method in fig. 4. The original implementation of  is shown in lines 5 to 10. Assuming the original implementation is an à-trous filter in the baseline schedule, we remove the filter dilation  = 2  increasing with each iteration  to contract the neighborhood  x (Line 7).Since our method ensures a constant working set size equal to the first undilated wavelet iteration, we further access neighboring pixels through shared memory.Preloading a tile from global to shared memory is a common operation and elided for brevity (Line 3).As a result, we arrive at a reasonably optimized implementation of an undilated filter (Lines 3 to 10).The inserted preamble implements a special case of our permutation-based bank conflict resolution  shm , which for the assumed workgroup shape of 8 × 8 can be implemented as a transposition of local invocation indices (Lines 1 to 2).More generally,  shm reshapes subgroups from the row-major order of the baseline schedule to a rectangular tiling.In the postamble, the bijection  2↓ (Line 13) decomposes the output signal into even and odd signals along each dimension to form a set of shifted, decimated signals differing in their offset, i.e. the four subimages per input (sub)image.The bijection   (Line 15) embeds the decomposed signals on a rectilinear grid.In this implementation variant,   arranges subimages in ascending order, sorted by their offset in the input image, in which case its sufficient to apply the bijection   (Line 14) approximating texture mirroring for the embedded subimages in the first iteration only.In the last iteration, we instead invert all applied bijections to restore the original image layout.We discuss the shown optimized bijection (Lines 16 to 22) to invert multiple applications of  2↓ in section 3.3.

Signal Decomposition
The core of our method is an iterative subdivision of the input image into four independent subimages through the bijection  2↓ .We show the correctness of this transformation by modelling local data dependencies of each output pixel as affine functions.We then extend this notion from individual pixels to affine dependencies for the whole image.
Data Dependency of Affine Memory Accesses.As demonstrated in fig. 4 lines 5 to 10, stencil computations can usually be modeled through a loop nest centered around the output element.Given a single loop for(i=;i<;i+=) with loop counter  ∈ Z, initialization , condition  < , and loop increment , all dynamic instances of this loop nest can be modeled through an affine function i +  along with inequalities modeling the constraints imposed by the initialization and condition, which we conveniently express as  ∈ [, ).As can be seen in fig.5(b), the initialization  partitions the domain Z into  disjunct subsets.As this describes a congruence relation modulo , we denote []  as the subset of all values described by the affine function with initialization  ∈ [0, ).On a bounded domain of width , the size of the induced subsets may differ by one if the value range  is not divisible by the loop increment .There is a prefix of  = ( mod ) large congruence classes, and suffix of  −  small congruence classes.Thus, the maximal wasted memory for padding to a single cardinality for all subsets is at most the number of small subsets  < .
Data Dependencies of À-trous Wavelets.We will now consider how the global data dependencies of a dilated stencil loop centered on each pixel can be modeled analogously as an affine function.Given a stencil with radius  centered around pixel  with a dilation factor of , we can uniquely identify each memory access of  with an iteration vector (, n), where  ∈ [−,  ] describes the relative offset of a non-zero tap integrated into .The memory locations accessed by  are given by   (black and green in fig.5).The union of all center pixels  ∈ Z with a non-empty intersection of dependent memory accesses   can be described by an unbounded joint affine function (gray in fig.5(a)).Consequently, a stencil with dilation  and arbitrary support  > 0, decomposes the domain Z into  independent subsets.Figure 5 Delaying discussions of varying subimage sizes until section 3.2, we define the bijection , which takes  ×  wavelet images as a 5-dimensional array  ×  ×  ×  ×  and decomposes each subimage according to its data dependencies into 2  subimages represented by an array of shape 2   × 2   ×  × ⌈ /2  ⌉ × ⌈ /2  ⌉.In figure 3, we illustrate the application of  2↓ where a single subimage [(0, 0)] 0 is decomposed into four subimages.Mapping this array representation to a memory layout varying the innermost dimensions  ×  most quickly, data dependencies to neighboring stencil taps  x are localized.

Image Domain Embedding
We define the embedding   •   , which reshapes the output of  2↓ back to the input shape 1 × 1 × × ×  .Such a mapping reduces memory consumption and maximizes device occupancy.
Motivation.In a baseline schedule, each center pixel of the iterative filter is mapped to a single GPU invocation.Consequently, the decomposition into congruence classes [o]  induces a partition of invocations into independent sets.While this suggests that each subimage could be mapped to a single shader, embedding of all subimages into a single domain after each iteration has two benefits.First, we can reuse opaque image memory through ping-pong buffering.Second, we can dispatch a single shader that processes all subimages while being oblivious to the dimensions of subimages apart from optional weight modifications for out of bounds taps.Using a single shader dispatch for all subimages of an iteration trivially works around the quantization of subimage bounds to integer multiplies of the workgroup size.Otherwise, the occupancy of later iterations is impacted through inactive invocations and dispatch grid sizes that insufficiently saturate the hardware.Figure 1(d), shows an example where a single workgroup processes multiple subimages at once, which would otherwise reduce occupancy through inactive invocations if unembedded subimages are directly mapped to the device.
Local Subdivision Order.Local subdivision naturally results from an iterative application of  2↓ to each subimage.Each subimage [o]  stored in a coordinate range in the input image, is partitioned by its least significant bit of the local pixel indices   (o) − x and written back to the same coordinate range in the output image.As illustrated in fig.6(d), this leads to a reversal of bits as the -th bit is inserted as the lowest bit to form the -bit wide subimage identifier.Consequently, local subdivision is defined as i.e. reversal of all  significant bits   of the subimage identifier o.In a practical implementation, the function rev 2 reverses the bits of the bitsize(•) ≥ -wide unsigned integer format used to store o.
Ascending Order.An iterative application using an identity function   (o) = o partitions global pixel indices.As shown in fig.6(b), iteration  partitions the embedding at the -th bit, inserting it as the highest bit into the subimage identifier, effectively partitioning the original pixel position at  with the lower bits representing , and higher bits the local subimage position.
Discussion.Notably, an iterative global subdivision may be implemented as a single application of  2↓ to the whole input image, i.e. the embedded subimages.Further, as the order between small and large subimages is maintained, Euclidian division automatically packs subimages on a rectilinear grid without padding.In an iterative application of local subdivision   , iteration  permanently partitions memory regions based on the -th bit for all subsequent iterations  ′ > .In contrast to   , an embedding in ascending order introduces additional data dependencies between subimages as potentially reused memory regions are not disjoint.(As an example, consider memory dependencies between subimage [(1, 0)] 1 and [(0, 0)] 0 in fig.6.)However, in the proposed schedule, where subimages of a level are processed in a single shader dispatch with ping-pong buffering, these additional data dependencies are already satisfied.

Interleaving
To restore the original image layout after multiple iterations of our method, we define which first unembeds the subimages (  •   •   ) −1 , before (re)interleaving The bijection  −1  computes grid positions and subimage-local pixel indices from global pixel indices.The bijections  −1  and  −1  are self-inverse, with  −1  restoring the subimage identifier from the embedded grid position and  −1  normalizing the pixel coordinate system of each subimage.

Treatment of Signal Boundaries
In this section, we define the boundary permutation   , which approximates texture mirroring for the embedded subimages in order to reduce overhead of out-of-bounds memory accesses.
Motivation.We strive for permutations to be pure post-and preprocesses not causing any additional arithmetic or control overhead within the filter loop.In the baseline schedule, the overhead of boundary checks can often be mitigated by exploiting texture mirroring or by setting out-of-bounds accesses to a special value with zero weight, i.e.   (•) = 0-either using the texture sampler, or by resolving out-of-bounds accesses once while preloading data from global to local memory.However, our embedding   •  prevents these optimizations as boundary checks between embedded subimages are required.
Boundary Bijection.We flip the axes of subimages embedded in the two dimensional domain in a checkerboard as shown fig.1(a).Given the position o of the subimage on the embedding grid, the boundary bijection is flips the axis of every other subimage of shape  Discussion.As it allows cross-talk between neighboring subimages,   is non-semantics preserving as the computed result   () differs from  () for pixels close to the image boundaries.However, as discussed in more detail in section 4.1, taps from neighboring subimages fall within the footprint of the filter support quantized to the dilation factor, effectively converting the à-trous filter to a undilated filter in boundary regions.Furthermore, it increases the number of taps integrated from within the current subimage, as the boundary may be crossed multiple times during iterative filtering.
We can implement the boundary permutation as a pure postprocess in the postamble without additionally computing the coordinate system orientation as a side product of (  •   ) −1 in the preamble as our exposition assumes either a symmetric filter kernel or a filter kernel deriving its weights in a data-driven way from the pixel values.In these cases, the filter kernel is independent of the coordinate system orientation of subimages.Otherwise, either the filter kernel or the data would have to be flipped to match orientation.
Iterative Filtering.In an iterative application of a symmetric or data-driven filter kernel, the output of the prior iteration still would have to be unembedded to determine the coordinate system orientation of the output in the current iteration.We propose the following optimizations for iterative filtering.For ascending order   , uneven pixels are always embedded at uneven positions since subimages are positioned using the LSB  0 first.Thus, we can statically specialize a shader that flips half of the input image in the postamble of the first iteration, ignoring the orientation of subimages in all subsequent iterations.For local subdivision order   the boundary permutation is applied in each iteration.The orientation stays the same for pixels coordinates, where the most significant bit (MSB) of the prior subimage identifier   −1 and the new MSB   are identical.
The difference between both image domain embedding schemes with mirroring is illustrated in fig.6.

Shared Memory Bijection
A common optimization technique for stencils is shared memory resident overlapped tiling, i.e. for a workgroup size of   ×  , a rectangular region  ×  ×  =  ×  +2 ×  +2 of the input image containing all memory locations accessed by a filter kernel with radius  is cached in fast on-chip memory (see fig. 1(d)).Efficient addressing of on-chip memory requires bank conflict resolution, which is commonly achieved through row-padding.As the memory accessed by a workgroup grows exponentially with each à-trous iteration, the baseline schedule is unable to leverage scarce on-chip memory, leading state-of-the-art implementations [Nvidia Inc. 2023] to only exploit on-chip memory in the first iteration.Our method converts wavelets to an undilated filter with constant filter support across all iterations.However, even for the first iteration, mapping an overlapped tile to shared memory can be non trivial, if many features  are cached with row-padding.Therefore, in this section, we propose a bijection  shm to resolve bank conflicts without padding.In essence, we derive the shape of a sliding window, which is bank conflict-free for any memory offset within the shared memory resident tile (Figure 1(d)).We then use this shape to tile the subgroups within a workgroup.
The GCD-Test.The core idea of our method is to design a permutation  shm that avoids concurrent access to the same bank through the GCD-Test [Muchnick 1997], an alias analysis technique.To do so, we model the affine constraint imposed by bank conflicts explicitly by comparing the stride  of an affine memory access with the stride implied by the number of shared memory banks   in the system.Assuming successive words are stored in successive banks in a cyclic pattern, this implies  0 4 8 12 16 20 24 28 0 4 8 12 16 20 24 28 0 4 8 12 16 20 24 28 0 4 8 12 16 20 24 28 0 4 8 12 16 20 24 28   0  0 1 2 3 4 5 6 7  9  8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7  and ĉ is the number of shared memory banks after correcting for the size of the array elements stored in shared memory.We assume shared memory banks and array elements to be of equal size (  = ĉ ) and refer to the supplementary material for differing bit widths.Solving this linear Diophantine equation, for example using the extended Euclidean algorithm, gives us the greatest common divisor  = gcd(, −ĉ  ) = gcd(, ĉ ) and the coefficients describing the smallest positive non-trivial solution  Construction.Setting the stride  of the affine memory access to the width   of the shared memory region allows an analysis of bank conflicts within a column of a row-major shared memory layout.In this case,  gives the number of rows after which the periodic mapping between array indices and bank indices repeats.As a consequence, resolving bank conflicts in a region   ×   =   ×  resolves bank conflicts for shared memory footprint   ×   -effectively bounding the height of our analysis.We call this region, shown in fig.7(c), a bank pattern.At the same time,  gives the maximal height of a column 1 ×  that is bank conflict free.Analog to our discussion section 3.1, the greatest common divisor  induces a partition of these columns into disjunct affine subsets of shared memory banks, which is shown in fig.7(b).Consequently, reducing the problem from a two-dimensional conflict resolution problem to a one-dimensional problem over columns.Selecting an arbitrary column from each subset forms a bank conflict free access pattern for a subgroup.If adjacent columns are picked to form a subgroup of shape   ×   =  ×  as shown in fig.7(d), we call the region a bank pattern tile.
Properties.Bank pattern tiles are translation invariant and periodic at column boundaries as  divides   by construction.Rows are periodic if  divides   .We consider translation by one to construct a proof by induction.Shifting a bank pattern tile vertically by one, a row  will be replaced by a row (  ± )%  with an identical sequence of bank indices from a neighboring bank pattern.Shifting a bank pattern tile horizontally by one, a column  will be replaced by another column (  ± )%  from the same affine subset, which is backed by the same set of banks-albeit in a different sequence.We discuss normalization of banks within columns in the supplementary.
Limitations.As the bank pattern height may not divide the height of the workgroup   , reshaped subgroups within a workgroup may be unable to cover the workgroup shape.Any odd shared memory width   , for example, yields an elongated bank pattern 1 × ĉ .While for any stencil radius  , odd shared memory widths   =   + 2 may not occur for usual choices of   , we discuss strategies to alleviate these limitations by reshaping bank pattern tiles in the supplementary.

EVALUATION
We evaluate the effects of our scheduling method on reconstruction quality and execution efficiency.
In section 4.1 we discuss the effect of the non-semantics preserving treatment of image boundaries on reconstruction quality.In section 4.2 we evaluate the throughput of key GPU subsystems and its effect on execution time.(second column).This crosstalk between wavelets visually manifests as light leaking to the opposing side of the input image.In contrast, the proposed approximation of mirroring   (third column) automatically retrieves close-by filter taps from neighboring subimages, effectively converting the à-trous filter to a dense filter in boundary regions.More precisely, let  be a tiling of the input image  into 2  × 2  tiles, which contain one pixel from each subimage-except for partial tiles at boundaries if  is not divisible by 2  .Then all out of bounds memory taps fall into zero weight taps within the support of the à-trous kernel in  expanded to the tiles in  intersected by the support.This invariant holds for any positioning of wavelets  on a rectilinear of the embedding permutation   •   .This remapping of out-of-bounds memory accesses has two side effects.First, structural artifacts of the wavelet filter, especially prominent in the box filter (first and second row), are reduced along the axis orthogonal to the boundary, whereas the non-locality of the ascending embedding   benefits this smoothing (last column).Second, in the baseline, pixels close to the boundary have low contribution as the iterative filter is truncated at the first out-of-bounds memory access.Consequently, taps within bounds are only integrated in early iterations where all ancestor taps are within bounds.As   does not truncate the filter if a tap is out-of-bounds, filter taps may cross the boundary multiple times, preserving the correct weight for contributions from pixels within bounds.As a result, weight of pixels close to the boundary is increased for both   and   .

Performance Evaluation
Hardware & Software.We evaluate our method on an NVIDIA RTX 3070.Measurements are taken in stable power mode.Given the regularity of the variance between measurements is minor and consequently not reported.We evaluate our method using two à-trous algorithms.First, edge avoiding wavelets (EAW) [Hanika et al. 2011], which are designed for computational photography and denoise a color image without auxiliary buffers.Second, spatio-temporal variance guided filtering (SVGF) [Schied et al. 2017], which is a real-time denoising algorithm designed for removal of Monte Carlo noise.Our baseline implementation [Alber 2024] has slightly adapted edge stopping functions with depth gradients explicitly stored in the G-buffer and an elided Gaussian blur of luminance variance.We refer to the code provided in the supplementary material for details.Performance Characteristics.Figure 10 compares the throughput of hardware units and their cache hit rates in the baseline schedule to our proposed schedule.Figure 9 plots shader runtimes.We note that the runtime of the denoiser is independent of the framebuffer contents.
Independent of the iteration, the workload is memory-bounded.In the baseline schedule, the active working set of each workgroup increases with each iteration .Consequently, we observe the workload gradually shifting to increasingly distant units of the memory hierarchy.The workload is initially L1Tex limited.L2 bandwidth from L1 and L2 bandwidth to VRAM increase with each iteration and is saturated for  ≥ 4, at which point the workload is limited by L2.The increasing non-locality of the workload is also observable in the cache hit rates.L1 hit rates gradually decrease from 96% to 31%.Data is increasingly served from L2 with hit rates increasing to 96% in iteration  = 4.As a result of poor memory subsystem performance, multiprocessor throughput is low, further decreasing as latency increases.The most prominent stall reasons are long scoreboard and TEX throttle-indicating that a high number of texture fetches with high latency are inflight.If the decreasing performance of the memory subsystem manifests in runtime variation depends on the specific combination of à-trous filter and device.We especially observe increasing runtimes for iterations saturating L2 bandwidth for both evaluated algorithms on an NVIDIA RTX 3070.However, in other proprietary implementations, not further evaluated in this paper, we either observe a near-linear increase in runtime with each iteration; or a constant runtime as decompression of the G-buffer on the special function unit masks poor memory subsystem performance [Willberger et al. 2019].
In contrast, our method has near-constant throughput and cache hit rates for each filter iteration, resulting in an almost uniform shader runtime for each iteration.Thus, indicating the dynamically varying output pattern of our localization is not integral for performance.Comparing an implementation of our method without shared memory to the baseline, throughputs of the memory subsystem are near-identical to the first iteration of the baseline.SM throughput, however, is increased in our variants-27.2%for ascending order, 29.6% for local subdivision order, compared to 25.6%-17.4% in the baseline schedule.While our permuted filter iterations issue 15.8%, resp.8.1%, more instructions for subdivision, resp.ascending order, their runtime is near-identical to the baseline schedule, indicating overhead of our method is hidden by instruction-level parallelism.Memory localization without shared memory usage does not result in a speedup for the first iteration, only reducing runtime for later iterations with an approximate speedup of 1.5.Preloading to shared memory results in a speedup of 2.5 for the first iteration.The speedup over later iterations of the baseline is 3.8.We note that our method with shared memory enabled has lower L1Tex hit rates, as memory transactions after data preloading are served from shared memory.
The last SVGF iteration restores the original image layout.The non-locality of this write operation negatively impacts runtimes, increasing the runtime from 0.30ms to 0.46ms.Given the low multiplicity of write to read operations, we do however still observe a speedup of 2.19 over the baseline.The last iteration of EAW merges the result of all iterations.Our method delocalizes read operations in this upward pass negatively impacting runtimes-increasing the runtime of our shared method to 3.88ms compared to 1.05ms in the baseline.As a result, the overall speedup for the whole algorithm is reduced to 1.2.
Staged Decompression.Auxiliary features in the G-buffer are usually packed and encoded in a rendering engine specific format.Decoding this data has high arithmetic cost within the stencil loop-especially if data is non-linearly compressed.For SVGF, we decompress data once during preloading to shared memory, utilizing 40 byte per pixel to store the data of each neighbor.This accounts for 0.16ms of the reported runtime reduction, equating to an additional speedup of 1.24.
Portability.We report runtime speedups for other devices to demonstrate hardware portability.For SVGF, on a NVIDIA RTX 2070, the baseline has a runtime of 0.88 to 1.62ms.In contrast, our method has a constant runtime of 0.66ms-equivalent to a speedup of 1.33 to 2.42.For EAW, on an AMD Radeon RX 7900 XT with variable clock rates, we measure a runtime from 0.09 to 0.16ms for the baseline, 0.11 to 0.12ms for our method without shared memory and local subdivision, 0.068 to 0.071ms with shared memory.For the ascending variant, we measure 0.10 to 0.11ms without shared memory, 0.052 to 0.59ms with shared memory-equivalent to a speedup of 1.72 to 2.85.Delocalization in the upward pass results in a runtime of 0.575ms compared to 0.464ms in the baseline.

CONCLUSION
We introduced a scheduling technique to accelerate à-trous wavelet-based denoisers.Our optimization approach follows the general idea that on modern hardware with deep memory hierarchies, the arithmetic overhead of memory index computations can either be hidden by memory latency through instruction-level parallelism or amortized by the increased memory throughput.We derived a permutation of GPU invocation indices localizing image coordinates to coalesce global memory accesses and to enable shared memory usage.The permutation is loop-invariant avoiding arithmetic and control overhead within the stencil loop.We optionally eliminate boundary checks and improve neighborhood sampling at image boundaries without any overhead through an embedding of shifted decimated image signals into the original two-dimensional image domain.Since our method enables the usage of shared memory, a promising future research direction is to decouple shader dispatches from wavelet iterations.Given the exponentially growing filter support of à-trous wavelets, non-overlapped tiling [Bondhugula et al. 2016;Grosser et al. 2014Grosser et al. , 2013] ] seems especially promising.Our method improves the memory efficiency of undecimated à-trous wavelets independent of the hyperparameters of the denoiser, such as the number of wavelet levels and auxiliary features.Furthermore, our schedule has constant cost in each iteration, which we believe will help others to design efficient denoisers tailored to the needs of their rendering architecture.

Fig. 1 .
Fig. 1. (a+e) Our schedule for à-trous wavelets iteratively partitions the input image into four subimages to improve cache hit rates and consequently shader runtimes.(b) The baseline maintains the original pixel coordinates ,  causing neighboring threads to accesses a growing number of increasingly disjoint and dilated pixels.(c) In contrast, ours updates pixel coordinates to apply each iteration with an undilated kernel.Further, permitting shared memory caching in all iterations.(d) Subgroups within a workgroup are reshaped to rectangular regions.These are constructed in such a way that shifting them anywhere in the shared memory resident pixel tile does not result in a bank conflict.

Fig. 4 .
Fig. 4. Minimal implementation of our method for a filter radius  = 2 and workgroup size of 8 × 8, in which case our bank conflict resolution strategy simplifies to a transposition of the thread indices.The embedding is arranging subimages in ascending order on a uniform grid (Section 3.2).

Fig. 5 .
Fig. 5. (a) Spatial filtering schema of the one-dimensional à-trous wavelet transform for a single input pixel and filter radius  = 2.The number of sampled neighbors with non-zero coefficients (black) is constant in each level , while the dilation  increases.Arrows show data dependencies between levels .(b) Interpretation of an affine index access function  () =  + .The dilation , as shown here for a one dimensional à-trous wavelet in iteration  = 2, partitions space into  disjunct subsets.The offset o, resp.loop initialization, uniquely identifies each subset [o].Each chain of arrows represents a single subset, whereas subset [0] is highlighted in color.The cardinality of each subset differs by at most one if the length of the data set is not an integer multiple of .
(b)  shows an example for a one-dimensional stencil with dilation  = 4 as it occurs in level  = 2 of an à-trous wavelet.We only consider rectangular schedules, which apply the identical transformation to both orthogonal image signals along width  and height  of an image  , and can thus directly transfer observations for one-dimensional affine functions to the two-dimensional à-trous transform.We define[o]  = { |  = 2   + o,  ∈  } as the o = (  ,   )-th subimage at level  of the input image  .Thus,  = [(0, 0)] 0 .In a hierarchical application, each wavelet level  partitions the subimages further into even and odd signals along each axis.Therefore, level  consists out of  ×  = (2  ) × (2  ) subimages.Data dependencies between subimages [o]  decompose recursively.There are no data dependencies between two taps at level  ′ , if the taps were partitioned into different subimages in a previous iteration  <  ′ .Consequently, each subimage [o]  is independently schedulable, only being directly dependent on its parent subimage [⌊o/2⌋]  −1 .

𝑏 0
Fig.6.Example of our method for both subimage orderings (Section 3.2) with and without the boundary bijection (Section 3.4) on an 16 × 1 image with the maximal number of wavelet iterations 4. The bits of an pixel index   , ...,  2 ,  1 ,  0 are given with the most significant bit   (MSB) first.Mirroring ensures out of bound taps are close to the center tap, preventing visual light leaking.Representants o of each subimage are colored green.Shading indicates the orientation of the coordinate system of each subimage.Invocation indices indicate the mapping of the stencil centers to the pixel indices in each level-the invocation index reads the pixel in its column and writes it to the position as indicated by the order in the subsequent level below.

Fig. 7 .
Fig. 7. Overview of our padding-free bank-conflict resolution strategy.The example uses a shared memory region with width   = 10 and a component size of 128-bit in a system with 32 32-bit shared memory banks.The same graphic can also serve as the solution for a system with eight banks and a component size of 32-bit if each shown bank index is divided by four.(a) The core idea is to compare the stride of an affine column access with the stride of the bank to find the point at which both affine functions conflict and thus realign to repeat the mapping.(b) This induces an affine function partitioning banks into disjunct subsets.(c) Shows the affine access and shared memory of (a) in two-dimensional row-major order.(d) Shows (b) in two dimensions.(e) A mapping of invocation indices within a subgroup to any set of columns from different affine subsets resolves all bank conflicts.We call the resulting region a bank pattern tile if adjacent columns are picked to form a subgroup.

Figure 7
Figure 7(a) provides a visual intuition of the above equations with ĉ = 8 and  = 10.The coefficient  = 10/ = 5 gives the number of cycles through each bank required until the array accesses and the shared memory banks realign, and thus the point of the first bank conflict.Since the number of banks coincides with the subgroup size, this is equal to the number of subgroups.The coefficient  = 8/ = 4 gives the number of array cells accessed before the bank conflict occurs.

Fig. 8 .
Fig. 8. Effect of different boundary strategies in the two-dimensional embedding.The image size is 32 × 32.Even rows show the weight contribution of the neighboring pixels to the center pixel  = (16, 30).Difference images in odd rows encode deviations from the baseline.

Fig. 9 .
Fig. 9. Runtime of the baseline schedule and our proposed schedules.
Quality At Image BoundariesThe weighted contribution of neighboring pixels to a pixel in the boundary region for different boundary strategies is shown in fig.8.If out-of-bounds memory accesses are set to zero weight, our scheduling method preserves the semantics of the baseline schedule (first column).If out-ofbounds memory accesses are not prevented, a naive embedding   without checkerboarding of the coordinate axis   will retrieve filter taps from the opposite side of neighboring subimages Proc.ACM Comput.Graph.Interact.Tech., Vol. 7, No. 1, Article 15.Publication date: May 2024.