Online Neural Path Guiding with Normalized Anisotropic Spherical Gaussians

The variance reduction speed of physically-based rendering is heavily affected by the adopted importance sampling technique. In this paper we propose a novel online framework to learn the spatial-varying density model with a single small neural network using stochastic ray samples. To achieve this task, we propose a novel closed-form density model called the normalized anisotropic spherical gaussian mixture, that can express complex irradiance fields with a small number of parameters. Our framework learns the distribution in a progressive manner and does not need any warm-up phases. Due to the compact and expressive representation of our density model, our framework can be implemented entirely on the GPU, allowing it produce high quality images with limited computational resources.


INTRODUCTION
Unbiased physically-based rendering (PBR) is achieved by launching a light transport simulation to solve the rendering equation with Monte Carlo methods.Over the past few decades, unidirectional path tracing has become the dominant method of PBR in the film and design industries due to its flexibility and simplicity.To reduce the variance of the Monte Carlo integral efficiently, many importance sampling methods have been proposed (e.g., [Hart et al. 2020;Ureña et al. 2013;Veach 1998]).However, currently, only some components in the rendering equation (typically, the BSDF term, or direct lighting) can be importance-sampled.When indirect lighting is dominant, the sampling efficiency becomes poor.
Path guiding is a promising genre of importance sampling approaches to overcome this challenge [Müller et al. 2017;Vorba et al. 2014].A path guiding method usually learns distributions over the 3D scene that fit the rendering equation more closely, either during the rendering process (online) or with precomputation (offline).Then, the path tracer can sample the scattering directions using the learned distribution to reduce the variance.Previous approaches [Müller et al. 2017;Vorba et al. 2014] have partitioned the 3D space and approximated the incident radiance of each zone using statistical methods (typically, histogramming or expectationmaximization) instead of modeling the whole product of the rendering equation for each shading point; consequently, they could only learn one distribution for each spatial partition and usually suffered from the parallax issue.
Recently, neural networks are starting to be used for learning spatially varying, per-shading-point product distribution of the rendering equation [Müller et al. 2020;Zhu et al. 2021a,b].Ideally, an explicit model that directly fits the continuous product distribution over 3D space is desired; however, this has been difficult due to the lack of (1) a representation that can reconstruct both high-and low-frequency features with a small number of parameters and (2) a closed-form density model that can provide an analytical solution for integrating the distribution for normalization.As a result, existing models either (1) learn an implicit model that only provides a mapping between a uniform distribution and the shading-point product distribution, but with expensive computation [Müller et al. 2020]; (2) learn a coarse, low-resolution distribution with limited accuracy [Zhu et al. 2021a]; or (3) require costly offline precomputation [Zhu et al. 2021b].
In this paper, we propose a novel, neural-network-based path guiding framework that learns an explicit spatial-varying distribution of scattered radiance over the surfaces in a 3D scene online.Using the shading point's 3D position and auxiliary information as input, our network estimates the distribution of full scattered radiance product, which can be used for efficient importance sampling of light transport.Our key insight is a novel closed-form density representation, namely the Normalized Anisotropic Spherical Gaussian mixture (NASG).The NASG is a 5-parameter anisotropic distribution model that comes with an easy-to-compute integral and a closed-form sampling algorithm, making it more feasible for Monte Carlo rendering compared to Kent distribution.We also show that normalization is an important factor for successful learning of scattered radiance distribution with sparse online training samples.The simplicity and expressiveness of NASG, along with its analytical normalization formula, lead to successful online learning of the complex radiance distribution via a tiny multilayer perceptron (MLP).
Our framework is robust enough to handle a variety of lighting setups, ranging from normal indirectly illuminated scenes to caustics.We also propose the corresponding training workflow, including specialized loss function and online acquisition of training samples.Because our network can be easily stored and executed in parallel on a GPU, we are able to integrate it with a wavefront path tracer.In this way, we can implement a high-performance, neural-guided, unidirectional path tracer on a single GPU, making neural guiding affordable for conventional personal computers.Our framework outperforms existing neural path guiding approaches both in terms of sampling efficiency and raw performance, and it provides comparable or even better performance than state-of-the-art statistical methods.
The contributions of this paper can be summarized as • Normalized Anisotropic Spherical Gaussian mixture, a novel density model that can effectively represent spatial-varying distribution for path tracing, and • A novel, lightweight, online neural path guiding framework that can be integrated with either CPU-or GPU-based conventional production path tracers.

RELATED WORK 2.1 Physically-based Rendering and Path Guiding
The radiance of a shading point can be calculated using the rendering equation [Kajiya 1986]: which consists of the emitted radiance   (x) at the point and the integral over the sphere 2 , which aggregates the contributions of the incident radiance   (x,   ) from all directions   .For each   , the product of incident radiance, the bidirectional scattering distribution function (BSDF)   (x,   ,   ), and the geometry term | cos   | signify its contribution to the outgoing radiance.In 3D scenes, due to light bouncing off surfaces, the incoming radiance   can be considered as originating from another point on a different surface.This implies that   also follows the integral form described by the rendering equation Eq. ( 1) and needs to be evaluated at the point from which the scattered light originates.This makes the problem highly complex, and no analytical solution can be formed in general.
Path tracers solve the rendering equation via the Monte Carlo method, which draws random direction samples  ′  , and the average of samples is the estimation of the integral: where  ( ′  ) is the probability density function (PDF) from which the integrator samples at point x.Due to the complexity of light transport over the 3D space, the variance of samples is high, and many importance sampling methods have been proposed to reduce this variance (e.g., [Conty Estevez and Kulla 2018;Hanika et al. 2015;Veach 1998]).These methods usually focus on importance sampling of single components of the rendering equation.Here, a universal importance sampling technique for the whole product of the rendering equation can help further improve sampling efficiency.Path guiding, which we describe next, is a genre of methods developed along this direction.
Path guiding methods aim to find sufficient approximation of the incident radiance distribution, which serves to achieve better importance sampling that follows the global illumination distribution rather than local BSDF distribution.Jensen [1995] and Lafortune et al. [1995] first proposed to fit the incident light distribution for more efficient indirect lighting sampling.Vorba et al. [2014] fit a Gaussian mixture to model the distribution estimated by an explicit photon tracing pass.Müller et al. [2019;2017] proposed the "Practical Path Guiding (PPG)" algorithm to model the distribution using an SD-tree approach.Based on these techniques, full product guiding methods have also been proposed.Herholz et al. [2016] calculated an approximation of full product on top of an earlier work [Vorba et al. 2014], and Diolatzis et al. [2020] calculated an approximation on top of another study [Müller et al. 2017] to achieve product guiding.Both of these methods operate at the cost of higher overhead as the approximated product needs to be calculated multiple times from the corresponding learned incident radiance distribution.The above techniques share the same idea of partitioning the space and progressively learning discrete distributions.Each discrete distribution is shared among points within the spatial partition.A major issue of this approach is parallax error: the above techniques fail to accurately learn the distribution for close-distance incident radiance where the incoming light quickly changes within the same spatial partition.Ruppert et al. [2020] proposed a parallax-aware robust fitting method to address this issue with a discrete approach.We compared their approach with our network-based approach in § 7.3.Recent research shows that good path guiding requires a proper blending weight between BSDF and product-driven distributions [Müller et al. 2020], and the learned distribution should be variance-aware since the samples are not zero-variance [Rath et al. 2020].These findings can help to achieve a more robust path guiding framework.
Recently, neural networks have been used to model scattered radiance distribution; both online and offline learning methods have been proposed.Zhu et al. [2021b] used neural networks to estimate a quad-tree representation of incident radiance distribution using nearest photons as input.Currius et al. [2020] used convolutional neural networks (CNN) to estimate incident radiance represented by spherical Gaussians; this work is for real-time rendering, but the estimated radiance distribution could be used for path guiding.Zhu et al. [2021a] applied an offline-trained neural network to efficiently sample complex scenes with lamps.This technique requires training a U-Net for more than 10 hours for just a single light source, while the estimated distribution is only a 16 × 16 2D map, which is not sufficient for representing general indirect lighting distributions.A much higher resolution is required for robustly guiding over the entire 3D scene; for instance, PPG's quad-tree approach can represent a resolution of 2 16 × 2 16 ).These methods are categorized as offline learning, since they require training a network offline with massive training samples using ground truth distributions.
Our framework, however, is categorized as online learning, which learns the distribution on the fly, without a ground truth distribution as reference.Previously, Müller et al. [2020] adopted normalizing flow [Kobyzev et al. 2021] to model the full product of incident radiance distribution.However, with an implicit density model like normalizing flow, each sample/density evaluation requires a full forward pass of multiple neural networks, which introduces heavy computation costs.In a modern path tracer with multiple importance sampling (typically, BSDF sampling and next event estimation (NEE)), this means full forward pass needs to be executed for each surface sample at least two times.Moreover, the training process requires dense usage of differentiable transforms, which makes the training slower than that for regular neural networks.Indeed, Müller et al. [2020] used two GPUs specifically for the normalizing flow's neural network computation along with the CPU-based path tracing implementation; nevertheless, the sampling speed was still only 1/4 of PPG.Our work, in contrast, proposes the use of an explicitly parameterized density model in closed-form that can be learned using a small MLP.The neural network is used to generate the closedform distribution model, rather than actual samples; therefore, we are able to freely generate samples or evaluate the density after a single forward pass.With careful implementation, we show that our method can reach a similar sampling rate to that of PPG, while the result has lower variance.In research that is parallel with our work, Dong et al. [2023] used a small neural network to estimate per-shading-point distribution, which is similar to our approach.However, we also propose the use of NASG, a novel anisotropic model, in place of classic von Mises-Fisher distribution.NASG helps to further improve the fitting accuracy and guiding efficiency.We highlight the benefits through experiments in § 7.

GPU Path Tracing
GPU path tracing has gained significant attention in recent years due to its high performance, supported by the GPU's ability to concurrently render many pixels.While many recent research works have focused on constructing real-time path tracers [Bitterli et al. 2020;Chaitanya et al. 2017;Lin et al. 2022;Müller et al. 2021;Ouyang et al. 2021]), GPU-based production renderers leverage GPU to render fully converged, noise-free results faster [Maxon 2023;OTOY 2023].Compared to CPU architecture, GPU requires a sophisticated programming design for better concurrency, which is needed for overcoming the large memory latency of GPU.One of the general idea of this is ray re-ordering [Lü et al. 2017;Meister et al. 2020].Laine et al. [2013] proposed a new architecture that splits full path tracing computation into stages (small kernels) and processes threads in the same stage to maximize performance.Recently, Zheng et al. [2022] proposed a more sophisticated framework for GPU path tracing, which leverages run-time compilation for high performance and flexible implementation.While several GPU-based renderer products are available and it has been proven that path guiding can improve sampling efficiency, few products have provided GPUbased path guiding implementation.This is because existing path guiding approaches are designed for CPUs, and specialization for GPUs only emerged recently at the research level [Dittebrandt et al. 2020;Pantaleoni 2020].Our work provides a low-cost continuous path guiding option for GPU-based path tracers.

Density Models
Parametric density models have been extensively explored in statistics.Exponential distribution is a representative family, among which Gaussian mixture is the most widely used density model.Vorba et al. [2014] used a 2D Gaussian mixture to model incident radiance distribution; however, since the domain of a 2D Gaussian is the entire 2D plane, when mapping it to unit sphere, it is necessary to discard samples outside the domain, which affects computation efficiency.Dodik et al. [2022] further proposed using 5D Gaussians to model incident radiance distribution over the space.By using a tangent space formulation, it greatly reduced the number of discarded samples.
In the context of physically-based rendering, spherical models have been widely leveraged.The spherical Gaussian (SG) is widely used for radiance representation and density modelling (e.g., [Wang et al. 2009]).Actually, a normalized SG is equivalent to the von Mises-Fisher (vMF) distribution in 3D.It has several desirable properties, such as computational efficiency and analytical tractability for integrals.However, its expressiveness is limited due to its isotropic nature, which restricts the shape of the distribution on the sphere.The Kent distribution [1982] presents a possible remedy by generalizing the 3D vMF model to a five-parameter anisotopic spherical exponential model, providing higher expressiveness.However, its application within our framework is impeded by three significant limitations: high computational cost, the absence of a direct sampling method, and issues with numerical precision, particularly when the parameters for controlling concentration are high.These limitations are elaborated upon in § 4.1.
Other recent works proposed spherical models specifically for graphics.Xu et al. [2013] proposed an anisotropic spherical Gaussian model to achieve a larger variety of shapes; however, no analytic solution was found for its integral, which is problematic for normalization.Heitz et al. [2016] proposed another anisotripic model with a closed-form expression called linearly transformed cosines (LTC).However, the integral of LTC requires computation of the inverse matrix, and one component requires in total 12 scalars to parameterize a component.
Another family of density models widely adopted in the rendering context is the polynomial family, among which spherical harmonics is the most commonly used model.Polynomial models have been extensively used in graphics [Moon et al. 2016;Sloan et al. 2002].They can be easily mapped to unit spheres to represent spherical distributions.However, polynomial models are generally limited to capturing low-frequency distributions.To represent high-frequency distributions, a very high degree is required, leading to a significant increase in the number of parameters.Furthermore, there is currently no efficient approach to directly sample a polynomial distribution, although there does exist some relatively expensive importance sampling schemes (e.g., [Jarosz et al. 2009]).These were the main challenges in our pilot study that made us give up the idea of adopting them for our path guiding scheme.
Learning-based density models have been drawing more attention recently [Gilboa et al. 2021;Müller et al. 2020].The model based on normalizing flow can successfully learn complex distributions [Müller et al. 2020] but suffers from the implicitness and heavy computation mentioned earlier.Marginalizable Density Model Approximation (MDMA) [Gilboa et al. 2021] has been proposed as a closed-form learning-based density model, which is essentially a linear combination of multiple sets of 1D distributions.MDMA showed that closed-form normalization is an essential factor in learning density models, and inspired by their work, we propose such a model for our application.

OVERVIEW
The goal of our work is to model the distribution of the product of incident light over the unit sphere for every unique shading point x so that the product can be fully importance-sampled (see § 4 for details of our density model).In other words, we map the shading point's 3D position and auxiliary data to a parameterized spherical distribution.
As shown in Fig. 2, our framework learns the distribution online, in a progressive manner.In every rendering iteration, the image is rendered at one sample per pixel, and a subset of the pixel samples are collected to train an MLP (see § 6 for details of collecting strategy).After the image is rendered and training samples are collected, we train the MLP using an optimizer implemented with Libtorch (C++ frontend of Pytorch [2017]) as described in § 5.When rendering an image, every time a ray hits a shading point x, our framework uses a learned MLP (whose weights are updated every  iterations as described in § 6) to estimate the product distribution, and it draws the sample from either the BSDF distribution or the learned one, based on the learned selection probability.

NORMALIZED ANISOTROPIC SPHERICAL GAUSSIAN MIXTURE
In this section, we describe our proposed density model we call the Normalized Anisotropic Spherical Gaussian mixture (NASG), which allows us to efficiently learn the light distribution at each shading point.We first review the existing density models that inspired our model in § 4.1.We then describe our model itself in § 4.2.

Background
One of the main requirements for progressive learning of a distribution with neural networks is that models be easily normalizable, implying that a model must have a closed-form integral.
Marginalizable Density Model Approximation.Gilboa et al. [2021] proposed Marginable Density Model Approximation (MDMA), which can be effectively optimized via deep learning.A bivariate dimensional distribution can be modeled as where  1 and  2 are 1D normalized distributions and    are normalized coefficients that sum to 1.
In practice, the 1D distributions in MDMA can be piece-wise linear or piece-wise quadratic spline models; however, with its limited number of components, MDMA can learn only a coarse distribution.Therefore, it always gives poor results for distributions with high-frequency spots.Despite the limited accuracy, MDMA works well in learning distributions from sparse noisy samples, compared with non-normalizable models (e.g., polynomial model, spherical harmonics).During training, samples with high energy can lower the contribution of other areas, and eventually the training converges.This feature is crucial in learning distributions from sparse training samples.
Normalized Gaussian Mixtures.Normalized Gaussian mixtures can address the accuracy limitations of MDMA: where   (x;   ) are Gaussian distributions parameterized by   ,   are normalizing factors, and   are normalized weights of Gaussians such that    = 1.Gaussian mixtures are highly expressive and also easily normalizable: we look into two common Gaussian distributions that can represent the spherical distributions needed for modeling the lighting: spherical Gaussian and anisotropic spherical Gaussian.
Spherical Gaussian.Spherical Gaussian (SG) is a variant of the univariate Gaussian function defined in the spherical domain: where v is a unit vector specifying the direction,  is the lobe axis, and  is a parameter that controls the "sharpness" of the distribution.
The normalizing term of an SG can be computed by Normalized SG is equivalent to the vMF distribution in 3D.Since SG is defined in the spherical domain and is suitable for representing allfrequency light distribution at each sample point, mixture models based on SG have been applied in representing the light map used for precomputed radiance transfer (e.g., [Currius et al. 2020;Wang et al. 2009]).However, SG is an isotropic univariate model, and thus less expressive; for example, it cannot represent anisotropic distributions.
Anisotropic Spherical Gaussian.To model anisotropic distributions, Xu et al. [2013] proposed Anisotropic Spherical Gaussian (ASG) to model the light map.ASG can be written as where , , and  are the lobe, tangent, and bi-tangent axes, respectively, and [, , ] form an orthonormal frame;  and  are the bandwidths for the and -axes, respectively, and  is the lobe amplitude.Xu et al. [2013] successfully modeled complex lighting conditions and rendered anisotropic metal dishes using ASG.On the other hand, ASG does not have a closed-form solution for the integral, which inhibits its usage for our purpose.
Kent Distribution.Another candidate is the Kent distribution [Kent 1982], which is also an anisotropic spherical distribution: where ,  are parameters that control anisotropic concentration with 0 < 2 < , and [, , ] form an orthonormal frame in the same way as ASG.The Kent distribution has a closed-form integral, i.e., its normalizing constant: where Γ(•) is the gamma function, and   () is the modified Bessel function of the first kind.However, since this integral entails the evaluation of a nested infinite series (since   () is also an infinite series), sufficient precision necessitates the computation of a significant number of terms, thus imposing a substantial computational overhead.In the context of computer science, the precision of the Kent distribution is further constrained by the limitations on floating-point number precision.Specifically, when the value of  is large (e.g.,  > 20), both the non-normalized density and the normalizing constant approach the representational boundary of floating-point numbers, resulting in a lack of precision.This leads to either a biased result or disrupted learning (where an NaN loss is generated and corrupts the learning process).Within the realm of Monte Carlo rendering, an additional limitation emerges, since Kent distribution lacks a direct sampling method, which necessitates the use of a more computationally expensive rejection method.

Normalized Anisotropic Spherical Gaussian
Inspired by the models in § 4.1, we believe a spherical anisotropic exponential distribution can help to achieve more accurate results in our framework.While the Kent distribution exists as a potential candidate, its inherent limitations -namely, the necessity of approximating its integral, the precision issue, and the absence of a direct sampling algorithm -prompt us to develop our own distribution, which we call the Normalized Anisotropic Spherical Gaussian: ) where [, , ] forms an orthogonal frame as above,  is sharpness, and  controls the eccentricity.Fig. 3 shows examples of the NASG component with different parameters.
The derivation of NASG proceeds as follows.First, we employ the standard spherical coordinate expression of v ≠ ± in terms of the polar angle  and the azimuthal angle  (with respect to the orthonormal frame [, , ]) to obtain the equation which introduces the anisotropic nature of the distribution through the exponent  cos2 .Next, we effect a change of variables in the integral of SG (cf.Eq. ( 6)) so that the Jacobian in Eq. ( 11) emerges as part of the corresponding Jacobian.In this way, we arrive at NASG, which satisfies all requirements as a model for learning the distribution needed for unbiased rendering.Different from ASG [Xu et al. 2013], NASG has an analytical closed-form solution for its integral (see § B), which makes it easy to normalize: NASG is also expressive, since it can model anisotropic distributions, as well as numerically stable to represent high-and lowfrequency distributions accurately 1 .It is also compact because it is parameterized by only 7 scalars, making it highly efficient in GPU-based computation due to its low bandwidth requirement.Additionally, the sampling algorithm of NASG is remarkably efficient (see § C).These characteristics make NASG a highly feasible and practical option for our framework.
1 NASG is not continuous at v = − when  > 0 in this form.Indeed, it approaches 0 if we let v tend to − along any meridian, except for the ones passing through ± (i.e.,  =  /2 and  = 3 /2), in which case NASG approaches exp(−2) > 0. However, this discontinuity does not affect our application, and it can be resolved easily by introducing an auxiliary parameter (see § A for details)

ONLINE LEARNING OF DENSITY MODEL
We now describe our neural network structure for learning the parameters of our NASG model, as well as the training scheme.

Network Architecture
We intend our neural network to be as simple as possible, considering the performance requirements for practical use.In this work, we use a 4-layer MLP, where each layer has 128 units without bias.
Network Inputs.The input to the model consists of the location of the shading point x, outgoing ray direction   , and surface normal n.The location of the shading point is first encoded into a 57 dimensional vector by positional encoding.Thus, the input size is 63 (57 + 3 + 3), but is padded to 64 for hardware acceleration purposes 2 .We adopt the simple one-blob encoding [Müller et al. 2019] instead of complex encoding mechanics, such as Multiresolution Hash Encoding [Müller et al. 2022], because there was very little improvement in our pilot study to justify the overhead such mechanics introduce.Details of the encoding scheme can be found in Tab. 1.
Network Outputs.The output of our neural network is a  dimensional vector , where each component corresponds to the NASG parameter at the shading point x and a selection probability .Assuming the number of mixture components is  ,  = 5 ×  + 2 ×  +  + 1 = 8 + 1 (see Tab. 2).These parameters are further decoded to represent the actual parameters as summarized in Tab. 2. The mixture weights of NASG A are normalized with a softmax function.

Training
We now describe our loss function and the training process.As shown in Fig. 4, we utilize automatic differentiation to train the network effectively with sparse online rendering samples.Since the training data are noisy online samples, we need a loss function that is more robust than regular L1 or L2 losses.
Kullback-Leibler Divergence.The design of a loss function for distribution learning in PBR context has been studied in several previous works [Müller et al. 2019[Müller et al. , 2020;;Zhu et al. 2021a,b].By using (  ; ) to denote the NASG distribution (with NASG parameters ) estimated by the neural network, to achieve importance sampling of the rendering equation, the optimal distribution should be proportional to the product: Therefore, we use  (  ) =    (x,   ,   )  (x,   )| cos   | as our target distribution, where  is a normalizing term whose value is unknown.We can still train the system without knowing  as shown below.
In our framework, we use the Kullback-Leibler Divergence (KL-Divergence) to represent the likelihood between our estimated distribution (  ; ) and the target one: Then, the integral can be replaced with our one-sample estimation: where q(  ; ) denotes the PDF of the distribution the sample obeys during rendering (see below).Although the right-hand side still involves the global scaling factor  , it is canceled in moment-based optimizers such as Adam [Kingma and Ba 2014]; accordingly, we can train the system even though there is an unknown scaling factor in  (  ) [Müller et al. 2019].
Selection Probability.Following [Müller et al. 2020], we do not directly use a distribution estimated by our neural network.Instead, we regard our scattering sampling process as an MIS process that blends the learned distribution and the BSDF distribution    (  ).
As we also learn a selection probability , as described in § 5.1, the MIS PDF at shading point x is q(  ; ) = (  ; ) + (1 − )   (  ).However, optimizing q(  ; ) naively leads to falling into local optima with degenerate selection probability; therefore, our final loss function blends q(  ; ) with (  ; ): where  is a fixed fraction that we set to 0.2.

IMPLEMENTATION DETAILS
We integrate our framework into Clight, our in-house wavefront GPU path tracer for production [Void Dimensions 2024].In this section we discuss implementation details as well as design considerations.
Progressive Learning and Sampling.Our framework collects samples to train the network online, and we use the network to generate a distribution that is then blended with the BSDF sampling distribution using the learned selection probability .We further multiply an extra coefficient  to compute an actual blending weight  ′ = , where  is initially set to 0 and gradually increased to 1, such that the framework initially relies on the BSDF distribution but gradually switches to the learned distribution q(  ; ).We use a fixed-step strategy to ensure that our guided sampling uses a sufficient number of samples for learning the distribution: for every  images rendered, we increase  by 1  .In our implementation, we set  = 4 and  = 64.This could require adjustments according to the rendering task, and further discussions are given in § 8.3.
For rendering images while progressively learning the distribution, previous works [Müller et al. 2017[Müller et al. , 2019[Müller et al. , 2020] ] blended the rendered image samples based on their inverse variance.Such a process introduces extra memory and computation overhead, especially for a renderer that produces arbitrary output variables (AOVs).Based on the observation that the variance of individual image samples decreases as the learning proceeds, we use a simplified method that scales the weights of accumulated results according to the training steps.This essentially gives more weight to later samples in a progressive manner.The weight scaling process stops when the training step count reaches  × .Since the rendering result at each step can be seen as an unbiased estimation of the ground truth, and the weighted average can be seen as an MIS process with different sampling techniques (in this case, different distributions), the result remains unbiased.
Network Optimization.The process of distribution learning runs on GPU after every rendering iteration.We use Libtorch, the C++ frontend of Pytorch [Paszke et al. 2017], for the implementation, along with the Adam optimizer with its learning rate set to 0.002.When rendering an image, we split the image into smaller tiles of size  × ( is a real number, and  ≥ 1) and collect data from one pixel per tile in each render iteration.After the iteration of rendering, we update  based on collected sample size  and maximum size  as  = max(1,  √︃   ); consequently, after rendering the full image plane, we can get an  close to .Note that if  exceeds , we only record and keep  samples for training.
After each render iteration, the collected data are directly used for progressively training the model, and the training step is determined by  =  • ⌈   ⌉ so that every sample will be trained at least  times.Larger  and  help to improve the learning accuracy, but the training overhead introduces a significant trade-off.Here, we set the following parameters: maximum size of training samples in each iteration  = 2 16 , training batch size  = 2 12 , and step factor  = 1.The differences in the results obtained with different configurations are revealed by experiment in § 7.2.
Direct Lighting with NEE.Our implementation uses MIS to blend guided scattering directions with Next Event Estimation (NEE) to sample direct lighting.NEE helps to improve the quality of the images, especially at the beginning of the training stage.When recording radiance to a training sample, we use the MIS-weighted value.It should be noted that MIS is only practical because our model is an explicit model that allows us to freely sample the distribution and evaluate PDF in a certain direction, which could be computationally expensive for implicit models such as [Müller et al. 2019[Müller et al. , 2020]].
Wavefront Architecture.Neural network computation can be greatly accelerated with hardware (e.g., Tensor Core).However, such hardware is usually designed for batched execution, in which a single matrix multiplication involves multiple threads.This conflicts with the classic thread-independent "Mega-kernel" path tracing implementation.Our implementation adopts "Wavefront path tracing" architecture [Laine et al. 2013] to solve this problem.In a wavefront path tracer, all of the pixels at the same stage are executed concurrently.Under this condition, we are able to "insert" our neural network calculation seamlessly right before the sampling/shading stage and minimize the computation overhead for best performance.The forward pass of our MLP does not require gradient calculation, so we implement it on the GPU with CUBLAS, and the resulting implementation is substantially faster than when using existing deep learning frameworks.Moreover, this implementation achieves a low overhead that makes our framework practical with conventional hardware, the performance of which we evaluate in § 7.2.

EVALUATION
We evaluated our framework in three ways.First, we integrated our framework into our in-house GPU production renderer.We evaluated its variance reduction efficiency by comparing it to that of plain path tracing.To demonstrate the benefits of our NASG and product guiding, we compared the results rendered with incident radiance guiding and other density models.Second, we evaluated its raw performance through a comparison with the results produced by path tracing in the same amount of time; here, our method was executed multiple times under different meta-parameter configurations.Finally, we implemented our framework in Mitsuba and compared the results produced with the same number of samples by several state-of-the-art methods, including Practical Path Guiding (PPG, [Müller 2019]), Robust Fitting of Parallax-Aware Mixtures for Path Guiding (PAVMM, [Ruppert et al. 2020]), and Neural Importance Sampling (NIS, [Müller et al. 2019]).We used mean absolute percentage error (MAPE) for quality measurement.MAPE is calculated by dividing the absolute error by the value of the ground truth.We added a small  = 0.01 to the denominator to avoid divisions by zero.When calculating the MAPE from all of the rendering results, we discarded 0.1% of the pixels with the highest error before computing the average.

Variance Reduction
We implemented our framework on top of our in-house GPU production renderer and evaluated the quality of the images by comparing them with those produced by plain path tracing using the same number of samples per pixel (SPP).We implemented both the sampling algorithm and the optimizer on the GPU, leaving the CPU to only perform scheduling tasks.We also compared different guiding schemes, i.e., guiding with incident radiance distribution, cosine-weighted radiance distribution, and full product distribution.In addition, we compared NASG with SG, MDMA, 2D Gaussians, and Kent distribution.Here, we used 8-component NASG (64 parameters).For other density models, we adjusted the number of parameters of each Gaussian to match that of NASG: SG uses 14 components (70 parameters), 2D Gaussians uses 12 components (72 parameters), Kent distribution uses 8 components (64 parameters), and MDMA uses 8 1D distributions for both dimensions (80 parameters).For Kent distribution, even when calculations are performed in the logarithmic space to mitigate the precision issues, we found it necessary to restrict the range of  to 0 <  < 20, and accordingly constrained 0 < 2 <  to prevent biased results or unstable learning.
We run our GPU path tracer on a conventional PC with a 4core i7 9700K CPU and an Nvidia 2080ti GPU.The implementation of our framework introduces roughly 600 MB memory overhead on the GPU, mainly for buffering batched network execution.Our Pytorch optimizer takes another 600 MB of GPU memory to cache tensors and gradients.The test 3D scenes are shown in Fig. 5, and we render the scenes with NEE enabled.All of the rendered images are included in the supplemental material.Tab. 3 shows the results of the quantitative evaluation.
Learned distribution.Our framework can easily learn a full product.In all of the test scenes, guiding with full product distribution gives lower variance.Guiding with full product helps to effectively reduce variance when rendering smooth surfaces.This is because learning full product not only improves the accuracy of the learned distribution but also helps to learn an optimal selection probability.The difference is more obvious in scenes such as Ajar and Corridor, where the floors are large surfaces with low roughness.When guiding with incident radiance distribution, the lack of BSDF distribution approximation results in many more outliers, as shown in Fig. 6 (a).
Distribution Models.Compared to traditional distribution models such as spherical Gaussians, our NASG model is more expressive and achieves lower variance in most scenes, and this is especially the case for scenes with anisotropic lighting conditions, such as Table 3. Errors of the results from variance reduction experiment, reported in MAPE.We rendered a variety of 3D scenes at 1024 samples per pixel (SPP), guided with our framework and different learned distributions.Scenes were rendered in different resolutions for a larger variety.We compare the learning distribution of incident radiance using NASG (NASG Li) with both the cosine-weighted incident radiance (+cos ) and the full product distribution.We also compare the learning distribution of full product with different density models, i.e., spherical Gaussians (SG), MDMA (MDMA), 2D Gaussians (G2D), and Kent distribution (KENT).The time cost of each method is provided for completeness.reconstructed when using Kent distribution.In addition the rendered images, several learned distributions with different models are compared in Fig. 7.

Performance
An importance sampling method is more practical when it achieves a lower level of error with the same time budget.We conducted an experiment to compare the equal-time performance of our framework with different configurations, as well as with plain path tracing, using the same set of test scenes as in our previous experiments.
There are several meta-parameters that can influence performance and accuracy, including the maximum sample size (), number of hidden units (HU), and number of components (C).In our default configuration, we set  = 2 16 , HU = 128, and C = 8.We show the results under different configurations in Tab. 4. In general, the default configuration achieves a good balance between learning accuracy and sampling speed: it achieves the lowest MAPE in four scenes out of eight.Even in the scenes where it does not perform the best, its MAPE is close to the best results.With 32 components (C=32), the result of Box has the lowest MAPE, while the sample count is only 58% of our default configuration.Actually in a simple scene such as Box, where performance is not a bottleneck and all methods achieve significant number of samples, a more precise distribution can be learned with more components, resulting in a lower error.However, in complex scenes where higher variance is introduced by multiple light sources or complex materials, sufficient samples are required for convergence for Monte Carlo rendering.With C=32, sample count is low under limited computation budget, resulting in larger error.
Neural network overhead.In the experiment we measured the time cost of three parts of computation: forward execution for every shading point, training of the neural network, and path tracing.While changing through scenes, we found that typically 15% of the time is used for forward execution, 35% for training, and 50% for path tracing.The training overhead is almost constant among different image resolutions, which makes our framework relatively costly when rendering small images.For example, in the Bidir scene, which uses a small 512 × 512 resolution, the ratio of neural network overhead appears to be larger.From our experience, increasing the training steps (i.e., more training batches in each iteration) leads to higher accuracy in the learned distribution, but with significant overhead.We conjecture that using meta-learning to accelerate the training process would be a promising direction, and leave this as a future work.
Network scale.Intuitively, a larger network, with either more weight parameters or more components, could achieve higher accuracy.However, as neural network computation is the primary overhead in our framework and complex scenes require a significant sample count to converge, it is still worthwhile to trade off accuracy for computation overhead given our limited time budget.

State-of-the-Art Comparison
To compare our framework with various state-of-the-art path guiding techniques, we implemented it on the Mitsuba renderer.In our Mitsuba implementation, we set  = 2 18 and  = 4 to better show the potential of our framework.The test scenes are shown in Fig. 8, and we rendered all of the images in this experiment with two Intel Xeon E5-2680v2 20-thread CPUs (40 threads in total).We compared The thin curve distribution in the Box scene is difficult to fit with isotropic models: the approximated curve is much thicker than the reference.In contrast, our NASG effectively preserves the thin nature of the curve, achieving lower KL-divergence.The Kent distribution fails to learn sharp distributions due the limited range of .However, raising the limit leads to a biased result or disrupted learning.
Table 4. Equal-time comparison results, using a fixed 300-second budget.We compare the rendering results of plain path tracing (PT) with our framework in its default configuration, as well as several different configurations: 4 components (C=4), 16 components (C=16), 32 components (C=32), 64 hidden units per network layer (HU=64), 256 hidden units per network layer (HU=256), and a larger sample size (S=2 18 ).the results of our framework with the authors' implementations of PPG [Müller 2019] and PAVMM [Ruppert et al. 2020], along with our implementation of NIS [Müller et al. 2019].For all techniques, we enabled NEE.To render with PPG, we disabled inverse-variance-based weighting and selection probability optimizaton, while enabling tree splatting using default configuration.For PAVMM, we used BSDF product sampling when available and cosine weight sampling otherwise (the authors' implementation requires specific scene structure, which conflicts with scene setups), and we set maxSamplesPerLeafNode to 16k, with all other options left as default.We implemented NIS following the original paper; based on our framework, we replaced the sampling-density evaluation component with 2-layer quadratic linear normalizing flow, and we added another MLP to learn selection probability.We used the encoded input of our NASG framework as extra features for the transform layer, and the feature dimensions are encoded with one-blob encoding (we used 64 bins).The training sample size and training step's configuration are aligned with our framework.The major difference of our implementation from that in the original paper is that we only use lightweight MLPs, at the same scale used for our framework, rather than ResNets.The source code of our framework and our NIS implementation are given in the supplemental material.The rendering speed of of a CPU-based neural path guiding implementation is strongly affected by the actual implementation, so we mainly compared equal-sample results rather than raw performance.

Variance reduction of guiding method.
To objectively evaluate the efficiency of different guiding algorithms, we first disabled orthogonal features and rendered scenes with our framework and other methods.To ensure a fair comparison, we used the same sampling parameters in all of the methods.The selection probability is set to the fixed fraction 0.5 for all methods.All of the methods have a 128-sample training budget and 384 samples for rendering (512 samples in total).We achieved this by clearing the rendering accumulation after the 128th sample, and we accumulated the results for the rest of the samples.
Fig. 9 reports statistical results and Fig. 10 shows the convergence behavior during the experiment.From the results, we can observe that each previous work has both pros and cons; consequently, one technique may perform very well in one kind of scene but poor in others (details discussed below).Our framework, on the other hand, while achieving overall low error also provides a more balanced performance in all test scenes.
Comparison with PPG.PPG uses quad-tree for directional representation, which can closely model distributions with multiple high-energy points.Hence, in the Torus scene it achieves low variance, while PAVMM struggles due to the limited components used.However, in the Maze scene, the parallax issue makes the PPG result very noisy, as shown in Fig. 9 (c).Parallax error occurs because PPG's spatial partition algorithm has difficulty learning from narrow and close incident radiance.Our framework learns the continuous spatial-varying distribution and does not suffer from such issues.
Comparison with PAVMM.PAVMM performs generally better than PPG.With its parallax-aware fitting method, it learns distributions in the Maze scene most accurately.However, PAVMM produces a rather noisy result in the Ajar scene, as shown in Fig. 9 (d).This is due to two reasons: First, the major illumination is a lineshaped lit surface on the wall at the right side of the scene (lit by light source outside the door via a small crack), which is anisotropic and difficult for vMF to fit.Second, the fitting algorithm of PAVMM encounters a challenge in modeling the indirect lighting from the specular checkbox floor.In contrast, our proposed framework exhibits greater robustness by leveraging NASG's ability to accurately capture the anisotropic distribution with a parsimonious set of components, thus achieving low variance.
Comparison with NIS.With the same configuration of training samples and steps, NIS in general achieves similar variance, except in the Maze scene (as shown in Fig. 9 (c), in which the learning of the 2D distribution suffers from discontinuity at the edge of the azimuth axis).Compared to the results reported in the original NIS paper [Müller et al. 2019], our NIS implementation has a slightly higher variance.We attribute this to two main factors: (a) Our implementation employs a lightweight 2-layer MLP-based transform as opposed to the original, relatively more complex, 4-layer ResNetbased version and (b) we limit our training to a subset of samples at each iteration.However, even with this relatively lightweight implementation, NIS takes roughly 2× longer than our framework: in Tab. 5 we compare the rendering time of NIS and our NASG framework.The extra overhead is due to the usage of coupled networks, and the normalizing flow needs to be evaluated two times at each non-delta shading point (one for scattering sampling and the other for NEE).The intrinsic characteristics of normalizing flow as an implicit density model, characterized by the utilization of multi-layer transformations, imply that a larger network architecture may be imperative to effectively capture complex distributions with accuracy.Moreover, it is plausible that a greater number of training samples or iterations might be essential to ensure convergence.These adjustments are likely requisite for attaining the results reported in the original paper.However, they also inherently bring a significant increase in computational overhead.In other words, with limited computation resources, our framework can learn more accurate distributions.PPG [Müller 2019] proposed similar features yet achieved them with different approaches.To compare the sampling efficiency of fully integrated systems, we render the above scenes with all features enabled.For PPG [Müller 2019] we additionally set the sample budget to 511 in order to benefit from inverse-variance-weighted blending.
While PAVMM [Ruppert et al. 2020] does not implement similar features, similar improvement could be expected if the features were implemented with it.
As shown in Tab.6, by enabling selection probability optimization and sample re-weighting, the MAPE of the results produced by both methods decreases in a generally varying range.PPG achieves a more obvious boost, which we attribute to its lack of product guiding: selection probability optimization thus has a stronger impact on reducing variance.While both methods are further improved by these features, it is also notable that our framework can optimize a selection probability almost for free, making our framework an attractive option when integrating it into a high-performance rendering system.

DISCUSSION 8.1 Performance in GPU Path Tracing
A GPU-based brute-force path tracer can be 20× faster than a CPUbased one.If an importance sampling method provides low persample variance yet high overhead, it cannot defeat a brute-force path tracer on GPU.As a network-based guiding method, the major overhead is the network itself, therefore, we must adopt a sampling strategy that can be implemented in a small network.This motivates us to find a solution for learning an explicit distribution, rather than an implicit model such as one based on a normalizing-flow.Despite its powerful computation ability, a GPU's large memory bandwidth can only be leveraged by carefully designed high-concurrency algorithms.This raises the necessity of reducing code branching as much as possible, which conflicts with the nature of path tracing.Otherwise, it suffers from a much more obvious memory latency than CPU, which causes many CPU algorithms to perform relatively slowly on the GPU (e.g., PPG's quad-tree approach requires multiple random accesses to GPU memory, thus causing long stalls).Recently, with hardware improvement, we see some acceleration hardware that is specifically designed for neural networks, which our framework could potentially benefit from.

Learning rate
In our GPU implementation, we used a relatively large learning rate (0.002).This decision was strategically aimed at enabling our model to swiftly navigate the parameter space, thus facilitating rapid convergence.This feature is particularly crucial in leveraging early samples to their fullest potential in our context.However, a large learning rate limits our model's ability to reach the deepest minima.That being said, at the latter stage of training, the resulting "jitters" remain close to the ground truth, ensuring that our learned distributions are valid approximations.We leave the improvement of optimization strategy as future work.

Limitations and Future Work
Although in general our configuration for training works well, sometimes we encounter cases where we need to use a smaller learning rate to avoid degenerated distributions.Such degenerated distributions have been observed only a few times when rendering caustics.They could be due to the complexity of the distribution, where only a small amount of light hits the point from a narrow direction.When the early samples based on BSDF fail to produce a representative initial distribution, the system could fail to sample important directions, which could eventually lead to degeneration.Adopting an adaptive learning strategy could possibly improve robustness -we leave this as a future work.
Another interesting extension of our framework could be path guiding in scenes with participating media.This could be achieved in theory (similar to what was done previously [Herholz et al. 2019]), yet the ability to learn 3D distributions remains an area to be further investigated.

CONCLUSION
In this paper we proposed an effective online neural path guiding framework for unbiased physically-based rendering.We tackled the major challenges of neural-based online path guiding methods by proposing a novel closed-form density model, NASG.The simplicity and expressiveness of NASG allow it to be efficiently trained online via a tiny, MLP-based neural network with spare ray samples.
Through experiments, we show that under the same sample budget, our framework outperforms existing neural path guiding techniques and achieves comparable results to state-of-the-art statistical path guiding techniques.With proper implementation on GPU, our framework helps to improve the raw performance of a GPU path  tracer via neural path guiding.This was achieved because our framework can effectively learn the spatial-varying distribution and guide a unidirectional path tracer with low overhead, allowing the path tracer to produce high-quality images with limited computational resources.Our work also shows that learning-based importance sampling has great potential in practical rendering tasks.We hope our work can help pave the path for research communities in industry and academia to adopt neural technologies for path tracing.where  is the lobe amplitude and  is an auxiliary parameter introduced so that  is continuous whenever  > 0. In practice, it is possible to set  = 0, although this breaks continuity at v = − unless  = 0.For the rest of the content, we set  = 1 and omit it from the notation.

C SAMPLING NASG
Here, we discuss how to sample from the distribution on  2 obtained by normalizing  (cf.Eq. ( 30)).To this end, we essentially reverse the discussions in the previous section.Define two functions ] (here, "E" and "W" stand for east and west, respectively).
Let us now argue that the above sampling method serves our purpose.Observe that both Φ  and Φ  are bijections.Their inverses are given by Let  be a random variable on  2 , which we also view as a function of (, ).Then, the expected value of  over the points Φ  (, ), where the (, ) are uniformly sampled from [ −2 , 1] × [−  2 ,  2 ], is given by • Φ  (, ).(37) See Eq. ( 20).It follows that our sampled directions obey the distribution obtained by normalizing , as desired.

Fig. 2 .
Fig. 2. Overview of proposed framework.Area in blue frame is computation of a classic path tracer.Area in yellow frame shows how we learn the light distribution with a neural network and use the estimated explicit density model to guide further sampling.
Fig. 3. Visualization of NASG component  with different parameters.Note that  agrees with spherical Gaussian when  = 0.
3 n Parameterization of NASG.To improve the efficiency of learning, we reduce the number of parameters used to represent the NASG model.First, we introduce , , , the Euler angles representing the orientation of the orthogonal basis with respect to the global axes.Accordingly, z and x of the orthogonal basis can be represented by z = cos  sin  sin  sin  cos  , x = cos  cos  cos  − sin  sin  cos  sin  cos  + cos  sin  − sin  cos  .

Fig. 4 .
Fig. 4. Training process of proposed network.We heavily apply automatic differentiation to adjust network weights based on estimated distribution and rendering estimation ⟨⟩ from the sampled scattering direction   .

Fig. 6 .
Fig.6.A subset of rendering results in variance reduction comparison, rendered at 1024 samples per pixel (SPP).(a) Compared with incident radiance guiding, our full product guiding significantly improves results on specular surface: our "Full Product" result shows clear reflection of the metallic shelf, while "Li" or cosine weighted guiding only produces noise.MDMA and Kent distribution fail to learn an accurate distribution for narrow incident light, leading to higher variance.(b) Compared to traditional distributions such as SG, our NASG is more expressive and can learn a more accurate distribution for anisotropic lighting conditions, leading to lower variance.

Fig. 7 .
Fig.7.Visualization images of learned distributions using different models, including 14-component spherical Gaussians (SG), 12-component 2D Gaussians (G2D), MDMA with 8 components for both dimensions, 8-component Kent distribution (KENT, due to the precision issue, we apply a constraint 0 <  < 20), and our 8-component NASG.Each row shows the learned distributions at the red-dotted locations.The reference is generated by rendering a panorama image at given locations, to which a cosine weight is applied.We measured the difference from the reference using KL-divergence (lower is better).(a) Our NASG achieves slightly more accurate distribution in an ordinary indoor indirect lighting scene.(b) Ajar scene features a vertical indirect lighting source, and our NASG can fit the shape well with fewer components.(c) The thin curve distribution in the Box scene is difficult to fit with isotropic models: the approximated curve is much thicker than the reference.In contrast, our NASG effectively preserves the thin nature of the curve, achieving lower KL-divergence.The Kent distribution fails to learn sharp distributions due the limited range of .However, raising the limit leads to a biased result or disrupted learning.

Fig. 8 .
Fig. 8. Nine scenes used to evaluate proposed Mitsuba implementation with state-of-the-art techniques.From left to right: Ajar, Bathroom, Wine, Bidir, Kitchen, Maze, Box, Staircase, Torus.Several scenes apply different lighting setups for more indirect and anisotropic illumination.

Fig. 9 .
Fig.9.Results of state-of-the-art comparison, visualized in logarithmic scale (lower is better).We render a variety of 3D scenes with 512 SPP, guided with different techniques including our framework, PPG[Müller 2019], PAVMM[Ruppert et al. 2020], and NIS[Müller et al. 2019].Each technique uses the authors' implementations without modification except NIS.A subset of rendering results that demonstrates: (b) The Box scene, which features a circle luminary lighting the scene via a glass ceiling.The lighting is indirect and anisotropic, which is challenging for isotropic models such as vMF used in PAVMM.Our NASG helps to fit the distribution accurately and achieve lower variance.(c) The Maze scene with strong indirect lighting from a cage, which is challenging for PPG due to the parallax issue, while PAVMM achieves a clean result with its parallax-aware fitting algorithm, closely followed by ours.However, in (d) the Ajar scene, PAVMM has difficulty learning and fitting accurate distribution due to anisotropic indirect lighting and the near-specular floor, while our framework guides robustly.Although every method has its advantages in special circumstances, our framework performs well in general, making it an attractive option as a path-guiding technique.

Table 1 .
Encoding scheme of network inputs

Table 2 .
Decoding scheme for NASGM with N components Parameter Symbol Decoding cos  , sin , cos , sin , cos   0

Table 5 .
Rendering time (minutes) of proposed NASG framework and NIS implementation in Mitsuba.NIS uses a normalizing flow to model distribution, which takes roughly twice as much time to calculate, compared to our explicit NASG model.

Table 6 .
MAPE of rendering results, with and without enabling selection probability optimization and sample re-weighting schemes, in PPG and proposed framework.A general improvement can be observed in most cases for both methods.