Joint Sampling and Optimisation for Inverse Rendering

When dealing with difficult inverse problems such as inverse rendering, using Monte Carlo estimated gradients to optimise parameters can slow down convergence due to variance. Averaging many gradient samples in each iteration reduces this variance trivially. However, for problems that require thousands of optimisation iterations, the computational cost of this approach rises quickly. We derive a theoretical framework for interleaving sampling and optimisation. We update and reuse past samples with low-variance finite-difference estimators that describe the change in the estimated gradients between each iteration. By combining proportional and finite-difference samples, we continuously reduce the variance of our novel gradient meta-estimators throughout the optimisation process. We investigate how our estimator interlinks with Adam and derive a stable combination. We implement our method for inverse path tracing and demonstrate how our estimator speeds up convergence on difficult optimisation tasks.


INTRODUCTION
Forward rendering involves solving the light transport integrals with given scene parameters, e.g., geometry, materials, textures, by numerically estimating the rendering equation [Kajiya 1986;Pharr et al. 2016].Inverse rendering reverses this process by estimating the scene parameters starting from a given target image.This process involves inverting the rendering equation [Kajiya 1986].
Such inversion tasks are typically solved by gradient descent.Physically-based differentiable renderers [Jakob et al. 2022;Li et al. 2018;Zhang et al. 2020] facilitate these gradient-based optimisation methods.The process involves backpropagating from an underlying loss function, quantifying the disparity between an image generated with the current parameters and the target image, resulting in gradients w.r.t. the scene parameters.These gradient values are approximated from a given set of samples, and subsequently, the scene parameters are adjusted using these gradients to minimise the loss.The ultimate goal is to converge to a parameter set that produces the target image.Due to the nature of Monte Carlo integration, the estimated gradients can be extremely noisy, hampering the performance of gradient-based optimisers.In inverse rendering, gradients are estimated with tens to hundreds of rays traced per pixel [Nimier-David et al. 2020;Zhang et al. 2019] to minimise noise.Usually, inverse rendering requires hundreds to thousands of iterations to converge; recomputing these gradient estimates in every iteration comes at a large cost.
In this paper, we propose a theoretical framework that jointly considers sampling and optimisation by deriving a theoretical framework for interleaving them.We reuse past samples without introducing bias thanks to finite-difference estimators that describe the change in the estimated gradients between each iteration.By combining proportional and finite-difference samples, we continuously reduce the variance of our novel gradient meta-estimators throughout the optimisation process.
First, we introduce our meta-estimation theory and then discuss our variance estimation strategies used to derive coefficients for our meta-estimator.We investigate how our estimator interlinks with Adam and derive a stable combination.We run experiments to evaluate our method in the context of inverse rendering.Finally, we discuss our method concerning future and concurrent work and give our conclusions.
Our contributions include: • Meta-estimation theory on combining proportional and finitedifference estimators.• Practical variance approximation techniques to effectively implement meta-estimation.• Implementation and evaluation of meta-estimation for inverse rendering.(We will release our code upon acceptance.)[Jakob et al. 2022;Li et al. 2018;Nimier-David et al. 2020;Zeltner et al. 2021;Zhang et al. 2020].While our work applies to any method using gradient descent on Monte Carlo estimated gradients, we mainly experiment with Path Replay Backpropagation [Vicini et al. 2021]; a well-established state-of-the-art method for inverse path tracing, implemented in Mitsuba 3 [Jakob et al. 2022].

RELATED WORK
Previous works have focused on sampling strategies [Bangaru et al. 2020;Yan et al. 2022;Zhang et al. 2021] and improving the optimisation itself [Nimier-David et al. 2020;Vicini et al. 2021] to reduce noise in the gradients.Particularly relevant, concurrent work by Chang et al. [2023] applies ReSTIR [Bitterli et al. 2020] in parameter-space with the same goal of reducing the variance of the estimated gradients.
Ray Differentials.Igehy [1999] first proposed ray differentials to approximate derivatives for texture interpolation and anisotropic filtering.Kettunen et al. [2015] combine ray differentials with gradient-domain MLT [Lehtinen et al. 2013] to build unbiased image-space gradient estimators for gradient-domain path tracing.Manzi et al. [2016] extend their work to the spatiotemporal domain.
We apply the general idea of finite-difference estimation to temporal gradient averaging on a set of parameters.As we do not assume any structure between individual parameters, we forgo Poisson reconstruction and instead statistically average proportional and integrate finite-difference samples.
Gradient averaging.Iterating with the arithmetic mean of gradient samples is well-understood to improve the convergence of optimisers.Several recursive schemes [Nesterov 1983;Polyak and Juditsky 1992] achieve fast convergence on convex problems [Moulines and Bach 2011], with some proving particularly useful in deep learning [Sutskever et al. 2013].Kingma and Ba [2014] propose start-up bias-corrected exponential moving averaging on gradients; Adam remains the de-facto optimisation algorithm for deep learning and inverse rendering applications.
In recent work, Gower et al. [2020] analyse gradient averaging methods based on finite sums; they show improvements in convergence analogous to our work, although limited to convex problems.Unfortunately, the finite-sum setting of algorithms like SAGA [Defazio et al. 2014] and SVRG [Johnson and Zhang 2013] does not generalise to Monte Carlo integration [Nicolet et al. 2023].
Reducing the gradient variance is well understood to improve convergence speed and stability.Previous works on optimising neural networks increase the batch size to reduce this variance, which is often preferable over slower learning rates [Smith et al. 2018].Variates. Fieller and Hartley [1954] first propose control variates as a weighted combination of correlated estimators, one of which must be of a closed-form integral.Owen [2013] shows that the optimal control weight is proportional to the covariance of the estimators.Rousselle et al. [2016] generalise control variates to any pair of correlated estimators through two-level Monte Carlo integration; they apply their work to spatiotemporal gradient-domain rendering.Concurrent with our work, Nicolet et al. [2023] further generalise control variates to recursive estimation, applying it to primal renderings in the context of inverse path tracing.

Control
Our work is distinctively different from control variates in that we build on an independent finite-difference estimator rather than a pair of correlated estimators.In particular, this formulation lets us avoid covariance terms in our weighting scheme.

DIFFERENTIAL META-ESTIMATORS
Various Monte Carlo methods estimate a sequence of integrals.Often, each integral is a function of the previous one, with the sequence converging to a solution.Optimisation via inverse Monte Carlo is a prime example; we estimate gradients in each iteration, adjust parameters accordingly, and repeat the process.
Our work is focused on improving the convergence speed and stability of the optimisation process by reducing the variance of the estimated gradients.We draw inspiration from control theory, specifically noise reduction through the combination of proportional and differential signals.These methods assume that samples are drawn from known probability distributions, usually normal distributions with known variances [Kalman 1960].Unfortunately, we cannot make such assumptions when dealing with Monte Carlo noise.
We combine two estimators: a proportional estimator ⟨  ⟩ -any Monte Carlo gradient estimator sampled independently between iterations -and a finite-difference estimator ⟨Δ  ⟩ that estimates the change of a gradient between two consecutive iterations.
Notation.Let   denote the integral of function  over the domain X, given parameters   for the current iteration  ∈ [0, ∞): (1) Let ⟨  ⟩ denotes the (proportional) Monte Carlo estimator of   , meaning E[⟨ ⟩] =   .For example, an estimator may sample  given a density  over X: Finite-difference estimation.We write the change of   between consecutive steps as: (3) A finite-difference estimator (⟨Δ  ⟩) estimates this change, ideally with a low variance.For example, we can substitute Equation (1) into Equation (3): and sample with a density  like in Equation ( 2): Here we assume that  is continuous w.r.t.  .Although our theory may apply to any Monte Carlo integral   , we analyse the case where  (,   ) = /  is the gradient at the -th iteration, for some objective .

Meta-estimation
Our meta-estimator aims to optimally combine each proportional ⟨  ⟩ and finite-difference estimator ⟨Δ  ⟩ available until the current step .In this subsection, we establish the theoretical conditions required for a variance-optimal, unbiased combination of both estimators.
A finite-difference estimator ⟨Δ  ⟩, by its definition in Equation (3), lets us update any proportional estimator from the previous step (⟨  −1 ⟩) to the current step .This update can be done simply by addition without introducing any bias.We can easily show this by expanding the expected value of the sum: We define our meta estimator as a weighted sum over the combination of all previous estimators up until step -1 i.e., ⟨  −1 ⟩  and the current proportional and finite-difference estimators: We initialise ⟨ 0 ⟩  = ⟨ 0 ⟩.As we sample all ⟨  ⟩ and ⟨Δ  ⟩ independently, the optimal   coefficients are given by inverse variance weighting [Sinha et al. 2011]: This simple recurrent relation captures the variance optimal combination of all previously sampled proportional and finite-difference estimators.
To summarise, we introduce the optimal and unbiased metaestimator in Equation ( 7).However, in practice, we use a more efficient implementation which suffers from some start-up bias.We describe this version in the following section.To visualise the estimators mentioned above and to motivate the design of our optimiser, we show a simple example in Figure 2.

VARIANCE ESTIMATION
Implementing Equation ( 7) in practice presents a challenge due to the unknown variances of our estimators.In this section, we describe how we approximate each variance term.We must balance three main objectives: the efficiency of our variance approximation methods, the optimality of our approximated   coefficients, and any bias potentially introduced to ⟨  ⟩  .

Proportional estimator variance
We approximate Var[⟨  ⟩] as a zero-centred raw moment [Papoulis and Pillai 1984], computed using an exponential moving average (EMA) with coefficient   :

Finite-difference estimator variance
As the proportional estimator reaches steady-state, we can safely assume that ⟨  ⟩ are identically distributed over consecutive iterations.Unfortunately, the same observation does not apply to finite-difference estimation as this finite-difference depends on the optimisation step we take in the previous iteration.For example, a larger step will cause a larger shift in the per-parameter gradients.
To resolve this issue we propose to decouple the optimisation step size (|Δ  |) from the approximated finite-difference estimator variance (Var[⟨Δ  ⟩]).We begin the derivation of this decoupling Parameter value Parameter gradient

Iteration
Figure 2: We optimise the rate parameter of an exponential distribution such that the mean of the distribution matches our target value of 2.0.We take 32 samples of the distribution in each iteration and compute an L2 loss between their mean and the target value.The bottom row shows insets of the graphs in the top row, indicated by grey regions.On the left, we show how Adam and our method reach the ground truth rate parameter 0.5.Error bars show the run-to-run variation of the optimised parameter.Our method converges significantly faster and is more stable than Adam.On the right, we show the estimators we use for our method; the proportional estimator has a large variance, while the finite-difference estimator is much less noisy.
Our meta-estimator combines both, with its variance reducing over time.
by expanding the fraction in Equation ( 5) by the Euclidean step size ||Δ  || 2 of the previous iteration: For sufficiently small step sizes, we can rearrange the terms in Equation (10) to approximate the finite-difference of gradients  as the second-order gradient ( /), times a unit-directional vector, times the left over terms: Applying the variance operator to Equation (11) gives us the decoupled finite-difference variance (Var[⟨Δ  ⟩]  ): 12) We use a zero-centred EMA, with a coefficient  Δ , to approximate this decoupled variance as: We generally use a small  Δ coefficient as Var We found this decoupled variance more closely distributed between iterations, better suited for approximation via moving averages.

Meta-estimator variance
We approximate the variance of our meta estimator in Equation ( 7) by recurrently applying the variance operator: Here we assume   to be a non-random value to simplify our mathematical derivation.Later in Section 6, we show the choice of   is less significant as long as its correlation with the gradient samples diminishes.

Alpha clipping
Meta-estimation is most vulnerable to underestimated Var[⟨  ⟩  ]; unless a significant Var[⟨Δ  ⟩] indicates a shift, ⟨  ⟩  will only slowly correct its overconfidently estimated value by averaging ⟨  ⟩ over many iterations.The risk of underestimation is the greatest while our exponential moving averages accumulate their initial samples.Clipping alpha based on the iteration resolves this risk: Intuitively, Equation ( 16) constrains alpha by the perfect average of all previous estimates; any value above this must be overestimated.We generalise this observation to the entire optimisation process, assuming that Var[⟨  ⟩] is similar in subsequent steps:

OPTIMISATION
Combining meta-estimation and optimisation creates a complex feedback loop; the meta-estimated gradients ⟨  ⟩  depend on their finite-difference estimates ⟨Δ  ⟩, which depend on the optimiser's steps Δ  , which, in turn, depend on the gradients estimated in the previous step ⟨  −1 ⟩  .It becomes crucial that the meta-estimator provides the optimiser with reliable gradients and that the optimiser makes steps that let the meta-estimator converge.We aim to combine Adam with meta-estimated gradients.We explain Adam's variance approximation and update rule to show where we can integrate meta-estimation.Adam [Kingma and Ba 2014] is well known for its robustness to outlier gradient samples; upon encountering an outlier, the estimated second moments adjust in the same step, swiftly pulling down the step size.This mechanism works because Adam first updates its second-moment estimate: corrects the EMA startup bias: and then divides the step size by its square root: where   and   refer to Adam's moment estimates,  to the learning rate,  2 to Adam's second moment coefficient, and  to a small value to ensure numerical stability.  and Var[⟨  ⟩] behave similarly in our case; first, we update Var[⟨  ⟩] for the current step (Equation ( 9)), compute   (Equation ( 8)), and add the outlier gradient ⟨  ⟩ to our meta-estimator ⟨  ⟩  weighted by (1−  ) (Equation ( 7)).Therefore, just like  2 for Adam,   offers a tradeoff between outlier robustness and estimation bias.When using optimisers like RMSProp [Graves 2014] and Adam [Kingma and Ba 2014], lower variance gradients naturally accelerate convergence since these optimisers divide their step size by the standard deviation of the gradients (Equation ( 20)).Additionally, Momentum [Sutskever et al. 2013] helps these optimisers handle tricky non-linear, multivariate curvatures such as ravines.The optimisers' effectiveness is greatly reduced if the noise in the gradients overpowers the variance arising from non-linearities in the estimated moments required for these mechanisms.
Naively feeding the meta-estimated gradients to Adam is problematic; Adam computes its moment estimates, assuming the input gradients in each iteration to be independent.Meanwhile, our metaestimator outputs an already averaged gradient (Equation ( 7)) with a strong positive correlation to previous averages.Adam's moment estimates are also redundant since we already estimate the variance of our meta-estimator Var[⟨  ⟩  ].Therefore, we formulate the update step in terms of our estimates: Dividing by √︁ Var[⟨  ⟩  ] sets the step size based on our metaestimator.As Var[⟨  ⟩  ] responds to changes in the estimated gradients much more quickly than Adam's second-moment estimate with the suggested  2 = 0.999 parameter, the stability of our method may seem uncertain.We observe that the responsivity of our method actually improves convergence, especially when combined with the decoupled estimation of Var[⟨Δ  ⟩].Optimisation speeds up quickly when low-noise gradients are available and slows down naturally when approaching a minimum.

EXPERIMENTS
We run several experiments to confirm our method's behaviour and verify its theory.We also compare our method against Adam, as it is used in state-of-the-art inverse rendering pipelines.We implement our method in Mitsuba 3 [Jakob et al. 2022] and use Path Replay Backpropagation [Vicini et al. 2021] to sample gradients computed with the unbiased Mean Relative Squared Error loss [Deng et al. 2022;Pidhorskyi et al. 2022].For texture optimisation tasks, we use gradient preconditioning as proposed by Nicolet et al. [2021].We compute ⟨Δ  ⟩ with a simplified form of the shift mapping proposed by Kettunen et al. [2015], only accounting for the BRDF sampling.While this implementation is sufficient for our proof-ofconcept demonstrations, a full implementation of shift mapping can also account for changes in geometry at an insignificant cost compared to proportional samples.Unless mentioned otherwise, we tune learning rates of each method in each experiment.
Variance reduction without lag.We investigate the variance reduction our method can achieve while the scene parameters are changing.We run a fixed linear interpolation of the parameters without an optimiser to prevent any effects from the feedback of the gradients.
Forward gradients of several pixels in Figure 3 show that our method avoids the lag in gradients typical of EMAs.Our metaestimate's actual variances and estimate variances are much tighter than the estimates computed by Adam.Furthermore, our method remains more stable upon encountering outliers.
Approximation accuracy.We repeat the previous setup in Figure 4, only now we test an exponentially decaying change in the gradients.Again, our meta-estimator stays within 0.5 to 2 times its predicted standard deviation.As the gradients settle, our method provides a consistent variance reduction (Row 1), averaging a large number of samples wherever possible.Meanwhile, Adam struggles with high-variance gradients (Row 2) and is thrown off by outliers.
We also show the approximated variances compared to ground truth variances computed over 1000 independent runs.Our approximation methods perform reasonably, only overestimating Var[⟨  ⟩] (Row 3).This overestimation results in generally conservative   values, erring on the side of robustness rather than maximising variance reduction (Row 5).On the other hand, we approximate Var[⟨Δ  ⟩] (Row 4) with little bias, although with often a large run-to-run variance.
Multivariate optimisation.We simultaneously optimise an object's colour, metalness, and roughness, as shown in Figure 1.Thanks to our meta-estimator, our method can traverse the loss surface without losing past samples.Furthermore, our finite-difference estimates let our meta-estimator adjust rapidly, avoiding the overshoots typical of Momentum-based methods.Even when tuning Adam's hyperparameters for the specific problem, it can only match our method at an over 20 times increase in computational cost, not counting the time spent on hyperparameter tuning.
Texture optimisation.We show a difficult texture optimisation case in Figure 6.Texture optimisation requires disentangling global illumination with very few gradient samples per texel.Adam can only take a few steps within a fixed budget at a high sample count, Figure 3: We estimate forward gradients of the left wall's colour's blue channel while linearly changing the scene from the initial state (top left) to the target state over 100 iterations.The dashed line represents the actual gradient, dots the gradient samples, the solid line the estimated gradient, and the shaded area the estimated standard deviation.Error bars every 10 iterations show the run-to-run variation of the estimated gradient.Meta-estimation eliminates lag, improves robustness to outliers, and offers lower variance while more accurately estimating this variance.We select the three pixels w.r.t. the actual gradient variance; blue is the noisiest, orange is at 75'th percentile, and green is the median.
requiring a high learning rate that skips over the intricate loss surface necessary to navigate for disentangling various effects.At a lower sample count, however, Adam struggles to progress as steps devolve into a random walk as the scale of the gradients shrinks close to minima.
High-dimensional optimisation.In Figure 7, we optimise an emissiveabsorptive volume of size 256×256×256 voxels, totalling 70 million parameters.Perfectly fitting such a non-physical volume to rendered images is impossible.Thus, the optimiser needs to balance per-pixel losses for a good approximation, further needing to disentangle the small subset of parameters visible through any given pixel.Previous works avoid convergence to local minima by upsampling the optimised volume in several stages; our method does not need this workaround.On the other hand, Figure 8 shows that our method provides less benefit when our finite-difference estimator's sampling is too sparse across the volume.3, we show the estimated gradients and standard deviations of our metaestimator and Adam.Our meta-estimators achieve significantly lower actual variance in each case, while also providing much more accurate approximations.Dashed lines represent actual or optimal values, solid lines a randomly sampled run, while error bars show run-to-run variation.In addition, we show violin plots for alpha, demonstrating how our method is more likely to be conservative and not to be swayed by outliers.
Zero-centred EMAs.We chose to use zero-centred moving averages so that we do not need to approximate the mean of our estimators directly.This approach is generally more robust and memory efficient, though it overestimates variance at large signal-to-noise ratios.However, gradients generally have a low signal-to-noise ratio, so this tradeoff works in our favour.Figure 5 (top-left) demonstrates how non-zero-centred variance approximation is unstable, providing unreliable alphas, thus causing optimisation to diverge.Each modification of our method is less robust than our results in Figure 6.Potted Plant © 2023 James Ray Cock Alpha clipping.Alpha clipping helps resolve cases when our approximated variances are inaccurate, improving robustness.It may hinder the variance reduction of our method in the special case when Var[⟨  ⟩] is sharply decreasing over iterations.However, we have not encountered this behaviour with our tested proportional estimators.Figure 5 (bottom-left) shows an ablation without alpha clipping for a scene from Figure 6, demonstrating rapid divergence as the initial variance approximations are unreliable.
Sample reuse.We use the same samples for rendering and variance approximation.This correlation introduces some bias at the start of the optimisation process, which diminishes over time.shows an unbiased ablation using uncorrelated samples.Although this independently approximated variance eliminates bias, it misses outliers in the samples used for gradient estimation, causing the parameters receiving these outliers to diverge.

LIMITATIONS
Estimators ⟨Δ  ⟩ are not generally available for many problems.Kettunen et al. [2015] propose shift mapping for path tracing, which we use in our work.Our meta-estimators rely heavily on ⟨Δ  ⟩; as we recurrently sum Var[⟨Δ  ⟩] in Equation ( 15), it inherently bounds the variance of our meta-estimator.Doing so is fine as long as Var[⟨Δ  ⟩] is quadratic w.r.t. the step size (Equation ( 14)).Thus, we need to ensure this property when building finite-difference estimators while also aiming for the lowest variance to achieve the best stability and convergence with meta-estimation.Zeltner et al. [2021] show that gradient estimators benefit from specialised differential sampling strategies.The same is true of finite-difference estimators; our naive toy formulation in Equation (5) glosses over this problem where  (x,   ) is usually only optimised by importance sampling  (x,   ), not the difference between  (x,   ) and  (x,   −1 ).
Suboptimal sampling strategies of ⟨  ⟩ compound the issue.As our work focuses on gradient estimation, meaning   are gradients, sampling of ⟨  ⟩ is not yet well established.For example, Zeltner et al. [2021] show the poor performance of roughness gradient estimators.We experience these issues first-hand, as we show in Figure 9.

CONCLUSION
Our proposed meta-estimation technique and corresponding adaptation of the Adam update rule can substantially improve convergence when descending on noisy gradients, reducing computation costs by several orders of magnitude.We solve cases where lowsample-count gradients are too noisy for fast convergence while high-sample-count gradients are prohibitively expensive to compute for the required number of iterations on difficult non-linear, multivariate problems.
Future work.We look forward to applications of meta-estimation to various inverse Monte Carlo problems, especially as MC gradient estimators become prominent in machine learning [Mohamed et al. 2020].Building good gradient and finite-difference estimators may seem challenging -and are the main limitation of our methodbut it is undoubtedly a fruitful direction for future work.We did not investigate training deep neural networks in this work but see it as the next step once low-variance finite-difference estimators become available.

Figure 4 :
Figure4: Following the same setup as Figure3, we show the estimated gradients and standard deviations of our metaestimator and Adam.Our meta-estimators achieve significantly lower actual variance in each case, while also providing much more accurate approximations.Dashed lines represent actual or optimal values, solid lines a randomly sampled run, while error bars show run-to-run variation.In addition, we show violin plots for alpha, demonstrating how our method is more likely to be conservative and not to be swayed by outliers.

Figure 5 :
Figure 5: We ablate our design choices, demonstrating their importance for robust gradient estimation and optimisation.Each modification of our method is less robust than our results in Figure 6.Potted Plant © 2023 James Ray Cock

Figure 6 :
Figure6: We optimise diffuse textures, placing a potted plant in the extremely noisy Veach Ajar scene.At 64 spp, Adam can only afford to take a few steps at a large learning rate and cannot match small details, further failing on complex interactions such as the leaf's reflection on the pot's side, where navigating a difficult loss surface is necessary.At 3 spp, Adam reduces to a random walk after reaching vaguely accurate parameters where gradients vanish against the high noise level.Our estimator shows good convergence.Textures remain a little blurry as a known consequence of regularisation[Nicolet et al. 2021].The texture of the glossy pot (third inset) shows relatively worse convergence due to the concerns discussed in Section 7. Potted Plant © 2023 James Ray Cock, Veach Ajar © 2023 Benedikt Bitterli Differentiable Path Tracing.Path tracing accounts for global illumination through physically accurate light transport by Monte Carlo integration of the rendering equation [Kajiya 1986].Recent works proposed various approaches to differentiate such Monte Carlo integrals and estimate derivatives w.r.t.scene parameters.