Self-Calibrating, Fully Differentiable NLOS Inverse Rendering

Existing time-resolved non-line-of-sight (NLOS) imaging methods reconstruct hidden scenes by inverting the optical paths of indirect illumination measured at visible relay surfaces. These methods are prone to reconstruction artifacts due to inversion ambiguities and capture noise, which are typically mitigated through the manual selection of filtering functions and parameters. We introduce a fully-differentiable end-to-end NLOS inverse rendering pipeline that self-calibrates the imaging parameters during the reconstruction of hidden scenes, using as input only the measured illumination while working both in the time and frequency domains. Our pipeline extracts a geometric representation of the hidden scene from NLOS volumetric intensities and estimates the time-resolved illumination at the relay wall produced by such geometric information using differentiable transient rendering. We then use gradient descent to optimize imaging parameters by minimizing the error between our simulated time-resolved illumination and the measured illumination. Our end-to-end differentiable pipeline couples diffraction-based volumetric NLOS reconstruction with path-space light transport and a simple ray marching technique to extract detailed, dense sets of surface points and normals of hidden scenes.We demonstrate the robustness of our method to consistently reconstruct geometry and albedo, even under significant noise levels.

Figure 1: We present a self-calibrating, fully-differentiable NLOS inverse rendering pipeline for the reconstruction of hidden scenes.Our method only requires transient measurements as input and relies on differentiable rendering and implicit surface estimation from NLOS volumetric outputs to obtain the optimal NLOS imaging parameters that yield accurate surface points, normals, and albedo reconstructions of the hidden scene.The top row shows the reconstructed volumetric intensity, albedo, and 3D geometry of a real scene [Liu et al. 2020], failing to reconstruct geometry estimation due to noise interference.The bottom row demonstrates our results after optimization of the imaging parameters.

INTRODUCTION
Time-gated non-line-of-sight (NLOS) imaging algorithms aim to reconstruct hidden scenes by analyzing time-resolved indirect illumination on a visible relay surface [Faccio et al. 2020;Jarabo et al. 2017;Satat et al. 2016].These methods typically emit ultra-short illumination pulses on the relay surface, and estimate the hidden scene based on the time of flight of third-bounce illumination reaching the sensor [Lindell et al. 2019;Liu et al. 2019;O'Toole et al. 2018;Velten et al. 2012;Xin et al. 2019].
The majority of existing methods estimate hidden geometry by backprojecting captured third-bounce illumination into a voxelized space that represents the hidden scene [Laurenzis and Velten 2014], lacking information about surface orientation and self-occlusions [Iseringhausen and Hullin 2020].Moreover, captured data contains higher-order indirect illumination and high-frequency noise from different sources that introduce undesired artifacts in the reconstructions.Performing a filtering step over the data or the reconstructed volume is the most common solution to mitigate errors and enhance the geometric features [Arellano et al. 2017;Buttafava et al. 2015;Liu et al. 2019;O'Toole et al. 2018;Velten et al. 2012]; however, this requires manual design and selection of filter parameters, as their impact in the reconstruction quality is highly dependent on the scene complexity, environment conditions, and hardware limitations.
Recent physically-based methods proposed an alternative technique that avoids the issues linked to backprojection.By merging a simplified but efficient three-bounce transient rendering formula with an optimization loop, the computed time-resolved illumination at the relay wall resulting from an optimized geometry reconstruction is compared to the measured illumination.However, geometric representations introduced by existing works limit the detail in the reconstructions [Iseringhausen and Hullin 2020] or fail to reproduce the boundaries of hidden objects [Tsai et al. 2019].
Alternatively, the recent development of accurate transient rendering methods [Jarabo et al. 2014;Pediredla et al. 2019;Royo et al. 2022] has fostered differentiable rendering pipelines in path space [Wu et al. 2021;Yi et al. 2021], which have the potential to become key tools in optimization schemes.However, differentiable methods are currently bounded by memory limitations since the need to compute the derivatives of time-resolved radiometric data severely limits the number of unknown parameters that can be handled.The difficulty of handling visibility changes in a differentiable manner, as well as the large number of parameters that need to be taken into account, are two limiting factors shared as well with steadystate differentiable rendering [Li et al. 2018;Zhao et al. 2020], that are further aggravated in the transient regime.As a result, NLOS imaging methods that rely on differentiable rendering are therefore limited to simple operations such as tracking the motion of a single hidden object with a known shape [Yi et al. 2021].
To address these problems, we propose a novel self-calibrated, fully differentiable pipeline for NLOS inverse rendering that jointly optimizes system parameters and scene information to extract surface points, normals, and albedo of the hidden geometry.To this end, we combine diffractive phasor-field imaging in the frequency domain [Liu et al. 2020[Liu et al. , 2019] ] with differentiable third-bounce transient rendering in the temporal domain.We leverage the volumetric output of phasor-field NLOS imaging to estimate geometric information of the hidden scene, which we then use on a transient rendering step to simulate time-resolved illumination at the relay wall.By minimizing the error between simulated and captured illumination, we provide a fully-differentiable pipeline for self-calibrating NLOS imaging parameters in an end-to-end manner.
Our optimized parameters provide accurate volumetric outputs from which we estimate surface points, normals and albedos of hidden objects, with more geometric detail than previous surfacebased methods.Our method is robust in the presence of noise, providing consistent geometric estimations under varying capture conditions.Our code is freely available for research purposes 1 .

RELATED WORK
Active-light NLOS imaging methods provide 3D reconstructions of general NLOS scenes by leveraging temporal information of light propagation by means of time-gated illumination and sensors [Faccio et al. 2020;Jarabo et al. 2017].
Scene representation.While existing methods rely on inverting third-bounce transport, they may differ in their particular representation of scene geometry as volumetric density or surfaces.Volumetric approaches estimate geometric density by backprojecting third-bounce light paths onto a voxelized space [Ahn et al. 2019;Arellano et al. 2017;Buttafava et al. 2015;Gariepy et al. 2015;Gupta et al. 2012;La Manna et al. 2018;Velten et al. 2012].Efficiently inverting the resulting discrete light transport matrix is not trivial; many dimensionality reduction methods have been proposed [Heide et al. 2019;Lindell et al. 2019;O'Toole et al. 2018;Xin et al. 2019;Young et al. 2020], but they are often limited in spatial resolution (as low as 64×64 in some cases) due to memory constraints.Surface methods, in contrast, rely on inverting third-bounce light transport onto explicit representations of the geometry [Iseringhausen and Hullin 2020;Plack et al. 2023;Tsai et al. 2019], usually starting with simple blob shapes, progressively optimizing the geometry until loss converges.In contrast, we estimate implicit geometric representations of the hidden scene based on surface points and normals by ray marching the volumetric output of NLOS imaging, inspired by recent work on neural rendering [Barron et al. 2021;Mildenhall et al. 2020;Niemeyer et al. 2022].The combination of NLOS imaging with differentiable transient rendering over the estimated geometry allows us to self-calibrate imaging parameters in an end-to-end manner.For clarity, in this paper the term explicit surface refers to a polygonal surface mesh, while implicit surface denotes a representation based on surface points and their normals, without defining a surface mesh.Please, refer to Section 4.2 for a further detailed discussion on explicit/implicit surface representations.
Learning-based approaches.Other methods leverage neural networks instead, such as U-net [Grau Chopite et al. 2020], convolutional neural networks [Chen et al. 2020], or neural radiance fields [Mu et al. 2022].These learning-based methods are learned using object databases such as ShapeNet [Chang et al. 2015].However, their parameters are trained with steady-state renderings of synthetic scenes composed of a single object behind an occluder in an otherwise empty space.As such, their performance is often degraded with real scenes, often overfitting to the training dataset, and becoming susceptible to noise.Our method does not rely on a pre-trained deep network to extract high-level features from synthetic steady-state rendering data; instead, we explicitly optimize virtual illumination functions and scene information by evaluating actual transient observations, without relying on neural networks.Recent works by Shen et al. [2021] and Fujimura et al. [2023] leverage transient observations similar to ours for optimizing multi-layer perceptrons for imaging.However, these methods cannot be utilized for calibrating the filtering parameters of volumetric NLOS methods due to the lack of evaluation of the physical observation of the transient measurements by an NLOS imaging and light transport model.
Wave-based NLOS imaging.Recent works have shifted the paradigm of third-bounce reconstruction approaches to the domain of wave optics [Lindell et al. 2019;Liu et al. 2019].In particular, the phasor field framework [Liu et al. 2019] computationally transforms the data captured on the relay surface into illumination arriving at a virtual imaging aperture.This has enabled more complex imaging models (e.g., [Dove and Shapiro 2020a,b;Guillén et al. 2020;Marco et al. 2021;Reza et al. 2019]), and boosted the efficiency of NLOS imaging to interactive and real-time reconstruction rates [Liao et al. 2021;Liu et al. 2020;Mu et al. 2022;Nam et al. 2021].However, these systems require careful calibration of all their parameters, including the definition of the phasor field and the particular characteristics of lasers and sensors, which makes using them a cumbersome process.Our fully self-calibrated system overcomes this limitation.

TIME-GATED NLOS IMAGING MODEL
We propose a differentiable end-to-end inverse rendering pipeline (shown in Figure 2) to improve the reconstruction quality of hidden scenes by optimizing the parameters of NLOS imaging algorithms without prior knowledge of the hidden scene.In the following, we describe our NLOS imaging model.Section 4 describes our optimization pipeline based on this NLOS imaging model.

Phasor-based NLOS imaging
In a standard NLOS imaging setup (see Figure 3), a laser beam is emitted towards a point x  on a visible relay wall, which reflects light towards the hidden scene and then is reflected back to the wall.The hidden scene is imaged based on the time of flight of the time-resolved illumination, captured at points x  on the relay wall in the form of a measurement matrix  (x  , x  , ).
The recent diffractive phasor-based framework by Liu et al. [2020;2019] intuitively turns the grid of measured points x  on the relay wall into a virtual aperture; this allows to formulate the reconstruction of NLOS scenes as a virtual line-of-sight (LOS) problem.
We define  (x  , x  , Ω) as a set of phasors at the relay wall, obtained by Fourier transform of the measurement matrix  (x  , x  , ).In practice, since this function  is noisy, we apply a filtering operation as where P (x  , x  , Ω) represents a virtual illumination function that acts as a filter over  , typically defined as a spatially-invariant illumination function [Liu et al. 2020[Liu et al. , 2019]].The hidden scene can then be imaged as an intensity function  pf (x  , ) on a voxelized space via Rayleigh-Sommerfeld Diffraction (RSD) operators as where  and  define the illuminated and measured regions on the relay wall, respectively;    = ∥x  − x  ∥ and   = ∥x  − x  ∥ are voxel-laser and voxel-sensor distances (see Figure 3); and Ω represents frequency.Classic NLOS reconstruction methods reconstruct hidden geometry by evaluating  (x  , x  , ) at the time of flight of third-bounce illumination paths between scene locations and points on the relay surface [Arellano et al. 2017;Gupta et al. 2012;O'Toole et al. 2018].This is analogous to evaluating  pf (x  , ) at  = 0, where the RSD propagators have traversed an optical distance ∥ x∥ =    +   .We incorporate a similar third-bounce strategy in our path integral formulation as described in the following.Due to the challenges of estimating surface albedo due to diffraction effects during the NLOS imaging process [Guillén et al. 2020;Marco et al. 2021], we assume an albedo term per surface point that approximates the averaged reflectance observed from all sensor points.

Path-space light transport in NLOS scenes
To formally describe transient light transport in an efficient manner, we rely on the transient path integral formulation [Jarabo et al. 2014;Royo et al. 2022].Transient light transport  (x  , x  , ) ∈ R can then be expressed as where K is the radiometric contribution in transient path-space; d ( x) is the differential measure of path x; T represents the domain of temporal measurements; t =   . . .  is the sequence of timeresolved measurements on each vertex; d (t) denotes temporal integration at each vertex; x = x  . . .x  is a set of discrete transient path time intervals of  + 1 vertices; and  = ∪ ∞

𝑘=1
is the entire space of paths with any number of vertices, with   being the space of all paths with  vertices.For convenience and without losing generality, we ignore the fixed vertices at the laser and sensor device in our formulae.
In practice,  is obtained by the spatio-temporal integration of transient measurements during a time interval , which accounts for the contribution of all paths x with time of flight where  is the speed of light, x 0 ≡ x  , and x  ≡ x  .We assume no scattering delays at the vertices.
Incorporating the third-bounce strategy of NLOS reconstruction methods in our path integral formulation, we can express K in a closed form as 5) where Λ is the emitted light from the laser, Φ is the time-dependent sensor sensitivity function,  represents surface reflectance, and ( x, t) is the path throughput defined by where  is the binary visibility function between two vertices,   = ∥x  − x  ∥ and   = ∥x  − x  ∥, and  1−4 refer to the angles between the normals of both the relay wall and surface geometry, and the path segments in x (see Figure 3).Note that the three-bounce illumination is expressed in the path space as x ≡ x  → x  → x  .
Neither the emitted light Λ nor the sensor sensitivity Φ are ideal Dirac delta functions.Yi et al. [2021] and Hernandez et al. [2017] provide the following models for the laser and sensor behavior where   is the standard deviation of the Gaussian laser pulse,   is the laser intensity,   is the sensor sensitivity decay rate,   is the standard deviation of the sensor jitter, and   is the offset of the sensor jitter.Since we are only interested on reproducing the combined effect of the laser and sensor models Λ and Φ on the path throughput (Equation 6), we replace them by a single joint laser-sensor correction function as Note that the convolution of the two Gaussian functions of Equations 7 and 8 yields a single Gaussian with a joint model parameter We set the sensor jitter offset as   = 0, with the assumption that a uniform distribution of shifts is equally present in all transient measurements.Please refer to the supplemental material for more details on derivation.Our inverse rendering optimization seeks optimal parameters of this model automatically based on physically-based transient rendering.

DIFFERENTIABLE TIME-GATED NLOS INVERSE RENDERING
In the following, we describe in detail our self-calibrated, end-toend differentiable inverse rendering pipeline, where the forward pass provides high-detailed reconstructions of the geometry , while the backward pass optimizes per-voxel surface reflectance as albedo Θ  , as well as system parameters Θ pf and Θ ls to improve the forward pass reconstruction.For clarity, from here on, we redefine our functions in terms of their parameters to be optimized.Refer to the supplemental material for a summary of the different symbols.

Virtual illumination for RSD propagation
The inputs to our system are the known locations of the illumination x  and the sensor x  , a matrix  of transient measurements, and an arbitrary virtual illumination function P (Θ pf ) ≡ P (x  , x  , Ω) (Equation 1), where Θ pf represents the optimized parameter space for P. Based on previous works [Liu et al. 2020[Liu et al. , 2019;;Marco et al. 2021], we define Θ pf = { pf , Ω pf } to model a central frequency with a zero-mean Gaussian envelope as , where  pf , Ω pf represent the standard deviation and central frequency, respectively.Note that this equation is fully differentiable.
In the forward pass we first compute the filtered matrix  pf (Equation 1) using the optimized virtual illumination P (Θ pf ), having 2a).We then compute a first estimation of the volumetric intensity  pf of the hidden scene by evaluating RSD propagation (Equation 2) at  = 0, as  pf =  ( pf ).Next, we show how to estimate both the geometry  and the time-resolved transport   at the relay wall.

Implicit surface geometry
Our next goal is to estimate an implicit surface representation  (points x  and normals n  ) by means of a differentiable function  as  =  ( pf ) (Figure 2b) that takes our volumetric intensity function  pf as input.
We keep an implicit representation of our hidden surface geometry  without creating meshed (explicit) surface geometry during the whole optimization.The key idea is to use the volumetric data computed at each forward pass to estimate projections of the geometry (i.e., points and normals) visible from the perspective of each sensor point x  on the relay wall and use those to perform path-space differentiable transient rendering at x  .
We first estimate the geometry observed by x  by sampling rays towards our volumetric intensity  pf , and build an implicit representation of the closest surface along each ray.Using information from neighboring rays, we then estimate the normals required to compute the path-space throughput of  (Equation 6).Using the implicit geometry computed for every sensing point x  , we then compute time-resolved illumination at x  as we describe later in this subsection.
Points.As Figure 4a shows, for each sensor point x  we sample rays uniformly using concentric hemispherical mapping [Shirley and Chiu 1997].We then sample points along each ray with ray marching, and estimate the intensity at each sampled point (blue in Figure 4a) by trilinear interpolation of neighbor voxel intensities of  pf (red).From the interpolated volumetric intensities  pf (  ) (Figure 4b, left), we estimate the distance   between x  and the hidden surface vertex x  (Figure 4b, right), assuming x  is located at the maximum intensity along the ray.To find   in free space from the ray-marched intensities in a differentiable manner, we use softargmax function: , where   is a ray-marched distance from x  , and   =   pf, is a probability density function of   , and  pf, is the volume intensity at distance   along the ray.
is a hyperparameter that determines the sensitivity in blending neighboring probabilities, set to 1e+3 in all our experiments.If  pf falls below a threshold, we assume that no surface has been found; we set this threshold to 0.05 for synthetic scenes, and 0.2 for real scenes throughout the paper.Our procedure implicitly estimates surface points x  at distances  = x  − x  by observing via ray marching the grid of phasor-field intensities  pf from the perspective of the sensing points x  .
Normals.As shown in Figure 4c, we estimate the normal n  at vertex x  based on the distances   ,   ,   ,   at neighboring ray samples in the concentric hemispherical mapping.We compute the normals of two triangles △      and △      via two edges' cross product and compute n  as the normalized sum of the normals of those two triangles.
Surface albedo.Besides points and normals-updated implicitly during each forward pass-, computing path contribution K (Equation 5) at sensor points x  requires computing per-point monochromatic albedo .We estimate albedos by evaluating the physical observation of the transient measurements in the backward pass.

Differentiable transient rendering
The next step during the forward pass is to obtain time-resolved illumination   at x  through transient rendering.In our pipeline (Figure 2c), we represent this step as   = (; Θ G , Θ ls ), where () computes third-bounce time-resolved light transport at sensing points x  .We use the rays sampled from x  (Figure 4b) to compute the radiometric contribution K ( x, t) of the implicit surface points x  estimated by those rays, following Equations 5 through 9.
Visibility.Differentiating the binary visibility function  , necessary to compute the path throughput  (Equation 6), is challenging.
However, note that we estimate an implicit surface at x  based on volumetric intensities, which strongly depend on the illumination from the laser reaching the surface and going back to the sensor without finding any occluder.Based on this, we avoid computing the visibility term by assuming the volumetric intensities are a good estimator of the geometry visible from the perspective of both laser and sensor positions on the relay wall.
Transient rendering.The radiometric contribution K ( x, t) (Equation 5) yields time-resolved transport in path space for a single path x ≡ x  → x  → x  .Our goal is to obtain a set of discrete transient measurements   from all paths arriving at each sensing point x  , such that   is comparable to the captured matrix  .To this end, we first discretize |K ( x, t)| into neighboring bins  using a differentiable Gaussian distribution function as

𝑡
, where  is a transient bin index,  is continuous time of x (Equation 4), and   is set to 0.62 to make the FWHM of the Gaussian distribution cover a unit time bin.The time-resolved measurement   (x  , x  , ) at temporal index  is then approximated as the sum of the discrete path contributions K ( x, ) sampled through the concentric disk mapping as where X is the set of paths x that start at x  and end in x  .After generating the rendered transient data   , we then apply our joint laser-sensor model to it to obtain a sensed transient data   : where   is the intensity offset parameter that takes the ambient light and the dark count rate of the sensor into account.

Optimization of system parameters
Our final goal is to estimate the system parameters Θ = {Θ pf , Θ ls , Θ  } that minimize the loss between the measured matrix  and the rendered matrix   (Figure 2, red).We define this as which we minimize by gradient descent.The transient cost function L consists of a data term and regularization terms as The data term   computes an  2 norm between the transient measurements  and   : where   is the total number of elements of  .The key insight of this loss term is that   is the byproduct of time-resolved illumination computed from our implicit geometry , which was itself generated from volumetric intensities  pf by means of RSD propagation of the ground truth  .The difference between  and   is therefore a critical measure of the accuracy of geometry  and  pf .By backpropagating the loss term through our pipeline, we optimize all system parameters, which improve the estimation of  pf ,  and therefore   .
The term   pf in Equation 13 is a volumetric intensity regularization term that imposes sparsity, pursuing a clean image: where  pf,z is the maximum intensity values of  pf projected to the  plane,  pf,z is the number of pixels of  pf,z , and  1 is a loss-scale balance hyperparameter, which is set to 1e+2 in all our experiments.The term   in Equation 13 is a regularization term that imposes smoothness, suppressing surface reflectance noise: where   is the number of voxels x  , and  2 is a loss-scale balance hyperparameter, which is set to 5e-3 in all our experiments.All terms   ,   pf , and   of the loss function are computed over batches of the transients and voxels at every iteration.

RESULTS
We implement our pipeline using PyTorch.Our code runs on an AMD 7763 CPU of 2.45 GHz equipped with a single NVIDIA GPU A100.3D geometry is obtained from points and normals using Poisson surface reconstruction [Kazhdan and Hoppe 2013].Please note that we do not perform any thresholding or masking of the data prior to this step.We evaluate our method on four real confocal datasets Bike, Resolution, SU, and 34, provided by O'Toole et al.  8).
The real datasets include all illumination bounces and different levels of noise depending on their exposure time.The synthetic datasets include up-to third-bounce illumination.In specific cases, we manually add Poisson noise to synthetic datasets to evaluate our robustness to signal degradation.

Convergence of system parameters
In Figure 5, we show the convergence of our system parameters in a full optimization of the Bike real scene, showing as well the final reconstruction of both volumetric intensity and geometry.Phasorfield kernel parameters Ω pf and  pf (first column) are responsible for improving the reconstruction quality by constructing a phasor kernel (fourth column, top) that yields high-detailed geometry.The laser and sensor parameters (second and third columns) improve the reconstruction of the transient measurements so that the transient simulation (fourth column, bottom, orange) resembles as much as possible the input data (blue).Refer to the supplemental material for more results of the progressive optimization.
We evaluate the impact of each component in our optimization pipeline: phasor kernel, albedo, and laser-sensor model, using a 256 × 256 × 201 voxel volume.As Table 1 shows, adding albedo and laser-sensor parameters improves the result over just using the A lower signal-to-noise ratio (SNR) value indicates a higher level of noise, with an exponential increase in noise.
phasor parameters, while including the three components yields the best results.The impact of optimizing albedo is the most significant in this experiment.

Robustness to noise
To illustrate the robustness of our method to signal degradation, in Figure 6 we show reconstructions of the Bunny synthetic dataset under increasing levels of Poisson noise (from left to right) applied to the input transient data.The first row shows the final volumetric reconstruction after the optimization, while the second row shows the resulting surface estimation.The third row shows a comparison between the input transient illumination (blue) and our converged transient illumination at the same location that results from our estimated geometry (orange).The parameters optimized by our pipeline produce a volumetric reconstruction robust enough for our surface estimation method to obtain a reliable 3D geometry under a broad spectrum of noise levels.Note that while the volumetric outputs may show noticeable noise levels (first row), our pipeline optimizes the imaging parameters so that such volumetric outputs provide a good baseline for our geometry estimation method, which yields surface reconstructions that consistently preserve geometric details across varying noise levels (second row).
In Figure 7, we compare our method with existing volumetric approaches on two real confocal scenes, Resolution and Bike, captured under different exposure times.For each scene, first to fourth columns illustrate the compared methods: O'Toole et al.
[2018], Lindell et al. [2019], Liu et al. [2020], and ours, respectively.First to fourth rows show the resulting volumetric intensity images under increasing exposure times of 10, 30, 60, and 180 minutes, respectively.Our method converges to imaging parameters that produce the sharpest results while significantly removing noise even under the lowest exposure time (top row).Other methods degrade notably at lower exposure times, failing to reproduce details in the resolution chart, or yielding noisy outputs in the Bike scene.
While LCT [O'Toole et al. 2018] allows to manually select an SNR filtering parameter  to improve results in low-SNR conditions, our experiments with different  values from 0.001 to 1.0 at different exposure levels validate that our automated calibration approach outperforms the LCT method, reproducing detailed geometric features (see supplemental material).

Inverse rendering
Our optimization pipeline estimates surface points, normals, and albedo by using only the input transient measurements.Figure 8 illustrates our volumetric intensity, as well as surface points, normals and albedo in the confocal synthetic scene Bunny made of two different surface albedos 1.0 (top) and 0.3 (bottom).Our method is consistent when estimating spatially-varying albedo, while not affecting the estimation of detailed surface points and normals.
Figure 9 demonstrates our inverse rendering results on real scenes.As shown in a confocal scene SU (first row) and two nonconfocal scenes 44i (second row) and NLOS (third row), we correctly estimate the albedo of objects with uniform reflectance properties (second column), although they undergo different attenuation factors due to being at different distances from the relay wall.The result of the NLOS non-confocal scene (third row) shows the albedo throughout the entire surface is almost identical.Our estimation of surface points and normals (third and fourth columns) is able to accurately reproduce the structure of the hidden geometry.
In Figure 1, we illustrate the benefits of our inverse rendering optimization on the real scene Bike.The first row shows the first iteration of the optimization, which uses the volumetric output by Liu et al. [2020] with the default parameters of the illumination function.The resulting noise heavily degrades the geometry and normal estimation (top-right), and the albedo is wrongly estimated at empty locations in the scene despite the lack of a surface at such locations (top center).After our optimization converges (bottom row), the albedo is estimated only at surface locations, yielding a clean reconstruction of the bike's surface points and normals.

Geometry accuracy
In Figure 10, we compare the reconstructed geometry with surface normals in two real scenes (34 and SU) using D-LCT [Young et al. 2020], NeTF [Shen et al. 2021], a differentiable rendering approach [Plack et al. 2023], and our method.Existing methods fail to reproduce detailed surface features in both scenes, such as the subtle changes in depth of the numbers.Plack's method (fourth column) fails to reproduce the partially occluded U-shaped object and some regions of the S-shaped object in the SU scene.D-LCT (second column) succeeds in reproducing the U-shaped object but fails to reconstruct the detailed geometry of the boundary of the letters.While NeTF [Shen et al. 2021] (third column) is capable of reproducing the U-shaped object, their methodology, based on positional encoding and neural rendering, suppresses geometric details significantly, producing a coarse geometry.Plack's method faces similar challenges in reproducing geometric details due to the constraints imposed by the resolution of the explicit proxy geometry.Previous optimization-based methods that also rely on explicit geometry [Iseringhausen and Hullin 2020;Tsai et al. 2019] share similar limitations.Our method based on implicit surface representations is able to handle partial occlusions while reproducing detailed features of the surfaces, such as the depth changes on the numbers and the narrow segments of the letters.
In Figure 11, we provide quantitative comparisons between our estimated geometry and the geometry obtained from D-LCT [Young et al. 2020], NeTF [Shen et al. 2021] and Plack et al. [2023] for three synthetic scenes, Dragon, Erato, and Indonesian, using the Hausdorff distance map as an objective metric.In terms of geometric accuracy, we outperform all three methods in Erato, and Dragon, as shown in the RMSE table.Our improvements are especially noticeable in self-occluded regions and in the reproduction of detailed features.While Plack et al. [2023] yields a lower RMSE in the Indonesian scene, note that it fails to reproduce large regions on the sides of the geometry.Thus, RMSE is only computed on the reconstructed regions and may not fully represent the overall accuracy of the reconstruction.

DISCUSSION AND FUTURE WORK
We have presented an efficient and fully-differentiable end-to-end NLOS inverse rendering pipeline, which self-calibrates the imaging parameters using only the input-measured transient illumination.
Our method is robust in the presence of noise while achieving enhanced scene reconstruction accuracy.
Even though forward automatic differentiation (AD) is known to be memory efficient, we implemented our pipeline using reverse AD, as we found it to be 20 times faster and showed better performance when optimizing a large number of parameters (such as per-voxel albedo), and supports a wider set of differentiable functions required for our context.
Phasor-field NLOS imaging can be performed analogously using temporal-or frequency-domain operators [Liu et al. 2020[Liu et al. , 2019]].However, operating in the temporal domain introduces large memory constraints that are impractical on a differentiable pipeline.Our pipeline therefore operates in the frequency domain to perform NLOS imaging, which provides practical implementation of convolutions of complex-valued phasor-field kernels within GPU memory constraints.While we based volumetric NLOS imaging on phasor-based operators and kernels, an interesting avenue of future work may be optimizing alternative kernel parameterizations or implementing other differentiable NLOS imaging approaches.Plack et al. [2023] using synthetic transient data with ground truth geometries Dragon, Erato, and Indonesian.We quantify the introduced errors using the Hausdorff distance between the ground truth geometry and the estimated geometries.Our method yields the smallest RMSE in Erato and Dragon, noticeable in highly-detailed areas.Note that while Plack et al. [2023] has smaller RMSE in Indonesian, the reconstructed surface is missing significant regions of the ground truth geometry, which are not quantified by the RMSE.

JOINT LASER-SENSOR CORRELATION MODEL
The joint laser-sensor model is derived as Equation 1 following the previous related work [Chen et al. 2020;Hernandez et al. 2017]. =

ADDITIONAL RESULTS
This section provides additional validations and results.
Manual parameter adjustment vs. our self-calibration.et al. 2018] and ours.To handle noise in the input dataset, we manually tweak the SNR parameter in the LCT method with a very wide range from 0.001 to 1.0.Our method yields clearer results than any of the results under the explored values for the SNR parameter of LCT, throughout all exposure levels.
Progressive optimization results.Figure 2 show detailed progress of the optimization in the Dragon and Erato scenes, displaying the evolution of the phasor-field kernel until the converged state.While the full optimization takes 100 iterations (1.28 hours), after only 50 iterations (39 minutes) the converged phasor-field kernel parameters already yield volumetric and geometric reconstructions very close to the converged result, while the remaining iterations refine more local details.

Figure 3 :
Figure3: NLOS imaging setup.A laser emits a pulse of light, which travels to the relay wall, then to the hidden geometry, back to the relay wall, and reaches the sensor after a travel time of  =  1 + 2 + 3 + 4 .The inset shows the sensor response; the peak at  indicates the presence of a hidden object.

Figure 4 :
Figure 4: Geometry estimation procedure.(a) We ray-march from sensor points x  , and estimate the intensity at each point along the ray by trilinear interpolation of  pf .(b) From the discrete ray-marching samplings, we obtain a continuous depth function.(c) Normals are computed based on the distances at neighboring ray samples in the concentric hemispherical mapping.

Figure 5 :
Figure 5: Convergence of the imaging parameters optimized by our method in the Bike real scene.From left to right: Phasor kernel parameters (Ω pf ,  pf ), laser-sensor joint model parameters ( ls ,  l ,  s ,  s ), the converged phasor kernel (purple and green for real and imaginary parts), measured transients compared to our reconstructed one, and our reconstruction results after the optimization.The yellow line indicates when the optimization converges.The converged phasor kernel yields a high-quality reconstruction, while the laser and sensor parameters provide an accurate estimation of transient illumination.

Figure 6 :
Figure 6: Evaluation of our surface reconstruction under increasing levels of Poisson noise (left to right).From top to bottom: intensity volume, reconstructed geometry, and measured vs. optimized transport.Our method reconstructs geometry reliably across a broad spectrum of noise levels.A lower signal-to-noise ratio (SNR) value indicates a higher level of noise, with an exponential increase in noise.

Figure 8 :Figure 9 :
Figure 8: Our optimization scheme estimates spatially-varying albedo in a consistent manner, without affecting the surface and normal estimation.From left to right: Synthetic Bunny scene with two different albedos (0.3 and 1.0), our converged volumetric intensity, the optimized albedo, and the estimated geometry.

Figure 10 :Figure 11 :
PhotographYoung 2020 Shen 2021 Figure 1 compares the estimated volumetric intensities of Bike and Resolution scenes by two different methods: the light cone transform (LCT) [O'Toole of transient measurements as input, and outputting volumetric intensity  pf .(b)We estimate , an implicit geometric representation of the hidden scene, from  pf .(c)We obtain the time-resolved illumination   from  using differentiable path-space transient rendering.(d)We optimize imaging parameters until the error between  and   converges with regularization terms Γ.Geometry  is computed during the forward pass, while Θ pf , Θ ls , and Θ G are updated during the backward pass.

Table 1 :
Ablation study of the impact of each component.
MSE transient loss comparison with different configurations with the Bunny scene with two different albedos (Figure denotes the photon detection efficiency.  is the ambient light and   is the dark count rate.Φ is the sensor model function that can be expressed in the form of convolution between exponential function   and Gaussian function   .Λ is the laser function that has the shape of Gaussian   .Note that the convolution of two Gaussians   and   can be merged to a single Gaussian   .The convolution of   and   is then expressed as Ψ that has three parameters   ,   , and   .and   can be summed to a single offset value   .Our joint laser-sensor correlation model finally has four parameters and these values are optimized in our self-calibrating pipeline.Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.Copyrights for third-party components of this work must be honored. • (Φ * (Λ *   +   )) +   = (Φ * Λ *   + Φ *   ) +   = ((  *   ) *   *   + (  *   ) *   ) +   = ((  * (  *   ) *   + (  *   ) *   ) +   = ((  *   *   + (  *   ) *   ) +   = (Ψ (;   ,   ,   ) *   +   ) +   = Ψ(;   ,   ,   ) *   +   ,(1)where  For all other uses, contact the owner/author(s).SA Conference Papers '23, December 12-15, 2023, Sydney, NSW, Australia © 2023 Copyright held by the owner/author(s).
Table2summarizes the type of data (confocal or non-confocal), as well as the dimensions of the transient data, the dimensions of the reconstructed volume, the total reconstruction time, and the number of iterations before convergence; note that most of our scenes are significantly larger than previously reported results by transient optimization methods.

Table 1 :
Main notations and symbols used in the paper.Description x = x 0 ...x  Light path of  + 1 vertices x  Light source point on the relay wall x  Surface point in the hidden scene x  Sensor point on the relay wall x  Voxel in a volumetric grid n  Surface normal in the hidden scene  Scene geometry parameters: points x  and normals n  t =  0 ...  Time delays on  + 1 vertices  Distance between the hidden surface and the relay wall Parameters of phasor field kernel: Ω pf ,  pf Θ ls Parameters of laser and sensor models:   ,   ,   ,   Θ  Parameters of per-voxel albedo  Θ Set of optimizing variables: Θ = {Θ pf , Θ ls , Θ  } pf

Table 2 :
Configurations of our input datasets, including converge time and the number of iterations needed.Confocal Trans.measurement Volume dimension Time [hr. (#iter.)]