Adaptive Shells for Efficient Neural Radiance Field Rendering

Neural radiance fields achieve unprecedented quality for novel view synthesis, but their volumetric formulation remains expensive, requiring a huge number of samples to render high-resolution images. Volumetric encodings are essential to represent fuzzy geometry such as foliage and hair, and they are well-suited for stochastic optimization. Yet, many scenes ultimately consist largely of solid surfaces which can be accurately rendered by a single sample per pixel. Based on this insight, we propose a neural radiance formulation that smoothly transitions between volumetric- and surface-based rendering, greatly accelerating rendering speed and even improving visual fidelity. Our method constructs an explicit mesh envelope which spatially bounds a neural volumetric representation. In solid regions, the envelope nearly converges to a surface and can often be rendered with a single sample. To this end, we generalize the NeuS [Wang et al. 2021] formulation with a learned spatially-varying kernel size which encodes the spread of the density, fitting a wide kernel to volume-like regions and a tight kernel to surface-like regions. We then extract an explicit mesh of a narrow band around the surface, with width determined by the kernel size, and fine-tune the radiance field within this band. At inference time, we cast rays against the mesh and evaluate the radiance field only within the enclosed region, greatly reducing the number of samples required. Experiments show that our approach enables efficient rendering at very high fidelity. We also demonstrate that the extracted envelope enables downstream applications such as animation and simulation.

Left: We first fit a dense neural volume using a new spatially-varying kernel that automatically adapts to be large in volumetric regions such as hair or grass, and small in sharp-surface regions such as skin or furniture.We then extract an explicit bounding mesh of the region to be rendered whose width is determined by the kernel, and render at real-time rates.Right: the proposed method is general and effective across a wide range of data and well-suited for downstream applications such as simulation and animation.The face model of the Khady synthetic human shown left is courtesy of texturing.xyz.
Neural radiance fields achieve unprecedented quality for novel view synthesis, but their volumetric formulation remains expensive, requiring a huge number of samples to render high-resolution images.Volumetric encodings are essential to represent fuzzy geometry such as foliage and hair, and they are well-suited for stochastic optimization.Yet, many scenes ultimately consist largely of solid surfaces which can be accurately rendered by a single sample per pixel.Based on this insight, we propose a neural radiance formulation that smoothly transitions between volumetric-and surface-based rendering, greatly accelerating rendering speed and even improving visual fidelity.Our method constructs an explicit mesh envelope which spatially

INTRODUCTION
Neural radiance fields, which we will refer to as NeRFs, have recently emerged as a powerful 3D representation enabling photorealistic novel-view synthesis and reconstruction.Unlike traditional explicit methods for novel-view synthesis, NeRFs forego high-quality mesh reconstruction and explicit surface geometry in favor of neural networks, which encode the volumetric density and appearance of a scene as a function of 3D spatial coordinates and viewing direction.However, the high visual fidelity of NeRFs comes at a great computational cost, as the volume rendering formulation requires a large number of samples along each ray and ultimately prevents realtime synthesis of high-resolution novel views.In tandem, explicit reconstruction and novel-view synthesis have continued to make great progress by leveraging advances in inverse rendering and data-driven priors, but a fidelity gap remains.The goal of this work is to close this gap by developing a neural volumetric formulation that leverages explicit geometry to accelerate performance, without sacrificing quality.
Much recent and concurrent work has likewise sought to improve the efficiency of NeRF representations and volume rendering.An important step towards this goal was the evolution from a global large multi-layer perceptron (MLP) representation [Mildenhall et al. 2020] to local sparse feature fields combined with shallow MLP decoders [Müller et al. 2022;Sara Fridovich-Keil and Alex Yu et al. 2022].This resulted in several orders-of-magnitude speed-ups.Complementary research to improve the efficiency of NeRFs proposed replacing the neural networks by simpler functions such as spherical harmonics, or baking the volumetric representation onto proxy geometry that accelerates rendering [Chen et al. 2023;Yariv et al. 2023].The latter formulation enables especially large speedups and facilitates real-time rendering even on commodity devices [Chen et al. 2023].Yet, doing so compromises the quality as the scene content is projected onto proxy geometry.
In this work, we instead aim to make NeRF rendering more efficient while maintaining or even improving the perceptual quality.To this end, we propose a narrow-band rendering formulation that enables efficient novel-view synthesis, while enjoying the desirable properties of the volumetric representation (Figure 1 left).Our method is inspired by the insight that different regions of the scene benefit from different styles of rendering.Indeed, fuzzy surfaces with intricate geometry and complex transparency patterns benefit greatly from exhaustive volume rendering, while conversely, opaque smooth surfaces can be well-or potentially even betterrepresented by a single sample where the ray intersects the surface.This observation allows us to better distribute the computational cost across the rays by assigning as many samples as needed to faithfully represent the ground-truth appearance.
With the introduction of auxiliary acceleration data structures that promote empty space skipping [Müller et al. 2022], NeRFs can already render images with a varying number of samples per ray.Still, there remain many challenges that prevent the current formulations from efficiently adapting to the local complexity of the scene (Figure 2).First, the memory footprint of grid-based acceleration structures scales poorly with resolution.Second, the smooth inductive bias of MLPs hinders learning a sharp impulse or step function occupied voxels voxel space-skipping narrow-band rendering (ours) ray samples extracted band explicit mesh single sample Fig. 2. One state-of-the-art approach to accelerate volumetric rendering is to skip empty voxels, however this still requires multiple samples within occupied voxels (left).Our approach extracts a narrow band mesh, for which a single sample at the midpoint is a very good approximation of the surface (right).
for volume density, and even if such an impulse was learned it would be difficult to sample it efficiently.Finally, due to the lack of constraints, the implicit volume density field fails to accurately represent the underlying surfaces [Wang et al. 2021], which often limits their application in downstream tasks that rely on mesh extraction.
To remedy the last point, [Wang et al. 2021[Wang et al. , 2022a;;Yariv et al. 2021] propose to optimize a signed distance function (SDF) along with a kernel size encoding the spread of the density, rather than optimizing density directly.While this is effective for improving surface representations, the use of a global kernel size contradicts the observation that different regions of the scene demand adaptive treatment.
To address the above challenges, we propose a new volumetric neural radiance field representation.In particular: i) We generalize the NeuS [Wang et al. 2021] formulation with a spatially-varying kernel width that converges to a wide kernel for fuzzy surfaces, while collapsing to an impulse function for solid opaque surfaces without additional supervision.This improvement alone results in an increased rendering quality across all scenes in our experiments.ii) We use the learned spatially-varying kernel width to extract a mesh envelope of a narrow band around the surface.The width of the extracted envelope adapts itself to the complexity of the scene and serves as an efficient auxiliary acceleration data structure.iii) At inference time, we cast rays against the envelope in order to skip empty space and sample the radiance field only in regions which contribute significantly to the rendering.In surface-like regions, the narrow band enables rendering from a single sample, while progressing to a wider kernel and local volumetric rendering for fuzzy surfaces.
The experiments of Section 4 validate the effectiveness of our formulation across several data sets.In addition, the applications of Section 5 demonstrate the benefits of our representation.

RELATED WORK
Synthesizing novel views from a set of images is a longstanding problem in the fields of computer vision and graphics.The classical approaches to novel-view synthesis can be roughly categorized based on the coverage density of the input images.In particular, light field interpolation methods [Davis et al. 2012;Gortler et al. 1996;Levoy and Hanrahan 1996] assume that the input views are sampled densely and close to the target view.When the input views are sparse, classical methods usually follow a two-stage approach: In the first stage, they construct a proxy geometry from the images using a combination of a multi-view stereo pipeline [Schönberger and Frahm 2016;Schönberger et al. 2016] and point cloud reconstruction methods [Kazhdan et al. 2006;Kazhdan and Hoppe 2013].
In the second stage, the input images are then unprojected onto the geometry either directly in terms of RGB colors [Buehler et al. 2001;Debevec et al. 1996;Waechter et al. 2014;Wood et al. 2000] or, more recently, latent features [Riegler andKoltun 2020, 2021].Other lines of research have developed specialized methods for certain classes of objects, such as faces (e.g.Bi et al. [2021])-although we show results on synthetic human and animal data (Figure 1), the approach presented here is entirely general.
Neural Radiance Fields (NeRFs).NeRF [Mildenhall et al. 2020] have revolutionized the prevailing paradigm of novel-view synthesis, by using a neural network to represent the scenes as a volumetric (radiance) field that may be queried at any location to return the view-dependent radiance and volume density.Mildenhall et al. [2020] synthesize novel views by querying the radiance field along the image rays and accumulating the appearance using volume rendering.The photorealistic quality of NeRF has inspired a large body of follow-up work.NeRF++ [Zhang et al. 2020] analyzed the difficulties of NeRF to represent unbounded scenes and proposed a background formulation based on the inverted sphere representation.MipNeRF [Barron et al. 2021] addressed the aliasing effects with an integrated positional encoding.This work was later extended to unbounded scenes [Barron et al. 2022] by contracting the volume and using an additional proposal network.[Deng et al. 2022] and [Niemeyer et al. 2022] tackled the challenging setting with sparse input views and proposed to regularize the volumetric representation using depth supervision or smoothness constraints and data priors based on normalizing flows, respectively.NeRF-W [Martin-Brualla et al. 2021] has shown how NeRF can be extended to unstructured collections of images captured in-the-wild, by using per-frame learnable latent codes to compensate for appearance differences and a transient embedding to remove dynamic objects.Alternative representations to neural fields include point clouds [Kopanas et al. 2021;Rückert et al. 2021], spheres [Lassner and Zollhöfer 2021], and 3D Gaussians [Kerbl et al. 2023].
Implicit surface representation.The NeRF formulation has two main shortcomings when it comes to modeling surfaces: i) Besides a lack of regularization of the density field, ii) surface extraction has to be performed at an arbitrary level-set of the density field.In combination, these lead to noisy and low-fidelity surface reconstruction.However, with small changes in the formulation, implicit representations combined with volume rendering [Oechsle et al. 2021;Wang et al. 2021Wang et al. , 2022b;;Yariv et al. 2021Yariv et al. , 2020;;Zhang et al. 2021] still appear as a promising alternative to classical surface reconstruction approaches from image data [Schönberger et al. 2016].For example, instead of directly optimizing the density field, [Wang et al. 2021;Yariv et al. 2021] proposed to decompose it into an SDF and a global kernel size that defines the spread of the density.This allows for extracting accurate surfaces from the zero-level set of the SDF, which can also be regularized using the Eikonal constraint.Similar to NeRFs, implicit surface representations were also combined with local feature fields and auxiliary acceleration data structures [Li et al. 2023;Rosu and Behnke 2023;Tang et al. 2023;Wang et al. 2022a;Zhao et al. 2022] with the goal of improved efficiency and representation capacity.While our method is built on the NeuS [Wang et al. 2021] formulation, our main goal is not to improve the accuracy of the extracted surface.Instead, we utilize the SDF to extract a narrow shell that allows us to adapt the representation to the local complexity of the scene and in turn to accelerate rendering.
Accelerating neural volume rendering.One of the main limitations of NeRFs is the computational complexity of neural volume rendering which slows down both training and inference.Recently, various different directions to accelerate NeRFs have been explored.For example, replacing a global MLP with a (sparse) local feature field combined with a shallow MLP [Chen et al. 2022;Liu et al. 2020;Müller et al. 2022;Sun et al. 2022;Yu et al. 2021] or the spherical harmonics embedding [Chen et al. 2022;Karnewar et al. 2022;Sara Fridovich-Keil and Alex Yu et al. 2022], partitioning the volume into a large number of local (shallow) MLPs [Rebain et al. 2020;Reiser et al. 2021], or using efficient sampling strategies [Hu et al. 2022;Kurz et al. 2022;Lin et al. 2022;Neff et al. 2021], or imagespace convolutions [Cao et al. 2022;Wan et al. 2023].However, even the most optimized volumetric representations [Müller et al. 2022] are still much slower than pure surface-based approaches such as NvDiffRec [Munkberg et al. 2022].
To further increase the efficiency of the inference phase, volumetric representations can be baked onto a proxy surface geometry [Chen et al. 2023;Wan et al. 2023;Yariv et al. 2023] that can be efficiently rendered using high-performance rasterization pipelines.An alternative "baking" strategy is to precompute the outputs of the neural network and store them on a (sparse) discrete grid that acts as a lookup during inference [Hedman et al. 2021;Reiser et al. 2023].In this work, we investigate an alternate approach to speeding up the (volumetric) rendering, by adapting the number of samples required to render each pixel to the underlying local complexity of the scene.Note that our formulation is complementary to the "baking" approaches and we consider the combination of both an interesting avenue for future research.

METHOD
Our method (see Figure 3) builds on NeRF [Mildenhall et al. 2020] and NeuS [Wang et al. 2021].Specifically, we generalize NeuS [Wang et al. 2021] with a new spatially-varying kernel (Section 3.2), which improves the quality and guides the extraction of a narrow-band shell (Section 3.3).Then, the neural representation is fine-tuned (Section 3.5) within the shell that significantly accelerates rendering (Section 3.4).

Preliminaries
NeRF [Mildenhall et al. 2020] represents a scene as a volumetric radiance field that maps a 3D point x ∈ R 3 and a viewing direction d ∈ R 3 to the volume density  and the emitted view-dependent color c ∈ R 3 .This volumetric field is represented by a neural network NN  (•) with parameters  , such that (c, ) = NN  (x, d).The scene can then be rendered along a ray r = o + d with origin o ∈ R 3 and direction d ∈ R 3 from   to   via standard volumetric rendering which is approximated by numerical integration where   denotes the number of samples along the ray r and   is the distance between two adjacent samples.
To improve geometric surface quality in NeRF-like scene reconstructions, NeuS [Wang et al. 2021] and VolSDF [Yariv et al. 2021] propose to replace the learned density  by a learned signed distance field  , and then transform  to  for rendering via a sigmoid-shaped map.The formulation of NeuS optimizes an SDF (c,  ) = NN  (x, d) along with a global kernel size  that controls the sharpness of the implied density.To evaluate volume rendering (Equation 2) the SDF value  at x is transformed to a density  by where  is implicitly  (r()) along a ray.Intuitively, a small  results in a wide kernel with a fuzzy density, while in the limit lim →0 Φ  / approximates a sharp impulse function (see inset).This SDF-based formulation allows for the use of an Eikonal regularizer during training, which encourages the learned  to be an actual distance function, resulting in a more accurate surface reconstruction.The relevant losses are discussed in Section 3.5.

Spatially-Varying Kernel Size
The NeuS SDF formulation is highly effective, yet, it relies on one global kernel size.In combination with the Eikonal regularization this implies a constant spread of the volume density across the whole scene.However, a one-size-fits-all approach does not adapt well to scenes that contain a mixture of "sharp" surfaces (e.g.furniture or cars) and "fuzzy" volumetric regions (e.g.hair or grass).
Our first contribution is to augment the NeuS formulation with a spatially-varying, locally learned kernel size  as an additional neural output that is dependent on the input 3D position x.The extended network becomes (c,  , ) = NN  (x, d) (see the implementation details in Section 4.1).During training, we additionally include a regularizer that promotes the smoothness of the kernel size field (Section 3.5).This neural field can still be fit from only color image supervision, and the resulting spatially-varying kernel size automatically adapts to the sharpness of the scene content (Figure 7).This enhanced representation is independently valuable, improving reconstruction quality in difficult scenes, but importantly it will serve to guide our explicit shell extraction in Section 3.3, which greatly accelerates rendering.

Extracting an Explicit Shell
The adaptive shell delimits the region of space which contributes significantly to the rendered appearance, and is represented by two explicit triangle meshes.When  is large the shell is thick, corresponding to volumetric scene content, and when  is small the shell is thin, corresponding to surfaces.After the implicit fields  and  have been fit as described in Section 3.2, we extract this adaptive shell once as a post-process.
In Equation 3 the magnitude of the quantity  / in the sigmoid exponent determines the rendering contribution along a ray (see the inset figure in Section 3.1).It is tempting to simply extract a band where | / | <  for some  as the region that makes a significant contribution to the rendering.However, the learned functions quickly become noisy away from the  = 0 level set, and cannot be sufficiently regularized without destroying fine details.Our solution is to separately extract an inner boundary as an erosion of the  = 0 level set, and an outer boundary as its dilation (Figure 4), both implemented via a regularized, constrained level set evolution tailored to the task.
In detail, we first sample the fields  and  at the vertices of a regular grid.We then apply a level set evolution to  , producing a new eroded field SDF − , and extract the SDF − = 0 level set as the inner shell boundary via marching cubes.A separate, similar evolution yields the dilated field SDF + , and the SDF + = 0 level set forms the outer shell boundary.We define both level sets separately: the dilated outer surface should be smooth to avoid visible boundary artifacts, while the eroded inner surface needs not be smooth, but must only exclude regions which certainly do not contribute to the rendered appearance.
Recall that a basic level set evolution of a field  is given by / = − |∇| , where  is the desired scalar outward-normal velocity of the level set.Our constrained, regularized flow on  is then where  0 here denotes the initial learned SDF, the divergence term is a curvature smoothness regularizer with weight  curv .
The soft falloff  (see inset) limits the flow to a window around the level set: with window width  .To dilate the level set, the velocity is chosen to fill all regions with density  >  min for a ray incoming in the normal direction with   as a scaling coefficient.We use  = 0.1, and  curv = 0.01.To erode the level set, the velocity is inversely-proportional to density, so that the shell expands inward quickly for low density regions and slowly for high density regions where here we use  = 0.05, and  curv = 0.These velocities lead to a short-distance flow, and thus a narrow shell where  is small and the content is surface-like.They lead to a long-distance flow and hence a wide shell where  is large and the content is volume-like.We compute the dilated field SDF + and eroded field SDF − respectively by forward-Euler integrating this flow on the grid for 50 steps of integration, computing derivatives via spatial finite differences.We do not find numerical redistancing to be necessary.Finally, we clamp the results SDF − ← max( 0 , SDF − ) and SDF + ← min( 0 , SDF + ), to ensure that the eroded field only shrinks the level set, and the dilated flow only grows the level set.The SDF + = 0 and SDF − = 0 level sets are extracted via marching cubes as the outer and inner shell boundary meshes M + and M − , respectively.Figure 5 visualizes the resulting fields.Further details are provided in Procedure 1 and 2 of the Appendix.

Narrow-Band Rendering
The extracted adaptive shell serves as an auxiliary acceleration data structure to guide the sampling of points along a ray (Equation 2), enabling us to efficiently skip empty space and sample points only where necessary for high perceptual quality.For each ray, we use hardware-accelerated ray tracing to efficiently enumerate the ordered intervals defined by the intersection of the ray and the In detail, we first build ray tracing acceleration data structures for both the outer mesh M + and inner mesh M − We then cast each ray against the meshes, yielding a series of intersections where the ray enters or exits the mesh, partitioning the ray into zero or more intervals contained in the shell (see inset).For each interval MobileNerf ours ours, # samples ground truth Fig. 6.Pure surface-based representations struggle to represent fuzzy surfaces such as the tail.On the other hand, our method adapts the narrow shell to the local complexity of the scene, using a single sample for the sharp skin surface and up to 16 samples for the tail.
with width , a target inter-sample spacing   , and a single-sample threshold  s , we compute the number of samples as ceil(max( −  s , 0)/  ) + 1.We cap the maximum number of samples to  max , and equidistantly sample the interval.Note that if the interval has  <  s , a single sample is taken at the center of the interval.When an interval ends because the ray hits the inner mesh M − , we do not process any subsequent samples, as this represents the interior of a solid object.Otherwise, we process intervals until the ray exits the scene or we hit a maximum cap, accumulating the contributions as in Equation 2.
Note that this procedure can be implemented by first generating all samples within all intervals, and then performing a single batched MLP inference pass, which improves throughput.For surfaces, our narrow-band sampling often amounts to just a hardware-accelerated ray tracing, followed by a single network evaluation, while for fuzzy regions it densely samples only where necessary-in either case, performance is greatly accelerated (Table 1).More algorithmic details are included in Procedure 3 of the Appendix.

Losses and Training
We optimize the parameters of our representation in two stages.In the first stage, we use the fully volumetric formulation described in Sections 3.1 and 3.2 and minimize the following objective with the weights  c = 1,   = 0.1,  n = 0.1, and   = 0.01 for all experiments.Here L c is the standard pixel-wise color loss against calibrated ground-truth images and L  is the Eikonal regularizer as in [Wang et al. 2021] where R and X denote the set of rays and samples along the rays, respectively.∇ (x) can be obtained either analytically [Wang et al. 2021[Wang et al. , 2022a;;Yariv et al. 2021] or through finite differences [Li et al. 2023; Wang et al. 2023].We use the latter approach.The loss L  regularizes the spatially varying kernel size introduced in our formulation for smoothness where N (0,  2 ) denotes samples from the normal distribution with standard deviation .Lastly, we incorporate the loss internal to our network architecture (Section 4.1).Like NeuS, we will leverage geometric normals as an input to a shading subnetwork, but we find that predicting these normals internally improves inference performance vs. gradient evaluation.L n serves to train these predicted normals to remain approximately faithful to ones obtained through the finite differences of the underlying SDF field ∇ (x) [Li et al. 2023;Wang et al. 2023].
After the implicit field has been fit, we extract the adaptive shell as in Section 3.3.While initial training requires dense sampling along rays, our explicit shell now allows narrow-band rendering to concentrate samples only in significant regions.We therefore fine-tune the representation within the narrow band, now with only L c -it is no longer necessary to encourage a geometricallynice representation, as we have already extracted the shell and restricted the sampling to a small band around the scene content.Disabling regularization enables the network to devote its whole capacity to fit the visual appearance, which leads to improved visual fidelity (Table 4).In Procedure 4 of the Appendix, we also present the training pipeline with algorithm details.

EXPERIMENTS
In this section, we provide low-level details of our implementation and evaluate our method in terms of rendering quality and efficiency on four data sets that range from synthetic object-level Shelly, NeRF-Synthetic [Mildenhall et al. 2020] and tabletop DTU [Jensen et al. 2014] data, to real-world large outbound scenes MipNeRF360 [ Barron et al. 2022].For comparisons, we treat Instant NGP [Müller et al. 2022] as our volumetric baseline, due to its balance between high fidelity and efficiency.In addition, we compare to prior methods that were optimized either for fidelity [Barron et al. 2021[Barron et al. , 2022;;Wang et al. 2021;Yuan et al. 2022] or rendering efficiency [Chen et al. 2023;Guo et al. 2023;Yariv et al. 2023].
When running NeRF [Mildenhall et al. 2020] and Mip-NeRF [ Barron et al. 2021] on DTU and Shelly, we use the implementation from Nerfstudio [Tancik et al. 2023].For other methods and experiment settings, we use their official implementations.

Architecture Details
Following the state of the art in neural volumetric rendering, we represent our neural field as a combination of a feature field and a small (decoder) neural network.Specifically, we use a multi-resolution hash encoding [Müller et al. 2022] Ψ(•) with 14 levels, where each level is represented by a hash-table with 2 22 two-dimensional features.The voxel grid resolution of our feature field grows from 16 3 → 4096 3 for Shelly and NeRFSynthetic, and from 16 3 → 8192 3 for the other data sets.The features at each level are obtained through tri-linear interpolation before being concatenated to form the feature embedding Ψ(•) ∈ R 28 .This is further concatenated with the sample coordinates and input to the geometry network  is an MLP with two hidden layers and dimensions 48 → 64 → 64 → 3. To reduce the computational cost, we directly predict the normal vector n with an MLP rather than computing it as the gradient of the underlying SDF field as done in NeuS [Wang et al. 2021].Finally, to encode the input direction d, we use the spherical harmonic basis up to degree 4, such that  (d) ∈ R 16 .The dimensions of all layers in both networks and the feature field were selected for high throughput on modern GPU devices.

Implementation
The training stage of our method is implemented in PyTorch [Paszke et al. 2017], while the inference stage is implemented in Dr.Jit [Jakob et al. 2022].To achieve real-time inference rates, we rely on the automatic kernel fusion performed by Dr.Jit as well as GPU-accelerated ray-mesh intersection provided by OptiX [Parker et al. 2010].While the inference pass is implemented with high-level Python code, the asynchronous execution of large fused kernels hides virtually all of the interpreter's overhead.Combined with the algorithmic improvements described above, we achieve frame rates from 40 fps (25 ms /frame) on complex outdoor scenes to 300 fps (3.33 ms /frame) on object-level scenes, at 1080p resolution on a single RTX 4090 GPU.A performance comparison to Instant NGP [Müller et al. 2022] on four data sets is given in Table 1.Note that in this work, we focused on inference performance only, and have not yet applied these performance optimizations to the training procedure.Detailed pseudo-code is given in Procedures 1, 2, 3 and 4 of the Appendix.

Evaluation Metrics
In order to evaluate the rendering quality, we report the commonly used peak signal-to-noise ratio (PSNR), learned perceptual image patch similarity (LPIPS), and structural similarity (SSIM) metrics.Unfortunately, evaluating the efficiency of the methods is less straightforward as the complexity of the method is often intertwined with the selected hardware and low-level implementation details.Indeed, reporting only the number of frames-per-second (FPS) or the time needed to render a single frame may paint an incomplete picture.We therefore additionally report the number of samples along the ray that are required to render each pixel.While the number of samples along the ray also does not tell the whole story as the per-sample evaluation can have different computational complexity, combining all metrics provides a good assessment of the computational complexity of the individual methods.

Shelly Data Set
The NeRFSynthetic data set that was introduced in [Mildenhall et al. 2020] is still one of the most widely used data sets to evaluate novel-view synthesis methods.Yet, it mainly consists of objects with sharp surfaces that can be well-represented by surface rendering methods [Munkberg et al. 2022], and thus does not represent the challenge of general scene reconstruction.This motivated us to introduce a new synthetic data set, which we name Shelly.It covers a wider variety of appearance including fuzzy surfaces such as hair, fur, and foliage.Shelly contains six object-level scenes: Khady, Pug, Kitty, Horse, Fernvase and Woolly.For each scene, we have rendered 128 training and 32 test views from random camera positions distributed on a sphere with a fixed radius.We are grateful to the original artists of these models: jhon maycon, Pierre-Louis Baril, abdoubouam, ckat609, the BlenderKit team, and textures.xyz.
Table 2 shows quantitative results, while the novel views are qualitatively compared in Figure 8.Our method significantly outperforms prior methods across all quality metrics achieving more than 2dB higher PSNR than Instant NGP. Figure 8 demonstrates that surface-based rendering methods (MobileNerf) struggle to represent fuzzy surfaces.On the other hand, our method aligns its representation to the complexity of the scene.For example, Figure 6 shows that our method represents the skin of the horse as a sharp surface, while using a wider kernel for its tail, which benefits from volumetric rendering.

DTU Data Set
We consider 15 tabletop scenes from the DTU data set [Jensen et al. 2014].These scenes were captured by a robot-held monocular RGB camera, and are commonly used to evaluate implicit surface representations.We follow prior work [Wang et al. 2021;Yariv et al. 2021] and task the methods to represent the full scene, but evaluate the performance only within the provided object masks.
Table 2 depicts that our method outperforms all baselines across all evaluation metrics.Qualitative results are provided in Figure 9.

Ground Truth InstantNGP MobileNerf Ours
Ground Truth InstantNGP MobileNerf Ours  Different from the Shelly data set, the performance of Ours on the DTU data set is slightly lower than that of Ours (full ray) in terms of PSNR.We hypothesize that this is due to the distribution of the camera poses that observe the scene only from a single direction.This hinders constraining the neural field and hence also the adaptive shell extraction.The same reason also contributes to a significant increase in the sample count for Instant NGP (see Table 1).

NeRFSynthetic Data Set
The NeRFSynthetic data set introduced in [Mildenhall et al. 2020] contains 8 synthetic objects rendered in Blender and is widely adopted to evaluate the quality of novel view synthesis methods.
As shown in Table 2, our method can achieve comparable quality to the state of the art methods Mip-NeRF [Barron et al. 2021] and I-NGP [Müller et al. 2022], but with a much faster runtime performance (Table 1).Our method also achieves better image quality compared to recent works optimized for rendering efficiency [Chen et al. 2023;Guo et al. 2023] 1 .

MipNeRF360 Data Set
The MipNeRF-360 data set [Barron et al. 2022] is a challenging real-world data set that contains large indoor and outdoor scenes captured from 360 • camera views 2 .The scenes feature a complex central object accompanied by a highly detailed background.To better represent the background, we follow [Yariv et al. 2023] and extend our method with the scene contraction proposed in [Barron et al. 2022] (more details are provided in the supplemental document).
Table 3 provides the quantitative results and the qualitative comparison is depicted in Figure 10.Our method achieves comparable performance to other interactive methods.Directly compared to I-NGP, our proposed narrow-band formulation can reduce the number of samples by a factor of three, resulting in fivetimes higher average frame rates at comparable rendering quality.We note that on this data set, performance and quality depend significantly on the background, which our approach is not specialized to handle.

Performance Evaluation
We compare the performance of our method to the most efficient volumetric baseline, Instant NGP [Müller et al. 2022], in Table 1.To ensure a fair comparison, we render the same test views for both methods at 1080p resolution and remove the GUI overhead.The comparison was run on a single RTX 4090 GPU.Our narrow-band rendering formulation can efficiently reduce the number of samples along the ray (up to 10 times) which results in significantly reduced inference time per frame.On the challenging outbound 360 scenes, our method already runs at real-time rates.Yet, additional speedups could be achieved by further studying the interaction of our adaptive sample placement with the spatial remapping employed in these scenes.

Ground Truth InstantNGP MipNerf Ours
Ground Truth InstantNGP MipNerf Ours Table 3. Quantitative results on the MipNeRF360 data set.We report the PSNR, LPIPS and SSIM results for each object and compare them to baselines.
Our method achieves a performance comparable to the baselines while being significantly faster during inference (see Table 1).In our comparison, we exclude the two scenes with license issues: Flowers, Treehill.

Ablation Study
We ablate our design choices on the Shelly data set in Table 4.In line with our motivation in Section 3.2, the spatially-varying kernel size provides the required flexibility to adapt to the local complexity of the scene which results in improvement across all metrics.Using a fixed SDF threshold to extract the band requires seeking a compromise between an adaptive shell that is too narrow to represent fuzzy surfaces (threshold 0.02) or an increased sample count (threshold 0.05).Instead, our formulation can automatically adapt to the local complexity of the scene leading to higher quality metrics and lower sample count.As described in Section 3.5, we disable the regularization terms after shell extraction to devote more capacity to fit the visual appearance.Comparing Ours (keep regularization) with Ours, this leads to improved visual fidelity.
In Figure 11, we ablate our method and study how image quality and runtime change with different sample counts.We vary the sample step size   in narrow-band rendering (Section 3.4) to produce varying sample counts, and keep other hyperparameters unchanged.The PSNR is sensitive to sample counts when the samples are insufficient (0.25×-1×), and the image quality starts to saturate as the sample counts go higher (1×-4×).In most scenes, the runtime performance is approximately linear w.r.scenes such as Kitten and Fernvase, smaller sample counts (0.25×-1×) do not further reduce the runtime due to a mixture of fixed overheads (e.g.Python interpreter and Dr.Jit tracing) and underutilization of the GPU.

APPLICATIONS
Our method directly constructs an explicit outer shell mesh M + which by definition contains all regions of space that contribute to the rendered appearance.This property has great utility for use in downstream applications.So far our scenes have represented entirely static content, yet, the world is full of motion.Cage-based deformation methods have shown promise for enabling dynamic, non-rigid motion in NeRF and other volumetric representations [Garbin et al. 2022;Joshi et al. 2007;Lee et al. 2018;Xu and Harada 2022;Yuan et al. 2022].The basic idea is to construct a coarse tetrahedral cage around a neural volume, deform the cage, and use it to render the deformed appearance of the underlying volume.Our approach perfectly supports such techniques, as the outer shell mesh M + guides the construction of a cage which will surely contain the content.
We first dilate and tetrahedralize the outer mesh M + with Fast-TetWild [Hu et al. 2020] to produce a tetrahedral mesh that encapsulates the scene.This mesh acts as a proxy for performing physics simulations, animations, editing, and other operations.To render our representation after deforming the tetrahedral cage, any deformation is transferred to M + and M − via barycentric interpolation, using precomputed barycentric coordinates generated as a preprocess.Ray directions are likewise transformed via finite differences.After the transformation, we proceed with rendering as usual in the reference space of our representation, as described in Section 3.4.Note that even in the presence of deformations, the rendering process still benefits from our efficient adaptive shell representation, and is able to efficiently sample the underlying neural volume.
We show two examples of applying physical simulation and animation to the reconstructed objects in Figure 12; please see the supplemental video for dynamic motion.In the animation example, we manually drive the motion of plants in a vase according to an analytical wind-like spatial function.Other animation schemes, such as blend shapes or character rigs could potentially be substituted to drive the motion.In the physical simulation example, we simulate the reconstructed asset via finite-element elastic simulation on the cage tetrahedra including collision penalties [Jatavallabhula et al. 2021].

DISCUSSION
Recent work has developed schemes to accelerate and improve the quality of NeRF-like scene representations.Section 4 provides comparisons to selected, particularly relevant methods.Note that due to the high research activity in the field, it is impossible to compare to all techniques and for many approaches implementations are not available.Hence, we offer additional comments on some related work: • MobileNeRF [Chen et al. 2023], BakedSDF [Yariv et al. 2023], NeRFMeshing [Rakotosaona et al. 2023], and nerf2mesh [Tang et al. 2023] post-process NeRF-like models and extract meshes to accelerate inference, similar to this work.However, these approaches constrain appearance to surfaces, sacrificing quality.
Our method instead retains a full volumetric representation and nearly full-NeRF quality, at the cost of moderately more expensive inference (though still real-time on modern hardware).
• DuplexRF [Wan et al. 2023] also extracts an explicit shell from the underlying neural field and uses it to accelerate rendering, although it does so with a very different neural representation, prioritizing performance.Their shell is directly extracted from two thresholds of the radiance field, which requires the careful selection of the thresholds and results in a noisy shell that is not adapted to the local complexity of the scene in contrast to our approach.
• VMesh [Guo et al. 2023] builds upon the similar insight that different parts of the scene require different treatment.However, their formulation assumes an additional voxel-grid data structure to mark the volumetric areas that contribute to the final rendering.This approach suffers from poor complexity scaling as with the auxiliary acceleration data structure of [Müller et al. 2022].Instead, our method uses an explicit, adaptive shell to delimit the areas that contribute to the rendering.Apart from lower complexity, our formulation seamlessly enables further applications as discussed in Section 5.

CONCLUSION AND FUTURE WORK
In this work we focus on efficiently rendering NeRFs.Our first stage of training (Section 3.2) is largely similar to that of [Li et al. 2023], and likely can be accelerated by algorithmic advancements and low-level tuning similar to our inference pipeline [Wang et al. 2022a].
Although our method offers large speedups for high-fidelity neural rendering and runs at real-time rates on modern hardware (Table 1, it is still significantly more expensive than approaches such as MeRF [Reiser et al. 2023] that precompute the neural field outputs and bake them onto a discrete grid representation.Our formulation is complimentary to that of MeRF [Reiser et al. 2023] and we hypothesize that combining both approaches will lead to further speedups, potentially reaching the performance-at high quality-of the methods that bake the volumetric representation to explicit meshes and can run in real-time even on commodity hardware (e.g.[Chen et al. 2023]).
Our method does not guarantee to capture thin structures-if the extracted adaptive shell omits a geometric region, it can never be recovered during fine-tuning and will always be absent from the reconstruction.Artifacts of this form are visible in some MipNeRF360 scenes.Future work will explore an iterative procedure, in which we alternately tune our reconstruction and adapt the shell to ensure that no significant geometry is missed.Other artifacts occasionally present in our reconstructions include spurious floating geometry and poorly-resolved backgrounds; both are common challenges in neural reconstructions and our approach may borrow solutions from other work across the field (e.g.[Niemeyer et al. 2022]).
More broadly, there is great potential in combining recent neural representations with high-performance techniques honed for realtime performance in computer graphics.Here, we have shown how ray tracing and adaptive shells can be used to greatly improve performance.

APPENDIX
We provide additional algorithmic details and pseudocode.In the first phase of training, our method adaptively extracts an explicit mesh envelope which spatially bounds the neural volumetric representation: level set evolution and shell extraction are shown in Procedures 1 and 2. In the second phase of training, as well as inference, we leverage the extracted shells to sample query points only where they are needed.We cast rays against the shell meshes and compute query locations in the narrow band between the outer and inner shell.This is detailed in Procedure 3. Note that one ray may intersect with multiple narrow bands, however it always terminates when encountering an inner shell.Finally, we include the overall training pipeline in Procedure 4. In this supplementary material, we provide additional results (Section 1) and implementation details (Section 2).

ADDITIONAL RESULTS
We provide additional qualitative results on MipNeRF360 data set (Figure 1) and DTU data set (Figure ??, ??), as well as per-scene quantitative numbers for all data sets (Tables 1-11).

IMPLEMENTATION DETAILS
Level set evolution.To perform the level set evolution, we extract the initial SDF values on a 512 3 grid and separately dilate and erode the zero level set for 50 iterations with the timestep as 0.1.For dilation, we use   = 1 and the density threshold  min is set to 0.01 for all data sets.These values were determined empirically such that the dilated level set sufficiently covers thin structures.We set the erosion hyperparameters to   = 0.001,  max = 100.The evolution process takes approximately 2 seconds which is negligible compared to the other steps of our training pipeline.
Narrow-band rendering.When sampling query points within the shells, we set the maximum number of samples per interval as   = 16, and the maximum cap for depth peeling   = 20.For Shelly, DTU and MipNeRF360 data sets, we use single-sample threshold   = 0.02 and inter-sample spacing   = 0.01.For NeRF-Synthetic data set, we use single-sample threshold   = 0.005 and inter-sample spacing   = 0.0025.
Representing the background.For DTU data set, we combine the main volume representation with a spherical background placed at infinity (only dependent on the ray direction).Similar to our main network, the spherical background is represented using a combination of a hash encoding (4 levels, 2D features per level) and a small MLP such that c = NN bckg  ([Ψ(d)]).All the rays that completely miss the extracted shell obtain the color from the background, which requires a single sample evaluation.
When training on MipNeRF360 data set, we follow prior works [Yariv et al. 2023] and extend our method with the scene contraction proposed in [Barron et al. 2022].Specifically, we map the scene outside the unit sphere into a sphere with radius 2 using the scene contraction function contract Training details.During training, we linearly warm up the learning rate to 1 × 10 −2 in the first 5k iterations, and then exponentially decay it to 1 × 10 −4 at the end of training.In all experiments, we use the AdamW optimizer with weight decay as 1e-2.For Shelly, DTU and NeRFSynthetic data sets, we train for a total of 300k iterations, where the first 200k iterations are used for the first stage (full-ray Author's address: formulation) and the remaining 100k for the second stage (finetuning within the narrow band).We use the batch-size of 4096 rays.For each scene, the training takes 8h on a single A100 GPU.For larger MipNeRF360 scenes, we increase the full-ray training to 500k iterations.We adopt the progressive training scheme [Li et al. 2023;Wang et al. 2022], where we initially enable the 8 coarsest levels of features (with the remaining feature channels set to zero), we then add one level every 5k iterations until reaching the maximum number of levels which equals 14.Table 5. Per-scene quantitative results on DTU data set.We report the PSNR, LPIPS and SSIM results for each scene and compare them with baselines.

Fig. 1 .
Fig.1.This work presents an approach for efficiently rendering neural radiance fields by restricting volumetric rendering to a narrow band around the object.Left: We first fit a dense neural volume using a new spatially-varying kernel that automatically adapts to be large in volumetric regions such as hair or grass, and small in sharp-surface regions such as skin or furniture.We then extract an explicit bounding mesh of the region to be rendered whose width is determined by the kernel, and render at real-time rates.Right: the proposed method is general and effective across a wide range of data and well-suited for downstream applications such as simulation and animation.The face model of the Khady synthetic human shown left is courtesy of texturing.xyz.

Fig. 3 .
Fig.3.Overview of the proposed approach.We demonstrate high-fidelity, efficient neural implicit scene reconstruction by efficiently-sampling volumetric rendering inside of an explicit thin shell, which is automatically fit from visual objectives.

Fig. 4 .
Fig. 4.After fitting an initial SDF and spatially varying kernel, we apply level set flows to extract an adaptive shell via dilation and erosion.For illustrative purposes, the adaptive shell is enlarged; in practice it very tightly encloses sharp surfaces.

Fig. 5 .
Fig.5.Given kernel size  and SDF  learned from Section 3.2 (top), we apply a erosion and dilation flows to  (bottom middle and left) to extract a narrow shell in which we efficiently render (bottom right).Here, we visualize each quantity on a 2D slice through a scene.For clarity, we show the fields only nearby the adaptive shell that is ultimately rendered.

Fig. 7 .
Fig.7.The original NeuS[Wang et al. 2021]  formulation uses a single global kernel size .On complex scenes with varying content, the global  value converges to an average which is too small for volumetric parts , and too large for sharp surfaces .Instead, our locally varying kernel size adapts to the scene, in-turn allowing us to reduce the number of samples to a single sample for sharp surfaces and up to 32 samples for the fern (top right).NeuS uses a constant 384 samples per pixel (bottom right).
which is an MLP with a single hidden layer and 31 → 64 → 31 dimensions.Apart from the SDF value  and kernel size , NN geo  also outputs a geometry latent feature f geo ∈ R 26 and the normal vector n ∈ R 3 which are combined with x and the encoded view direction d as input to the radiance network c = NN rad  ([ (d), f geo , n, x]) that predicts the emitted color.Here, NN rad

Fig. 8 .
Fig. 8.A gallery of results on the test-views of our Shelly data set.

Fig. 9 .
Fig. 9.A gallery of results on the DTU data set.

Fig. 10 .
Fig. 10.A gallery of results on the test-views of the MipNerf360 data set.

Fig. 11 .
Fig. 11.Ablating the effect of sample count on image quality and runtime performance.We vary the sample count, and plot the PSNR change (left) and relative runtime performance (right) compared to the default hyperparameters denoted as "1× sample count".We experiment with six scenes from the Shelly (fernvase, khady, kitten) and MipNeRF360 (bicycle, garden, room) data sets.

Fig. 12 .
Fig. 12.Our representation is well-suited for animation (top) and physical simulation (bottom).The visual quality is preserved under deformation: the original shape is shown in leftmost column, with deformations in the middle and rightmost column.For details, please zoom into the fuzzy regions (e.g.fur, leaves), and refer to the supplemental video.

Fig. 1 .
Fig.1.Qualitative visualization of geometry and kernel size on MipNeRF360 data set.The kernel size is re-scaled to between 0 and 1.Our method automatically converges to a large kernel size for fuzzy regions such as grass and a small kernel size for sharp surfaces.

Table 1 .
Performance comparisons on all four data sets, measured at 1080p without GUI overhead using an RTX 4090 GPU.Our adaptive sample placement and mesh-based empty-space skipping technique allows us to outperform Instant NGP without compromising visual fidelity.Note that Instant NGP's performance on the DTU data set was hindered by a large number of background samples, and is therefore not necessarily indicative of a real use case: the user may specify a tighter scene bounding box to focus the samples on the main scene contents.Müller et al. 2022]Sample count ↓ ms /frame ↓ FPS ↑ Sample count ↓ ms /frame ↓ FPS ↑

Table 2 .
[Tancik et al. 2023] on Shelly data set, DTU data set and NeRFSynthetic data set.We report PSNR, LPIPS and SSIM.Our method achieves better results across all metrics on Shelly and DTU and comparable results on NeRFSynthetic.Real-time denotes methods that achieve >30FPS at 1080p.On Shelly and DTU, we run NeRF and Mip-NeRF with Nerfstudio[Tancik et al. 2023], and use official implementation for other methods.Baselines of NeRFSynthetic are from the original papers.Detailed results for each object/scene are provided in the Supplement.

Table 1 .
Per-scene quantitative PSNR comparison on Shelly data set.

Table 2 .
Per-scene quantitative LPIPS comparison on Shelly data set.

Table 4 .
Per-scene sample count of our method on Shelly data set.

Table 7 .
Per-scene results of our method on NeRFSynthetic data set.

Table 8 .
Per-scene quantitative PSNR comparison on MipNeRF360 data set.