BakedSDF: Meshing Neural SDFs for Real-Time View Synthesis

We present a method for reconstructing high-quality meshes of large unbounded real-world scenes suitable for photorealistic novel view synthesis. We first optimize a hybrid neural volume-surface scene representation designed to have well-behaved level sets that correspond to surfaces in the scene. We then bake this representation into a high-quality triangle mesh, which we equip with a simple and fast view-dependent appearance model based on spherical Gaussians. Finally, we optimize this baked representation to best reproduce the captured viewpoints, resulting in a model that can leverage accelerated polygon rasterization pipelines for real-time view synthesis on commodity hardware. Our approach outperforms previous scene representations for real-time rendering in terms of accuracy, speed, and power consumption, and produces high quality meshes that enable applications such as appearance editing and physical simulation.

We present a method for reconstructing high-quality meshes of large unbounded real-world scenes suitable for photorealistic novel view synthesis.We first optimize a hybrid neural volume-surface scene representation designed to have well-behaved level sets that correspond to surfaces in the scene.We then bake this representation into a high-quality triangle mesh, which we equip with a simple and fast view-dependent appearance model based on spherical Gaussians.Finally, we optimize this baked representation to best reproduce the captured viewpoints, resulting in a model that can leverage accelerated polygon rasterization pipelines for real-time view synthesis on commodity hardware.Our approach outperforms previous scene representations for real-time rendering in terms of accuracy, speed, and power consumption, and produces high quality meshes that enable applications such as appearance editing and physical simulation.

INTRODUCTION
Current top-performing approaches for novel view synthesis -the task of using captured images to recover a 3D representation that can be rendered from unobserved viewpoints -are largely based on Neural Radiance Fields (NeRF) [Mildenhall et al. 2020].By representing a scene as a continuous volumetric function parameterized by a multilayer perceptron (MLP), NeRF is able to produce photorealistic renderings that exhibit detailed geometry and view-dependent effects.Because the MLP underlying a NeRF is expensive to evaluate and must be queried hundreds of times per pixel, rendering a high resolution image from a NeRF is typically slow.
Recent work has improved NeRF rendering performance by trading compute-heavy MLPs for discretized volumetric representations such as voxel grids.However, these approaches require substantial GPU memory and custom volumetric raymarching code and are not amenable to real-time rendering on commodity hardware, since modern graphics hardware and software is oriented towards rendering polygonal surfaces rather than volumetric fields.
While current NeRF-like approaches are able to recover highquality real-time-renderable meshes of individual objects with simple geometry [Boss et al. 2022], reconstructing detailed and wellbehaved meshes from captures of real-world unbounded scenes (such as the "360 degree captures" of Barron et al. [2022]) has proven to be more difficult.Recently, MobileNeRF [Chen et al. 2022a] addressed this problem by training a NeRF whose volumetric content is restricted to lie on the faces of a polygon mesh, then baking that NeRF into a texture map.Though this approach yields reasonable image quality, MobileNeRF initializes the scene geometry as a collection of axis-aligned tiles that turns into a textured polygon "soup" after optimization.The resulting geometry is less suitable for common graphics applications such as texture editing, relighting, and physical simulation.
In this work, we demonstrate how to extract high-quality meshes from a NeRF-like neural volumetric representation.Our system, which we call BakedSDF, extends the hybrid volume-surface neural representation of VolSDF [Yariv et al. 2021] to represent unbounded real-world scenes.This representation is designed to have a well-behaved zero level set corresponding to surfaces in the scene, which lets us extract high-resolution triangle meshes using marching cubes.
Our key idea is to define the SDF in contracted coordinate space [Barron et al. 2022], as it has these advantages: It more strongly regularizes distant content, and it allows us to also extract the mesh in contracted space which distributes the triangle budget better (more in the center, fewer in the periphery).
We then equip this mesh with a fast and efficient view-dependent appearance model based on spherical Gaussians, which is fine-tuned to reproduce the input images of the scene.The output of our system can be rendered at real-time frame rates on commodity devices, and we show that our real-time rendering system outperforms prior work in terms of realism, speed, and power consumption.Additionally we show that (unlike comparable prior work) the mesh produced by our model is accurate and detailed, enabling standard graphics applications such as appearance editing and physics simulation.
To summarize, our key contributions are: (1) High-quality neural surface reconstruction of unbounded real-world scenes, (2) a framework for real-time rendering of these scenes in a browser, and (3) we demonstrate that spherical Gaussians are a practical representation of view-dependence appearance for view-synthesis.

RELATED WORK
View synthesis, i.e., the task of rendering novel views of a scene given a set of captured images, is a longstanding problem in the fields of computer vision and graphics.In scenarios where the observed viewpoints are sampled densely, synthesizing new views can be done with light field rendering -straightforward interpolation into the set of observed rays [Gortler et al. 1996;Levoy and Hanrahan 1996].However, in practical settings where observed viewpoints are captured more sparsely, reconstructing a 3D representation of the scene is crucial for rendering convincing novel views.Most classical approaches for view synthesis use triangle meshes (typically reconstructed using a pipeline consisting of multi-view stereo [Furukawa and Hernández 2015;Schönberger et al. 2016], Poisson surface reconstruction [Kazhdan et al. 2006;Kazhdan and Hoppe 2013], and marching cubes [Lorensen and Cline 1987]) as the underlying 3D scene representation, and render novel views by reprojecting observed images into each novel viewpoint and blending them together using either heuristically-defined [Buehler et al. 2001;Debevec et al. 1996;Wood et al. 2000] or learned [Hedman et al. 2018;Riegler andKoltun 2020, 2021] blending weights.Although mesh-based representations are well-suited for real-time rendering with accelerated graphics pipelines, the meshes produced by these approaches tend to have inaccurate geometry in regions with fine details or complex materials, which leads to errors in rendered novel views.Alternatively, point-based representations [Kopanas et al. 2021;Rückert et al. 2022] are better suited for modeling thin geometry, but cannot be rendered efficiently without visible cracks or unstable results when the camera moves.
Most recent approaches to view synthesis sidestep the difficulty of high-quality mesh reconstruction by using volumetric representations of geometry and appearance, such as voxel grids [Lombardi et al. 2019;Penner and Zhang 2017;Szeliski and Golland 1999;Vogiatzis et al. 2007] or multiplane images [Srinivasan et al. 2019;Wizadwongsa et al. 2021;Zhou et al. 2018].These representations are well-suited to gradient-based optimization of a rendering loss, so they can be effectively optimized to reconstruct detailed geometry seen in the input images.The most successful of these volumetric approaches is Neural Radiance Fields (NeRF) [Mildenhall et al. 2020], which forms the basis for many state-of-the-art view synthesis methods (see Tewari et al. [2022] for a review).NeRF represents a scene as a continuous volumetric field of matter that emits and absorbs light, and renders an image using volumetric ray-tracing.NeRF uses an MLP to map from a spatial coordinate to a volumetric density and emitted radiance, and that MLP must be evaluated at a set of sampled coordinates along a ray to yield a final color.
Subsequent works have proposed modifying NeRF's representation of scene geometry and appearance for improved quality and editability.Ref-NeRF [Verbin et al. 2022] reparameterizes NeRF's view-dependent appearance to enable appearance editing and improve the reconstruction and rendering of specular materials.Other works [Boss et al. 2021;Kuang et al. 2022;Srinivasan et al. 2021;Zhang et al. 2021a,b] attempt to decompose a scene's view-dependent appearance into material and lighting properties.In addition to modifying NeRF's representation of appearance, papers including UNISURF [Oechsle et al. 2021], VolSDF [Yariv et al. 2021], NeuS [Wang et al. 2021], MetaNLR++ [?], and NeuMesh [?] augment NeRF's fullyvolumetric representation with hybrid volume-surface models, but do not target real-time rendering and show results only for objects and bounded scenes.
The MLP NeRF uses to represent a scene is usually large and expensive to evaluate, and this means that a NeRF is slow to train (hours or days per scene) and slow to render (seconds or minutes per megapixel).While rendering can be accelerated with a sampling network that reduces the MLP queries per ray [Neff et al. 2021], recent methods have improved both training and render time by replacing the large MLP with a voxel grid [Karnewar et al. 2022;Sun et al. 2022], a grid of small MLPs [Reiser et al. 2021], low-rank [Chen et al. 2022b] or sparse [Yu et al. 2022] grid representations, or a multiscale hash encoding with a small MLP [Müller et al. 2022].
While these representations reduce the computation required for both training and rendering (at the cost of increased storage), rendering can be further accelerated by precomputing and storing, i.e., "baking", a trained NeRF into a more efficient representation.SNeRG [Hedman et al. 2021], FastNeRF [Garbin et al. 2021], Plenoctrees [Yu et al. 2021], and Scalable Neural Indoor Scene Rendering [Wu et al. 2022] all bake trained NeRFs into sparse volumetric structures and use simplified models of view-dependent appearance to avoid evaluating an MLP at each sample along each ray.These methods have enabled real-time rendering of NeRFs on highend hardware, but their use of volumetric raymarching precludes real-time performance on commodity hardware.Concurrent to our work, ?developed Memory-Efficient Radiance Fields (MERF), a compressed representation volumetric for unbounded scenes that facilitates fast rendering on commodity hardware.When compared with our meshes, this volumetric representation achieves higher quality scores, but requires more memory, needs a complex renderer, and is not straightforward to use for downstream graphics applications such as physics simulation.Please refer to the MERF paper for a direct comparison with our method.

PRELIMINARIES
In this section, we describe the neural volumetric representation that NeRF [Mildenhall et al. 2020] uses for view synthesis as well as improvements introduced by mip-NeRF 360 [Barron et al. 2022] for representing unbounded "360 degree" scenes.
A NeRF is a 3D scene representation consisting of a learned function that maps a position x and outgoing ray direction d to a volumetric density  and color c.To render the color of a single pixel in a target camera view, we first compute the ray corresponding to that pixel r = o + d, and then evaluate the NeRF at a series of points {  } along the ray.The resulting outputs   , c  at each point are composited together into a single output color value C: This definition of C is a quadrature-based approximation of the volume rendering equation [Max 1995].
NeRF parametrizes this learned function using an MLP whose weights are optimized to implicitly encode the geometry and color of the scene: A set of training input images and their camera poses

Baking a high-resolution triangle mesh
Modeling appearance with spherical Gaussians Fig. 2.An illustration of the three stages of our method.We first reconstruct the scene using a surface-based volumetric representation (Section 4.1), then bake it into a high-quality mesh (Section 4.2), and finally optimize a viewdependent appearance model based on spherical Gaussians (Section 4.3).
are converted into a set of (ray, color) pairs, and gradient descent is used to optimize the MLP weights such that the rendering of each ray resembles its corresponding input color.Formally, NeRF minimizes a loss between the ground truth color C gt and the color C produced in Equation 1, averaged over all training rays: (2) If the input images provide sufficient coverage of the scene (in terms of multiview 3D constraints), this simple process yields a set of MLP weights that accurately describe the scene's 3D volumetric density and appearance.Mip-NeRF 360 [Barron et al. 2022] extends the basic NeRF formulation to reconstruct and render real-world "360 degree" scenes where cameras can observe unbounded scene content in all directions.Two improvements introduced in mip-NeRF 360 are the use of a contraction function and a proposal MLP.The contraction function maps unbounded scene points in R 3 to a bounded domain: which produces contracted coordinates that are well-suited to be positionally encoded as inputs to the MLP.Additionally, mip-NeRF 360 showed that large unbounded scenes with detailed geometry require prohibitively large MLPs and many more samples along each ray than is tractable in the original NeRF framework.Mip-NeRF 360 therefore introduced a proposal MLP: a much smaller MLP that is trained to bound the density of the actual NeRF MLP.This proposal MLP is used in a hierarchical sampling procedure that efficiently generates a set of input samples for the NeRF MLP that are tightly focused around non-empty content in the scene.

METHOD
Our method is composed of three stages, which are visualized in Figure 2. First we optimize a surface-based representation of the geometry and appearance of a scene using NeRF-like volume rendering.Then, we "bake" that geometry into a mesh, which we show is accurate enough to support convincing appearance editing and physics simulation.Finally, we train a new appearance model that uses spherical Gaussians (SGs) embedded within each vertex of the mesh, which replaces the expensive NeRF-like appearance model from the first step.The resulting 3D representation that results from this approach can be rendered in real-time on commodity devices, as rendering simply requires rasterizing a mesh and querying a small number of spherical Gaussians.

Mesh
Diffuse color Specular Full appearance Fig. 3. Our method produces an accurate mesh and decomposes appearance into diffuse and specular color.

Modeling density with an SDF
Our representation combines the benefits of mip-NeRF 360 for representing unbounded scenes with the well-behaved surface properties of VolSDF's hybrid volume-surface representation [Yariv et al. 2021].
VolSDF models volumetric density of the scene as a parametric function of an MLP-parameterized signed distance function (SDF)  that returns the signed distance  (x) from each point x ∈ R 3 to the surface.Because our focus is reconstructing unbounded realworld scenes, we parameterize  in contracted space (Equation 3) rather than world-space.The underlying surface of the scene is the zero-level set of  , i.e., the set of points at distance zero from the surface: Following VolSDF, we define the volume density  as: where Ψ  is the cumulative distribution function of a zero-mean Laplace distribution with scale parameter  > 0. Note that as  approaches 0, the volumetric density approaches a function that returns  inside any object and 0 in free space.To encourage  to approximate a valid signed distance function (i.e. one where  (x) returns the signed Euclidean distance to the level set of  for all x), we penalize the deviation of  from satisfying the Eikonal equation [Gropp et al. 2020]: Note that as  is defined in contracted space, this constraint also operates on contracted space.
Recently, Ref-NeRF [Verbin et al. 2022] improved view-dependent appearance by parameterizing it as a function of the view direction reflected about the surface normal.Our use of an SDF-parameterized density allows this to be easily adopted as SDFs have well-defined surface normals: n(x) = ∇ (x)/∥∇ (x)∥.Therefore, when training this stage of our model we adopt Ref-NeRF's appearance model and compute color using separate diffuse and specular components, where the specular component is parameterized by the concatenation of the view direction reflected about the normal direction, the dot product between the normal and view direction, and a 256 element bottleneck vector output by the MLP that parametrizes  .
We use a variant of mip-NeRF 360 as our model (see Appendix A in supplementary material for specific training details).Similarly to VolSDF [Yariv et al. 2021], we parameterize the density scale factor as  =  −1 in Equation 5.However, we find that scheduling  rather than leaving it as a free optimizable parameter results in more stable training.We therefore anneal  according to , where  goes from 0 to 1 during training,  0 = 0.1, and  1 for the three hierarchical sampling stages is 0.015, 0.003, and 0.001 respectively.Because the Eikonal regularization needed for an SDF parameterization of density already removes floaters and results in well-behaved normals, we do not find it necessary to use the orientation loss or predicted normals from Ref-NeRF, or the distortion loss from mip-NeRF 360.

Baking a high-resolution mesh
After optimizing our neural volumetric representation, we create a triangle mesh from the recovered MLP-parameterized SDF by querying it on a regular 3D grid and then running Marching Cubes [Lorensen and Cline 1987].Note that VolSDF models boundaries using a density fall-off that extends beyond the SDF zero crossing (parameterized by ).We account for this spread when extracting the mesh and choose 0.001 as the iso-value for surface crossings, as otherwise we find the scene geometry to be slightly eroded.
Visibility and free-space culling.When running Marching Cubes, the MLP-parameterized SDF may contain spurious surface crossings in regions that are occluded from the observed viewpoints as well as regions that the proposal MLP marks as "free space".The SDF MLP's values in both of these types of regions are not supervised during training, so we must cull any surface crossings that would show up as spurious content in the reconstructed mesh.To address this, we inspect the 3D samples taken along the rays in our training data.We compute the volumetric rendering weight for each sample, i.e., how much it contributes to the training pixel color.We then splat any sample with a sufficiently large rendering weight (> 0.005) into the 3D grid and mark the corresponding cell as a candidate for surface extraction.
Mesh extraction.We sample our SDF grid at evenly spaced coordinates in the contracted space, which yields unevenly spaced non-axis-aligned coordinates in world space.This has the desirable property of creating smaller triangles (in world space) for foreground content close to the origin and larger triangles for distant content.Effectively, we leverage the contraction operator as a levelof-detail strategy: as our desired rendered views are close to the scene origin, and because the shape of the contraction is designed to undo the effects of perspective projection, all triangles will have approximately equal areas when projected onto the image plane.
Region growing.After extracting the triangle mesh, we use a region growing procedure to fill small holes that might exist in regions that were either unobserved by input viewpoints or missed by the proposal MLP during the baking procedure.We iteratively mark voxels in a neighborhood around the current mesh and extract any surface crossings that exist in these newly active voxels.This regiongrowing strategy effectively remedies situations where a surface exists in the SDF MLP but was not extracted by marching cubes due to insufficient training view coverage or errors in the proposal MLP.We then transform the mesh into world space so it is ready for rasterization by a conventional rendering engine that operates in Euclidean space.
Implementation.We use a 2048 3 grid for both visibility and freespace culling and marching cubes.Initially, we run marching cubes only on voxels that were not culled, i.e. visible and non-empty.We then complete the mesh with 32 region-growing iterations, where we re-run marching cubes in a 8 3 voxel neighborhood around the vertices in the current mesh.Finally, we post-process the mesh using vertex order optimization [Sander et al. 2007], which speeds up rendering performance on modern hardware by allowing vertex shader outputs to be cached and reused between neighboring triangles.In Appendix B we detail additional steps for mesh extraction which do not strictly improve reconstruction accuracy, but enable a more pleasing interactive viewing experience.

Modeling view-dependent appearance
The baking procedure described above extracts high-quality triangle mesh geometry from our MLP-based scene representation.To model the scene's appearance, including view-dependent effects such as specularities, we equip each mesh vertex with a diffuse color c  and a set of spherical Gaussian lobes.As far-away regions are only observed from a limited set of view directions, we do not need to model view dependence with the same fidelity everywhere in the scene.In our experiments, we use three spherical Gaussian lobes in the central regions (∥x∥ ≤ 1) and one lobe in the periphery.Figure 3 demonstrates our appearance decomposition.
This appearance representation satisfies our efficiency goal for both compute and memory and can thus be rendered in real-time.Each spherical Gaussian lobe has seven parameters: a 3D unit vector  for the lobe mean, a 3D vector c for the lobe color, and a scalar  for the width of the lobe.These lobes are parameterized by the view direction vector d, so the rendered color C for a ray intersecting any given vertex can be computed as: Outdoor Scenes Indoor Scenes PSNR ↑ SSIM ↑ LPIPS ↓ PSNR ↑ SSIM ↑ LPIPS ↓ offline NeRF [Mildenhall et al. 2020 To optimize this representation, we first rasterize the mesh into all training views and store the vertex indices and barycentric coordinates associated with each pixel.After this preprocessing, we can easily render a pixel by applying barycentric interpolation to the learned per-vertex parameters and then running our viewdependent appearance model (simulating the operation of a fragment shader).We can therefore optimize the per-vertex parameters by minimizing a per-pixel color loss as in Equation 2. As detailed in Appendix B, we also optimize for a background clear color to provide a more pleasing experience with the interactive viewer.To prevent that optimization from being biased by pixels that are not well-modeled by mesh geometry (e.g.pixels at soft object boundaries and semi-transparent objects), instead of the L2 loss that was minimized by VolSDF we use a robust loss  (•, , ) with hyperparameters  = 0,  = 1 /5 during training, which allows optimization to be more robust to outliers [Barron 2019].We also model quantization with a straight-through estimator [Bengio et al. 2013], ensuring that the optimized values for view-dependent appearance are well represented by 8 bits of precision.
We find that directly optimizing this per-vertex representation saturates GPU memory, which prevents us from scaling up to highresolution meshes.We instead optimize a compressed neural hashgrid model based on Instant NGP [Müller et al. 2022] (see Appendix A in supplemental material).During optimization, we query this model at each 3D vertex location within a training batch to produce our diffuse colors and spherical Gaussian parameters.
After optimization is complete, we bake out the compressed scene representation contained in the hash grids by querying the NGP model at each vertex location for the appearance-related parameters.Finally, we export the resulting mesh and per-vertex appearance parameters using the gLTF format [ISO/IEC 12113:2022 2022] and compress it with gzip, a format natively supported by web protocols.

EXPERIMENTS
We evaluate our method's performance both in terms of the accuracy of its output renderings and in terms of its speed, energy, and memory requirements.For accuracy, we test two versions of our model: the intermediate volume rendering results described in Section 4.1, which we refer to as our "offline" model, and the baked real-time model described in Sections 4.2 and 4.3, which we call the "realtime" model.As baselines we use prior offline models [Barron et al. 2022;Mildenhall et al. 2020;Müller et al. 2022;Riegler and Koltun 2021;Zhang et al. 2020] designed for fidelity, as well as with prior real-time methods [Chen et al. 2022a;Hedman et al. 2018] designed for performance.We additionally compare our method's recovered meshes with those extracted by COLMAP [Schönberger et al. 2016], mip-NeRF 360 [Barron et al. 2022], and MobileNeRF [Chen et al. 2022a].All FPS (frames-per-second) measurements are for rendering at 1920 × 1080 resolution.

Real-time rendering of unbounded scenes
We evaluate our method on the dataset of real-world scenes from mip-NeRF 360 [Barron et al. 2022], which contains complicated indoor and outdoor scenes captured from all viewing angles.In Table 1 we present a quantitative evaluation of both the offline and real-time versions of our model against our baselines.Though our offline model is outperformed by some prior works (as we might expect, given that our focus is performance) our real-time method outperforms the two recent state-of-the-art real-time baselines we evaluate again across all three error metrics used by this benchmark.In Figure 4 we show a qualitative comparison of renderings from our model and these two state-of-the-art real-time baselines, and we observe that our approach exhibits more detail and fewer artifacts than prior work.
In Table 2 we evaluate our method's rendering performance by comparing against Instant-NGP (the fastest "offline" model we evaluate against) and MobileNeRF (the real-time model that produces the highest quality renderings after our own).We measure performance of all methods at 1920 × 1080.Both MobileNeRF and our method are running in-browser on a 16" Macbook Pro with a Radeon 5500M GPU while Instant NGP is running on a workstation equipped with a power NVIDIA RTX 3090 GPU.Though our approach requires more on-disk storage than MobileNeRF (1.27×) and Instant NGP (4.07×), we see that our model is significantly more efficient than both baselines -our model yields FPS/Watt metrics that are 1.44× and 77× greater respectively, in addition to producing higher quality renderings.
W ↓ FPS ↑ FPS/W ↑ MB (disk) ↓ Instant-NGP [Müller et al. 2022]  Table 2.The performance (Watts consumed, frames per second, and their ratio) and storage requirements for our real-time method and two baselines.FPS is measured when rendering at 1920 × 1080 resolution.
Our appreciably improved performance relative to MobileNeRF may seem unusual at first glance, as both our approach and Mobile-NeRF both yield optimized meshes that can be easily and quickly rasterized.This discrepancy is likely due to MobileNeRF's reliance on alpha masking (which results in a significant amount of computeintensive overdraw) and MobileNeRF's use of an MLP to model view-dependent radiance (which requires more compute to evaluate than our spherical Gaussian approach).
Compared to Deep Blending [Hedman et al. 2018], we see from Table 1 that our method achieves higher quality.However, it is also worth noting that our representation is also much simpler: while our meshes can be rendered in a browser, Deep Blending relies on carefully tuned CUDA rendering and must store both color and geometry for all training images in the scene.As a result, total storage cost for Deep Blending in the outdoor scenes is 2.66× higher (1154.78MB on average) than for our corresponding meshes.

COLMAP MobileNeRF Mip-NeRF360 Ours
Fig. 5. Comparing the meshes produced by our technique with baselines that yield meshes.Our meshes are higher in quality compared to those of COLMAP, MobileNeRF, and Mip-NeRF 360.COLMAP's mesh contains noise, floaters, and irregular object boundaries, MobileNeRF's mesh is a "polygon soup" that may not accurately represent scene geometry, and isosurfaces from Mip-NeRF 360's density field tend to be noisy and represent reflections with inaccurate geometry.

Mesh extraction
In Figure 5 we present a qualitative comparison of our mesh with those obtained using COLMAP [Schönberger et al. 2016], Mobile-NeRF [Chen et al. 2022a] and an iso-surface of Mip-NeRF 360 [Barron et al. 2022].We evaluate against COLMAP not only because it represents a mature structure-from-motion software package, but also because the geometry produced by COLMAP is used as input by Stable View Synthesis and Deep Blending.COLMAP uses volumetric graph cuts on a tetrahedralization of the scene [Jancosek and Pajdla 2011;Labatut et al. 2007] to obtain a binary segmentation of the scene and then forms a triangle mesh as the surface between these regions.Because this binary segmentation does not allow for any averaging of the surface, small noise in the initial reconstruction tends to result in noisy reconstructed meshes, which results in a "bumpy" appearance.MobileNeRF represents the scene as a disconnected collection of triangles, as its sole focus is view synthesis.As a result, its optimized and pruned "triangle soup" is highly noisy and may not be ideal for downstream tasks such as appearance editing.
As recently shown [Oechsle et al. 2021;Wang et al. 2021;Yariv et al. 2021], extracting an iso-surface directly from the density field predicted by NeRF can sometimes fail to faithfully capture the geometry of the scene.In Figure 5 we show this effect using Mip-NeRF 360 and extract the iso-surface where its density field exceeds a value of 50.Note how the surface of the table is no longer flat, as the reflection of the vase is modeled using mirror-world geometry.In contrast, our method produces a smooth and high-fidelity mesh, which is better suited for appearance and illumination editing, as demonstrated in Figure 1.

Appearance model ablation
In Table 3 we present the results of an ablation study of our spherical Gaussian appearance model.We see that reducing the number of SGs to 2, 1, and 0 (i.e., a diffuse model) causes accuracy to degrade monotonically.However, when using 3 SGs in the periphery our model tends to overfit to the training views, causing a slight drop in quality compared to our proposed model with just a single peripheral SG.Furthermore, compared to 3 SGs everywhere, using a single SG in the periphery reduces the average size vertex by 1.52× (from 36 to 23.76 bytes), which significantly reduces the memory bandwidth consumption (a major performance bottleneck for rendering).Perhaps surprisingly, replacing our SG appearance model with the small view-dependent MLP used by both SNeRG [Hedman et al. 2021] and MobileNeRF [Chen et al. 2022a] significantly reduces rendering quality and yields error metrics that are roughly comparable to the "1 Spherical Gaussian" ablation.This is especially counter-intuitive given the significant cost of evaluating a small MLP (∼ 2070 FLOPS per pixel) compared to a single spherical Gaussian (21 FLOPS per pixel).Additionally, we ablate the robust loss used to train our appearance representation with a simple L2 loss, which unsurprisingly boosts PSNR (which is inversely proportional to MSE) at the expense of the other metrics.

Ground truth Our rendering Our mesh
Fig. 6.Our framework is based on the neural SDF representation, which struggles to represent semi-transparent objects or thin structures.These limitations can further affect our rendering reconstruction performance.

Limitations
Although our model achieves state-of-the-art speed and accuracy for the established task of real-time rendering of unbounded scenes, there are several limitations that represent opportunities for future improvement: We represent the scene using a fully opaque mesh representation, and as such our model may struggle to represent semi-transparent content (glass, fog, etc.).And as is common for mesh-based approaches, our model sometimes fails to accurately represent areas with small or detailed geometry (dense foliage, thin structures, etc.). Figure 6 depicts additional extracted meshes visualization which demonstrates our surface reconstruction limitations and their effect on the rendering reconstruction.These concerns could perhaps be addressed by augmenting the mesh with opacity values, but allowing for continuous opacity would require a complex polygon sorting procedure that is difficult to integrate into a real-time rasterization pipeline.One additional limitation of our technique is that our model's output meshes occupy a significant amount of on-disk space (∼ 430 megabytes per scene), which may prove challenging to store or stream for some applications.This could be ameliorated through mesh simplification followed by UV atlasing.However, we found that existing tools for simplification and atlasing, which are mostly designed for artist-made 3D assets, did not work well for our meshes extracted by marching cubes.

CONCLUSION
We have presented a system that produces a high-quality mesh for real-time rendering of large unbounded real world scenes.Our technique first optimizes a hybrid neural volume-surface representation of the scene that is designed for accurate surface reconstruction.From this hybrid representation, we extract a triangle mesh whose vertices contain an efficient representation of view-dependent appearance, then optimize this meshed representation to best reproduce the captured input images.This results in a mesh that yields state-of-the-art results for real-time view synthesis in terms of both speed and in accuracy, and is of a high enough quality to enable downstream applications.

A TRAINING AND OPTIMIZATION DETAILS
SDF model definition and optimization.As stated in Section 4.1, we model our SDF using a variant of mip-NeRF 360.We train our model using the same optimization settings as mip-NeRF 360 (250k iterations of Adam [Kingma and Ba 2015] with a batch size of 2 14 and a learning rate that is warm-started and then log-linearly interpolated from 2 • 10 −3 to 2 • 10 −5 , with  1 = 0.9,  2 = 0.999,  = 10 −6 ) and similar MLP architectures (a proposal MLP with 4 layers and 256 hidden units, and a NeRF MLP with 8 layers and 1024 hidden units, both using swish/SiLU rectifiers [Hendrycks and Gimpel 2016] and 8 scales of positional encoding).Following the hierarchical sampling procedure of mip-NeRF 360, we perform two resampling stages using 64 samples evaluated using the proposal MLP, and then one evaluation stage using 32 samples of the NeRF MLP.The proposal MLP is optimized by minimizing L prop + 0.1L SDF where L prop is the proposal loss described in [Barron et al. 2022], designed to bound the weights output by the NeRF MLP density.
Optimizing for per-vertex attributes via a compressed hash grid.As stated Section 4.3, during optimization we use Instant NGP [Müller et al. 2022] as the underlying representation for our vertex attributes.We use the following hyperparameters: L = 18, T = 2 21 and N max = 8192.We remove the view-direction input from the NGP model, as we incorporate it later in Equation 7. We use a weight decay of 0.1 for the hash grids but not the MLP, optimize using Adam [Kingma and Ba 2015] for 150k iterations with a batch size of 2 14 and an initial learning rate of 0.001 that we drop by 10× every 50k iterations.

B TWEAKS FOR A COMPELLING VIEWER
Here we detail a few tweaks to the pipeline which do not strictly improve reconstruction accuracy, but rather makes for a more compelling viewing experience.With this in mind, we found it important to alleviate jarring transitions between the reconstructed scene content and the background color.To this end, we also include a global clear color into the appearance parameters we optimize for in Section 4.3.That is, we assign this color to any pixel in the training data which does not have a valid triangle index.
To further mask the transition between geometry and background, we enclose SDF with bounding geometry before extracting the mesh in Section 4.2.We compute a convex hull computed as the intersection of 32 randomly oriented planes, where the location of each plane has been set to bound 99.75% of the voxels that have marked as candidates for surface extraction.We then further make this hull conservative by inflating it by a slight margin of ×1.025.However, since the extracted mesh needs to be transformed into world space for rendering, we must take care to avoid numerical precision issues that may arise from using unbounded vertex coordinates during rasterization.We solve this by bounding the scene with a distant sphere with a radius of 500 world-space units.These two operations are easily implemented by setting the SDF value in each grid cell to the pointwise minimum of the MLP-parameterized SDF and the SDF of the defined bounding geometry.

C BASELINES DETAILS
MobileNeRF viewer configuration.Note that by default the Mobile-NeRF viewer runs at a reduced resolution for high-framerates across a variety of devices.For our comparisons we modify it to run at different resolutions.When we compute image quality metrics, we choose the resolution of the test set images.Furthermore, when we measure run-time performance we use a 1920 × 1080, which is a resolution that is representative for most modern displays.
Instant NGP.Table 1 reports quality results for Instant NGP [Müller et al. 2022] method, where we carefully adapt it to work on unbounded large scenes.We asked the authors of Instant NGP for help with tuning their method and made the following changes: • We use big.json configuration file provided with the official code release, • we increased the batch size by 4× to 2 20 , and • we increased the scene scale from 16 to 32.Note that none of these changes has a significant impact on the render time for Instant NGP.
By default, the Instant NGP viewer is equipped with a dynamic upscaling implementation, which renders images at a lower resolution and then applies smart upscaling.For a fair comparison we turn this off when measuring perfomance, as these dynamic upscalers can be applied to any renderer.More importantly, we want the perfomance numbers to correspond with the test set quality metrics, and none of the test-set images were computed using upscaling.

Fig. 1 .
Fig. 1.Our method, BakedSDF, optimizes a neural surface-volume representation of a complex real-world scenes and (a) "bakes" that representation into a high-resolution mesh.These meshes (b) can be rendered in real time on commodity hardware, and support other applications such as (c) separating material components, (d) appearance editing with accurate cast shadows, and (e) physics simulation for inserted objects.aInteractive demo at https://bakedsdf.github.io/.

Fig. 4 .
Fig.4.Test-set renderings (with insets) for our model and the two state-of-the-art real-time baselines we evaluate against, using scenes from the mip-NeRF 360 dataset.Deep Blending[Hedman et al. 2018] produces posterized renderings when the proxy geometry used as input is incorrect (such as in the background of the bicycle scene) and renderings from MobileNeRF[Chen et al. 2022a] tend to exhibit aliasing artifacts or oversmoothing.

Table 1 .
[Barron et al. 2022s of our model on the "outdoor" and "indoor" scenes from mip-NeRF 360[Barron et al. 2022], with evaluation split for "offline" and "real-time" algorithms.Red, orange, and yellow indicate the first, second, and third best performing algorithms for each metric.Metrics not provided by a baseline are denoted with "−".

Table 3 .
SSIM ↑ LPIPS ↓ MB (GPU) ↓ An ablation study of our view-dependent appearance model on all scenes from the mip-NeRF 360 dataset.