VMesh: Hybrid Volume-Mesh Representation for Efficient View Synthesis

With the emergence of neural radiance fields (NeRFs), view synthesis quality has reached an unprecedented level. Compared to traditional mesh-based assets, this volumetric representation is more powerful in expressing scene geometry but inevitably suffers from high rendering costs and can hardly be involved in further processes like editing, posing significant difficulties in combination with the existing graphics pipeline. In this paper, we present a hybrid volume-mesh representation, VMesh, which depicts an object with a textured mesh along with an auxiliary sparse volume. VMesh retains the advantages of mesh-based assets, such as efficient rendering and compact storage, while also incorporating the ability to represent subtle geometric structures provided by the volumetric counterpart. VMesh can be obtained from multi-view images of an object and renders at 2K 60FPS on common consumer devices with high fidelity, unleashing new opportunities for real-time immersive applications.


INTRODUCTION
Modern graphics engines mostly rely on polygonal meshes for efficient scene representation and rendering.However, when it comes to the problem of novel view synthesis, the rendering quality of mesh-based representations is greatly limited by the quality of the reconstructed geometry.On the other hand, volumetric representations like NeRF [Mildenhall et al. 2022] have gained more and more attention for their superior view synthesis quality, but at the cost of slow rendering.Although researchers have devoted extensive efforts to enable real-time rendering of volumetric representations [Chen et al. 2022;Garbin et al. 2021;Hedman et al. 2021;Müller et al. 2022;Yu et al. 2021], the storage of large amounts of volumetric data often causes severe memory issues, which prevents them from displaying on low-end devices or scaling to high-resolution scenes.
In this paper, we pursue a better representation of the geometry and appearance of an object for view synthesis from multi-view images.The representation should have the following characteristics: Authors' addresses: Yuan-Chen Guo, Tsinghua University, Beijing, China, guoyc19@ mails.tsinghua.edu.cn;Yan-Pei Cao, ARC Lab, Tencent PCG, Beijing, China, caoyanpei@ gmail.com;Chen Wang, Tsinghua University, Beijing, China, wchen20@mails.tsinghua.edu.cn;Yu He, BIMSA, Beijing, China, hooyeeevan2511@gmail.com; Ying Shan, ARC Lab, Tencent PCG, Shenzhen, China, yingsshan@tencent.com;Xiaohu Qie, Tencent PCG, Shenzhen, China, tiger.qie@gmail.com;Song-Hai Zhang, Tsinghua University, Beijing, China, shz@tsinghua.edu.cn.Fig. 1.Our hybrid volume-mesh representation, VMesh, depicts object geometry with a triangular mesh and a sparse volume (cubes in the wireframe rendering).In the bubble chart, the area of the circle is proportional to the view synthesis quality on the NeRF-Synthetic dataset.VMesh achieves comparable view synthesis quality with existing real-time NeRF variants while being significantly more efficient in rendering speed and storage cost.arXiv:2303.16184v1[cs.CV] 28 Mar 2023 • Able to be rendered at a resolution of modern screens (1080P) in real-time (30FPS) on consumer-grade devices (laptops, tablets, mobile phones).• Efficient in storage to model an object, occupying moderate disk space (<50MB).• Has enough capability to model detailed geometry (such as thin structures) and view-dependent effects.
Besides these primary goals, we also expect the representation to be editable in some ways and have the potential to be integrated into modern graphics engines.Mesh-based assets commonly adopted in graphics pipelines would meet most of these goals, i.e., high efficiency in rendering and storage, and great editability.However, obtaining high-quality meshes and accurate textures of an object from casually-captured images is still an open problem.On the other hand, the volumetric representation used by NeRF-based methods achieves remarkable performance for modeling object appearance, while being less competitive in all other aspects.Therefore, to achieve our goals, instead of improving the volumetric representation to have the abilities it initially does not have, it is more natural to base on the mesh representation to enhance its capability in geometry and appearance modeling with volumetric primitives.The key here is that the volumetric primitives are expected to be very sparse, modeling only the parts that the mesh struggles to model, e.g., thin structures, and only exist when having enough impact on image quality.In this way, the overhead brought by the volumetric part could be minimized, making it possible for the representation to achieve high efficiency and expressiveness at the same time.
In practice, if we already have a mesh and a volume, combining the two to render jointly could simply be done by raymarching where the ray terminates at mesh surfaces, and compositing the volume color and surface color.However, the main difficulty lies in obtaining the mesh and volume from multi-view images, which is a non-trivial process, as direct optimization of either a mesh or an explicit volumetric grid can be very challenging under the multiview setting [Fridovich-Keil et al. 2022;Munkberg et al. 2022;Sun et al. 2022].We thus draw inspiration from recent implicit neural surface reconstruction methods [Wang et al. 2021;Yariv et al. 2021] where signed distance values can be converted to volume densities, and propose to obtain our hybrid representation from an implicit signed distance-density field.In this way, the signed distance field which represents a surface, and the density field which represents a volume can be jointly optimized by volume rendering.Through subsequent quality-preserving quantization steps, the signed distance field and the density field are converted to a triangular mesh and a sparse volume respectively.Furthermore, to model view-dependent effects while enabling efficient rendering and storage, we propose a RefBasis texture representation that mimics physically-based rendering.By representing the view-dependent specular color as the multiplication of a specular tint, a Fresnel attenuation factor, and an environment light given by learned basis functions, our texture representation achieves the best visual quality among existing real-time alternatives with significantly lower storage demand.
We implement a WebGL renderer for our VMesh representation and show that VMesh assets can be rendered in real-time frame rates at HD resolution on consumer-grade devices like mobile phones, tablets, and laptops.It achieves comparable view synthesis quality to existing real-time NeRF variants on various challenging scenes while offering very low storage costs.The explicit mesh surface also facilitates a wide range of applications, including shape deformation and texture editing.

RELATED WORK 2.1 Real-Time View Synthesis
Real-time rendering has long been an important goal pursued by view synthesis techniques, especially the ones based on neural rendering.Since NeRF [Mildenhall et al. 2022] emerged as the new quality standard for view synthesis, many attempts have been made to achieve efficient rendering with NeRF while preserving its high quality.Although many of them succeed in efficient rendering on high-end GPUs, few can maintain real-time performance on consumer-grade devices.PlenOctrees [Yu et al. 2021] utilizes spherical harmonics to represent the view-dependent color, enabling storing volume density and color information in a sparse grid organized by octrees, therefore eliminating the massive computational cost of MLP evaluations.SNeRG [Hedman et al. 2021] applies a deferred rendering scheme so that the view-dependent color only needs to be computed once for each pixel.It uses a very small MLP for color computation to balance the quality and rendering speed.To further improve rendering efficiency, MobileNeRF [Chen et al. 2022] optimizes a set of triangle faces instead of a volumetric field, taking advantage of the high efficiency of the graphics pipeline.It adopts a similar texture representation as SNeRG and is applicable on common devices like laptops and mobile phones.However, the improvement in rendering efficiency introduces trade-offs with storage.These methods often require hundreds or even thousands of megabytes of disk storage for a single object, as storing densely sampled volumetric data or irregular triangles is not efficient.Moreover, they all lack a surface representation, which makes it hard to support further processing such as editing, simulation, shadow cast, and collision detection, to name a few.VMesh proposed in this paper is based on a textured mesh and only relies on volumetric primitives to enhance its representational ability in certain regions, which makes it both memory-efficient and processing-friendly.Our work is different from those mesh-based inverse rendering methods in that we do not assume a strict physically-based rendering model.Even so, we take inspiration from these methods to design our texture representation, which is demonstrated to be powerful enough for high-quality view synthesis while being editable.

Hybrid Volume-Mesh Representation
Mesh has been widely utilized as geometry guidance for NeRF variants to achieve geometry editing or handle deformable objects.NeRF-Editing [Yuan et al. 2022] allows controllable shape deformation on a NeRF representation by first deforming the extracted mesh and then training a deformation field based on the mesh deformation.NeuMesh [Yang et al. 2022] learns vertex features for an extracted mesh and uses it for volume rendering, which also allows view synthesis of deformed objects by directly deforming the proxy mesh.Methods for building head avatars [Zielonka et al. 2022] often take advantage of mesh-based parametric face models to guide the ray marching process.EyeNeRF [Li et al. 2022] uses an explicit eyeball surface to compute reflection and refraction rays for ray marching.Neural Assets [Bozic et al. 2022] extracts a proxy mesh from a trained NeRF to support shadow cast and collision detection.These methods only make use of the geometry of the mesh and solely rely on volume rendering to produce the visual appearance.In contrast, our VMesh representation uses mesh and volume to represent separate geometries, and the mesh can be directly used for rendering.The work of Loubet et al. [Loubet and Neyret 2017] is most related to ours, where a hybrid volume-mesh representation is used to build LoDs for complex 3D assets as the mesh is inefficient for pre-filtering sub-resolution structures, especially at large LoD scales.The authors introduce a heuristic approach to automatically find sub-resolution geometry in a mesh, and perform pre-filtering by voxelization.We share the same intuition to represent subtle structures with volume but manage to obtain such representation from multi-view images for view synthesis instead of processing existing assets.

PRELIMINARIES 3.1 Neural Radiance Fields
Neural Radiance Fields (NeRF) [Mildenhall et al. 2022] is the stateof-the-art method for novel view synthesis.It represents a scene as a continuous volumetric field modeled by a multi-layer perceptron (MLP).For each spatial location  = (, , ) and viewing direction  = (, ), the networks output the volume density  and viewdependent color : For each pixel to be rendered, the network is first evaluated on all sampled points   along the camera ray  () =  +  to get their densities and color.The pixel color Ĉ is estimated by the quadrature approximation of the volume rendering equation: where   =  +1 −   .NeRF is optimized by minimizing the error between the rendered color Ĉ and ground truth color :

NeuS
NeuS [Wang et al. 2021] is a surface reconstruction method based on neural implicit representation.It represents the surface of an object as the zero-level set of a signed distance field (SDF) modeled by an MLP  Θ :  sdf :  →  (4) To effectively train the SDF, NeuS derives an unbiased mapping in the first order of approximation from the signed distance value  to an opaque density , the counterpart of the volume density : where Φ  is the Sigmoid function with  as a scene-wise learnable parameter.1/ could be viewed as the uncertainty of the surface, with 1/ = 0 being equivalent to performing surface rendering.The same volume rendering equation (Eq.( 2)) can be applied to get the pixel color with a discretized version of volume opacity   given by: Apart from the photometric loss in Eq. ( 3), NeuS also utilizes an Eikonal term to regularize the SDF: and an optional object mask loss: where M =    is the estimated opacity along a ray,  is the ground truth mask, and BCE is binary cross entropy.

HYBRID VOLUME-MESH (VMESH)
Our hybrid volume-mesh representation, or VMesh, depicts object geometry with a triangular mesh surface and a sparse volume.The two types of geometric representation share the same texture formulation and can be jointly rendered by raymarching.We propose to obtain such a representation from multi-view images of an object.To start with, we train a contiguous form of the representation (Sec.4.1), where the surface part is modeled by a neural signed distance field, and the volume part is modeled by a neural density field.Following NeuS, we first convert the signed distance value to volume opacity and naturally combine the two parts by volume rendering.Then we fix the learned signed distance field and extract a triangular mesh from it as a substitution to be rendered jointly with the neural density field (Sec.4.2).We utilize differentiable isosurface and rasterization techniques to get high-quality meshes that align well with the implicit geometry.Lastly, we drop all the neural networks and perform discretization to get the final assets for efficient storage and rendering (Sec.4.3).Concretely, the triangular mesh is simplified and UV-parametrized, and the neural density field is first voxelized and pruned to a sparse volume, which is then organized by perfect spatial hashing to support fast indexing and compact storage.
The training pipeline is illustrated in Fig. 2. We further implement a WebGL2 renderer to enable real-time rendering in modern web browsers on various kinds of consumer-grade devices (Sec.4.4).

Contiguous Stage
We start from an implicit neural representation where each point in space holds two geometric properties, a signed distance value  for the surface and a density value  for the volume.According to Eq. ( 5),  could be converted to volume opacity , allowing us to treat the surface part to be also a volume.To make the two parts work together, we could simply overlay the two volumes, summing up their densities: According to the definition of volume opacity , the above equation is equivalent to: where  surf and  vol can be obtained from  and  following Eq.( 5) and Eq. ( 2) respectively.We use  hyb to render the hybrid scene composed of both the surface and the volume and also use  surf to render the surface only.Implicit neural scene representations often rely on a large MLP to model object appearances.Given the feature of a spatial location and a viewing direction, the MLP predicts the view-dependent color, which can be time-consuming.Existing real-time view synthesis methods improve the efficiency of color queries mainly by (1) replacing the MLP with fixed basis functions, such as spherical harmonics or spherical gaussians [Yu et al. 2021;Zhang et al. 2021].
(2) replacing the MLP with learned basis functions [Garbin et al. 2021;Wizadwongsa et al. 2021].The functions are also modeled by an MLP, but they can be discretized under acceptable memory consumption as they only depend on viewing directions.
(3) applying deferred shading and using a tiny MLP [Chen et al. 2022;Hedman et al. 2021].In this way, the MLP is only evaluated per pixel instead of per sample point.These solutions all have their disadvantages.Fixed basis functions (1) and learned basis functions (2) often require numerous coefficients to be stored at each point to achieve enough representational ability.For example, PlenOctrees [Yu et al. 2021] adopts spherical harmonics up to degree 3, resulting in 16 coefficients per color channel, 48 in total; FastNeRF [Garbin et al. 2021] models 8 basis functions, which corresponds to 8 coefficients per color channel, 24 in total.Tiny MLP (3) struggles to model high-frequency view-dependent effects due to its limited capacity and is much more computationally extensive than the other two.In this work, we present a new texture representation named RefBasis for real-time view synthesis that is both representative and efficient in storage.We make the color depend on the reflected ray direction   instead of the incident ray direction, or viewing direction   .This makes view-dependencies easier to model as demonstrated in Ref-NeRF [Verbin et al. 2022].  can be computed as: where the normal direction  can be obtained as the gradient direction of the signed distance field by automatic differentiation: We adopt a similar color formulation as Ref-NeRF, where the viewdependent color  is the composition of a diffuse color   , and the multiplication of a specular tint  and a specular color   .In Ref-NeRF,   is modeled by an MLP, taking   as input, as well as a per-point feature vector to bring additional degrees of freedom, and the incident angle to compensate for Fresnel effects.To enable real-time rendering, we instead model   as learned basis functions.Specifically, we predict a set of  base specular colors { 1  ,  2  , ...,    } for each   and an N-dimensional weighting vector ( 1 ,  2 , ...,   ) for each point.  is computed as the weighted summation of these base colors: To explain Fresnel effects, we also predict an attenuation factor A from the incident angle  and a per-point "metallic" property .The final color  can be formulated as: where  is a fixed tone-mapping function and ⊙ denotes elementwise multiplication.The base specular colors and the attenuation factor are modeled by MLPs and will be discretized as look-up tables in later stages.
In practice, we also predict a normal direction for each location and use this normal to compute   .By forcing the predicted normal n to be close to the analytic normal , n acts as a smoothness prior at the beginning of training and a low-pass filter to improve the normal quality: ) Since our goal is to represent most parts of the scene with surfaces, and only to model the "hard" areas with volumetric matters, we apply the photometric loss simultaneously on the hybrid rendered color Ĉhyb  with  hyb in Eq. ( 11) and the color Ĉsurf  rendered only from the surface model with  surf : This prioritizes the usage of surfaces over volume wherever possible.
We also penalize for volume densities as in [Yu et al. 2021] to further encourage volume sparsity: where   are volume densities at sampled locations, and  is a hyperparameter.
We utilize an opaque loss which regularizes the rendered opacity of the surface model Msurf  to be either 0 or 1: This encourages  in Eq. ( 6) to be as large as possible, resulting in confident surfaces which benefit surface extraction in later stages.
We optimize the weighted summation of these loss terms in the contiguous stage:

Mesh Optimization Stage
Once the contiguous representation is trained, we move to a mesh optimization stage where we extract a triangular mesh from the signed distance field and render it jointly with the neural density field for optimization.A simple way to achieve this is to use isosurface techniques such as Marching Cubes [Lorensen and Cline 1987] to extract the mesh and directly use it for further optimizations.However, due to the limited marching resolution, the extracted mesh cannot perfectly align with the implicit geometry, causing missing structures and biased surfaces.Redundant volume would appear to compensate for the inaccuracies, bringing unnecessary storage overhead.We solve this problem by optimizing the geometry of the extracted mesh with Deep Marching Tetrahedra [Shen et al. 2021] and differentiable rasterization [Laine et al. 2020;Ravi et al. 2020].Specifically, we fix  sdf from the contiguous stage and initialize a new signed distance field  ′ sdf from  sdf .We apply Deep Marching Tetrahedra on a dense grid with learnable grid vertex potisions, and render the extracted mesh to get its silhouette and depth map using a differentiable rasterizer [Laine et al. 2020].The silhouette and depth map are compared with the opacity and depth value rendered from  sdf to make the mesh geometry closer to the implicit surface.The constraints can be formulated as: To render the extracted mesh jointly with the neural density field, we apply raymarching that terminates at the mesh surface.For each pixel to be rendered, if it is occupied by the mesh surface, we only sample points in front of the surface to accumulate densities and colors.Then we alpha-blend the volume-rendered color Ĉvol with the rasterized pixel color Ĉmesh regarding the mesh surface as totally opaque ( mesh = 1): where Mvol is the transparency of the volume, as is    in Eq. ( 2).
Note that till now we still rely on the MLPs to obtain the predicted normals and view-dependent colors.In this stage, we supervise on Ĉhyb to optimize the appearance of this semi-contiguous representation: The overall loss function is: where L norm and L sp are the same as in the contiguous stage.

Discretization Stage
In this stage, we convert all the neural representations to explicit assets for real-time rendering and efficient storage.
4.3.1 Texture Space Discretization.Earlier stages rely on MLPs to compute the specular color for each reflected ray direction and the corresponding attenuation factor.To enable fast color computation, we first densely sample all directions and convert the specular color MLP to a set of cube maps.We call these cube maps "base environment maps" as they act like the environment map in image-based lighting techniques.For the attenuation factor MLP, we densely sample the (, ) space and convert it to a 2D look-up table.

Mesh Discretization.
Simplification.The mesh extracted from dense tetrahedron grids has an excess number of vertices and faces even in flat areas, which brings computation and storage overhead.To cope with this problem, we simplify the generated mesh by Quadric Edge Collapse Decimation [Garland and Heckbert 1997].Experiment results in Tab. 3 demonstrate that appropriate levels of mesh simplification do not undermine view synthesis quality but can, in fact, enhance it.
Parametrization.We perform UV unwrapping to get texture coordinates for the mesh vertices, and sample densely in the UV space to retrieve the 2D normal map and texture maps.

Volume Discretization.
Voxelization.We densely evaluate the volume density for an  3 voxel grid, and then track the max contribution of each voxel by shooting rays from all pixels of the training images.The contribution is calculated as the weight  in Eq. (2).Voxels with contributions lower than a threshold are pruned.For the remaining voxels, we uniformly sample 64 points inside each voxel and evaluate their average density, normal, and textures as an anti-aliased estimation of the properties of the voxel.
Hashing.To enable efficient access to volume properties at arbitrary spatial locations while optimizing storage cost, existing works mainly adopt advanced data structures like octrees [Yu et al. 2021], or pack occupied voxels into smaller dense blocks [Hedman et al. 2021].Considering that the volume in our representation is very sparse (typically < 0.1%), we choose the latter approach and further utilize Perfect Spatial Hashing (PSH) [Lefebvre and Hoppe 2006] for compact storage and efficient indexing.Denote the positions of occupied blocks as set , with | | =  , PSH constructs • a 3D hash table  of size  = m3 ≈  to store the volume data • a 3D offset table Φ of size  = r 3 =  where 1/6 ≤  < 1 so that there exists a perfect spatial hash function ℎ() as an injective map when  ∈ , mapping  to a unique slot in the hash table  .The hash function ℎ is defined as In this way, we can access volume data at any arbitrary location in constant O (1) time while only requiring the extra storage of a much smaller offset table other than the data itself.

Fine-tuning.
To mitigate the quality loss brought by the quantization process, especially the texture seams caused by the UVparametrization and the aliasing caused by voxelization (see Fig. 6 for visual examples), we directly fine-tune the extracted assets.However, the optimization of explicit geometry and textures could easily fall into local minima as they are no longer constrained by the inductive bias or local smoothness of the neural networks.Therefore, we fix the mesh geometry and use the rendered normal and texture images from the mesh optimization stage as pseudo ground truth references.In addition to the pixel-wise MSE loss, we also adopt a VGG perceptual loss [Johnson et al. 2016] to optimize for image quality.To further reduce storage costs, it is a common practice to re-scale and quantize the values in the assets to 8-bit integers and store them as images.Since the quantization operation is not differentiable, existing methods mainly make it a post-processing step, which may bring quality loss.We instead propose to optimize the assets with a differentiable quantization module, which gives the quantized value in the forward pass and keeps the gradient untouched during the backward pass.In this way, we can ensure the rendering outputs are the same during optimization and in the real-time renderer.After fine-tuning, we store the mesh as a .objfile compressed using Draco, and all other assets as PNG files including the normal map, texture maps, volume data in the hash table, and the offset table.For unconstrained access to the volume data, we also store the occupancy of the  3 grid as an occupancy image, where the occupancy status of a voxel corresponds to a single bit in the image.

Real-Time Rendering
We implement the rendering process of VMesh assets in WebGL2 using the three.jslibrary [mrdoob 2023].The full process requires four render passes: • In the first and second passes, we rasterize the font and back face of the object bounding box to get the valid interval for raymarching.This could also be done by ray-box intersection.
• In the third pass, we rasterize the mesh to get the normal image, feature image, and depth image, and calculate the mesh rendering output based on the texture model.• In the final pass, we first determine the start and termination points for raymarching.The start point is the front face position from the first pass.The termination point is the closer one of the back face position from the second pass and the surface position from the third pass.Then we uniformly sample points in this interval to get the volume rendering output.The volume-rendered color is alpha-composited with the mesh color the get the final rendering result.

Implementation Details
We implement the training process using the PyTorch [Paszke et al. 2019] framework.In the contiguous and mesh optimization stage, we adopt the hash grid encoding and acceleration techniques proposed in Instant-NGP [Müller et al. 2022].To stabilize training, we adopt a progressive training strategy to mask features from high-level hash grids in the early stage of training.We show in Sec.5.2 that this simple strategy could effectively alleviate shape-radiance ambiguity and improve geometry quality.We choose  = 4 for the RefBasis texture.The contiguous stage is trained with a batch size of 8, 192 rays for 80, 000 iterations.We set  pm = 10,  mask = 0.1,  eik = 0.1,  norm = 0.1, linearly increase  sp from 0.01 to 0.1, and only apply  opaque = 0.1 after 50, 000 iterations.The mesh optimization stage is trained with a batch size of 8, 192 rays for 10, 000 iterations.The resolution of the dense grid for marching tetrahedra is set to 350.We set  pm = 10,  mesh = 1,  norm = 0.1, and  sp = 0.1 in this stage.
In the discretization stage, we reduce the number of mesh faces to 1/4, apply UV-parametrization using xatlas [jpcy 2022] and render the normal and texture maps at 1024 × 1024.We store the base environment maps as cube maps of resolution 512 × 512, and create the look-up table at 256 × 256.We voxelize the volume part using a 512 3 dense grid and implement a custom PyTorch C++ extension for the Perfect Spatial Hashing algorithm.It takes around 2 hours to optimize for a scene on a single RTX 3090.

EXPERIMENTS
We mainly conduct experiments on the NeRF-Synthetic [Mildenhall et al. 2022] dataset, which contains 8 challenging scenes with thin structures, highly reflective materials, and complex texture patterns.As we aim at real-time free-viewpoint rendering on consumer devices, we select the baseline methods as follows: • Real-time NeRF variants, including PlenOctrees [Yu et al. 2021], SNeRG [Hedman et al. 2021] and MobileNeRF [Chen et al. 2022].These methods all work without the need for highend GPUs.We also compare with a compressed version of PlenOctrees, denoted as PlenOctrees-Web, and two SNeRG variants with lower voxel grid resolution, denoted as SNeRG-750 and SNeRG-500 respectively.• nvdiffrec [Munkberg et al. 2022], that extracts the 3D mesh, PBR material and lighting from images.The extracted assets can be directly rendered in traditional graphics pipelines.
In Sec.5.1 we first compare VMesh with these alternatives on image quality, rendering speed at different resolutions on various kinds of consumer devices, and disk storage cost.We compare with nvdiffrec and MobileNeRF on geometry quality as they also produce polygonal meshes.Then we perform thorough ablation studies in Sec.5.2 to show how different design choices could affect the image quality and efficiency.We show some qualitative results on realcaptured scenes in Sec.5.3 and present several applications based on our representation in Sec.5.4.

Comparisons
We compare our VMesh representation with other alternatives on NeRF-Synthetic according to three aspects: (1) view synthesis quality on the test set in PSNR, SSIM and LPIPS; (2) rendering speed on a MacBook Pro (2020, M1) laptop measured with frames per second (FPS); (3) disk storage cost in megabytes.The results are shown in Tab. 1, where the numbers are averaged across all 8 scenes.We also provide a bubble chart in Fig. 1 for more intuitive comparisons.
In general, VMesh acts as a trade-off between volume rendering approaches (PlecOctrees, SNeRG, MobileNeRF) and mesh rendering approaches (nvdiffrec).Compared with volume rendering approaches, VMesh achieves competitive view synthesis quality with significantly higher rendering speed (2x faster than MobileNeRF, 5x faster than SNeRG) and lower storage cost (1/10 of MobileNeRF, 1/5 of SNeRG).The improvements could be attributed to the use of mesh for representing macro structures of the object, which is compact in storage and could fully utilize the efficiency of the graphics pipeline.Combined with our efficient texture representation, VMesh is able to render high-resolution images at real-time frame rates on various consumer-grade devices.In Tab. 2 we evaluate the rendering speed at different resolutions on three types of common devices: mobile phones, tablets, and laptops.nvdiffrec produces mesh assets with PBR materials, therefore serving as the upper bound for rendering efficiency.We can see from the table that the rendering speed of both SNeRG and MobileNeRF drops significantly as the resolution goes up, while VMesh maintains acceptable framerates even at 4K resolution.The reason for this discrepancy is that the computationally expensive raymarching processes or MLP evaluations are necessary for each pixel in SNeRG and MobileNeRF, which hinders efficient rendering at high resolutions.In contrast, our VMesh representation contains minimal amounts of volume, and our RefBasis texture representation is considerably more computationally efficient than tiny MLP.Some qualitative comparisons are shown in Fig. 3, where we can see that VMesh does well in modeling view-dependent effects, and has enough representational ability in recovering subtle structures thanks to the volumetric counterpart.In Fig. 4, we qualitatively compare the geometry quality with MobileNeRF and nvdiffrec as they also produce explicit triangular meshes.We also report the average number of vertices and faces on NeRF-Synthetic in Tab. 3. It can be seen that our method achieves the best geometry quality among the three.MobileNeRF optimizes for separate triangles without any connection, which greatly limits its capability in tasks other than view synthesis.nvdiffrec produces meshes with unexpected holes.We believe this originates from the strict PBR assumption, which lacks the capability of explaining global illuminations, bringing geometry artifacts.VMesh creates high-quality meshes that can be drastically simplified without compromising rendering quality.This is due to the fact that we utilize meshes to represent larger, macro-level structures, which can be represented accurately using a smaller number of faces.

Ablation Studies
5.2.1 The necessity of volume.We first evaluate the necessity of the volumetric part of our VMesh representation.In Fig. 5 we show results on the ficus and ship scene of the NeRF-Synthetic dataset.These two scenes contain objects with very subtle structures, such as twigs and thin ropes.With only the mesh part, these subtle structures are largely missing as they are hard (also inefficient) to be modeled with surfaces.Volumetric primitives, on the other hand, can be effective in handling these areas.Combining the two, VMesh can represent the challenging scenes very well.5.2.2 Texture representation.We evaluate different choices of texture representations in Tab. 4 to demonstrate the effectiveness of our RefBasis Texture.We investigate five common types of texture representations: • NeRF [Mildenhall et al. 2022] texture, which takes the geometry feature vector and the viewing direction as input, and models color by an MLP.
• IDR [Yariv et al. 2020] texture, which takes the geometry feature vector, the viewing direction, and the local normal direction as input, and models color by an MLP.• Ref-NeRF [Verbin et al. 2022] texture, which takes the roughness value, the incident angle, the reflected ray direction, and the geometry feature vector as input, models the specular color by an MLP, and combines with the diffuse color and the specular tint to get the final color.• Spherical Harmonics (SH) texture, which models color by coefficients of SH functions.• Learned Basis (LB) texture, which models color by coefficients of a set of learned basis functions.The basis functions are modeled by an MLP, taking the viewing direction as input.
For a fair comparison, we also experiment with variants of these representations (marked with -R in the table) where the viewing direction is replaced with the reflected ray direction.We compare on view synthesis quality of the NeRF-Synthetic test set in the contiguous stage, as well as the number of features N stored at each location to indicate storage costs, and whether the representation can be quantized (Q? column in the table, check mark ✓for quantizable) for real-time application.For representations that cannot be quantized, we choose N to be close to the one we use.And for SH and LB, we use commonly adopted settings in existing works.As shown in the table, among all quantizable representations, our RefBasis representation achieves the best visual quality with significantly lower storage costs.We achieve such efficiency by storing most of  the common information in the neural environment map, which only has to be stored per scene instead of per point.In comparison to non-quantizable representations, RefBasis places greater emphasis on rendering efficiency at the expense of a minor decrease in visual quality.

Training strategies.
We conduct ablation studies on some of the training strategies to demonstrate their importance in achieving promising view synthesis quality, as shown in Fig. 6 and Tab. 5. Without predicted normal (w/o pred.normal), the normal calculated by automatic differentiation can be noisy, leading to inaccurate reflection directions.Removing the volume sparsity regularization in Eq. ( 18) (w/o L sp ) will allow the model to use volume to explain view-dependent effects, resulting in foggy blobs as can be seen in the figure.This will also bring unnecessary additional storage costs for the redundant volume content.Without the progressive training strategy for hash encoding illustrated in Sec.4.5 (w/o prog.training), shape-radiance ambiguity is more likely to happen, causing incorrect geometry.In the mesh optimization stage, we show that it is very crucial to optimize the extracted mesh for high-quality view synthesis.As can be seen in the figure, the extracted mesh could be far from ideal, with missing structures and inaccurate surfaces.After optimization using silhouette and depth constraints from the contiguous stage, the mesh quality is largely improved.Even so, nearly all the quality loss comes from this stage, further demonstrating the importance of having accurate geometries.In the discretization stage, the fine-tuning step helps remove the seams for better visual quality.

Storage optimization.
For the organization of volume data, we find that using a block size of 16 generally gives the best storage efficiency.We evaluate the impact of different mesh texture sizes on storage cost and show the results in Tab. 6.Using a texture resolution of 2048 consumes significantly larger storage (2.5x) but only brings marginal improvement in rendering quality (by +0.1db in PSNR).As a result, we employ a texture resolution of 1024 for all the experiments.

Results on Real-Captured Scenes
In Fig. 7 we show qualitative results on real-captured scenes from Mip-NeRF 360 [Barron et al. 2022].It is essential to point out that this paper only focuses on achieving efficient view synthesis of the foreground object.The background, however, is modeled using a similar approach to NeRF++ [Zhang et al. 2020] and is excluded in the real-time renderer.VMesh successfully generalize on real-world scenes and is able to recover the subtle structures of the foreground objects such as the spokes on the bicycle wheels (up) and the fine filaments on dried flowers (down).

Applications
Benefiting from the mesh-based representation, we can intuitively perform various geometry and texture editing tasks which can be hard for volumetric-based methods.In Fig. 8 we show some examples of editing existing VMesh assets.In the shape deformation example (left), we use the ARAP (as-rigid-as-possible) method to deform the triangular mesh.Once the deformation is done, the deformed object can be instantly put into real-time view synthesis, without the need for re-training like in NeRF-Editing [Yuan et al. 2022].In the texture painting (middle) and appearance editing (right) examples, we directly manipulate the diffuse and specular tint maps.This also shows that our RefBasis texture formulation is more flexible than existing alternatives like Spherical Harmonics or learned basis functions.

LIMITATIONS
There are several limitations inherent to the current VMesh representation that present opportunities for potential future improvements.
These limitations include: (1) The representation can produce inaccurate geometry due to global illumination effects caused by complex materials.For instance, in Fig. 7, the tabletop displays unexpected holes as a result of the highly reflective, glass-like surface near its center.Our RefBasis texture formulation assumes approximate distant illumination and is therefore unable to accurately account for global illumination effects like self-reflections.
(2) VMesh is unable to precisely capture the appearance of transparent or furry objects.Ideally, these properties should be modeled by the volume part.However, we find that in practice, they are instead baked onto an inaccurate surface instead of being modeled by volume.This may result from using a surface rendering loss for optimization during the contiguous stage (see Eq. ( 17)), and it requires a better surface-volume separation strategy to solve this problem.(3) VMesh currently only facilitates real-time rendering of foreground objects, but there is potential to extend this hybrid formulation to efficiently model both foreground and background.

CONCLUSIONS
In summary, we present a new hybrid mesh-volume representation, VMesh, for efficient view synthesis of objects.VMesh combines the efficiency and flexibility of traditional mesh-based assets and the expressiveness of volumetric representations.We show competitive visual quality with state-of-the-art real-time view synthesis methods while being significantly faster and more efficient in storage.We believe VMesh may inspire future research into the potential of combining surface rendering and volume rendering techniques to produce high-quality view synthesis at a low cost.

Fig. 2 .
Fig. 2. The 3-stage pipeline of training VMesh representation from multi-view images.We propose to start with an implicit signed distance-density field and gradually discretize the geometry and texture to obtain assets for real-time rendering.

Fig. 5 .
Fig. 5. Qualitative explanation on the necessity of the volume part.Volumetric primitives could easily recover thin structures that are difficult to be modeled with mesh surfaces, like twigs and thin ropes.

Fig. 7 .
Fig. 7. Qualitative results on two real-captured scenes.VMesh is able to model the thin structures of the foreground objects.

Fig. 8 .
Fig.8.Applications on shape deformation, texture painting, and appearance editing.These tasks can all be easily accomplished thanks to the mesh geometry and the flexible RefBasis texture model.

Table 1 .
Comparisons with state-of-the-art real-time view synthesis methods on NeRF-Synthetic dataset regarding image quality, rendering speed, and storage cost.The rendering speed is tested on a Macbook Pro (2020,M1) laptop.not supported on the tested device. *

Table 2 .
Rendering efficiency comparison on different devices.VMesh is able to render at 2K 60FPS on a MacBook Pro (2020,M1).Note that the numbers are averaged across all scenes.Rendering framerates are capped at 60FPS on iPhone13 and iPad8.
*not supported on the tested device.

Table 3 .
Comparison on mesh complexity with MobileNeRF and nvdiffrec.The meshes extracted by VMesh can be drastically simplified without affecting visual quality.

Table 4 .
Visual quality of different texture representations.We report the image quality metrics from the contiguous stage. denotes the number of features stored at each location.Q? denotes whether this representation can be quantized/cached for real-time rendering.Qualitative comparison of the extracted mesh from MobileNeRF, nvdiffrec and VMesh.VMesh is able to extract meshes with high-quality surfaces.

Table 5 .
Ablation study on how design choices in different stages affect image quality.

Table 6 .
Ablation study on how different mesh texture resolutions affect image quality and storage.