DE-NeRF: DEcoupled Neural Radiance Fields for View-Consistent Appearance Editing and High-Frequency Environmental Relighting

Neural Radiance Fields (NeRF) have shown promising results in novel view synthesis. While achieving state-of-the-art rendering results, NeRF usually encodes all properties related to geometry and appearance of the scene together into several MLP (Multi-Layer Perceptron) networks, which hinders downstream manipulation of geometry, appearance and illumination. Recently researchers made attempts to edit geometry, appearance and lighting for NeRF. However, they fail to render view-consistent results after editing the appearance of the input scene. Moreover, high-frequency environmental relighting is also beyond their capability as lighting is modeled as Spherical Gaussian (SG) and Spherical Harmonic (SH) functions or a low-resolution environment map. To solve the above problems, we propose DE-NeRF to decouple view-independent appearance and view-dependent appearance in the scene with a hybrid lighting representation. Specifically, we first train a signed distance function to reconstruct an explicit mesh for the input scene. Then a decoupled NeRF learns to attach view-independent appearance to the reconstructed mesh by defining learnable disentangled features representing geometry and view-independent appearance on its vertices. For lighting, we approximate it with an explicit learnable environment map and an implicit lighting network to support both low-frequency and high-frequency relighting. By modifying the view-independent appearance, rendered results are consistent across different viewpoints. Our method also supports high-frequency environmental relighting by replacing the explicit environment map with a novel one and fitting the implicit lighting network to the novel environment map. Experiments show that our method achieves better editing and relighting performance both quantitatively and qualitatively compared to previous methods.


ABSTRACT
Neural Radiance Fields (NeRF) have shown promising results in novel view synthesis.While achieving state-of-the-art rendering results, NeRF usually encodes all properties related to geometry and appearance of the scene together into several MLP (Multi-Layer Perceptron) networks, which hinders downstream manipulation of geometry, appearance and illumination.Recently researchers made attempts to edit geometry, appearance and lighting for NeRF.However, they fail to render view-consistent results after editing the appearance of the input scene.Moreover, high-frequency environmental relighting is also beyond their capability as lighting is modeled as Spherical Gaussian (SG) and Spherical Harmonic (SH) functions or a low-resolution environment map.To solve the above problems, we propose DE-NeRF to decouple view-independent appearance and view-dependent appearance in the scene with a hybrid lighting representation.Specifically, we first train a signed distance function to reconstruct an explicit mesh for the input scene.Then a decoupled NeRF learns to attach view-independent appearance to the reconstructed mesh by defining learnable disentangled features representing geometry and view-independent appearance on its vertices.For lighting, we approximate it with an explicit learnable environment map and an implicit lighting network to support both low-frequency and high-frequency relighting.By modifying the view-independent appearance, rendered results are consistent across different viewpoints.Our method also supports high-frequency environmental relighting by replacing the explicit environment map with a novel one and fitting the implicit lighting network to the novel environment map.Experiments

Input Images Geometry Appearance Lighting
Figure 1: Given a set of input images, we train a neural radiance field that decouples geometry, appearance, and lighting.Our method supports not only the geometry manipulation and appearance editing but also the rendering of the captured or modified scene in a novel lighting condition.
show that our method achieves better editing and relighting performance both quantitatively and qualitatively compared to previous methods.

INTRODUCTION
Neural Radiance Fields (NeRF) [Mildenhall et al. 2020] have shown promising results in scene reconstruction and novel view synthesis.Compared with traditional geometry and appearance representations, such as textured meshes, NeRF does not require precise geometry and texture reconstruction and can produce realistic rendering results.However, besides visualization, editing is also an important task in computer graphics.Traditional 3D modeling applications allow users to edit mesh geometry via modifying face connections or vertex locations and edit the appearance by painting from a given viewpoint.Lighting conditions are also changeable by replacing the environment map.But in conventional NeRF, the geometry is represented by a density function that does not well reflect the real geometry and its appearance is an entanglement of material and lighting, which increases the difficulty of editing.
On the geometry editing side, a few methods propose to deform neural radiance fields by deforming sample points on a ray [Garbin et al. 2022;Peng et al. 2022;Xu and Harada 2022;Yuan et al. 2022].For appearance editing, researchers try to decompose geometry, material and lighting from 2D images in an implicit way so that each component can be edited independently.PhySG [Zhang et al. 2021a] and NeRD [Boss et al. 2021a] use MLP (Multi-Layer Perceptron) networks to predict BRDF (Bidirectional Reflectance Distribution Function) materials and approximate lighting with Spherical Gaussian functions.But their geometry is still in an implicit form and lighting representation is smooth so high-frequency environmental relighting is beyond their limits.For better material estimation, NeR-Factor [Zhang et al. 2021b] predicts material parameters with a pretrained BRDF decoder and represents lighting with a low-resolution image, which prevents it from representing high-frequency lighting.RefNeRF [Verbin et al. 2022] proposes not to explicitly decompose BRDF materials but instead learns view-dependent and view-independent appearance simultaneously.Although achieving high-quality reconstruction results, RefNeRF [Verbin et al. 2022] can only edit a scene by adjusting its color network's outputs and is unable to deform the geometry or relight the input scene.
To transfer editing from one viewpoint to other viewpoints seamlessly, NeuTex [Xiang et al. 2021] maps sample points to a unified 2D texture space and uses traditional UV mapping to query corresponding colors.After training, the appearance of the scene is baked into the 2D texture image.Users can edit the neural radiance field by painting the 2D texture image.However, the 2D texture generated by NeuTex [Xiang et al. 2021] is usually distorted and hard to be edited.To resolve this issue, NeuMesh [Bao et al. 2022] defines learnable geometry and appearance features on a pre-reconstructed mesh for the scene and learns to decompose the geometry and appearance using two MLP networks.Unfortunately, its appearance is still an entanglement of material and lighting so that rendered results can be inconsistent with the input editing when viewed from novel viewpoints and its lighting conditions cannot be changed.
To allow view-consistent appearance editing and high-frequency environmental relighting, we propose DE-NeRF that decouples the geometry, appearance and lighting of the input scene.Given a set of captured 2D images for a scene, we first reconstruct its geometry with an SDF (Signed Distance Field) network.Then we define the geometry and view-independent appearance features on the reconstructed mesh's vertices and use the corresponding geometry network and appearance network to predict signed distance values and appearance parameters.By baking geometry and viewindependent appearance features onto mesh vertices, DE-NeRF can seamlessly transfer the appearance editing from one viewpoint to other viewpoints and the edited appearance is consistent across different viewpoints.For lighting, we propose to use a hybrid representation, composed of an explicit low-resolution environment map for efficiency and an implicit lighting network.The explicit environment map is responsible for low-frequency diffuse lighting and the implicit lighting network is trained to represent specular lighting.After training, geometry, view-independent appearance and lighting are disentangled and they can be separately edited without influencing other components.
Our contributions can be summarized as follows: • A neural radiance fields editing method that allows editing of geometry, appearance and lighting.Appearance editing from one viewpoint can be seamlessly transferred to other viewpoints and the rendered results are view-consistent after editing.• Our lighting representation supports high-frequency environmental relighting and produces more faithful relighting results compared to previous methods.

RELATED WORK 2.1 Neural Geometry Reconstruction
With the development of neural rendering [Oechsle et al. 2019;Thies et al. 2019] and implicit geometry representations [Chen and Zhang 2019;Mescheder et al. 2019;Park et al. 2019], surfacebased rendering methods [Niemeyer et al. 2020;Yariv et al. 2020;Zhang et al. 2021c] are proposed to learn geometry and appearance separately to reconstruct an object's geometry from 2D images by minimizing the difference between rendered images and input images.Later with the emergence of Neural Radiance Fields (NeRF) [Mildenhall et al. 2020], researchers start to work on geometry reconstruction with volume rendering.A pioneering work that builds the connection between implicit geometry representations and neural radiance fields is NeuS [Wang et al. 2021], which derives an unbiased and occlusion-aware formulation for the neural radiance field's density function from a signed distance function (SDF).UNISURF [Oechsle et al. 2021] instead treats geometry as an occupancy field that predicts whether a sampled point is on the object surface and replaces the alpha value in volume rendering with the occupancy value.Yariv et al. [2021] also transform the SDF to a density function in volume rendering and their transformation function is the Cumulative Distribution Function (CDF) of a learnable Laplace distribution.To reduce the requirement for the number of input images, SparseNeuS [Long et al. 2022] extracts 2D features from images to provide extra information for sample points in the space via projection.To accelerate the training process of geometry reconstruction, VOXURF [Wu et al. 2022] defines learnable features on voxel grids similar to [Fridovich-Keil et al. 2022;Liu et al. 2020] to speed up training.

NeRF Decomposition
Recently, researchers started to disentangle geometry, material and lighting from Neural Radiance Fields.NeRV [Srinivasan et al. 2021] decomposes BRDF materials under a given lighting condition.It models direct illumination and one-bounce indirect illumination and uses a network to predict the visibility of the sample point.NeRD [Boss et al. 2021a] approximates lighting with Spherical Gaussian (SG) functions and reduces the learning difficulty by first extracting view-independent material parameters and density functions and applying them to the learning of viewdependent material parameters.For more accurate material estimation, Boss et al. [2021b] predict BRDF materials with a material autoencoder pre-trained on a BRDF material dataset [Matusik et al. 2003].NeROIC [Kuang et al. 2022] approximates lighting with Spherical Harmonic (SH) coefficients and decomposes static appearance and transient appearance.NeRFactor [Zhang et al. 2021b] is the first work to learn shadow decomposition under unknown lighting conditions.Similar to [Boss et al. 2021a], it first trains a standard NeRF network to determine the geometry.Then it predicts material with a pre-trained BRDF decoder and optimizes its lighting which is represented by a low-resolution image.More recently, RefNeRF [Verbin et al. 2022] implicitly decomposes viewdependent appearance and view-independent appearance via two separate networks and can learn high-frequency specular reflections, but it does not decompose shadow or lighting.Besides the works mentioned above, there are works that decompose scenes based on other representations.PhySG [Zhang et al. 2021a] models geometry as an SDF network and its lighting is approximated by a composition of several Spherical Gaussian (SG) functions [Wang et al. 2009].It utilizes the Disney BRDF model [Bi et al. 2020] and assumes that the scene can only have one single specular BRDF material, causing a performance drop on more complex scenes.In-vRender [Zhang et al. 2022b] further models indirect illumination with another set of SG functions to handle more complicated appearances like inter-reflection.NvdiffRec [Munkberg et al. 2022] and NvdiffRecMC [Hasselgren et al. 2022] use Deep Marching Tetrahedra [Shen et al. 2021] as its geometry representation and learn to decompose the input scene with differential rasterization rendering [Laine et al. 2020] and differentiable Monte Carlo renderer.They handle high-frequency lighting but struggle when the objects have a highly glossy surface.

Neural Radiance Field Editing
Classified by editing targets, previous works can be roughly divided into geometry editing and appearance editing.In terms of geometry, several works [Garbin et al. 2022;Xu and Harada 2022;Yuan et al. 2022] share a similar idea to reconstruct an explicit mesh as a proxy for a static scene and builds correspondence between the mesh and NeRF.By editing the mesh using As-Rigid-As-Possible deformation [Sorkine-Hornung and Alexa 2007], sample points in the rendering process are transformed along with the mesh via barycentric coordinate interpolation.For appearance editing, several methods [Huang et al. 2022;Wang et al. 2022;Zhang et al. 2022a] propose to edit the appearance of NeRF by stylizing it with an image or text prompt.EditNeRF [Liu et al. 2021] is the first work that allows users to edit NeRF by editing 2D images, which greatly reduces the editing difficulty.It models a scene with a shape code and a color code.Editing is performed by optimizing the color code.But it requires a large dataset from the same category to generate plausible editing results.NeuTex [Xiang et al. 2021] maps sample points in a single scene to UV coordinates and gets its color from a learnable UV map.After training, the appearance of NeRF can be edited by painting the UV texture.However, the learned UV mapping is usually distorted and hard to be edited.NeuMesh [Bao et al. 2022] reconstructs the geometry of the scene using NeuS [Wang et al. 2021] and defines learnable geometry and appearance features on mesh vertices.It allows users to edit NeRF's appearance from 2D images by optimizing appearance features similar to EditNeRF [Liu et al. 2021].Since its appearance features do not disentangle material and lighting, artifacts may occur when observed from a different viewpoint after editing.Our method focuses on decoupling NeRF into geometry, appearance and lighting for independent editing, where the geometry and view-independent appearance are encoded on mesh vertices to ensure view consistency, and a hybrid lighting representation is proposed to support relighting with high-frequency environmental lighting.

METHOD
We propose DE-NeRF, a decoupled geometry, appearance and lighting editing method for NeRF that allows view-consistent appearance editing and high-frequency environmental relighting.The pipeline of our method is illustrated in Fig. 2. We first reconstruct the geometry of the input scene (Sec.3.1).To enable geometry and appearance editing, we define learnable features for geometry and appearance on the vertices of the reconstructed mesh to bake view-independent information onto the reconstructed mesh to ensure view consistency.For lighting, we propose a hybrid lighting representation that supports both low-frequency lighting and high-frequency lighting.The low-frequency lighting is modeled by an explicit environment map where each pixel in it represents a light and all lights in the environment map are integrated at every sample point in the scene.For high-frequency lighting, it is costly to represent it with a large environment map.Instead, we model it with an implicit lighting network and encourage it to be consistent with the explicit environment map.Under the guidance of the reconstructed geometry and the input images, we decouple the geometry, appearance and lighting of the scene by optimizing the learnable features on the mesh vertices, the learnable environment map, and the lighting network (Sec.3.2).After decoupling, users can edit the geometry, appearance, and lighting of the input scene (Sec.3.3).

Geometry Reconstruction
Recent neural implicit representations [Chen and Zhang 2019;Mescheder et al. 2019;Park et al. 2019] and neural rendering techniques [Mildenhall et al. 2020] have achieved great success in the scene reconstruction task.In this work, we use the Signed Distance Function (SDF) as our geometry representation for smooth geometry reconstruction.The SDF can be parameterized as an MLP network  = F (x).It takes a sample point  () =  + • as input and outputs its signed distance  to the surface, where  is the origin of a camera ray,  is the ray direction, and  is the parameter that and the distance to the mesh ℎ are fed into an SDF decoder to predict its signed distance value .Similarly, appearance features    ,    ,    , and distance ℎ go through several appearance decoders to predict diffuse albedo , roughness value  , and specular tint .A learnable environment map   is integrated with the diffuse albedo to get diffuse color   .We also train a specular lighting decoder   to predict specular lighting   , which is multiplied by the specular tint  to produce the specular color   .Combining   and   , we get the color  for this point.
determines the sample point on the ray.To learn the SDF from multi-view images of the scene, we adopt the occlusion-aware and unbiased volume rendering technique from NeuS [Wang et al. 2021] to render the SDF of the scene.Same as NeuS, we define the geometry density based on SDF as  () = max − Φ  ( ( ( ) ) Φ  ( ( ( ) ) , 0 , where Φ s (x) = (1 + e −sx ) −1 and s is a trainable deviation parameter.
Generally, this formulation works well.However, for scenes with specular reflection, a point on the surface can present totally different colors when observed from different viewpoints, making it hard to be learned by a single color network conditioned on the viewpoint as NeuS does.To fake the complicated view-dependent effects, NeuS tends to wrongly construct a concave surface so that, from different viewpoints, the camera will not see the same surface point but different points with different colors.To address this issue, we divide the color network into two branches following RefN-eRF [Verbin et al. 2022] to model view-independent appearance and view-dependent appearance respectively, which reduces the learning difficulty of the color network.The view-independent branch takes a sample point as input and outputs its view-independent color c d and its specular tint p.Both the sample point and the ray direction  are fed into the view-dependent branch to predict the view-dependent color   .The final color of a sample point can be formulated as c = c d + p • c l .
To calculate the color of each camera ray  (), we integrate the colors of the sample points on the ray by the volume rendering equation:  () =  =1       , where   is accumulated transmittance defined as   =  −1 =1 (1 −   ); and   represents opaque value at point   .We learn to reconstruct the input scene's geometry and appearance by optimizing the following loss function: where  the camera rays in a training batch.  () represents the ground truth pixel color for a ray . , is the th sample point on the ray .∥∇  , ∥ is the spatial norm of the SDF network  ()'s gradient at point  , .

Scene Decoupling
After reconstructing the scene's geometry and appearance, we extract a mesh using the marching cubes [Lorensen and Cline 1987] algorithm.To decouple geometry, appearance and lighting components for editing, we define learnable features on the vertices of the mesh, denoted as   for geometry features,   for diffuse features,   for specular features, and   for roughness features.For a sample point , its features   (),   (),   (),   () are defined by the weighted average of its K nearest neighbors from the reconstructed mesh vertices as similar to NeuMesh [Bao et al. 2022] and PointNeRF [Xu et al. 2022].  * () represents the interpolated learnable features, i.e.,    (),    (),    () and   ().the weight   () is the inverse of the distance between  and its th nearest neighbor   : Next, we use a geometry network that takes the geometry feature    () and the distance ℎ() from  to the reconstructed mesh as input to predict the signed distance value  of point .The distance ℎ() is also calculated by the weighted average of the distances to its K nearest neighbors, where the weights are defined in Eqn. 2. Similarly, we feed the features    ,    ,    into separate MLPs to infer diffuse albedo , specular tint , and roughness value  .The signed distance value and appearance parameter predictions can be formulated as follows:  =   (   , ℎ);  =   (   , ℎ);  =   (   , ℎ);  =   (   , ℎ).
(3) On the lighting side, the diffuse lighting is represented by an explicit environment map   where each pixel can be seen as a light so that the diffuse color   for a point can be obtained by integrating all lights in the environment map   at this point via is the direction of incident light   . is the normal direction for point  derived by the gradient of the geometry network   and • denotes dot product.
For specular lighting that may contain high-frequency details, it is costly to represent it with a high-resolution environment map and integrate the environment map and the material parameters using the rendering equation.Inspired by the Split-Sum [Karis 2013] approximation in real-time rendering and the recent work RefN-eRF [Verbin et al. 2022] that decouples lighting from the rendering equation, we model a sample point's specular color   =  •   as the multiplication of its specular tint  and the light color   that comes from the reflected direction   = 2(  • ) −   of the view direction   = − w.r.t.its normal direction .Here, the light color   is predicted by a specular lighting decoder   (•) that takes a sample point's roughness  , the dot product cos  =  •   of the normal direction  and the view direction   , and the reflected direction   as input: =   (, cos ,   ). (4) Combining the diffuse color   and the specular color   , we get the sample point's color  =   +  and render a pixel color using volume rendering.For training, we minimize the following loss: where   and   are the same as those in Eqn. 1.    is the loss between the predicted signed distance value  at a sample point and the ground truth signed distance value   to the reconstructed mesh.
is a smoothness loss that penalizes differences between adjacent vertices' geometry features and is defined as: where N () is the indices of the adjacent vertices for the th vertex.  denotes the environment map consistency loss, which enforces the environment map   to be consistent with the specular lighting generated by the specular lighting decoder   .However, our specular lighting has an implicit representation so it is impossible to directly compare it with diffuse lighting.Recall Eqn. 4 that the light color   from the reflected direction   at a point with roughness  is   (, cos ,   ).Following the approximation in SplitSum [Karis 2013], when the reflected direction   is the same We treat these directions as view directions and feed them along with the roughness value of the sample point into the specular lighting decoder to get the specular lighting colors from different view directions.These predicted specular lighting colors are unwrapped to the 2D image space as an environment map.
as the view direction   , the normal direction is the same as the reflected direction   and the view direction   , so cos  = cos 0 = 1.In this case, the output of the specular lighting decoder   is an approximation of the environment map as shown in Fig. 3. Thus   is defined as: where  is the number of pixels in the environment map   . is the roughness value for a randomly sampled point on the mesh surface and    is the th unit vector starting from the origin to the th pixel's location in   on an extremely large sky sphere.

Scene Editing
With geometry, appearance and lighting decoupled by the network, our method allows users to edit each component individually without affecting other components.For example, lighting can be changed without influencing the geometry or appearance.At a finer level, we can also edit appearance parameters like diffuse albedo, roughness, and specular tint independently.In the following, we elaborate on how to edit each component.[Bao et al. 2022], we apply As-Rigid-As-Possible deformation [Sorkine-Hornung and Alexa 2007] to the reconstructed mesh to deform the scene.

Appearance Editing.
Our appearance editing supports editing all appearance features, including diffuse, specular and roughness components by painting a rendered image of the scene.Given a painted image, we can locate the corresponding mesh vertices for editing by applying raycasting from the camera to the reconstructed mesh.The appearance features   * of these vertices are then treated as trainable parameters while the features of other vertices remain the same.The optimization target function can be formulated as follows: arg min where   denotes the corresponding camera rays of the painted pixels.  () stands for the color of a painted pixel. # () is a rendered component's pixel color after volume rendering, e.g., the diffuse color   .

Relighting.
As mentioned in Sec.3.2, our lighting mechanism has two parts, namely diffuse lighting and specular lighting.The diffuse l ighting i s r epresented b y a n e xplicit e nvironment map and the specular lighting is represented by an MLP network.For relighting, the diffuse lighting can be easily changed by replacing the environment map with the target environment map.However, as the specular lighting has an implicit representation, it cannot be directly changed.Instead, we optimize the specular lighting network   to fit the target environment map   by minimizing the following loss: where  denotes the number of sample points on the mesh surface, and  is the number of pixels in the target environment map image   .  is the roughness value of the th sample point, respectively, and    is the th unit vector starting from the origin to the th light's location in the target environment map   on an extremely large sphere.Note that we make the same assumption as in Eqn. 8 that the normal direction  is the same as the view direction   , so cos  = cos 0 = 1.However, Eqn. 10 only works for those sample points with small roughness values so that it can well preserve the lighting from the environment map.Directly applying Eqn. 10 to those sample points with large roughness values may result in unexpected results, such as a rough surface looking like a mirror after relighting (please refer to Fig. 7).Thus, we construct a mipmap of the target environment map by computing pre-filtered environment maps at different roughness levels by Monte-Carlo sampling: where   (  ) is the light coming from direction   and  is the number of sampled incident light directions.The sampling process is determined by the roughness value and can be quickly performed using [Krivánek and Colbert 2008].After integrating over incoming lighting at different roughness levels, we can construct a mipmap of the environment map which has a fixed roughness value at each mip level.The specular lighting can be quickly queried from the mipmap based on the sample points' roughness  and the view direction   .So Eqn. 10 can be further improved: where  is the pre-filtered environment mipmap computed by Eqn.11 and  (   ,   ) is the light color viewed from direction    and interpolated by roughness   .

RESULTS AND EVALUATIONS 4.1 Datasets and Evaluation metrics
We conduct our experiments on two synthetic datasets, NeRF Synthetic [Mildenhall et al. 2020] and Shiny Blender [Verbin et al. 2022] datasets, and the real DTU [Jensen et al. 2014] dataset.To evaluate the quality of the reconstructed meshes, we use Chamfer  Distance between the reconstructed meshes and the corresponding ground truth geometry.Regarding rendering quality, we use SSIM [Wang et al. 2004], PSNR, and LPIPS [Zhang et al. 2018] metrics to evaluate the similarity between the rendered images and the corresponding ground truth images.For editing results, we evaluate the image quality by calculating the Fréchet Inception Distance (FID) [Heusel et al. 2017] between the image set before editing and after editing, which has been widely used in image generation and editing tasks.For training details and the network architecture, please refer to the supplementary material.

Scene Reconstruction
As shown in Fig. 4, unlike NeuS [Wang et al. 2021], PhySG [Zhang et al. 2021a] and NvDiffRec [Munkberg et al. 2022], our method avoids concave surfaces in geometry reconstruction for scenes with specular reflection by learning view-dependent and viewindependent appearances separately.Quantitative results in Table 1 also show that our method outperforms these baselines.
We present novel view synthesis results in Fig. 5 and compare them with PhySG [Zhang et al. 2021a], NeRFactor [Zhang et al. 2021b], NvDiffRec [Munkberg et al. 2022], and NeuMesh [Bao et al. 2022].PhySG fails to recover the details in the scene, due to its smooth lighting representation and its assumption that the whole scene shares the same specular BRDF material.NeRFactor uses an environment map of size 32×16 as its lighting representation, which

Scene Editing
As mentioned in Sec. 3, we support editing on geometry, appearance and lighting.Our geometry editing is similar to NeuMesh [Bao et al. 2022] and we show geometry editing results in the supplementary material.In this section, we focus on the appearance and lighting editing tasks.4.3.1 Appearance Editing.We show appearance editing comparisons with NeuMesh [Bao et al. 2022] in Fig. 8. NeuMesh renders plausible results from the editing viewpoint after optimization, but the rendered results from another viewpoint become inconsistent with the input editing.Our method optimizes the learnable features of diffuse albedo   to minimize the difference between rendered diffuse color and editing target using Eqn. 9 so the edited appearance matches the input editing viewed from other viewpoints and the view-dependent appearance can be preserved.We compare the image set before editing and after editing using the Frèchet Inception Distance (FID) [Heusel et al. 2017] metric to evaluate the image quality after editing in Table 3.Compared with NeuMesh, our rendered images score higher in all datasets, indicating higher image quality after editing.We show the specular and roughness editing results that NeuMesh does not support in the supplementary material.
4.3.2Relighting.We compare with recent PhySG [Zhang et al. 2021a], InvRender [Zhang et al. 2022b], NeRFactor [Zhang et al. 2021b], NvDiffRec [Munkberg et al. 2022], andNvDiffRecMC [Hasselgren et al. 2022] that learn to decompose geometry, material and lighting in Fig. 9. PhySG, InvRender and NeRFactor fail to express high-frequency environmental lighting due to their smooth or lowresolution lighting representations.NvDiffRec and NvDiffRecMC can handle high-frequency lighting with their high-resolution environment map but may fail to reconstruct correct geometry or material, leading to less faithful results.Our method extracts more accurate geometry and produces better relighting results with the hybrid lighting representation.We also evaluate the relighting results using PSNR, SSIM, and LPIPS metrics in Table 4 by comparing the relighting results with ground truth images generated by Blender.Overall, our relighting results have higher quality.

Ablation studies
In this subsection, we evaluate several design choices in our pipeline by conducting ablation studies on them.4.4.1 Hybrid Lighting.We use a hybrid lighting representation of an environment map and a lighting network.To evaluate this representation, we compare it with a baseline that renders both the diffuse color and specular color using the explicit environment map with the microfacet model [Walter et al. 2007] in Fig. 6 and Table 5.The baseline struggles to reconstruct high-frequency lighting effects and our hybrid lighting representation outperforms it in terms of reconstruction quality.

Mipmap
Relighting.We construct a mipmap of the target environment map based on the roughness values for the relighting task.We show comparisons between the relighting results with and without mipmap interpolation in Fig. 7.The relit scenes may have a mirror-like appearance if mipmap is not applied while the roughness can be well preserved when we utilize mipmap interpolation, leading to more faithful relighting results.We also evaluate it quantitatively in Table 6 and the rendered images of the relit scenes have a higher image quality when the mipmap strategy is applied.

DISCUSSION AND CONCLUSION
In this paper, we present a geometry, appearance and lighting editing method for neural radiance fields.The technical core is a geometry, appearance and lighting decoupling network that optimizes the learnable geometry and appearance features defined on mesh vertices, the environment map, and the specular lighting network all at once.Building upon this decoupling network, appearance editing from a given viewpoint can be seamlessly transferred to other viewpoints.In addition, our hybrid lighting representation composed of an explicit environment map and an implicit lighting network can well simulate the lighting effects in the scene and supports high-frequency environmental relighting.Nevertheless, our approach still has the following limitations: Firstly, our method does not jointly optimize the geometry in the decoupling step, which may lead to poor reconstruction of thin structures (see the first row of Fig. 8).Secondly, our method works better on relatively convex objects since it does not consider shadow or inter-reflection and produces wrong decoupled results when shadow and inter-reflection exist in the scene as shown in Fig. 10.For future exploration, we would like to learn geometry, appearance and lighting in an endto-end manner so that all components can be optimized jointly.It is also possible to combine generative models and neural radiance field editing, for example, we can leverage newly developed diffusion models [Rombach et al. 2022] to help with image editing or apply deep geometric generative models [Gao et al. 2019] and texture generative models [Gao et al. 2021;Wang et al. 2014] to neural radiance fields.Bao et al. 2022].NeuMesh [Bao et al. 2022] can generate plausible rendering results from the editing viewpoint but rendered results from another viewpoint may be inconsistent with the input editing.Our method produces more faithful editing results from both editing viewpoint and novel viewpoints.

Figure 3 :
Figure3: Given a sample point in the scene (the red point), we sample multiple directions   from the sample point to points (black points on the blue frame) on the sky sphere.We treat these directions as view directions and feed them along with the roughness value of the sample point into the specular lighting decoder to get the specular lighting colors from different view directions.These predicted specular lighting colors are unwrapped to the 2D image space as an environment map.
Figure 6: Qualitative comparison of reconstruction results with the explicit lighting baseline.
Figure 7: Qualitative comparisons between relighting results without and with mipmap interpolation.The relit scene can better preserve the roughness of the input scene when mipmap is applied.

Figure 8 :
Figure 8: Scene appearance editing comparison with NeuMesh[Bao et al. 2022].NeuMesh[Bao et al. 2022] can generate plausible rendering results from the editing viewpoint but rendered results from another viewpoint may be inconsistent with the input editing.Our method produces more faithful editing results from both editing viewpoint and novel viewpoints.

Figure 9 :
Figure9: Scene relighting comparisons with PhySG[Zhang et al. 2021a], InvRender[Zhang et al. 2022b], NeRFactor[Zhang et al. 2021b], NvDiffRec[Munkberg et al. 2022], and NvDiffRecMC[Hasselgren et al. 2022].In each row, the input scene and target environment map are shown in the first column.In other columns, we show relighting results by different methods and the ground truth relighting result.With the help of our reconstructed geometry and hybrid lighting representation, our method can produce more faithful relighting results with high-frequency details.We show more relighting comparisons in the supplementary material.

Figure 10 :
Figure 10: Failure case: for an input scene with interreflections (the first row) and shadows (the second row), our decomposition network may produce wrong decomposition results and bakes them into appearance.

Table 1 :
Quantitative comparison of geometric reconstruction quality using Chamfer distance metric.All values have been mutiplied by 10 for easier reading.

Table 2 :
Quantitative comparison of novel view synthesis results using SSIM, PSNR, and LPIPS metrics.

Table 3 :
[Bao et al. 2022parison of appearance editing results with NeuMesh[Bao et al. 2022] using the FID metric (the lower the better).

Table 4 :
Quantitative comparison of novel view synthesis results after relighting using SSIM, PSNR, and LPIPS metrics.Results are averaged over ten different viewpoints with eight different environment maps.

Table 5 :
Quantitative comparison of reconstruction resultswith the explicit lighting baseline on the Shiny Blender dataset.

Table 6 :
Quantitative comparison of relighting results with mipmap and without mipmap on the Shiny Blender dataset.