FLARE: Fast Learning of Animatable and Relightable Mesh Avatars

Our goal is to efficiently learn personalized animatable 3D head avatars from videos that are geometrically accurate, realistic, relightable, and compatible with current rendering systems. While 3D meshes enable efficient processing and are highly portable, they lack realism in terms of shape and appearance. Neural representations, on the other hand, are realistic but lack compatibility and are slow to train and render. Our key insight is that it is possible to efficiently learn high-fidelity 3D mesh representations via differentiable rendering by exploiting highly-optimized methods from traditional computer graphics and approximating some of the components with neural networks. To that end, we introduce FLARE, a technique that enables the creation of animatable and relightable mesh avatars from a single monocular video. First, we learn a canonical geometry using a mesh representation, enabling efficient differentiable rasterization and straightforward animation via learned blendshapes and linear blend skinning weights. Second, we follow physically-based rendering and factor observed colors into intrinsic albedo, roughness, and a neural representation of the illumination, allowing the learned avatars to be relit in novel scenes. Since our input videos are captured on a single device with a narrow field of view, modeling the surrounding environment light is non-trivial. Based on the split-sum approximation for modeling specular reflections, we address this by approximating the prefiltered environment map with a multi-layer perceptron (MLP) modulated by the surface roughness, eliminating the need to explicitly model the light. We demonstrate that our mesh-based avatar formulation, combined with learned deformation, material, and lighting MLPs, produces avatars with high-quality geometry and appearance, while also being efficient to train and render compared to existing approaches.


INTRODUCTION
There has been remarkable progress on learning personalized 3D facial assets, moving from complex and expensive high-end systems [Beeler et al. 2011;Debevec et al. 2000;Ghosh et al. 2011] to using single commodity sensors as input [Grassal et al. 2022;Zheng et al. 2022;Zielonka et al. 2023].Although a quality gap still exists, it is being gradually bridged by neural methods that leverage implicit or explicit shape representations.In particular, signed distance fields [Zheng et al. 2022] and point clouds [Zheng et al. 2023] have been Fig. 2. Method pipeline.Top: Given an input video, we optimize for vertex displacements to obtain a canonical geometry.A deformation network D (green) then predicts FLAME [Li et al. 2017] expression blendshapes E, pose correctives P and blend skinning weights W given canonical vertices, which are used to deform the mesh into the corresponding expression and pose.The deformed mesh is rasterized following a deferred shading pipeline to obtain per-pixel canonical coordinates x c and deformed normals n.Bottom: x c is used to query the material network M (purple) to obtain the albedo , roughness  , and specular intensity .Next, the lighting network L (pink) obtains an estimate of the diffuse shading and specular reflection from the normal and reflection vectors, while taking roughness into account.We use physically-based rendering to compute the final color, which is compared with the ground-truth frame during training.
Table 1.Compared to other methods, FLARE converges rapidly, reconstructs high-fidelity geometry, is compatible with graphics pipelines since it employs a mesh representation, and produces head avatars that can be relighted.

Method Converges within 15 mins
High-fidelity geometry Compatible with graphics pipelines Relightable used to obtain impressive 3D reconstructions, while NeRF-based [Mildenhall et al. 2020] approaches [Gafni et al. 2021;Zielonka et al. 2023] have shown an outstanding ability to synthesize novel views of the subject.Further, these methods are trained such that the learned avatars can be controlled with novel poses and expressions, making them appropriate for entertainment and telecommunication.
There are several challenges that remain for existing head avatars to be widely applicable in industry.First, the majority of methods are slow to train and/or to render, taking hours [Zheng et al. 2023] or days [Grassal et al. 2022;Zheng et al. 2022] of processing to obtain a single, scene-dependent avatar.This limits the scope of applications and hinders the creation of immersive experiences.Fast approaches have recently been proposed [Gao et al. 2022;Xu et al. 2023;Zielonka et al. 2023], but they suffer from low-quality geometry and often do not generalize well to novel views.Second, to achieve high-quality reconstructions, current methods use shape representations that are not compatible with standard graphics pipelines.Ideally, a mesh representation should enable easy asset extraction and integration.However, recent neural methods that are built on triangulated meshes [Grassal et al. 2022;Khakhulin et al. 2022] do not achieve the same geometric quality as methods based on more flexible representations.Finally, most neural approaches generate avatars that can only be rendered in the same environment in which they were captured since they do not disentangle the light from intrinsic material properties.What is still missing is an efficient method to extract head avatars that have high-fidelity geometry and can be animated and relit.
In this work we present a new method, FLARE (Fast Learning of Animatable and RElightable mesh avatars), for building 3D facial avatars from monocular videos that addresses all of these challenges, as shown in Table 1 and Figure 1.We use a mesh representation to allow easy integration as well as fast computation during training and at inference time.We represent the canonical head geometry as a triangular mesh with optimizable vertex locations and learn blendshapes as well as skinning-weight fields to deform the canonical mesh given FLAME [Li et al. 2017] expression and pose parameters.To disentangle the intrinsic material properties and extrinsic light conditions, we leverage physically-based rendering [Cook and Torrance 1982;Walter et al. 2007] where materials and lighting are represented by multi-layer perceptrons (MLPs).Specifically, we use the Disney material model [Burley 2012] and represent albedo, roughness, and specular intensity as hash-encoded spatial MLPs [Müller et al. 2022].To render color efficiently we adopt the split-sum approximation proposed in [Karis 2013].However, explicitly computing the environment light is challenging with monocular head videos given their narrow field of view.To address this, we approximate the pre-filtered environment map in the split-sum approximation with a neural network, together with an Integrated Directional Encoding (IDE) [Verbin et al. 2022] that accounts for different roughness levels.Our networks are trained using a two-stage approach, where the first stage is focused on geometry and then the second stage refines the color by leveraging the hash-grid encoding [Müller et al. 2022].This allows FLARE to control the pace at which geometry and color are learned relative to each other, achieving detailed results in both areas.While maintaining high accuracy, our method is carefully designed to improve training and rendering efficiency: (1) The canonical material estimation MLPs are fueled by hash-grid encoding [Müller et al. 2022], which effectively represents high-resolution mappings with shallow MLPs, boosting query speed significantly; (2) The neural split-sum approximation reduces the evaluation of expensive integrals into one look-up in the preintegrated texture [Karis 2013;Munkberg et al. 2022], as well as one forward pass of an MLP; (3) Our morphable mesh representation enables efficient differentiable rasterization with existing tools [Laine et al. 2020], in contrast to implicit representations that require hundreds of queries per pixel.Thanks to the above components, our method reconstructs detailed relightable avatars in around 15 minutes.Our experiments show that our proposed approach achieves high-fidelity geometry as well as realistic renderings, which are on par with, or superior to, existing approaches while being much faster to train as demonstrated in Figure 3. Code is available for research purposes at https://flare.is.tue.mpg.de.

RELATED WORK 2.1 3D head avatars from videos
Creating animatable 3D head avatars from videos is a popular research topic because it replaces the need for complex capture equipment [Beeler et al. 2011;Debevec et al. 2000;Ghosh et al. 2011;Riviere et al. 2020] with more easily accessible commodity sensors.Classic approaches [Garrido et al. 2016;Thies et al. 2016] employ statistical models [Blanz and Vetter 1999] to recover the 3D shape and appearance, but only focus on the facial area and produce relatively coarse reconstructions.NerFACE [Gafni et al. 2021] was the first to use dynamic neural radiance fields (NeRF) [Mildenhall et al. 2020] to represent head avatars.IMavatar [Zheng et al. 2022] recovers accurate geometry using implicit surfaces by jointly learning canonical head geometry and expression deformations.However, methods based on implicit representations can be inefficient to train and render.PointAvatar [Zheng et al. 2023] uses a similar deformation model but employs a point cloud representation, enabling faster rasterization and better image quality.Recently, several methods [Gao et al. 2022;Xu et al. 2023;Zielonka et al. 2023] employ InstantNGP [Müller et al. 2022] to speed up radiance field queries and can reconstruct avatars within 5 to 20 minutes.To the best of our knowledge, none of these fast avatar reconstruction methods produce high-quality surface normals.Neural Head Avatar (NHA) [Grassal et al. 2022] reconstructs mesh-based avatars with complete head and hair geometry.However, the reconstructed geometry is relatively coarse, with many details represented in the texture space.None of these recent neural methods factorize light and albedo, with the exception of PointAvatar, which performs a rudimentary factorization using a diffuse shading model.In contrast, our method reconstructs mesh-based avatars with high-quality geometry within 15 minutes, and factorizes lighting into albedo, roughness and extrinsic illumination.This enables our avatars to be readily rendered in new scenes.Our method is nearly as quick as INSTA while performing on-par or better than competitors.Lower is better for (a) and higher is better for (b).

Relightable 3D reconstruction from multi-view images
The ability to learn relightable 3D assets from 2D observations has extensive applications in AR and VR content creation.Several previous methods [Bi et al. 2020;Boss et al. 2021a;Srinivasan et al. 2021;Verbin et al. 2022;Zhang et al. 2021a,b] leverage neural implicit representations such as NeRF [Mildenhall et al. 2020] or neural SDF [Mescheder et al. 2019;Park et al. 2019], which benefit from unconstrained topology but are inefficient to render.Recently, [Munkberg et al. 2022] convert neural SDFs to meshes with differentiable marching tetrahedrons [Shen et al. 2021] and employ physically-based rendering to reconstruct high-quality relightable 3D assets in less than an hour.Neural-PIL [Boss et al. 2021b] leverages a similar idea to us and approximates parts of the split-sum formula [Karis 2013] with neural networks.However, the method requires pre-training on a large dataset, which hinders generalization, and is only tested with multi-view images that have full coverage of the scene.
To obtain 3D head avatars that are both animatable and relightable, recent methods [Dib et al. 2021;Feng et al. 2022] leverage the deformable geometry of pretrained 3DMMs [Li et al. 2017;Paysan et al. 2009], and predict albedo and lighting from a single image.SIRA [Caselles et al. 2023] improves the coarse 3DMM geometry by learning a deformable SDF but requires a large number of 3D scans for training.In contrast, our method reconstructs relightable 3D head avatars from a single monocular video, achieving high-quality geometry without requiring expensive 3D scans for prior training.

Geometry
To achieve high train-and test-time efficiency, and for compatibility with standard graphics pipelines, we use triangle meshes as the geometric representation.As shown in Figure 2, we learn a single canonical mesh that best explains all views, along with deformation fields that transform the canonical mesh given FLAME pose and expression parameters.We describe each of these below.
3.1.2Deformation field.We deform the canonical geometry using the FLAME parameters computed during the pre-processing step.Specifically, given a canonical vertex v ∈ R 3 , we deform it as follows: where  () is the joint regressor, LBS is the standard linear blendskinning function with blend-skinning weights W,  and  are the FLAME pose and expression parameters, and   (•) and   (•) compute the pose and expression offsets using the blendshape components P and E. Similar to IMavatar [Zheng et al. 2022], we train a deformation network D parameterized by an MLP that, given a canonical vertex location v, returns the expression blendshapes , the pose correctives P ∈ R   ×9×3 , and the linear blend skinning weights W ∈ R   of the vertex (with   and   the number of expression parameters and bone transformations, respectively), Note that, while IMavatar requires a costly root-finding process to search for canonical correspondences given deformed ray samples, our mesh formulation avoids this by directly deforming the canonical mesh and rasterizing it.

Reflectance
To make our avatars relightable, we factorize the observed colors into learned albedo, roughness, and specular intensity, as well as a neural representation of the environment illumination.We adopt the Disney shading model [Burley 2012] and modify it to better suit our input data.We elaborate on the appearance model in the following.

Physically-based rendering.
According to the classic rendering equation [Kajiya 1986], the radiance   (x,   ) ∈ R 3 leaving from a surface point x ∈ R 3 with normal vector n in the direction   is modeled as where the integral is over the hemisphere (x,   ,   ) is the bi-directional reflectance distribution function (BRDF), and   is the incoming light intensity from direction   .
Following the dichromatic reflection model [Shafer 1985], the BRDF is decomposed into a diffuse term   and a specular term   , and the total reflectance is calculated as   (x,   ,   ) =   (x,   ,   )+  (x)  (x,   ,   ), where  (x) is a spatially-varying specular intensity factor that weighs the contribution of the specular BRDF, similar to [Riviere et al. 2020].The rendering equation then becomes: We use a simple Lambertian model for the diffuse term: where  is the spatially-varying RGB albedo.For the specular term we use the Cook-Torrance microfacet model [Cook and Torrance 1982]: Here, the surface roughness  modulates the microfacet normal distribution function , and the geometry attenuation function  that accounts for self-shadowing. denotes the Fresnel equation that describes the proportion of light reflected at different surface angles.We follow the Disney material model for the specific choice of ,  and  functions, see [Burley 2012;Karis 2013].

Estimating intrinsic materials.
We optimize the albedo and roughness of our head model, as well as specular intensity values.These properties are canonical properties of the surface and remain constant during facial deformation.Therefore, we employ an MLP, M, that receives canonical surface points x c as input and predicts albedo , roughness  and specular intensity : 3.2.3Split-sum approximation.The split-sum approximation [Karis 2013] was proposed to efficiently evaluate the specular reflectance by splitting it into two integrals that can be pre-computed: The first term corresponds to a pre-filtered environment map, where the environment light   is convolved with the normal distribution function .This term is pre-computed for a set of roughness values and stored as a series of 2D look-up textures (LUT), where each mipmap level is selected based on roughness, and each texture is indexed by the reflection vector   = 2(  n)n −   .The second integral, known as the BRDF integration map contains the rest of the terms, and it is equivalent to integrating Equation 6 with a white environment map (  (  ) = 1) [Karis 2013].This term depends on the roughness  and the cosine angle (  • n), and it is also stored as an LUT, which will be referred to as  −  .

3.2.4
Neural split-sum approximation.The split-sum approximation can help to efficiently learn a rich model of illumination, and has been used to disentangle the light and materials from multi-view images [Munkberg et al. 2022].However, our setting considers as input a fixed viewpoint video, which is a more challenging scenario for light disentanglement.We found through experiments that optimizing environment maps directly often leads to sub-optimal results (See Figure 10).To address this, we approximate the pre-filtered environment map in Eq. 8 with a neural network: To design this neural network, we observe that the roughness parameter  influences the output via the normal distribution function , i.e., a larger roughness corresponds to a wider distribution and leads to blurrier filtered light maps.In 2D LUTs, the pre-filtered environment maps for different roughness values are represented as mipmaps.A key challenge for the neural split-sum approximation is to model this behavior for different roughness levels.To address this, we propose to adapt the Integrated Directional Encoding (IDE) [Verbin et al. 2022] to represent different mip levels of neural fields.The IDE encodes the input reflection vector   through the expected value of a set of spherical harmonics under a von Mises-Fischer (vMF) distribution centered at   , where the concentration parameter  is defined as the inverse roughness 1/ : with M  = {(, ) :  = 1 . . . 2  ,  = 0 . . . },    the spherical harmonics basis functions, and  = 4.In practice, this positional encoding limits the representational power of the neural network when using larger roughness values, which essentially mimics the behavior of mipmap levels in a continuous manner.Note that the incident illumination   , now represented as part of the pre-filtered light MLP L, also determines the diffuse shading.We observe that setting the roughness to its maximum value  = 1 within the GGX distribution for  (employed by the Disney material model) equates to 1/, and the pre-filtered environment map term becomes the diffuse shading of Equation 5. Hence, we can use L to represent both the diffuse shading and the specular pre-filtered environment map: We thus replace the explicit integration of a scene environment map with a single query over the pre-filtered light MLP, while still grounding the formulation on a physics-based model via the − term.At test time we relight the avatar by simply replacing L with a pre-filtered environment map.

Color prediction.
The final outgoing radiance is calculated as

TRAINING 4.1 Loss Functions
In this section, we discuss the loss functions employed by FLARE, grouped by image-related losses, geometry-related losses, deformation-related losses, and material regularizers.
4.1.1Image.Given a ground-truth frame   and a predicted image Ĩ , we compute (1) the 2 loss in log space between the masked ground-truth and the predicted image following [Munkberg et al. 2022]: (2) an L2 loss between ground-truth and predicted binary masks: and (3) a perceptual loss [Johnson et al. 2016] given as: where   represents the extracted features from the first four layers of a pre-trained VGG [Simonyan and Zisserman 2015] network.
4.1.2Geometry.During the optimization of mesh vertices, it is necessary to constrain them in order to avoid self-intersections and to obtain a coherent shape.We follow [Luan et al. 2021;Worchel et al. 2022] and use a Laplacian smoothness regularizer where the magnitude of the differential coordinates of each vertex is minimized.
For canonical vertices given by {v  +Δv  | = 1 . . . }, the regularizer is defined as: where   = ( )  are the differential coordinates of the -th vertex, and  ∈ R  × the graph Laplacian of the mesh [Sorkine 2005].
Additionally, we apply a normal consistency term [Luan et al. 2021;Worchel et al. 2022] that enforces cosine similarity between neighboring face normals and is given by: F is the set of triangle pairs that share an edge, and   is the normal of triangle .

Material Regularization.
We apply a white light regularization over the diffuse shading as in [Munkberg et al. 2022]: where   is the per-channel average intensity.Additionally, we regularize the specular intensity  by computing the z-score of our predicted specular intensities relative to a Gaussian distribution based on the MERL / ETH Skin Reflectance Database [Weyrich et al. 2006].The dataset provides specular intensity measurements for 156 faces, with a mean value of 0.3753 and a standard deviation of 0.1655.The regularization is defined as: We employ a similar strategy to regularise the roughness.However, since the statistics reported in [Weyrich et al. 2006] are computed for the Torrance-Sparrow model, we empirically set the mean to  ℎ = 0.5 and standard deviation to  ℎ = 0.1 through visual evaluation.We provide an ablation study in Section 5.4.1 to support the choice of this hyper-parameter.The loss is defined as: Finally, we enforce a smoothness constraint for both albedo and roughness similar to [Munkberg et al. 2022], with an additional robust term [Barron 2019] that helps preserve high-frequency details.Specifically, for each canonical point x c ∈ R 3 we compute a random displacement vector  ∈ R 3 sampled from a Gaussian distribution, and compute the albedo () and roughness ( ) for both points.We apply an L1 loss between these two to enforce smoothness within neighboring points as follows: where  robust is the adaptive robust loss function of [Barron 2019].

Training Details
We train FLARE using differentiable rendering to compare the predicted images with ground-truth frames.Given the current canonical mesh {v  +Δv  | = 1 . . . }, we first estimate expression blendshapes E, pose correctives P and blend skinning weights W through a forward pass of the deformation network, D (v  + Δv  ) → (E, P, W).
With the expression and pose parameters ,  , we obtain deformed vertex positions ṽ using the FLAME function in Equation 1, ṽ =   (v  + Δv  , E, P, W, , ).Following a deferred shading pipeline, the deformed vertices are rasterized to obtain triangle indices and barycentric coordinates per pixel.We then interpolate and obtain the corresponding canonical point locations x c , deformed point locations x d (used to compute   ), and deformed normals n d for each pixel.Next, we compute material properties by querying the material network M (x c ) → (, , ) (Equation 7).Finally, we query the lighting MLP L (  ,  ) using the deformed normals n d to obtain the left-hand side of Equation 8 and the diffuse shading of Equation 5.The final color for the pixel is computed using Equation 13.Our framework is implemented in PyTorch and trained using a single A100 Nvidia GPU with 80GB of memory and a batch size of 4 images per iteration.

Two-stage training.
To learn high-frequency facial features and to enable fast rendering, we incorporate hash-grid encoding [Müller et al. 2022] for the material MLP, M.However, we found that this approach overfits to colors quickly, learning texture much faster than the geometry, resulting in smoother shapes of lower quality.
To address this, we design a two-stage training approach.During the first stage, we equip the material MLP with positional encoding [Mildenhall et al. 2020] and jointly optimize the geometry Δv  , deformation D, material M, and lighting L. The first stage can achieve detailed geometry but often learns blurry texture.During the second stage of training, we leverage the pre-trained mesh geometry and deformation from the previous stage and re-optimize both material and lighting MLPs, where M is now equipped with high-resolution hash-grid encoding [Müller et al. 2022].With the proposed two-stage training, our method can achieve both highfidelity geometry and realistic texture (See Fig. 11).

EVALUATION
In this section, we present qualitative and quantitative results of FLARE.First, we show qualitative examples of the individual components, including geometry, albedo, roughness, diffuse and specular shading, as well as relit images (Sec.5.2).Next, we compare our results with the state-of-the-art (SOTA) baselines in terms of image quality and albedo, as well as geometric accuracy (Sec.5.3).Finally, we conduct an ablation study to evaluate our design choices in Sec.5.4.All the results in this section are generated using frames from the test set.For each test frame, we obtain FLAME parameters (pose and expression) from the pre-processing step and animate the personalized canonical representation (geometry) estimated by each baseline method.These animated renderings are relit with novel environment maps.We include additional results in the supplementary video.

Dataset
We use 2 subjects released by [Zheng et al. 2022], 2 by [Zheng et al. 2023] (where 1 subject is captured by a webcam), and 1 by [Zielonka et al. 2023].We additionally capture 15 subjects with a smartphone to demonstrate the robustness of FLARE for diverse skin tones and shapes.We follow the protocol of [Zheng et al. 2022[Zheng et al. , 2023] ] for the capture and obtain an average of 3000-4000 frames for training and around 1000-3000 frames for testing.These new subjects gave prior informed written consent for their data to be used for academic research purposes.In total, we conduct the evaluations for 20 subjects.To measure geometric accuracy we use a dataset Input Albedo Roughness Spec.Intensity

Qualitative Evaluation: Intrinsic Materials and Relighting
The intrinsic material properties (albedo, roughness, and specular intensity) and relit faces are visualized in Figures 4 and 5.The rendered albedo images in Figure 4 illustrate that FLARE is capable of removing evident shadows and specular highlights in the face region; e.g., see the subject in the second row.The influence of the predicted roughness values can be visualized in the relit images: the teeth of the subject in the first row and the hair of the subject in the second row are correctly predicted as shiny surfaces (lower roughness values), which results in realistic reflections when relit with new environment lighting.Finally, the robustness of FLARE is demonstrated across different skin tones, skin textures, hair types, hair styles, facial hair, and even accessories such as a cap.Despite having a single monocular video as input, FLARE computes geometries and materials that enable realistic and plausible relighting.

Comparison with PointAvatar.
To the best of our knowledge, PointAvatar [Zheng et al. 2023] is the only other neural avatar method trained from a monocular video that disentangles diffuse shading from albedo.Thus, we qualitatively evaluate the albedo and shading of FLARE in Figure 5 by comparing it with PointAvatar.We relight the renderings of PointAvatar using a Lambertian shading model, where we use the predicted surface normals to obtain diffuse shading.From Figure 5, we observe that the albedo estimated by PointAvatar is biased towards light skin tones and fails to capture the skin color of the subjects.In comparison, the albedo estimated by FLARE resembles the color of the subject, and much of the shading is removed.Relighting FLARE 's estimated materials results in natural looking images.This is due, in part, to the estimated specular highlights, which are absent in PointAvatar's formulation.The specular highlights are visible in the 5th row of Figure 5, on the left and right cheeks of the first subject from the left, and in the rightmost subject, who has smooth and shiny hair that reflects the environment's light.

Comparisons with State-of-the-Art
In this section, we compare the results of FLARE with the following state-of-the-art (SOTA) methods for neural head avatar estimation from videos: (1) IMavatar [Zheng et al. 2022] and (2) PointAvatar [Zheng et al. 2023], which use a deformation module similar to ours, with a signed distance function (SDF) and point cloud representation for geometry, respectively; (3) NHA [Grassal et al. 2022], which employs a mesh representation along with an alternating training strategy between geometry and color; and (4) INSTA [Zielonka et al. 2023], which learns an animatable avatar using a NeRF [Mildenhall et al. 2020] representation and leverages the InstantNGP framework [Müller et al. 2022] for faster optimization.NHA and PointAvatar employ test-time optimization of the expression and pose parameters due to noisy pre-processing estimates.Hence, we report the optimized quantitative evaluations for both NHA and PointAvatar to retain their best performance.However, it must be noted that the reported results of FLARE, IMavatar, and INSTA are not optimized at test time.

Image quality.
First, we compare FLARE with SOTA methods with respect to image quality.We use the same FLAME parameters sampled from the test set on all baselines1 and measure the accuracy against the ground-truth frames by using mean absolute error (L1 distance), PSNR, structural similarity index measure (SSIM) 2. We make the following ob-  6, 4th row and 5th column, we can observe the artifacts that occur at extreme poses, producing a salt-and-pepper-like noise.In comparison, our mesh representation inherently solves the sparsity issue with approximately 11k vertices (we evaluate mesh resolution in Sec.5.4.4).( 3) FLARE is able to capture high-frequency texture details better than IMavatar and this is evidenced qualitatively as well as quantitatively, where IMavatar has the weakest LPIPS score.(4) INSTA can converge quickly and produces visually convincing expressions and high-quality texture with forward-facing poses.However, at extreme neck poses we observe noisy texture, possibly due to the volumetric representation that fails to extrapolate well.( 5) NHA also employs a mesh-based representation to learn the geometry.However, the predicted mesh is unable to capture high-fidelity details as well as the baselines and produces an over-smoothed representation.We believe that this is a result of their training strategy in which the geometry is primarily supervised with pseudo-normals from [Abrevaya et al. 2020], unlike the rest of the baselines, which learn geometry exclusively via inverse rendering.Instead, we carefully consider how fast the texture network is trained compared to the geometry network, and we observe that this was helpful in achieving high-fidelity geometry.We evaluate our training strategy further in Sec.relatively close scores quantitatively.However, IMavatar is approximately 200 times slower to train than FLARE, mainly due to the root-finding step during ray tracing between deformed and canonical points.Moreover, it uses an SDF representation that requires a post-processing step to obtain a mesh, while FLARE can be trained in approximately 15 minutes and directly produces a canonical mesh that can be animated.INSTA, on the other hand, exhibits noisy shapes that can be observed in both Figure 7 and Figure 6, and the normals of NHA do not completely capture the identity.Note that the synthetic heads have smooth geometry and, consequently, most methods do well with only small numerical differences between methods.

Training Time.
Figure 3 plots the training time of each method against image quality (LPIPS) and geometric quality (cosine similarity).The plot is measured over the same data as Tables 2 and 3. We find that FLARE can be trained almost as quickly as INSTA but with better performance in terms of image quality and state-of-the-art results in terms of geometry.

Ablation Study
5.4.1 Loss Functions.We evaluate the contribution of the terms in the loss function that are not adopted by prior avatar methods but are crucial in our setting.
Specular Intensity Regularization.L  : The specular intensity  controls the intensity of the specular highlights.In Figure 8 we qualitatively evaluate the effectiveness of using the regularizer and show relighting results for one subject with and without the specular intensity regularization.We observe the occurrence of unnaturally sharp highlights that have high intensity around the subject's lower lip and cheek regions when specular intensity is left unconstrained.Constraining it with L  makes the non-Lambertian effects more subtle and natural.
Roughness Regularization L  : To regularize the roughness we employ a statistical approach similar to specular intensity.However, we know of no suitable database of statistical values for roughness that can be used to regularize the appearance model.Hence, we   employ an empirical mean with a fixed standard deviation of 0.1, and evaluate the results of using different mean values in Table 4. Additionally, we evaluate the results of not using this regularization and show qualitative results in Figure 9.We observe a similar behavior as with specular intensity when roughness is left unconstrained.The final numerical prediction of each subject is not affected by a large margin since the network learns to compensate for wrong predictions with other estimations.However, Figure 9 reveals that the regularizer helps produce visually realistic renderings.
5.4.2Standard PBR vs. Lighting MLP.We compare our proposed approach with a method that estimates a standard texture-based environment map for training, which will be referred to as "Standard PBR".For this experiment we use the same hyper-parameters, loss functions, training protocol, and geometric representation of our method, and replace the pre-filtered light MLP L with a learnable texture of the environment map, where the integral is solved with the approach proposed by [Munkberg et al. 2022].In Figure 10 we visualize an example of geometry and relighting obtained with both methods.We observe that the standard PBR results in noisy texture and geometry predictions that are evident after relighting the subject.This is probably due to the redundant calculations of the regions in the environment map that are never observed in our monocular setting, creating instability in the optimization process.Further, we can observe that the input image in Figure 10 (bottom left) is captured such that the main light source is from the right of the subject.However, the environment maps have the main light source coming from the left.Here, PBR exhibits shadows in the texture that are retained from the original input data; for instance, see the shadowing on the nose.This is also observed in the estimated albedo, and it is not prominent in our results.

Two-stage training.
Through the course of our experiments, we noticed that it is necessary to control the speed at which the texture is learned in order to obtain both good geometry and albedo.
In particular, using a hash-grid positional encoding [Müller et al. 2022] results in better image quality, and the method converges very fast.However, this results in noisier geometries since there is not enough gradient signal coming from the color supervision.This behavior can be observed in the first column of Figure 11, where a high-quality rendered image corresponds to a relatively noisy geometry.On the other hand, using a standard positional encoding [Mildenhall et al. 2020] (second column in Figure 11) converges slower and leads to blurry textures, but learns geometric details from the observed images.Our two-stage training approach achieves the best of both options, as shown in the last column of Figure 11.5.4.4Mesh Upsampling.The FLAME mesh contains 5023 vertices that model the face and neck region, without hair or shoulders.In our setting, we learn the geometry of the subjects including diverse hair types and hairstyles, facial hair, head accessories, and part of the shoulder.However, optimizing with only 5023 vertices results in a smooth coarse geometry, as illustrated in Figure 12.The output mesh appears smooth as the vertices around the shoulder and hair region are stretched out to form triangles occupying a large area.To capture the high-fidelity geometric details of the subject, we increase the resolution by upsampling the mesh [Botsch and Kobbelt 2004] to around 11k vertices.This improves the quality in the hair, neck, and shoulder regions.Note that our resolution is lower than the roughly 16K vertices used by NHA, yet our geometric quality is higher.

FLARE GT
Fig. 13.Limitations.Modeling the mouth interior and eyes are challenging due to their complex material properties, variation in appearance (e.g.subjects 1, 2, and 3 have different-sized teeth), and the fact that we do not model eye blinks (e.g.subject 3).Capturing sharp specular highlights is also challenging due to the approximations made by our lighting model (subjects 3 and 4).

LIMITATIONS AND FUTURE WORK
FLARE can be trained in around 15 minutes and produces competitive results compared to methods that generate high-fidelity geometry at the expense of longer training times (on the order of days).However, there are still limitations, as shown in Figure 13.Firstly, the quality of the eyes and mouth interior needs improvement.These are challenging areas due to their complex material properties, and most neural avatar methods currently struggle with modeling these.For the mouth interior, an additional challenge comes from the fact that the teeth are exposed to varying degrees during training and this varies significantly between subjects.When a person does not smile with their teeth or does not articulate sufficiently, then the model does not have enough information to correctly reproduce the tooth color and geometry.Further, the FLAME mesh does not have vertices in the mouth interior and thus, during rasterization, there are no vertices projected onto the image of the mouth, resulting in no gradient being propagated there.Our remeshing step partly addresses this problem and, for some subjects, there are vertices formed around the teeth.However, modeling the teeth remains a challenging task due to the constant motion of the lips and limited supervision.Similarly, the eye area exhibits challenging photometric properties that are not always captured by our method.In addition, our pre-processing step does not track eye blinking, resulting in inevitable errors during optimization that yield a noisy geometry around the eyes.Future work should develop techniques that can enhance the estimation of the mouth and eye area, in both photometric and geometric respects.Second, capturing harsh neck shadows, self-shadows, and sharp specular highlights is difficult as demonstrated in Figure 13.We can remove shadows cast on the face region as the subject moves their head in various directions.However, the shoulder and neck areas remain mostly static and shadows are baked in.Additionally, although non-Lambertian reflections that look plausible can be captured by our method during relighting due to the estimated materials, we miss reproducing the sharp specular highlights of the ground truth.This is due to the several approximations that we make to model the pre-filtering of the environment and to simplify the integral of the rendering equation.Finally, our method does not model more subtle skin properties such as sub-surface scattering, or time-dependent appearance changes.We hypothesize that this could enhance realism and, consequently, the estimated geometry.We believe this is an interesting direction to pursue in the future.

ETHICS
The goal of FLARE is to enable fast, subject-specific, avatar creation that can be used to generate novel expressions and to place the avatars in different scenes.This capability, however, opens the door to potential misuse, where new malicious content of the training subject can be generated without their consent.Although the quality of FLARE still exhibits identifiable artifacts signaling its AI origin, the rapid progression of the field suggests these cues may diminish over time.Addressing this remains an important technical and legal challenge.

CONCLUSION
In this work we presented FLARE, a new method for building animatable and relightable head avatars from monocular video in 15 minutes.Our approach directly produces a mesh representation that can be efficiently rendered and animated, along with material parameters that allow the avatars to be placed in scenes under novel illumination.This is achieved by combining traditional computer graphics methods for rendering with neural networks that approximate some of the components.More specifically, we optimize a canonical mesh geometry while approximating the expression deformations, albedo, roughness and specular intensity values using coordinate-based MLPs.Further, we avoid explicitly computing an environment map from a narrow field of view by approximating the pre-filtered environment map in the split-sum formulation with a neural network.Finally, we propose a two-stage approach designed to control the pace at which geometry and texture are learned relative to one another.Our experimental results show that we can obtain mesh avatars of high geometric and image fidelity.Once learned, the avatars can be readily inserted and rendered in arbitrary scenes using standard graphics pipelines, enabling downstream applications in gaming, film production and telepresence.

A IMPLEMENTATION DETAILS
FLAME Mesh and Deformation Network.We manually add mesh faces to the FLAME template mesh between the upper and lower lips to close the mouth cavity, similar to NHA [Grassal et al. 2022].Additionally, we also simplify the tessellated eye region of the FLAME template following [Zielonka et al. 2023].Similar to PointAvatar [Zheng et al. 2023], during training, we map the optimized mesh vertices to a canonical pose with jaw open and a neutral expression and then proceed to perform LBS to deform the mesh.This additional step encourages the canonical mesh to have an open-mouth expression, which facilitates the learning of mouth movements.We train the deformation MLP D only during the first stage with a learning rate of 10 −3 and use the Adam optimizer [Kingma and Ba 2015].We adopt the network architecture of PointAvatar which is similar to [Zheng et al. 2022], except we do not predict additional vertex displacements (only skinning weights, expression and pose blendshapes).During the second stage, we freeze the deformation network and use the weights from the first stage.
Optimization of Mesh Vertices.The canonical mesh is upsampled once during training, resulting in a final mesh of approximately 11K vertices.During the first stage of training, when the number of vertices increases, we reduce the learning rate of the vertex offsets from 10 −3 to 10 −3 * 0.75 and increase the weight of the Laplacian and normal regularizer by 4 times following [Worchel et al. 2022].This helps in learning a smoother mesh and prevents the vertices from diverging after the upsampling step.During the second stage, we set the learning rate of the vertex offsets to a very small value (10 −5 ) and initialize the training with the canonical mesh from the previous stage.

Fig. 3 .
Fig. 3. Training time vs image quality (a) and geometric quality (b) for SOTA methods.Our method is nearly as quick as INSTA while performing on-par or better than competitors.Lower is better for (a) and higher is better for (b).

Fig. 4 .
Fig. 4. Qualitative results.The first three rows illustrate our intrinsic material estimates (albedo, roughness, and specular intensity) for three different subjects.The next three rows show the above subjects in the same pose and expression under 4 different environment maps.

Fig. 5 .
Fig. 5. Qualitative comparision with PointAvatar.The first two rows show albedo and diffuse shading estimated by FLARE compared with PointAvatar [Zheng et al. 2023].The next row shows the roughness and specular intensity (Spec.Intensity) estimated by FLARE for the same subjects as above.The bottom two rows contain relighting results of FLARE and PointAvatar for the same subjects, animated with test poses and expressions.

Fig. 6 .Fig. 7 .
Fig. 6.Qualitative comparison between FLARE and state-of-the-art methods.The canonical representation of each baseline method is animated using test poses and expressions.Odd columns: generated images.Even columns: generated normals.

Fig. 8 .
Fig. 8. Ablation of L  .Qualitative comparison of relighting results with and without the specular intensity regularizer.Results indicate that it is necessary to constrain the specular intensity statistically to avoid unrealistically sharp highlights.

Fig. 9 .
Fig. 9. Ablation of L  .Qualitative comparison of relighting results with and without the roughness regularizer.Results indicate that it is necessary to constrain the roughness to ensure non-Lambertian reflections on the skin look plausible.

FLAREFig. 10 .
Fig. 10.Ablation Study, Comparison against learning a full environment map ("standard PBR"), as in [Munkberg et al. 2022].Using this representation typically results in noisier geometry and color.From left to right: predicted geometry, predicted albedo, relighting under two different environment maps.The bottom row shows the input test image, and the two environment maps.

Table 2 .
Quantitative comparisons in terms of image quality on real data.The evaluations are performed only on the face region.Red color indicates the best value, yellow second best, and light yellow is the third best.

Table 3
. Quantitative comparisons in terms of geometric accuracy on a synthetic dataset.Showing cosine similarity compared to ground-truth normals (higher is better).Red color indicates the highest value, yellow second highest and light yellow is third.

Table 4 .
Ablation of L  .Influence of L  on image quality when using different mean roughness values.