Single-Shot Implicit Morphable Faces with Consistent Texture Parameterization

There is a growing demand for the accessible creation of high-quality 3D avatars that are animatable and customizable. Although 3D morphable models provide intuitive control for editing and animation, and robustness for single-view face reconstruction, they cannot easily capture geometric and appearance details. Methods based on neural implicit representations, such as signed distance functions (SDF) or neural radiance fields, approach photo-realism, but are difficult to animate and do not generalize well to unseen data. To tackle this problem, we propose a novel method for constructing implicit 3D morphable face models that are both generalizable and intuitive for editing. Trained from a collection of high-quality 3D scans, our face model is parameterized by geometry, expression, and texture latent codes with a learned SDF and explicit UV texture parameterization. Once trained, we can reconstruct an avatar from a single in-the-wild image by leveraging the learned prior to project the image into the latent space of our model. Our implicit morphable face models can be used to render an avatar from novel views, animate facial expressions by modifying expression codes, and edit textures by directly painting on the learned UV-texture maps. We demonstrate quantitatively and qualitatively that our method improves upon photo-realism, geometry, and expression accuracy compared to state-of-the-art methods.

Fig. 1.Given a single input image, our method reconstructs a high-quality editable 3D digital avatar (columns 2 and 3) by combining implicit geometry representations with explicit texture maps.The proposed approach naturally supports novel view synthesis from large pose shifts, an expressive and non-linear facial animation space (columns 4 through 6), direct user access to texture map editing (column 7), and 3D asset extraction for further downstream applications such as relighting (column 8).Original image courtesy of COD Newsroom/flickr (top) and Malcolm Slaney/flickr (bottom).
There is a growing demand for the accessible creation of high-quality 3D avatars that are animatable and customizable.Although 3D morphable models provide intuitive control for editing and animation, and robustness for single-view face reconstruction, they cannot easily capture geometric and appearance details.Methods based on neural implicit representations, such as signed distance functions (SDF) or neural radiance fields, approach photorealism, but are difficult to animate and do not generalize well to unseen data.To tackle this problem, we propose a novel method for constructing implicit 3D morphable face models that are both generalizable and intuitive for editing.Trained from a collection of high-quality 3D scans, our face model is parameterized by geometry, expression, and texture latent codes with a learned SDF and explicit UV texture parameterization.Once trained, we can reconstruct an avatar from a single in-the-wild image by leveraging the learned prior to project the image into the latent space of our model.Our implicit morphable face models can be used to render an

INTRODUCTION
Personalized avatar creation-the ability to map one's facial features to a 3D virtual replica that can be animated, customized, and rendered-is an emerging technology with great promise for cinema, the metaverse, and telepresence.Advances in this area may lead to digital twins with greater verisimilitude in detail and in animation that are more easily integrated into downstream applications and pipelines.Single-shot personalized avatar creation enables reconstructing face avatars from individual RGB images with greater convenience and flexibility than methods that require more specialized capture setups or procedures.
Traditional approaches to animatable 3D avatar creation are often based on 3D Morphable Models (3DMM) [Blanz and Vetter 1999], which disentangle shape and appearance variation into a low-dimensional face representation.Building on these, more recent approaches often leverage either explicit (textured) template meshes [Daněček et al. 2022;Feng et al. 2021;Grassal et al. 2022;Khakhulin et al. 2022;Li et al. 2017;Tran and Liu 2019] or neural implicit representations [Mildenhall et al. 2021;Park et al. 2019;Sitzmann et al. 2019].Template-based approaches enable easy asset extraction and intuitive editing, but are often unable to capture high-quality geometry and textures.Emerging implicit face models can achieve greater realism by modeling more complex geometric features such as hair [Cao et al. 2022b;Giebenhain et al. 2022;Zheng et al. 2022a].However, implicit face representations often compromise on interpretability and are less intuitive to control; the entangled latent spaces learned by these highly parameterized models are difficult to edit.
Our approach aims to combine the interpretability and editability advantages of template-based 3DMMs with the quality and topological flexibility of implicit 3D representations.Crucially, we decouple appearance and geometry into two branches of our network architecture.By incorporating a UV parameterization network to learn continuous and consistent texture maps, we can export avatars as textured meshes to support downstream applications such as texture map editing and relighting in a traditional graphics pipeline (See Figure 1).On the other hand, by representing geometry with an implicit signed distance field (SDF), our facial shape is less limited by resolution and topology compared to mesh-based approaches.
We show that our proposed hybrid representation effectively captures the geometry, appearance, and expression space of faces.We demonstrate that single-shot in-the-wild portrait images can be effectively mapped to avatars based on our proposed representation, and that these avatars improve upon the previous state-of-the-art in photo-realism, geometry, and monocular expression transfer.Moreover, we demonstrate compelling capability for enabling direct texture editing and disentangled attribute editing such as facial geometry and appearance attributes.
In summary, contributions of our work include: • We propose a hybrid morphable face model combining the high-quality geometry and flexible topology of implicit representations with the editability of explicit UV texture maps.• We present a single-shot inversion framework to map a single in-the-wild RGB image to our implicit 3D morphable model representation.The inverted avatar supports novel view rendering, non-linear facial reanimation, disentangled shape and appearance control, direct texture map editing, and textured mesh extraction for downstream applications.• We demonstrate state-of-the-art reconstruction accuracy for photo-realistic rendering, geometry, and expression accuracy in the single-view reconstruction setting.
The seminal work by Blanz and Vetter proposed a linear 3D Morphable Model (3DMM) [Blanz and Vetter 1999] that models facial shape and textures on a template mesh using linear subspaces computed by principal component analysis (PCA) from 200 facial scans.This low-dimensional facial shape and texture space makes 3DMMs suitable for robustly capturing facial animation as well as reconstructing 3D faces in monocular settings.To reconstruct shape, texture, and lighting from a photo, previous work employed continuous optimization using constraints such as facial landmarks and pixel colors [Cao et al. 2014[Cao et al. , 2016;;Garrido et al. 2013Garrido et al. , 2016;;Ichim et al. 2015;Li et al. 2017;Romdhani and Vetter 2005;Shi et al. 2014;Thies et al. 2016] and more recently deep learning-based inference [B R et al. 2021;Daněček et al. 2022;Deng et al. 2019b;Dib et al. 2021a,b;Dou et al. 2017;Feng et al. 2021;Genova et al. 2018;Luo et al. 2021;Tewari et al. 2019;Tewari et al. 2017;Tuan Tran et al. 2017;Wu et al. 2019].While approaches relying on 3DMMs tend to be robust, they are ineffective for reconstructing high-fidelity geometry and texture details due to the linearity and low dimensionality of the model.Various other methods extended 3DMMs to capture non-linear shapes [Chandran et al. 2020;Li et al. 2020;Tewari et al. 2018;Tran et al. 2019;Tran andLiu 2018, 2019;Wang et al. 2022b], photo-realistic appearance using neural rendering or optimization [Gecer et al. 2019;Nagano et al. 2018;Saito et al. 2017;Thies et al. 2019], or reflectance and geometry details for relightable avatar generation [Chen et al. 2019;Huynh et al. 2018;Lattas et al. 2020;Yamaguchi et al. 2018].Recent approaches predict geometry offsets over the template mesh to reconstruct non-facial regions such as hair [Grassal et al. 2022;Khakhulin et al. 2022].We refer the reader to Egger et al. [2020] for an in-depth survey of 3DMM techniques and Tewari et al. [2022] for a report of recent advancements in neural rendering.Since mesh-based 3DMMs represent geometry with a shared template mesh, their fixed topology limits the ability to scale the model to capture complex geometry such hair or fine-scale details.Additionally, their ability to synthesize photo-realistic facial textures may be limited by the resolution of the template mesh and discrete texture map.By parameterizing geometry with a signed distance function and color with a continuous texture map, our method is able to avoid such resolution issues and scale more efficiently with model capacity while retaining 3DMM-like intuitive parameters to individually control geometry and textures.Our consistent texture parameterization enables not only direct texture editing in UV space, but also semantic correspondence between our face model and an input image via facial landmarks, which can be leveraged to improve single-shot reconstruction quality.

Implicit Representations for Modeling and Rendering
While single-shot 3D reconstruction methods have explored various explicit 3D representations such as voxels [Girdhar et al. 2016;Tulsiani et al. 2017;Wu et al. 2018;Yan et al. 2016;Yang et al. 2018;Zhu et al. 2017], point clouds [Fan et al. 2017], meshes [Xu et al. 2019], geometric primitives [Niu et al. 2018;Zou et al. 2017], and depth maps [Wu et al. 2020], implicit representations have recently been leveraged to achieve higher resolution reconstruction using occupancy or signed distance fields (SDFs) [Chen and Zhang 2019;Mescheder et al. 2019;Xu et al. 2019].Implicit representations such as neural radiance fields (NeRFs) [Mildenhall et al. 2021] and signed distance fields (SDFs) [Park et al. 2019] have demonstrated high reconstruction quality for 3D shapes and volumetric scenes.PIFu [Saito et al. 2019] and follow-up works [Cao et al. 2022a;Saito et al. 2020] use implicit fields to model human bodies and clothing.AtlasNet [Groueix et al. 2018] demonstrated 3D shape generation by predicting a set of parametric surface elements given an input image or point cloud.NeuTex [Xiang et al. 2021] replaces the radiance prediction of NeRFs with a learned UV texture parameterization conditioned on lighting direction.Although our method also employs a UV cycle consistency loss, we 1) operate in a SDF setting and condition our parameterization on geometry and expression latent codes to generalize across samples rather than overfit to a single scene, 2) employ sparse facial landmark constraints to facilitate learning a semantically intuitive and consistent parameterization, and 3) explicitly leverage 2D to 3D facial landmark correspondences enabled by the learned consistent parameterization during singleimage reconstruction.Implicit representations have also given rise to higher quality 3D generative models [Chan et al. 2022;Or-El et al. 2022;Xue et al. 2022], and follow-up work has studied inverting an image into the latent space of a pre-trained 3D GAN [Ko et al. 2023;Lin et al. 2022;Roich et al. 2022] for single-view 3D reconstruction.However, without careful optimization and additional priors [Xie et al. 2022;Yin et al. 2022], this 3D GAN inversion tends to be less robust due to unknown camera poses [Ko et al. 2023] and multiview nature of NeRF training in the monocular setting.On the other hand, the compact face representation of our model provides robust initialization in the single-shot reconstruction setting.

Implicit Face Models
Compared to traditional mesh-based 3DMMs for face modeling, implicit representations naturally offer flexible topology and non-linear expression animation through latent code conditioning.While some approaches learn to reconstruct an implicit 3DMM from an input 3D face scan [Alldieck et al. 2021;Cao et al. 2022b;Giebenhain et al. 2022;Yenamandra et al. 2021;Zanfir et al. 2022;Zheng et al. 2022b], other works have explored modeling an implicit face model from RGB videos [Grassal et al. 2022;Ma et al. 2022;Zheng et al. 2022a,c].However, the above approaches either do not support or cannot generalize to single-shot in-the-wild images.Multi-view methods have also been used to reconstruct implicit head models [Athar et al. 2021[Athar et al. , 2022;;Hong et al. 2022;Kellnhofer et al. 2021;Li et al. 2022;Ramon et al. 2021;Wang et al. 2022a].HeadNeRF [Hong et al. 2022] is the closest to our work and learns a parametric head model from multi-view images during training; at test-time, an input image can be inverted for 3D reconstruction.However, HeadNeRF performs volumetric rendering at a limited image resolution and relies on upsampling CNN modules, resulting in flickering artifacts from depth error during novel view synthesis.Furthermore, existing implicit morphable models do not support texture manipulation beyond interpolation; by contrast, our learned explicit texture paramterization enables intuitive and out-of-domain edits such as adding tattoos or mustaches (see Fig. 1).

METHOD 3.1 Implicit Morphable Face Parameterization
We disentangle each facial avatar into identity and expression, where identity is encoded by geometry and color latent codes while expression is captured by an expression latent code.To attain both high-quality geometry and interpretable texture, our model consists of an implicit geometry branch and a UV texture parameterization branch.The geometry branch contains a multilayer perceptron (MLP) that maps 3D points  to SDF values  () during sphere tracing.The UV texture branch consists of a parameterization MLP that maps  to spherical coordinates  (), a parameterization regularizer MLP that learns the inverse mapping from  () back to , and a color network that predicts the output RGB at  ().See Figure 2 for a diagram of our model pipeline.Please refer to the supplement for model architecture details.
We train our model on the Triplegangers [2022] 3D scan dataset for its volume and diversity of subjects and expressions.Although the RenderPeople [2022] dataset additionally models hair and clothing, it only contains 120 neutral expression subjects, making it less suitable for reconstructing an avatar from unconstrained in-the-wild photos.Our training samples consist of a 3D head mesh, UV diffuse texture map, and six diffusely lit frontal RGB images.The dataset contains 515 different subjects each with 20 expressions, for a total of 10,300 data samples.Our full model learns an AutoDecoder dictionary of 515 geometry codes, 515 color codes, and 10,300 expression codes, as subjects express the same sentiment differently.Different expressions for the same training subject share the same geometry and color codes, allowing the model to disentangle expression from the underlying geometry and texture.Please refer to the supplement for examples of our training data.

Training Losses
Our model is trained on geometry, color, and regularization losses: Following Figure 2, let  be the SDF MLP,  the UV parameterization MLP,  −1 the inverse UV parameterization MLP, and  the set of randomly sampled surface points during training.The geometry loss consists of surface, Eikonal [Gropp et al. 2020], normal, and UV losses.The surface loss ℓ   optimizes the SDF zero level set, the Eikonal loss ℓ  regularizes the SDF gradients, and the normal loss ℓ  aligns the SDF gradients with the ground truth mesh normals n.The UV loss ℓ  regularizes the learned mapping to follow an invertible surface parameterization, which enables correspondences between texture and geometry used in our single-shot inversion pipeline, described in Section 3.5. (2) (4) (5) The color loss consists of a reconstruction loss ℓ  on the ground truth texture T , as well as perceptual [Zhang et al. 2018] and reconstruction losses ℓ  over the facial region    between the ground truth image Î and rendered image  obtained via sphere tracing: Finally, we enforce the compactness in the learned latent space by penalizing the magnitude of the geometry, color, and expression codes: (10)

Learning UV Parameterizations
To learn an interpretable texture space and coherent semantic correspondence across subjects, we add an auxiliary loss term to L  that enforces the parameterization to be consistent through a sparse set of facial landmark constraints: The first term enforces the learned UV mapping to match the ground truth UV mapping ĝ for the set of 3D facial landmark points , and the second term enforces this mapping to be invertible.Fig. 8 demonstrates the consistency of our learned UV parameterization.
Although mostly consistent, it is difficult to obtain perfect registrations around the inner mouth and eyes due to the billboard geometry and errors originating from the ground truth data.

Animation
After training, an avatar can be animated by manipulating its expression latent code.For a source subject with expression code   , target expression code  ′  , and animation timesteps  ∈ [0, 1], we define the expression animation trajectory by: Unlike traditional linear 3DMM approaches, our expression space follows non-linear trajectories learned from high-quality 3D scans, as shown in Fig. 4.

Single-Shot Inversion
In order to reconstruct and animate unseen subjects, we project an input RGB image into the latent space of our pre-trained model and lightly fine-tune the model weights similar to Pivotal Tuning Inversion (PTI) [Roich et al. 2022].To handle unseen lighting conditions, we de-light the input image using LUMOS [Yeh et al. 2022] and initialize the geometry, color, and expression codes through a separately trained encoder.We empirically find this encoder initialization to be important in obtaining robust results for in-the-wild input images (See Figure 9).
Image Encoder.We attain latent code initializations by training a DeepLabV3+ [Chen et al. 2018] encoder to reconstruct each training image Î and its corresponding latent codes Ŵ already computed from the previous AutoDecoder training stage: One major challenge when inverting in-the-wild images is handling unseen identities, accessories, hairstyles, and occlusion present in real-world images, as Triplegangers contain limited identities with no variations in hairstyles or background.Therefore, we augment the encoder's training dataset with synthetically augmented Triplegangers images from [Yeh et al. 2022], which improves the robustness of the initialization and final inversion reconstruction, shown in Fig. 9.
Optimization.After initializing the latent codes for an input image Î using our encoder, we freeze the model weights and optimize the latent codes while minimizing image, silhouette, multi-view consistency, facial landmark, and regularization losses: (15) where the silhouette loss ℓ ℎ iterates over points contained in the ground truth face region Î  , but not in the predicted face region    , to bring the points closer to the SDF zero level set.
ArcFace [Deng et al. 2019a] measures the face similarity between different views and   is a predicted render from a randomly perturbed camera pose. is an off-the-shelf facial landmark detector [King 2009] and d is the ground truth facial landmark UV mapping enforced in Eq. 11.Note that our consistent UV parameterization directly enables correspondences for the facial landmark alignment loss ℓ  ; Fig. 10 demonstrates the benefits of incorporating this loss.The regularization loss ℓ  is important to ensure that the optimized codes stay near the manifold of the pre-trained latent space for expression animation.We obtain face masks using a pre-trained BiSeNet [Yu et al. 2018] and optimize for 800 steps.
Fine-tuning.To reconstruct finer details in the input image, we freeze the latent codes after optimization and fine-tune the model weights on the above losses.We omit the silhouette loss, as we find it tends to bloat the geometry when the model weights are unfrozen.Although fine-tuning the model improves reconstruction quality, it may also hinder its capability for animation or novel view synthesis.Therefore, we only perform model fine-tuning for 60 steps.

RESULTS
We present results of our proposed method with comparisons to EMOCA [Daněček et al. 2022], ROME [Khakhulin et al. 2022] and FaceVerse [Wang et al. 2022b], three recent mesh-based approaches for single-shot 3D avatar generation, and HeadNeRF [Hong et al. 2022], an implicit approach using neural radiance fields.Our method achieves higher fidelity texture and geometry reconstruction in the facial region compared to the baselines.Qualitatively and quantitatively, our method also demonstrates more faithful expression and pose transfer between in-the-wild source and target images.Finally, our learned texture map is intuitive to edit and propagates naturally during animation.

Implementation Details
Our model is trained in two stages.In the first stage, we withhold the ground truth multi-view images, as we find that supervising with both texture maps and multi-view images negatively impacts the model's ability to learn a consistent UV mapping.In the second stage, Table 2. Quantitative results on single-shot in-the-wild reconstruction (left) and self-expression retargeting (right).Left: image, pose, and identity metrics are computed on 500 images sampled from FFHQ.Depth metrics are computed on the H3DS dataset.Image, identity, and depth metrics are computed only on the facial region.EMOCA is evaluated using its smaller face crop.Right: FACS coefficients and facial landmarks are computed after expression and pose transfer on 32 expression pairs sampled from the Triplegangers test split.we freeze the UV networks {,  −1 } and supervise using the multiview images to fine-tune the learned texture maps while rendering image reconstructions at 768 × 512 resolution.Camera poses are provided with ground truth training data and we estimate camera poses for in-the-wild FFHQ images using Deep3DFaceRecon [Deng et al. 2019b].We perform sphere tracing for 50 steps per ray and use a dimensionality of 512 for the geometry, color, and expression latent codes.We train our AutoDecoder for 1000 epochs (approx.one week) and our inversion encoder for 200 epochs (approx.one day) across 8 NVIDIA A40 GPUs.We use a Triplegangers training/test split of 386/129 for the quantitative expression experiments.Sphere tracing takes 8.5 seconds and inversion takes 3 hours per image.See supplemental material for more details on training and model architectures.

Single-Shot 3D Face Reconstruction and Animation
Qualitative Results.We show qualitative comparisons for singleshot reconstruction followed by expression and pose transfer on FFHQ [Karras et al. 2019] images between the proposed method, EMOCA, ROME, and HeadNeRF in Fig. 5 and Fig. 13.
Overall, our method is more photo-realistic and achieves higher expression accuracy in facial reconstruction.EMOCA does not model the mouth interior and relies on a pre-trained FLAME [Li et al. 2017] albedo model for texture.Our model produces the most faithful expression transfer, demonstrating the diversity of its learned expression space and generalization capabilities of our method to in-the-wild data.HeadNeRF exhibits a large amount of identity shift during pose transfer, whereas our method remains view-consistent after large pose changes.
We also show a ground truth comparison of reconstructed geometry on the H3DS [Ramon et al. 2021] dataset between our method and the baselines in Fig. 6.HeadNeRF performs volumetric rendering at a low resolution and therefore produces noisy depth results.Our geometry captures higher fidelity facial geometry than ROME and captures the expression more faithfully (e.g., eye blink) compared to EMOCA.
Quantitative Results.We report quantitative reconstruction and self-reenactment expression transfer results in Table 2 and Table 3.The photometric (LPIPS [Zhang et al. 2018], DISTS [Ding et al. 2020], SSIM [Wang et al. 2004]), pose error, and MagFace [Meng et al. 2021] identity consistency (ID) metrics are calculated over a dataset of 500 images from FFHQ.We compute L1 and RMSE depth error over all subjects in the H3DS dataset.To evaluate self-reenactment expression error, we randomly sample 32 source-target expression pairs over a test split of the Triplegangers dataset and measure the L2 error for FACS [Ekman and Friesen 1978] coefficients and facial landmarks.For details related to how each metric is computed, please refer to the supplemental material.
On the FFHQ dataset, our proposed method achieves the best accuracy in terms of LPIPS, DISTS, SSIM, and ID score.The optimizationfree ablation struggles to handle the considerably large domain shift between Triplegangers training data and FFHQ in-the-wild images.Our model also exhibits the lowest depth error on the H3DS dataset without relying on a 3D template mesh prior.Finally, our model has the lowest FACS and facial landmark errors, demonstrating the diversity of its learned expression space.

Ablations
In addition to the baselines mentioned, we compare our method to two ablations for single-shot reconstruction.The first ablation is an optimization-free inversion approach that only uses the learned encoder to directly map an input image to the geometry, color, and expression codes {  ,   ,   }.The second ablation is an encoder-free inversion approach that omits the encoder and instead uses a mean initialization for {  ,   ,   } over the learned AutoDecoder dictionary of latent codes.
Quantitative results for the ablations are reported in Table 2.The optimization-free approach produces significantly worse photometric and depth results, as there is a large domain gap between Triplegangers training data and in-the-wild images; this causes the encoder to produce a coarse reconstruction.The encoder-free approach performs better than the optimization-free approach but is still worse than our full method in image and geometry quality, demonstrating that the encoder initialization improves the optimization reconstruction.Both ablations and our full method perform similarly on pose accuracy.
Applications.As demonstrated in Fig. 5, our method directly supports monocular facial performance capture and expression retargeting.Our hybrid representation provides direct control over an intuitive texture map with a consistent layout.Fig. 7 demonstrates an example workflow: a user reconstructs an input image and modifies the learned texture map.The edits then continue to persist smoothly across different facial animations.Textured meshes can be extracted for further downstream applications such as re-lighting, as shown in the teaser.Fig. 11 and Fig. 12 further demonstrate our model's disentanglement between geometry, texture, and expression with its capability of shape and facial appearance transfer.

DISCUSSION
We have presented a new method for reconstructing 3D animatable and textured faces from a single RGB image.The proposed approach combines implicit representations with explicit texture maps to support explicit editing while achieving better photo-realistic rendering, geometry, and expression reconstruction than previous methods.We believe the proposed method makes important contributions towards accessible creation of high-fidelity avatars from in-the-wild images that are animatable, editable, and customizable for downstream applications.
However, there are still limitations to the method.Firstly, the current optimization process during inversion is significantly slower than encoder-based methods.For real-time applications, more expressive representations such as neural feature fields can be explored to enable optimization-free inversion methods.Furthermore, the method relies on a de-lighting module from Lumos to process inthe-wild images to generate a diffusely lit input image, which may cause subjects to appear paler than expected.These limitations may be alleviated through lighting augmentations of the training dataset to reduce the domain gap and incorporating a lighting model such as spherical harmonics into the representation.Finally, the results shown in this paper do not capture hair or accessories due to limitations of the training dataset.While not perfect, we refer to the supplemental material for a preliminary demonstration of our representation's capacity to handle hair and clothing on the smaller RenderPeople dataset.As implicit representations such as neural radiance fields excel at capturing the geometry and texture of thin structures, it may be fruitful to combine our method with recent sparse view implicit hair models [Kuang et al. 2022;Wu et al. 2022].

Target
Ours ROME

Fig. 2 .
Fig. 2. Our Pipeline.Avatars are represented by geometry, expression, and color latent codes {  ,   ,   } with each being 512 dimensional.At each 3D coordinate  during sphere tracing, the SDF network  and UV parameterization network  are conditioned on   ,   , and positional encoding  () to predict the signed distance  () and UV coordinates   (), respectively.The inverse UV parameterization network  −1 regularizes the learned mapping to be a surface parameterization  −1 (  ();   ,   ) = , while the color network ℎ predicts the associated RGB texture  () = ℎ (  ();   ,   ).After training, the avatar can be rendered freely with direct control over its texture and facial expression, or extracted as a stand-alone textured mesh asset.

Fig. 3 .
Fig.3.Single-shot inversion pipeline.We de-light the input image and initialize the latent codes using a pre-trained encoder (top row).We then perform PTI[Roich et al. 2022]  to get the final reconstruction (bottom row).Original image courtesy of Brett Jordan/flickr.

Fig. 4 .
Fig. 4. Non-linear animation space.By linearly interpolating between source and target expression codes, our model exhibits non-linear deformation trajectories on the 3D mouth vertices visualized.Original image courtesy of David Shankbone/flickr.

Fig. 5 .
Fig. 5. Single-shot reconstruction on FFHQ with expression and pose transfer.On the left, we show the input FFHQ source image, de-lit input image using LUMOS [Yeh et al. 2022], and reconstruction results for each method.On the right, we show monocular performance capture and retargeting, where we reconstruct and transfer the expression and pose from a target image (right-most column) to the source image identity (left-most column).On the left from top to bottom, original images are courtesy of José Carlos Cortizo Pérez/filckr, Montclair Film/flickr, Pham Toan/flickr, Javier Morales/flickr, Khiet Nguyen/flickr, and Malcolm Slaney/flickr.On the right from top to bottom, original images are courtesy of Adam Charnock/flickr, Daughterville Festival/flickr, Delaney Turner/flickr, South African Tourism/flickr, Pat (Cletch) Williams/flickr, and Collision Conf/flickr.

Fig. 6 .
Fig. 6.Ground truth geometry comparison on the H3DS dataset in the single-view setting.

Fig. 7 .
Fig. 7. Texture editing.Top row: input image, learned texture map, and user edited texture map.The learned texture map layout is intuitive and edits propagate naturally during facial animation as shown in the bottom row.Original image courtesy of Ed Kohler/flickr.

Fig. 8 .
Fig. 8. UV parameterization consistency.We measure the mean L2 error over 32 FFHQ subjects between the learned texture map (top left) and the cycle texture map (bottom left) obtained by mapping from UV → 3D → UV.

Fig. 9 .
Fig. 9. Encoder training data augmentation ablation.Training the encoder with the synthetically augmented Triplegangers dataset [Yeh et al. 2022] significantly improves our initialization, which is important for converging to a high quality inversion result.Note the difference in the final reconstructed geometry.Original image courtesy of David Geitgey Sierralupe/flickr.

Fig. 10 .
Fig. 10.Facial landmarks loss ablation.Removing the facial landmarks loss during inversion reduces reconstruction quality of the face contour (left and right jaws) and facial features such as the eyes (right).Original image courtesy of Cena Mineira (left) and BigBrother Junkie (right).

Fig. 11 .
Fig. 11.Shape attribute transfer.We fix the color and expression codes for the source subject and directly replace the source geometry code with the target geometry code.Original images are courtesy of Francesco Pierantoni/flickr (left col, top), Tim Regan (left col, bottom), Bob n Renee/flickr (top row, left), and Sarah & Austin Houghton-Bird/flickr (top row, right).

Fig. 12 .
Fig. 12. Facial appearance attribute transfer.We fix the geometry and expression codes for the source subject and directly replace the source color code with the target color code.Original images are courtesy of Lord Jim/flickr (left col, top), xiǎo cháo zhù/flickr (left col, bottom), U.S. Army/flickr (top row, left), and U.S. Department of Energy/flickr (top row, right).

Table 1 .
Comparison to recent prior work.To the best of our knowledge, our method is the first implicit 3D face model to generalize across single-image inputs while supporting flexible topology and explicit texture map control.