FaceFolds: Meshed Radiance Manifolds for Efficient Volumetric Rendering of Dynamic Faces

3D rendering of dynamic face captures is a challenging problem, and it demands improvements on several fronts$\unicode{x2014}$photorealism, efficiency, compatibility, and configurability. We present a novel representation that enables high-quality volumetric rendering of an actor's dynamic facial performances with minimal compute and memory footprint. It runs natively on commodity graphics soft- and hardware, and allows for a graceful trade-off between quality and efficiency. Our method utilizes recent advances in neural rendering, particularly learning discrete radiance manifolds to sparsely sample the scene to model volumetric effects. We achieve efficient modeling by learning a single set of manifolds for the entire dynamic sequence, while implicitly modeling appearance changes as temporal canonical texture. We export a single layered mesh and view-independent RGBA texture video that is compatible with legacy graphics renderers without additional ML integration. We demonstrate our method by rendering dynamic face captures of real actors in a game engine, at comparable photorealism to state-of-the-art neural rendering techniques at previously unseen frame rates.


INTRODUCTION
Facial expressions are our primary means of communication-video streaming of our faces is a frequent part of our daily digital lives.Perceptually lossless and efficient video compression algorithms have enabled this application at consumer scale by achieving compute and memory efficiency.Other key enablers of this ubiquity of video streaming are 1) compatibility of compressed video playback with existing platform infrastructure such as operating systems and web browsers, and 2) easy trade-off of quality vs. memory through variable image resolution, allowing for seamless streaming across dynamically varying data bandwidth.But the same cannot be said for streaming and playback of personalized 3D face animation, which is a formidable task with major algorithmic challenges such as 3D reconstruction, animation, and transmission.While photorealism and compute/memory efficiency are obvious underlying challenges to each of these steps, other hurdles include compatibility with legacy software infrastructure and ability to smoothly trade-off quality vs. bandwidth.Solving these challenges promises widespread adoption of immersive experiences in 3D media content [22,25], immersive AR/VR communication and 3D telepresence [10,17,18,48].
Traditional graphics-based acquisition techniques for facial performances reconstructs 3D meshes and texture maps [13,28] for each individual frame, which can be rendered very efficiently on commodity hardware.However, such traditional mesh-based representations encounter significant challenges in accurately modeling and rendering the fine-scale detailed geometry and complex appearance of hair and skin [17,28], resulting in limited photorealism.On the other hand, recent advances in implicit volumetric representations such as neural radiance fields [44] and Gaussian splatting [32] have enabled high-quality acquisition and photorealistic rendering of dynamic human faces [41,49,51], including hair and skin at unprecedented quality [9,53], without requiring explicit geometry reconstruction.But such techniques require deep ML integration for inference and are not natively compatible with existing graphics rendering platforms such as game engines.They also do not always provide means to trade-off quality with compute or memory efficiency.
We present a layered-mesh based volumetric representation for 3D view-synthesis of dynamic face performances that works efficiently on legacy renderers.At training time, our method takes inspiration from radiance manifolds [16] and models the scene density for the entire sequence using a set of static spatial manifolds of alpha values, and the temporal appearance changes as a timeconditioned UV-mapped radiance over these manifolds.The alpha-manifolds and corresponding temporal UV appearance maps are parameterized by dense neural networks, and the appearance is further decomposed into view-conditioned components.From this trained model, we export the radiance manifolds as a single layered mesh for the entire sequence, and the corresponding view-independent UV-space appearance as RGBA texture maps, encoded as a video.This exported representation is then rendered efficiently through simple alpha-blending of the textured mesh layers in any renderer.Unlike previous methods that cannot change the resolution or quality once trained, our layered mesh representation also allows for trading off the image quality for efficiency through standard operations like mesh decimation and subsampling of texture resolution.Our view-independent texture maps allow for easy rendering through texture look-up without the need for evaluating complex view-dependent reflectance or radiance transfer.
We demonstrate the efficacy of our representation using the Multiface dataset [61] consisting of multi-view real world captures of face performances of several actors.We qualitatively and quantitatively compare our method with state-of-the-art neural rendering techniques on both image quality and efficiency, and show previously unseen high frame-rate rendering of these sequences on the Unity game engine on a consumer device.

RELATED WORK
Impressive results have been achieved for editing and animation of face images and videos using purely 2D or hybrid approaches [8,34,43,59,64], but such methods do not guarantee consistent rendering when changing perspectives, which is crucial for gaming or XR applications.
Traditional 3D performance capture techniques [13,22,25,28] have relied on textured meshes both due to ease of use for playback and editing, and to rely on rasterization for fast rendering.Parametric face models such as 3D morphable face models (3DMMs) [6,21,38] compress the dimensionality of mesh representations and make them differentiable, enabling efficient optimization frameworks that achieve the difficult task of canonical performance capture and playback.However, they suffer from low representational capacity and cannot model high-frequency effects in appearance and geometry.Some recent ML-based approaches build on such surface-based representations, e.g., Neural Head Avatars employs a surface mesh with a dynamic texture [26], while IMAvatar [66] opts for an implicit surface.While efficient, these methods need to generate expression-specific appearance via feed-forward neural networks and have issues representing semi-transparent effects such as hair and beards.Adding deferred rendering networks, such as in the single-shot model [33] struggles to mitigate these effects and jeopardizes multi-view consistency.
Volumetric representations such as NeRFs parameterize a compressed emission-absorption volume via a multi-layer perceptron (MLP) trained with frequency encoding [45,56].Since the MLP has to be invoked for decompression at every sample location, inference performance stands out as one of the main limitations of the original work.Subsequently, hybrid representations have replaced the need for an MLP at inference time with (sparse) volumetric grids [11,20,24,29,40,52,54], explicit geometry [12,16,37], and hash grids [46].Many recent works have exploited such volumetric representations to track or animate head models, but typically with computational requirements far exceeding our target application.Nerfies [49] and HyperNeRF [50] propose a continuous deformation field conditioned on a frame-specific latent code, which enables replay but requires an additional deformation MLP.
A popular class of methods rely on combining the low-dimensional tracking capabilities of 3DMMs with high representational power of volumetric radiance.GNARF [5] and Next3D [55] use a tri-plane representation with a form of mesh-based deformation but still require 2D superresolution modules to generate image at the desired resolution.INSTA [69] increases the efficiency of this type of approach by using hash grids [69], while NerSemble replaces the explicit head model with a hash ensemble for increased generality [36].Other methods [23,30,63] employ tetrahedral fields to directly deform a volumetric representation.MonoAvatar [2] uses a full volumetric NeRF model, where deformation is a function of  nearest points on the driving surface mesh.Ray tracing and the use of MLPs means that these methods are not nearly as fast as rasterization-based techniques.MVP [42] and its generative extension [10] offer a more efficient rendering pipeline by employing volumetric primitives attached to a guide mesh.More recent works have used other alternatives to NeRF-like volumes, such as point based representations [67] and Gaussian Splatting [51,53] driven by combinations of explicit models like FLAME [38] and neural functions.While these models offer improved performance compared to volumetric methods, they still require custom rasterization components and are thus not trivial to deploy in existing software.For all these methods, use of 3DMMs makes these methods challenging to deploy without additional software components including iterative optimizers.
An alternate way to mitigate issues of predefined mesh topology and increase the flexibility of the representation is to use one [39,57] or a collection of (implicit) surfaces [16].In GRAM [16], the authors propose a discretized volumetric representation using a set of learned non-intersecting implicit surfaces which can be efficiently used for sampling radiance.GRAMInverter [15] and Fig. 2. Training and inference pipelines.Given a set of rays from the training cameras, we determine the intersection of these rays with a set of implicit manifolds predicted by a single manifold predictor.After transforming these intersections to UV-space coordinates, a texture predictor outputs RGBA texture maps conditioned on the video frame index.At inference time, we shoot rays from the surface of a designated hemisphere around the scene towards its center, obtaining a single geometry and a video texture.The viewdependent branch is bypassed to ensure that the appearance is fully diffuse.
GRAM-HD [62] propose high-resolution variants of GRAM, albeit at increased computational cost and model complexity not suitable for real-time applications.BakedAvatar [19] is a concurrently developed avatar animation technique that also uses radiance manifolds, and like previous methods, employs deformation fields conditioned on tracked 3DMM coefficients.However, their model still relies on an ML model to estimate appearance at inference time.During training, our method builds on layered implicit surfaces used in GRAM [16], but is not generative, focusing on high-quality, per-person specific models instead.We show that a static set of manifolds can model an entire performance sequence, and we offer improved performance by decomposing the radiance into view-dependent and -independent components, which allows us to export our results to a single explicit triangle mesh with video texture.This enables real-time playback in commonly available software packages at low computational cost.

METHOD
In this section, we first formulate the objective problem and present an overview of our pipeline.
After describing how we process our datasets, we elaborate on how we leverage radiance manifolds to learn efficient 3D representations from multi-view videos.Finally, we describe how we export our representation to a single set of textured meshes that can be rendered natively on traditional graphics software while maintaining the rendering quality.

Definitions and Overview
Our objective is to learn a volumetric 3D representation of a subject that can be played back on game engines without any special neural network integration.Given a multi-view video of a subject with  frames, we learn a static geometry and a dynamic appearance model in an end-to-end fashion.We take inspiration from recent advances in implicit geometry representations [39,57] that significantly outperform explicit 3D reconstruction techniques that rely on mesh or point cloud representations.In our pipeline, similar to GRAM [16], the geometry is modeled by a set of 2D manifolds, embodied as a set of implicit surfaces.But unlike GRAM, the appearance is learned as a UV-mapped dynamic radiance over these manifolds, instead of the 3D -space.We learn  distinct manifolds defined implicitly by a single manifold predictor: Given a set of fixed scalars {  ∈ R |  ∈ I ≜ {1, 2, . . .,  )}, which we refer to as s-values, the manifold predictor defines a set of  isosurfaces that represent our static geometry: In our appearance model, we first transform points on each manifold to UV-space coordinates via a fixed function  : For each manifold  ∈ I and each frame  ∈ J ≜ {1, 2, . . .,  }, a texture predictor T   defines RGB and transparency fields: Note that we do not estimate volume density as is the case for traditional volume rendering, but instead model 3D point transparency with an alpha value.This makes the radiance accumulation independent of the ray path, which is crucial for enabling the next step of exporting the learned manifolds as textured mesh layers.Our approach can be treated as a generalization of the multi-plane image representation [68], where we optimize arbitrary 2D surfaces instead of planes.
Once the training is completed, we collect samples across each manifold at a specific resolution and export these collections of 3D points, UV-coordinates, and RGBA values as a single set of topologized triangle meshes with UV-textures that can be efficiently rendered on legacy graphics renderers.We illustrate our training and inference pipelines in Figure 2.

Dataset
We use the publicly available dataset Multiface [61], from which we gather multi-view videos of 3 subjects from V1 of the dataset (subject IDs 002914589, 002643814, 5372021), and 2 subjects from V2 (subject IDs 002421669, 002645310).We pick  = 60 consecutive frames from each video sequence where subjects perform facial expressions freely.For all subjects, we scale and transform the scene parameters such that the subjects are centered at the origin and oriented along the positive -axis with up-vector aligned with the positive -axis, and that the camera centers are distributed roughly 1 unit away from the subjects.We discard the cameras with elevation angles of more than 45 • and azimuth angles of more than 90 • , which yields a set of cameras in the  > 0 half-space.For each subject, we also hold out 2 cameras to perform quantitative evaluations.This yields us 23 training cameras for V1 subjects and 50 training cameras for V2.Finally, we downsample all images to 768 × 500 resolution while adjusting the camera parameters accordingly.We do not perform any background masking as our method is able to separate the foreground significantly either by restricting the scene volume or by placing the background into the view-dependent component of radiance, which is discarded at inference time.

Model Architecture and Training
Given  frames from a multi-view video with corresponding camera parameters, we sample  points uniformly along each camera ray and compute the intersections between these rays and the manifolds using the differentiable ray-manifold intersection algorithm [47] adopted in GRAM [16].In our method, we sample  = 256 points along each ray, set the number of manifolds to  = 12, and train our model using videos consisting of  = 60 frames for each subject.
Previous techniques that model dynamic scenes with radiance manifolds [19,60] have used explicit learned deformation of the manifolds to model scene animation.On the contrary, we model all frames of a dynamic sequence with a single set of static manifolds, which poses a nontrivial challenge.To achieve this, our technique uses a unique sequence of steps, where we 1) transform intersection points to UV-space, 2) separate RGB predictions into view-independent and view-dependent components, and 3) estimate the transparency of each intersection directly without computing volume densities.Given an intersection p = (, , ) ∈ R 3 and a fixed center c ∈ R 3 of a unit sphere, we first project the intersection to the surface of the sphere and obtain p ′ ≜ ( ′ ,  ′ ,  ′ ) = (p − c)/∥p − c∥.We then calculate the UV-space coordinates as  = 2  sin −1 ( ′ ) and  = 2  tan −1 ( ′ / ′ ).The texture predictor receives UV-coordinates and the s-values of the intersections, as well as the frame index that is mapped to a learned latent code that conditions the predictions.The texture predictor is then branched into two layers that predict single-channel view-dependent compontent and three-channel view-independent component, former of which is conditioned on the view direction.The outputs of these branches are added together to produce the final RGB prediction.Such architecture allows us to discount view directional effects at inference time and achieve view-consistent rendering of exported meshes.Furthermore, it also helps us to separate most of the background from foreground by attributing the background to view-dependent component, particularly if the background is primarily grayscale, thus eliminating the need for explicit background removal.Finally, the alpha values are predicted as the raw output of our texture predictor, which can be directly used to alpha-composite our  -layered representation.

Training details.
To promote training stability, we adopt the manifold initialization technique [1] used in GRAM [16] and begin training with sphere-like manifolds centered at c.We optimize our model in an end-to-end fashion by adopting ℓ 1 loss between the predicted and ground truth pixel values.To ensure that the appearance is mostly explained by the view-independent component, we penalize the output of the view-dependent branch with ℓ 2 penalty.To promote more stability, we apply ℓ 2 regularization to all manifold predictor layers except for the final one.The manifold and texture predictors are optimized jointly by minimizing the loss function where L rec is the ℓ 1 reconstruction loss, L vd is the view-dependent branch penalty, and L reg is the manifold regularization with  vd = 1.0 and  reg = 0.0001.The manifold and texture predictors are jointly optimized using the Adam optimizer [35] with initial learning rates of 0.0007 and 0.0010, and exponential decay rates of 0.05 and 0.20 per 200 000 iterations, respectively.Using a batch size of 32 768 rays sampled across all training frames and views, we perform training for 500 000 iterations for each subject.

Architecture Details.
The manifold predictor is implemented as an MLP with 3 hidden layers of widths 128 and a final layer, where we choose the set of fixed scalars {  }  =1 so that the initial concentric and sphere-like surfaces roughly falls within ±0.03 units of the surface of the face.These scalars are tuned slightly for each subject according to the size of the faces inferred by the tracked meshes provided in the Multiface dataset [61].The texture predictor is implemented as an MLP with 8 hidden layers of widths 256 and 2 final layers corresponding to view-independent and view-dependent branches.Both input points and view directions undergo positional encodings and the frame indices are mapped to 32-dimensional learned latent codes through an embedding layer.Encoded input points and frame indices are fed into the MLP at its first layer whereas the encoded view directions are concatenated to the input to the view-dependent branch.

Exporting Layered Meshes and Textures
At test time, we gather points across the unit hemisphere by collecting azimuth and elevation angles in [−/2, /2] uniformly at resolution  ×  and shoot rays towards the sphere center c.We set this center 0.25 units away from the scene center in the direction of negative -axis to ensure that the entire scene is encompassed by the hemisphere.This gives us  ×  samples across each manifold with UV-coordinates that are distributed uniformly in For each of the  frames, these points are used to query the texture predictor to yield  RGBA texture maps at resolution  × , where the view-dependent branch is bypassed to ensure the texture maps are fully diffuse.
Our simple spherical projection yields reasonable texture mapping despite slight distortions near the edges of the maps [17].We emphasize that we export the alpha channel as 8-bit images and hence store them very efficiently without sacrificing the visual quality.Finally, while the texture maps vary according to their respective frame indices, the geometry is constant across all frames.
To export our manifold-based representation to explicit surfaces, we reconstruct meshes from each of the  point clouds of size  ×  via Poisson surface reconstruction [31], where the normals for each point are computed with respect to their neighboring points.We then simplify these meshes using a mesh decimation algorithm to reduce the number of vertices to a target mesh resolution  m × m .Finally, for each vertex in the simplified mesh, we determine the nearest point in the original point cloud and assign its corresponding UV-coordinate.The texture maps, on the other hand, can be downsampled to a specific target resolution  t ×  t to meet the memory requirements of the renderer.To summarize, our final assets are: 1) a single set of  triangle meshes, each with number of vertices less than the target resolution  m ×  m and 2)  sets of  RGBA texture maps at resolution  t ×  t that form a UV texture video.We illustrate 3 frames from an example texture video along with composited texture maps in Figure 3.

Rendering on Game Engines
Our rendering pipeline in Unity using the exported layered mesh and texture sequences runs in real-time.We leverage two-pass deferred shading [14] on the GPU.When given a camera pose at runtime, we generate  G-buffers by shading each mesh layer and its opacity in a single render pass.Modern game engines use multiple render targets (MRT) for this purpose and we used culling masks to achieve this in Unity.For a small number of layers (e.g.,  < 16, which is the maximum texture sampler count supported in Unity), we composite G-buffers on the GPU by tracing a ray through all layers in one pass, similar to accumulating luminance in the traditional volume rendering pipeline.For more than 16 layers, we suggest using a prefix sum algorithm [7] on the GPU for efficient layer compositing.
In our experiments with  = 12, we achieved real-time performance on a 2019 Macbook Pro with an M1 Max chip and Unity 2021.3.This was consistent across five datasets and over 1,000 frames.The average rendering time per frame was under 17 ms (above 60 FPS) at a rendering resolution of 2560 × 1440, even for our largest reconstructed mesh of 6.3M triangles.

EXPERIMENTS AND RESULTS
We evaluate the performance of our method on several subjects from the Multiface dataset [61], where we provide qualitative and quantitative comparisons against state-of-the-art neural rendering methods.We then perform more analysis of the configurability of our representation by assessing its performance with respect to varying number of manifolds, mesh resolution, and texture resolution.

Qualitative Results
We train our pipeline individually on 5 multi-view video sequences from [61], and illustrate our novel view synthesis results in Figure 5. Here, we provide comparisons against 4 state-of-theart neural rendering methods-MonoAvatar [2], MVP [42], HyperNeRF [50], and Nerfies [49].Despite discretizing the 3D volume into only  = 12 manifolds and hence sampling much fewer points across the scene during both training and evaluation, our approach manifests a comparable performance against other techniques.Furthermore, our technique does not require any MLP query or a custom pipeline during rendering and thus can be exported into a game engine, where we can perform free-viewpoint rendering of a dynamic 3D scene.We import our layered meshes and UV-textures into Unity and achieve the results demonstrated in Figure 4. We encourage the reader to refer to the supplementary material for video visualizations.
By interpolating between the learned latent codes of different frames at inference time, we can render our representation at higher frame rates to enhance the overall visual quality.We depict our frame interpolation results and provide comparisons in Figure 6, where we observe comparable performance against other methods.Please refer to the supplementary material for the videos.

Quantitative Results
For each subject, we perform quantitative evaluations on 2 held-out cameras across all  = 60 frames, totaling 120 images.In Table 1, we report average image quality metrics for our method and other methods in PSNR, SSIM [58], and LPIPS [65], where we consistently observe comparable performance across all methods.We also report VRAM usage, required disk storage, and frame rates for each of the methods, where we compress individual texture maps into a video and apply mesh compression to individual meshes using Draco 1 with no quantization of vertex positions and texture coordinates, and using the lowest compression amount.From our results, we observe that Table 1.Quantitative comparisons.Our method attains comparable visual quality across various metrics while utilizing significantly less VRAM and enabling much higher frame rates.The image quality metrics are averaged over a total of 600 test images of all 5 subjects.our method is able to achieve higher frame rates despite utilizing a comparable amount of storage against other methods.Note here that other methods can be run with much lower VRAM usage by reducing the batch size down to single ray per batch.All ML training, including ours and the state-of-the-art methods, was done on a workstation with NVIDIA V100 GPU.Since the state-of-the-art methods require a Linux workstation with NVIDIA GPU also for inference, they were evaluated and profiled on this same workstation.Our method does not require such special ML integration, and we perform our evaluation on the Unity game engine on a 2019 Macbook Pro laptop.

Ablation Studies
Since radiance manifolds [16] constrain 3D volumes to a number of implicit surfaces, the rendering quality is strongly influenced by the number of manifolds chosen before training.We provide qualitative and quantitative comparisons across different numbers of manifolds in Figure 7 and Table 2. Here, we observe that not only the rendering quality suffers with decreasing number of manifolds, capturing volumetric effects also requires a sufficient number of samples.
Our exported representation allows for trading off image quality with memory efficiency by performing standard operations such as mesh simplification and texture downsampling.After reconstructing a single set of meshes via Poisson surface reconstruction [31], we decimate each of these meshes to meet a target number of vertices.We illustrate qualitative and quantitative evaluations for varying mesh resolutions in Figure 7 and Table 3.We observe that the image quality does not undergo a significant drop until 16 × 16 resolution per surface.This is because our layered mesh representation does not manifest high-frequency changes while still allowing for state-of-the-art rendering quality via learned alpha-manifolds.This provides us with an extremely lightweight geometry representation without sacrificing any visual quality or volumetric effects.
The texture resolution, on the other hand, naturally plays a vital role in rendering quality.To compare, we individually subsample each of the texture maps across all manifolds and frames mesh resolution texture resolution number of manifolds   bilinearly, and render each frame at original training resolution 768 × 500.We illustrate our results in Figure 7 and Table 4 where we observe a notable reduction in quality at 128 × 128 resolution.

LIMITATIONS AND FUTURE WORK
Since we export view-independent texture maps, our results do not exhibit view directional effects, as shown in Figure 8(a).Our pipeline could be extended to estimate specular or roughness maps from the view-dependent component to enable plausible specular relighting in graphics pipelines.
While we demonstrate that the radiance manifolds can be leveraged to export lightweight geometry and appearance models, joint learning of these models poses challenges in training stability and requires careful tuning of relative learning rates of the two models.Besides these challenges, we observe that sampling across discrete manifolds instead of the entire 3D volume causes shell artifacts in extreme poses, as illustrated in Figure 8(b).Our experiments suggest that these artifacts can be mitigated by initializing the manifolds according to the size of the scene and keeping the distance between the consecutive manifolds sufficiently small, while also ensuring that these distances are large enough to allow for volumetric effects.In addition to these heuristics and the regularization of the manifold predictor weights, more sophisticated regularization techniques [27] can be utilized to promote training stability and improve geometry predictions.We should note that our method involves sampling across the 3D volume by casting a single ray per pixel and query MLPs for each point across these rays, which is prone to producing aliasing artifacts [3] and lead to slow training [46].Using more recent neural rendering frameworks that, for example, combine anti-aliasing techniques and fast grid-based representations [4] would be a possible next step towards improving the overall performance and visual quality of our method, although incorporating radiance manifolds into such pipelines can present nontrivial challenges.
While our spherical mapping to UV-space coordinates works well since our learned radiance manifolds are smooth and mostly convex, such mapping could be problematic for non-convex regions such as the nose.Also, our spherical sampling technique at inference time could be modified to have denser number of samples around more detailed regions such as the eyes and the hair.A non-linear UV mapping technique for the face in the context of radiance manifolds would be an exciting future research exploration.

CONCLUSION
In this work, we introduce FaceFolds, a novel representation for high-quality and memory-efficient volumetric rendering of dynamic facial performances in legacy renderers.We achieve this by leveraging radiance manifolds to model the animated performance.Our novel contribution includes a unique sequence of operations and design choices required to make the radiance manifold framework view-independent to enable exporting of the layered mesh and video textures.Once these assets are obtained, our representation does not require any ML-based operations or complex computations, and hence can be easily rendered in standard graphics software on consumer hardware at high frame rates.Our results demonstrate that we still achieve state-of-the-art rendering quality despite securing notable gains in memory and compute footprint.

Fig. 3 .
Fig.3.Video texture visualization.We illustrate 3 frames (15 th , 30 th , and 45 th frames) from the learned texture video of subject 002914589.For each frame, we show full RGBA and alpha-only UV-space texture maps in the top and bottom rows, respectively.

Fig. 4 .
Fig. 4. Free-viewpoint rendering on Unity.Our representation allows for free-viewpoint rendering of dynamic 3D volumes on consumer hardware.Please refer to the supplementary material for the videos.

Fig. 5 .Fig. 6 .
Fig. 5. Qualitative comparisons.Our method achieves comparable visual quality to state-of-the-art neural rendering techniques while facilitating very efficient rendering of dynamic sequences on legacy graphics software without any custom integration of ML pipelines.

Fig. 7 .
Fig. 7. Ablation results.Number of manifolds.Using a sufficient number of manifolds is essential to attain photorealism and volumetric effects.Mesh resolution.We can decimate the exported meshes to much lower resolutions without sacrificing significant visual quality.Texture resolution.We can modify texture resolution arbitrarily at inference time to trade off image quality with rendering efficiency as desired.

Fig. 8 .
Fig. 8. Limitations.While our method achieves good image quality in general, it suffers from some drawbacks.(a) Since we export only view-independent radiance to the texture, we cannot render specularities such as the ones on the nose, teeth and cheeks.(b) At extreme out-of-training-distribution viewpoints, our method sometimes exhibits shell artifacts due to transparency of the layered mesh from grazing views.Please refer to the supplementary material for video visualizations.

Table 2 .
Ablation on number of manifolds.While significant gains in memory efficiency can be achieved by reducing the number of manifolds, it has a noteable effect on the visual quality.Numbers are averaged over a total of 240 test images of two subjects with IDs 002914589 and 002421669.We report total disk storage required by the meshes and the video texture as well as the total number of triangles in all meshes.

Table 3 .
Ablation on mesh resolution.Despite reducing the memory footprint of the geometry, visual quality is maintained for resolutions as low as 32 × 32.For all mesh resolutions, the texture resolution is set to 1024 × 1024 and requires 5.9 MB storage after video compression.Numbers are averaged over 120 test images of a single subject with ID 002914589.Mesh resolution PSNR ↑ SSIM ↑ LPIPS ↓ Num.Tri.↓ Disk (mesh) ↓

Table 4 .
Ablation on texture resolution.We can reduce the memory footprint of the video texture by simply subsampling each frame at inference time.For renders of resolution 768 × 500, a noteable drop in quality occurs at 256 × 256 texture resolution.For all texture resolutions, the mesh resolution is set to 512 × 512, hence the number of triangles and disk storage for meshes are constant and are 6261920 and 118.1 MB, respectively.Numbers are averaged over 120 test images of a single subject with ID 002914589.Note that the texture video size will increase with the number of frames in the input video.