HDHumans: A Hybrid Approach for High-fidelity Digital Humans

Photo-real digital human avatars are of enormous importance in graphics, as they enable immersive communication over the globe, improve gaming and entertainment experiences, and can be particularly beneficial for AR and VR settings. However, current avatar generation approaches either fall short in high-fidelity novel view synthesis, generalization to novel motions, reproduction of loose clothing, or they cannot render characters at the high resolution offered by modern displays. To this end, we propose HDHumans, which is the first method for HD human character synthesis that jointly produces an accurate and temporally coherent 3D deforming surface and highly photo-realistic images of arbitrary novel views and of motions not seen at training time. At the technical core, our method tightly integrates a classical deforming character template with neural radiance fields (NeRF). Our method is carefully designed to achieve a synergy between classical surface deformation and NeRF. First, the template guides the NeRF, which allows synthesizing novel views of a highly dynamic and articulated character and even enables the synthesis of novel motions. Second, we also leverage the dense pointclouds resulting from NeRF to further improve the deforming surface via 3D-to-3D supervision. We outperform the state of the art quantitatively and qualitatively in terms of synthesis quality and resolution, as well as the quality of 3D surface reconstruction.


INTRODUCTION
Photo-realistic synthesis of digital humans is a very important research topic in graphics and computer vision.Specially, with the recent developments of VR and AR headsets, it has become even more important, since photo-real human avatars can be used to populate virtual or augment real scenes.The classical approach to achieve this goal would be the manual creation of human avatars by means of 3D modeling including meshing, texturing, designing material properties, and many more manual steps.However, this process is not only tedious and time-consuming, but it also requires expert knowledge, preventing these techniques from being adopted by non-expert users.A promising alternative is to create such digital human avatars from video captures of real humans.The goal of our approach is to create controllable and highly photo-realistic characters at high resolution solely from multi-view video.
This idea was already subject of previous research works that can be broadly categorized based on the employed representation.Some approaches explicitly model the human's surface as a mesh and employ texture retrieval techniques [Casas et al. 2014;Xu et al. 2011] or deep learning [Habermann et al. 2021] to generate realistic appearance effects.However, the synthesis quality is still limited and the recovered surface deformations are of insufficient quality because they are driven purely by image-based supervision.Other works solely synthesize humans in image space [Chan et al. 2019;Liu et al. 2020bLiu et al. , 2019b]].These approaches however suffer from 3D inconsistency when changing viewpoint.Recently, first attempts have also been proposed to combine a neural radiance field with a human body model [Chen et al. 2021;Liu et al. 2021;Peng et al. 2021a,b;Xu et al. 2021].These works have demonstrated that a classical mesh-based surface can guide a neural radiance field (NeRF) [Mildenhall et al. 2020] for image synthesis.However, since they rely on a human body model or skeleton representation, they do not model the underlying deforming surface well.In consequence, they only work for subjects wearing tight clothing.In stark contrast, we for the first time demonstrate how a NeRF can be conditioned on a densely deforming template and we even show that improvements can be achieved in the other direction as well where the NeRF is guiding the mesh deformation.
In contrast to prior work, we propose a tightly coupled hybrid representation consisting of a classical deforming surface mesh and a neural radiance field defined in a thin shell around the surface.On the one hand, the surface mesh guides the learning of the neural radiance field, enables the method to handle large motions and loose clothing, and leads to a more efficient sampling strategy along the camera rays.On the other hand, the radiance field achieves a higher synthesis quality than pure surface-based approaches, produces explicit 3D constraints for better supervision of explicit surface deformation networks, and helps in overcoming local minima due to the local nature of color gradients in image space.This tight coupling between explicit surface deformation and neural radiance fields creates a two-way synergy between both representations.We are able to jointly capture the detailed underlying deforming surface of the clothed human and also employ this surface to drive a neural radiance field, which captures high-frequency detail and texture.More precisely, our method takes skeletal motion as input and predicts a motion-dependent deforming surface as well as a motion-and view-dependent neural radiance field that is parameterized in a thin shell around the surface.In this way, the deforming surface acts as an initializer for the sampling and the feature accumulation of the neural radiance field making it significantly (6 times) more efficient and, thus, enables training on 4K multi-view videos.The deforming surface mesh and the neural radiance field are tightly coupled during training such that the mesh drives the neural radiance field making it efficient and robust to dynamic changes.Furthermore, not only the neural radiance field is improved based on the tracked surface mesh, but it can also be used to refine the surface mesh, since the neural radiance field drives the mesh towards reconstructing finer-scale detail, such as cloth wrinkles, which are difficult to capture with image-based supervision alone.Thus, a two-way synergy between the employed classical and neural scene representation is created that leads to significantly improved fidelity.Compared to previous work, our approach not only reconstructs deforming surface geometry of higher quality, but also renders human images at much higher fidelity (see Figure 1).In summary, our technical contributions are: • A novel approach for high-fidelity character synthesis that enables novel view and motion synthesis at a very high resolution, which cannot be achieved by previous work.• A synergistic integration of a classical mesh-based and a neural scene representation for virtual humans that produces higher quality geometry, motion, and appearance than any of the two components in isolation.
To the best of our knowledge, this is the first approach that tightly couples a deforming explicit mesh and a NeRF enablings photo-realistic rendering of neural humans wearing loose clothing.

RELATED WORK
Mesh-based synthesis.Photo-realistic image synthesis of controllable characters is challenging due to the difficulty in capturing or predicting high-quality pose-dependent geometry deformation and appearance.Some works [Carranza et al. 2003;Collet et al. 2015;Hilsmann et al. 2020;Li et al. 2014;Zitnick et al. 2004] focus on free-viewpoint replay of the captured human performance sequence.Other works [Casas et al. 2014;Volino et al. 2014;Xu et al. 2011] aim at the more challenging task of photo-realistic free-viewpoint synthesis for new body poses.However, their method needs several seconds to generate a single frame.Casas et al. [2014] and Volino et al. [2014] accelerate the image synthesis process with a temporally coherent layered representation of appearance in texture space.These classical methods struggle with producing high-quality results due to the coarse geometric proxy, and have limited generalizability to new poses and viewpoints, which are very different from those in the database.To improve the synthesis quality and generalizability, Habermann et al. [2021] proposes a method for learning a 3D virtual character model with pose-dependent geometry deformations and pose-and view-dependent textures in a weakly supervised way from multi-view videos.While great improvements have been made, some fine-scale details are missing in the results, because of the difficulty in the optimization of deforming polygon meshes with only images as supervision.In this work, we observed that deforming implicit fields is more flexible (e.g., no need of using regularization terms to keep the mesh topology), thus leading to more stable and efficient training.However, the rendering of implicit fields is time-consuming, and editing implicit representations is much more difficult than editing explicit representations, e.g., meshes.Hence, our method unifies the implicit fields and explicit polygon meshes joining the advantages from both worlds.
Image-based synthesis.GANs have achieved great progress in image synthesis in recent years.To close the gap between the rendering of a coarse geometric proxy and realistic renderings, many works formulate the mapping from the coarse rendering to a photo-realistic rendering as an image-to-image translation problem.These works take the renderings of a skeleton [Chan et al. 2019;Kappel et al. 2020;Li et al. 2019;Pumarola et al. 2018;Shysheya et al. 2019;Zhu et al. 2019], a dense mesh [Grigor'ev et al. 2019;Liu et al. 2020bLiu et al. , 2019b,a;,a;Neverova et al. 2018;Prokudin et al. 2021;Raj et al. 2021;Sarkar et al. 2020;Wang et al. 2018], or a joint position heatmap [Aberman et al. 2019;Ma et al. 2017Ma et al. , 2018] ] as the input to image-to-image translation and output realistic renderings.While these methods can produce high-quality images from a single view, they are not able to synthesize view-consistent videos when changing camera viewpoints.In contrast, our method directly optimizes the geometry deformations and appearance in 3D space, so it is able to produce temporally-and view-consistent photo-realistic animations of characters.
Volume-based and hybrid approaches.Recently, some methods have demonstrated impressive results on novel view synthesis of static scenes by using neural implicit fields [Mildenhall et al. 2020;Niemeyer et al. 2020;Oechsle et al. 2021;Sitzmann et al. 2019;Wang et al. 2021;Yariv et al. 2021Yariv et al. , 2020] ] or hybrid representations [DeVries et al. 2021;Hedman et al. 2021;Liu et al. 2020a;Reiser et al. 2021;Yu et al. 2021] as scene representations.Great efforts have been made to extend neural representations to dynamic scenes.Neural Volumes [Lombardi et al. 2019] and its follow-up work [Wang et al. 2020] use an encoder-decoder network to learn a mapping from reference images to 3D volumes for each frame of the scene, followed by a volume rendering technique to render the scene.Several works extend the NeRF [Mildenhall et al. 2020] to dynamic scene modeling with a dedicated deformation network [Park et al. 2020[Park et al. , 2021;;Pumarola et al. 2020;Tretschk et al. 2021], scene flow fields [Li et al. 2020], or space-time neural irradiance fields [Xian et al. 2020].Many works focus on human character modeling.Peng et al. [2021b] and Kwon et al. [2021] assign latent features on the vertices of the SMPL model and use them as anchors to link different frames.Lombardi et al. [2021] introduce a mixture of volume primitives for the efficient rendering of human actors.These methods can only playback a dynamic scene from novel views but are not able to generate images for novel poses.To address this issue, several methods propose articulated implicit representations for human characters.A-NeRF [Su et al. 2021] proposes an articulated NeRF representation based on a human skeleton for human pose refinement.Recent works [Anonymous 2022;Chen et al. 2021;Jiakai et al. 2021;Li et al. 2022;Liu et al. 2021;Noguchi et al. 2021;Peng et al. 2021a;Wang et al. 2022;Xu et al. 2021] present a deformable NeRF representation, which unwarps different poses to a shared canonical space with inverse kinematic transformations and residual deformations.Moreover, HumanNeRF [Weng et al. 2022] has shown view-synthesis for human characters given only a monocular RGB video for training.Most of these works cannot synthesize pose-dependent dynamic appearance, are not applicable to large-scale datasets that include severe pose variations, and have limited generalizability to new poses.The most related work to our proposed method is Neural Actor [Liu et al. 2021], which uses a texture map as a structure-aware local pose representation to infer dynamic deformation and appearance.In contrast to our method, they only use a human body model as a mesh proxy, and thus cannot model characters in loose clothes.Furthermore, they only employ the mesh proxy to guide the warping of the NeRF but do not optimize the mesh.In consequence, this method cannot extract high-quality surface geometry.Further, since the mesh proxy is not very close to the actual surface, it still needs to sample many points around the surface, which prevents training on 4K resolution.Instead, we infer the dense deformation of a template that is assumed to be given, which is more efficient and enables the tracking of loose clothing.More importantly, our recovered NeRF even further refines the template deformations.

METHOD
The goal of our approach is to learn a unified representation of a dynamic human from multi-view video, which on the one hand allows to synthesize motion-dependent deforming geometry and on the other hand also enables photo-real synthesis of images displaying the human under novel Fig. 2. Overview of the proposed approach.Our method takes as input a skeletal motion of the actor and predicts high-quality appearance as well as space-time coherent and deforming geometry.
viewpoints and novel motions.To this end, we propose an end-to-end approach, which solely takes a skeletal motion and a camera pose as input and outputs a posed and deformed mesh as well as the respective photo-real rendering of the human.Figure 2 shows an overview of the proposed method.In the following, some fundamentals are provided (Section 3.1).Then, we introduce our mesh-guided neural radiance field, which allows synthesizing a dynamic performance of the actor from novel views and for unseen motions (Section 3.2).This proposed mesh guidance assumes a highly detailed, accurately tracked, and space-time coherent surface of the human actor.We however found that previous weakly-supervised performance capture approaches [Habermann et al. 2021[Habermann et al. , 2020] ] struggle with capturing high fidelity geometry.At the same time, volume-based surface representations [Mildenhall et al. 2020] seem to recover such geometric details when visualizing their view-dependent pointclouds, but they lack space-time coherence, which is essential for the proposed mesh guidance.To overcome this limitation, we propose a NeRF-guided point cloud loss, which further improves the motion-dependent and deformable human mesh model (Section 3.3).
Data assumptions.For each actor, we employ  calibrated and synchronized cameras to collect a segmented multi-view video of the person performing various types of motions.The skeletal motion that is input to our method is recovered using a markerless motion capture software [TheCaptury 2020].Finally, we acquire a static textured mesh template of the person using a scanner [Treedys 2020] that is manually rigged to the skeleton.Note that our approach does not assume any 4D geometry in terms of per-frame scans or pointclouds as input.

Human Model and Neural Radiance Fields
3.1.1Deformable Human Mesh Model.We aim at having a deformation model of the human body and clothing, which only depends on the skeletal motion M = {(  − ,   − , z  − ), ..., (  ,   , z  )} and deforms the person-specific template such that motion-dependent clothing and body deformations can be modeled, e.g. the swinging of a skirt induced by the motion of the hips.Here,   ∈ R 57 ,   ∈ R 3 , and z  ∈ R 3 refer to the skeletal joint angles, the root rotation and translation, respectively.(•)  refers to the th frame of the video.In practice, the time window is set to 3 ( = 2) and for the current character pose (  ,   , z  ) we drop the frame subscript for readability in the following and assume the motion window is fixed.We leverage the character representation of Habermann et al. [2021] which takes the current skeletal pose, embedded deformation parameters A, T ∈ R ×3 , where  denotes the number of graph nodes of an underlying graph, and the per vertex displacement d  ∈ R 3 for vertex  as input.It returns the posed and deformed vertex v  .
In order to model the skeletal motion-dependent template deformation, embedded deformations [Sorkine and Alexa 2007;Sumner et al. 2007] are learned by a motion-conditioned graph convolutional network where  (•) is their proposed skeletal motion-to-graph encoding.To capture dynamic details beyond the resolution of the embedded graph, a second network learns the additional per-vertex displacements as a function of the skeletal motion.Here,  (•) is their proposed motion-to-vertex encoding.Now, inserting both networks into Equation 1 leads to the generative character model which is solely parameterized by the skeletal pose (,  , z) and the network outputs that are conditioned on the skeletal motion M. The entire posed and deformed mesh can be derived by stacking the individual vertices into a matrix V ∈ R  ×3 where  denotes the number of template vertices.Interestingly, the deformation networks can be trained purely from image data using a multi-view silhouette loss L sil and a dense rendering loss L chroma , as well as some spatial regularizers L spatial .We follow their proposed training procedure to obtain pre-trained deformation networks.

Neural Radiance Fields.
A neural radiance field (NeRF) [Mildenhall et al. 2020] is a deep, volumetric scene representation of a static scene, which enables photo-realistic novel view synthesis.
In detail, for rendering an image, each pixel is represented by a camera ray that has a normalized direction d ∈ R 3 and an origin o ∈ R 3 .Then,  ∈ {0, ...,  } samples along the ray at positions where x  ∈ R 3 and   is the depth along the ray, are drawn and fed into a Multi-layer Perceptron (MLP) which takes the positional encoding  (•) of x  and d as input.Then, the network predicts a color c  ∈ R 3 and a density value   ∈ R.
To obtain the final pixel color, the individual colors and densities are accumulated using volume rendering [Levoy 1990] according to where   is the Euclidean distance between x +1 and x  .
3.1.3Discussion.The main advantage of the proposed geometry-based character representation [Habermann et al. 2021] is that it can represent dynamically moving humans and that the reconstructed and synthesized geometry matches image silhouettes well and it shows some plausible coarse deformations.However, there is still a gap in terms of surface accuracy and the approach suffers from baked-in geometric details originating from the scanned template mesh, which remain almost rigid throughout the deformation.We suspect that this comes from the purely image-based supervision strategy, which prevents the template from being deformed more drastically with respect to the scan.While they enable the synthesis of loose clothing such as skirts, we also found that the learned dynamic texture and the final appearance in image space cannot reach the quality of volume-based neural rendering approaches for static scenes.
NeRF [Mildenhall et al. 2020] has shown state-of-the-art synthesis quality on static scenes.Interestingly, when training a NeRF on humans, the recovered depth maps show detailed wrinkle patterns despite some noise and outliers.However, long compute time and the fact that the original NeRF formulation is limited to static scenes prevent it from being directly used on long dynamic scenes (around 20,000 frames), which we are targeting.Recently, human-specific follow-up works [Chen et al. 2021;Liu et al. 2021;Peng et al. 2021a,b;Xu et al. 2021] have been introduced.However, they usually rely on a human body model and do not account for non-rigid surface deformations, such as the dynamic movement of clothing.As a consequence, these works are limited to types of apparel that tightly align with the human body while loose clothes such as skirts are beyond their reach.
In the following, we address these problems of 1) limited geometric deformations caused by the purely image-based supervision, 2) limited synthesis quality of geometry-based representations, and 3) the limited types of apparel that can be synthesized by recent NeRF-based approaches.To this end, we propose a novel and non-trivial combination of a deformable mesh-based representation and a neural radiance field and show that one can overcome the above limitations by tightly coupling those two representations.

Mesh-guided Neural Radiance Fields
First, we establish a tight connection between the deforming human mesh model and a surrounding neural radiance field.Here, we assume the posed and deformed vertex positions V are given by our pre-trained character model (Equation 4).

Motion-dependent Neural
Texture.When thinking about defining motion-dependent features on the mesh surface, there usually is the problem of an one-to-many mapping [Bagautdinov et al. 2021;Liu et al. 2021], since (almost) similar motion can lead to various different images, i.e., wrinkle patterns on the clothing.This is due to the fact that the state of clothing does not only depend on the last few poses of the actor, but there exist also other factors.Examples are the initial state of clothing when the performance starts, external forces such as wind, and second order dynamics.On the one hand, reliably modeling all of these is intractable and on the other hand ignoring them leads to a blurred appearance [Habermann et al. 2021].Similar to Liu et al. [Liu et al. 2021], we encode the actor's pose (  ,   , z  ) of frame  in the form of a normal map T norm, ∈ R 1024×1024×3 , which is denoted by the function (  ,   , z  ) = T norm, .However, different to them we use the posed and deformed geometry for creating the normal maps.Thus, higher frequency geometric details are explicitly represented in the normal maps.Then a texture-to-texture translation network [Wang et al. 2018] converts them into dynamic texture maps T tex, ∈ R 1024×1024×3 , which contain realistic cloth wrinkle patterns.For creating the training pairs, the posed normal maps can be trivially computed from the posed and deformed mesh.For generating the ground truth texture maps T tex,gt, ∈ R 1024×1024×3 , we leverage the multi-view texture stitching approach proposed by Alldieck et al. [Alldieck et al. 2018].Simply using an ℓ 1 -loss for L tex between T tex,gt,t and T tex, would still lead to blurry results for the above mentioned reasons.Thus, we employ a discriminator loss as proposed in the vid2vid architecture [Wang et al. 2018].This greatly reduces the problem of the one-to-many mapping.
Finally, we have a UNet-based feature encoder [Isola et al. 2017], enc (T tex, − , ..., T tex, ) = T enc, which takes the generated textures T tex, ′ of the motion window  ′ ∈ { −  , ...,  } by concatenating them along the last channel resulting in a texture tensor T enc, ∈ R 1024×1024×3( +1) .The output of this is a feature texture T enc, ∈ R 1024×1024×32 that will be later used as an input to the NeRF.
Previous work [Liu et al. 2021] showed that encoding the texture into a feature space rather than directly using the generated texture as a conditioning input to the NeRF improves the synthesis quality.Thus, we follow this design choice, however, we choose a UNet-based [Isola et al. 2017] encoder rather than a ResNet-based [He et al. 2016] architecture.This allows us to efficiently encode a higher resolution feature map (1024 × 1024 vs. 512 × 512 [Liu et al. 2021]).Further, since we predict appearance from motion rather than from a single pose [Liu et al. 2021], we also concatenate the per-pose textures T tex, in the feature channel.
3.2.2Geometry-guided Sampling.Next, we describe how the NeRF sampling process can be represented as a function of the deforming mesh V given a ray  and therefore tightly connects the two representations.Assuming the training camera and the ray  are fixed, the th sample x  along the ray is originally defined by Equation 5where   is first sampled uniformly and later based on importance sampling.As  is fixed, o and d are pre-defined.However, we replace   with the following geometry-dependent function Here, Φ  (V) is a depth renderer, which renders the depth map of the mesh with respect to the camera, and  indicates the specific pixel that was rendered.The function  (•) represents the dilation operator, which computes the maximum depth value in the depth map around the sampled location  .Similarly,  (•) computes the eroded value or minimum value around the sampled location  .We choose a kernel size of 9 × 9 for both operators.The erosion and dilation ensure that the NeRF is also sampled on the foreground when the underlying mesh is erroneously not exactly overlaying the ground truth foreground mask.Moreover,  mi defines the volume that is sampled in front of the actual surface, and similarly  ma defines the volume that is sampled behind the actual surface by ensuring that the distance between the rendered depth and the depth of the sample point does not exceed  mi and  ma .We set  mi =  ma = 4 for all results.Lastly,  defines the number of samples along the ray.When sampling  , only pixels that project onto the eroded/dilated depth maps are considered.Otherwise, they are discarded during the NeRF evaluation described later.This allows a more effective sampling of the neural radiance field since most samples are very close to the actual surface.In practice, we only need a single NeRF MLP in contrast to [Mildenhall et al. 2020], which employ a coarse and a fine MLP.Moreover, they draw 64 samples for the coarse MLP and 128 for the fine MLP.Since our mesh provides accurate sampling guidance, we only require 32 samples for generating photo-realistic results.This effectively means our mesh-guided sampling is 6 times more efficient than the baseline, which is especially important when training on 4K multi-view videos.aim at synthesizing novel views and performances of dynamic scenes.Fortunately, the posed and deformed template can also help to enable the synthesis of dynamic scenes using our motiondependent feature texture T enc .The main idea is that the motion-dependent deep features attached to the mesh can be propagated to the 3D ray samples.Then, instead of evaluating the NeRF MLP on global coordinates, we condition the MLP on a surface relative encoding using the signed distance and the texture features.More specifically, Equation 6 is modified as

Geometry
Here,  (•) takes the mesh V, the sample x  along the ray, and the feature texture T enc as input and samples the feature texture at the UV coordinate of the closest point from x  to the mesh resulting in a 32-dimensional feature. (•) computes the signed and normalized distance between the mesh and the sample point.Here, points in the interior of the mesh have a negative sign, and the points outside the mesh have a positive sign.The term normalized means that the actual distance is divided either by  mi or  ma , depending on whether the sample point is inside or outside the mesh surface.A positional encoding [Mildenhall et al. 2020] is then applied to the distance value using 10 frequencies.Finally, the MLP also takes the positional encoding of the viewing direction using 4 frequencies.Note that Equation 11 only depends on the skeletal motion M (which then defines V and T enc ) and the camera pose (which then defines x  and d).Thus, this reformulation allows the network to encode the dynamic motion of the actor and allows the NeRF to handle dynamically moving humans.

Supervision.
During training, we fix the pre-trained deformation networks  eg and  delta and the texture translation network  tex and only train the feature encoder  enc as well as the NeRF MLP  nf .Assuming  images of calibrated cameras for a fixed frame are given, a random foreground pixel  from a random camera is chosen, which has the ground truth color c  gt .We employ an ℓ 1 loss between the ground truth color and the estimated one

NeRF-guided Deformation Refinement
So far, we have discussed how the NeRF representation can leverage the advantages of the underlying 3D template mesh.However, the geometry can also be improved using the neural radiance field.
The key observation is that a weakly supervised setup, as proposed by [Habermann et al. 2021], struggles with recovering the finer wrinkles on the clothing (see Figure 3) due to the limited supervision.For the silhouette loss L sil , the main limiting factor is that it can at most recover details up to the visual hull, which is carved into the 3D volume by the multi-view foreground masks.For the dense rendering loss L chroma , there are three limitations: 1) the rendering loss is very sensitive to local minima as gradients of the input image are computed with finite differences on the ground truth image; 2) this loss struggles with deformations that are out of the camera plane, and 3) the rendering loss cannot account for shadows and view-dependent effects.Fortunately, it can be observed that the per-view pointclouds that can be recovered from the proposed NeRF contain small-scale wrinkles (see Figure 3).Thus, we supervise the template mesh by a 3D-to-3D constraint between the posed and deformed template and the per-view pointcloud, which has the advantage that no explicit per-frame multi-view stereo reconstruction is required.

Poincloud Extraction.
The per-view pointcloud for a ray  from any given camera of the current frame can be computed as Fig. 3. Here, we visualize the influence of the NeRF-guided geomtry refinement using the proposed pointcloud loss.Note that the NeRF pointcloud contains much more geometric detail than the mesh before refinement.
Once we refined  delta using the view-dependent pointcloud, these geometric details are also contained in the deformed mesh.
Here, we use our mesh-guided NeRF representation (Equation 11) and volume rendering (Equation 7) to acquire the density per ray sample .This density is used to weight each (depth) sample along the ray and returns the average depth, which is then multiplied with the viewing direction d  .Adding the camera origin o  leads to the final point p  in global 3D space.We only sample rays for pixels where the dilated depth map is non-zero (foreground pixels).For each frame and view, we sample  = 8192 points, which we found is a good compromise between accuracy and training speed.

Mesh Deformation
Refinement.Now, we further refine the mesh deformation network  delta (while keeping the embedded deformation network  eg fixed) using an additional loss where  is the number of template vertices and  (•) is a robust loss function that sets the value to zero when it exceeds a certain threshold  = 4 to ensure robustness with respect to outliers.Now, DeltaNet is refined with the losses Here, L edge is an isometry or edge length constraint that is imposed similar to the one proposed by [Habermann et al. 2019].This constraint has the advantage that it allows local rotations in contrast to Laplacian regularization, which is important when trying to reproduce wrinkle patterns.L sil and L laplacian are a multi-view silhouette loss and a Laplacian regularizer [Habermann et al. 2021].Figure 3 shows that the proposed NeRF pointcloud loss helps to recover finer wrinkles and to ensure that the deformed and posed template better matches the ground truth.
Once the mesh is refined, the whole process can be iterated.We found that the refined geometry further improves the synthesis quality, which ultimately means that a synergy effect between the deformable mesh and the neural radiance field arises and both improve each other over each iteration till convergence is reached.For more details concerning the training procedure and the implementation, we refer the reader to the supplemental document.

RESULTS
Dataset.We evaluate our proposed approach on the publicly available DynaCap dataset [Habermann et al. 2021], which contains 5 different actors.Three actors wear loose clothes, i.e., two dresses and and one skirt.The other two actors wear tighter clothing, i.e., short and long pants and long and short sleeves.Each actor is performing a large variety of motions for the training and testing sequences.Further, the motions in the test split significantly differ from the ones contained in the training set.We follow the proposed train/test split of the dataset.The original released dataset has an image resolution of 1285 × 940.However, the authors of the dataset also provided us the full resolution videos (4112 × 3008) for all sequences.

Qualitative Results
First, we show qualitative results for the image synthesis quality of our approach in Figure 4 for all subjects in the dataset.We visualize novel view synthesis results for training motions (left column) as well as novel motion and view results (right column) and provide a reference image for each actor.Note that in both modes, the results look highly photorealistic and even small clothing wrinkles can be realistically synthesized.View-dependent appearance effects, such as view-dependent specular highlights on the skin of the actors, are also synthesized realistically.Notably, our method consistently achieves a high synthesis quality irrespective of the clothing type such that even loose clothing can be synthesized well.
We further show qualitative results of our space-time coherent geometry and synthesized motiondependent deformations in Figure 5.The 3D wrinkle patterns are nicely recovered in the geometry and the mesh also aligns well to the reference views.Thus, our method is very versatile in the sense that it 1) faithfully reconstructs the geometry of the training sequence and 2) re-renders the training sequences from novel views.Moreover, our method is also capable of 3) synthesizing motion-dependent surface deformations for unseen skeletal motions and 4) synthesizing photo-real images of the actor performing unseen skeletal motions.For more qualitative results, we refer to our supplemental video.

Evaluation Sequence.
We compare to other methods on the challenging D2 sequence of the DynaCap dataset.We do not evaluate on subjects with more loose clothing as except Habermann et al. [2021] and our method no previous work is able to track loose clothing.Every metric is averaged across the entire sequence using every 10th frame.We hold out four cameras (with indices 7, 18, 27, and 40) for testing, which are uniformly sampled in space.For quantitative evaluation, we also reconstruct pseudo ground truth geometry per frame using an off-the-shelf multi-view stereo reconstruction software [Agisoft 2016].While the overall quality of the geometry is very high, slight reconstruction errors are unavoidable.However, testing on real data is preferable as it is hard to faithfully simulate the complex and dynamic deformations and appearance effects that can be obvserved in this dataset.For more qualitative comparisons, we also refer to our supplemental video.4.2.2Previous Methods and Baselines.We compare to Neural Actor (NA) [Liu et al. 2021] and A-NeRF [Su et al. 2021], which are also hybrid approaches in the sense that they attach a NeRF to a human body model [Loper et al. 2015] or skeleton.We further compare to Neural Volumes (NV) [Lombardi et al. 2019], which is a neural volume rendering approach and Neural Body (NB) [Peng et al. 2021b], which leverages structured latent codes that can be posed using an underlying skeleton structure.We also compare to the surface and neural texture-based approach, Deep Dynamic Character (DDC) [Habermann et al. 2021], which is the only related work that also tracks and synthesizes the underlying surface deformation.Last, we compare to NHR [Wu et al. 2020], which uses a point-based scene representation.

Metrics.
To measure image synthesis quality, we first mask all results using the eroded ground truth foreground masks since even ground truth masks still contain segmentation errors.Otherwise falsely classified background pixels, that are however correctly recovered by the respective methods would erroneously lead to high errors.Then, we evaluate the peak signal-to-noise ratio (PSNR).However, this metric does not reflect the visual perception of humans, i.e., blurry results can have a low error although they appear very unrealistic to the human eye [Zhang et al. 2018].Hence, we also report the learned perceptual image patch similarity (LPIPS) [Zhang et al. 2018], and the Fréchet inception distance (FID) [Heusel et al. 2017] metrics, which are human perception-based metrics.In contrast to our approach, other methods cannot generate results at 4K resolution (4112 × 3008) in a reasonable time, we evaluate all metrics on the downsampled versions (1285 × 940) if not specified otherwise.To measure geometry quality, we report the Chamfer and Hausdorff distance between the pseudo ground truth and the reconstructed results.

Image Synthesis
Accuracy.First, we evaluate the image synthesis quality of our approach and compare it to previous works.In Figure 6, we show a qualitative comparison to previous works.For NB [Peng et al. 2021b], NV [Lombardi et al. 2019], and NHR [Wu et al. 2020], the results are very blurry and contain obvious visual artifacts.Compared to their original results, we found that when using the larger and more challenging DynaCap training dataset, which also contains more variations, the quality of these methods significantly degrades.Thus, these methods seem to be inherently limited to shorter sequences and exhibit limited generalization ability in terms of unseen poses and views.The results of DDC [Habermann et al. 2021] are less blurry compared to the aforementioned methods, but high frequency wrinkles are still not recovered well.In contrast, NA [Liu et al. 2021] captures such wrinkles, but as mentioned earlier this work can only handle tight types of clothing.In contrast to that, our method is able to synthesize arbitrary types of apparel and also produces the sharpest and most detailed results.
Table 1 also quantitatively confirms that our method achieves the best view synthesis results in terms of perceptual metrics.We provide the numbers for our method when we did not use 4K videos during training (referred to as Ours w/o 4k).Importantly, other methods cannot be trained on 4K video data in a reasonable amount of time while our method design allows training on such data in general.Notably, we outperform other approaches in terms of LPIPS and FID by 27.2% and 73.4%.The difference in PSNR however is less prominent since this metric is less sensitive to blurry results and, thus, even if results are more blurry the PSNR can be higher [Zhang et al. 2018].This explains why DDC has a slightly better score even though our results are significantly more plausible.As stated in Section 3.2.1,we found that the motion to appearance mapping is an one-to-many mapping [Bagautdinov et al. 2021;Liu et al. 2021].While others [Habermann et al. 2021;Lombardi et al. 2019;Peng et al. 2021b;Su et al. 2021;Wu et al. 2020] ignore this, we explicitly account for it by leveraging a discriminator during the training of the texture network.Thus, our generated textures are sharp and plausible, but the exact wrinkles might vary from the ground truth since there is no unique mapping.However, they will not fully align with the ground truth resulting in a lower PSNR compared to a blurred result.To confirm this, we also evaluate our method when using the ground truth texture maps instead of the synthesized texture map such Fig. 6.Here, we visually compare our approach in terms of the image synthesis quality to two recent neural human rendering methods, Neural Actor [Liu et al. 2021], and Deep Dynamic Characters [Habermann et al. 2021].Note that our approach renders sharper images and smaller details in the image can be much better recovered by our method compared to the previous works.
Table 1.View synthesis error of the D2 sequence.Note that we achieve by far the highest scores for the perception-based metrics and also in terms of PSNR our method performs better than the previous state-ofthe-art methods.The same tendency can be observed when comparing to other works in terms of motion synthesis (see Table 2).Again, our method achieves the best perceptual results due to the high quality synthesis of our approach.In terms of PSNR, some methods achieve a higher score although results are notably very blurred and/or not photo-real.The reason is the same as before, which also for this setting can be confirmed when evaluating the PSNR value for Ours w/ GT textures.However, as confirmed by our qualitative results and the supplemental video, our method clearly outperforms previous works in terms of perceived image quality and photorealism.
4.2.5 Geometry Deformation Accuracy.In Figure 7 and Table 3, we qualitatively and quantitatively evaluate the surface deformation accuracy of our approach and compare it to DDC [Habermann et al. 2021] and NA [Liu et al. 2021].For NA, we used Marching Cubes to retrieve per-frame reconstructions.The recovered geometries contain a lot of noise since the density field is not regularized, thus, resulting in worse performance.DDC is the only previous work that also tracks the space-time coherent deformation of the template.One can see that our method has an overall lower error in terms of surface quality, which is due to our NeRF-guided supervision.Further, new wrinkle patterns, which appear while the actor is performing different motions, cannot be tracked well by DDC, i.e., the wrinkles in the geometry often do not match the ones in the images as indicated by the red boxes in Figure 7.In contrast, we demonstrate that our NeRF guidance helps the deforming geometry to recover these details.   4. Note that we show a consistent improvement in terms of perceptual metrics over NA [Liu et al. 2021] and DDC [Habermann et al. 2021].NA can be considered as a method that leverages a explicit piece-wise rigid mesh to guide a neural radiance field.In contrast, DDC explicitly accounts for non-rigid mesh deformations, but it does not leverage a NeRF representation.Thus, this baseline comparison clearly shows the advantage of uniting a deformable mesh representation with a neural radiance field.

Geometry
Refinement.Next, we evaluate the influence of the NeRF-guided geometry refinement (Section 3.3).In Figure 8, we show how the better geometry tracking helps to achieve a higher synthesis quality as wrinkle patterns appear sharper.This is also quantitatively confirmed in Table 4 where the result with our proposed refinement (Ours w/o 4k) is consistently better than our result without refinement (Ours w/o ref. and 4k).In terms of geometry error, we found that compared to the baseline (DDC), the NeRF-guided loss also helps to recover geometry that is closer to the ground truth (see Table 3).Thus, for both tasks, image synthesis and surface recovery, the proposed NeRF-guided geometry refinement improves the results.Importantly, the joint consideration of deformation tracking and synthesis, for the first time, allows to achieve such photo-real quality for loose types of apparel.

4K Supervision.
We further study the influence of the 4k image resolution in Figure 8. Again, one can see that more details in the image can be preserved when 4k images are used for training, i.e., the black stripes on the shoe are sharper.This is also quantitatively confirmed in Table 4 where we evaluate the metrics on the higher resolution (4112 × 3008) images instead of the downsampled ones.The error can be further reduced by using the 4K supervision and that training on such data is only possible in an acceptable time due to our more efficient mesh-based sampling and feature attachment strategy.In fact, the original NeRF architecture (using the fine MLP as well) requires Fig. 9. Our method also enables exciting applications.Here, we show a motion re-targeting result of our method where we apply the motion of the actor with the blue dress on the actor with the yellow skirt.Moreover, our method can be used for video synthesis.To this end, we overlay our actors onto a dynamic video.For both applications, our method achieves photorealistic results.
64 + 128 samples per ray whereas our proposed architecture and sampling only requires 32 samples.This reduces our training time to 10 days compared to 29 days when using the original architecture.

Applications
In addition to view and motion synthesis, our method can also enable other exciting applications such as motion re-tarteting.Figure 9 shows such an application setting where the actor with the blue dress is driving the actor with the yellow skirt.Note that the resulting images are very sharp and even small-scale wrinkles can be synthesized.Further, our method enables video synthesis where we augment an existing video with our virtual and photo-realistic characters.

LIMITATIONS AND FUTURE WORK
Although our method achieves state-of-the-art results in terms of view and motion synthesis, it still has some limitations, which require future research in this direction.First, we currently do not capture and model hand gestures and facial expressions.This results sometimes in blurry results in these regions and these body parts can also not be explicitly controlled by the user.In the future, we plan to explore this direction to provide a fully controllable digital avatar.Moreover, our method does not model the incoming light independently from the reflectance properties of the surface.This comes with the limitation that the light is "baked-into" the appearance and novel lighting conditions cannot be synthesized.Here, a more explicit decomposition of light and material could potentially solve the problem.We rely on the motion tracking quality of our marker-less motion capture system and failures in the tracking can lead to artifacts in our results.To overcome this, one could jointly refine the skeletal pose by backpropagating the dense color losses through the entire neural rendering pipeline.Moreover, some artifacts around the boundary between the actor and the background arise from wrong ground truth segmentations during training.Here, we plan to investigate whether the pixel-wise classification can be jointly estimated during training.Last, even though our method is significantly more efficient than the baseline and capable of rendering 4K images within a few seconds, the training time could still be improved and the inference time is not yet real time.In the future, we would like to investigate alternative (potentially more lightweight) network designs [Chan et al. 2022;Fridovich-Keil and Yu et al. 2022;Garbin et al. 2021] and further explore the promising idea of hybrid representations.

CONCLUSION
We proposed HDHumansa method for view and motion synthesis of digital human characters from multi-view videos.Our method solely takes a skeletal motion and a camera pose and produces high resolution images and videos of an unprecedented quality.At the technical core, we propose to jointly learn the surface deformation of the human and the appearance in form of a neural radiance field.We showed that this has a synergy effect and the combination of both scene representations improves each other.Our results demonstrate that our method is a clear step forward towards more photo-realistic and higher resolution digital avatars, which will be an important part for the upcoming era of AR and VR.We also believe that our work can be a solid basis for future research in this direction, which potentially tackles the challenges of real-time compute, relighting, and face and hand gesture synthesis.

Fig. 4 .
Fig. 4. Qualitative results.We show qualitative results for novel views and skeletal motions.Our method achieves a photo-realistic rendering quality and even individual clothing wrinkles appear sharp in the images.

Fig. 5 .
Fig. 5. Qualitative results showing our reconstructed/synthesized geometry on the DynaCap dataset.Note that due to the novel NeRF-guided supervision of surface deformations, geometric details (such as clothing wrinkles) can be recovered nicely.

Fig. 7 .
Fig.7.We compare our surface deformation quality to previous work[Habermann et al. 2021].Due to our NeRF-guided supervision, geometric details can be better tracked compared to DDC, which solely supervises the geometry in image space.
-guided Motion Features.The other important property, which is missing in the original NeRF approach, is that it can only render a static scene under novel views.However, we Proc.ACM Comput.Graph.Interact.Tech., Vol. 6, No. 2, Article .Publication date: August 2023.

Table 2 .
We also evaluate the motion synthesis quality of our approach and compare it previous methods.Again, we outperform other works in terms of the perception-based metrics and are comparable to earlier works in terms of PSNR.

Table 3 .
Here, we evaluate the 3D error on the training skeletal motion of the D2 sequence.Note that the proposed method outperforms previous work in terms of tracking the surface deformation.We further evaluate the 3D error on the test skeletal motion of the challenging D2 sequence.Note that the proposed method also outperforms previous work in terms of deformation synthesis.The error is reported in .

Table 4 .
[Habermann et al. 2021e evaluate the influence of our proposed mesh refinement and the influence of training on 4k resolution instead of 1k resolution.Note that the refinement greatly improves the result across all metrics.When testing on 4k resolution, one can see that also the training on 4k improves all metrics.Deformable Mesh Guidance.First, we evaluate the design choice of uniting an explicit and deformable mesh representation with a neural radiance field.To this end, we leverage the pre-trained deformation networks of DDC[Habermann et al. 2021] to obtain the deformed geometry and apply our mesh-guided radiance field (Section 3.2); this method is referred to as Ours w/o ref. and 4k in Table