Drivable Avatar Clothing: Faithful Full-Body Telepresence with Dynamic Clothing Driven by Sparse RGB-D Input

Clothing is an important part of human appearance but challenging to model in photorealistic avatars. In this work we present avatars with dynamically moving loose clothing that can be faithfully driven by sparse RGB-D inputs as well as body and face motion. We propose a Neural Iterative Closest Point (N-ICP) algorithm that can efficiently track the coarse garment shape given sparse depth input. Given the coarse tracking results, the input RGB-D images are then remapped to texel-aligned features, which are fed into the drivable avatar models to faithfully reconstruct appearance details. We evaluate our method against recent image-driven synthesis baselines, and conduct a comprehensive analysis of the N-ICP algorithm. We demonstrate that our method can generalize to a novel testing environment, while preserving the ability to produce high-fidelity and faithful clothing dynamics and appearance.


RGB-D Input Our Output Our Output Ground Truth
Ground Truth RGB-D Input Figure 1: We present photorealistic full-body avatars that can be driven by sparse RGB-D views (along with body pose and facial keypoints) and faithfully reproduce the appearance and dynamics of challenging loose clothing from the input views.We show the input views, our output and the ground truth reference images in each group of results.

INTRODUCTION
Photorealistic avatars are important for enabling truly immersive and believable telepresence experiences.An ideal telepresence application should not only produce plausible-looking results, but also be complete and accurate: all salient aspects of human appearance should be captured and resynthesized to fully match the real-world states of the subject.However, these properties are particularly challenging to achieve with clothing, an integral part of human appearance, due to its complex movements on human body.
To deal with this challenge, some previous methods go beyond treating clothing effectively as a part of the human body [Bagautdinov et al. 2021;Liu et al. 2021] and perform explicit modeling of clothing as a separate layer on top of the underlying body [Xiang et al. 2022[Xiang et al. , 2021]].These methods can work well for pose-driven animation, i.e., synthesizing plausible clothing deformation and photorealistic appearance that are perceptually compatible with the input pose signal.However, there is no guarantee that the animation output will faithfully reproduce the actual states of clothing (Fig. 6), and potentially distorting the conveyed social signals.In fact, the dynamics of clothing cannot be fully explained by the body pose of the current frame or a few previous frames.Given two distinct initial states of clothing, the same body motion can result in completely different trajectories of clothing deformations, especially for loose garments like skirts or dresses.Therefore, it is impossible to infer accurate clothing states given such incomplete input signals.
An alternate approach for telepresence relies heavily on the availability of sensory inputs without a strong human prior.For example, volumetric fusion methods [Dou et al. 2017[Dou et al. , 2016;;Newcombe et al. 2015] produce a complete geometrical representation of a scene by tracking and fusing observations from sparse RGB-D cameras.Neural implicit functions can also be utilized to reconstruct a dynamically moving human surface from sparse camera inputs [Shao et al. 2022a;Yu et al. 2021b], or even to directly model the radiance field of clothed human appearance [Lin et al. 2022;Shao et al. 2022b].In theory, these methods are flexible enough to be able to reconstruct arbitrary shape from the given input streams.However, due to a lack of model constraints, it is generally more challenging for these methods to achieve high-fidelity temporal coherency especially with noisy or incomplete input, and the output quality is heavily tied with the sensory input.For example, it is hard to produce the sharp and subtle detail in hands [Shao et al. 2022b] with the limited resolution and noise, and the observed and unobserved regions from the input can have obvious difference in the output quality [Shao et al. 2022a].Human priors have been introduced to regularize the predictions [Kwon et al. 2021[Kwon et al. , 2023]], but the ability to reliably handle loose clothing has not been clearly demonstrated.
To leverage the benefits of both families of approaches, we can rely on explicit avatar models as a prior, but expand the driving signal to include the denser input in addition to the body pose.We build avatars with dynamic clothing that can be driven from a sparse set of RGB-D cameras (usually three unless otherwise stated).This formulation allows for more faithful resynthesis of the human appearance, including clothing details.We build on top of DVA [Remelli et al. 2022], which proposes the texel-conditioned avatar, an encoder-decoder model that takes in UV-aligned driving features and predicts geometry and appearance for rendering.However, DVA only works well for tight clothing that closely follows underlying body, due to the limitation in relying on a generic body shape prior.
To better handle loose clothing, our insight is to introduce a tracking stage that coarsely aligns the loose clothing surface with the input depth.More specifically, we propose a simple-yet-effective Neural Iterative Closest Points (N-ICP) algorithm to iteratively update a clothing deformation model given the feedback from surface error in a data-driven manner.N-ICP combines the flexibility of the classical ICP methods which allows us to handle large clothing deformations, while relying on learning for more efficient inference and reliable geometry estimates.In contrast to DVA, which uses coarse body geometry to extract features, the N-ICP tracking allows us to extract more accurate and meaningful texel-aligned features.It also eases the burden on the encoder-decoder model, since large deformations and misalignments are handled by the coarse tracking, and ultimately leads to better quality and generalization.In addition, several technical components have been leveraged to further improve the texel-conditioned avatars.To aid geometry prediction, we augment texel-aligned features with geometry features computed from depth and coarsely tracked geometry.To improve appearance, we adopt a specific perceptual loss to encourage high-frequency texture detail on the predicted clothing.
Our contributions can be summarized as follows: • We develop photorealistic full-body avatars with dynamic clothing that faithfully reproduces the original states of subject's appearance and geometry.These avatars are driven by sparse (up to 3) RGB-D inputs and enable free-viewpoint rendering.
• As an important component of our system, we introduce a Neural Iterative Closest Point algorithm that learns to iteratively find the most effective parameter update to track the input point cloud efficiently with a deformation model.
• Our avatars can directly generalize to a novel testing environment with a different background and illumination, while capturing the complex clothing dynamics and preserving the original high-quality appearance from the training data.
We also provide an option to finetune the model to adapt to the new appearance in the novel environment.

RELATED WORK
We focus on related work in three areas: photorealistic clothed avatars, sensing-based telepresence, and image-conditioned novelview synthesis and learning to optimize for non-rigid tracking.
Photorealistic Clothed Avatars.Here we discuss approaches posedriven avatars that model both the shape and photorealistic appearance of clothed humans leanrned directly from captured images in a data-driven manner.Depending on the appearance model, these methods can be roughly divided into three categories: methods that combine coarse human geometry with Deferred Neural Rendering [Grigorev et al. 2021;Raj et al. 2021;Thies et al. 2019], methods that incorporate a human prior into Neural Radiance Fields [Li et al. 2022;Liu et al. 2021;Mildenhall et al. 2020;Peng et al. 2021a;Su et al. 2021;Zheng et al. 2022], and methods based on mesh and dynamic texture [Habermann et al. 2021;Shysheya et al. 2019].Our method is more closely related to the Full-Body Codec Avatars [Bagautdinov et al. 2021] among the third category, as well as the followup work that can handle dynamic clothing [Xiang et al. 2022[Xiang et al. , 2021]].
Most of these methods learn a mapping from body pose or motion to clothed body appearance, without resolving the ambiguity in clothing states for similar body poses.Drivable Volumetric Avatars (DVA) [Remelli et al. 2022] is an exception, which is additionally conditioned on a dynamic texture unwrapped from sparse driving views to incorporate the clothing information.However, it is limited to tight clothing due to a constraint in body representation.Our method can further track and render dynamically moving loose clothing realistically.
Sensing-Based Telepresence Approaches.Another category of approaches for telepresence rely more heavily on the sensing input for surface reconstruction, usually from one or several RGB(-D) cameras.Volumetric fusion [Dou et al. 2017[Dou et al. , 2016;;Newcombe et al. 2015] and multiview-conditioned implicit functions [Shao et al. 2022a;Suo et al. 2021;Yu et al. 2021b] have been utilized to reconstruct scene geometry from the sensor input.Early work only reconstructs the geometry [Newcombe et al. 2015], but later work also predicts color by warping and blending RGB information from input views [Dou et al. 2016;Lawrence et al. 2021], possibly assisted by deep neural networks [Shao et al. 2022a;Suo et al. 2021;Yu et al. 2021b].Neural rendering has also been introduced [Martin-Brualla et al. 2018;Nguyen-Ha et al. 2022] to compensate for artifacts in the reconstructed geometry.These sensing-based methods enjoy the flexibility in handling varying topology, but are generally more sensitive to noisy or missing input than the model-based approaches described in the previous section.
Image-Conditioned Novel-View Synthesis.Our task can also be regarded as Novel-view Synthesis (NVS) based on sparse input images.Generalizable NeRF is a group of methods that extend the Neural Radiance Field (NeRF) [Mildenhall et al. 2020] and allow the reconstruction of a scene with sparse images as input without per-scene optimization [Chen et al. 2021;Lin et al. 2022;Shao et al. 2022b;Wang et al. 2021;Yu et al. 2021a].This formulation is thus more suitable for telepresence than the original NeRF, but tends to perform less well when the target view is far away from the sparse input views due to the inherent 3D ambiguity.To alleviate this problem, some methods incorporate prior knowledge of the human body to achieve better quality.Neural Body [Peng et al. 2021b] utilizes the SMPL model to aggregate the temporal information over the multiview videos.It still does not allow direct use of novel images as input, but this limitation is addressed in Neural Human Performer [Kwon et al. 2021] and GP-NeRF [Chen et al. 2022].Several other methods [Gao et al. 2023[Gao et al. , 2022;;Kwon et al. 2023;Zhao et al. 2022] predict radiance fields in a canonical space with the help of forward and inverse skinning transformations.These methods additionally allow generalization across identities, which is not the focus of our work.KeypointNeRF [Mihajlovic et al. 2022] utilizes 3D body keypoints to encode the spatial information in the rendered volume.These methods have shown limited capability to handle dynamic loose clothing due to the body representation.
Learning to Optimize for Non-Rigid Tracking.Many non-rigid tracking and reconstruction problems are traditionally formulated as optimization.Examples include SMPL model fitting for 3D human pose estimation [Bogo et al. 2016], and non-rigid ICP for deformable surface tracking [Guo et al. 2015].In recent years, researchers attempted to incorporate deep neural networks into these problems [Bhatnagar et al. 2020;Bozic et al. 2020Bozic et al. , 2021;;Li et al. 2020].Our N-ICP formulation essentially treats the neural network as an optimization solver that iteratively generates a parameter update.Some work [Corona et al. 2022;Song et al. 2020] explores a similar idea for monocular human pose reconstruction in a supervised setting.RMA-Net [Feng et al. 2021] also addresses the non-rigid registration problem with recurrent update.In addition to the difference in deformation model and loss function, our method integrates learning into optimization more deeply by explicit feeding the error and gradient into the network for update prediction, which is shown to be the key to effectiveness in the ablation study.

METHOD OVERVIEW
Our method takes as input RGB-D images from sparse (up to 3) views, as well as 3D body pose in the form of joint angles (and facial keypoints if available), and generates photorealistic rendering of the subject from an arbitrary viewpoint.The model is trained on images of the subject captured in a dense camera system.We adopt the two-layer representation that has proven effective in previous work on loose clothing [Xiang et al. 2022[Xiang et al. , 2021]].Our method consists of two major modules.First, in the Neural Iterative Closest Point module, we coarsely track the loose clothing surface given the fused point cloud from the input depth maps using a deformation graph model.Second, we convert the sparse driving RGB-D images into texel-aligned features and feed them into texel-conditioned clothed avatars to infer detailed geometry and view-dependent texture, which are then rasterized to form the output image.The overall pipeline is illustrated in Fig. 2.

NEURAL ITERATIVE CLOSEST POINT
As the first step of our approach, we introduce a Neural Iterative Closest Point (N-ICP) algorithm to coarsely track the dynamic clothing surface using a deformation graph representation of the clothing geometry.Such a module is needed for two reasons.First, coarse tracking provides rough correspondences on the clothing surface across different frames.Previous work [Xiang et al. 2022[Xiang et al. , 2021] ] shows that such canonicalization reduces the variance in appearance that needs to be modelled by the downstream module and leads to improved quality.Second, compared with skeletonlevel tracking [Remelli et al. 2022], the deformation graph enjoys the flexibility to track the surface at a higher accuracy from the input depth, so that the following stage (Sec.5) only needs to predict a small geometry correction.This concept of coarse-to-fine modeling has also proven useful in previous work, e.g.[Habermann et al. 2021].
Non-Rigid ICP.The classical approach to track a deforming surface is the non-rigid Iterative Closest Point (ICP) algorithm [Li et al. 2008].Given a deformation model D controlled by deformation parameters  , the non-rigid ICP algorithm tracks an input point cloud1 P by solving the following minimization problem min where C queries the closest point on the deformed mesh D ( ) from a point p in the point cloud based on Euclidean or projective distance.To keep the deformed mesh in a smooth shape, some regularization terms are often used together with Eq. ( 1) to penalize extreme distortion.This non-linear optimization problem is Figure 3: The Neural Iterative Closest Point algorithm.In each iteration, we perform the closest point query between the input point cloud P and deformed clothing model D ( ( ) ).
The residual r and gradient g are then passed to a neural network M, which finds the best update Δ to the parameters.This process is repeated for  iterations.
usually solved by the Gauss-Newton method, especially its Levenberg-Marquardt variant.Concretely, in each iteration of optimization, the algorithm evaluates the residual vector r, which consists of offset r(p) = C(p, D ( )) −p for each point p in the point cloud, and the Jacobian J of each residual term with respect to the parameters.
Then the algorithm searches for the update Δ by solving the linear system (J  J + I)Δ = −J  r, where  > 0 is a damping coefficient.
N-ICP.The problem of driving avatars for online telepresence poses a challenge in terms of both robustness and speed for the non-rigid tracking algorithm.The nonlinear minimization problem in Eq. ( 1) relies heavily on good initialization, so classical method usually requires sequential tracking that is hard to recover from failure and technically demanding GPU implementation to meet the runtime constraint [Newcombe et al. 2015;Zollhöfer et al. 2014].
This challenge motivates us to introduce a Neural Iterative Closest Point (N-ICP) algorithm, where the goal is to leverage the prior learned by a deep neural network to make an efficient and robust prediction of the update direction Δ .We utilize a PointNet [Qi et al. 2017] architecture to operate on the input point cloud P. In addition to the coordinates of each point p, we also use the offset r(p) = C(p, D ( )) − p as a feature, which includes the essential closest point information for solving the non-rigid ICP problem.Inspired by the classical optimization paradigm, we also a provide first-order derivative to the network.For the ease of computation, we use the gradient g = J  r of total loss with respect to the parameters, which can be automatically derived in most modern neural network libraries [Paszke et al. 2019], or manually derived for better computation efficiency.The formulation of the network M can be written as Δ = M (P, r, g).The update  (+1) =  ( ) + Δ is then performed for  = 3 times.
Weakly-Supervised Learning.We train the network in a weaklysupervised manner without requiring the ground truth deformation parameters  .Instead, we only use the clothing geometry P reconstructed by Multi-View Stereo (MVS).Compared with P which has missing parts due to occlusion from the sparse depth input, P includes the complete clothing geometry reconstructed from all cameras in the capture studio, and is only used at training time for supervision.Our loss function for training the network is written as  N-ICP = 1   =1  ICP ( ( ) , P) plus regularization (see supplementary document).In other words, the network is trained to find an update to the parameters so that the deformed mesh can most efficiently track the clothing surface for every ICP iteration.
Clothing Deformation Model.ICP-style methods require good initialization to converge to the right local minimum.We observe that the underlying body pose can provide coarse information about body orientation and articulation as a good starting point.Thus we adopt a hierarchical deformation model for dynamic loose clothing: an outer layer of Linear Blend Skinning (LBS) W with respect to body pose, and an inner layer of deformation graphs E [Sumner et al. 2007]: D ( ) = W (E (M,  ), ), where M is the template shape of the clothing defined in the canonical space of LBS.The underlying body pose  can be obtained from sparse RGB-D input by vision-based keypoint detection followed by inverse kinematics [Mehta et al. 2017], and it remains fixed during the N-ICP process.In this formulation,  is defined as the rotation and translation of deformation graph nodes, and we set its initialization to be the rest pose  (0) = 0. Thanks to the global transformation and body articulation encoded in , D ( (0) ) = W (M, ) is close enough to the target P for N-ICP to converge nicely.This formulation also makes it efficient to perform per-frame tracking, preventing the failure caused by error accumulation in sequential tracking.The complete N-ICP process is illustrated in Fig. 3.

TEXEL-CONDITIONED CLOTHED AVATARS
The N-ICP algorithm in Sec. 4 can track large clothing dynamics, providing a good starting point for rendering the clothed body appearance.However, the underlying deformation model is designed to only capture large geometry deformations, and does not model fine geometrical detail or appearance.Therefore, in the next step, we develop a clothed avatar that can produce high-fidelity geometry and appearance when conditioned on both sparse RGB-D views as well as the output from the previous N-ICP stage.The critical question here is how to faithfully reconstruct the appearance detail from the sparse driving views for dynamic clothing.
We build upon the texel-conditioned avatars from Drivable Volumetric Avatars (DVA) [Remelli et al. 2022].DVA takes in driving signals of several RGB images mapped to texel-aligned features, as opposed to conditioned primarily on pose [Bagautdinov et al. 2021], and predicts geometry and view-dependent appearance that can reproduce the full-body appearance.Formally, in DVA, the input feature F   is the mean unwrapped image from multiple RGB driving views based on the skinned mesh W   = W (M  , ) of the body template M  .This unwrapping process can be written as where U denotes the unwrapping operation and {  } denotes the set of driving images.The neural avatar A  is a convolutional encoder-decoder that takes in F   , viewpoint v and body pose , and predicts the geometry corrective G  and appearance T  with The final geometry G  is obtained by a function G that applies G  on top of the LBS mesh W   with a pre-defined coordinate transformation which is then used to render2 the output image together with the view-conditioned appearance T  .
Texel-Conditioned Clothing Avatars.The DVA baseline, however, struggles to handle the large deformation of dynamic clothing.The root of the problem is that the LBS mesh W   , which encodes only the skeleton-level motion, is too coarse to serve as the base geometry for dynamic clothing.The large deviation of the LBS mesh W   from the true clothing surface has two consequences: first, the unwrapping operation in Eq. ( 2) cannot effectively capture the appearance detail; second, it places a heavy burden for the network A  to predict a large offset G  to update the geometry in Eq. ( 3).
One of the key ideas of our clothed avatar model is to use the non-rigid tracking result D = D ( ( ) ) from the final N-ICP iteration as the starting point.Because D is already well-aligned with the clothing surface, we can obtain better texel-aligned features with more appearance detail from the unwrapping operation F  = U ({  }, D).In order to further guide the estimation of the geometry corrective using the driving depth images, we also provide a "depth offset" feature F  as input.For each pixel [, ] in camera  with depth value   , we associate it with the rendered depth   from the tracked geometry D at the same pixel location and compute the offset as where K  is the camera projection matrix, and {R  , t  } are the rigid transformation from each camera frame to a unified body root coordinate frame.The depth offset feature is then a 3-channel average tensor obtained by unwrapping {  } for each driving camera , F  = U ({  }, D).Thus, F  can be regarded as the offset to be corrected on top of D in order to match the sensor depth for each UV location.Our clothing model takes in the texel-aligned features F  and F  as well as the viewpoint v, and predicts the geometry corrective G and texture T with [G, T] = A (F  , F  , v).Finally, the geometry G and texture T are used to render the output image.Following [Xiang et al. 2021], the clothing is modelled as a separate layer from the underlying body avatar.The body avatar can be conditioned on body pose  or additionally on texel-aligned features [Remelli et al. 2022].To train the model, besides the standard image reconstruction loss and mesh regularization, we use a part segmentation loss and an ID-MRF perceptual loss [Wang et al. 2018], which are detailed in the supplementary document.

Capture Setup and Detail
We capture a total of three sets of garments: (1) a red dress with a short full skirt; (2) a flared, short skirt 3 in floral pattern with a bottom ruffle; (3) a loose T-shirt and a long skirt.The training data are captured in a multi-view capture studio equipped with roughly 150 RGB cameras.This dense capture setup enables us to use Multi-View Stereo (MVS) to reconstruct geometry, which can be rasterized to depth and used as input to drive our avatars.We split out a small segment of each sequence for testing purpose and use the multi-view captured images as ground truth for evaluation.
To demonstrate the application of our method for telepresence, we additionally capture the subjects with the same garments in a novel environment with different background and illumination.It is designed to be a simpler capture setup, where nine Kinect RGB-D cameras are deployed, and we select three cameras as driving input to our model.We also split the data in the novel environment for fine-tuning and testing.

Evaluation of Neural Iterative Closest Point
We conduct an evaluation on the N-ICP algorithm using the dress sequence.We use the ground truth from offline registration by nonrigid ICP from [Xiang et al. 2022].For this evaluation, we adopt the same input as our full method: a point cloud fusing the depth maps from three driving views.We report the evaluation metric of the mean squared point-to-triangle distance for both directions from prediction to ground-truth and from ground truth to prediction.
We compare N-ICP with classical optimization solvers, including L-BFGS and nonlinear Conjugate Gradient with strong Wolfe 3 The paired upper-body garments are tight and modelled in the body layer.
Table 1: Quantitative comparison with various baselines and ablation studies using the loose T-shirt and long skirt sequence captured in the multi-view studio.The metrics are computed on the whole rendered images with a plain background from two different views.We compare with DVA [Remelli et al. 2022] and its two-layer variant, ENeRF [Lin et al. 2022], KeypointNeRF [Mihajlovic et al. 2022], and sensing-based baselines.Ablation studies are also included.Our method converges at a faster rate than the baseline methods, including both the gradient-based methods (L-BFGS, CG and ours) and the more complicated Jacobian-based Levenberg-Marquardt method.For ablation studies on the formulation of our N-ICP network, please refer to our supplementary document.

Full Method Evaluation
In this section, we present evaluation and comparison using highquality data from the dense multi-view capture studio.For comparison with most methods, we report quantitative results on the the challenging loose T-shirt and long skirt sequence in in Tab. 1, except for pose-driven clothed avatars because the tucked T-shirt is extremely challenging to simulate in [Xiang et al. 2022].Therefore for this category we report results on the skirt with floral pattern shown in Tab. 2.
Comparison w/ DVA [Remelli et al. 2022].Although DVA adopts the Mixture of Volumetric Primitive (MVP) [Lombardi et al. 2021] formulation that can render any structure in theory, the spatial arrangement of the primitives relies on the guidance of base body mesh as initialization.Both the original method and its variant with body and clothing layers are guided by LBS transformation, which is too coarse to capture the motion of loose clothing.Qualitative comparison is shown on the left of Fig. 7.
Comparison w/ NeRF-based methods.NeRF has been used to model dynamically moving human appearance, but many of them requires per-scene/frame fitting [Işık et al. 2023;Shao et al. 2023] and are not directly applicable to our telepresence setting.We only  compare our method with representative approaches of generalizable NeRF that directly take in sparse image input.ENeRF [Lin et al. 2022] focuses on the setting of interpolating a target view from several most nearby source views without human prior, but does not perform well in our setting due to the difficulty in associating views given wide camera baselines.KeypointNeRF [Mihajlovic et al. 2022] provides relative encoding of body keypoints as an additional input feature for radiance prediction but cannot handle clothing motion that does not closely follow the body motion.Qualitative comparison is shown on the right side of Fig. 7.
Comparison w/ sensing-based baselines.Most sensing-based approaches [Dou et al. 2017;Martin-Brualla et al. 2018;Shao et al. 2022a;Yu et al. 2021b] are proprietary and tightly integrated with their capture system without available open-source implementations.Therefore, we rather compare our method with baselines that can be implemented with moderate effort, listed in the third group of results in Tab. 1, including image warping based on TSDF fusion and further applying a U-Net follwing the idea of Lookin-Good [Martin-Brualla et al. 2018].For detail please refer to our supplementary document.
Ablation studies.We show ablation studies on several components of our framework and the results are shown in Fig. 8 and the fourth group of Tab. 1. First, the initial tracking by N-ICP provides a basis for the whole framework to estimate the correct clothing geometry and appearance.Without this component, the initialization by only body pose is too coarse and leads to obvious artifacts.Second, we demonstrate that the additional depth offset input to the encoder-decoder allows our method to predict more accurate overall clothing shape.Third, we verify that the part segmentation loss helps to produce correct body-clothing boundary.Last, we compare the results with and without ID-MRF loss, which preserves the high-frequency texture detail in our output when the predicted geometry is not completely photometrically aligned with the driving images.
Comparison w/ pose-driven avatars.A major motivation of our framework is that the underlying body motion does not contain enough information to fully determine the loose clothing states.We validate this intuition by comparison with Clothing Codec Avatars [Xiang et al. 2021] and Dressing Avatars [Xiang et al. 2022] shown in Fig. 6 and Tab. 2. Clothing Codec Avatars struggle to learn the mapping from a sequence of body pose to large clothing dynamics.Dressing Avatars produce clothing motion that looks realistic with the help of physics-based simulation, but its output is not faithful to the actual motion because of the lack of an efficient approach for estimating underlying physical parameters for simulation.Our method utilizes the driving signal from sparse RGB-D input to achieve faithful clothing telepresence, which is verified by the low error on the evaluation metrics in Tab. 2. Please refer to the supplementary document for a more detailed discussion on the difference in their formulation (e.g.supervised vs. self-supervised).

Results in the Novel Capture Environment
For the novel capture environment, we test our model in two different scenarios: without and with fine-tuning, both shown in Fig. 9.The first scenario refers to a direct application of the model trained from the dense capture studio to the same subject but in the novel environment.For this scenario, our model needs to handle the difference in the input RGB-D images between the training and testing environments, such as illumination and sensor setup.Note that our model is trained to be robust to these variations in the input and preserve the appearance style from the original training data in our output (see supplementary document), but with the unseen body and clothing motion captured in the novel environment.This experiment demonstrates the ability of our method to directly generalize to a novel environment.
In the second scenario, we test our model after fine-tuning it with the training data captured in the new environment.Note that we use the same three Kinects for both driving input and ground truth supervision.During fine-tuning, the output of our model adapts to the new appearance caused by the illumination and sensor specification in the environment, as well as the change in body over time such as hair style.We test our model with a even more aggressive change in the body-layer appearance from the tank-top to the green capture suit in the second row of Fig. 9.The fine-tuning step can be viewed as an option to further boost the model output quality if time and computation budgets permit.

CONCLUSION
We presented a framework for building photorealistic full-body avatars that can be driven by sparse RGB-D inputs and faithfully reproduce the motion of loose clothing.Our method accurately reconstructs challenging clothing appearance of the subjects, thus tackling a major drawback of existing pose-driven avatars.As per limitations, our model is still person-and garment-specific and cannot handle clothing motion that falls far outside of the deformation space of the deformation graph model, such as topology change.
Interesting future directions would be to extend our method to multi-identity setting, and develop a formulation that can handle more generic garment categories, e.g. with implicit representations.

RELATED WORK (CONTINUED)
Template-Based Performance Capture.Our work is also related to a group of methods that track human surface by deforming a personspecific or category-specific template or avatar, using either classical optimization [Habermann et al. 2019;Robertini et al. 2016;Xiang et al. 2020;Xu et al. 2018] or network prediction [Habermann et al. 2021[Habermann et al. , 2020;;Jiang et al. 2023;Li et al. 2022].They achieve better temporal coherency than template-free methods that regress human shape for each frame [Li et al. 2020;Saito et al. 2019Saito et al. , 2020;;Xiu et al. 2023Xiu et al. , 2022]], but focus on reconstructing human geometry and rather than modeling dynamic appearance.

ABLATION STUDIES ON N-ICP
We conduct ablation studies on our design of the N-ICP algorithm.The results are shown in Tab. 1.The most naive baseline is to simply use the point cloud as the input feature, shown on the first row of the table.On the second row, we add the closest point residual to the input feature, which provides useful information for surface alignment and enables an iterative update of the deformation parameters.The following rows suggest that the energy gradient derived from the residuals can provide more effective guidance, similar to its critical role in traditional nonlinear optimization.The last two rows verify the benefit of iterative parameter update compared with a one-shot prediction by the network.

DETAIL OF COMPARISON WITH
SENSING-BASED BASELINES (SEC.6.3) Here, we provide the implementation detail for the sensing-based baselines for the experiment in Sec.6.3 in the main paper.We first fuse the sparse input depth maps into a single Truncated Signed Distance Field (TSDF) volume [Curless and Levoy 1996;Dong et al. 2022], and then extract from it an explicit mesh representation.
Using the fused geometry, we can then warp the input RGB images Work done when DX was a visiting researcher at Meta.ZC and CW are now at Google.from the source views to any target view.However, the warped image is usually imperfect because the fused geometry is often incomplete and noisy.Therefore, we follow the idea of Lookin-Good [Martin-Brualla et al. 2018] and train a U-Net to complete the warped image.This baseline essentially learns to inpaint complete human appearance from partial input only in the screen space, and struggles to achieve 3D-aware temporal consistency in the output.
As explained in the main paper, this experiment is not intended to be a full-scale comparison against state-of-the-art sensing-based approaches, but to better understand our method in comparison to a modest baseline along this line of work given similar input.

COMPARISON WITH CLOTHING CODEC AVATARS AND DRESSING AVATARS
We highlight the difference in formluation between our method, Clothing Codec Avatars (CCA) [Xiang et al. 2021] and Dressing Avatars (DA) [Xiang et al. 2022]    method can generate richer and more realistic dynamics for loose clothing than CCA, but DA requires a proprietary implementation of real-time cloth simulation.CCA and DA utilize ground truth clothing registration to train their models, while our method does not require such pre-processing.Finally, thanks to the additional RGB-D input and our model design, our output is more faithful to the actual clothing motion than those two previous methods.

IMPLEMENTATION DETAIL 5.1 Clothing Deformation Graph
In Fig. 1, we provide a visual illustration of the deformation graph E in the inner layer of the clothing deformation model D (Sec. 4 of the main paper) for the dress example.The parameters for the deformation graph include the rotation and translation for each node: where r  is the axis-angle representation of a 3D rotation.We use a total of  = 125 nodes for each example.

Training Setup
where the balancing weight is set to  DG-Reg = 1 × 10 −3 .The trainable parameters in N-ICP are those in PointNet M. The input and output of the PointNet M are converted to the root body coordinate of the subject given the tracked body pose  to be invariant to the global orientation and translation.We use the AdamW optimizer with an initial learning rate of 1 × 10 −5 .
Initialization.We find it crucial to initialize the parameters in the last layer of the PointNet with values close to zero, so that  ( ) ≈ 0 for  = 1, . . .,  at the first training iteration, with  (0) set to 0. In this way, thanks to the two-layer clothing deformation model (Sec.4 in the main paper), D ( ( ) ) is close enough to the ICP target to generate meaningful gradient at the beginning of the training process, and gradually converges to the desired minimum.In practice, we initialize the last layer of the network by random sampling from a uniform distribution  [−, ] where  = 1 × 10 −6 .
Discussion on supervision.N-ICP is trained in a self-supervised manner, because the loss function  N-ICP does not involve the "ground truth" deformation parameters.The reasons are two-fold.First, it takes extra processing time efforts obtain the ground truth.Second, the problem of estimating reliable "ground truth" deformation parameters is challenging by itself.Unless the garment under capture has been specially designed to encode correspondences in a printed pattern [Halimi et al. 2022], otherwise, the principal approach is to run offline ICP between the deformation model and MVS geometry.In this way, the "ground truth" essentially offers no more information than directly supervising N-ICP by MVS.The self-supervised formulation, instead, allows solving a global optimization by sharing the information across all frames. RGB and  mask are the standard  1 losses for RGB and mask respectively;  reg is the Laplacian regularization terms for body and clothing meshes. part is similar to  mask but identify background, body and clothing in three different categories.Following [Feng et al. 2022], we use the ID-MRF loss [Wang et al. 2018], a stronger form of perceptual loss to encourage sharpness for high-frequency texture in the clothing region.We use  RGB = 0.2,  mask =  part = 500.0, reg = 100.0, ID-MRF = 1.0.The gradient of loss functions defined in the image space (RGB, mask, part and ID-MRF) with respect to the network parameters are back-propagated through a differentiable rasterizer.We use the AdamW optimizer with a learning rate of 1 × 10 −3 .Color Augmentation.In order to deal with the domain gap in illumination and color when the directly applying the avatars to the novel capture environment (Sec.6.4 in the main paper), we apply a random color augmentation to texel-aligned RGB features F  using the 'ColorJitter' function in TorchVision 1 at training time.Notice that we leave the ground truth images used for supervision in  avatars unchanged, so that the network always preserves the original appearance in the output, despite a different color mode in the input feature F  when we direct apply the model to the novel environment.The output appearance only changes after fine-tuning with ground truth images in the novel environment.

Preprocessing and Postprocessing
5.3.1 Input Preprocessing.Our method takes RGB and depth images as input.When training and testing using data from the dense capture studio, we run image-based part segmentation and transfer the result to the MVS mesh by projection and visibility check.This operation allows us to extract the clothing region.The MVS mesh may include floating noise, which we remove by checking the mesh connectivity and setting a threshold on the minimal number of vertices in a connected component.Then, we rasterize the segmented mesh to RGB views to "simulate" a depth image.
When training and testing in the novel environment, we use the RGB-D images from calibrated Kinect sensors as input.We also run image-based part segmentation to extract the clothing regions.Then we use TSDF fusion [Curless and Levoy 1996] and Marching Cubes [Lorensen and Cline 1987] to form a mesh from the extracted depth images, which allows us to perform similar connectivity check as above to remove noise from the depth sensors.5.3.2Temporal Smoothing.Due to the unstructured point cloud input, the output of N-ICP may have undesirable jittering.We apply temporal smoothing to the output of N-ICP by taking the average on the vertex positions in a small temporal window, which is feasiable because the N-ICP output shares a consistent registered topology across all the frames.The filtered meshes are then used to unwrap texel-aligned features and as input to the texel-conditioned avatars as shown in Fig. 2 of the main paper.We find no need to apply additional smoothing on the final output of texel-conditioned avatars if the provided initial tracking is temporally stable and the floating depth noise has been removed in the preprocessing step.

Collision.
To resolve the collision between the body and clothing layers, which is usually slight in the results, we follow Clothing Codec Avatars [Xiang et al. 2021] (Sec.6) to project the clothing vertices in collision beyond the nearest body points by a slight margin.More sophisticated ways to handle collision based on geometry or learning [Tan et al. 2022] may be incorporated, which we leave for future work.

5.
4.2 Texel-Conditioned Clothed Avatars.The overall architecture of the texel-conditioned avatar models is shown in Fig. 2. Given the texel-aligned features F  , F  unwrapped from the initial tracking results D as input, the encoder produces a feature map that is spatially aligned with the input.The encoded feature maps are then decoder into a vertex offset map G, from which the offsets are extracted and then applied on top of the initial tracking to obtain the output geometry G.The geometry G and the viewpoint v are used together to compute the view-conditioning, including the viewing vector expressed in the local Tangent-Bitangent-Normal (TBN) frame [Xiang et al. 2022] as well as its reflected direction.The view-dependent U-Net takes in the view conditioning and the viewindependent texture to produce an additive view-dependent offset.With the final geometry G, we also compute the ambient occlusion, which is fed into the shadow U-Net to produce a multiplicative shadow map.The view-dependent texture is then upsampled to 2k resolution by a upsampling network.
To specify the architecture of the individual networks above, we define the blocks shown in Fig. 3.
(1) Convolutional encoder consists of the network blocks in the following table.Following DVA [Remelli et al. 2022], we find that using a U-Net at 64 × 64 resolution instead of a bottleneck structure helps to preserve the UV-space detail in the output.

Block
Output  64,32) 32 × 64 × 64 (2) View-independent decoder consists of the network blocks in the following table.Here, the "RepeatChannels" operation repeats the channels of the input feature for the geometry and texture

Figure 2 :
Figure 2: An overview of our method.The Neural Iterative Closest Point module efficiently tracks the clothing surface from the input point cloud P with a clothing deformation model D ( ); the initial tracking result D is then used to unwrap the driving images   and depth maps   into texel-aligned features F  , F  , which are then fed into the texel-conditioned avatar, together with body pose , facial keypoints and viewpoint v, to predict the output image.

Figure 4 :
Figure4: Comparison between our method and classical nonrigid ICP with different types of optimization solvers.For each method, we plot the runtime in seconds vs. the Mean Squared Surface Error (MSE) in mm 2 for the surface tracking results.The square markers on the curves denote individual steps in the optimization.

Figure 5 :
Figure 5: Comparison with basic sensing-based baselines.Given the depth input from 3 views, we run TSDF fusion[Curless and Levoy 1996; Dong et al. 2022] to obtain a proxy geometry (first column), and then warp the driving RGB images to the target view (second column).We train a UNet following LookinGood[Martin-Brualla et al. 2018]  to inpaint the missing regions (third column).

Figure 6 :
Figure 6: Qualitative comparison with Clothing Codec Avatars [Xiang et al. 2021] and Dressing Avatars [Xiang et al. 2022].The results are shown on the top row; the input views and ground truth are shown on the bottom.The  1 error in the skirt region is reported beside the names of the methods.

Figure 8 :
Figure 8: Ablation studies on components in our framework, including N-ICP, depth offset as input to the encoder-decoder, part segmentation loss, and ID-MRF loss.Each result is shown in comparison to our full output and the ground truth.

Figure 1 :
Figure 1: A visualization of the deformation graph E used in the dress example.On the left side, we show the coordinate frame at each graph node and their connectivity by the red lines.On the right side, the region of influence by a node located in the center is shown in red.
5.2.1 N-ICP.When training the N-ICP module, we adopt a regularization term for deformation graph that compares the difference in transformation between adjacent nodes: DG-Reg = 1  ( − 1) ∑︁ 1≤ ≠ ≤ ∥  m  −   m  ∥ 2 , (2)where and   denote the SE(3) transformation for the −th and −th nodes respectively, and m   denotes the middle point between the rest positions of the −th and −th nodes.Then the total loss function for training N-ICP is written as  N-ICP = 1   ∑︁ =1  ICP ( ( ) , P) +  DG-Reg  DG-Reg , -Conditioned Clothed Avatars.We use the following loss functions to train the texel-conditioned clothed avatars (Sec.5 of the main paper)  avatars = ∑︁   i  i ,  ∈ {RGB, mask, reg, part, ID-MRF}.(4) 5.4.1 N-ICP.N-ICP takes an unstructured point cloud as input, so we adopt the PointNet++ [Qi et al. 2017] architecture.To specify the architecture, we reuse the notation of Set Abstraction function from [Qi et al. 2017]: SA(, , [ 1 , . . .,   ]),

Table 2 :
Quantitative comparison with Clothing CodecAvatars[Xiang et al. 2021] and Dressing Avatars[Xiang et al.  2022]on the skirt sequence.The metrics are computed in the skirt region.

Table 1 :
Ablation studies on different types of input for N-ICP.The evaluation metric is the Mean Squared Error (MSE) in mm 2 between surfaces.P, r and g refer to the point cloud, residual and gradient defined in Sec.4.2. denotes the number of update iterations.When  = 1, the network makes a one-shot prediction.Our full method is shown in the last row.
in Tab. 2. In terms of driving signal, CCA and DA take body and face motion as input, while our method additionally uses sparse RGB-D views.DA and our arXiv:2310.05917v2[cs.GR] 11 Oct 2023