Object Motion Guided Human Motion Synthesis

Modeling human behaviors in contextual environments has a wide range of applications in character animation, embodied AI, VR/AR, and robotics. In real-world scenarios, humans frequently interact with the environment and manipulate various objects to complete daily tasks. In this work, we study the problem of full-body human motion synthesis for the manipulation of large-sized objects. We propose Object MOtion guided human MOtion synthesis (OMOMO), a conditional diffusion framework that can generate full-body manipulation behaviors from only the object motion. Since naively applying diffusion models fails to precisely enforce contact constraints between the hands and the object, OMOMO learns two separate denoising processes to first predict hand positions from object motion and subsequently synthesize full-body poses based on the predicted hand positions. By employing the hand positions as an intermediate representation between the two denoising processes, we can explicitly enforce contact constraints, resulting in more physically plausible manipulation motions. With the learned model, we develop a novel system that captures full-body human manipulation motions by simply attaching a smartphone to the object being manipulated. Through extensive experiments, we demonstrate the effectiveness of our proposed pipeline and its ability to generalize to unseen objects. Additionally, as high-quality human-object interaction datasets are scarce, we collect a large-scale dataset consisting of 3D object geometry, object motion, and human motion. Our dataset contains human-object interaction motion for 15 objects, with a total duration of approximately 10 hours.


INTRODUCTION
Capturing and synthesizing human movements in contextual environments is critical to progressing embodied AI, character animation, VR/AR, and robotics.The real world in which humans live is complex and highly dynamic.Humans routinely interact with dynamic objects to accomplish everyday tasks, demonstrating a diverse range of full-body manipulations.For example, humans pull and push a mop to tidy a floor, reposition a floor lamp to illuminate a specific area, drag a chair toward a desk, and place a monitor on a desk.Realistically simulating such complex manipulation behaviors is a fundamental problem in computer graphics with a lot of downstream applications.
Prior works have made significant progress in addressing the contextual human motion synthesis problem for activities such as navigating through a 3D scene or sitting on a chair [Hassan et al. 2021;Mir et al. 2023;Wang et al. 2021a,b;Zhang et al. 2022a;Zhao et al. 2023].They model interactions with static 3D scenes or static objects based on large-scale human motion datasets.In comparison, datasets containing full-body interaction with moving objects are scarce.Prior works rely on reinforcement learning to model such behaviors [Hassan et al. 2023;Merel et al. 2020;Xie et al. 2023], but the learned policies are often limited to manipulating specific types of geometry used for training.
We present a new approach to synthesizing the dynamic interactions between humans and large-sized objects, particularly in manipulation tasks requiring full-body movements and precise coordination between hands and objects.We aim to bridge the gap between current research and real-world manipulation behaviors by introducing a large-scale dataset and developing a robust approach to synthesize full-body motion from object motion.
We present a new framework -Object MOtion guided human MOtion synthesis (OMOMO).We leverage a conditional diffusion formulation to predict plausible full-body poses with a sequence of object geometry as input.One key observation is that hand position is a deciding factor for full-body movement during manipulation.Thus, we devise a two-stage approach to generate hand positions conditioned on object geometry features and then synthesize fullbody poses based on the predicted hand positions.The two-stage design enables us to apply contact constraints to our predicted hand joint positions, which significantly enhances the contact realism of the generated results.We demonstrate the effectiveness of our proposed method in our dataset and showcase its generalizability to unseen objects.
Moreover, we introduce an innovative application that generates full-body human poses based on object motion captured by an iPhone.In particular, we mount an iPhone on an unseen object, employ the iPhone ARKit to obtain camera poses and deduce the motion of the object.Subsequently, we apply these object poses to 3D geometry reconstructed using Luma [AI 2023].Our pipeline takes the sequence of object geometry as input and generates the corresponding full-body human motion.This application demonstrates an affordable and user-friendly method for capturing human interaction motions during everyday tasks.
An additional contribution of this work is a new dataset with paired object motion and human motion to facilitate the learning of full-body human manipulation behaviors.We leverage an advanced 3D reconstruction technique to extract 3D object geometry from a monocular video.We then use motion capture devices to capture human and moving objects simultaneously.To capture motions that resemble real-world scenarios, we provide language descriptions to guide our volunteers to perform meaningful interactions with various objects.Our dataset can be used for different tasks to model full-body human manipulation behaviors.
To summarize, the contributions of this work include: (1) A novel approach to full-body manipulation synthesis by generating full-body motion from object motion.We introduce an effective framework based on conditional diffusion to synthesize full-body movements from object motion.
(2) A novel application that employs an iPhone to capture object motion from the egocentric view of the object, enabling the synthesis of full-body movements by simply attaching an iPhone to various objects.(3) A large-scale high-quality dataset consisting of 3D object geometry, object motion, and full-body motion.

RELATED WORK
Human Motion and Interaction Datasets.Human motion modeling has been extensively studied with motion capture datasets [Mahmood et al. 2019].Recently, there has been a surge of interest in human scene interactions.PROX [Hassan et al. 2019] provides paired 3D scenes and human motions extracted from RGB videos.HPS [Guzov et al. 2021] contributes a dataset of paired scenes, egocentric video, and human motion captured with an IMU-based suit.EgoBody [Zhang et al. 2022c] collects a dataset consisting of 3D scenes, egocentric video, eye gaze, and human motions extracted from multiview RGBD frames with a focus on social interactions.GIMO [Zheng et al. 2022] explores the problem of gaze-guided motion prediction using a similar data modality.Synthetic datasets [Li et al. 2023;Wang et al. 2022] combine scene datasets [Dai et al. 2017;Straub et al. 2019] with motion datasets [Mahmood et al. 2019] to produce paired human motions in 3D environments.CIRCLE [Araujo et al. 2023] integrates VR and MoCap techniques to collect high-quality motion within virtual scenes.
A couple of datasets focus on human-object interactions.For example, SAMP [Hassan et al. 2021] contains sitting and lying down motions while interacting with chairs and sofas.COUCH [Zhang et al. 2022a] is dedicated to data collection for sitting on different chairs.These datasets primarily contain motions interacting with static objects.
Moreover, some datasets collect both human motion and object motion [Bhatnagar et al. 2022;Fan et al. 2023;Guzov et al. 2023;Taheri et al. 2020].GRAB [Taheri et al. 2020] focuses on the interaction between humans and small-sized objects, involving mostly hand motions.BEHAVE [Bhatnagar et al. 2022] records interactions with larger-sized objects, making it closely related to our dataset.However, it relies on multi-view RGBD input to extract human and object motion, which does not yield motion of sufficient quality for motion synthesis tasks.Furthermore, the limited data for each object impedes its capacity for training a motion generative model.In contrast, our work focuses on synthesizing dynamic human interactions with large-sized objects and we introduce a large-scale dataset consisting of high-quality human motion and object motion.
Contextual Human Motion Synthesis.Motion synthesis is a longstanding problem in computer graphics, and here we survey prior works centered on motion synthesis in 3D environments.Leveraging the dataset with paired scenes and human motions [Hassan et al. 2019], a couple of work [Wang et al. 2021a,b] learn separate modules to predict root trajectory first and generate full-body poses conditioned on both scene and the planned path.However, constrained by the scale and motion quality of the dataset, these methods struggle to synthesize realistic human motions.To improve the motion quality of generation results, SAMP [Hassan et al. 2021] collects a high-quality dataset consisting of walking, sitting and lying down motions.And they present a pipeline that first produces a collisionfree path based on A * algorithm, generates full-body motion following the path and then synthesizes interaction motions to sit on chairs and sofas.A recent work [Mir et al. 2023] introduces action keypoints as scene abstraction, enabling continual motion synthesis generation across various scenes.In order to produce physically plausible movements, several works employ reinforcement learning techniques to learn interaction policies through meticulously designed task rewards [Chao et al. 2021;Hassan et al. 2023;Lee and Joo 2023].
Another line of work focuses on reaching motion synthesis within contextual environments.GOAL [Taheri et al. 2022] and SAGA [Wu et al. 2022] generate full-body poses aimed at grasping a specific object.IMoS [Ghosh et al. 2023] further synthesizes human and object motions simultaneously after grasping an object.However, these works only consider the target object and do not involve navigation in cluttered scenes.Meanwhile, CIRCLE [Araujo et al. 2023] incorporates human-scene interaction features and formulates the problem with a scene-aware motion refinement model, enabling reaching synthesis in complex static scenes.
While most existing work that involves interaction with dynamic objects aims to synthesize dexterous hand motions [Li et al. 2007;Ye and Liu 2012;Zhang et al. 2021], our work diverges from this line of work.Instead, we focus on the synthesis of full-body movements for manipulation without synthesizing detailed hand movements.
Full-body human motion synthesis for manipulation has been explored in both kinematic-based [Starke et al. 2019] and physicsbased methods [Hassan et al. 2023;Merel et al. 2020;Xie et al. 2023].NSM [Starke et al. 2019] learns a gating network and a motion prediction network to synthesize interaction movements including sitting and carrying objects.As for physics-based character animation, reinforcement learning has been widely used to learn different skills [Liu and Hodgins 2018;Peng et al. 2018Peng et al. , 2021;;Xie et al. 2022].In terms of manipulation, Merel et al. [2020] devise a hierarchical reinforcement learning framework to synthesize box catching and carrying movements with egocentric observations.More recently, Hassan et al. [2023] propose to learn policies based on the Adversarial Motion Priors framework [Peng et al. 2021] for box manipulation task.
In summary, most prior research has not considered the dynamic interaction between humans and large-sized objects.A few works studied the problem of full-body manipulation but were constrained to interactions with boxes.In contrast, our work examines contextual environments with diverse dynamic objects.Leveraging our large-scale dataset, we develop an approach to synthesize manipulation movements for diverse objects.And inspired by the success of diffusion in motion modeling [Dabral et al. 2023;Huang et al. 2023;Li et al. 2023;Tevet et al. 2023;Tseng et al. 2023;Zhang et al. 2022b], we design our framework based on conditional diffusion.

METHOD
Our goal is to generate full-body poses  ∈ R  × from a sequence of object geometry  ∈ R  × ×3 , where  denotes the time steps of the sequence, ,  represents the dimension of human pose state and the number of vertices on object mesh respectively.This problem presents two significant challenges.First, there is inherent uncertainty in predicting full-body poses from object motions, as humans can produce the same object motion with varying movements.Second, the generated human poses need to maintain correct contact with the given object when it is being manipulated.The first challenge can be addressed by using a generative model, such as a diffusion model [Ho et al. 2020].However, naively applying diffusion models would not address the second challenge of precisely enforcing contact constraints between the hands and the object.We develop a two-staged method based on a diffusion framework with hand positions as an intermediate representation.The first stage predicts both right and left hand positions  ∈ R  ×6 from the object geometry.The second stage generates full-body poses  ∈ R  × conditioned on the predicted hand joint positions.Our pipeline is shown in Figure 2.

Data Representation
Human Pose Representation.Our pose state representation at time step  consists of global joint position   ∈ R 24×3 and global joint rotation   ∈ R 22×6 represented using 6D continuous rotation [Zhou et al. 2019].We adopt a widely used parametric human model, SMPL-X [Pavlakos et al. 2019], to reconstruct human mesh from pose and shape parameters.
Object Representation.Given a sequence of object geometry  ∈ R  × ×3 , we adopt Basis Point Set (BPS) representation [Prokudin et al. 2019] to encode object geometry.We use the BPS representation for two reasons.First, it gives us a lightweight and compact representation using fixed-length vectors.Second, BPS does not rely on special model architecture, such as PointNet [Qi et al. 2017], to process and can be encoded with an MLP to learn downstream tasks effectively as demonstrated in the previous work [Prokudin et al. 2019].We define a ball with a radius  = 1, a value chosen to encompass all objects in our dataset.The ball is centered at the centroid of the object, (   , at time step .We sample 1024 points from the volume of the ball   ∈ R 1024×3 .The BPS representation is computed by calculating the difference between each sampled point and its nearest neighbor vertex on the object mesh, and denoted as  (  ,   ) ∈ R 1024×3 .As the global position is not encoded in the BPS representation, we concatenate the 3D location of the object at time step  to yield object geometry features Then we employ a Multilayer Perceptron (MLP) to project the high-dimensional features onto a lower-dimensional space.The projected geometry features are denoted as   ,   ∈ R 256 .

Conditional Diffusion Formulation
The diffusion model consists of a forward diffusion process and a reverse diffusion process.The forward diffusion process is gradually adding noise to the data representation  0 for  steps formulated using a Markov chain, (1) Denoising Diffusion The transition of forward diffusion is modeled by a posterior distribution .And each step is decided by a fixed variance schedule using   and is defined as where  represents identity matrix.
The reverse diffusion process is to generate desired data representation from random noise   ∼ N (0,  ).This is achieved by learning a neural network   to denoise recursively.Specifically, at noise level , we use  to represent the conditions, and we have the reverse diffusion process represented as follows: where   (  , , ) is the learned mean,   is the fixed variance.
(  , , ) (we use   in the following equation for brevity) can be formulated as, where x (  , , ) is the prediction of  0 ,   , ᾱ are fixed parameters that satisfy ᾱ =  =1   .Learning the mean can be reparameterized as learning to predict the original data  0 .We use reconstruction loss of  0 during training: (5)

Our Pipeline
Generating Hand Positions from Object Geometry.In the first stage, we employ conditional diffusion to generate hand joint positions  1 ,  2 , ...,   from object geometry features  1 ,  2 , ...,   .Here, the conditions  are represented by  1 ,  2 , ...,   .We adopt a transformer model architecture [Vaswani et al. 2017] as our denoising network which consists of four self-attention blocks.Each selfattention block contains a multi-head attention layer followed by a position-wise feedforward layer.As shown in Figure 3, we introduce an additional step to include noise level embedding as an input to our transformer model.
Apply Hand Contact Constraints.The hand joint positions generated in the initial stage may not always be precise.They may occasionally deviate from the object, resulting in perceived non-contact at certain time steps.To mitigate this, we propose a post-processing strategy based on the observation that human hands typically maintain consistent relative positions with respect to objects during contact.Given a sequence of hand joint positions  1 ,  2 , ...,   , we begin by computing the minimum distance from the hand joints to the corresponding object mesh  1 ,  2 , ...,   at each time step, denoted as  1 ,  2 , ...,   .We then traverse the sequence  1 ,  2 , ...,   starting from the first frame.We set an empirical contact threshold th = 0.03 and record a specific time step  where   < th.
Next, we calculate the difference vector  =   −    at step , where    is the nearest neighbor vertex of the hand joint on the object mesh.The difference vector ,  ∈ R 3 , is then used to compute updated hand joint positions in subsequent time steps.We denote the object rotation sequence as  1 ,  2 , ...,   , and for  > , we compute the updated hand joint position as Ĥ =    +    −1  .This ensures the generated hand joint positions maintain a realistic, consistent contact with the object across the entire sequence.From the input object geometry, the first stage determines the joint positions for both the left and right hands.If the positions of both hands are in close proximity to the object, it results in a two-handed manipulation.If not, a single-handed manipulation is established.Specifically, close proximity is determined by computing the Euclidean distance between the hand position and its nearest neighbor points on the object mesh.If this distance is smaller than a predefined threshold (we empirically set the threshold to 0.03), it is inferred that there is contact.
Generating Full-body Poses from Hand Positions.In the second stage, we utilize the same denoising network architecture as in stage one to generate full-body poses from the hand joint positions.The conditions in this stage are the hand joint positions ( Ĥ1 , Ĥ2 , ..., Ĥ ) that have been rectified using the contact constraints.The model is trained using human motion data only.

Object
Large By integrating these three components, we establish a complete pipeline to generate full-body poses from object motion.This pipeline models the one-to-many mapping from object motion to human poses and ensures that the generated poses maintain realistic contact with the object.

DATASET
We collected a large-scale high-quality dataset consisting of 3D object geometry, human and object motions.In this section, we elaborate on our object geometry acquisition, motion capture, and data processing.Object Geometry Capture.We selected 15 objects commonly used in everyday tasks, which include a vacuum, mop, floor lamp, clothes stand, tripod, suitcase, plastic container, wooden chair, white chair, large table, small table, large box, small box, trashcan, and monitor.For each object, we filmed a video circling the object and employed Luma [AI 2023] to reconstruct the 3D object geometry from this monocular video.We then utilized Meshlab to manually remove noisy points and downsample object meshes to contain a reasonable number of points for training.
Motion Capture.We utilized a Vicon system comprised of 12 cameras controlled by Vicon Shogun, which record at a rate of 120 FPS.For each object, we attached 5 markers and captured the object and human motion simultaneously.We invited 17 subjects (13 males, 4 females) to participate in our motion capture sessions.During each mocap session, the volunteer was provided with verbal instructions on how to interact with each object to avoid meaningless interactions.We show some examples of our language guidance in Figure 4.Each mocap session lasted approximately 1.5 to 2 hours.The total duration of captured motion for each object is shown in Table 1.
Data Processing.For the object geometry data, we employed a public python library [Kleineberg 2023] to compute the SDF for objects.In cases where objects contained noisy SDFs, we used SIREN [Sitzmann et al. 2020] to train neural networks and extract the SDF.
In terms of motion data processing, we used Mosh++ [Loper et al. 2014;Mahmood et al. 2019] to process our raw mocap files and extract SMPL-X model [Pavlakos et al. 2019] parameters for each sequence.In order to compute object transformations based on marker positions, we initially manually annotate the marker positions on the reconstructed object mesh.Subsequently, we utilize the analytical solution of the orthogonal Procrustes problem to compute the scale, rotation, and translation needed to align the annotated points with the marker positions.Furthermore, we visualize the object and human meshes, and conduct a manual verification on our collected dataset, discarding any sequences that fail to meet our high-quality standard.

EXPERIMENT
We first introduce the dataset and evaluation metrics used for this task.Then we describe the chosen baselines and showcase comparisons against them.Additionally, we conduct an ablation study to investigate the effects of hand positions on overall performance.We encourage readers to watch our supplementary video for more qualitative evaluations.

Dataset and Evaluation Metrics
Dataset.We conduct all experiments using our collected dataset.This dataset consists of motion capture data from 17 subjects, with 15 subjects used for training and 2 subjects for testing.We adopt two data partitioning for evaluation.In the first setting, we use 15 objects for both training and testing.To further evaluate the model's generalization ability to new objects, we divide the 15 objects into 10 for training and 5 for testing as shown in Figure 5.
Evaluation Metrics.We evaluate the synthesized results from two perspectives.Firstly, we compare the generated poses against the ground truth motion data.Additionally, we assess the physical plausibility of the results, considering contact correctness, object penetration, and foot sliding.We detail our evaluation metrics as follows.
• HandJPE, MPJPE and MPVPE represent mean hand joint position errors, mean per-joint position errors, and mean per-vertex errors in centimeters ().•   and   represent the root translation error computed using Euclidean distance in centimeters () and orientation error defined by the Frobenius norm of the difference between the 3 × 3 rotation matrix • FS represents foot sliding metric and is computed following previous work [He et al. 2022].• Collision Percentage.At time step , for th vertex on reconstructed human mesh, we query the object SDF and acquire a signed distance value    .We use a threshold (4cm) to compute collision.If there exists vertices that satisfy    < 0, |   | > 4, we increment the collision count.By traversing the sequence, we can compute the collision percentage.
• Contact Metrics.We adopt metrics precision (  ), recall (  ), and F1 score from the object detection task to evaluate contact performance.We first compute the distance between hand positions and object meshes.We empirically set a contact threshold (5cm) and use it to extract contact labels for each frame.We perform the same calculation for ground truth hand positions.Then we count true/false positive/negative cases to compute precision, recall, and F1 score.

Evaluations
Baselines.Since no existing work specifically addresses the task of object motion-guided human motion synthesis, we adapt a prior work GOAL [Taheri et al. 2022] on object-reaching motion synthesis as our baseline.GOAL proposed an autoregressive model that predicts future 10 frames conditioned on past 5 poses, hand distance between the current frame and the target goal frame, and the BPS representation which encodes hand-to-object distance at the target frame.In our problem setting, the input is a sequence of object geometry that guide the motion generation instead of a single target frame in GOAL.Thus, we make changes to the input features and use the next frame as the target frame.Specifically, the input features in our modified version consist of the past 5 poses, and the BPS representation that encodes the distance features between the current human mesh and the object mesh at the next frame.We use their default model setting which consists of four learning blocks.Results.Since our approach is based on conditional diffusion, there can be multiple plausible generation results given the same object motion.To make a quantitative comparison, we sample 20 times for the same object motion input and select the one with the smallest MPJPE.We show quantitative evaluations in Table 2 and Table 3 for two different data splits.One splits training and testing on all 15 objects.The other one uses 10 objects for training and the other 5 unseen objects for testing.For each configuration, there is only one random seed used to train our model, and the low collision percentage.As for smaller FS scores, we observed that the feet position in the baseline results usually drifts above the floor which will not be counted as foot sliding according to the foot sliding metric.In addition, it is worth mentioning that applying the contact constraints to GOAL is not straightforward.Since GOAL predicts all the joints' rotations, it requires inverse kinematics to rectify the human pose based on the corrected hand positions.We also showcase qualitative results in Figure 6.OMOMO contains better contacts compared to the setting without hand joint positions as an intermediate representation, as evidenced by Figure 6 and higher contact F1 scores.For more qualitative comparisons, please watch our supplementary video.
Human Perceptual Study.We further conduct a human perceptual study to complement the evaluations.The goal is to evaluate the motion quality and contact realism.We random sample 100 generated sequences for each approach including OMOMO, OMOMO-singlestage, GOAL, and ground truth, covering all 15 objects.We show some generated results of each approach in Figure 8.We compare OMOMO and the other three settings and totally form 300 pairs for evaluation.For each question, we ask amazon mechanical turk workers which sequence looks more natural and interacts with objects more realistically.Each question is evaluated by 20 different workers ( Figure 7).
We show that our OMOMO clearly outperforms the baseline GOAL and OMOMO-single-stage.And when compared with ground truth, 31% preferred our results (the upper bound would be 50%).It is worth noting that our results of OMOMO are produced via a single forward pass, without any optimization or post-processing for the full-body poses.Therefore, certain artifacts such as penetration may be produced in the generated motion, which results in ground truth motion is preferred in some sequences.

Ablation Study
To investigate the effects of hand positions on our overall performance, we compare the full-body human poses generation results that use the predicted hand joint position as input (OMOMO) and ground truth hand positions as input (OMOMO-GT).In Table 4, we show that the synthesis results can be further improved by feeding more accurate hand joint positions.

Test on Manually Animated Object Trajectory
We further evaluated our pipeline using manually crafted animations of previously unseen objects.In this process, we began by reconstructing the 3D geometry of the object with the aid of Luma [AI 2023].Once reconstructed, the 3D object was imported into Blender.Within Blender, we manually established keyframes at 15-frame intervals.Based on these keyframes, Blender then produced a complete object motion sequence.This sequence, exported from Blender, served as the input for our OMOMO.The resulting outputs are shown in Figure 9.

APPLICATION
We introduce our novel approach to capturing human motion interacting with objects using a single smartphone attached to the object.Specifically, we mount an iPhone XR on the target object and ask the subject to interact with the object while the iPhone camera is filming the environment.We leverage the API ARWorld-TrackingConfiguration provided by iPhone ARKit to extract camera poses.This feature is based on visual-inertial odometry techniques that combine visual information and sensor information to estimate accurate camera pose in the world coordinate system.Since the

Ground Truth
Fig. 11.Limitations.Our contact constraint cannot produce generations that involve intermittent contacts with the object.From top to bottom, we show the generated results and corresponding ground truth motion.In the generation results, the hand positions are processed to be fixed on the object, which introduces implausible human motions penetrating with objects.
camera is rigidly mounted on objects, we can derive object motion from camera poses.Similar to the data collection process, we film a video and use Luma [AI 2023] to reconstruct 3D geometry of the target object.From a sequence of object-moving geometries, we can generate full-body human poses with our proposed pipeline.We showcase some results in Figure 10.Note that these objects are not used during model training.

CONCLUSION
In summary, we presented a novel approach for synthesizing human motion guided by moving objects.Specifically, we proposed a framework based on a two-stage paradigm to enforce contact constraints, demonstrating its effectiveness in generating realistic human motions in interaction.Moreover, we introduced a novel application that enables capturing human interaction motion using a smartphone only.To facilitate the research on human-object interactions, we also introduced a large-scale dataset consisting of 3D object geometry, high-quality object motion, and human motion.
Limitations.Our current dataset falls short of accurately representing dexterous hand movements, which often result in implausible hand motions.A promising avenue for future research would be incorporating hand priors and optimization techniques, enhancing the realism of hand motions in our full-body pose generations.Furthermore, the contact constraints in our current framework cannot effectively address scenarios with intermittent contacts with the object as shown in Figure 11.This could be addressed by identifying and predicting contact states to enable the generation of more complex, long-term manipulation with the objects.Lastly, while our methodology is based on kinematics, future efforts could benefit from integrating physics-based components to mitigate the occurrence of artifacts.

Fig. 2 .Fig. 3 .
Fig.2.Method Overview.Given a sequence of object geometry, we use BPS representation to encode geometry features and project the representation to a low dimensional vector at each time step using an MLP.We use conditional diffusion to synthesize hand joint positions and apply contact constraints.Then we feed the updated hand joint positions to our full-body synthesis module and produce human poses in contact with the given dynamic object.
(a) Pull the chair to move.(b) Grab one of the chair' s legs and tilt it.(c) Lift the chair over your head, walk and place the chair onto the floor.(d) Lift the chair, flip it upside down and place it on top of the table.
(a) Pull thetable to the desired location.(b) Lift and move the table.(c) Kick the table to move across the room.(d) Lift two legs, slide your feet and rotate the table, and lower the table.

Fig. 5 .
Fig. 5. Training objects are annotated in blue, and testing objects are annotated in purple.

Fig. 6 .
Fig. 6.Qualitative Results.We compare our single-stage model, our two-stage model with contact constraints, and ground truth motion.For more qualitative comparisons with GOAL, please watch our supplementary video.
Fig. 10.Application.We mount an iPhone on an object (shown in (a)) and use iPhone ARKit to capture object motion.(b) shows the synthesized human motion.

Table 1 .
Duration of 15 objects in our dataset.

Table 4 .
Ablation Study.* represents the setting that tests on unseen objects.