AdaptNet: Policy Adaptation for Physics-Based Character Control

Motivated by humans' ability to adapt skills in the learning of new ones, this paper presents AdaptNet, an approach for modifying the latent space of existing policies to allow new behaviors to be quickly learned from like tasks in comparison to learning from scratch. Building on top of a given reinforcement learning controller, AdaptNet uses a two-tier hierarchy that augments the original state embedding to support modest changes in a behavior and further modifies the policy network layers to make more substantive changes. The technique is shown to be effective for adapting existing physics-based controllers to a wide range of new styles for locomotion, new task targets, changes in character morphology and extensive changes in environment. Furthermore, it exhibits significant increase in learning efficiency, as indicated by greatly reduced training times when compared to training from scratch or using other approaches that modify existing policies. Code is available at https://motion-lab.github.io/AdaptNet.


INTRODUCTION
Research on physically-based character animation has received a great deal of attention recently, especially using reinforcement learning (RL) to develop control policies that produce a wide spectrum of motion behaviors and styles with few or no manual inputs.Most techniques rely on reference human motion to either provide direct tracking or indirect comparison to constrain movement, along with additional targets and rewards to shape task success (e.g., [Liu and Hodgins 2018;Peng et al. 2018a;Xu and Karamouzas 2021]).However, methods to date largely develop policies or controllers for a known behavior, and must be relearned (usually from scratch) to produce a new behavior.While curriculum-style learning and warm-start approaches may be used to migrate policies to targeted goal tasks [Tao et al. 2022;Yin et al. 2021], we instead aim to broadly adapt previously trained policies to make them usable in a wide spectrum of new scenarios without the need for full retraining.
Inspired by recent work in conditioning existing models in imagebased stable diffusion and large language models [Hu et al. 2021;Zhang and Agrawala 2023], we introduce AdaptNet as an approach for controlling physically based characters that modifies an existing policy to produce behavior in a variety of new settings.The main novelty of our work is the ability to control the motion generation process through editing the latent space.In physics-based character control tasks, there is an opportunity to better understand and exploit the latent space representation of control policies obtained using reinforcement learning frameworks.AdaptNet provides an initial step in this direction.
Specifically, our approach relies on the training of weights for new network components that are injected into a previously trained policy network.Building on top of a pre-existing multi-objective reinforcement learning controller, we propose a two-tier architecture for AdaptNet that augments the latent state embedding while adding modifications to the remaining layers for control refinement.The first layer modifies the latent space projected from the association of the task and character state.It supports adding elements to the control state, as well as changing the imitation and task rewards.Meanwhile, the deeper, control-level refinement augments the policy's action derived from the latent state, supporting more substantive changes to the task control.Together, AdaptNet performs fast training from a previously trained policy and is capable of making a wide spectrum of adaptations from a single behavior.
As in Figure 1, we showcase our learning framework with numerous controller adaptation examples, including changes in the style of locomotion derived from very short reference motions.AdaptNet can perform this "few-shot style transfer" using only the embedding layer augmentation in a fraction of the time it takes to learn the original locomotion policy.Furthermore, through interpolating in the latent space, it is possible to control the generated control dynamically and smoothly transition from the original behavior to the new style.We further experiment with changes to the character morphology by "locking" joints and changing limb lengths.While these changes lead to failure in the original policy, AdaptNet augments the policy easily to account for the various changes.We also investigate changes in the environment, exploring adaptation for locomotion on rough and slick (low-friction) terrains, as well as on obstacle-filled environments.In each case, AdaptNet provides significant improvement leading to characters that robustly traverse a range of new settings (see Figure 1 and accompanying video).
We evaluate the effectiveness of AdaptNet on various tasks, including its ability for adaptation of imitation learning, different goal rewards, and environmental states.We compare our approach against training from scratch, as well as training-continuation (finetuning).Training with AdaptNet can typically be carried out within 10-30 minutes for simple adaptation tasks, and up to 4 hours for complex locomotion tasks and environment changes.Within such modest training time budgets, in most cases it is impossible to obtain a working controller that can adhere to imitation and goal-task objectives when training from scratch or finetuning a pre-existing policy.Additional ablation studies support the specific architecture we propose over several alternatives along with highlighting Adapt-Net's ability to successfully control and modify the latent space.
The contributions of our work are summarized as follows: • We show how the latent space representation of an RL policy can be modified for motion synthesis in physics-based motor control tasks.• Based on this, we introduce AdaptNet as a framework to efficiently modify a pre-trained physics-based character controller to new tasks.• We showcase the applicability of AdaptNet on a variety of multi-objective adaptation tasks, including few-shot motion style transfer, motion interpolation, character morphology adaptation, and terrain adaptation.

RELATED WORK
Our approach follows a wide set of previous related work stemming from general disciplines in computer animation, robotics, machine learning and image generation.We focus on the background that is most relevant, categorized in physically based character skill control, transfer learning, and latent space adaptation.

Deep Reinforcement Learning for Skilled Motion
Deep learning neural network control policies have become the staple for physics-based character animation research due to their ability to synthesize a range of skilled motions.In recent years, techniques have trained control policies to animate physics-based humanoid characters for agile motions [Yin et al. 2021], team sports [Liu and Hodgins 2018;Xie et al. 2022], martial arts [Won et al. 2021], juggling [Chemin and Lee 2018;Luo et al. 2021;Xu et al. 2023], performing complex environment interactions [Merel et al. 2020], as well as general locomotion tasks [Bergamin et al. 2019;Peng et al. 2018a].The recent survey by Kwiatkowski et al. [2022] provides a comprehensive overview of approaches that have been developed for motion synthesis and control of animated characters.
Training skill-specific policies often requires extended training time, necessitating years of simulated learning [Peng et al. 2022].Skill re-use and combining pre-trained policies to perform more complex tasks offer an alternative that can create needed savings from this extensive training.To this end, a number of papers have proposed ways to reuse and/or combine policies.For example, Deep-Mimic [Peng et al. 2018a] trains a composite policy that transitions between a collection of different skills.Liu and Hodgins [2017] experiment with hierarchical models that sequence a set of pretrained control fragments.Hejna et al. [2020] explore a hierarchical approach to decouple low and high-level policies to transfer skills from agents with simple morphologies to more complex ones, and found that it helps to reduce overall sampling.Likewise, we demonstrate that the proposed AdaptNet approach is effective when adapting pre-trained policies to new character morphologies and motion styles with relatively little additional training time.
Curriculum learning is also related to skill adaptation since the agent is trained on tasks with increasing difficulty [Karpathy and van de Panne 2012;Yu et al. 2018].The approach is demonstrated to be effective for training controllers that allow agents to traverse environments of increasing complexity [Heess et al. 2017;Xie et al. 2020] and recover to standing [Frezzato et al. 2022] under increasingly challenging conditions.In comparison, we demonstrate that our approach efficiently allows a physically simulated humanoid to adapt pre-trained walking and running skills to new terrains as well.However, the aim for curriculum learning is somewhat different than our own in that it is usually used as a means to develop a single advanced skill while we focus on the ability to generalize from one behavior to many.

Transfer Learning
In machine learning, a common approach for model adaptation is to start with a pre-trained model and fine tune it on a new task.Over the years a number of architectures have been proposed to overcome the overfitting and expressivity issues of finetuning, including GANinspired approaches for domain adaptation [Ganin et al. 2016;Tzeng et al. 2017] and adding new models to previously learnt ones through lateral connections [Rusu et al. 2016b[Rusu et al. , 2017]].To facilitate better model transfer, algorithms have been explored that account for entropy optimization [Haarnoja et al. 2017;Wang et al. 2021].As well, others directly manipulate the source task domain through randomizing physical parameters of the agent and/or environment while adapting the source domain to the target one [Ganin et al. 2016;Peng et al. 2018b;Rajeswaran et al. 2017].To encourage diversity during early training, recent work on transfer learning has also explored a multi-task paradigm where a model is pre-trained on many tasks before being transferred to a new target domain [Alet et al. 2018;Devin et al. 2017].Some multi-task transfer learning solutions include policy distillation that seeks to "distill" knowledge from expert policies to a target policy [Parisotto et al. 2016;Rusu et al. 2016a].Another approach with a similar goal is policy learning which learns a residual around given expert policies [Silver et al. 2019].
Meta learning has also gained popularity recently in computer vision and robotics, seeking to leverage past experiences obtained from many tasks to acquire a more generalizable and faster model that can be quickly adapted to new tasks [Andrychowicz et al. 2016;Ravi and Larochelle 2017].The related formulations can be broadly classified into models that ingest a history of past experiences through recurrent architectures [Duan et al. 2016;Heess et al. 2015], model-agnostic meta-learning methods [Finn et al. 2017;Nichol et al. 2018], and approaches for meta-learning hyperparameters, loss functions, and task-dependent exploration strategies [Gupta et al. 2018;Houthooft et al. 2018;Xu et al. 2018].
While some of the aforementioned approaches have shown great promise for agent control problems, in this paper, we propose an approach that can quickly adapt RL policies for physically simulated humanoids through fine control tuning as well as augmentation injected in the latent space, loosely inspired by recent findings in image diffusion [Hu et al. 2021;Mou et al. 2023;Zhang and Agrawala 2023].In character animation, related work has focused on motion style transfer tasks for kinematic characters [Aberman et al. 2020;Mason et al. 2018] and the recent work of Starke et al. [2022] shows exciting results about how a well-learned latent space can aid motion synthesis.However, in physics-based character control tasks, there is still little investigation about the latent space representation of the control policy obtained using reinforcement learning frameworks.We believe that AdaptNet provides a promising step in bridging that gap.

Latent Space Adaptation
We are inspired by research in image and 3D model generation that shows it is possible to control the synthesis process to generate targeted artifacts through purposeful modification of the latent space [Abdal et al. 2019;Berthelot et al. 2017;Bojanowski et al. 2018;Epstein et al. 2022;Karras et al. 2020;Radford et al. 2016;Shen et al. 2020;Wu et al. 2016;Zhuang et al. 2021].While we have seen related work in RL for character control, AdaptNet offers a unique approach to latent space adaptation, drawn from these adjacent works' successes.Related works in physics-based character control, such as [Juravsky et al. 2022;Ling et al. 2020;Peng et al. 2019Peng et al. , 2022;;Tessler et al. 2023;Won et al. 2021], explore using pretrained latent space models to facilitate the training of a control policy.These methods intend to adapt the pre-trained multi-skill model for downstream tasks by controlling skill latent embeddings, focusing on reusing skills for motion generation.In contrast, our approach does not break down the latent space by task and character state and instead allows the policy to be adapted to heterogeneous tasks that require learning new (out-of-distribution) motions/skills.Further, previous methods discard the pre-trained latent encoder during adaptation and rely on re-training to obtain a new encoder.In contrast, our approach directly edits the latent space projected from the association of the task and character state via the pre-trained policy.To do this, we use a gated recurrent unit (GRU) [Chung et al. 2014] layer as the encoder and initialize it by duplicating the original encoder parameters.Next, a fully connected layer is applied after the GRU to ensure zero initialization and convert the encoded state to a latent modification.In sum, the training for our adaptation starts from modifying the pre-trained policy rather than from scratch, which benefits adaptation in comparison to previous work in sample efficiency and, at times, overall effectiveness.

ADAPTNET FRAMEWORK
An overview of the AdaptNet framework is shown in Figure 2. The GAN-style control framework (top), described below, produces an original (pre-trained) policy (bottom, left) while AdaptNet is used to adapt that pre-trained control policy to a new task controller (bottom, right).Notably, the adaptation process could involve changes to the reward function (e.g., motion stylization) or the state and dynamics model (e.g., character morphology and terrain adaptation).Components of the AdaptNet for policy adaptation are shown: a latent space injection component and an internal adaptation component.The latent space injection performs policy adaption by editing the latent space, which is conditioned on the pre-trained policy's state as well as any additional state information, for example, for new tasks.This component is trained to cooperate with the pre-trained policy by generating offsets to the original latent space instead of trying to learn how to generate latent variables for new tasks from scratch during adaptation This leads to an efficient state-action exploration that starts from the pre-trained policy, instead of complete random exploration.Internal adaptation further tunes the policy by Fig. 2. Overview of our approach for adapting motor control policies for physics-based characters.Top: We model both pretraining and adapted tasks using a multi-critic reinforcement learning framework that balances the training of imitation and goal-directed control objectives.After a policy is trained, we can quickly adapt it to a new task using AdaptNet.Bottom: AdaptNet starts with a copy of the pre-trained policy network and modifies it through editing the latent space conditioned on the character's state and introducing optional adaptation modules for further finetuning.
adding a branch to each internal fully-connected layer in the policy network.This allows for more flexibility, enabling AdaptNet to shift away from the pre-trained policy and generate refinement through control actions that the pre-trained policy may not reach easily.
In our implementation, both the pre-trained policy and the adaptation are produced using a multi-objective learning framework [Xu et al. 2023] combining reinforcement learning with a GAN-like structure for effective policy learning that accounts for both motion imitation and goal-directed control (see Figure 2, top).During runtime, AdaptNet can be activated flexibly and dynamically allowing us to control the level of adaptation of the original control policy.The control policy  (a|s  ) is a neural network taking the agent state s  as input and outputting a probability distribution from which a control a  can be drawn from the action space A.
For physics-based character control tasks with dynamic goals, we consider s  := {o  , g  }, where o  denotes the current state of the character, e.g., joint or body link positions and velocities, and g  is an optional task-related goal state or an encoding variable that indicates desired motion parameters, such as target speed and direction, end-effector positions, motion style, etc.The action vector a  is the target posture fed to a PD servo through which the simulated character is controlled at a higher frequency.As shown in Figure 2, a  is expressed as a multivariate Gaussian distribution.
Under the framework of reinforcement learning, our goal is to find the policy  that maximizes the discounted cumulative reward: where is the state-action visitation distribution for the trajectory  = {  ,   } over a horizon of  time steps,  ∈ [0, 1] denotes the discount factor, and  (•) is the reward received at a given time step and  (•) is the state-transition probability of the underlying Markov decision process.In our domain, when the character faces a new task,  (•) and/or  (•) may change.AdaptNet seeks to efficiently modify  and adapt it to the new task by editing the latent space and finetuning the policy.

POLICY ADAPTATION USING LATENT SPACE INJECTION
If we consider the first layer, or first several layers, in the policy network  as an encoder to embed the state s  into a latent space Z, the control policy can be rewritten as where E  is the encoding layers with parameters ,  are the parameters for the layers in the policy network that follow the encoder, and (, ) denote the weights of .In this formulation, the policy network   decides the projection from the latent z  = E  (s  ) into the action space A. Assuming that   is optimized by a typical on-policy policy gradient algorithm, the optimization objective with the introduction of the latent becomes max where (•) provides an advantage function estimation based on the received rewards {  }  ≥ during the interaction with the environment and represents how good an action sample a  is given the conditional state s  .
Given the generalization of neural networks, the latent space Z can be considered as a superset covering all the possible latent states, which could lie outside of the domain that   can reach during its training.Based on this observation, when   needs to be adapted to a new task, we propose to edit z  = E  (s  ) ⊂ Z instead of discarding the original encoder E  and training a new one from scratch.The intuition is that for similar tasks, adjusting the current encoder provides better efficiency, allowing the desired control policy to be learned by a modified projection function from s  to z  .
Our approach manipulates the full latent space projected from both the character state o  and the goal state g  .Specifically, as shown in Figure 2, we perform latent space injection by introducing a new conditional encoder I  with parameters  after the first encoding layer, where the character state o  and the goal state g  are concatenated to generate E  .This latent space is modified via where   is an additional control input for the new task which could be optional.The injector module where G  is an optional module to process the additional control input c  , E  is a state encoder that has exactly the same structure as the original encoder E  , and F  is a final embedding module, which can be a fully-connected layer or a stack of multiple fully-connected layers.
During retraining for adaptation, we perform policy optimization as in Eq. 3, but only optimize the new parameters  while keeping the parameters  and  fixed: We begin with copying the original encoder parameters  into the new encoder E  and initializing the last fully-connected layer inside F  with zero weight and bias.In this way, the new encoder E  is optimized by finetuning a set of parameters that are already optimized for state feature extraction during pre-training.The zero initialization of F  lets the control policy give exactly the same action output as the original pre-trained one, i.e.,   (a  |E  (s  )), at the beginning of re-training.It guides the adaptation to start from the state-action trajectory generated by the original policy rather than from a completely random exploration.We refer to Figure 2 for the default implementation of AdaptNet, where the latent space injection is performed right after the concatenation of o  and g  .We denote this latent space as Z 0 , and the following ones after each fully-connected layer but before the final action layer as Z  where  = 1, 2, • • • .Empirically, we note that it is more challenging to perform optimization when the injection occurs at a deeper layer in the policy network, leading typically to unstable training and low-fidelity controllers.An extreme case is to perform injection directly at the action space, which makes the whole system similar to directly finetuning the pre-trained policy network.We refer to Section 9 for related sensitivity analysis on introducing latent space injection at different network layers and for comparisons with directly finetuning a pre-trained policy network for new tasks.
During runtime, we can further introduce an extra scaling coefficient to the injection term in Eq. 4. Since our approach does not change the original encoder E  as well as the policy   , the scale coefficient allows us to turn the injection on and off, or control the transition from the original policy to the fully adapted one.In such a way, we can perform motion style or behavior transitions (e.g., walk to skip) by interpolation in the latent space, as we will show in Section 8.1.

INTERNAL ADAPTATION FOR CONTROL LAYERS
The latent space injection component of AdaptNet edits the latent space based on the input state and further allows us to introduce additional control input for new tasks.However, the expressive ability of the action policy is still constrained by the pre-trained layers after the state encoder in the policy network, i.e.,   .While utilizing the pre-trained   for fast adaptation to new tasks, we introduce an internal adaptation component through which we can finetune   , overcoming the bias it introduces and allowing for more flexibility in the types of generated controls compared to the ones obtained from the original training domain.The goal of the finetuning is to find a small increment Δz   to the original latent z   in each latent space Z  ,  > 1, to help optimize the objective function in Eq. 6 during adaptation training, but without changing the   too much to avoid drifting too far away from the pre-trained policy and being stuck at overfitting during adaptation.To do so, we add a branch to each fully-connected layer between two latent spaces.As shown in the red block of Figure 2, the corresponding latent is generated as: Here, F   denotes the fully-connected layer between the latent space Z  −1 and Z  in the policy network   , and F   is the newly introduced adaptor that generates Δz   and is modeled as a fullyconnected layer in the added branch.The parameter  is defined as with ΔW  and Δb  being the weight and bias parameters in F   respectively.Similarly to the embedding module F  in the latent space injection component, F   is initialized as zero and will not influence the output of the policy network at the beginning of policy adaptation.We lock  in F   during adaptation training and introduce the parameter  into the optimization function in Eq. 6.
Our approach is different from directly finetuning   .When directly finetuning   , the gradient from z   with respect to z  −1  is decided by the weight W  in the layer F   , which may be highly biased and have relatively large or very small values given it was fully trained.Therefore, finetuning   directly for new tasks may lead to unstable training compared to only finetuning the newly introduced parameter set  which is initialized with zero.Furthermore, we can easily apply regularization on ΔW  and Δb  to prevent aggressive finetuning regardless of the value of the parameters W  and b  in the pre-trained layer F   .This will limit the possible change that the internal adaptation can bring about in order to prevent overfitting.We can also introduce an extra scaling weight to control the adaptation level during runtime, as discussed in Section 4.
Our proposed internal adaptation component is similar to the approach of low-rank adaptation (LoRA) proposed by Hu et al. [2021].The major difference is that instead of directly employing a fullyconnected layer, LoRA decomposes the weight matrix ΔW  into two low-rank matrices, i.e., ΔW  = B  A  , where, B  is a |Z  −1 |-by- matrix, A  is a  -by-|Z  | matrix, and  ≪ min(|Z  −1 |, |Z  |).In contrast, our approach can be considered a full-rank adaptation.LoRA has been demonstrated as an effective way to fine tune large language and image generation models, reducing the number of parameters that need to be optimized during model adaptation.However, as shown in Section 9.3, we found that LoRA does not work well for physics-based character control tasks.A possible ALGORITHM 1: Policy Adaptation using AdaptNet Obtain the policy   and the state encoder E  by performing training to optimize Eq. 9 in a general or default environment setting. 1 Build up the latent space injection component I  based on Eq. 5 and the internal adaptation component { F   } based on the Eq. 7. 2 Lock the parameters  and .
3 Initialize E  using the pre-trained parameter . 4 Initialize the last layer inside F  and each F   using zero weight and bias.5 Adapt the policy for a new task by only optimizing the parameters  and  using Eq.12.
reason is that the related policy networks are markedly smaller compared to large language and image generation models that may have more than 12K dimensions.The latent spaces of our policy network have a typical size of 512 or 1024 dimensions and may not exhibit the lower intrinsic ranks that larger models do [Aghajanyan et al. 2021;Li et al. 2018;Pope et al. 2021].

POLICY TRAINING
We use the multi-objective learning framework for physics-based character control proposed by Xu et al. [2023] to perform both the original (pre-)training and adaptation training.The framework leverages a multi-critic structure where the objectives of motion imitation and goal-directed control are considered independent tasks during policy updating.In Figure 2, for example, the imitation objective is associated with a critic network labeled in blue, and the goal-directed objective is associated with a critic in magenta.The advantage (cf.Eqs. 3, 6) with respect to each objective is estimated only by its associated reward and critic network.To ensure that the policy can be updated in a balanced way taking into account both the imitation and goal-directed control objectives, all estimated advantages are standardized independently before policy updating.
During pre-training, we seek to find a basic motor control policy   (a  |E  (s  )), which we can later adapt to new tasks.In this work, we focus on locomotion tasks, and thus   involves two objectives: a motion imitation objective given a batch of reference motions of walking and running, and a goal-directed objective involving a given target direction and speed.Using the multi-objective learning framework, the optimization objective function during pretraining shown in Eq. 3 can be written as max where Ā  is the standardization of the estimated advantage associated with the objective  and   satisfies    = 1 providing additional control to adjust the policy updating in a preferred manner when conflicts between multiple objectives occur.
We employ a GAN-like structure [Ho and Ermon 2016;Merel et al. 2017] that relies on an ensemble of discriminators [Xu and Karamouzas 2021] to evaluate the imitation performance and generate the corresponding reward signals for advantage estimation and policy updating.In particular, we take an ensemble of  discriminators and use a hinge loss [Lim and Ye 2017] with policy gradient [Gulrajani et al. 2017] for discriminator training, resulting in the following loss function: Here,   denotes a discriminator network, ô = o  + (1 − ) õ with  ∼ Uniform(0, 1) and  GP is gradient penalty coefficient.The reward function to evaluate the imitation performance is defined as The reward for the goal-related task is computed heuristically.We refer to the appendix for the representation of the goal state g  and the definition of the goal-related task reward.
After obtaining   and E  in pre-training, we introduce the proposed AdaptNet to perform policy adaptation for new tasks that are relative to but have different reward definitions and/or environment settings from the one in the pre-training phase.Before the adaptation training starts, we lock the parameters  and .We then initialize E  inside the latent space injection component I  using the weights , and initialize with zero weight and bias the last layer of F  inside I  along with each fully-rank adaptor F   ,  > 0. To stabilize the training, besides applying a common weight decay to the parameter set  (Eq.7) via L2 regularization, we introduce an additional regularization on the latent injection generated by I  .The adaptation training is still performed under the aforementioned multi-objective learning framework in the same way as the pre-training phase.The optimization objective for policy adaptation is max where  and  are regularization coefficients.In Section 10, we give a detailed analysis of the regularization on the latent space injection.We refer to Algorithm 1 for the outline of the whole training process.Adaptation with the proposed AdaptNet can be done very quickly within 10-30 minutes for simple control tasks and up to 4 hours for challenging terrain adaptation tasks with new control input processed by an additional convolutional neural network G  , as defined in Eq. 5.

EXPERIMENTAL SETUP
Our experiments were run using IsaacGym [Makoviychuk et al. 2021] with 512 environments running in parallel during training.The simulated character has 15 body links and 28 degrees of freedom, where the elbow and knee joints are implemented as 1-dimensional revolute joints, and the hands are fused with the forearms and uncontrollable.All simulations run at 120Hz with a normal PD controller employed as the low-level actuator to directly manipulate the simulated character, while the control policy runs at 30 Hz, as shown in Figure 2.  We run policy optimization using PPO [Schulman et al. 2017] and update policy parameters using the Adam optimizer [Kingma and Ba 2017].To encode the character's state, we take the position, orientation, and velocities of all the body links related to the pelvis (root link) in the last four frames as the state representation o  and employ a gated recurrent unit (GRU) [Chung et al. 2014] with a 256-dimension hidden state to process this temporal state.For discriminator training, we take the character's pose at five consecutive frames as the representation of {o  , o  +1 } to evaluate the policy's imitation performance during the transition from timestep  to  + 1.We employ an ensemble of 32 discriminators and model it by a multi-head network, as shown in Figure 3.The critic network has a similar structure to the policy network, but with a 2-dimensional output for the value estimations to the imitation objective and goaldirected objective respectively.We refer to the appendix for the hyperparameters used for policy training and the representation of the goal state g  in the locomotion task.
Rewards for both task and imitation are employed during policy adaptation.To avoid bias from the pre-trained policy, we discard the discriminators for imitation from the original policy and new discriminators are trained from scratch.Intuitively, in tasks such as motion style transfer the original discriminator will not work well for the new given reference style and thus a new one is needed.Even for other adaptation tasks, we found utilizing old discriminators to be problematic, as the optimal action in the new task can dramatically change from the original in the context of how it employs the reference motion.Empirically, when we experimented with reusing the old discriminators, we found they introduce too much bias towards the old task.Finally, with training new discriminators for a new task, we also perform value estimation by re-training a new critic from scratch.
All our tests were run on machines with a V100 or A100 GPU.To achieve a good locomotion policy based on which we perform further adaptation, the pre-training took around 26 hours and consumed 4 × 10 8 training samples.The reference motions are around

APPLICATIONS OF ADAPTNET
In this section, we apply the AdaptNet technique to demonstrate the success and efficiency of learning new physics-based controllers through adaptation.Our experiments use two pre-trained locomotion policies (walking and running) that account for two objectives: motion imitation based on a batch of walking or running reference motions, respectively, and a goal objective as defined by a target direction of motion and speed.We adapt the pre-trained policies to a range of new tasks, highlighting applications of AdaptNet to style transfer, character morphology changes and adaptation to different terrains.Figure 1 shows snapshots from different outcomes.Please refer to the supplementary video for related animation results.

Motion Style Transfer and Interpolation
We consider a variety of motion style transfer tasks where a pretrained walking locomotion policy is adapted to a particular style.Note, this is not a simple motion imitation task, since all the style reference motions are very short (see Table 1, bottom), containing only one or two gait cycles.It is therefore impossible to train an equivalent locomotion policy that supports goal-directed steering using the target reference motion.Instead, the nature of this test is few-shot learning, where AdaptNet is expected to effectively learn how to perform locomotion in the style provided by the small duration of the style example in the new reference, while relying on the pre-trained policy to perform turning and goal-directed steering.
Figure 5 depicts related qualitative results.AdaptNet can effectively learn how to do goal-directed turning in the provided style.Further, adaptation training can be done very quickly, within 10-30 minutes, in contrast to the original that we obtained during pre-training took about one day for training.We refer to the supplementary video for animation results, and Section 9 for comparing AdaptNet to learning stylized locomotion from scratch.
As discussed in Sections 4 and 5, we can perform motion interpolation in the latent space by introducing a scale variable to control the adaptation level.This process can be described by modifying Eqs. 4 and 7 as where  ∈ [0, 1] is the introduced scale variable.In Figure 4, we show interpolation results.As shown in the figure, we can achieve motions with different style intensity, which can transition between the base walking motion and the stylized ones in a smooth manner.We can further extend Eq. 13 to perform interpolation between any two AdaptNet models via where the parameters  ′ and  ′ are from one AdaptNet model and  ′′ and  ′′ are from the other one.Such an interpolation scheme can be regarded as applying two independently trained AdaptNet models simultaneously on the same, pre-trained policy, with an example shown in Figure 6.The above interpolation results demonstrate that during adaptation training, AdaptNet can effectively learn structured information about the latent space with respect to the desired motion styles.We refer to Section 10 for more details on controlling the latent space and related visualizations, along with an analysis of the training difficulty (time consumption) when learning different styles.

Morphological Adaptation
We consider two kinds of morphological changes: body shape and joint lock.Due to physical constraints, morphological changes in the character model will cause the same action a  to lead to different resulting states compared to the ones observed in the pre-training phase.Without adaptation, the pre-trained policy does not perform well if it's even able to keep the character balanced, especially when the lower body is modified.
We tested eight body-shape variants of the original character model, as shown in Figure 7.In the LongBody variant, we extend the abdomen length by 50%, while the BigBody variant increased the torso size by 50%.The latter leads to an increase in the torso mass of over 300%.In LongUpperArms and LongLowerArms variants, the length of the upper and lower arms are extended by 25% respectively, while in AsymmetricUpperArms, we increase the length of the right upper arm but decrease the length of the left upper arm.In the LongThighs and LongShins variants, the length of the upper and lower legs are extended by 50% respectively, the latter akin to a human walking on stilts.In the model of SuperLongLegs, both the   thighs and shins are extended resulting in a character that is over 2 m tall.
We also experimented with different configurations, as shown in Figure 8, where some of the joints (in orange) are 'locked'.The locked joints are removed from the character model such that the linked body parts are fused together.This reduces the number of dimensions of the action space.To make the pre-trained policy compatible with the new action space, we simply prune the weight and bias matrices of the last layers in the policy network and remove the output neurons corresponding to the locked joints.
Even though the pre-trained policy would not completely lose control of the character when the torso or arms are modified, the character still loses balance quite often.As more challenging examples, the morphological changes in the lower body parts and joints leave the pre-trained policy unable to control the character without falling.For example, when the knee joint is locked, the policy needs to adjust the output of the hip and ankle in order to compensate for the 'disability' of the knee.This requirement leaves the pre-trained policy incapable of suitably controlling the modified character model.
During adaptation, we did not do any retargeting to generate new reference motions for AdaptNet to learn.Instead, we simply modify the character's model while relying on the reference motions used to pre-train the original policy, retargeted to the character model without any morphological changes.We found it takes 15-30 minutes to finish the adaptation training depending on the difficulty of the morphology change task.The character controlled by the AdaptNet policy can maintain its balance and walk or run without falling down.An interesting observation is that in order to match the provided height of the root link (pelvis) in the reference motions, the AdaptNet policy will control the character to walk or run in a crouch with the body at a relatively low position compared to the leg length.We show some representative results in Figure 9, and refer to the supplementary video for animations.

Terrain Adaptation
Next we discuss policy adaptation for character locomotion on low friction and rough terrains as well as obstacle-filled scenes that require extra control input.

Friction Adaptation.
To simulate an icy surface, we significantly reduce the ground friction.In particular, we decrease the friction coefficient from 1 to 0.15 for walking and to 0.35 for running.Figure 10 compares results obtained for the running policy with and without using AdaptNet.Note, AdaptNet can effectively control the character to change its moving direction by sliding on its feet, as shown in the left example of the figure.In addition, using AdaptNet,  the character lowers its center of mass and takes quick steps to maintain its balance.In contrast, with the original policy, the character cannot run on the icy ground without falling down.For walking, the AdaptNet controller is more cautious with the character preferring to stop and change its direction in place.Without using AdaptNet, the character tends to turn around with a bigger radius, but not slow down.This demonstrates the ability of AdaptNet to change the behavior provided by the original policy to make it better suited to new environmental settings.

Terrain Adaptation with Additional Control Input. To test
AdaptNet with extra control input, we designed several experiments where the character is asked to do goal-steering navigation in challenging environments with procedurally generated terrains.A local heightmap is provided as the additional control input c  through which the character is expected to adjust its motions to prevent falling down during walking.The heightmap is extracted locally based on the character's root position and aligned with the orientation of the root, with a left and right horizon of 1.7 m, backward horizon of 1 m and forward horizon of 2.4 m.To process the heightmap c  , we introduce a convolutional neural network (CNN) as the encoding module G  (see Eq. 5) for AdaptNet.We refer to the appendix for the network structure of the CNN.An extra map encoding module having the same structure with G  is added to the critic network for value estimation during adaptation.We show representative examples of our tested terrains in Figure 11 and note the appendix also gives more detail on terrain.
We refer to the companion video for the navigation performance of the character when walking on the designed terrains after adaptation training.Even in terrains where the height changes smoothly, the character teeters under the control of the pre-trained policy and a minor change in the terrain slope is enough to make the character stumble.After adaptation training, AdaptNet can enable the character to smoothly walk and turn on the uneven terrains without falling.Besides being able to step over low-height obstacles, the AdaptNet character exhibits intelligent local decision making, trying not not to step on the edge of the rocks on the rough terrain and avoids overly rugged paths by altering its moving trajectory to an easy-to-follow one.
To further demonstrate the ability of AdaptNet to perform local path planning, we designed a more challenging environment with uncrossable obstacles randomly placed on the ground.We qualitatively show the results in Figure 12.As seen in the figure, the character controlled with AdaptNet (blue) can successfully walk around the obstacles.Without accounting for collisions, the character controlled solely by the initially trained policy (green) crosses through the regions where obstacles are placed.Unsurprisingly, the introduction of the CNN (detailed in Appendix B) increases the time needed to perform policy optimization iterations in the training for rough terrains.Still, for the easier terrains, training can be done within 1.5 hours.The more rugged terrain took around 4 hours for training.Finally, it took around 22 hours to train adaptation for the local obstacle avoidance test case.We note that this is still less time than is needed for training the original flat-ground locomotion policy from scratch (26 hours).

Perturbation Adaptation
In a final experimental foray, we investigate AdaptNet's ability to improve the handling of perturbations.Although the original policy can handle small perturbations, the character will still fall under larger impulses.In order to achieve more robust control, we adapt the control policy's ability to maintain balance in the presence of large disturbances.We begin with pre-trained policies for target-directed locomotion for walking and running.During the training process, we randomly apply perturbations (1000 N, lasting for 0.2 seconds) in different directions on the character's torso.With adaptation training of around 5 hours, the character is able to stay balanced against comparable impulses following training for both running and walking tasks.In contrast, the original controls are not able to handle such perturbations repeatably and they often lead to the characters falling over.Furthermore, we also observe that AdaptNet control adjusts the character's footsteps to recover balance when the character is highly out of balance due to perturbations.A comparison of the original policy and our results can be seen in the supplementary video.

ABLATION STUDIES
In this Section, we compare the performance of AdaptNet to different baselines along with performing sensitivity analysis on the two components of the proposed AdaptNet technique.

Baseline Comparisons
We consider the following baselines: Scratch where a new policy is trained from scratch on a given task; FT where we directly finetune the pre-trained policy network to the newly given task; FT + Reg where we apply regularization on the weights of the policy network during finetuning; and PNet where policy adaptation is performed using a progressive neural network approach [Rusu et al. 2016b].
Figure 13 compares the learning curves for the goal-task performance between the baselines and AdaptNet on three style-transfer tasks (top row) and three adaptation tasks (bottom row), two involving changes in the character's morphology and one for lowered ground friction.For fair comparison, we employ the same training setup for all baselines, where the reward function of the new policy accounts for both a task objective and an imitation objective using an automatic weighting scheme [Xu et al. 2023].In the motion style transfer experiments, the imitation term is computed using a new discriminator that takes only the stylized motions as the reference similar to Section 8.1.
As can be seen from the learning curves in Figure 13, Scratch fails to attain the desired goals in the considered benchmarks, achieving a very low goal task reward within the given budget of 8M training samples.FT can effectively modify the locomotion policy in the bottom three tasks where the character's morphology or environmental friction changes.However, in the motion style transfer tasks, the reward curve of FT noticeably drops after the training begins as FT overfits the imitation of the newly provided stylized reference motion and ignores the goal direction signal.In contrast, AdaptNet provides a stable task reward curve during the adaptation training with the character being able to imitate the newly provided style without forgetting the previously learned locomotion behaviors as seen in Figure 14.The above findings are in line with previous works [Peng et al. 2019;Rusu et al. 2016b] that have shown finetuning to be efficient when the parameters of a pre-trained model need to be slightly adjusted to a new target domain.However, FT can be susceptible to catastrophic forgetting when the imitation objective is significantly changed, as in the motion style transfer tasks.FT + Reg leads to poor training and low-fidelity controllers in all tasks.While, in theory, adding regularization can improve the navigation performance, in practice, it is hard to regulate the weights during finetuning due to the presence of both significant large and small weights in the pre-trained policy.
PNet shares similarities with AdaptNet as both approaches add new weights to the original policy network and freeze the old weights during transfer learning.However, despite these similarities, the architectures of the two approaches are significantly different.AdaptNet uses a residual structure that supports merging, resulting in a single policy network which allows forward propagation in one pass during inference.In contrast, PNet does not support merging and requires the original network to be present and run first to compute the values of the hidden neurons in the added network.This adds significant complexity and memory overhead, with the network structure becoming larger and slower.Importantly, during training, the added network in PNet cannot start from zero as compared to AdaptNet.In essence, the zero initialization in Adapt-Net allows us to guide the adaptation starting from the original policy.This is clear in the style-transfer tasks, where AdaptNet begins training with a much higher reward than PPNet due to the locomotion ability provided by the original policy.Despite its competitive final performance in several of the adaptation tasks, PNet is sample inefficient.Finally, we note that it can lead to forgetting the prior knowledge provided by the pre-trained policy as the added network can significantly change the output of the whole model in some cases.This can be seen in the Penguin Walk task where the navigation performance drops after 5M samples.Overall, AdaptNet consistently outperforms all four baselines in terms of final performance and sample efficiency.In terms of memory efficiency, Scratch and FT do not add any overhead.AdaptNet introduces additional parameters, but since the original network is frozen, the number of trainable parameters is still at the same scale with the original neural network when no conditional input, i.e.,   and G  , is needed.While the the total number of parameters increases, the effective number of parameters is the same as the original policy because AdaptNet can be merged into the original network.In contrast, PNet requires both networks to be present and effectively doubles the number of parameters.

Latent Space Injection
Our default implementation performs injection on the latent space Z 0 right after the goal state g  and character state o  are encoded and concatenated together.Here, we test the application of the injection module to other latent spaces after Z 0 but before reaching the action space, along with applying injection on all possible latent spaces simultaneously.To solely study the performance of latent space injection, we also remove the full-rank adaptation modules for these tests.The tested network structures are shown in Figure 15.
To explore how the injection schemes perform differently in generating new policies, we run tests on several motion style transfer tasks.During our experiments, we observe qualitatively that injection at the lower space Z 2 or at all the latent spaces Z 0:2 , which also includes the lower one, can easily produce jerky motions with stiff movements of the torso and legs.It can also lead to failures in training where the character falls repeatedly after a few training iterations.In Figure 16, we plot the trajectory of the foot height in two of our tested cases.While injection at Z 0 (blue) leads to a smooth repeatable trajectory, the curves become more irregular as the injected latent space changes from Z 1 (green) to Z 2 (orange) and then to Z 0:2 (red).We also see some sharp jumps in the curves of Z 2 and Z 0:2 , which represent fast motion transitions.We refer to the supplementary video for the animation results including examples where injection at Z 2 and Z 0:2 fails.
Overall, our tests show that as the chosen target latent space is closer to the action space, it becomes more difficult for AdaptNet to generate desired motions, with  0 both intuitively and empirically giving the best results.This observation is in agreement with recent work in image synthesis where the target space for manipulation is usually chosen nearer to the input of the generator rather than near the final output [Abdal et al. 2019;Karras et al. 2020;Zhuang et al. 2021].In terms of the network structure in our implementation, the input state s  ∈ R 784 is encoded into the first latent space Z 0 ∈ R 260 and then projected to Z 1 ∈ R 1024 .The whole network, therefore, can be regarded as an encoder-decoder structure where the bottleneck is at Z 0 .As we will show in Section 10, Z 0 is wellstructured which makes it amenable to manipulation for motion generation.

Comparison of Adaptation Methods
We quantitatively evaluate the imitation performance of AdaptNet with other adaptation approaches, including alternate methods with and without using its internal adaptation component.As in prior work [Harada et al. 2004;Peng et al. 2021;Tang et al. 2008;Xu and Karamouzas 2021], we measure the imitation error via: where  link is the total number of body links,   ∈ R 3 is the position of the body link  in the world space at the time step , and p is the body link's position in the reference motion.The evaluation results are shown in Table 2.We find our proposed approach to combine latent space adaptation (LSA) and internal adaptation (IA) results in the best performance.While the results in Table 2 imply LSA alone is sufficient in many cases, IA appears to help most in the difficult motion style transfer tasks, e.g., Goose Step, Jaunty Skip and Joyful Walk, where the stylized motions are relatively far away from the pre-trained walking motions.In these tasks, adding IA improves the visual quality as well as motion smoothness, foot height, and gait frequency as shown in the supplementary video.It is important to note, however, that IA alone produces subpar performance.In addition, it cannot account for the additional control input needed in other adaptation tasks such as terrain adaptation.Further, even when no additional  2).As shown in Table 2, utilizing just the old latent space embedding is useful but no more valuable than using internal adaptation.
In addition to ablations to our own architecture, we also compare our IA component, which can be regarded as a full-rank adaptation scheme, to the low-rank adaptation (LoRA) scheme [Hu et al. 2021].LoRA typically works well for adapting large language models with a low rank ≤ 8.However, we did not find any evident improvement over just using LSA when an intrinsic rank of 8 was employed.Even after increasing the rank to 64, the performance gap between the full-rank adaptation scheme and LoRA still remains as listed in Table 2. Though using a low-rank decomposition can reduce the total number of parameters, it increases the computation cost since one more matrix multiplication is needed for each adaptor.Given the small size of our policy network, from our findings we conclude that the full-rank adaptation offers desirable benefits over LoRA.

LATENT SPACE ANALYSIS
In this section, we provide more insights on the ability of AdaptNet to successfully control and modify the latent space.

Latent Space Visualization
Figure 17 visualizes the latent space for different motion style transfer tasks.For each task, a controller was trained using AdaptNet starting from the same pre-trained locomotion policy of walking.During adaptation training here, we use only the latent space injection component as in Z 0 for all models.We also remove the regularization term I  in Eq. 12 and prolong the training time to let AdaptNet fit the style motions as much as possible.After training, we collect samples for each stylized motion from the simulated character following a straight path without any goal-direction changes.We use a multidimensional scaling technique to reduce the dimension of the collected latent samples.
As seen in the figure, the 2D projection of the latent space exhibits a circular shape with the pre-trained walking policy (dark purple) located near the center.There is a clear and roughly continuous transition when the motion style changes from one to the other, which demonstrates the well structured nature of the latent space with the different motion styles.The distribution of the stylized motion in the visualized space is roughly consistent with the imitation error distribution listed in Table 2 when no internal adaptor is employed.Motions with smaller imitation errors are distributed generally closer to the pre-trained policy while Joyful Walk (light green) has the largest error and is located the farthest away from the center of the circle.We also note the Penguin Walk (red) and Pace (light purple) show greater differences in frequency and speed and appear farther away from the center of the figure.This indicates that the distribution in the latent space not only reflects the pose similarity between motions but also some semantic information, like motion rhythm and gait frequency.Similar conclusions have been drawn by recent work in the field of image generation, where the latent space for image generation is considered to capture semantic information more than just simple color transformations [Epstein et al. 2022;Jahanian et al. 2020;Shen et al. 2020].

Latent Injection Regularization
In Figure 18, we show the latent visualizations of several motions generated by AdaptNet when L2 regularization is applied on the injected latent.For comparison, we highlight in white each motion's distributions in the full latent space shown in Figure 17.In the lower figures, the dark purple points represent the latent embedding of the pre-trained walking, while the gray points are generated by the pretrained encoder E  when the simulated character performs stylized motions.Other colors represent varying levels of regularization, as shown.The goal of regularization is to ensure that the generated latent can fall into the manifold composed of the gray dots.This represents a relatively safe region where the latent space is expected to be handled properly by the pre-trained policy.
In the Stoop task, there is almost no difference with and without using the L2 regularization.All visualized samples are overlapped together and covered by the gray region.This is expected given that the style motion of Stoop is close to the walking motion in the latent space.However, in the example of Pace, there is a clear separation when different regularization coefficients are employed.Note when a coefficient of 0.1 is taken, the generated stylized motion (orange) is overlapped with the walking motion (dark purple).AdaptNet, in this case, is over-regularized.It yields to the pre-trained policy and fails to adapt the pre-trained policy to perform the desired stylized motion.In contrast, without regularization ( = 0), the latent is already outside of the safe, gray region.AdaptNet, in this case, simply overfits to imitating the style motion and loses the ability to perform goal-steering navigation.While in Jaunty Skip, any value can be employed, in Limp a -value of 0.01 best ensures that the latent space stays into the grey manifold while attaining high imitation performance.In all adaptation tasks detailed in the paper, we found  = 0.01 to be sufficient.We note that such regularization is not necessary in other tested adaptation tasks without motion style transfer.In such cases, the new expected motions are close to the original policy and already lie in the safe region.We refer to the supplementary video for a visual comparison of the generated motions when different regularization coefficients are employed.

CONCLUSIONS
This paper presents AdaptNet, an approach for adapting existing character RL-based control policies to new motion tasks.Our approach applies two strategies.The first adapts the latent space by conditioning on the character's state and allowing the addition of new control inputs that will allow the control policy to perform new tasks.The second aims at control refinement which allows policy adaptation by shifting the original policy and generating new control actions based on new training.Importantly, AdaptNet training always begins with having no (zero) influence, starting from the existing policy and increasing its influence as training proceeds.
We demonstrate that a previously trained control policy for locomotion can be adapted to support diverse style transfer, morphological changes including limb length variation and locked joints, and terrain adaptation including varied friction and geometry.These adaptations are also very efficient to learn.While the original locomotion policy requires 26 hours of training, our style adaptations take less than thirty minutes to produce a full controller that is capable of goal-directed steering while adhering to a specified walking style.More extreme adaptations require more time, but training is still far more efficient than the cost of learning the initial policy.
A core limitation of this work is that policy adaptation requires an existing pre-trained policy, and thus it cannot act to produce new motions on its own.While it is capable of migrating the policy to many new behaviors and conditions, extreme adaptions (e.g., training a jumping action with long flight phase from a walking controller) do not produce the expected results.We believe this is due to the distinct characteristics of the two behaviors and we see such 'deep' adaptation as a direction for future work.Also, while we demonstrate smooth interpolation between latent space embeddings when we employ control-layer refinement, interpolation does not always produce coherent in-between behaviors.As we show in Section 9.2, an improper choice of the target latent space could lead to undesired control results.As such, we found starting with a proper latent space is important for obtaining high-quality controllers.
In the current work, we use the recent approach of Xu et al. [2023] for pre-training an initial policy that is then modified by AdaptNet.
In the future, we would like to see how well other recent approaches for training physics-based controllers [Peng et al. 2022[Peng et al. , 2021;;Yao et al. 2022] can work with our proposed approach.We would also like to investigate how our approach can be extended to generate a well-represented latent space that can be further exploited for motion synthesis.This opens up many avenues for further research, including latent space disentanglement, inversion, and shaping.

Fig. 3 .
Fig. 3. Network structures.Here, ⊙ denotes the concatenation operator and ⊖ denotes the average operator.The state encoder E  is shown in the dashed block.An optional control input encoding module G is included if the additional control input c  is provided during adaptation training.

Fig. 4 .
Fig. 4. Motion interpolation between walking (pre-trained policy) shown at the top-left corner and different stylized motions by controlling the adaptation level of the associated AdaptNet model (cf.Eq. 13).Snapshots on the left show the learned stylized motions of Stoop walking and Jaunty Skip.When  = 0, the character is controlled only by the original walking policy.When  = 1, the character is controlled with a full injection of AdaptNet.

Fig. 6 .
Fig. 6.Motion interpolation in the latent space by activating and switching between multiple AdaptNet models to let the character perform style transition interactively during goal-steering navigation.

Fig. 8 .
Fig. 8. Character models with joints being locked.From left to right, the locked joints are abdomen, elbows, ankles, and right knee respectively (shown in red).Corresponding body parts between a locked joint are highlighted in orange.

Fig. 9 .
Fig. 9. Adapting the locomotion policy of running to characters with different body shapes and locked joints.

Fig. 10 .
Fig. 10.Comparison of characters controlled with and without AdaptNet running on an ice floor with very low friction.Left: character controlled with AdaptNet slides and skids on the ice ground while running.Right: character without AdaptNet slips down.

Fig. 11 .
Fig. 11.Character controlled with AdaptNet navigates in the environment with procedurally generated terrains.

Fig. 12 .
Fig. 12. Local collision avoidance in an obstacle-filled environment using AdaptNet.Green characters show the movement trajectory generated by the original walking policy without AdaptNet.

Fig. 13 .
Fig.13.Learning performance of our adaptation scheme using AdaptNet, training from scratch for each task (Scratch), using a progressive network (PNet), and adaptation via directly finetuning the pre-trained policy (FT) and finetuning with regularization (FT + Reg).Colored regions denote mean values ± a standard deviation based on 5 trials.The top row consists of motion style transfer tasks, while the bottom row focuses on morphological and terrain adaptation tasks.

Fig. 14 .
Fig. 14.Top: AdaptNet successfully controls the character to turn during walking in Pace style.Bottom: The character controlled by FT policy keeps imitating the reference motion to pace straightly, and fails to turn due to overfitting.Green arrows indicate the dynamically generated target directions for locomotion control.

Fig. 15 .Fig. 16 .
Fig. 15.Injection at different latent spaces.Gray blocks represent the original policy network locked during adaptation.Green blocks are the state encoder E  , and blue ones are F  .For left tor right, the manipulated latent spaces are Z 0 (the default implementation of AdaptNet), Z 1 , Z 2 and Z 0:2 respectively.We ignore G  , given that there is no extra control input in the tested examples here.

Fig. 17 .
Fig. 17.Latent space visualization with respect to different styles of walk-related motions.The latent representations of the stylized motions are obtained by AdaptNet without using the internal adaptation component.The walk motion (dark purple) near the center is provided by the pre-trained policy based on which AdaptNet performs adaptation to learn the stylized motions.The visualization is achieved using multidimensional scaling technique to project the latent representations from 260 dimensions to 2 dimensions.
Fig. 18.Latent space distributions of Stoop, Pace, Jaunty Walk, Limp (left to right).The top figures show the distribution of the stylized motions in the full latent space without regularization, and the bottom figures show the distribution with regularization applied during adaptation training.Gray points shown in the bottom figures are the latent embeddings generated by the pre-trained encoder E  while the character performs the stylized motions. is the regularization coefficient from Eq. 12.

Table 1 .
Reference motions for policy pre-training (top) and stylized motion learning (bottom).Walk 334.07 s normal walking motions for pre-training Run 282.87 s normal running motions for pre-training including normal walking and running motions with turning poses and various speeds (cf.Table 1, top).All the reference motions used during pre-training and adaptation training are recorded at 30 Hz and extracted from the publicly available dataset LAFAN1 [Harvey et al. 2020].