Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold

Synthesizing visual content that meets users’ needs often requires flexible and precise controllability of the pose, shape, expression, and layout of the generated objects. Existing approaches gain controllability of generative adversarial networks (GANs) via manually annotated training data or a prior 3D model, which often lack flexibility, precision, and generality. In this work, we study a powerful yet much less explored way of controlling GANs, that is, to "drag" any points of the image to precisely reach target points in a user-interactive manner, as shown in Fig.1. To achieve this, we propose DragGAN, which consists of two main components: 1) a feature-based motion supervision that drives the handle point to move towards the target position, and 2) a new point tracking approach that leverages the discriminative generator features to keep localizing the position of the handle points. Through DragGAN, anyone can deform an image with precise control over where pixels go, thus manipulating the pose, shape, expression, and layout of diverse categories such as animals, cars, humans, landscapes, etc. As these manipulations are performed on the learned generative image manifold of a GAN, they tend to produce realistic outputs even for challenging scenarios such as hallucinating occluded content and deforming shapes that consistently follow the object’s rigidity. Both qualitative and quantitative comparisons demonstrate the advantage of DragGAN over prior approaches in the tasks of image manipulation and point tracking. We also showcase the manipulation of real images through GAN inversion.


INTRODUCTION
Deep generative models such as generative adversarial networks (GANs) [Goodfellow et al. 2014] have achieved unprecedented success in synthesizing random photorealistic images.In real-world applications, a critical functionality requirement of such learningbased image synthesis methods is the controllability over the synthesized visual content.For example, social-media users might want to adjust the position, shape, expression, and body pose of a human or animal in a casually-captured photo; professional movie pre-visualization and media editing may require efficiently creating sketches of scenes with certain layouts; and car designers may want to interactively modify the shape of their creations.To satisfy these diverse user requirements, an ideal controllable image synthesis approach should possess the following properties 1) Flexibility: it should be able to control different spatial attributes including position, pose, shape, expression, and layout of the generated objects or animals; 2) Precision: it should be able to control the spatial attributes with high precision; 3) Generality: it should be applicable to different object categories but not limited to a certain category.While previous works only satisfy one or two of these properties, we target to achieve them all in this work.
Most previous approaches gain controllability of GANs via prior 3D models [Deng et al. 2020;Ghosh et al. 2020;Tewari et al. 2020] or supervised learning that relies on manually annotated data [Abdal et al. 2021;Isola et al. 2017;Ling et al. 2021;Park et al. 2019;Shen et al. 2020].Thus, these approaches fail to generalize to new object categories, often control a limited range of spatial attributes or provide little control over the editing process.Recently, text-guided image synthesis has attracted attention [Ramesh et al. 2022;Rombach et al. 2021;Saharia et al. 2022].However, text guidance lacks precision and flexibility in terms of editing spatial attributes.For example, it cannot be used to move an object by a specific number of pixels.
To achieve flexible, precise, and generic controllability of GANs, in this work, we explore a powerful yet much less explored interactive point-based manipulation.Specifically, we allow users to click any number of handle points and target points on the image and the goal is to drive the handle points to reach their corresponding target points.As shown in Fig. 1, this point-based manipulation allows users to control diverse spatial attributes and is agnostic to object categories.The approach with the closest setting to ours is UserControllableLT [Endo 2022], which also studies dragging-based manipulation.Compared to it, the problem studied in this paper has two more challenges: 1) we consider the control of more than one point, which their approach does not handle well; 2) we require the handle points to precisely reach the target points while their approach does not.As we will show in experiments, handling more than one point with precise position control enables much more diverse and accurate image manipulation.
To achieve such interactive point-based manipulation, we propose DragGAN, which addresses two sub-problems, including 1) supervising the handle points to move towards the targets and 2) tracking the handle points so that their positions are known at each editing step.Our technique is built on the key insight that the feature space of a GAN is sufficiently discriminative to enable both motion supervision and precise point tracking.Specifically, the motion supervision is achieved via a shifted feature patch loss that optimizes the latent code.Each optimization step leads to the handle points shifting closer to the targets; thus point tracking is then performed through nearest neighbor search in the feature space.This optimization process is repeated until the handle points reach the targets.DragGAN also allows users to optionally draw a region of interest to perform region-specific editing.Since DragGAN does not rely on any additional networks like RAFT [Teed and Deng 2020], it achieves efficient manipulation, only taking a few seconds on a single RTX 3090 GPU in most cases.This allows for live, interactive editing sessions, in which the user can quickly iterate on different layouts till the desired output is achieved.
We conduct an extensive evaluation of DragGAN on diverse datasets including animals (lions, dogs, cats, and horses), humans (face and whole body), cars, and landscapes.As shown in Fig. 1, our approach effectively moves the user-defined handle points to the target points, achieving diverse manipulation effects across many object categories.Unlike conventional shape deformation approaches that simply apply warping [Igarashi et al. 2005], our deformation is performed on the learned image manifold of a GAN, which tends to obey the underlying object structures.For example, our approach can hallucinate occluded content, like the teeth inside a lion's mouth, and can deform following the object's rigidity, like the bending of a horse leg.We also develop a GUI for users to interactively perform the manipulation by simply clicking on the image.Both qualitative and quantitative comparison confirms the advantage of our approach over UserControllableLT.Furthermore, our GAN-based point tracking algorithm also outperforms existing point tracking approaches such as RAFT [Teed and Deng 2020] and PIPs [Harley et al. 2022] for GAN-generated frames.Furthermore, by combining with GAN inversion techniques, our approach also serves as a powerful tool for real image editing.

RELATED WORK 2.1 Generative Models for Interactive Content Creation
Most current methods use generative adversarial networks (GANs) or diffusion models for controllable image synthesis.
Unconditional GANs.GANs are generative models that transform low-dimensional randomly sampled latent vectors into photorealistic images.They are trained using adversarial learning and can be used to generate high-resolution photorealistic images [Creswell et al. 2018;Goodfellow et al. 2014;Karras et al. 2021Karras et al. , 2019]].Most GAN models like StyleGAN [Karras et al. 2019] do not directly enable controllable editing of the generated images.
Conditional GANs.Several methods have proposed conditional GANs to address this limitation.Here, the network receives a conditional input, such as segmentation map [Isola et al. 2017;Park et al. 2019] or 3D variables [Deng et al. 2020;Ghosh et al. 2020], in addition to the randomly sampled latent vector to generate photorealistic images.Instead of modeling the conditional distribution, EditGAN [Ling et al. 2021] enables editing by first modeling a joint distribution of images and segmentation maps, and then computing new images corresponding to edited segmentation maps.
Controllability using Unconditional GANs.Several methods have been proposed for editing unconditional GANs by manipulating the input latent vectors.Some approaches find meaningful latent directions via supervised learning from manual annotations or prior 3D models [Abdal et al. 2021;Leimkühler and Drettakis 2021;Patashnik et al. 2021;Shen et al. 2020;Tewari et al. 2020].Other approaches compute the important semantic directions in the latent space in an unsupervised manner [Härkönen et al. 2020;Shen and Zhou 2020;Zhu et al. 2023].Recently, the controllability of coarse object position is achieved by introducing intermediate "blobs" [Epstein et al. 2022] or heatmaps [Wang et al. 2022b].All of these approaches enable editing of either image-aligned semantic attributes such as appearance, or coarse geometric attributes such as object position and pose.While Editing-in-Style [Collins et al. 2020] showcases some spatial attributes editing capability, it can only achieve this by transferring local semantics between different samples.In contrast to these methods, our approach allows users to perform fine-grained control over the spatial attributes using point-based editing.
GANWarping [Wang et al. 2022a] also use point-based editing, however, they only enable out-of-distribution image editing.A few warped images can be used to update the generative model such that all generated images demonstrate similar warps.However, this method does not ensure that the warps lead to realistic images.Further, it does not enable controls such as changing the 3D pose of the object.Similar to us, UserControllableLT [Endo 2022] enables point-based editing by transforming latent vectors of a GAN.However, this approach only supports editing using a single point being dragged on the image and does not handle multiple-point constraints well.In addition, the control is not precise, i.e., after editing, the target point is often not reached.
3D-aware GANs.Several methods modify the architecture of the GAN to enable 3D control [Chan et al. 2022[Chan et al. , 2021;;Chen et al. 2022;Gu et al. 2022;Pan et al. 2021;Schwarz et al. 2020;Tewari et al. 2022;Xu et al. 2022].Here, the model generates 3D representations that can be rendered using a physically-based analytic renderer.However, unlike our approach, control is limited to global pose or lighting.
Diffusion Models.More recently, diffusion models [Sohl-Dickstein et al. 2015] have enabled image synthesis at high quality [Ho et al. 2020;Song et al. 2020Song et al. , 2021]].These models iteratively denoise a randomly sampled noise to create a photorealistic image.Recent models have shown expressive image synthesis conditioned on text inputs [Ramesh et al. 2022;Rombach et al. 2021;Saharia et al. 2022].However, natural language does not enable fine-grained control over the spatial attributes of images, and thus, all text-conditional methods are restricted to high-level semantic editing.In addition, current diffusion models are slow since they require multiple denoising steps.While progress has been made toward efficient sampling, GANs are still significantly more efficient.

Image Deformation
How to deform images following users' point-drag command is a classic problem in computer graphics.Conventional approaches [Botsch and Sorkine 2007] typically convert images into meshes, and then deform the mesh subject to geometric constraints such as rigidity [Igarashi et al. 2005;Sorkine and Alexa 2007] and Laplacian smoothness [Lipman et al. 2004[Lipman et al. , 2005;;Sorkine et al. 2004].An earlier work has proposed the notion of shape deformation based on hand-crafted features [Beier and Neely 2023].However, these geometric constraints and hand-crafted features lack knowledge on the underlying structure and rigidity of the edited objects, often producing sub-optimal deformation.Additionally, they cannot hallucinate new content when needed such as synthesizing occluded regions.It is also shown that point-drag editing can be approximated by navigating in a video, but the need of video data limits its applicability [Goldman et al. 2008].In contrast, this work studies image deformation based on a strong generative image prior that captures rich information about object structure and appearance.

Point Tracking
To track points in videos, an obvious approach is through optical flow estimation between consecutive frames.Optical flow estimation is a classic problem that estimates motion fields between two images.Conventional approaches solve optimization problems with handcrafted criteria [Brox and Malik 2010;Sundaram et al. 2010], while deep learning-based approaches started to dominate the field in recent years due to better performance [Dosovitskiy et al. 2015;Ilg et al. 2017;Teed and Deng 2020].These deep learning-based approaches typically use synthetic data with ground truth optical flow to train the deep neural networks.Among them, the most widely used method now is RAFT [Teed and Deng 2020], which estimates optical flow via an iterative algorithm.Recently, Harley et al. [2022] combines this iterative algorithm with a conventional "particle video" approach, giving rise to a new point tracking method named PIPs.PIPs considers information across multiple frames and thus handles long-range tracking better than previous approaches.
In this work, we show that point tracking on GAN-generated images can be performed without using any of the aforementioned approaches or additional neural networks.We reveal that the feature spaces of GANs are discriminative enough such that tracking can be achieved simply via feature matching.While some previous works also leverage the discriminative feature in semantic segmentation [Tritrong et al. 2021;Zhang et al. 2021], we are the first to connect the point-based editing problem to the intuition of discriminative GAN features and design a concrete method.Getting rid of additional tracking models allows our approach to run much more efficiently to support interactive editing.Despite the simplicity of our approach, we show that it outperforms the state-of-the-art point tracking approaches including RAFT and PIPs in our experiments.

METHOD
This work aims to develop an interactive image manipulation method for GANs where users only need to click on the images to define some pairs of (handle point, target point) and drive the handle points to reach their corresponding target points.Our study is based on the StyleGAN2 architecture [Karras et al. 2020].Here we briefly introduce the basics of this architecture.
StyleGAN Terminology.In the StyleGAN2 architecture, a 512 dimensional latent code  ∈ N (0,  ) is mapped to an intermediate latent code  ∈ R 512 via a mapping network.The space of  is commonly referred to as W.  is then sent to the generator  to produce the output image I =  ().In this process,  is copied several times and sent to different layers of the generator  to control different levels of attributes.Alternatively, one can also use different  for different layers, in which case the input would be  ∈ R  ×512 = W + , where  is the number of layers.This less constrained W + space is shown to be more expressive [Abdal et al. 2019].As the generator  learns a mapping from a low-dimensional latent space to a much higher dimensional image space, it can be seen as modelling an image manifold [Zhu et al. 2016].

Interactive Point-based Manipulation
An overview of our image manipulation pipeline is shown in Fig. 2. For any image I ∈ R 3× × generated by a GAN with latent code , we allow the user to input a number of handle points {  = ( , ,  , )| = 1, 2, ..., } and their corresponding target points {  = ( , ,  , )| = 1, 2, ..., } (i.e., the corresponding target point of   is   ).The goal is to move the object in the image such that the semantic positions (e.g., the nose and the jaw in Fig. 2) of the handle points reach their corresponding target points.We also allow the user to optionally draw a binary mask M denoting which region of the image is movable.Given these user inputs, we perform image manipulation in an optimization manner.As shown in Fig. 2, each optimization step consists of two sub-steps, including 1) motion supervision and 2) point tracking.In motion supervision, a loss that enforces handle points to move towards target points is used to optimize the latent code .After one optimization step, we get a new latent code  ′ and a new image I ′ .The update would cause a slight movement of the object in the image.Note that the motion supervision step only moves each handle point towards its target by a small step but the exact length of the step is unclear as it is subject to complex optimization dynamics and therefore varies for different objects and parts.Thus, we then update the positions of the handle points {  } to track the corresponding points on the object.This tracking process is necessary because if the handle points (e.g., nose of the lion) are not accurately tracked, then in the next motion supervision step, wrong points (e.g., face of the lion) will be supervised, leading

Motion Supervision
How to supervise the point motion for a GAN-generated image has not been much explored before.In this work, we propose a motion supervision loss that does not rely on any additional neural networks.The key idea is that the intermediate features of the generator are very discriminative such that a simple loss suffices to supervise motion.Specifically, we consider the feature maps F after the 6th block of StyleGAN2, which performs the best among all features due to a good trade-off between resolution and discriminativeness.We resize F to have the same resolution as the final image via bilinear interpolation.As shown in Fig. 3, to move a handle point   to the target point   , our idea is to supervise a small patch around   Inputs Ours UserControllableLT Fig. 4. Qualitative comparison of our approach to UserControllableLT [Endo 2022] on the task of moving handle points (red dots) to target points (blue dots).
Our approach achieves more natural and superior results on various datasets.More examples are provided in Fig. 10.
(red circle) to move towards   by a small step (blue circle).We use Ω 1 (  ,  1 ) to denote the pixels whose distance to   is less than  1 , then our motion supervision loss is: where F() denotes the feature values of F at pixel ,   =   −  ∥  −  ∥ 2 is a normalized vector pointing from   to   (  = 0 if   =   ), and F 0 is the feature maps corresponding to the initial image.Note that the first term is summed up over all handle points {  }.As the components of   +   are not integers, we obtain F(  +   ) via bilinear interpolation.Importantly, when performing back-propagation using this loss, the gradient is not back-propagated through F(  ).This will motivate   to move to   +   but not vice versa.In case the binary mask M is given, we keep the unmasked region fixed with a reconstruction loss shown as the second term.At each motion supervision step, this loss is used to optimize the latent code  for one step. can be optimized either in the W space or in the W + space, depending on whether the user wants a more constrained image manifold or not.As W + space is easier to achieve out-ofdistribution manipulations (e.g., cat in Fig. 16), we use W + in this work for better editability.In practice, we observe that the spatial attributes of the image are mainly affected by the  for the first 6 layers while the remaining ones only affect appearance.Thus, inspired by the style-mixing technique [Karras et al. 2019], we only update the  for the first 6 layers while fixing others to preserve the appearance.This selective optimization leads to the desired slight movement of image content.

Point Tracking
The previous motion supervision results in a new latent code  ′ , new feature maps F ′ , and a new image I ′ .As the motion supervision step does not readily provide the precise new locations of the handle points, our goal here is to update each handle point   such that it tracks the corresponding point on the object.Point tracking is typically performed via optical flow estimation models or particle video approaches [Harley et al. 2022].Again, these additional models can significantly harm efficiency and may suffer from accumulation error, especially in the presence of alias artifacts in GANs.We thus present a new point tracking approach for GANs.The insight is that the discriminative features of GANs well capture dense correspondence and thus tracking can be effectively performed via nearest neighbor search in a feature patch.Specifically, we denote the feature of the initial handle point as   = F 0 (  ).We denote the patch around Then the tracked point is obtained by searching for the nearest neighbor of   in Ω 2 (  ,  2 ): (2) In this way,   is updated to track the object.For more than one handle point, we apply the same process for each point.Note that here we are also considering the feature maps F ′ after the 6th block of StyleGAN2.The feature maps have a resolution of 256 × 256 and are bilinear interpolated to the same size as the image if needed, which is sufficient to perform accurate tracking in our experiments.
We analyze this choice at Sec. 4.2.

Implementation Details
We implement our approach based on PyTorch [Paszke et al. 2017].
We use the Adam optimizer [Kingma and Ba 2014] to optimize the latent code  with a step size of 2e-3 for FFHQ [Karras et al.Real image 4 th Edit (expression) 1 st Edit (pose) 2 nd Edit (hair) 3 rd Edit (shape) GAN Inversion Fig. 5. Real image manipulation.Given a real image, we apply GAN inversion to map it to the latent space of StyleGAN, then edit the pose, hair, shape, and expression, respectively.2019], AFHQCat [Choi et al. 2020], and LSUN Car [Yu et al. 2015] datasets and 1e-3 for others.The hyper-parameters are set to be  = 20,  1 = round(3/512 × ),  2 = round(12/512 × ), where  is the resolution of the generated image.In our implementation, we stop the optimization process when all the handle points are no more than  pixel away from their corresponding target points, where  is set to 1 for no more than 5 handle points and 2 otherwise.We also develop a GUI to support interactive image manipulation.Thanks to the computational efficiency of our approach, users only need to wait for a few seconds for each edit and can continue the editing until satisfied.We highly recommend readers refer to the supplemental video for live recordings of interactive sessions.
Baselines.Our main baseline is UserControllableLT [Endo 2022], which has the closest setting with our method.UserControllableLT does not support a mask input but allows users to define a number of fixed points.Thus, for testing cases with a mask input, we sample a regular 16 × 16 grid on the image and use the points outside the mask as the fixed points to UserControllableLT.Besides, we also compare with RAFT [Teed and Deng 2020] and PIPs [Harley et al. 2022] for point tracking.To do so, we create two variants of our approach where the point tracking part (Sec.3.3) is replaced with these two tracking methods.

Qualitative Evaluation
Fig. 4 shows the qualitative comparison between our method and UserControllableLT.We show the image manipulation results for several different object categories and user inputs.Our approach accurately moves the handle points to reach the target points, achieving diverse and natural manipulation effects such as changing the pose of animals, the shape of a car, and the layout of a landscape.In contrast, UserControllableLT cannot faithfully move the handle points to the targets and often leads to undesired changes in the images, e.g., the clothes of the human and the background of the car.It also does not keep the unmasked region fixed as well as ours, as shown in the cat images.We show more comparisons in Fig. 10.A comparison between our approach with PIPs and RAFT is provided in Fig. 6.Our approach accurately tracks the handle point above the nose of the lion, thus successfully driving it to the target position.In PIPs and RAFT, the tracked point starts to deviate from the nose during the manipulation process.Consequently, they move the wrong part to the target position.When no tracking is performed, the fixed handle point soon starts to drive another part of the image (e.g., background) after a few steps and never knows when to stop, which fails to achieve the editing goal.
Real image editing.Using GAN inversion techniques that embed a real image in the latent space of StyleGAN, we can also apply our approach to manipulate real images.Fig. 5 shows an example, where we apply PTI inversion [Roich et al. 2022] to the real image and then perform a series of manipulations to edit the pose, hair, shape, and expression of the face in the image.We show more real image editing examples in Fig. 13.

Quantitative Evaluation
We quantitatively evaluate our method under two settings, including face landmark manipulation and paired image reconstruction.
Face landmark manipulation.Since face landmark detection is very reliable using an off-the-shelf tool [King 2009], we use its prediction as ground truth landmarks.Specifically, we randomly generate two face images using the StyleGAN trained on FFHQ and detect their landmarks.The goal is to manipulate the landmarks of the first image to match the landmarks of the second image.After manipulation, we detect the landmarks of the final image and compute the mean distance (MD) to the target landmarks.The  results are averaged over 1000 tests.The same set of test samples is used to evaluate all methods.In this way, the final MD score reflects how well the method can move the landmarks to the target positions.
We perform the evaluation under 3 settings with different numbers of landmarks including 1, 5, and 68 to show the robustness of our approach under different numbers of handle points.We also report the FID score between the edited images and the initial images as an indication of image quality.In our approach and its variants, the maximum optimization step is set to 300.
The results are provided in Table 1.Our approach significantly outperforms UserControllableLT under different numbers of points.A qualitative comparison is shown in Fig. 7, where our method opens the mouth and adjusts the shape of the jaw to match the target face while UserControllableLT fails to do so.Furthermore, our approach preserves better image quality as indicated by the FID scores.Thanks to a better tracking capability, we also achieve more accurate manipulation than RAFT and PIPs.Inaccurate tracking also leads to excessive manipulation, which deteriorates the image quality as shown in FID scores.Although UserControllableLT is faster, our approach largely pushes the upper bound of this task, achieving much more faithful manipulation while maintaining a comfortable running time for users.
Paired image reconstruction.In this evaluation, we follow the same setting as UserControllableLT [Endo 2022].Specifically, we sample a latent code  1 and randomly perturb it to get  2 in the same way as in [Endo 2022].Let I 1 and I 2 be the StyleGAN images generated from the two latent codes.We then compute the optical flow between I 1 and I 2 and randomly sample 32 pixels from the flow field as the user input U.The goal is to reconstruct I 2 from I 1 and U. We report MSE and LPIPS [Zhang et al. 2018] and average the results over 1000 samples.The maximum optimization step is set  to 100 in our approach and its variants.As shown in Table 2, our approach outperforms all the baselines in different object categories, which is consistent with previous results.
Ablation Study.Here we study the effects of which feature to use in motion supervision and point tracking.We report the performance (MD) of face landmark manipulation using different features.As Table 3 shows, in both motion supervision and point tracking, the feature maps after the 6th block of StyleGAN perform the best, showing the best balance between resolution and discriminativeness.We also provide the effects of  1 in Table 4.It can be observed that the performance is not very sensitive to the choice of  1 , and  1 = 3 performs slightly better.

Discussions
Effects of mask.Our approach allows users to input a binary mask denoting the movable region.We show its effects in Fig. 8.When a mask over the head of the dog is given, the other regions are almost fixed and only the head moves.Without the mask, the manipulation moves the whole dog's body.This also shows that point-based manipulation often has multiple possible solutions and the GAN will tend to find the closest solution in the image manifold learned from the training data.The mask function can help to reduce ambiguity and keep certain regions fixed.
Out-of-distribution manipulation.So far, the point-based manipulations we have shown are "in-distribution" manipulations, i.e., it is possible to satisfy the manipulation requirements with a natural image inside the image distribution of the training dataset.Here we showcase some out-of-distribution manipulations in Fig. 9.It can be seen that our approach has some extrapolation capability, creating images outside the training image distribution, e.g., an extremely opened mouth and a large wheel.In some cases, users may want to always keep the image in the training distribution and prevent it from reaching such out-of-distribution manipulations.A potential way to achieve this is to add additional regularization to the latent code , which is not the main focus of this paper.
Limitations.Despite some extrapolation capability, our editing quality is still affected by the diversity of training data.As exemplified in Fig. 14 (a), creating a human pose that deviates from the  training distribution can lead to artifacts.Besides, handle points in texture-less regions sometimes suffer from more drift in tracking, as shown in Fig. 14 (b)(c).We thus suggest picking texture-rich handle points if possible.
Social impacts.As our method can change the spatial attributes of images, it could be misused to create images of a real person with a fake pose, expression, or shape.Thus, any application or research that uses our approach has to strictly respect personality rights and privacy regulations.

CONCLUSION
We have presented DragGAN, an interactive approach for intuitive point-based image editing.Our method leverages a pre-trained GAN to synthesize images that not only precisely follow user input, but also stay on the manifold of realistic images.In contrast to many previous approaches, we present a general framework by not relying on domain-specific modeling or auxiliary networks.This is achieved using two novel ingredients: An optimization of latent codes that incrementally moves multiple handle points towards their target locations, and a point tracking procedure to faithfully trace the trajectory of the handle points.Both components utilize the discriminative quality of intermediate feature maps of the GAN to yield pixel-precise image deformations and interactive performance.We have demonstrated that our approach outperforms the state of the art in GAN-based manipulation and opens new directions for powerful image editing using generative priors.As for future work, we plan to extend point-based editing to 3D generative models.

Fig. 2 .
Fig.2.Overview of our pipeline.Given a GAN-generated image, the user only needs to set several handle points (red dots), target points (blue dots), and optionally a mask denoting the movable region during editing (brighter area).Our approach iteratively performs motion supervision (Sec.3.2) and point tracking (Sec.3.3).The motion supervision step drives the handle points (red dots) to move towards the target points (blue dots) and the point tracking step updates the handle points to track the object in the image.This process continues until the handle points reach their corresponding target points.

Fig. 3 .
Fig.3.Method.Our motion supervision is achieved via a shifted patch loss on the feature maps of the generator.We perform point tracking on the same feature space via the nearest neighbor search.

Fig. 6 .
Fig.6.Qualitative tracking comparison of our approach to RAFT[Teed and Deng 2020], PIPs[Harley et al. 2022], and without tracking.Our approach tracks the handle point more accurately than baselines, thus producing more precise editing.

Fig. 7 .
Fig. 7. Face landmark manipulation.Compared to UserControl-lableLT [Endo 2022], our method can manipulate the landmarks detected from the input image to match the landmarks detected from the target image with less matching error.

Fig. 8 .
Fig.8.Effects of the mask.Our approach allows masking the movable region.After masking the head region of the dog, the rest part would be almost unchanged.

Fig. 9 .
Fig. 9. Out-of-distribution manipulations.Our approach has extrapolation capability for creating images out of the training image distribution, for example, an extremely opened mouth and a greatly enlarged wheel.

Fig. 11 .
Fig. 10.Qualitative comparison.This is an extension of Fig. 4. Input Target Ours Input Target Ours

Fig. 14 .
Fig. 14.Limitations.(a) the StyleGAN-human [Fu et al. 2022] is trained on a fashion dataset where most arms and legs are downward.Editing toward out-of-distribution poses can cause distortion artifacts as shown in the legs and hands.(b)&(c) The handle point (red) in texture-less regions may suffer from more drift during tracking, as can be observed from its relative position to the rearview mirror.

Fig. 15 .
Fig. 15.Effects of the mask.By masking the foreground object, we can fix the background.The details of the trees and grasses are kept nearly unchanged.Better background preservation could potentially be achieved via feature blending [Suzuki et al. 2018].

Table 1 .
Quantitative evaluation on face keypoint manipulation.We compute the mean distance between edited points and target points.The FID and Time are reported based on the '1 point' setting.

Table 3 .
Effects of which feature to use.x+y means the concatenation of two features.We report the performance (MD) of face landmark manipulation (1 point).