CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable and Controllable Text-Guided Face Manipulation

Recently introduced Contrastive Language-Image Pre-Training (CLIP) [Radford et al. 2021] bridges images and text by embedding them into a joint latent space. This opens the door to ample literature that aims to manipulate an input image by providing a textual explanation. However, due to the discrepancy between image and text embeddings in the joint space, using text embeddings as the optimization target often introduces undesired artifacts in the resulting images. Disentanglement, interpretability, and controllability are also hard to guarantee for manipulation. To alleviate these problems, we propose to define corpus subspaces spanned by relevant prompts to capture specific image characteristics. We introduce CLIP projection-augmentation embedding (PAE) as an optimization target to improve the performance of text-guided image manipulation. Our method is a simple and general paradigm that can be easily computed and adapted, and smoothly incorporated into any CLIP-based image manipulation algorithm. To demonstrate the effectiveness of our method, we conduct several theoretical and empirical studies. As a case study, we utilize the method for text-guided semantic face editing. We quantitatively and qualitatively demonstrate that PAE facilitates a more disentangled, interpretable, and controllable face image manipulation with state-of-the-art quality and accuracy.


INTRODUCTION
In text-guided image manipulation, the system receives an image and a text prompt and is tasked with editing the image according to the text prompt.Such tasks have received high research attention due to the great expressive power of natural language.Recently introduced and increasingly popular Contrastive Language-Image Pre-Training (CLIP) [Radford et al. 2021] is a technique to achieve this by embedding images and texts into a joint latent space.Combined with generative techniques such as Generative Adversarial Networks (GANs) [Goodfellow et al. 2014] and Diffusion [Dhariwal and Nichol 2021;Ho et al. 2020], CLIP has been utilized to develop several high-quality image manipulation methods [e.g., Antal and Bodó 2021;Khodadadeh et al. 2022;Ling et al. 2021;Zhang et al. 2021], where the image is optimized to be similar to the text prompt in the CLIP joint space.
There are three important but difficult-to-satisfy properties when performing image manipulation: disentanglement, interpretability, and controllability.Disentanglement means that the manipulation should only change the features referred to by the text prompt and should not affect other irrelevant attributes [Wu et al. 2021;Xu et al. 2022].Interpretability means that we know why/how an edit to the latent code affects the output image and thus we understand the reasoning behind each model decision [Doshi-Velez and Kim 2018;Miller 2019], or that the model can extract relevant information from the given data [Murdoch et al. 2019].Finally, controllability is the ability to control the intensity of the edit [Li et al. 2019;Park et al. 2020;You et al. 2021] for individual factors and hence is tightly related to disentanglement.
In CLIP-based text-guided image manipulation methods, since both the latent space of the generative network and the embedding space of CLIP extensively compress information (e.g., an 1024×1024 image contains 3 × 1024 2 dimensions, whereas a typical StyleGAN [Karras et al. 2019] and CLIP both only have 512-dimensional latent spaces), the manipulation in the latent space is akin to a black box.[Gabbay et al. 2021] argue that a GAN fails at disentanglement because it only focuses on localized features.[Zindancıoğlu and Sezgin 2021] also show that several action units in StyleGAN [Karras et al. 2019] are correlated even if they are responsible for semantically distinct attributes in the image.In addition, we found that although CLIP embeddings of images and texts share the same space, they actually reside far away from each other (see Section 3.1), which can lead to undesired artifacts in the generated images such as unintended editing or distortion of facial identity (see Figure 1).Finally, most existing methods for CLIP-based text-guided image manipulation do not allow for free control of the magnitude of the edit.As a result, how to perform an accurate image manipulation in a disentangled, interpretable, and controllable way remains a hard and open problem.
In this paper, we introduce a technique that can serve as an alternative optimization procedure for CLIP-based text-guided image manipulation.Rather than optimizing an image directly towards the text prompt in the CLIP joint latent space, we propose a novel type of CLIP embedding, projection-augmentation embedding (PAE), as the optimization target.PAE was motivated by two empirical observations.First, the images do not overlap with their corresponding texts in the joint space.This means that a text embedding does not represent the embedding of the true target image that should be optimized for.Therefore, directly optimizing an image to be similar to the text in the CLIP space results in undesirable artifacts or changes in irrelevant attributes.Second, a CLIP subspace constructed via embeddings of a set of relevant texts can constrain the changes of an image with only relevant attributes.As demonstrated in a case study of manipulating face images, our proposed approach yields a more disentangled, interpretable, and controllable performance.
To construct a PAE, we first project the embedding of the input image onto a corpus subspace constructed by relevant texts describing the attributes we aim to disentangle, and record a residual vector.Next, we augment this projected embedding in the subspace with the target text prompt, allowing for a user-specified augmenting power to control the intensity of change.Finally, we add back the residual to reconstruct a vector close to the "image region" of the joint space.We demonstrate that the PAE is a better approximation to the embedding of the true target image.With PAE, we achieve better interpretability via an explicit construction of the corpus subspace.We achieve better disentanglement since the subspace constrains the changes of the original image with only relevant attributes.We achieve free control of the magnitude of the edit via a user-specified augmenting power.
In short, we highlight the three major contributions of this paper: (1) (Section 3) We perform a series of empirical analyses of the CLIP space and its subspaces, identifying i) the limitations of using a naive CLIP loss for text-guided image editing; and ii) several unique properties of the CLIP subspace.(2) (Section 4) Based on our findings in Section 3, we propose the projection-augmentation embedding (PAE) as a better approximation to the embedding of the true target image.(3) (Section 5) We demonstrate that employing PAE as an alternative optimization target facilitates a more disentangled, interpretable, and controllable text-guided face image manipulation.This is validated through several text-guided semantic face editing experiments where we integrate PAE into a set of state-of-the-art models.We quantitatively and qualitatively demonstrate that PAE boosts the performance of all chosen models with high quality and accuracy.

RELATED WORK
Latent Manipulation for Image Editing.One popular approach for image manipulation is based on its latent code: the input image is first embedded into the latent space of a pre-trained generative network such as GAN [Goodfellow et al. 2014], and then to modify the image, one updates either the latent code of the image, [e.g., Antal and Bodó 2021;Creswell and Bharath 2018;Hou et al. 2022b;Khodadadeh et al. 2022;Kocasari et al. 2022;Ling et al. 2021;Patashnik et al. 2021;Perarnau et al. 2016;Pidhorskyi et al. 2020;Shen et al. 2020;Xia et al. 2021;Zhang et al. 2021;Zhu et al. 2016], or the weights of the network [e.g., Cherepkov et al. 2021;Nitzan et al. 2022;Reddy et al. 2021] to obtain the desired image editing.However, these methods can only alter a set of pre-defined attributes and thus lack flexibility and generalizability.
CLIP for Text-Guided Image Manipulation.In 2021, Radford et al. proposed Contrastive Language-Image Pre-Training (CLIP), where an image encoder and a text encoder are trained such that the semantically similar images and texts are also similar in the joint embedding space.The insight of connecting images and texts in the same space brings up a wide spectrum of applications in computer vision tasks.For example, Radford et al. provide applications for image captioning, image class prediction, and zero-shot transfer.More sophisticated tasks include language-driven image generation [Ramesh et al. 2022], zero-shot semantic segmentation [Li et al. 2022], image emotion classification (IEC) [Deng et al. 2022], largescale detection of specific content [González-Pizarro and Zannettou 2022], object proposal generation [Shi et al. 2022b], object sketching [Vinker et al. 2022] and referring expression grounding [Shi et al. 2022a].
CLIP has also been applied to text-guided image manipulation tasks.In this domain, one approach is to edit the latent code of a certain generative network so that the embedding of the generated image is similar to the embedding of the given text in the CLIP space [Hou et al. 2022a;Kocasari et al. 2022;Patashnik et al. 2021;Xia et al. 2021] (see Figure 4).However, this straightforward approach sometimes fails to change the desired attributes, or fails to change them in a disentangled way: other unrelated features are also affected (see the comparative study in Sections 5.2 and 5.3).In addition, the proposed methods often fail to exhibit enough interpretability and controllability.This is probably attributed to the separation of images and texts in the CLIP embeddings space, so that optimizing an image embedding towards a text embedding naturally introduces some undesired effects.DALL•E 2 [Ramesh et al. 2022] and CLIP-Mesh [Mohammad Khalid et al. 2022] propose to use diffusion prior that transforms a text embedding to its corresponding image embedding.Despite its effectiveness, there is no mechanism for users to explicitly impose disentangled generation or to adjust modification intensity.Directional CLIP loss [Gal et al. 2022;Patashnik et al. 2021] optimizes an image so that the difference between the generated and the original image in CLIP space aligns with the difference between the text prompt and a neutral text.However, as pointed out by Patashnik et al., the manipulation generated by this method is less pronounced, as the direction in CLIP space is computed by averaging over many images.There also exist other remedies, such as penalizing the change of features in the editing (e.g., [Canfes et al. 2022;Kocasari et al. 2022]), separating the image into different granule levels [Patashnik et al. 2021], formulating a constrained optimization problem [Zhu et al. 2016], partially labeling a set of features [Gabbay et al. 2021], or updating the parameters of the underlying GAN to preserve features (e.g., [Cherepkov et al. 2021;Nitzan et al. 2022;Reddy et al. 2021]), they do not collectively achieve a disentangled, interpretable, and controllable manipulation.Additionally, some of them target all features indifferently: they still do not separate the related features from the irrelevant ones.Moreover, most of these techniques only work for specific methods or attributes and lack generalizability, and some of them are too complicated and time-consuming, hence cannot be used in practical large-scale applications.

ANALYSIS OF THE CLIP JOINT SPACE AND SUBSPACE
Most CLIP-based text-guided image manipulation algorithms [e.g., Hou et al. 2022a;Kocasari et al. 2022;Patashnik et al. 2021;Xia et al. 2021] follow a general paradigm where certain parameters of an image editing process (such as the latent code or the weights of a generative network) are trained to minimize a cosine similarity loss between the resulting image and a text prompt in the CLIP joint space (see Figure 4).However, naively using this loss may introduce undesirable artifacts or unintended changes unrelated to the text prompt, as shown in Figure 1.We hypothesize that this is due to the discrepancy between the images and the texts in the CLIP joint space, as we will demonstrate in Section 3.1.Therefore, the embedding of the text prompt in the CLIP joint space does not actually represent the embedding of the true target image that should be optimized for.To alleviate this issue, we construct CLIP subspaces with desirable properties (see Section 3.2) leading to our proposed projection-augmentation embedding (see Section 4).

Non-overlapping image and text embeddings
As an empirical demonstration, we collect over 1500 face images and 1500 textual descriptions of faces (e.g., emotion, hairstyle, or general) and visualize their embeddings in the CLIP joint space using PCA [Jolliffe 2002] in Figure 2. Note that the visualization is given using Euclidean distance; Although CLIP is trained with cosine similarity, if we normalize all the embeddings, the Euclidean distance and cosine distance give exactly the same ranking because (1) As shown in Figure 2, the image and the text embeddings lie in two non-overlapping regions.They also exhibit a lower inter-modality cosine similarity compared to the intra-modality similarity, regardless of their semantic meanings.For example, the cosine similarity between an image of a dog and the text "dog" is 0.253 and that between an image of a cat and the text "cat" is 0.275.On the other hand, the similarity between a dog image and a cat image reaches 0.841, and that between the two texts "dog" and "cat" is as high as 0.936, much higher than the intra-modality similarity.We record additional evaluations of inter-modality and intra-modality cosine similarities in Appendix A.1.In the general paradigm described above (and see Figure 4), the CLIP embedding of the ideal target image is essentially approximated by the embedding of the text description.However, the separation of text and image embeddings clearly invalidates such approximation and thus leads to artifacts (see Figure 1 and more in Sections 5.2 and 5.3).

Subspaces distilling relevant information from image embeddings
Since the joint space is a vector space, we can construct a subspace of it using a set of relevant text prompts as basis vectors.For example, we can construct an "emotion subspace" using relevant emotion texts.In this section, we explain how such a subspace distills relevant information from images, which can be used to constrain the changes of an image.We also include some experiments in Appendix A.5 to show that the subspace can distill information from texts.We leave the mathematical details of the formulation of the subspace to Section 4, but preview certain properties of the subspace as these properties inspire the formulation of our method.We invited ten participants to record five-second videos of their faces changing from neutral to laughing out loud (LOL).We compute the cosine similarity (averaged over all videos) of each frame to the first frame or to the text "a happy face" in the CLIP space, or after projecting them onto the emotion or hairstyle subspace.The results are shown in Figure 3.
In Figure 3a, the similarity decreases much faster in the emotion subspace compared to the CLIP space and the hairstyle subspace.A similar pattern shows up in Figure 3b -the similarity increases much faster in the emotion subspace.Since the subspace has a lower dimensionality than the original space, these observations indicate that the information regarding the irrelevant attributes is discarded during the projection.Hence the changes of the hairstyle in the facial images have a low effect influencing the similarity -the emotion subspace only distills emotional attributes from the image  and discards others.This inspires the formulation of our method: if we make changes to the embedding of the original image within a subspace, it would only induce a change of attributes related to the subspace.As a result, these changes are disentangled.

PROJECTION-AUGMENTATION EMBEDDING
Motivated by our findings in Section 3, we propose the projectionaugmentation embedding (PAE) as a better approximation to the embedding of the true target image.There are three objectives we aim to fulfill when constructing such an embedding: i) it should be closer to the image region than the text is; ii) it should be guided by the target text prompt; iii) the guidance should be provided within the subspace so the changes are disentangled.

Overview
Given an input image  and a text prompt  , we construct the PAE, denoted as E  (, , ), as follows.First, we obtain the embeddings of  and  , denoted as   and   , via the CLIP image and text encoders [Radford et al. 2021], then we project   onto a corpus subspace  constructed with texts describing the attributes we aim to disentangle: where P  is the projection operation onto the subspace .We will explain the details of P  and the construction of  in Section 4.2.
Next, we record a residual vector  of the projection: After that, we augment the influence of the text prompt  to the projected embedding , and finally, add back the residual vector  to construct the final embedding: where A , is the augmentation operation in  according to  with a controllable augmenting power , as will be explained in details in Section 4.3.
A graphic illustration of PAE can be found in Figure 6.In response to our aforementioned three objectives, we highlight that we add back the residual to ensure that the PAE is close to the image region rather than the text region, that we apply the augmentation to ensure that the PAE is guided by the target text prompt, and that we apply the projection to ensure that the guidance is provided via the subspace so the changes of the image are disentangled (recall Section 3.2).As we will see in Section 4.2, the construction of the subspace with explicit selections of relevant prompts makes our approach more interpretable compared to black-box latent manipulations.Our introduction of the augmenting power (see Section 4.3) makes it possible to freely control the magnitude of the change to the image.PAE can be integrated into any CLIP-based text-guided image manipulation algorithms in place of the text embedding in the final loss function, as shown in Figure 5.In Figure 2 we also include corresponding PAEs generated by the same face images and facial descriptions in Section 3.1.We see that compared to the text embeddings, PAEs are indeed closer to the image embeddings, and that the PAE distribution has a lower variation, suggesting that they contain more specified information.Note that the formulation of PAE does not explicitly include any supervision losses (such as an identity loss [Deng et al. 2019]) or other trainable parameters, but as we will see in Section 5, PAE can achieve better identity preservation even without the identity loss.
In the following subsections, we introduce the details of the projection and augmentation operations.

Projection
We introduce two options for the projection operation: P GS  and P PCA  .P GS  aims to find the semantic vectors for  as the subspace is a semantically meaningful structure.For example, if we aim to change/disentangle the facial emotions, we can use the text embeddings of a set of basic emotions as the basis vectors.
After selecting the basis vectors {  }  , we apply the Gram-Schmidt process to obtain an orthonormal basis { b }  , and then project   onto  by computing a dot product with the basis vectors: P GS  will fail if there is no apparent semantic basis (e.g., for hairstyle, it is hard to find "basic hairstyles").P PCA  aims to find the basis of  by constructing a set T consisting of a corpus of relevant text embeddings and performing the principal component analysis (PCA) [Jolliffe 2002] to extract a pre-defined number  of principal components as the basis {  }  =1 .Effectively, the space spanned by {  }  =1 would be a space approximating  that encompasses all the related texts in the corpus of interest.After we get the basis, the projection is performed the same way as in Eq. ( 5).Other dimension reduction techniques can also be used in replace of PCA, such as kernel PCA [Schölkopf et al. 1997] or t-SNE [Van der Maaten and Hinton 2008]).
We also experimented with a simpler idea of projecting both   and   onto  and performing the optimization in , instead of augmenting in  and adding back the residual.However, this approach introduced significant artifacts and entangled changes, possibly due to the loss of information during the projection (see Appendix A.4 for more details).

Augmentation
The augmentation operation directs the embedding of the original image toward the target.As discussed above, the augmentation on the projected image  should be guided by the target text prompt.A simple augmentation operation can be A , (, ) =  +   .
(6) However, in our pilot studies, we found that this simple operation results in a PAE too similar to the original embedding, leading to an optimization process barely doing anything (see Appendix A.3).We strengthen the influence of   on  by weakening the unintended attributes of  -components of  where P  (  ) has a low value -while preserving the sum of coefficients of .
Mathematically, we first calculate the coefficients   of  and the coefficients   of the projected text P  (  ), expressed under the basis {  }  =1 of  established in the previous section: Next, we weaken (realized by subtraction) all components of  and then add back the projected text P  (  ), with a coefficient such that the sum of coefficients is preserved so that the embedding does not deviate too much: where  ∈ R + controls the augmenting power, contributing to a controllable latent manipulation, demonstrated in Figure 7b.The two terms in Eq. ( 9) are equivalent to weakening the components that are small in the projected text.
We also include two more options for projection and three more for augmentation in Appendix A.2, including the case when  is not a linear subspace but a general manifold (so that there is no notion of basis).Future researchers can also develop more effective options suitable for their own specific tasks, and this is where lies the extensibility of our PAE.Preliminary criteria for evaluation and selection of different PAEs are present in Appendix A.3.

EXPERIMENT
As a case study, we utilize PAE in a series of text-guided semantic face editing experiments, where we manipulate a high-level (emotion) and a low-level (hairstyle) facial attribute [Lyons et al. 2000;Patashnik et al. 2021].We also show the results of manipulating other facial attributes such as the size of the eyes and the size of mouth in Appendix A.6, and the results of manipulating non-facial images such as dogs in Appendix A.7.

Data and procedure
Given randomly generated latent codes, we synthesize facial images using an implementation of StyleGAN2 [Karras et al. 2020] with an adaptive discriminator augmentation (ADA) and the pre-trained model weights for the FFHQ face dataset [Karras et al. 2017].We compare four image manipulation methods based on latent optimization: Naive (directly optimizing the latent code of the Style-GAN2 generator), StyleMC (SM) [Kocasari et al. 2022], StyleCLIP (SC) [Patashnik et al. 2021], and TediGAN (TG) [Xia et al. 2021].
To construct the PAEs, for emotion editing, we use P GS emotion , defined via six basic human emotions [Ekman 1992] -happy, sad, angry, fearful, surprised, and disgusted -as the semantic basis; for hairstyle editing, we use P PCA hairstyle defined via 68 hairstyle texts and ten principal components.In this experiment, we used 7.0 for  in PAE across all methods, while criteria to automatically select the augmenting power is reported in Appendix A.3.

Qualitative comparison
A qualitative comparison is shown in Figure 7a.The text prompts are written to the left of each row.For more comprehensive results, please refer to Appendix A.6.Furthermore, in Figure 7b, we fix the text prompt to be "happy" and vary the augmenting power , showing the controllability of our PAE.Please refer to Appendix A.6 for full results.

Quantitative comparison
In Table 1, we measure the performance of the aforementioned sixteen models with a quantitative comparison using seven metrics.Fréchet Inception Distance (FID) [Heusel et al. 2017] measures the quality of manipulated images.Learned Perceptual Image Patch Similarity (LPIPS) evaluates the perceptual similarity of manipulated images to the original ones.Identity loss (IL) employs ArcFace [Deng et al. 2019] to assess the degree of facial identity preservation.We utilize a facial-attribute classifier [Serengil and Ozpinar 2021] to evaluate the disentanglement and accuracy of the manipulation.Disentanglement of manipulation (Dis-C) is measured by the percentage of images whose model classification of irrelevant attributes (e.g., age and race) remains unchanged after the manipulation.Accuracy of manipulation (Acc-C) is measured by the percentage of output images whose model-predicted relevant attributes (e.g., emotion) are the same as the text prompt.The first five scores are computed over the editing results of 5,880 randomly generated images.We also conducted a user study (see details about the survey protocol in Appendix A.8) involving 50 participants from various backgrounds to subjectively evaluate the quality of manipulation of 36 randomly generated images.Dis-S reports the degree of disentanglement as indicated by the survey.Acc-S reports the degree of conformity of the resulting image to the text prompt as indicated by the survey.
From Table 1, we see that Nv+PAE already outperforms the four original models in almost all scores.More importantly, we see that almost all scores of the four models improved after equipping with PAE.It is also worth noting that IL is included in the training loss in SC and SM but not in Nv+PAE, nevertheless, Nv+PAE outperforms SC and SM in terms of IL.These observations strongly indicate that PAE can improve image quality, identity preservation, disentanglement, and accuracy in image editing tasks.Compared to all DP and the majority of Dir models, PAE yields higher scores in most metrics.For several cases, Dir results in less pronounced changes (as also observed by Patashnik et al.), leading to higher scores in measures of image quality (FID), perceptual similarity (LPIPS), identity preservation (IL), and disentanglement (Dis-C) but lower scores in terms of conformity to the text prompt (Acc-C).This can also be observed in Figure 7a.

DISCUSSION AND LIMITATION
An interesting observation from our experiment is that the difficulty of manipulation complies with real-world situations.For example, all models find it very hard to change women's hairstyle to be bald whereas men are very easy to lose hair.This suggests that the performance of such image manipulation methods is limited by the real-world datasets that our upstream models (StyleGAN2 and CLIP) are trained on.Another limitation of our method is that the concrete projection and augmentation operations require manual selection for the best result.We automated a coarse selection by the criteria introduced in Appendix A.3, but a rigorously defined, numeric metric for disentanglement, accuracy, etc. could benefit the performance.Also, texts that do not associate with an obvious attribute (e.g., a celebrity's name) are harder to project to the subspace.

CONCLUSION
In this paper, we considered the problem of CLIP-based text-guided image manipulation.We introduced a novel projection-augmentation embedding (PAE) to facilitate a more disentangled, interpretable, and controllable image editing.PAE is a general technique that can be seamlessly integrated into any other CLIP-based models; it is simple to use and has the advantage of compute efficiency, extensibility, and generalizability.We demonstrated its effectiveness in a text-guided semantic face editing experiment at both high-level (emotion) and low-level (hairstyle) editing.We integrate PAE with other state-of-the-art CLIP-based face editing models, showing the improvements brought by PAE in various perspectives.We can see that without PAE, changes are not made in a disentangled way.For example, the hair color and the lighting condition of the face in the "happy" rows changed; the identity of the face in the "surprised" rows changed; etc.These problems are present in the original models and their DP versions.Meanwhile, some other edits do not well conform to the text prompts.For example, most Dir versions did not result in any notable changes.When comparing models before and after incorporating PAE, it is evident that PAE facilitates a more disentangled, realistic, and accurate face manipulation.

A APPENDIX A.1 Similarity of different CLIP embeddings
As a complement to Figure 2, we record the averaged cosine similarity of the embeddings from different modalities in Table 2.We see that image embeddings and text embeddings lie in different regions in ℭ and have lower inter-modality similarity compared to the intra-modality similarity, regardless of their semantic meanings.From the Table we can also see that PAEs have higher cosine similarity to images (0.491 and 0.493) than texts to images (0.200).
Note that this discrepancy of image embeddings and text embeddings in the CLIP space is general and not restricted to faces.We plot the embeddings of 100 random images and 100 random texts visualized using PCA in Figure 8 and also record the similarity in Table 3.We note the same observation that image embeddings and text embeddings lie in different regions in ℭ and have lower inter-modality similarity compared to the intra-modality similarity.

A.2 More options for projection and augmentation
In this section we present two more options for projection and three more for augmentation.About notations: we give each option a short identifier (or not at all) and add it to the superscript of P  , A , or E  .For example, E GS,+  means the projectionaugmentation embedding with a projection onto an orthonormal basis (P GS  ) and an augmentation that preserves the coefficients (A + , ).
A.2.1 Projection.A simpler type of projection, P  also assumes  to be a linear subspace of the CLIP space ℭ and is very similar to P GS  except that it does not orthonormalize the basis vectors.
Note that in this case, the dot product with the basis vectors is not a strict projection onto a linear subspace.However, this simple option also worked well in our prior experiments.This is possibly due to the high dimensionality of the CLIP space (512): if we do not have many basis vectors (e.g., < 20), their pairwise dot products tend to be very small and thus are nearly orthogonal to each other already.
The second option, P All  , does not assume that  is necessarily a linear subspace.Instead, it can be any manifold.P All  directly stores the set T .By storing embeddings of all related texts, it is effectively sampling and storing the points on the manifold , and when performing the projection onto a surface, the best approximation would be to pick the point on the surface that is closest to the vector to be projected: Naturally, if we obtain a larger T , we sample more points from the manifold and better approximate the projection.
A.2.2 Augmentation.We propose three more option for the augmentation operation: A , , A Ex , and A ExD , .The first one is used with projections P GS  , P  and P PCA  , and the last two with  All .
The simplest A , adds the text and image together, with a coefficient  controlling the augmenting power: A , () :=  +   . (12) However, simply adding up   may result in a vector too similar to the original embedding, likely resulting in an optimization process barely doing anything.In that case, the augmentation introduced in equation 9 is a better option because it additionally weakens a certain amount on other components (while preserving the sum of coefficients).To avoid the misunderstanding, we denote by A , this simplest augmentation and denote by A + , the augmentation in equation 9.
The last two types of augmentation, A Ex , and A ExD , , are for  All .Since  All gives an embedding  ∈ T , which is expected to indicate the current feature of the image, if we want to change that feature to be the one specified by the text  , we can simply exchange  with the text embedding   : where  is the augmenting power.
The final A ExD , is a more robust version than A Ex , .Instead of doing the one-for-one exchange as above, we can do one-for- exchange.More precisely, we still add in  copies of   , but instead of subtracting  copies of the most similar text embeddings, we subtract the  different, most similar text embeddings (each embedding only subtracted once).
Naturally, between the one-for-one exchange in A Ex , and onefor- exchange in A ExD , , there exist many other options from onefor- for any  between 1 and .We leave this exploration to future work.

A.3 Evaluation Criteria
In the future other researchers may want to develop new types of PAE for their specific tasks.In this section we provide two straightforward criteria to evaluate and select different options of PAE: (1) We need CosSim(E  ,    ) to be as large as possible, where We evaluate all the eight options of PAE (two proposed in Section 4.1 and six in Appendix A.2) in the text-guided facial emotion editing experiment described in Section 5. P GS  is realized by using six basic human emotions [Ekman 1992] as a semantic basis: "happy", "sad", "angry", "fearful", "surprised", "disgusted"; P PCA  and P All  are implemented with a corpus T consisting of 277 emotion texts found online and ten principal components.We plot the above quantities vs.  in Figure 9.Each color corresponds to an option of E  .Restating the above criteria, within each color, we need 1. the solid line to be as large as possible and at least be

𝔚
with an  ∈ [2.5, 5].Note that as the + versions perturb the embedding more, the dash-dotted lines for E +  , E GS,+  and E PCA,+  are higher than their non-+ counterparts.

A.4 A Failed Case: Double-Projected Embedding
This section presents the double projected embedding (DPE) that has similar idea of using CLIP subspaces to extract relevant information from the embeddings.DPE is simpler than PAE in that it does not have the second step of augmenting the projected image embedding; instead, it projects both image and text embeddings onto the subspace and directly optimizes the image towards the text in the subspace.
We conduct the same text-guided emotion editing experiment as in Section 5 using DPE GS and DPE PCA with 20 principal components.The results are summarized in Figure 10 and Figure 11, respectively.
We can see from the results that although most faces indeed changed the emotion according to the text prompts, we lost the disentanglement: in some manipulation, the lighting condition, the background, the hair, or even the facial identity changed.
One possible explanation is that by projecting faces onto the emotion subspace, most information except the emotion has been   In order for E  to be effective, we need the solid line to be as large as possible and at least be above the orange horizontal line and the dash-dotted line to be above the black horizontal line.In this experiment, the best three options are [10,15]) and E PCA,+  ( ∈ [2.5, 5]).Note also that as the + versions perturb the embedding more, the dash-dotted lines for + versions are generally higher than their non-+ counterparts.
lost, so we cannot preserve irrelevant attributes.Recall that in constructing PAE, the final step is to add back the residual  to go back to the "image region" (see Eq. ( 4)).We hypothesize that this  exactly contains the irrelevant information that we need to preserve.This example also shows that the necessity and importance of using an optimization target that is close enough to the embedding   of the original image.

A.5 CLIP Subspace Extracts Relevant Information from Texts
In this section we conduct two experiments to demonstrate that our emotion subspace  can extract the relevant information from texts.The emotion subspace  is created either using P GS  or P PCA  , but they result in similar observations.A.5.1 Extract Emotion from Emotion-Animal Phrases.We compare the averaged similarity of emotion-animal phrases in the original CLIP space ℭ and in the emotion space  created by P GS  .Each emotion-animal phrase consists of an adjective for emotion (e.g., "happy", "sad") qualifying a noun for animal (e.g., "horse", "man").
If  is able to capture the emotion information, then we expect that the similarity of phrases of same animal but different emotions (e.g., "happy horse", "sad horse") is high in ℭ (because they are the same animal) but is low in  (because the emotions are different).On the contrary, if the emotion is the same but the animals are different (e.g., "happy horse", "happy cat"), then we expect that the similarity is high in  but low in ℭ.
The result is tabulated in Table 4 and coincide with our hypothesis: In , emotions dominate the similarity rather than the animals, showing that  can extract emotion information from the phrases.Table 4: Averaged similarity of emotion-animal phrases.In , emotions dominate the similarity rather than the animals, showing that  can capture emotion information from the phrases.

ℭ
same animal, different emotions 0.863 0.439 same emotion, different animals 0.834 0.926 A.5.2 Extract Emotion from Emotion Texts.We collect five emotions from each of the 5 groups happy, sad, angry, disgusted, fearful (25 emotions in total) and use heat maps to visualize their pairwise similarity in ℭ, in the emotion subspaces created by P GS  and P PCA  .The result is shown in Figure 12.We see that emotions in the same group have higher similarity in both emotion subspaces than in ℭ.

A.6 Full Result of Text-Guided Face Editing
In this section we present the full result of the text-guided face editing that could not be placed in Section 5 due to the space limit.Figures 13 to 15 show the result of emotion, hairstyle, and physical characteristic editing, respectively, using different PAEs; The text prompts as well as the type of PAE used are written to the left of each row.Figure 16 to Figure 28 compare the eight models in face editing for different text prompts in the aforementioned three   editing categories (for emotion editing, we also include DP and Dir targets, totalling sixteen models).Figure 29 shows the controllability of our method by varying the augmenting power.

A.7 Text-Guided Editing on Non-Facial Images
We also conduct text-guided editing on non-facial images, namely on images of dogs from AFHQ-Dog dataset [Choi et al. 2020].The experiment procedure is very similar to Section 5 except for the three differences: • StyleGAN2 is pre-trained on the AFHQ-Dog dataset instead of the FFHQ dataset; • Since StyleCLIP [Patashnik et al. 2021] and TediGAN [Xia et al. 2021]

A.8 Survey Protocol
In order to quantitatively examine the performance of PAE, we conducted a user survey.We invited 50 participants of various backgrounds to evaluate the editing of 36 randomly generated images.Each question in the survey provides an original image, a text prompt, and the two edited images by a model (from Naive, StyleCLIP, StyleMC and TediGAN) and its +PAE version.In each question, the participant was asked 1. which edited image is more disentangled; and 2. which edited image conforms more to the text prompt (the participants were first introduced to the concepts of disentanglement and conformity to text prompts with examples).We created nine questions for each pair of the models, totalling 4 × 9 = 36 questions.The nine text prompts are happy, sad, angry, surprised, bald, curly hair, blonde, black hair, and grey hair.

Figure 1 :
Figure 1: Using text embedding as the optimization target results in unsatisfactory outputs.Note the entangled changes (e.g., clothing of the left child, glasses of the lady), inaccurate output, and artifacts.
Figure 2: PCA visualization of face images and text descriptions (Section 3.1), and their corresponding PAEs in the CLIP space (Section 4).Note that image and text embeddings are non-overlapping.
Figure 3: Similarity of video frames to the first frame and to the text "a happy face" in different spaces.The changes in the emotion subspace are the most significant as it distills the relevant information.

Figure 6 :
Figure6: A graphical demonstration of the calculation of PAE.It is calculated by 1. projecting the embedded image   to a pre-defined subspace  of interest; 2. augmenting the projected vector in a way that the effect of text is amplified; and finally 3. adding back the residual  to return to the "image region" in ℭ.
Comparison of sixteen methods in text-guided face editing.

Figure 9 :
Figure9: Comparison of eight options of PAE in a text-guided facial emotion editing experiment.Each option is color-coded.In order for E  to be effective, we need the solid line to be as large as possible and at least be above the orange horizontal line and the dash-dotted line to be above the black horizontal line.In this experiment, the best three options are E +  ( ∈[5, 15]), E GS,+

Figure 10 :
Figure 10: Text-guided emotion editing using DPE GS .Note that changes are not disentangled.

Figure 11 :
Figure 11: Text-guided emotion editing using DPE PCA .Note that changes are not disentangled.
h a p p y s a t is f ie d jo y f u l c o n t e n t a m u s e d s a d u n h a p p y lo n e ly h o p e le s s g r ie v e d a n n o y e d f u r io u s ir r it a t e d m a d a n g r y f e a r f u l a n x io u s t e r r if ie d p a n ic k e d h o r r if ie d d is li k in g o f f e n d e d u n c o m f o r t a b le d is g u s t e d d is t u r b e d

Figure 12 :
Figure 12: Pairwise similarity of 25 emotions in five groups in three different spaces.We see that emotions in the same group have higher similarity in both emotion subspaces than in ℭ.

Figure 14 :Figure 15 :
Figure 14: Text-guided hairstyle editing Figure30shows the aggregated result of Naive+PAE approach.The text prompts are written to the left of each row.Figure31to Figure33compare the four models for different text prompts.

Table 1 :
Seven metrics for sixteen image editing models.The ↓ besides a metric means that a lower score is better and the ↑ means the opposite.
is the embedding of the target image, i.e. the ideal target.This is because our goal is to approximate the inaccessible    .At least, we need CosSim(E  ,    ) > CosSim(  ,    ) so that we gain something by replacing   with E  as an optimization target.(2)We need CosSim(E  ,    ) > CosSim(E  ,   ), otherwise PAE is more similar to the original image and the no change towards    will be made during the optimization.

Table 2 :
Averaged cosine similarity of different CLIP embeddings.Note that image embeddings and text embeddings have lower inter-modality similarity.Also note that PAEs have higher intra-modality similarity (lower variation) than images, showing that they discard some noisy information.Figure8: PCA visualization of 100 random images and 100 random texts in the CLIP space.Note that image and text embeddings lie in differet regions.

Table 3 :
Averaged cosine similarity of CLIP embeddings of 100 random images and texts.Note that image embeddings and text embeddings have lower inter-modality similarity.