skip to main content
10.1145/3528233.3530747acmconferencesArticle/Chapter ViewAbstractPublication PagessiggraphConference Proceedingsconference-collections
research-article
Open Access

CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions

Published:24 July 2022Publication History

ABSTRACT

The success of StyleGAN has enabled unprecedented semantic editing capabilities, on both synthesized and real images. However, such editing operations are either trained with semantic supervision or annotated manually by users. In another development, the CLIP architecture has been trained with internet-scale loose image and text pairings, and has been shown to be useful in several zero-shot learning settings. In this work, we investigate how to effectively link the pretrained latent spaces of StyleGAN and CLIP, which in turn allows us to automatically extract semantically-labeled edit directions from StyleGAN, finding and naming meaningful edit operations, in a fully unsupervised setup, without additional human guidance. Technically, we propose two novel building blocks; one for discovering interesting CLIP directions and one for semantically labeling arbitrary directions in CLIP latent space. The setup does not assume any pre-determined labels and hence we do not require any additional supervised text/attributes to build the editing framework. We evaluate the effectiveness of the proposed method and demonstrate that extraction of disentangled labeled StyleGAN edit directions is indeed possible, revealing interesting and non-trivial edit directions.

Skip Supplemental Material Section

Supplemental Material

clip2stylegan_vid.mp4

Supplemental video

References

  1. Rameen Abdal, Yipeng Qin, and Peter Wonka. 2019. Image2stylegan: How to embed images into the stylegan latent space?. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4432–4441.Google ScholarGoogle ScholarCross RefCross Ref
  2. Rameen Abdal, Peihao Zhu, Niloy J. Mitra, and Peter Wonka. 2021a. Labels4Free: Unsupervised Segmentation Using StyleGAN. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 13970–13979.Google ScholarGoogle ScholarCross RefCross Ref
  3. Rameen Abdal, Peihao Zhu, Niloy J. Mitra, and Peter Wonka. 2021b. StyleFlow: Attribute-Conditioned Exploration of StyleGAN-Generated Images Using Conditional Continuous Normalizing Flows. ACM Trans. Graph. 40, 3, Article 21 (May 2021), 21 pages. https://doi.org/10.1145/3447648Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. 2021. ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).Google ScholarGoogle ScholarCross RefCross Ref
  5. Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning, Vol. 70. 214–223.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B. Tenenbaum, William T. Freeman, and Antonio Torralba. 2018. GAN Dissection: Visualizing and Understanding Generative Adversarial Networks. arxiv:1811.10597 [cs.CV]Google ScholarGoogle Scholar
  7. David Bau, Jun-Yan Zhu, Jonas Wulff, William Peebles, Hendrik Strobelt, Bolei Zhou, and Antonio Torralba. 2019. Seeing what a gan cannot generate. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4502–4511.Google ScholarGoogle ScholarCross RefCross Ref
  8. Adam Bielski and Paolo Favaro. 2019. Emergence of object segmentation in perturbed generative models. arXiv preprint arXiv:1905.12663(2019).Google ScholarGoogle Scholar
  9. Andrew Brock, Jeff Donahue, and Karen Simonyan. 2019. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In International Conference on Learning Representations. https://openreview.net/forum?id=B1xsqj09FmGoogle ScholarGoogle Scholar
  10. Eric R. Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. 2020. pi-GAN: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis. arxiv:2012.00926 [cs.CV]Google ScholarGoogle Scholar
  11. Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. arxiv:1909.11740 [cs.CV]Google ScholarGoogle Scholar
  12. Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. 2020. StarGAN v2: Diverse Image Synthesis for Multiple Domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  13. Edo Collins, Raja Bala, Bob Price, and Sabine Süsstrunk. 2020. Editing in Style: Uncovering the Local Semantics of GANs. arxiv:2004.14367 [cs.CV]Google ScholarGoogle Scholar
  14. Karan Desai and Justin Johnson. 2021. VirTex: Learning Visual Representations from Textual Annotations. arxiv:2006.06666 [cs.CV]Google ScholarGoogle Scholar
  15. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv:1810.04805 [cs.CL]Google ScholarGoogle Scholar
  16. Andrea Frome, Greg Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. 2013. Devise: A deep visual-semantic embedding model. (2013).Google ScholarGoogle Scholar
  17. Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and Daniel Cohen-Or. 2021. StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators. arxiv:2108.00946 [cs.CV]Google ScholarGoogle Scholar
  18. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017a. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems. 5767–5777.Google ScholarGoogle Scholar
  19. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017b. Improved training of wasserstein gans. In Advances in neural information processing systems. 5767–5777.Google ScholarGoogle Scholar
  20. Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris. 2020. Ganspace: Discovering interpretable gan controls. arXiv preprint arXiv:2004.02546(2020).Google ScholarGoogle Scholar
  21. Ali Jahanian, Lucy Chai, and Phillip Isola. 2020. On the ”steerability” of generative adversarial networks. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  22. Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progressive Growing of GANs for Improved Quality, Stability, and Variation. arxiv:1710.10196 [cs.NE]Google ScholarGoogle Scholar
  23. Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. 2020a. Training Generative Adversarial Networks with Limited Data. In Proc. NeurIPS.Google ScholarGoogle Scholar
  24. Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2021. Alias-Free Generative Adversarial Networks. arxiv:2106.12423 [cs.CV]Google ScholarGoogle Scholar
  25. Tero Karras, Samuli Laine, and Timo Aila. 2018. A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948(2018).Google ScholarGoogle Scholar
  26. Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020b. Analyzing and Improving the Image Quality of StyleGAN. In Proc. CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  27. Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. arxiv:2004.06165 [cs.CV]Google ScholarGoogle Scholar
  28. Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. arxiv:1908.02265 [cs.CV]Google ScholarGoogle Scholar
  29. Martin Maier and Abdel Rasha. 2018. Native language promotes access to visual consciousness. Psychological Science 29, 11 (2018), 1757–1772.Google ScholarGoogle ScholarCross RefCross Ref
  30. Xudong Mao, Qing Li, Haoran Xie, Raymond Y.K. Lau, Zhen Wang, and Stephen Paul Smolley. 2017. Least Squares Generative Adversarial Networks. In The IEEE International Conference on Computer Vision (ICCV).Google ScholarGoogle Scholar
  31. Quan Meng, Anpei Chen, Haimin Luo, Minye Wu, Hao Su, Lan Xu, Xuming He, and Jingyi Yu. 2021. GNeRF: GAN-based Neural Radiance Field without Posed Camera. arxiv:2103.15606 [cs.CV]Google ScholarGoogle Scholar
  32. Xingang Pan, Bo Dai, Ziwei Liu, Chen Change Loy, and Ping Luo. 2021. Do 2D {GAN}s Know 3D Shape? Unsupervised 3D Shape Reconstruction from 2D Image {GAN}s. In International Conference on Learning Representations. https://openreview.net/forum?id=FGqiDsBUKL0Google ScholarGoogle Scholar
  33. Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. 2021. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. arxiv:2103.17249 [cs.CV]Google ScholarGoogle Scholar
  34. William Peebles, John Peebles, Jun-Yan Zhu, Alexei A. Efros, and Antonio Torralba. 2020. The Hessian Penalty: A Weak Prior for Unsupervised Disentanglement. In Proceedings of European Conference on Computer Vision (ECCV).Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arxiv:2103.00020 [cs.CV]Google ScholarGoogle Scholar
  36. Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434(2015).Google ScholarGoogle Scholar
  37. Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. 2020. Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation. arXiv preprint arXiv:2008.00951(2020).Google ScholarGoogle Scholar
  38. Mert Bulent Sariyildiz, Julien Perez, and Diane Larlus. 2020. Learning Visual Representations with Caption Annotations. arxiv:2008.01392 [cs.CV]Google ScholarGoogle Scholar
  39. Adriaan MJ Schakel and Benjamin J Wilson. 2015. Measuring word significance using distributed representations of words. arXiv preprint arXiv:1508.02297(2015).Google ScholarGoogle Scholar
  40. Yujun Shen, Ceyuan Yang, Xiaoou Tang, and Bolei Zhou. 2020. Interfacegan: Interpreting the disentangled face representation learned by gans. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).Google ScholarGoogle Scholar
  41. Yujun Shen and Bolei Zhou. 2021. Closed-Form Factorization of Latent Semantics in GANs. In CVPR.Google ScholarGoogle Scholar
  42. Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. arxiv:1908.08530 [cs.CV]Google ScholarGoogle Scholar
  43. Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. arxiv:1908.07490 [cs.CL]Google ScholarGoogle Scholar
  44. Zhentao Tan, Menglei Chai, Dongdong Chen, Jing Liao, Qi Chu, Lu Yuan, Sergey Tulyakov, and Nenghai Yu. 2020. MichiGAN: Multi-Input-Conditioned Hair Image Generation for Portrait Editing. ACM Transactions on Graphics (TOG) 39, 4 (2020), 1–13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Ayush Tewari, Mohamed Elgharib, Gaurav Bharaj, Florian Bernard, Hans-Peter Seidel, Patrick Pérez, Michael Zollhofer, and Christian Theobalt. 2020a. Stylerig: Rigging stylegan for 3d control over portrait images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6142–6151.Google ScholarGoogle ScholarCross RefCross Ref
  46. Ayush Tewari, Mohamed Elgharib, Mallikarjun BR, Florian Bernard, Hans-Peter Seidel, Patrick Pérez, Michael Zöllhofer, and Christian Theobalt. 2020b. PIE: Portrait Image Embedding for Semantic Control. ACM Transactions on Graphics (Proceedings SIGGRAPH Asia) 39, 6. https://doi.org/10.1145/3414685.3417803Google ScholarGoogle Scholar
  47. Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and Daniel Cohen-Or. 2021. Designing an Encoder for StyleGAN Image Manipulation. ACM Trans. Graph. 40, 4, Article 133 (jul 2021), 14 pages. https://doi.org/10.1145/3450626.3459838Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Nontawat Tritrong, Pitchaporn Rewatbowornwong, and Supasorn Suwajanakorn. 2021. Repurposing GANs for One-shot Semantic Part Segmentation. arxiv:2103.04379 [cs.CV]Google ScholarGoogle Scholar
  49. Andrey Voynov and Artem Babenko. 2020. Unsupervised discovery of interpretable directions in the gan latent space. In International Conference on Machine Learning. PMLR, 9786–9796.Google ScholarGoogle Scholar
  50. Sheng-Yu Wang, David Bau, and Jun-Yan Zhu. 2021. Sketch Your Own GAN. In Proceedings of the IEEE International Conference on Computer Vision.Google ScholarGoogle ScholarCross RefCross Ref
  51. Zongze Wu, Dani Lischinski, and Eli Shechtman. 2020. StyleSpace Analysis: Disentangled Controls for StyleGAN Image Generation. arXiv preprint arXiv:2011.12799(2020).Google ScholarGoogle Scholar
  52. Zhibiao Wu and Martha Palmer. 1994. Verb semantics and lexical selection. arXiv preprint cmp-lg/9406033(1994).Google ScholarGoogle Scholar
  53. Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. 2015. LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop. arXiv preprint arXiv:1506.03365(2015).Google ScholarGoogle Scholar
  54. Yuxuan Zhang, Huan Ling, Jun Gao, Kangxue Yin, Jean-Francois Lafleche, Adela Barriuso, Antonio Torralba, and Sanja Fidler. 2021. DatasetGAN: Efficient Labeled Data Factory with Minimal Human Effort. arxiv:2104.06490 [cs.CV]Google ScholarGoogle Scholar
  55. Jiapeng Zhu, Yujun Shen, Deli Zhao, and Bolei Zhou. 2020b. In-domain gan inversion for real image editing. In European Conference on Computer Vision. Springer, 592–608.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Peihao Zhu, Rameen Abdal, John Femiani, and Peter Wonka. 2021a. Barbershop: GAN-based Image Compositing using Segmentation Masks. arxiv:2106.01505 [cs.CV]Google ScholarGoogle Scholar
  57. Peihao Zhu, Rameen Abdal, John Femiani, and Peter Wonka. 2021b. Mind the Gap: Domain Gap Control for Single Shot Domain Adaptation for Generative Adversarial Networks. arxiv:2110.08398 [cs.CV]Google ScholarGoogle Scholar
  58. Peihao Zhu, Rameen Abdal, Yipeng Qin, John Femiani, and Peter Wonka. 2020a. Improved StyleGAN Embedding: Where are the Good Latents?arxiv:2012.09036 [cs.CV]Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SIGGRAPH '22: ACM SIGGRAPH 2022 Conference Proceedings
    July 2022
    553 pages
    ISBN:9781450393379
    DOI:10.1145/3528233

    Copyright © 2022 Owner/Author

    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 24 July 2022

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate1,822of8,601submissions,21%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format