skip to main content
research-article

ASSET: autoregressive semantic scene editing with transformers at high resolutions

Published:22 July 2022Publication History
Skip Abstract Section

Abstract

We present ASSET, a neural architecture for automatically modifying an input high-resolution image according to a user's edits on its semantic segmentation map. Our architecture is based on a transformer with a novel attention mechanism. Our key idea is to sparsify the transformer's attention matrix at high resolutions, guided by dense attention extracted at lower image resolutions. While previous attention mechanisms are computationally too expensive for handling high-resolution images or are overly constrained within specific image regions hampering long-range interactions, our novel attention mechanism is both computationally efficient and effective. Our sparsified attention mechanism is able to capture long-range interactions and context, leading to synthesizing interesting phenomena in scenes, such as reflections of landscapes onto water or fora consistent with the rest of the landscape, that were not possible to generate reliably with previous convnets and transformer approaches. We present qualitative and quantitative results, along with user studies, demonstrating the effectiveness of our method. Our code and dataset are available at our project page: https://github.com/DifanLiu/ASSET

Skip Supplemental Material Section

Supplemental Material

3528223.3530172.mp4

presentation

References

  1. Samira Abnar and Willem Zuidema. 2020. Quantifying attention flow in transformers. In Proc. ACL.Google ScholarGoogle ScholarCross RefCross Ref
  2. David Bau, Hendrik Strobelt, William Peebles, Jonas Wulf, Bolei Zhou, Jun-Yan Zhu, and Antonio Torralba. 2019. Semantic Photo Manipulation with a Generative Image Prior. ACM Trans. Graph. 38, 4 (2019).Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv:2004.05150.Google ScholarGoogle Scholar
  4. Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. 2018. Coco-stuff: Thing and stuff classes in context. In Proc. CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  5. Chenjie Cao, Yuxin Hong, Xiang Li, Chengrong Wang, Chengming Xu, XiangYang Xue, and Yanwei Fu. 2021. The Image Local Autoregressive Transformer. In Proc. NeurIPS.Google ScholarGoogle Scholar
  6. Jiawen Chen, Andrew Adams, Neal Wadhwa, and Samuel W. Hasinoff. 2016. Bilateral Guided Upsampling. ACM Trans. Graph. 35, 6 (2016).Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 4 (2017).Google ScholarGoogle Scholar
  8. Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020. Generative pretraining from pixels. In Proc. ICML.Google ScholarGoogle Scholar
  9. Qifeng Chen and Vladlen Koltun. 2017. Photographic image synthesis with cascaded refinement networks. In Proc. ICCV.Google ScholarGoogle ScholarCross RefCross Ref
  10. Yu Cheng, Zhe Gan, Yitong Li, Jingjing Liu, and Jianfeng Gao. 2020. Sequential attention GAN for interactive image editing. In Proc. ACM Multimedia.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv:1904.10509.Google ScholarGoogle Scholar
  12. Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. 2021a. Twins: Revisiting the design of spatial attention in vision transformers. In Proc. NeurIPS.Google ScholarGoogle Scholar
  13. Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. 2021b. Conditional positional encodings for vision transformers. arXiv:2102.10882.Google ScholarGoogle Scholar
  14. Helisa Dhamo, Azade Farshad, Iro Laina, Nassir Navab, Gregory D Hager, Federico Tombari, and Christian Rupprecht. 2020. Semantic image manipulation using scene graphs. In Proc. CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  15. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. ICLR.Google ScholarGoogle Scholar
  16. Patrick Esser, Robin Rombach, Andreas Blattmann, and Björn Ommer. 2021b. Image-BART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis. In Proc. NeurIPS.Google ScholarGoogle Scholar
  17. Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021a. Taming transformers for high-resolution image synthesis. In Proc. CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  18. David Ferstl, Christian Reinbacher, Rene Ranftl, Matthias Ruether, and Horst Bischof. 2013. Image Guided Depth Upsampling Using Anisotropic Total Generalized Variation. In Proc. ICCV.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Shuyang Gu, Jianmin Bao, Hao Yang, Dong Chen, Fang Wen, and Lu Yuan. 2019. Mask-guided portrait editing with conditional gans. In Proc. CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  20. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proc. NeurIPS.Google ScholarGoogle Scholar
  21. Tobias Hinz, Matthew Fisher, Oliver Wang, and Stefan Wermter. 2021. Improved techniques for training single-image gans. In Proc. WACV.Google ScholarGoogle ScholarCross RefCross Ref
  22. Tobias Hinz, Stefan Heinrich, and Stefan Wermter. 2019. Generating multiple objects at spatially distinct locations. In Proc. ICLR.Google ScholarGoogle Scholar
  23. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. In Proc. ICLR.Google ScholarGoogle Scholar
  24. Seunghoon Hong, Xinchen Yan, Thomas Huang, and Honglak Lee. 2018. Learning hierarchical semantic image manipulation through structured representations. In Proc. NeurIPS.Google ScholarGoogle Scholar
  25. Tak-Wai Hui, Chen Change Loy, and Xiaoou Tang. 2016. Depth Map Super-Resolution by Deep Multi-Scale Guidance. In Proc. ECCV.Google ScholarGoogle Scholar
  26. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. In Proc. CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  27. Yifan Jiang, Shiyu Chang, and Zhangyang Wang. 2021. TransGAN: Two Transformers Can Make One Strong GAN. In Proc. NeurIPS.Google ScholarGoogle Scholar
  28. Youngjoo Jo and Jongyoul Park. 2019. Sc-fegan: Face editing generative adversarial network with user's sketch and color. In Proc. ICCV.Google ScholarGoogle ScholarCross RefCross Ref
  29. Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. In Proc. ICLR.Google ScholarGoogle Scholar
  30. Johannes Kopf, Michael F Cohen, Dani Lischinski, and Matt Uyttendaele. 2007. Joint bilateral upsampling. ACM Trans. Graph. 26, 3 (2007).Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo. 2020. Maskgan: Towards diverse and interactive facial image manipulation. In Proc. CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  32. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2020. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proc. ACL.Google ScholarGoogle ScholarCross RefCross Ref
  33. Huan Ling, Karsten Kreis, Daiqing Li, Seung Wook Kim, Antonio Torralba, and Sanja Fidler. 2021. EditGAN: High-Precision Semantic Image Editing. In Proc. NeurIPS.Google ScholarGoogle Scholar
  34. Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. 2018. Image inpainting for irregular holes using partial convolutions. In Proc. ECCV.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Hongyu Liu, Ziyu Wan, Wei Huang, Yibing Song, Xintong Han, and Jing Liao. 2021b. PD-GAN: Probabilistic Diverse GAN for Image Inpainting. In Proc. CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  36. Hongyu Liu, Ziyu Wan, Wei Huang, Yibing Song, Xintong Han, Jing Liao, Bin Jiang, and Wei Liu. 2021c. DeFLOCNet: Deep Image Editing via Flexible Low-level Controls. In Proc. CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  37. Ming-Yu Liu, Oncel Tuzel, and Yuichi Taguchi. 2013. Joint geodesic upsampling of depth images. In Proc. CVPR.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021a. Swin transformer: Hierarchical vision transformer using shifted windows. In Proc. ICCV.Google ScholarGoogle ScholarCross RefCross Ref
  39. Riccardo de Lutio, Stefano D'aronco, Jan Dirk Wegner, and Konrad Schindler. 2019. Guided super-resolution as pixel-to-pixel transformation. In Proc. ICCV.Google ScholarGoogle ScholarCross RefCross Ref
  40. Seonghyeon Nam, Yunji Kim, and Seon Joo Kim. 2018. Text-adaptive generative adversarial networks: manipulating images with natural language. In Proc. NeurIPS.Google ScholarGoogle Scholar
  41. Evangelos Ntavelis, Andrés Romero, Iason Kastanis, Luc Van Gool, and Radu Timofte. 2020. SESAME: semantic editing of scenes by adding, manipulating or erasing objects. In Proc. ECCV.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Jaesik Park, Hyeongwoo Kim, Yu-Wing Tai, Michael S. Brown, and Inso Kweon. 2011. High quality depth map upsampling for 3D-TOF cameras. In Proc. ICCV.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. 2019. Semantic image synthesis with spatially-adaptive normalization. In Proc. CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  44. Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer. In Proc. ICML.Google ScholarGoogle Scholar
  45. Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. 2021. Styleclip: Text-driven manipulation of stylegan imagery. In Proc. ICCV.Google ScholarGoogle ScholarCross RefCross Ref
  46. Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In Proc. ICML.Google ScholarGoogle Scholar
  47. Tamar Rott Shaham, Tali Dekel, and Tomer Michaeli. 2019. Singan: Learning a generative model from a single natural image. In Proc. ICCV.Google ScholarGoogle ScholarCross RefCross Ref
  48. Tamar Rott Shaham, Michaël Gharbi, Richard Zhang, Eli Shechtman, and Tomer Michaeli. 2021. Spatially-Adaptive Pixelwise Networks for Fast Image Translation. In Proc. CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  49. Assaf Shocher, Nadav Cohen, and Michal Irani. 2018. "zero-shot" super-resolution using deep internal learning. In Proc. CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  50. Sitong Su, Lianli Gao, Junchen Zhu, Jie Shao, and Jingkuan Song. 2021. Fully Functional Image Manipulation Using Scene Graphs in A Bounding-Box Free Way. In Proc. ACM Multimedia.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. 2022. Resolution-robust Large Mask Inpainting with Fourier Convolutions. In Proc. WACV.Google ScholarGoogle ScholarCross RefCross Ref
  52. Zhentao Tan, Menglei Chai, Dongdong Chen, Jing Liao, Qi Chu, Bin Liu, Gang Hua, and Nenghai Yu. 2021. Diverse Semantic Image Synthesis via Probability Distribution Modeling. In Proc. CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  53. Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2020. Efficient transformers: A survey. arXiv:2009.06732.Google ScholarGoogle Scholar
  54. Shubham Tulsiani and Abhinav Gupta. 2021. PixelTransformer: Sample Conditioned Signal Generation. In Proc. ICML.Google ScholarGoogle Scholar
  55. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proc. NeurIPS.Google ScholarGoogle Scholar
  56. Yael Vinker, Eliahu Horwitz, Nir Zabari, and Yedid Hoshen. 2021. Image Shape Manipulation from a Single Augmented Training Sample. In Proc. ICCV.Google ScholarGoogle ScholarCross RefCross Ref
  57. Ziyu Wan, Jingbo Zhang, Dongdong Chen, and Jing Liao. 2021. High-Fidelity Pluralistic Image Completion with Transformers. In Proc. ICCV.Google ScholarGoogle ScholarCross RefCross Ref
  58. Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self-attention with linear complexity. arXiv:2006.04768.Google ScholarGoogle Scholar
  59. Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proc. ICCV.Google ScholarGoogle ScholarCross RefCross Ref
  60. Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proc. CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  61. Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13, 4 (2004).Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Chao Yang, Xin Lu, Zhe Lin, Eli Shechtman, Oliver Wang, and Hao Li. 2017. Highresolution image inpainting using multi-scale neural patch synthesis. In Proc. CVPR.Google ScholarGoogle Scholar
  63. Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao. 2021. Focal Self-attention for Local-Global Interactions in Vision Transformers. In Proc. NeurIPS.Google ScholarGoogle Scholar
  64. Jingyu Yang, Xinchen Ye, Kun Li, Chunping Hou, and Yao Wang. 2014. Color-Guided Depth Recovery From RGB-D Data Using an Adaptive Autoregressive Model. IEEE Transactions on Image Processing 23, 8 (2014).Google ScholarGoogle ScholarCross RefCross Ref
  65. Qingxiong Yang, Ruigang Yang, James Davis, and David Nister. 2007. Spatial-Depth Super Resolution for Range Images. In Proc. CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  66. Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. 2018. Generative image inpainting with contextual attention. In Proc. CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  67. Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. 2019. Free-form image inpainting with gated convolution. In Proc. ICCV.Google ScholarGoogle ScholarCross RefCross Ref
  68. Tao Yu, Zongyu Guo, Xin Jin, Shilin Wu, Zhibo Chen, Weiping Li, Zhizheng Zhang, and Sen Liu. 2020. Region normalization for image inpainting. In Proc. AAAI.Google ScholarGoogle ScholarCross RefCross Ref
  69. Yingchen Yu, Fangneng Zhan, Rongliang Wu, Jianxiong Pan, Kaiwen Cui, Shijian Lu, Feiying Ma, Xuansong Xie, and Chunyan Miao. 2021. Diverse image inpainting with bidirectional and autoregressive transformers. In Proc. ACM Multimedia.Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. 2020. Big Bird: Transformers for Longer Sequences.. In Proc. NeurIPS.Google ScholarGoogle Scholar
  71. Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, and Jianfeng Gao. 2021. Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In Proc. ICCV.Google ScholarGoogle ScholarCross RefCross Ref
  72. Pan Zhang, Bo Zhang, Dong Chen, Lu Yuan, and Fang Wen. 2020. Cross-domain correspondence learning for exemplar-based image translation. In Proc. CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  73. Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proc. CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  74. Chuanxia Zheng, Tat-Jen Cham, and Jianfei Cai. 2019. Pluralistic image completion. In Proc. CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  75. Haitian Zheng, Zhe Lin, Jingwan Lu, Scott Cohen, Jianming Zhang, Ning Xu, and Jiebo Luo. 2021. Semantic Layout Manipulation with High-Resolution Sparse Attention. arXiv:2012.07288.Google ScholarGoogle Scholar
  76. Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. 2017. Scene parsing through ade20k dataset. In Proc. CVPR.Google ScholarGoogle ScholarCross RefCross Ref
  77. Xingran Zhou, Bo Zhang, Ting Zhang, Pan Zhang, Jianmin Bao, Dong Chen, ZhongfeiGoogle ScholarGoogle Scholar
  78. Zhang, and Fang Wen. 2021. CoCosNet v2: Full-Resolution Correspondence Learning for Image Translation. In Proc. CVPR.Google ScholarGoogle Scholar
  79. Peihao Zhu, Rameen Abdal, Yipeng Qin, and Peter Wonka. 2020. Sean: Image synthesis with semantic region-adaptive normalization. In Proc. CVPR.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. ASSET: autoregressive semantic scene editing with transformers at high resolutions

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Graphics
          ACM Transactions on Graphics  Volume 41, Issue 4
          July 2022
          1978 pages
          ISSN:0730-0301
          EISSN:1557-7368
          DOI:10.1145/3528223
          Issue’s Table of Contents

          Copyright © 2022 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 22 July 2022
          Published in tog Volume 41, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader