Text-Guided Synthesis of Eulerian Cinemagraphs

We introduce Text2Cinemagraph, a fully automated method for creating cinemagraphs from text descriptions - an especially challenging task when prompts feature imaginary elements and artistic styles, given the complexity of interpreting the semantics and motions of these images. We focus on cinemagraphs of fluid elements, such as flowing rivers, and drifting clouds, which exhibit continuous motion and repetitive textures. Existing single-image animation methods fall short on artistic inputs, and recent text-based video methods frequently introduce temporal inconsistencies, struggling to keep certain regions static. To address these challenges, we propose an idea of synthesizing image twins from a single text prompt - a pair of an artistic image and its pixel-aligned corresponding natural-looking twin. While the artistic image depicts the style and appearance detailed in our text prompt, the realistic counterpart greatly simplifies layout and motion analysis. Leveraging existing natural image and video datasets, we can accurately segment the realistic image and predict plausible motion given the semantic information. The predicted motion can then be transferred to the artistic image to create the final cinemagraph. Our method outperforms existing approaches in creating cinemagraphs for natural landscapes as well as artistic and other-worldly scenes, as validated by automated metrics and user studies. Finally, we demonstrate two extensions: animating existing paintings and controlling motion directions using text.

Figure 1: Our method can synthesize artistic cinemagraphs given text prompts, bringing to life motion effects such as "waterfall falling", "river flowing", and "turbulent ocean."These visual effects may be challenging to depict in a static photograph, but they flourish in the medium of cinemagraphs.To view the teaser images as videos, we recommend using Adobe Acrobat.

Abstract
We introduce Text2Cinemagraph, a fully automated method for creating cinemagraphs from text descriptionsan especially challenging task when prompts feature imaginary elements and artistic styles, given the complexity of interpreting the semantics and motions of these images.We focus on cinemagraphs of fluid elements, such as flowing rivers, and drifting clouds, which exhibit continuous motion and repetitive textures.Existing single-image animation methods fall short on artistic inputs, and recent text-based video methods frequently introduce temporal inconsistencies, struggling to keep certain regions static.To address these challenges, we propose an idea of synthesizing image twins from a single text prompt -a pair of an artistic image and its pixel-aligned corresponding natural-looking twin.While the artistic image depicts the style and appearance detailed in our text prompt, the realistic counterpart greatly simplifies layout and motion analysis.Leveraging existing natural image and video datasets, we can accurately segment the realistic image and predict plausible motion given the semantic information.The predicted motion can then be transferred to the artistic image to create the final cinemagraph.Our method outperforms existing approaches in creating cinemagraphs for natural landscapes as well as artistic and other-worldly scenes, as validated by automated metrics and user studies.Finally, we demonstrate two extensions: animating existing paintings and controlling motion directions using text.Please find code and video results on our project website

Introduction
Cinemagraphs are captivating visuals where certain elements exhibit repeated, continuous motions while the rest remains static [7].They offer a unique way to highlight scene dynamics while capturing a specific moment in time [10].Since their inception, cinemagraphs have become popular as short videos and animated GIFs on social media and photo-sharing platforms.They are also prevalent in online newspapers, commercial websites, and virtual meetings.However, creating a cinemagraph is remarkably challenging as it involves capturing videos or images with a camera and using semi-automated methods to produce seamless looping videos.This process often requires considerable user effort [66,2], including capturing suitable footage, stabilizing video frames, selecting animated and static regions, and specifying motion directions.
In this work, we explore a new research problem of text-based cinemagraph synthesis, significantly reducing the need for data capture and tedious manual efforts.As shown in Figure 1, our method captures motion effects such as "water falling" and "flowing river" , which cannot be easily expressed with still photographs and existing text-toimage methods.More importantly, our approach broadens the spectrum of styles and compositions available in cinemagraphs, which allows content creators to specify various artistic styles and describe imaginative visual elements.Our method can synthesize realistic cinemagraphs as well as creative or otherworldly scenes.Similar to prior works on single-image cinemagraph generation [28,44,15], we focus on generating cinemagraphs with predominantly fluid elements like water, smoke, and clouds.
This new task presents significant challenges to current methods.One straightforward approach is to use a text-toimage model to generate an artistic image and then animate it.Unfortunately, existing single-image animation methods struggle to predict meaningful motions for artistic inputs as the models are typically trained on real video datasets.Curating a large-scale dataset of artistic looping videos is impractical, given the difficulty of producing individual cinemagraphs and the wide range of artistic styles involved.Alternatively, text-based video models can be used to generate videos directly.However, our experiments reveal that these methods often introduce noticeable temporal flickering artifacts in static regions and fail to produce desired semi-periodic motions.
To close the gap between artistic images and animation models developed for real videos, we propose an algorithm based on the concept of twin image synthesis.Our method generates two images from a user-provided text promptone artistic and one realistic -that share the same semantic layout.The artistic image represents the style and appearance of the final output, while the realistic counterpart provides an input that is much easier for current motion prediction models to process.After we predict the motion for the realistic image, we can transfer this information to its artistic counterpart and synthesize the final cinemagraph.Though the realistic image is not displayed as the final output, it serves as a critical intermediate layer that resembles the semantic layout of the artistic image while remaining amenable to existing models.To improve motion prediction, we further leverage additional information from text prompts and semantic segmentation of the realistic image.
Our experiments show that our method outperforms existing single-image animation methods on both artistic and natural images in terms of automated metrics such as Fréchet Video Distance [68].We also demonstrate that our method synthesizes more visually appealing artistic cinemagraphs compared to existing single-image animation and zero-shot text-to-video methods, according to a user study.We further include an ablation study regarding different algorithmic designs.Finally, we extend our method to two applications: animating existing paintings and textbased control of the motion directions.

Related Work
Video Looping and Cinemagraph.Classic methods for creating cinemagraphs often involve reusing frames from a real video with periodic motions, as demonstrated by Video Texture [60].This seminal work uses a graph-based formulation to find seamless transitions between frames with similar appearance and motions.To accommodate varying motion patterns across different regions, several works [37,66,2,31] propose separating looping dynamic elements from the static background, which often requires user guidance to mask dynamic regions and stabilize videos.Other works [41,40] further allows users to animate each region with different looping periods and handle videos with moving cameras [61].These techniques have also been applied to process specific visual data such as portraits [3] and panoramas [1,23].In contrast to these methods, our work does not require capturing a periodic real video or additional user annotation.
Single Image Animation Another way of creating cinemagraphs is to start from a single image and add periodic motions.The first notable work in this area was by Chuang et al. [9].They manually define different types of motion for different subject classes such as water and leaves.More recent works have employed deep networks to predict motion [13,43,28,44,15,39,4].Animating Landscape [13] predicts motion autoregressively and generates the results video using backward warping.Holynski et al. [28] instead predicts a single optical flow that describes the motion between consecutive frames and uses forward warping to reduce stretching artifacts.Mahapatra et al. [44] and Fan et al. [15] offer controllable cinemagraph generation methods given user-provided masks and direction hints.DeepLandscape [43] adopts a StyleGAN model [33] to learn motion with random frame pairs from a video.Although all these methods can animate real images, they struggle with artistic scenes.In contrast, our method can support both natural Figure 2: Overview.Given a text prompt c, we generate twin images with Stable Diffusion, an artistic image x in the style described in the text prompt, and a realistic counterpart x using the modified prompt ĉ.Twin images share a similar semantic layout.We then extract a binary mask M of the moving regions from the Self-Attention maps obtained during the artistic image's generation process.We use the mask and the realistic image to predict the optical flow F with the flow prediction model G f low .Since the twin images have a very similar semantic layout, we can use the flow F to animate the artistic image, with our video generator G f rame .scenes and imaginative, artistic styles.Recently, Endless-Loops [22] provide a controllable and non-deep learningbased method to animate a single image that can work on a wide range of scenes, but requires extensive user input arrows and precise masks.Our method is the first text-based cinemagraph synthesis approach, significantly reducing the manual efforts involved in image capture and user annotations.
Text-to-image Synthesis.Text-based interfaces have gained widespread usage in image synthesis with deep generative models, such as GANs [51,18,32,59], Diffusion models [58,53,17], autoregressive models [73], and their hybrids [54,11,14].These approaches offer two major advantages.First, the text provides a widely accessible medium to many users.Second, recent text-based models demonstrate the ability to synthesize high-quality images with diverse styles and content.We aim to harness these advantages in our cinemagraph synthesis method.However, visualizing motion effects like "falling water" presents inherent challenges with still photographs.To overcome this limitation, our work depicts these visual effects with cinemagraphs.
More recently, several works have focused on text-based image editing, aiming to preserve the structure and content of the input image while incorporating desired changes specified by a text prompt [6,57,70,74,67,34,49].We adopt some of them to bridge the gap between real video training datasets and artistic text prompt.
Text-to-video Models.Another line of work attempts to directly synthesize videos based on text descriptions.Similar to the image domain, the most prominent directions are autoregressive [29,69] and diffusion models [27,62,19,5,24].ImagenVideo [27] and Make-a-Video [62] propose using a cascade of diffusion models for different resolutions.Alternatively, several methods [5,24] use the pretrained Stable Diffusion [55].All these works require a significant amount of data and computational resources for training.A notable exception is Text2Video [35], which uses only a pretrained Stable Diffusion model, without any additional fine-tuning to generate a video.Nonetheless, none of these methods explicitly model motion.As a result, our experiments demonstrate that the temporal consistency of their generated videos, a key aspect of plausible cinemagraphs, falls short compared to our method.

Method
Given a user-provided artistic text prompt c, describing an artistic or imaginative scene, along with a user-specified region name that the user wants to animate (e.g., waterfall), our goal is to generate a cinemagraph that faithfully reflects the text description.
As demonstrated in Section 4, existing text-to-video models often fail to generate visually pleasing cinemagraphs due to several factors.First, these models have not been trained on cinemagraph datasets.Curating such datasets becomes a chicken-and-egg problem, as creating a single artistic cinemagraph usually requires the expertise of a professional artist with existing tools.Second, textto-video models frequently generate moving-camera effects and struggle to maintain temporal coherence in predominantly static scenes.
Instead, one intuitive method involves first generating an artistic image x using Stable Diffisuion [54] in the same style as described by the text prompt.This is followed by predicting plausible motion corresponding to the image, and finally animating the image.Unfortunately, a naïve implementation of this idea fails in practice due to several challenges.
The first critical challenge lies in predicting plausible motion for artistic images or imaginative content.It is a daunting task, as motion prediction models are typically trained only on real video datasets.To tackle this, we propose generating a twin realistic image x, sharing the same semantic structure as the artistic image x, but with a natural photography style.Conveniently, we can automatically derive the realistic image x by leveraging intermediate diffusion model features produced during the artistic image generation process.We elaborate on this in Section 3.1.
The second challenge is that single-image motion prediction remains hard even for natural images.Recent methods [28,13] often hallucinate movements for stationary regions (like rocks and hills), leading to noticeable artifacts.To address this challenge, we propose using a binary mask to direct the prediction model toward the dynamic regions.Yet again, the question arises: "How do we obtain such a mask for an artistic image?".Once more, the twin realistic image comes to the rescue.We leverage a pretrained segmentation model, trained on real images, ODISE [72] and the user-specified region name, to predict a binary mask for realistic image x.To reduce the misalignment between region boundaries between twin images x and x, we further refine this mask using intermediate diffusion model features.The whole mask and motion prediction algorithm is described in Section 3.2.
Finally, in Section 3.3, we transfer the predicted motion from the realistic image to the artistic image and generate a coherent cinemagraph.Figure 2 illustrates our method.

Twin Image Generation
We start by generating an artistic image, x, that corresponds to the input text prompt using a pretrained Stable Diffusion model.We assume that this image can have some unnatural style, such as "pixel art" or "Monet painting".As mentioned before, for our motion prediction network to work, we need a twin version of the artistic image, denoted as x, with a similar semantic layout.
To generate the natural image, we need a correspond-  ing natural-styled prompt, ĉ, distinct from the user-provided artistic prompt c.Rather than expecting users to provide ĉ separately, we automatically generate the prompt ĉ that contains all the common nouns ('NN', 'NNS') from the input prompt c, using the NLTK natural language toolkit [47] corresponding to COCO-stuff [8] or DTDB [21] classes, along with their synonyms and append "nature, bright, realistic, photography, 4k, 8k" at the end.For example, given the c as "a large waterfall falling between hills in the style Van Gogh painting during sunset, 4k", its corresponding ĉ would be "waterfall, hills, nature, bright, realistic, photography, 4k, 8k".
Simply using the natural-styled prompt ĉ along with the same initial seed that was used to generate the artistic image does not guarantee that the twin images will always share the same semantic layout.In fact, this is rarely the case.Following recent works [49,67,25], we can further use the diffusion model's internal features to enforce structural correspondence between the two images.
Similar to Plug-and-Play [67], we observe that selfattention maps A t , at any timestep t in the denoising process in Stable Diffusion, control the spatial structure of the resulting image.Thus, during the generation of the artistic image x, we store the intermediate self-attention maps A t for all timesteps t, As noted by Plug-and-Play [67], the structural alignment can be made even stronger by injecting the output residual block features f t .We store the 4-th layer in the output blocks.
where ϵ θ is a standard denoising UNet.To enforce structural similarity when generating the realistic image x, we inject self-attention maps A i t and residual block features f t into the UNet module.
where x t , xt are noisy images at timestep t corresponding to the artistic and realistic image, respectively, and εθ is the modified UNet, which also takes injected features as input.
Note that Stable Diffusion operates in the latent space, but for notation brevity, we use x and x notation instead of introducing new notations for latents.By leveraging the stored attention maps and residual block features, we ensure that the generated realistic image x maintains a similar semantic layout to its artistic counterpart x.This enables us to establish a meaningful correspondence between the twin images, which is crucial for subsequent steps.

Mask-guided Flow Prediction
As mentioned previously, predicting plausible optical flow requires an input binary mask that defines the regions to animate in addition to an image.We use a pretrained open set panoptic segmentation module, such as ODISE [72], to generate the mask.For example, for a prompt "a large river flowing in front of a mountain in the style of starry nights painting", and a user-specified region name 'river', we would use the segmentation map corresponding to 'river' as the binary mask.However, reliable masks can only be predicted for the realistic image since the segmentation network is trained on real-world data.Although the images are twins, we cannot use the mask obtained for the realistic image directly for predicting flow as there may be inconsistencies at the boundaries of semantic regions between the two images.In Figure 3, we illustrate this issue with an example.This will result in motion prediction at the boundaries of static regions in the realistic image creating noticeable artifacts.
To this end, we use the self-attention maps A t to generate the mask.We first average A t , across different timesteps t and obtain average self-attention map A. We only use the self-attention maps after a certain number of steps, as earlier maps are noisy.We then cluster the pixels by applying spectral clustering [46] to the average self-attention map.
We then predict a binary mask M for the realistic image x using ODISE.Finally, we use the predicted mask M to select which clusters to retain in A based on the Intersectionover-Union (IOU) percentage between the two to obtain the final binary mask M .Figure 4 shows the different stages in our mask generation method.
Unlike concurrent works [50,20], we use ODISE mask instead of cross-attention map (for 'river' token) to select self-attention map regions.We found that ODISE produces more precise masks for natural scenes (like ocean, rivers, etc.).For example, for the prompt, "a view of a rocky beach on the shore of the ocean", the cross-attention map for the 'ocean' token highlights both the ocean and the surrounding rocky beach.More details are provided in the supplementary material.
Following this, we predict the optical flow ( F ) on the realistic image x, using a generator G f low conditioned on M .In addition, we condition the flow prediction on the CLIP [52] embedding input text prompt c, where we compute CLIP embedding given the text prompt c before feeding to the flow prediction network.We condition the optical flow prediction model with the text through cross-attention layers.We use the same spatial transformer blocks as in Stable Diffusion, but omit the self-attention layers for text conditioning.Due to the limited dataset size, we initialize the spatial transformer blocks with weights of Stable Diffusion and freeze the key and query weight matrices.We train the flow prediction network G f low with End-Point-Error (EPE), conditional GAN loss [30], and feature matching loss [71].Please refer to our supplement for more training details.We hypothesize that text inherently contains class information, like a 'waterfall' or 'river', which can be useful to determine the natural direction in the predicted flow.

Flow-guided Video Generation
Although in Section 3.2, we predict flow F for the realistic image x, we can use this flow to animate the artistic image x, as they both share a similar semantic layout.We, therefore, define F = F , assuming that we can directly transfer the motion of natural image F to its artistic counterpart F .
Given an accurate optical flow F , we can animate x using a method similar to Holynski et al. [28] for animating a single image, wherein each frame in the cinemagraph is generated separately.Since cinemagraphs have the property of looping in time, the artistic image x serves as the first and the last frame.Considering a total of N frames in the generated cinemagraph, we generate a frame n ∈ [0, N ] by symmetric splitting, introduced in Holynski et al. [28].For this,  , and a userspecified region name 'river', we generate a binary segmentation mask (c).In (d), we use PCA [16] to visualize the self-attention maps averaged across different timesteps.Finally, we apply spectral clustering [46] on the average selfattention maps (e) and select regions with high Intersectionover-Union (IOU) scores with respect to the ODISE mask (c).
we need the cumulative flow F 0→n , and F N →N −n in forward and backward directions respectively, where F 0→n , and F N →N −n are optical flows between frame (0, n) and (N, N − n).The predicted optical flow F however defines the flow between any two consecutive frames n, n + 1.To get cumulative flows in the forward and backward direction, we perform Euler integration of predicted optical flow, F , and its reverse flow -F , for n and N − n times respectively, where p is the 2D pixel coordinate in the flow F .G f rame is trained to predict an intermediate frame v n , given the cumulative flows, F 0→n and F N →N −n , the first frame and the last frame.We perform symmetric splatting in the feature space of an encoder part of the generator, at multiple feature resolutions, similar to Mahapatra et al. [44], and then use the decoder part to generate the RGB image.We train this model on pairs of cumulative ground-truth flows F gt 0→n−1 and F gt N →N −n , the first frame v 0 , and the last frame v N of ground-truth natural videos to predict an Our training objective involves L1 reconstruction loss, conditional GAN loss [30], feature matching loss [71] and VGG-based perceptual loss [75].
Since at test time we want to generate looping videos, we use the artistic image x as the first and the last frame, Surprisingly, although this model is trained on a dataset of real-domain videos, we can use it to animate the artistic image without any further modification as it essentially performs impainting of the small holes in feature space generated during symmetric splatting, with repetitive surrounding textures.

Experiments
In this section, we extensively compare our method against recent methods, both qualitatively and quantitatively, on real-world single-image animation (termed real domain) and text-to-cinemagraph for the artistic text prompts (termed artistic domain).
Dataset.Our dataset contains two domains: (1) Real Domain: We train our optical flow prediction and animation models, G f low and G f rame , on the dataset provided by Holynski et al [28].This dataset contains real-life videos of waterfall, lake, river, and ocean scenes with groundtruth average optical flow calculated for each video.Each video consists of 60 frames with a resolution of 720 × 1280.
The training set contains 4750 videos, and the test set contains 162 videos.We use BLIP [38] to generate captions based on the first frame of each video.(2) Artistic Domain: Since there is no existing dataset for text-to-cinemagraph, we generate 20 different captions corresponding to waterfalls, rivers, lakes, clouds, sea, and ocean using different artistic styles with 5-6 seeds each, generating 102 artistic images.Our data will be available upon publication.
Training Details.For the flow prediction network G f low , we adopt a UNet backbone [56] with SPADE Res-Blocks [48] for image and mask conditioning and crossattention layers for text conditioning [45].For the animation network G f rame , we also use a UNet backbone.In our experiments, we tried using both ODISE [72] and Segment Anything [36], and observed that Segment Anything produces a large number of holes in the segmentation masks, while ODISE produced much more consistent masks for our type of scene.Similar to Mahapatra et al. [44], we perform symmetric splatting across different hierarchical features of We compare our method with two single-image animation methods Holynski et al. [28] and SLR-SFS [15] on real videos (a) and artistic images (b).For real videos, we also show the ground truth optical flow averaged across all frames.Overall, our method predicts more plausible motions that align better with target regions.Please see our supplement for more video comparisons with more baselines.UNet.To ensure stable training, we first train two networks G f low and G f rame , separately, each for 200 epochs.We then train both models end-to-end for 50 epochs.Since the test videos in the real video dataset contain 60 frames, we generate videos of 60 frames duration.For artistic prompts, we generate videos of 120 frames.More details are provided in the supplementary material.
Evaluation Metrics.We compare the quality of generated cinemagraphs on real-domain data using Fréchet Video Distance (FVD) [68], a commonly-used metric used in video generation [27,63].We adopt two FVD variants, as suggested by Skorokhodov et al. [63]: (1) FVD 16 calculates the FVD score on 16 frames sampled at a factor of 3, and (2) FVD 60 computes the FVD score using all generated frames.
User Study.We also conduct a user study to assess the quality of generated cinemagraphs on both real and artistic domains.We ask the Amazon MTurk participants to choose which animation they prefer.For real videos, we ask them to make choices based on two criteria: (1)  videos, one generated by our method and the other by one of the baseline or our ablation method.Each paired comparison is annotated by 5 annotators.More details about the User Study are provided in the supplementary material.
For artistic cinemagraph generation, we rely solely on the user preference study, as we do not possess ground-truth data.

Baselines.
Single Image Animation.(1) Animating Landscape [13] predicts time-varying optical flow and generate the next frame using backward warping, autoregressively They post-
process the generated video to create a looping effect.(2) Holynski et al. [28] predict a constant optical flow from a given image and use this optical flow to generate a video using Euler integration and symmetric splatting.Their method inherently generates looping videos.(3) SLR-SFS [15] is designed for controllable animation that requires user-specified flow hints and masks.In our case, we repurpose their method for single-image animation without user controls for a fair comparison.
For text-to-cinemagraph, given a text prompt, we first generate an output image using Stable Diffusion [54].We then animate the generated image using the above methods.
Text-to-Video. (1) CogVideo [29] is based on a transformer text-to-image model CogView [12].(2) Text2Video-Zero [35] generates videos from text using a pretrained Stable Diffusion directly.(3) VideoCrafter [24] train a text-tovideo diffusion model to generate a video from text.For our comparison, we use pretrained models provided by the authors.The above models are recently released open-source models, while other text-to-video models [27,62,76] are close-sourced.Additionally, as the videos produced by these models are non-looping, we implement a postprocessing method [13] to make them loop.

Real Domain Results
In this section, we compare the quality of predicted optical flow and generated cinemagraphs from a single image on a real video dataset [28].Qualitative Comparison.As shown in Figure 5a, our method predicts more plausible flows than baselines.Due to the lack of mask conditioning, both Holynski et al. [28] and SLR-SFS [15] predict flow in static regions and ignore many parts in the dynamic regions.Holynski et al. [28]'s flow visualizations also exhibit checkerboard artifacts.In contrast, our method predicts flows spanning the entire dynamic regions (like 'river').Although the directions can sometimes vary from the ground truth, the flows remain physically plausible.For instance, a river can move from left to right or right to left, as shown in Figure 6a (row 2).Additionally, as shown in Figure 5a (row 3) and Figure 6a (rows 1 and 2), our flow sometimes covers more regions than the ground truth due to the precise masks predicted by ODISE, while the ground-truth flow is derived from the real video using the RAFT optical flow algorithm [65], which may neglect flow in regions of less movement.Please check out all the video comparison results on the project website.
Quantitative Comparison.We further compare the quality of generated cinemagraphs against ground-truth videos with FVD 16 and FVD 60 metrics [68].From Table 1, we see that our method achieves significantly lower FVD scores compared to baselines.Compared to the baselines, our generated cinemagraphs more closely match the data distribution and fidelity of ground-truth videos.Our user study, in Table 2 (row 1) further suggests that users prefer the visual quality of our results by a large margin compared to the baselines.

Artistic Domain Results
Qualitative Comparison. Figure 5b shows that our predicted flows are cleaner and focus on the desired regions compared to the baselines, which predict inaccurate flows and introduce more artifacts.CogVideo [29] struggles to capture all the details in the text prompt.Similarly, VideoCrafter [24] fails to embody the style mentioned in the text prompts as it has not been trained with these artistic captions.Text2Video-Zero [35] fails to preserve temporal consistency across frames, though it can capture the details with the pretrained Stable Diffusion.Additionally, all these text-to-video methods generate fewer frames, and animating them by postprocessing introduces cross-fade artifacts.Please see more video results of our method on the project website.
Quantitative Comparison.Due to the lack of groundtruth artistic cinemagraphs, we compare our method with baselines with a user study.Most users prefer our results to baselines, see Table 2 (row 2).

Ablation Study
Text and Mask Conditioning.We evaluate the design choices of our flow prediction method and compare them against the following variants: Twin Image Synthesis Here, we study the role of twin image synthesis.We compare our full method, which predicts optical flow on the realistic image, to Ours (w/o twin), which directly predicts optical flow using the artistic image.We also use the same mask in both cases (generated using the method mentioned in Section 3.2). Figure 6b shows that our predicted flow (using a realistic image) is significantly smoother and consistent.In our user study, 64.59 ± 9.86% of users prefer our full method to Ours (w/o twin), which suggests the crucial role of twin image synthesis.

Mask Generation
We evaluate the design choices in our mask generation method against the following alternatives : (1) using K-Means [42] clustering instead of spectral clustering [46], (2) using self-attention map from a single timestep of sampling instead of using the average of selfattention map across timesteps.Figure 8 highlights that masks generated by our method have better overlap with ODISE [72] masks and are restricted to regions of motion (like water) compared to both (1) and (2).For quantitative evaluation, we manually annotated the masks for 10 randomly selected artistic images and generated a mask using all three methods.Our method achieves a better average IoU of 0.84 than (1) 0.81 and (2) 0.79.Mask generation with different design choices: we compare the quality of masks generated from our method (a) with alternative design choices like (1) using K-Means [42] clustering instead of spectral clustering [46], (2) using self-attention map from a single timestep of sampling instead of using the average of self-attention map across timesteps.Compared to both (1) and (2), our masks have better overlap with ODISE [72] masks while remaining confined to the plausible regions of motion (like water).

Extentions
Text-Guided Direction Control We can also control the motion based on text directions.Users can indicate the direction with the template phrase, "in ... direction" following each object.For example, in Figure 9, a user can add "in left to right, downwards direction" after "river".We divide the total 360 • possible directions into 12 quadrants and associate each with a direction phrase.We then randomly sample an angle, θ, from the corresponding quadrant.This angle, combined with the binary mask M , generates a flow hint map where the y component is − sin θ, and x component is cos θ.Inspired by [44], we extend our model to accept these hints, in addition to the realistic image x, mask M , and text prompt c, to predict the flow.Figure 9 shows two distinct examples where we can synthesize the same scene with different motion directions.We attempted end-to-end training to directly condition optical flow on full sentences containing direction phrases.However, this approach was unsuccessful, likely due to the limited scale of the dataset.
Real Painting Cinemagraphs Our method can also create cinemagraphs for real paintings drawn by artists, like Ivan Aivazovsky's 'The Ninth Wave' painting.We convert the real image into a natural version by Diffusion-based image editing method, Plug-and-Play [67].Specifically, we perform DDIM inversion [64] for 1000 steps to achieve high-quality reconstruction followed by sampling and saving the intermediate features of the Stable Diffusion model.The rest of the steps are similar to our core algorithm.Figure 7 shows two cinemagraphs created from historic paintings.

Limitations and Discussion
In summary, we have introduced the problem of creating cinemagraphs from a text description.We have presented the first fully automatic method that works well for both natural and artistic scenes.Our method not only can generate cinemagraphs for text descriptions of real-life structures with different artistic styles but also for imaginative scenes altogether for different types of fluid elements like clouds, water and smoke.Our key concept, twin image synthesis, holds potential for other image synthesis tasks.By leveraging the concept of twin images, we separate the visual recognition process, which can be performed in an eas-  ier domain with abundant datasets and pre-trained models, from the more challenging visual synthesis process.
Limitations.Our method bears several limitations.First, the artistic and realistic images generated by Stable Diffusion may not always correspond to the input text prompt or contain very little or no dynamic regions.As shown in Figure 10a, the generated image does not contain the "waterfall" region mentioned in the text prompt.Second, our method occasionally alters the artistic image's structure even though it shares the same self-attention maps with the realistic image.For example, as seen in Figure 10, the sky region in the artistic image has inaccurately transformed into a wave Using more advanced contentpreserving image editing methods might be an option to alleviate the issue.
Third, our pre-trained segmentation model, like ODISE, can struggle with complex natural images.In Figure 10c, ODISE has difficulty separating small, isolated rocks from the river.Similarly, the optical flow network may introduce errors for natural images with unusual compositions and layouts, such as a waterfall appearing in the sky.
Fourth, for significant changes in the flow direction, like repeated zig-zag movement of water (Figure 10d), the optical flow model may fail to predict a plausible flow.Controllable image animation models [15] could potentially miti-gate the issue, with the user annotation cost regarding flow directions and object masks.
Finally, while our method is effective for scenes with repetitive textures, such as water, the method tends to generate monotonic movements due to symmetric splatting and the constant optical flow assumption.Additionally, our results sometimes contain artifacts of greyish regions due to the incapability of the video generator to fill holes created during symmetric splitting.

Acknowledgement Appendix
Please refer to our project website for more video results and comparisons.In Section A, we begin by describing additional implementation details in our method.Section B describes why data augmentation to train G f low on artistic style data is not an effective approach.In Section C we describe the details of our user study experiment.In Section D we describe why we need to use different masks at train and test times for optical flow prediction.Section E mentions some of the limitations in terms of capability with our current text-guided direction control approach.Section F describes the additional limitations of the quality of cinemagraphs obtained from the baseline methods.Finally, in Section G we describe why we do not use cross-attention maps instead of self-attention maps for generating the binary mask.

A. Additional Implementation Details
Mask Generation For generating the average selfattention map A, we only use the self-attention maps after 25-th step in the denoising process, considering we perform DDIM sampling for 50 steps.This is because the selfattention maps generated during the early phase of the denoising process are very noisy.Using them reduces the actual important semantic information for clustering.In our implementation, we use maps of resolution 32 × 32 following Patashnik et al. [50].This is because the lower resolution maps would generate very coarse clusterings and hence very coarse binary masks, which when upsampled to image resolution of 512 × 512, will be extremely coarse.Ideally, we would have preferred to use self-attention maps of the highest resolution possible, i.e., 64 × 64, but we observe that they do not contain very useful information for clustering.For clustering, we use 10 clusters which work well in most of the cases.For text prompts that generate very fine animatable regions (like thin waterfalls), a user might need to use a higher number of clusters.For selecting which regions to take from the cluster, we use an overlap of ≥ 70 % pixel-level overlap for most scenes (like river, ocean, sea, and clouds), and use a ≥ 90 % pixel-level overlap for waterfall scenes, which contain very fine structures.
Training During the independent first stage training of both the flow prediction and frame generation networks, we use a learning rate of 2 × 10 −3 and use the TTUR method to update the learning rate [26].For the end-to-end training, we reduce the learning rate to 1 × 10 −3 for training for an additional 50 epochs.We perform data augmentation, by randomly cropping the frames and optical flow to 512 × 512 resolution and applying random horizontal flip.We normalize the ground truth flows to  in our case).We saw that normalizing the flows and using tanh in G f low produces much more accurate optical flows compared to unbounded regression.We train all the networks with a batch size of 16 (both during independent and end-to-end training).While end-to-end training, we freeze the flow prediction network, G f low , and only fine-tune the frame generation network G f rame .
Inference For inference, we clip the predicted optical flow values between [−1, 1] for both artistic and real domain experiments, even for baseline methods.For the artistic domain, we also scale the predicted optical flow values by a factor of 0.5.This ensures that there do not exist very high values of optical flow which might result in large holes after symmetric splatting, which even the G f rame would not be able to impaint.Also, this ensures slow-moving and aesthetic structures in the generated cinemagraphs.

B. Why not train with Artistic Data Augmentation?
In our method, we have to generate a twin realistic image x that shares the same semantic structure as the artistic image x, which is essential for precise flow prediction, as the optical network can also work on real videos.Instead of going via this route, can we just create an augmented dataset of artistic style images and optical flow by editing the real images, using various diffusion-based semantically aligned editing methods into various artistic styles?These augmented artistic style images will essentially have the same ground-truth flow as the real images from which they were created as these editing methods, for most cases, pre-  It can be seen that even though cross attention maps can generate a mask for the 'waterfall' region properly corresponding to the 'waterfall' token (bottom), for 'ocean' (top) and 'river' (middle), they also highlight neighboring regions of 'rocky beach' and 'mountain' respectively.This reduces their effectiveness in mask generation.serve semantic layout in the edit.We, however, have tried this approach but failed to generate accurate flows at inference for artistic images.We believe there are mainly two reasons, (1) the number of artistic styles can be extremely large, and generating augmentations and further training with all possible augmentations like extremely resource and time-intensive, at inference, the text prompt might contain an artistic style very different from training augmentations, (2) The augmentations are limited by the structures of the real domain images.The text prompts at inference can not only contain artistic styles but also outwardly structure, in which case this method also fails.

C. User Study Details
For both real and artistic domain experiments we perform a user study to evaluate the quality of the generated cinemagraphs.We believe that to assess the quality of generative videos (and also cinemagraphs), human evaluation is the best available metric.We perform human evaluations on Amazon Mechanical Turk.We performed a paired test, where the users were asked to compare two videos, one generated by our method and the other by one of the baselines or our ablation methods, based on either a single image (for the real domain) or a text prompt (for the artistic domain).Each paired comparison was annotated by 5 annotators.Therefore, we collect a total of 810 responses for the real domain (considering 162 test images) and 500 responses for the artistic domain (considering 100 unique text-seed pairs).Only annotators who had more than 10000 accepted hits and had an accuracy of more than 95% could participate in the task.Each annotator received an amount of $0.03 for one annotation (or hit).Based on the obtained hits, for each of our v/s baseline (or our v/s ablation) tests, we bootstrap the samples 1000 times and report the mean and standard error in Table 2 in the main paper.

D. Mask for training
As mentioned in Section 3.2 in the main paper, we need a mask in addition to the single image to train the flow prediction network G f low .Since we train G f low only on real domain data, one option could be to use a mask obtained by a pretrained image segmentation model, like ODISE [72], similar to how we generate mask a test time for real domina.Instead, at training time, we use masks generated by thresholding the ground-truth average optical flow (like [44,15]).We find that training with masks obtained by thresholding the ground-truth average optical flow, makes G f low learn to predict non-zero optical flow in all regions inside the mask.In contrast, we observe that for the real domain training dataset, the masks generated by ODISE often include more regions that can have zero ground-truth average optical flow.This might be due to the inability of the pretrained optical flow model used to estimate optical flow in groundtruth videos in regions of (relatively) less movement.Thus using ODISE-generated masks during training will not ensure that G f low learns to predict non-zero optical flow in all regions inside the mask.In our experimentation, we saw this leads to inferior results compared to training with masks obtained by thresholding ground-truth average optical flow.

E. Limitations of Text-Guided Direction Control
In Section 4.5 of the main paper, we show that our method of generating cinemagraphs from text prompts can be extended to controlling the direction of motion in generated cinemagraph using text directions in the input prompt.However, to this end, our method only allows the user to specify a global direction.This works well for most scenarios where there is a single body that moves relatively in the same direction(like ocean or clouds or one waterfall or river).But, our method of text-guided direction control won't work in scenarios where the user wants to (1) gen-erate two or more objects (like rivers) moving in different directions, and, (2) assign different local directions of motion within the same object.We leave this as future work to investigate a way to achieve more fine-grained and end-toend text-guided direction control.

F. Limitations of Baselines
The text-to-video baselines, CogVideo [29], Text2Video-Zero [35] and VideoCrafter [24] natively generates videos of 32, 8, and 16 frames at resolutions of 480 × 480, 512 × 512, and 256 × 256, respectively.Although they can generate videos of 120 frames (the same number of frames as our method), we found that increasing the number of generated frames greatly reduces the temporal consistency of the output videos from these methods.Thus, we compare the videos generated by these baseline methods against our method in their respective native resolution and frame number.

G. Cross-Attention Maps v/s Self-Attention maps for Mask Generation
We use self-attention maps instead of cross-attention maps for generating the binary mask M .Even though Prompt-to-Prompt [25] shows that cross-attention map from the intermediate layers in Stable Diffusion for a particular token generates a mask that is representative of what that token corresponds to in the output image.In Figure 12 we see that this is not always the case.For scenes with larger or spread-out structures, like oceans and rivers, the crossattention map also tends to highlight neighboring regions.

Figure 3 :
Figure3: ODISE vs. Self-Attention Mask.We aim to derive a mask, given twin images: (a) the generated artistic image and (b) its realistic counterpart.Directly applying a segmentation model (e.g., ODISE[72]) to the realistic image might introduce segmentation errors.For example, in the ODISE-generated mask on the artistic image (c), some regions in the mask overlap with hills, which can result in the movement of the hills.We use the diffusion model's self-attention maps to further refine the mask (d).
(a) Artistic Image (b) Realistic Image (c) ODISE Mask (d) Self Attention Map (e) Spectral Cluster (f) Self Attention Mask

Figure 4 :
Figure 4: Mask generation pipeline.To generate our final binary mask M , we first generate a realistic twin (b) corresponding to the artistic image (a).Using a pre-trained open set panoptic segmentation module, ODISE[72], and a userspecified region name 'river', we generate a binary segmentation mask (c).In (d), we use PCA[16] to visualize the self-attention maps averaged across different timesteps.Finally, we apply spectral clustering[46] on the average selfattention maps (e) and select regions with high Intersectionover-Union (IOU) scores with respect to the ODISE mask (c).

Figure 5 :
Figure5: Visual comparisons for optical flow prediction.We compare our method with two single-image animation methods Holynski et al.[28] and SLR-SFS[15] on real videos (a) and artistic images (b).For real videos, we also show the ground truth optical flow averaged across all frames.Overall, our method predicts more plausible motions that align better with target regions.Please see our supplement for more video comparisons with more baselines.

Figure 6 :
Figure 6: Ablation Study (flow prediction).(a) The role of text and mask conditioning: we compare our flow prediction method with three variants: (1) w/o mask, (2) w/o text, and (3) w/o mask and text conditioning.Both text embedding vectors and semantic masks contribute to predicting more plausible flows.For the 1st row, our predicted flow covers more water regions than ground truth flow.For the 2nd row, our predicted flow is physically plausible, even when it differs from ground truth flows.(b) The role of twin image synthesis: Directly predicting flows given artistic images will incur significant errors.

Figure 7 :
Figure 7: Real Painting Cinemagraphs.Examples of cinemagraph generation from two real paintings.To view this figure as a video, we recommend using Adobe Acrobat.
(1) w/o mask, (2) w/o text, and (3) w/o mask and text conditioning.As shown in Table 1, our method performs the best regarding both FVD scores and slightly outperforms Ours (w/o text).We hypothesize that the text description contains class information (like 'waterfall'), making it easier for the model to generate plausible flow corresponding to each class category.Ours (w/o mask) and Ours (w/o mask and text) perform the worst, underscoring the mask's critical role.These findings are corroborated by the qualitative comparisons (Figure 6).In our user study, 64.11 ± 3.43%, 63.38 ± 3.32%, and 58.26±3.27% of participants prefer our method to Ours (w/o text and mask), Ours (w/o mask), and Ours (w/o text).

Figure 8 :
Figure8: Ablation Study (mask generation).Mask generation with different design choices: we compare the quality of masks generated from our method (a) with alternative design choices like (1) using K-Means[42] clustering instead of spectral clustering[46], (2) using self-attention map from a single timestep of sampling instead of using the average of self-attention map across timesteps.Compared to both (1) and (2), our masks have better overlap with ODISE[72] masks while remaining confined to the plausible regions of motion (like water).

Figure 9 :
Figure 9: Text-Guided Direction Control.The video results of our text-guided direction control using two distinct examples of different input text direction of motion for the same scene.To view this figure as a video, we recommend using Adobe Acrobat.Please see our supplementary website for more video results.

Figure 10 :
Figure 10: Limitations.Our method may fail due to several reasons, including (a) incorrect text-to-image generation result, (b) inconsistency between artistic and natural images, (c) imperfect segmentation for challenging natural images, and (d) scenes with complex fluid dynamics.

64 Text
[−1, 1], by diving all the optical flows in the training data by a constant factor (

Figure 11 :
Figure 11: Text-guided direction control.The wheel shows the division of all possible directions of motion in 12 quadrants and the text direction templates associated with each of the quadrants.

Figure 12 :
Figure 12: Cross Attention Maps Limitations.This figure visualizes the synthesized image (left) using Stable Diffusion and visualization of their corresponding average cross attention maps for each token in the input prompt (right).It can be seen that even though cross attention maps can generate a mask for the 'waterfall' region properly corresponding to the 'waterfall' token (bottom), for 'ocean' (top) and 'river' (middle), they also highlight neighboring regions of 'rocky beach' and 'mountain' respectively.This reduces their effectiveness in mask generation.

Table 1 :
which video has more natural movement, according to what occurs in the real world, and (2) looks better visually.For artistic images, we keep (1) and (2), and add a third criterion (3) the video describes what is written in the text accurately.We perform paired tests, where the users are asked to compare two Quantitative comparisons regarding video quality.Here we compare the generated videos with ground truth videos regarding Fréchet Video Distance (FVD) on real video datasets.