Text-Guided Synthesis of Eulerian Cinemagraphs

1 CMU    2Snap Research

ACM Transactions on Graphics (TOG), SIGGRAPH Asia 2023

Paper Code Project Gallery


Abstract

We introduce Text2Cinemagraph, a fully automated method for creating cinemagraphs from text descriptions --- an especially challenging task when prompts feature imaginary elements and artistic styles, given the complexity of interpreting the semantics and motions of these images. We focus on cinemagraphs of fluid elements, such as flowing rivers, and drifting clouds, which exhibit continuous motion and repetitive textures. Existing single-image animation methods fall short on artistic inputs, and recent text-based video methods frequently introduce temporal inconsistencies, struggling to keep certain regions static. To address these challenges, we propose an idea of synthesizing image twins from a single text prompt --- a pair of an artistic image and its pixel-aligned corresponding natural-looking twin. While the artistic image depicts the style and appearance detailed in our text prompt, the realistic counterpart greatly simplifies layout and motion analysis. Leveraging existing natural image and video datasets, we can accurately segment the realistic image and predict plausible motion given the semantic information. The predicted motion can then be transferred to the artistic image to create the final cinemagraph. Our method outperforms existing approaches in creating cinemagraphs for natural landscapes as well as artistic and other-worldly scenes, as validated by automated metrics and user studies. Finally, we demonstrate two extensions: animating existing paintings and controlling motion directions using text.


Method Pipeline

Given a text prompt \( \boldsymbol{c} \), we generate twin images with Stable Diffusion, an artistic image \( \boldsymbol{x} \) in the style described in the text prompt, and a realistic counterpart \( \hat{\boldsymbol{x}} \) using the modified prompt \( \hat{\boldsymbol{c}} \). Twin images share a similar semantic layout. We then extract a binary mask \( \boldsymbol{M} \) of the moving regions from the Self-Attention maps obtained during the artistic image's generation process. We use the mask and the realistic image to predict the optical flow \( \hat{\boldsymbol{F}} \) with the flow prediction model \( \boldsymbol{G_{flow}} \). Since the twin images have a very similar semantic layout, we can use the flow \( \hat{\boldsymbol{F}} \) to animate the artistic image, with our video generator \( \boldsymbol{G_{frame}} \). All our experiments are based on Stable Diffusion.


Our Results

We introduce a fully automated method for creating cinemagraphs from text descriptions — an especially challenging task when prompts feature imaginary elements and artistic styles, given the complexity of interpreting the semantics and motions of these images. Below we shows a grid of paris of input text prompt (top) and their corresponding generated cinemagraphs (bottom). For viewing the results in full screen view please refer to our Gallery (Our Results) page.

"a large waterfall falling between hills in the style Van Gogh painting during sunset, 4k"

"a large waterfall falling between hills in the style Van Gogh painting during sunset, 4k"

"a photo of a waterfall falling
from a hill covered in
snow during winter"


"a small wooden house, with a boat
floating on lake, in the style
of Camille Pissarro"

"pirate ships in turbulant
ocean, ancient photo,
brown tint"

"ultra realistic illustration, a lighthouse by night being attacked by a kraken, the tentacles are wrapped ..."


"world war scene, fighter
planes in the sky,
clouds, cinematic ancient photo,
brown tint"

"Old Victorian architecture in a Victorian
valley, dramatic sky,
cloudy sky, digital art,
4k, 8k, trending on ArtStation"

"always the sun, beautiful strange detailed summer seascape painting 8k resolution deviantart trending on Artstation concept art digital illustration"


"a large waterfall falling from hills
during sunset in the style
of Leonid Afremov"

"a large river flowing in front of a mountain in the style of starry nights painting"

"a watercolor painting
of a waterfall,
high definition, 4k"


"oil painting of a waterfall
falling from the hills in the
essence of renaissance style,
no humans"

"pirate ships in
turbulant ocean,
ancient photo,
brown tint"

"always the sun, beautiful strange detailed summer seascape painting 8k resolution deviantart trending on Artstation concept art digital illustration"


"super detailed color lowpoly art, northern sunset on a lake, monochrome photorealistic, lake, unreal engine, high contrast color palette, 3 d render, lowpoly, colorful, digital art, perspective, robb cobb"

"super detailed color lowpoly art, northern sunset on a lake, monochrome photorealistic, lake, unreal engine, high contrast color palette, 3 d render, lowpoly, colorful, digital art, perspective, robb cobb"

"a surreal and dreamlike
waterfall scene, using
imaginative colors
and shapes to
create a fantastical image"



Animating Real Painting

We show that our method can animate existing paintings created by renowned artists. For viewing the results in full screen view please refer to our Gallery (Real Painting Cinemagraphs) page.

Minnehaha Falls (undated), oil on canvas, by Albert Bierstadt


The Ninth Wave (1850) by Ivan Aivazovsky; Russian Museum, Public domain, via Wikimedia Commons



Text Guided Direction Control

We show results of text-guided direction control for cinemagraph generation. We can manipulate the movement direction in the cinemagraph based on the text prompt. For viewing the results in full screen view please refer to our Gallery (Text Guided Direction Control) page.

a large river flowing in left to right, downwards direction in front of a mountain in the style of starry nights
painting

a large river flowing in upwards, right to left direction> in front of a mountain in the style of starry nights
painting


Dead river, flowing in right to left, downwards direction, red color, highly detailed, 8 k, artstation, beutifull, masterpiece.

Dead river, flowing in downwards, left to right direction, red color, highly detailed, 8 k, artstation, beutifull, masterpiece.


always the sun, beautiful strange detailed summer seascape moving in right to left direction painting 8k resolution deviantart trending on Artstation concept art digital illustration.

always the sun, beautiful strange detailed summer seascape moving in left to right, upwards direction painting 8k resolution deviantart trending on Artstation concept art digital illustration.



Comparison to Baseline Methods (Real Domain)

Given the same input image (leftmost column), we compare our method with existing single-image animation methods on a real video dataset [Holynski et al. 2021]. For viewing the results in full screen view please refer to our Gallery (Baseline Comparison - Real Domain) page.

Single Image

Ours

Animating Landscape

Holynski et al.

SLR-SFS


Single Image

Ours

Animating Landscape

Holynski et al.

SLR-SFS


Single Image

Ours

Animating Landscape

Holynski et al.

SLR-SFS


Single Image

Ours

Animating Landscape

Holynski et al.

SLR-SFS


Single Image

Ours

Animating Landscape

Holynski et al.

SLR-SFS


Single Image

Ours

Animating Landscape

Holynski et al.

SLR-SFS



Comparison to Baseline Methods (Artistic Domain)

Given the input text prompt (below each row of results), we compare our method with both single-image animation methods and text-to-video models in the artistic domain. For single-image animation methods, we use the same static image as our method, generated by Stable Diffusion. As the videos produced by these baselines are non-looping, we implement a postprocessing method [Endo et al. 2019] to make the generated videos looping. For viewing the results in full screen view please refer to our Gallery (Baseline Comparison - Artistic Domain) page.

Ours

Animating Landscape

Eulerian

SLR-SFS

CogVideo

Text2VideoZero

VideoCrafter

"a large waterfall falling between hills in the style Van Gogh painting during sunset, 4k"


Ours

Animating Landscape

Eulerian

SLR-SFS

CogVideo

Text2VideoZero

VideoCrafter

"a photo of a waterfall falling from a hill covered in snow during winter"


Ours

Animating Landscape

Eulerian

SLR-SFS

CogVideo

Text2VideoZero

VideoCrafter

"a small wooden house, with a boat floating on lake, in the style of Camille Pissarro"


Ours

Animating Landscape

Eulerian

SLR-SFS

CogVideo

Text2VideoZero

VideoCrafter

"world war scene, fighter planes in the sky, clouds, cinematic ancient photo, brown tint"


Ours

Animating Landscape

Eulerian

SLR-SFS

CogVideo

Text2VideoZero

VideoCrafter

"always the sun, beautiful strange detailed summer seascape painting 8k resolution deviantart trending on Artstation concept art digital illustration"



Ablation - Role of Mask and Text for Optical FLow Prediction

We compare optical flow prediction and cinemagraph generation without mask and/or text conditioning against our full method. The results highlight their importance in accurate flow prediction and subsequent plausible cinemagraph generation. For viewing the results in full screen view please refer to our Gallery (Ablation - Role of Mask and Text) page.

Single Image

Ours (w/o text and mask) [Video]

Ours (w/o mask) [Video]

Ours (w/o text) [Video]

Ours [Video]

Ground-Truth Flow

Ours (w/o text and mask) [Optical Flow]

Ours (w/o mask) [Optical Flow]

Ours (w/o text) [Optical Flow]

Ours [Optical Flow]


Single Image

Ours (w/o text and mask) [Video]

Ours (w/o mask) [Video]

Ours (w/o text) [Video]

Ours [Video]

Ground-Truth Flow

Ours (w/o text and mask) [Optical Flow]

Ours (w/o mask) [Optical Flow]

Ours (w/o text) [Optical Flow]

Ours [Optical Flow]


Single Image

Ours (w/o text and mask) [Video]

Ours (w/o mask) [Video]

Ours (w/o text) [Video]

Ours [Video]

Ground-Truth Flow

Ours (w/o text and mask) [Optical Flow]

Ours (w/o mask) [Optical Flow]

Ours (w/o text) [Optical Flow]

Ours [Optical Flow]


Single Image

Ours (w/o text and mask) [Video]

Ours (w/o mask) [Video]

Ours (w/o text) [Video]

Ours [Video]

Ground-Truth Flow

Ours (w/o text and mask) [Optical Flow]

Ours (w/o mask) [Optical Flow]

Ours (w/o text) [Optical Flow]

Ours [Optical Flow]



Ablation - Importance of Twin Image Generation

We study the role of twin image synthesis. We compare our full method, which predicts optical flow on the realistic image, to Ours (w/o twin), which directly predicts optical flow using the artistic image. The results show that our predicted flow (using a realistic image) is significantly smoother and consistent. For viewing the results in full screen view please refer to our Gallery (Ablation - Importance of Twin Image Generation) page.

Ours (w/o twin) [Optical Flow]

Ours (w/o twin) [Video]

Ours [Optical Flow]

Ours [Video]

"a large waterfall falling between hills in the style Van Gogh painting during sunset, 4k"


Ours (w/o twin) [Optical Flow]

Ours (w/o twin) [Video]

Ours [Optical Flow]

Ours [Video]

"a large waterfall falling between hills in the style Van Gogh painting during sunset, 4k"


Ours (w/o twin) [Optical Flow]

Ours (w/o twin) [Video]

Ours [Optical Flow]

Ours [Video]

"oil painting of a waterfall falling from the hills in the essence of renaissance style, no humans"


Ours (w/o twin) [Optical Flow]

Ours (w/o twin) [Video]

Ours [Optical Flow]

Ours [Video]

"oil painting of a waterfall falling from the hills in the essence of renaissance style, no humans"


Ours (w/o twin) [Optical Flow]

Ours (w/o twin) [Video]

Ours [Optical Flow]

Ours [Video]

"world war scene, fighter planes in the sky, clouds, cinematic ancient photo, brown tint"



Citation

@inproceedings{mahapatra2023synthesizing,
  title={Text-Guided Synthesis of Eulerian Cinemagraphs},
  author={Mahapatra, Aniruddha and Siarohin, Aliaksandr and Lee, Hsin-Ying and Tulyakov, Sergey and Zhu, Jun-Yan},
  journal={arXiv preprint arXiv:2307.03190},
  year = {2023},
}

Related and Concurrent Works


Acknowledgements

We are also grateful to Nupur Kumari, Gaurav Parmar, Or Patashnik, Songwei Ge, Sheng-Yu Wang, Chonghyuk (Andrew) Song, Daohan (Fred) Lu, Richard Zhang, and Phillip Isola for fruitful discussions. This work is partly supported by Snap Inc. and was partly done while Aniruddha was an intern at Snap Inc.
The website template is taken from Custom Diffusion (which was built on DreamFusion's project page). The text editor used in the demo video has been taken from Rich Text-to-Image.