We introduce Text2Cinemagraph, a fully automated method for creating cinemagraphs from text descriptions --- an especially challenging task when prompts feature imaginary elements and artistic styles, given the complexity of interpreting the semantics and motions of these images. We focus on cinemagraphs of fluid elements, such as flowing rivers, and drifting clouds, which exhibit continuous motion and repetitive textures. Existing single-image animation methods fall short on artistic inputs, and recent text-based video methods frequently introduce temporal inconsistencies, struggling to keep certain regions static. To address these challenges, we propose an idea of synthesizing image twins from a single text prompt --- a pair of an artistic image and its pixel-aligned corresponding natural-looking twin. While the artistic image depicts the style and appearance detailed in our text prompt, the realistic counterpart greatly simplifies layout and motion analysis. Leveraging existing natural image and video datasets, we can accurately segment the realistic image and predict plausible motion given the semantic information. The predicted motion can then be transferred to the artistic image to create the final cinemagraph. Our method outperforms existing approaches in creating cinemagraphs for natural landscapes as well as artistic and other-worldly scenes, as validated by automated metrics and user studies. Finally, we demonstrate two extensions: animating existing paintings and controlling motion directions using text.
Given a text prompt \( \boldsymbol{c} \), we generate twin images with Stable Diffusion, an artistic image \( \boldsymbol{x} \) in the style described in the text prompt, and a realistic counterpart \( \hat{\boldsymbol{x}} \) using the modified prompt \( \hat{\boldsymbol{c}} \). Twin images share a similar semantic layout. We then extract a binary mask \( \boldsymbol{M} \) of the moving regions from the Self-Attention maps obtained during the artistic image's generation process. We use the mask and the realistic image to predict the optical flow \( \hat{\boldsymbol{F}} \) with the flow prediction model \( \boldsymbol{G_{flow}} \). Since the twin images have a very similar semantic layout, we can use the flow \( \hat{\boldsymbol{F}} \) to animate the artistic image, with our video generator \( \boldsymbol{G_{frame}} \). All our experiments are based on Stable Diffusion.
We introduce a fully automated method for creating cinemagraphs from text descriptions — an especially challenging task when prompts feature imaginary elements and artistic styles, given the complexity of interpreting the semantics and motions of these images. Below we shows a grid of paris of input text prompt (top) and their corresponding generated cinemagraphs (bottom). For viewing the results in full screen view please refer to our Gallery (Our Results) page.
"a large waterfall falling between hills in the style Van Gogh painting during sunset, 4k"
"a large waterfall falling between hills in the style Van Gogh painting during sunset, 4k"
"a photo of a
waterfall falling
from a hill covered in
snow during winter"
"a small
wooden
house, with a boat
floating on lake, in the style
of Camille Pissarro"
"pirate ships
in
turbulant
ocean, ancient photo,
brown tint"
"ultra realistic illustration, a lighthouse by night being attacked by a kraken, the tentacles are wrapped ..."
"world war
scene,
fighter
planes in the sky,
clouds, cinematic ancient photo,
brown
tint"
"Old
Victorian
architecture in a Victorian
valley, dramatic sky,
cloudy sky, digital
art,
4k, 8k, trending on ArtStation"
"always the sun, beautiful strange detailed summer seascape painting 8k resolution deviantart trending on Artstation concept art digital illustration"
"a large
waterfall falling from hills
during sunset in the style
of Leonid Afremov"
"a large river flowing in front of a mountain in the style of starry nights painting"
"a watercolor
painting
of a waterfall,
high definition, 4k"
"oil painting
of
a waterfall
falling from the hills in the
essence of renaissance
style,
no humans"
"pirate ships
in
turbulant ocean,
ancient photo,
brown tint"
"always the sun, beautiful strange detailed summer seascape painting 8k resolution deviantart trending on Artstation concept art digital illustration"
"super detailed color lowpoly art, northern sunset on a lake, monochrome photorealistic, lake, unreal engine, high contrast color palette, 3 d render, lowpoly, colorful, digital art, perspective, robb cobb"
"super detailed color lowpoly art, northern sunset on a lake, monochrome photorealistic, lake, unreal engine, high contrast color palette, 3 d render, lowpoly, colorful, digital art, perspective, robb cobb"
"a surreal
and
dreamlike
waterfall scene, using
imaginative colors
and shapes to
create a fantastical image"
We show that our method can animate existing paintings created by renowned artists. For viewing the results in full screen view please refer to our Gallery (Real Painting Cinemagraphs) page.
Minnehaha Falls (undated), oil on canvas, by Albert Bierstadt
The Ninth Wave (1850) by Ivan Aivazovsky; Russian Museum, Public domain, via Wikimedia Commons
We show results of text-guided direction control for cinemagraph generation. We can manipulate the movement direction in the cinemagraph based on the text prompt. For viewing the results in full screen view please refer to our Gallery (Text Guided Direction Control) page.
a large river
flowing in left to right, downwards direction in
front of a mountain in the style of starry nights
painting
a large river
flowing in upwards, right to left direction> in
front of a mountain in the style of starry nights
painting
Dead river, flowing in right to left, downwards direction, red color, highly detailed, 8 k, artstation, beutifull, masterpiece.
Dead river, flowing in downwards, left to right direction, red color, highly detailed, 8 k, artstation, beutifull, masterpiece.
always the sun, beautiful strange detailed summer seascape moving in right to left direction painting 8k resolution deviantart trending on Artstation concept art digital illustration.
always the sun, beautiful strange detailed summer seascape moving in left to right, upwards direction painting 8k resolution deviantart trending on Artstation concept art digital illustration.
Given the same input image (leftmost column), we compare our method with existing single-image animation methods on a real video dataset [Holynski et al. 2021]. For viewing the results in full screen view please refer to our Gallery (Baseline Comparison - Real Domain) page.
Single Image
Ours
Animating Landscape
Holynski et al.
SLR-SFS
Single Image
Ours
Animating Landscape
Holynski et al.
SLR-SFS
Single Image
Ours
Animating Landscape
Holynski et al.
SLR-SFS
Single Image
Ours
Animating Landscape
Holynski et al.
SLR-SFS
Single Image
Ours
Animating Landscape
Holynski et al.
SLR-SFS
Single Image
Ours
Animating Landscape
Holynski et al.
SLR-SFS
Given the input text prompt (below each row of results), we compare our method with both single-image animation methods and text-to-video models in the artistic domain. For single-image animation methods, we use the same static image as our method, generated by Stable Diffusion. As the videos produced by these baselines are non-looping, we implement a postprocessing method [Endo et al. 2019] to make the generated videos looping. For viewing the results in full screen view please refer to our Gallery (Baseline Comparison - Artistic Domain) page.
Ours
Animating Landscape
Eulerian
SLR-SFS
CogVideo
Text2VideoZero
VideoCrafter
"a large waterfall falling between hills in the style Van Gogh painting during sunset, 4k"
Ours
Animating Landscape
Eulerian
SLR-SFS
CogVideo
Text2VideoZero
VideoCrafter
"a photo of a waterfall falling from a hill covered in snow during winter"
Ours
Animating Landscape
Eulerian
SLR-SFS
CogVideo
Text2VideoZero
VideoCrafter
"a small wooden house, with a boat floating on lake, in the style of Camille Pissarro"
Ours
Animating Landscape
Eulerian
SLR-SFS
CogVideo
Text2VideoZero
VideoCrafter
"world war scene, fighter planes in the sky, clouds, cinematic ancient photo, brown tint"
Ours
Animating Landscape
Eulerian
SLR-SFS
CogVideo
Text2VideoZero
VideoCrafter
"always the sun, beautiful strange detailed summer seascape painting 8k resolution deviantart trending on Artstation concept art digital illustration"
We compare optical flow prediction and cinemagraph generation without mask and/or text conditioning against our full method. The results highlight their importance in accurate flow prediction and subsequent plausible cinemagraph generation. For viewing the results in full screen view please refer to our Gallery (Ablation - Role of Mask and Text) page.
Single Image
Ours (w/o text and mask) [Video]
Ours (w/o mask) [Video]
Ours (w/o text) [Video]
Ours [Video]
Ground-Truth Flow
Ours (w/o text and mask) [Optical Flow]
Ours (w/o mask) [Optical Flow]
Ours (w/o text) [Optical Flow]
Ours [Optical Flow]
Single Image
Ours (w/o text and mask) [Video]
Ours (w/o mask) [Video]
Ours (w/o text) [Video]
Ours [Video]
Ground-Truth Flow
Ours (w/o text and mask) [Optical Flow]
Ours (w/o mask) [Optical Flow]
Ours (w/o text) [Optical Flow]
Ours [Optical Flow]
Single Image
Ours (w/o text and mask) [Video]
Ours (w/o mask) [Video]
Ours (w/o text) [Video]
Ours [Video]
Ground-Truth Flow
Ours (w/o text and mask) [Optical Flow]
Ours (w/o mask) [Optical Flow]
Ours (w/o text) [Optical Flow]
Ours [Optical Flow]
Single Image
Ours (w/o text and mask) [Video]
Ours (w/o mask) [Video]
Ours (w/o text) [Video]
Ours [Video]
Ground-Truth Flow
Ours (w/o text and mask) [Optical Flow]
Ours (w/o mask) [Optical Flow]
Ours (w/o text) [Optical Flow]
Ours [Optical Flow]
We study the role of twin image synthesis. We compare our full method, which predicts optical flow on the realistic image, to Ours (w/o twin), which directly predicts optical flow using the artistic image. The results show that our predicted flow (using a realistic image) is significantly smoother and consistent. For viewing the results in full screen view please refer to our Gallery (Ablation - Importance of Twin Image Generation) page.
Ours (w/o twin) [Optical Flow]
Ours (w/o twin) [Video]
Ours [Optical Flow]
Ours [Video]
"a large waterfall falling between hills in the style Van Gogh painting during sunset, 4k"
Ours (w/o twin) [Optical Flow]
Ours (w/o twin) [Video]
Ours [Optical Flow]
Ours [Video]
"a large waterfall falling between hills in the style Van Gogh painting during sunset, 4k"
Ours (w/o twin) [Optical Flow]
Ours (w/o twin) [Video]
Ours [Optical Flow]
Ours [Video]
"oil painting of a waterfall falling from the hills in the essence of renaissance style, no humans"
Ours (w/o twin) [Optical Flow]
Ours (w/o twin) [Video]
Ours [Optical Flow]
Ours [Video]
"oil painting of a waterfall falling from the hills in the essence of renaissance style, no humans"
Ours (w/o twin) [Optical Flow]
Ours (w/o twin) [Video]
Ours [Optical Flow]
Ours [Video]
"world war scene, fighter planes in the sky, clouds, cinematic ancient photo, brown tint"
@inproceedings{mahapatra2023synthesizing,
title={Text-Guided Synthesis of Eulerian Cinemagraphs},
author={Mahapatra, Aniruddha and Siarohin, Aliaksandr and Lee, Hsin-Ying and Tulyakov, Sergey and Zhu, Jun-Yan},
journal={arXiv preprint arXiv:2307.03190},
year = {2023},
}
We are also grateful to Nupur Kumari, Gaurav Parmar, Or Patashnik, Songwei Ge, Sheng-Yu Wang, Chonghyuk (Andrew) Song,
Daohan (Fred) Lu, Richard Zhang, and Phillip Isola for fruitful discussions. This work is partly supported by Snap Inc.
and was partly done while Aniruddha was an intern at Snap Inc.
The website template is taken from Custom
Diffusion (which was built on
DreamFusion's project page). The text editor used in the demo video has been taken from Rich Text-to-Image.