Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation

Yingjie Chen, Yifang Men, Yuan Yao, Miaomiao Cui, Liefeng Bo

Institute for Intelligent Computing, Alibaba Tongyi Lab

By constructing 3D-aware motion representation based on various user intentions and taking the perception results as motion control signals, the proposed fine-grained motion-controllable image animation framework can be applied to various motion-related video synthesis tasks.

Abstract


Motion-controllable image animation is a fundamental task with a wide range of potential applications. Recent works have made progress in controlling camera or object motion via the same 2D motion representations or different control signals, while they still struggle in supporting collaborative camera and object motion control with adaptive control granularity. To this end, we introduce 3D-aware motion representation and propose an image animation framework, called Perception-as-Control, to achieve fine-grained collaborative motion control. Specifically, we construct 3D-aware motion representation from a reference image, manipulate it based on interpreted user intentions, and perceive it from different viewpoints. In this way, camera and object motions are transformed into intuitive, consistent visual changes. Then, the proposed framework leverages the perception results as motion control signals, enabling it to support various motion-related video synthesis tasks in a unified and flexible way. Experiments demonstrate the superiority of the proposed method.

Method


In our data curation pipeline, an in-the-wild video is taken as input, and off-the-shelf visual odometry and 3D point tracking algorithms are applied to obtain camera pose sequence and 3D point positions. Based on these results, 3D-aware motion representation can be constructed to render camera and object control signals. During training, those control signals are fed into individual encoders to avoid RGB-level interference and merged afterward. Three-stage training strategy is introduced for achieving fine-grained collaborative motion control. During inference, user intentions are first interpreted as 3D-aware motion representation, which enables our framework to support a variety of motion-related applications.

Qualitative Results


Camera-only Motion Control

We select both basic and arbitrary camera motions and visualize them in 3D.
Basic Camera Movements

Camera Movements

Reference Image

Camera Movements

Reference Image

Camera Movements

Reference Image





Arbitrary Camera Movements

Camera Movements

Reference Image





Object-only Motion Control

Since there is no camera motion, we project 3D movements of unit spheres onto the pixel plane and visualize how they change over time, using colors to represent the direction of movement.
Multi-instance Object Control

Object Movements

Reference Image





Fine-grained Object Control

Object Movements

Reference Image





Collaborative Motion Control

We control both camera and object motions, and visualize the perception process of 3D-aware motion representation.

Camera & Object Movements

Reference Image





Camera & Object Movements

Reference Image





Potential Applications


Motion Generation

Draw 2D trajectories on the reference image and create an animation based on them. We visualize how 2D trajectories change over time, using colors to represent the direction of movement.

2D Trajectories

Animation Results





Motion Clone

Mimic all motions in the source video.

Driven Video

Driven Video





Motion Transfer

Transfer motions from the source video to the reference image based on semantic correspondence.

First Frame

Driven Video





Motion Editing

Edit any motion present in the driving video.
Freeze the motions outside the segmentation mask.

Driven Video

Segmentation Mask 1

Segmentation Mask 2

Segmentation Mask 3

Segmentation Mask 4

Edited Video 1

Edited Video 2

Edited Video 3

Edited Video 4





Modify the motions outside the segmentation mask.

Driven Video

Segmentation Mask

Camera Movements 1

Edited Video 1

Camera Movements 2

Edited Video 2

Driven Video

Segmentation Mask

Camera Movements 1

Edited Video 1

Camera Movements 2

Edited Video 2





Modify the motions inside the segmentation mask.

Driven Video

Segmentation Mask

Edited Video

Segmentation Mask

Edited Video





Demo Video




Citation



  @article{chen2025perception,
    title={Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation},
    author={Chen, Yingjie and Men, Yifang and Yao, Yuan and Cui, Miaomiao and Liefeng, Bo},
    booktitle={arXiv preprint arXiv:2501.05020},
    year={2025}}