Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation

Yingjie Chen, Yifang Men, Yuan Yao, Miaomiao Cui, Liefeng Bo

Institute for Intelligent Computing, Alibaba Tongyi Lab

By constructing 3D-aware motion representation based on various user intentions and taking the perception results as motion control signals, the proposed fine-grained motion-controllable image animation framework can be applied to various motion-related video synthesis tasks.

Abstract


Motion-controllable image animation is a fundamental task with a wide range of potential applications. Recent works have made progress in controlling camera or object motion via the same 2D motion representations or different control signals, while they still struggle in supporting collaborative camera and object motion control with adaptive control granularity. To this end, we introduce 3D-aware motion representation and propose an image animation framework, called Perception-as-Control, to achieve fine-grained collaborative motion control. Specifically, we construct 3D-aware motion representation from a reference image, manipulate it based on interpreted user intentions, and perceive it from different viewpoints. In this way, camera and object motions are transformed into intuitive, consistent visual changes. Then, the proposed framework leverages the perception results as motion control signals, enabling it to support various motion-related video synthesis tasks in a unified and flexible way. Experiments demonstrate the superiority of the proposed method.

Method


In our data curation pipeline, an in-the-wild video is taken as input, and off-the-shelf visual odometry and 3D point tracking algorithms are applied to obtain camera pose sequence and 3D point positions. Based on these results, 3D-aware motion representation can be constructed to render camera and object control signals. During training, those control signals are fed into individual encoders to avoid RGB-level interference and merged afterward. Three-stage training strategy is introduced for achieving fine-grained collaborative motion control. During inference, user intentions are first interpreted as 3D-aware motion representation, which enables our framework to support a variety of motion-related applications.





Fine-grained Collaborative Motion Control


Camera-only Motion Control

We select both basic and arbitrary camera motions and visualize them in 3D.
Basic Camera Movements

Camera Movements

Reference Image


Camera Movements

Reference Image


Camera Movements

Reference Image


Camera Movements

Generated Results





Arbitrary Camera Movements

Camera Movements

Reference Image





Object-only Motion Control

Since there is no camera motion, we project 3D movements of unit spheres onto the pixel plane and visualize how they change over time, using colors to represent the direction of movement.
Multi-instance Object Control

Object Movements

Reference Image





Fine-grained Object Control

Object Movements

Reference Image





Collaborative Motion Control

We control both camera and object motions, and visualize the perception process of 3D-aware motion representation.

Camera & Object Movements

Reference Image





Camera & Object Movements

Reference Image



Comparison to State-of-the-arts


Camera Motion Control

We compare Perception-as-Control to the state-of-the-art methods, CameraCtrl and MotionCtrl, in terms of camera motion control.

Camera Movement

Point Cloud Reference

MotionCtrl

CameraCtrl

Ours

Camera Movement

Point Cloud Reference

MotionCtrl

CameraCtrl

Ours

Camera Movement

Point Cloud Reference

MotionCtrl

CameraCtrl

Ours

Object Motion Control & Collaborative motion control

We compare Perception-as-Control to the state-of-the-art methods, Motion-I2V and MOFA-Video, in terms of object motion control and simple collaborative motion control.

User Drags

Motion-I2V

MOFA-Video

Ours



Potential Applications


Illustration for motion-related applications.


Motion Generation

Users draw 2D/3D trajectories on the reference image and create an animation based on them. We visualize how these trajectories change over time, using colors to represent the direction of movement.

User Drags

Animation Results





Motion Clone

Mimic the entire motions in the source video.

Source Video

Source Video





Motion Transfer

Transfer local motions from the source video to the reference image by relocating and rescaling local motions based on semantic correspondence.

First Frame

Source Video





Motion Editing

Edit fine-grained scene and object motions in user-specified regions.
Freeze the motions outside the segmentation mask.

Source Video

Segmentation Mask 1

Segmentation Mask 2

Segmentation Mask 3

Segmentation Mask 4

Edited Video 1

Edited Video 2

Edited Video 3

Edited Video 4





Modify the motions outside the segmentation mask.

Source Video

Segmentation Mask

Camera Movements 1

Edited Video 1

Camera Movements 2

Edited Video 2

Source Video

Segmentation Mask

Camera Movements 1

Edited Video 1

Camera Movements 2

Edited Video 2





Modify the motions inside the segmentation mask.

Source Video

Segmentation Mask

Edited Video

Segmentation Mask

Edited Video





Demo Video




Citation



  @article{chen2025perception,
    title={Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation},
    author={Chen, Yingjie and Men, Yifang and Yao, Yuan and Cui, Miaomiao and Liefeng, Bo},
    booktitle={arXiv preprint arXiv:2501.05020},
    year={2025}}