WeChat Vision, Tencent Inc.
Recent advances have demonstrated compelling capabilities in synthesizing real individuals into generated videos, reflecting the growing demand for identity-aware content creation. Nevertheless, an openly accessible framework enabling fine-grained control over facial appearance and voice timbre across multiple identities remains unavailable. In this work, we present a unified and scalable framework for identity-aware joint audio-video generation, enabling high-fidelity and consistent personalization. Specifically, we introduce a data curation pipeline that automatically extracts identity-bearing information with paired annotations across audio and visual modalities, covering diverse scenarios from single-subject to multi-subject interactions. We further propose a flexible and scalable identity injection mechanism for single- and multi-subject scenarios, in which both facial appearance and vocal timbre act as identity-bearing control signals. Moreover, in light of modality disparity, we design a multi-stage training strategy to accelerate convergence and enforce cross-modal coherence. Experiments demonstrate the superiority of the proposed framework.
The overall dual-tower DiT network architecture and training framework. The model has five inputs: video, audio, video identity, audio identity, and structured caption. We first extract latents for each modality, apply identity embedding to identity latents, then organize latents with structured position embedding. In DiT, we use asymmetric self-attention for decoupled parameterization. The training contains three strages. Stage 1 for unimodal identity, stage 2 for joint multimodel identity training, and stage 3 for multi-view identity fine-tuning.
@article{chen2026identity,
title={Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation},
author={Yingjie Chen, Shilun Lin, Xing Cai, Qixin Yan, Wenjing Wang, Dingming Liu, Hao Liu, Chen Li, and Jing LYU},
booktitle={arXiv preprint},
year={2026}
}