Project Page

Abstract

Recent advances have demonstrated compelling capabilities in synthesizing real individuals into generated videos, reflecting the growing demand for identity-aware content creation. Nevertheless, an openly accessible framework enabling fine-grained control over facial appearance and voice timbre across multiple identities remains unavailable. In this work, we present a unified and scalable framework for identity-aware joint audio-video generation, enabling high-fidelity and consistent personalization. Specifically, we introduce a data curation pipeline that automatically extracts identity-bearing information with paired annotations across audio and visual modalities, covering diverse scenarios from single-subject to multi-subject interactions. We further propose a flexible and scalable identity injection mechanism for single- and multi-subject scenarios, in which both facial appearance and vocal timbre act as identity-bearing control signals. Moreover, in light of modality disparity, we design a multi-stage training strategy to accelerate convergence and enforce cross-modal coherence. Experiments demonstrate the superiority of the proposed framework.

Method

The overall dual-tower DiT network architecture and training framework. The model has five inputs: video, audio, video identity, audio identity, and structured caption. We first extract latents for each modality, apply identity embedding to identity latents, then organize latents with structured position embedding. In DiT, we use asymmetric self-attention for decoupled parameterization. The training contains three strages. Stage 1 for unimodal identity, stage 2 for joint multimodel identity training, and stage 3 for multi-view identity fine-tuning.

Single-subject Personalized Generation Results

Here we show the results of single-subject personalized generation. Please click on the videos to play the audio.

Prompt: This is a retro diva clip, shot with a medium-close slow zoom-in approach. The lens cuts in from the front, slightly to the left, focusing on the changes in the woman's bright smile. The scene is set in a nostalgic wood-toned studio, with soft warm light spilling in from the upper side. In the frame, <REF_1_S>a woman with iconic platinum blonde short curly hair and bright red lips<REF_1_E>, wearing a bright red sleeveless tank top, leans casually against a light wood-grain wall. She is initially turning her head to the side, laughing heartily, then lightly turns her head to look at the camera, her eyes sparkling with vivacity. She says with a touch of playfulness: <S>The surprise is yet to come.<E>, her face brimming with a radiant and charming smile. The background sound includes the shutter clicks of an old-fashioned camera and the faint sound of upbeat jazz music in the distance.

Prompt: This is a passionate sports drama clip, shot with a medium-close slow zoom-in approach. The lens cuts in from directly in front at eye level, focusing on the male's resolute facial contours and burly figure. The scene is set in the player preparation area of a professional indoor basketball court, with bright, professional cool-toned lighting spilling in evenly from the front. In the frame, <REF_1_S>a male with neat short black hair and resolute eyes<REF_1_E>, wearing a white 'ROCKETS' jersey with red trim, has his arms powerfully crossed over his chest. He initially looks serious and thoughtful, gazing deeply ahead, when he suddenly hears the cheers of his teammates scoring. His brows relax, revealing a confident and composed smile, and he looks firmly at the camera, saying: <S>该我们上场了。<E> The background sound includes the thumping of basketballs hitting the wooden floor, the squeaking of sneakers rubbing against the ground, and the distant whistle of a coach.

Prompt: This is a suspense crime drama clip, shot with a medium-close slow zoom-in approach. The lens cuts in from directly in front, focusing at extremely close range on the male's deep and intimidating blue pupils. The scene is set in a cool-toned secret interrogation room or industrial background space, backed against a neat white tiled wall, with harsh side lighting outlining his deep facial contours and skin texture. In the frame, <REF_1_S>a male with neat short textured hair and a cold gaze<REF_1_E>, wearing a crisp white pointed-collar shirt and a black pinstripe suit, initially leans expressionlessly against the wall, his eyes revealing a scrutinizing look. He then slowly parts his lips slightly, his gaze locked on the camera. He says in a deep and magnetic voice: <S>The game has just begun.<E>, his expression revealing an unquestionable calmness and determination. The background sound includes muffled distant metal clashing sounds and tense low-frequency suspenseful background music.

Prompt: This is a lifestyle short video clip, shot with a fixed medium-close approach, with the camera steadily focusing on the front of the character. The scene is set in a warmly lit corner of an indoor cafe, where blurred red and white decorations in the background form soft color blocks. In the frame, <REF_1_S>a young male with short brown hair that is slightly fluffy<REF_1_E>, wearing a dark blue V-neck sweater, faces the camera with bright eyes and a faint smile on his lips. He leans forward slightly and says in a brisk tone: <S>Done! This project went more smoothly than expected.<E> The background sound includes the steam sound of an espresso machine and a faint jazz melody.

Prompt: This is an urban lifestyle/workplace short drama clip, shot with a medium-close slow zoom-in approach. The lens cuts in from the front, slightly to the right, focusing at eye level on the male's eye contact and subtle expressions. The scene is set in a minimalist modern creative studio, with soft cool-toned natural light spilling in from the side, creating a premium feel. In the frame, <REF_1_S>a young male with short black textured hair and deep facial features<REF_1_E>, wearing a black branded hoodie, is casually resting his chin on his hand, with his arms folded on a transparent glass table. He is initially staring deeply diagonally downwards in thought, and sensing someone approaching, slowly raises his eyes to look at the camera. A faint, gentle smile appears on the corners of his mouth, and he says in a relaxed tone: <S>等你好久了。<E>, showing a personal charm that is both professional and approachable. The background sound includes the faint grinding sound of a coffee machine and soothing Lo-fi background music.

Prompt: This is a slice-of-life comedy short drama clip, shot with a medium-close slow zoom-in approach. The lens cuts in from the front, slightly to the right, delicately capturing the rich micro-movements of the male's facial muscles. The scene is set in a retro cafe full of life, with warm yellow light softly spilling down from above. In the frame, <REF_1_S>a middle-aged to elderly male with short textured gray hair and a forehead covered in deep wrinkles<REF_1_E>, wearing a small blue and white checkered shirt under a dark black corduroy suit jacket, is frantically patting his chest and pants pockets, his eyes instantly turning panicked before his movements freeze. He slowly raises his head, revealing an awkward yet polite wry smile on his face, and says in a lowered voice: <S>Oops, I think I forgot my wallet.<E> The background sound includes the faint clinking of cups and glasses in the distance.

Prompt:This is a political satire clip, shot with a slow dolly-in and slight eye-level frontal shot. The scene is set in a formal press briefing room, with a podium in the center and a blurred background of reporters. In the center of the frame, <REF_1_S>a man with a clean-cut look<REF_1_E>, wearing a dark suit and a white shirt, stands at the podium. He speaks with a calm and composed tone, and says: <S>I'm sure you are the expert of video generation.<E>

Prompt: This is a historical palace drama clip, shot with a slow dolly-in and slight eye-level frontal shot. The scene is set in an opulent imperial chamber... <REF_1_S>a woman with refined features, wearing a golden-red embroidered Manchu-style robe and an elaborate phoenix hairpin adorned with rubies<REF_1_E>, gently rotates a string of prayer beads in her hand, her piercing gaze fixed forward. She speaks in a calm yet icy tone, and says <S1>妹妹近日，往景仁宫跑得倒是勤快。<E1>

Prompt: This is a fashion portrait short video clip, shot with a slow zoom-in approach, starting from a close-up and gradually focusing on the character's face and shoulders, revealing the texture of the leather jacket and the details of his expression. The scene is set in a minimalist photography studio, with a soft grayish-white gradient background, and light illuminating the character from the front and upper side at a soft angle. In the frame, <REF_1_S>a young Asian male with neat short black hair that has a natural messy feel<REF_1_E>, wearing a highly glossy black leather jacket, looks directly at the camera with a firm and confident gaze, the corners of his mouth slightly raised. He slowly opens his mouth, speaking in a calm and affirmative tone: <S>就是现在，保持这个状态。<E> The background sound includes soft simulated shutter sounds and extremely faint ambient noise.

Prompt: This is an opening clip of a modern fashion-themed short film, shot with a fixed medium-close shot, slowly pulling back slightly from the character's shoulders to briefly show the character's posture in the environment. The scene is set in a minimalist photography studio or gallery, with a soft light gray wall as the background, and the lighting is mainly soft diffused light, evenly illuminating the character. In the frame, <REF_1_S>a young male with fluffy jet-black short hair and three-dimensional facial features<REF_1_E>, wearing a glossy black leather jacket over a white shirt and a black tie, relaxes his body slightly and rests his right hand lightly in front of him, looking steadily and directly at the camera. Then, the corners of his mouth turn up slightly, and he says in a clear and calm tone: <S>我们准备好了，可以开始。<E> The background sound includes very slight background music noise and an extremely light shutter sound.

Prompt: This is a traditional culture or comedy short drama clip, shot with a medium-close slow zoom-in approach. The lens cuts in from the front, focusing at eye level on the male's charming facial expressions. The scene is set in a quaintly decorated Chinese-style studio, with soft studio light shining in from the side and rear, creating a tranquil artistic atmosphere. In the frame, <REF_1_S>a middle-aged male with a signature "peach-heart" shaped short hair<REF_1_E>, wearing a dark blue silk Tang suit covered with intricate silver-white floral embroidery, is initially looking calm, seemingly thinking about something, and then, sensing the camera approaching, slowly raises his head to look forward. A signature humorous smile appears on the corners of his mouth, and he says in a melodious tone: <S>Well, that's where your skill comes in.<E>, showing an experienced and humorous master's demeanor. The background sound includes the crisp sound of a folding fan opening and the faint sound of an opera interlude in the distance.

Prompt: This is a nostalgic literary short drama clip, shot with a medium-close slow zoom-in approach. The lens cuts in from the front, slightly to the right, focusing on the melancholy and tenderness deep in the male's eyes. The scene is set in a private art studio full of nostalgic atmosphere, with warm yellow soft light spilling in from the side, creating a tranquil atmosphere. In the frame, <REF_1_S>a male with a classic black fluffy hairstyle, clear eyes revealing a faint sorrow<REF_1_E>, wearing a simple textured black crew neck knit sweater, is initially leaning against a grayish-brown wall lost in thought, seemingly immersed in some memory, and then slowly brings his gaze back to look at the camera. A gentle yet slightly lonely smile hangs on the corners of his mouth, and he says softly: <S>好久不见。<E>, showing a charming temperament that transcends time and space. The background sound includes the faint sound of distant traffic outside the window and a low, melodious vinyl record accompaniment.

Prompt: This is a solemn political or speech clip, shot with a medium-close slow zoom-in approach. The lens cuts in from directly in front, focusing at eye level on the male's facial expressions and gaze. The scene is set in a well-lit, serious formal office, with a bright American flag standing in the left background, and light softly shining in from the front side. In the frame, <REF_1_S>a middle-aged male with neatly trimmed grayish-black short hair and wise eyes<REF_1_E>, wearing a crisp dark blue suit over a light blue shirt and a blue-and-brown striped tie, is initially sitting upright, looking calm and profound, and when facing the camera, he slowly nods and reveals a confident and warm smile. He says in a firm and infectious tone: <S>这是我们共同的时刻。<E>, showing a strong leadership aura. The background sound includes the faint sound of shutter clicks in the distance and low, rhythmic indoor ambient white noise.

Prompt: This is an urban lifestyle/workplace short drama clip, shot with a medium-close slow zoom-in approach. The lens cuts in from the front, slightly to the right, focusing at eye level on the male's eye contact and subtle expressions. The scene is set in a minimalist modern creative studio, with soft cool-toned natural light spilling in from the side, creating a premium feel. In the frame, <REF_1_S>a young male with short black textured hair and deep facial features<REF_1_E>, wearing a black branded hoodie, is casually resting his chin on his hand, with his arms folded on a transparent glass table. He is initially staring deeply diagonally downwards in thought, and sensing someone approaching, slowly raises his eyes to look at the camera. A faint, gentle smile appears on the corners of his mouth, and he says in a relaxed tone: <S>I've been waiting for you for a long time.<E>, showing a personal charm that is both professional and approachable. The background sound includes the faint grinding sound of a coffee machine and soothing Lo-fi background music.

Prompt: This is a behind-the-scenes documentary clip of a movie, shot with a slow zoom-in close-up, starting from the folds of the clothing at the character's collarbone, gradually moving upwards, and focusing on the character's facial expression. The scene is set in a quiet interview room or studio, with a soft and blurred light-colored vertical stripe background, and light softly illuminating the character from the front. In the frame, <REF_1_S>actress Anne Hathaway with long dark brown wavy hair<REF_1_E>, wearing a simple black top, gently clasps her hands in front of her chest, placed at the collar. She gazes firmly at the camera, then takes a soft breath, as if sharing an important decision, speaking in a clear and steady tone: <S>I choose to believe in this story, and also believe in myself.<E> The background sound is extremely quiet, with only extremely faint equipment noise, to highlight the solemnity of her words.

Prompt: This is a cinematic literary short drama clip, shot with a medium-close slow zoom-in approach. The lens cuts in from directly in front, focusing at eye level on the female's highly narrative facial expressions. The scene is set in a studio with a 90s retro charm, with soft natural light spilling in evenly from the front, contrasting with the dark blue background. In the frame, <REF_1_S>a female with thick black long hair<REF_1_E>, wearing a black Chinese-style top with a light green collar and white floral embroidery, is initially looking calm and profound, gazing thoughtfully into the distance, and then, feeling some kind of touch, slowly aligns her gaze with the camera. Her eyes flash with a certain clear wisdom, and she says in a steady and magnetic tone: <S>The past will eventually gone with the wind.<E>, followed by a subtle and moving smile appearing on her face. The background sound includes the faint sound of a film camera rolling and a melodious guqin background music.

Prompt: This is a historical legend/urban inspirational/fashion short drama clip, shot with a medium-close slow zoom-in approach. The lens cuts in from the front, slightly to the left, focusing at eye level on the male's tension-filled facial expressions. The scene is set in a creative studio with a postmodern industrial style, with strong cool-toned backlight spilling in from behind, outlining the character's messy and tough silhouette. In the frame, <REF_1_S>a male with black textured short hair styled diagonally upwards, and eyes revealing shrewdness and a sense of story<REF_1_E>, wearing a black stand-collar motorcycle leather jacket with detailed zippers and metal buckles on the shoulders, is initially looking down in thought, then, as if making a bold decision, slowly raises his head to face the camera directly. A signature, confident, and slightly humorous smile appears on the corners of his mouth, and he says in a firm tone: <S>This time, we're sure to win.<E> The background sound includes the subtle sound of leather jacket friction and a rhythmic heavy bass background music.

Prompt: This is a political satire clip, shot with a static, eye-level medium frontal shot. The scene is set in the Oval Office, with the American flag visible in the soft-focus background. In the center of the frame, a man with a full head of hair, wearing a dark suit and a black tie, sits at a desk. He leans forward slightly, with a confident expression, and says <S>No one know video generation better than me.<E>. Background sounds include the soft hum of air conditioning and distant office chatter.

Prompt: This is a life comedy short drama clip, shot with a medium-close slow zoom-in approach. The lens cuts in from the front, slightly to the right, delicately capturing the rich micro-movements of the male's facial muscles. The scene is set in a retro cafe full of life, with warm yellow light softly spilling down from above. In the frame, <REF_1_S>a middle-aged and elderly male with short, choppy gray hair and deep wrinkles on his forehead<REF_1_E>, wearing a small blue-and-white checkered shirt under a dark black corduroy suit jacket, is initially staring longingly at a piece of cake on the table, hesitates for a moment, and then, as if making a certain decision, turns his head to look at the camera, winks slyly, and says: <S>My doctor told me I have to give up sweets, for my health. But if I don't eat this cheesecake, I'll be unhappy.<E> The background sound includes the faint sound of cups clinking and classical background music faintly heard in the distance.

Prompt: This is a historical legend/science education short drama clip, shot with a medium-close slow zoom-in approach. The lens cuts in from the front, slightly to the left, focusing on the deep and story-filled facial texture changes of the elderly wise man. The scene is set in a classical research room with scattered light and shadow, where soft side light illuminates the fine dust floating in the air. In the frame, <REF_1_S>an elderly male with iconic messy long silver hair and a thick gray handlebar mustache<REF_1_E>, wearing a heavy-textured grayish-brown tweed jacket, is initially staring into the void ahead lost in thought, then, as if catching some kind of truth, his brows relax, and he slowly raises his head to face the camera. His eyes flash with a wise light that sees through everything, and he says in a peaceful and profound tone: <S>想象力比知识更重要。<E>, revealing a calm smile that sees through time and space on his face. The background sound includes the rustling sound of a quill pen writing and the steady ticking of an old pendulum clock.

Multi-subject Personalized Generation Results

Here we show the results of multi-subject personalized generation. Please click on the videos to play the audio.

Prompt: This is a conversation presented in a static, eye-level medium shot from a front view. The scene is set in a formal government office with soft frontal lighting and a prominent American flag on the left wall. On the left, <REF_1_S>a middle-aged man with neatly trimmed gray-black short hair, wearing a tailored navy blue suit and blue-brown striped tie<REF_1_E>, sits upright with calm authority. He speaks and says, <S1>Leadership requires vision and discipline.<E1>. On the right, <REF_2_S>a woman with platinum blonde short curly hair, wearing bright red sleeveless top<REF_2_E>, leans forward slightly with a confident smile. She speaks and says, <S2>But it also needs courage to break the mold.<E2>. Background sounds include the quiet hum of an air conditioning system and the occasional soft click of a mechanical clock.

Prompt: This is a conversation presented in a static, eye-level medium shot from a front view. The scene is set in a nostalgic film set with wooden props and soft golden light. On the left, <REF_1_S>a man with long silver hair and a gray mustache, wearing a thick brown tweed coat<REF_1_E>, writes on a chalkboard. He speaks and says, <S1>I don't need a formula to make magic.<E1>. On the right, <REF_2_S>a woman with platinum blonde short curly hair, wearing bright red sleeveless top<REF_2_E>, winks at the camera. She speaks and says, <S2>But even magic follows certain rules.<E2>. Background sounds include the scratch of chalk on slate and the distant hum of a vintage projector warming up.

Prompt: This is a conversation presented in a static, eye-level medium shot from a front view. The scene is set in a slightly dimly lit but lively event site or backstage, with a blurred background. On the left side of the screen is <REF_1_S>a young male with thin metal-rimmed glasses and a gentle, scholarly temperament, with short parted black hair<REF_1_E>, wearing a red round-neck knit shirt; on the right side is <REF_2_S>a young male with short black bangs and delicate features<REF_2_E>, wearing a white hooded sweatshirt over a gray printed T-shirt. He intimately rests his arm on the shoulder of the male on the left. They lean their heads together, showing relaxed and sincere smiles to the camera. The male on the left says gently: <S1>The atmosphere tonight is really good.<E1> The male on the right responds mischievously: <S2>With you here, of course it's lively.<E2>, showing the deep friendship between the two.

Prompt: This is a glamorous red carpet event scene, captured with a medium close-up fixed camera. The background features a blurred event backdrop and crowd, with bright, even lighting that exudes the atmosphere of a high-profile social scene. On the left is a young woman with side-parted brown wavy short hair and delicated makeup, wearing a dress with pink 3D floral shoulder straps, her eyes bright and a sweet smile on her face. On the right is a young man with voluminous brown curly hair, wearing a black suit with a black bow tie and a crisp white shirt underneath, his smile confident and radiant. They stand close together facing the media cameras, presenting their best public image. The woman whispers slightly to the side: <S1>The flashlights seem even more intense tonight than usual.<E1> The man keeps smiling and calmly responds: <S2>Don't worry, we just need to keep smiling.<E2>

Prompt: This is a noble and elegant red carpet gala film scene, shot with a medium close-up fixed lens. The lighting is soft and bright, with a blurred background of crowds and camera equipment, creating a highly publicized, star-studded atmosphere. On the left is an elegant woman with a brown low ponytail, a few loose strands gently framing her face, wearing vintage pearl earrings, her eyes deep and captivating, a subtle smile on her lips. On the right is a man with neat side-parted short hair, a handsome and composed face, wearing a black suit with a black bow tie, his gaze firm and forward. They stand shoulder to shoulder, exuding confidence and poise as they face the cameras, showcasing star power. The woman softly says: <S1>This red carpet feels longer than I imagined.<E1> The man slightly turns his head and gently replies: <S2>It's okay, as long as you're by my side.<E2>, revealing deep affection between them.

Prompt: This is a conversation presented in a static, eye-level medium shot from a front view. The scene is set in a reflective evening office with city lights glowing outside and soft ambient desk lighting. On the left, <REF_1_S>a young woman with long wavy blonde hair and exquisite, atmospheric makeup<REF_1_E>, wearing a champagne satin off-shoulder gown, leans forward with a warm smile. She speaks and says, <S1>This moment feels like the beginning of something real.<E1>. On the right, <REF_2_S>a middle-aged male with short dark brown hair and a slightly weathered face<REF_2_E>, wearing a navy blue velvet blazer with a gray silk tie, nods slowly. He speaks and says, <S2>And the best beginnings are the ones we do not plan.<E2>

Prompt: This is a conversation presented in a static, eye-level medium shot from a front view. The scene is set in a moody fashion week afterparty, with colored lights pulsing and deep house music vibrating in the air. On the left, <REF_1_S>a young male with brown slightly curly medium-length hair<REF_1_E>, wearing a sleek black bomber jacket over a white shirt and a silver chain, leans in slightly. He speaks and says, <S1>I think we just became the best part of this night.<E1>. On the right, <REF_2_S>a young male with reddish-brown curly short hair<REF_2_E>, wearing a soft white knit shirt and layered pearls, laughs quietly. He speaks and says, <S2>Then let's make it last a little longer,<E2>

Prompt: This is a conversation presented in a static, eye-level medium shot from a front view. The scene is set on a vibrant city street at night, with colorful bokeh from neon signs and distant traffic humming softly. On the left, <REF_1_S>a young female with smooth black long straight hair<REF_1_E>, wearing silver hoop earrings and a black casual top, looks around in wonder. She speaks and says, <S1>Look at the lights, so mesmerizing.<E1>. On the right, <REF_2_S>an older male with gray short hair and round black-rimmed glasses<REF_2_E>, wearing a beige stand-collar jacket, smiles with calm appreciation. He speaks and says, <S2>The night is young, let's enjoy it,<E2>

Prompt: This is a conversation presented in a static, eye-level medium shot from a front view. The scene is set at a fan meet-and-greet after the event, with colorful banners and excited voices in the background. On the left, <REF_1_S>a young male with brown curly hair and a sunny and brilliant smile<REF_1_E>, wearing a black T-shirt with a logo, waves to the crowd. He speaks and says, <S1>They really love you.<E1>. On the right, <REF_2_S>a young female with dark brown center-parted long straight hair<REF_2_E>, wearing a pink denim jacket and hoop earrings, smiles warmly. She speaks and says, <S2>And they love you even more,<E2>. Background sounds include the joyful squeals of fans and the rhythmic clap of hands in the background.

Prompt: This is a tense subway tunnel scene, shot with a tracking shot from a side rail-mounted camera. The lighting is flickering and unstable, with distant train lights sweeping across their faces. The background is a damp concrete tunnel with graffiti-covered walls. On the left side of the frame is <REF_1_S>a sharp, professional woman with neat black hair tied in a bun and large silver hoop earrings<REF_1_E>, wearing a dark raincoat and gloves, her expression focused and stern. On the right is <REF_2_S>a mature man with dark short, tousled hair and light stubble<REF_2_E>, dressed in a black overcoat and scarf, his gaze deep and piercing. They move cautiously down the tracks, flashlights in hand. The woman stops and whispers: <S1>Footprints end here.<E1> The man shines his light ahead and replies: <S2>Then he's still close.<E2>

Prompt: This is a conversation presented in a static, eye-level medium shot from a front view. The scene is set in a minimalist creative studio with soft cold-toned natural light from the side. On the left, <REF_1_S>a young man with short black messy hair, wearing a black branded hoodie<REF_1_E>, rests his chin on his hand. He speaks and says, <S1>等你好久。<E1>. On the right, <REF_2_S>a woman with brown long wavy hair in a low ponytail, wearing a pure white V-neck silk top<REF_2_E>, looks up with a gentle smile. She speaks and says, <S2>现在我们终于有时间好好聊聊了。<E2>

Prompt: A reflective pause on a quiet balcony, shot with a wide-angle close-up. The background is a city skyline under soft moonlight. On the left is <REF_1_S>a young woman with side-parted brown wavy short hair and exquisite makeup<REF_1_E>, wearing a dress with pink 3D floral shoulder straps, arms crossed. On the right is <REF_2_S>a man with neat side-parted short hair, a handsome and composed face<REF_2_E>, wearing a charcoal-gray double-breasted suit with a crisp white shirt and no tie, sleeves rolled slightly. The woman on the left says: <S1>Some moments feel like they were meant to be remembered.<E1> The man on the right replies: <S2>Especially the ones we don't rush through.<E2>

Prompt: A vibrant night market encounter, lit by colorful lanterns and warm stall lights. On the left, <REF_1_S>a young Asian man in a black leather jacket<REF_1_E> holds a snack and says: <S1>热闹的地方，心反而静。<E1> On the right, <REF_2_S>a woman with golden curls in a red sleeveless top<REF_2_E> smiles and replies: <S2>因为知道，有人在等你回家。<E2>

Prompt: This is a dialogue presented in a static, eye-level medium shot from a front view. The scene is set in a dimly lit study room with bookshelves and a desk lamp. On the right, <REF_1_S>a young woman with black hair, wearing glasses and a white button-up shirt<REF_1_E>, sips tea. She says, <S1>Write down every idea, even the weak ones.<E1>. On the left, <REF_2_S>a young man with short black hair, wearing a dark jacket<REF_2_E>, flips through a notebook, saying, <S2>Because the weak ones sometimes lead to strong ones.<E2>

Prompt: This is a classic interview or group photo close-up, shot with a medium-close fixed lens at eye level. The scene is set against a minimalist indoor background, with beige textured wallpaper appearing simple and nostalgic, and soft frontal lighting clearly illuminating the faces. On the left side of the screen, <REF_1_S>a male with a very short buzz cut and a thin, cool face<REF_1_E> is wearing a dark gray turtleneck sweater; on the right side, <REF_2_S>a female with short black hair<REF_2_E> is wearing a black top with gold embroidery on the collar, looking gentle. They stand side by side, looking natural. The male looks straight ahead with deep eyes, his expression serious and reserved, while the female tilts her head slightly with a faint smile on her lips. The male says calmly: <S1>The story has been told.<E1> The female responds gently: <S2>But the touching part has just begun.<E2>

Prompt: This is a conversation presented in a static, eye-level medium shot from a front view. The scene is set in a hybrid space blending old books and digital screens. On the left, <REF_1_S>a man with long silver hair and a gray mustache, wearing a thick brown tweed coat<REF_1_E>, holds a fountain pen. He speaks and says, <S1>Wisdom comes from reflection.<E1>. On the right, <REF_2_S>a young Asian man with neat black short hair, wearing a glossy black leather jacket<REF_2_E>, turns on a tablet. He speaks and says, <S2>And progress comes from action.<E2>

Multi-view Personalization Results

Here we show the results of multi-view personalized generation. Please click on the videos to play the audio.

Multi-view Injection

Prompt: This is a slice-of-life short drama clip, filmed with a medium close-up shot slowly pushing in. The camera starts from the living room sofa area and moves forward gradually, focusing on the upper body of the character. The scene is set in a cozy indoor living room with soft lighting. In the center of the frame, a man with short black hair, wearing loose gray loungewear, leans against a soft sofa, holding a smartphone and staring at the screen. He mutters to himself, <S>Oh man, the food delivery platforms are really competing hard.<E>. Background sounds include the ticking of a clock, occasional traffic noise from outside the window, and smartphone notification alerts.

Prompt: This is a slice-of-life short drama clip, filmed with a medium shot slowly pushing forward and slightly tilting up. The camera starts from the kitchen doorway, moving forward and tilting upward gently. The scene is set in a bright kitchen, with wooden upper cabinets, white tile backsplash visible in the background, and small plants and kitchen items placed on the countertop. In the center of the frame, a man with dark short hair, wearing a white crew-neck T-shirt, stands in front of the counter, looking slightly off to the front. He picks up a beige mug with his right hand and says, <S>It feels a bit cooler today than yesterday.<E>. Background sounds include a faint dripping from the sink and birds chirping outside the window.

Prompt: This is a slice-of-life short drama clip, filmed with a medium close-up shot slowly pulling back. The camera starts from a frontal angle, focusing on the man's face, then slowly pulls away. The scene is set in an indoor space, with a plain light-colored wall and a small wooden hook rack visible in the background, creating a relaxed and comfortable atmosphere. In the center of the frame, a man with dark short hair, wearing a white crew-neck T-shirt and an unbuttoned light beige jacket, stands close to the camera, facing forward. He adjusts his jacket and says, <S>Let's start with a cup of coffee.<E>, with a relaxed posture, slightly leaning forward, arms hanging naturally. Background sounds include quiet indoor ambient noise and the faint ticking of a distant clock.

Prompt: This is a slice-of-life short drama clip, filmed with a medium tracking shot. The camera follows the character from the left front, slightly handheld, as he walks into the frame from a street corner, maintaining a steady pace. The scene is set outdoors during daytime, with brick buildings, green trees, and a sidewalk visible in the background, suggesting a campus or residential neighborhood. In the center of the frame, a man with dark short hair, wearing a white crew-neck T-shirt and an unbuttoned light beige jacket, holds a lidded disposable coffee cup in his left hand. As he walks, he says, <S>Got to head to the office for a meeting later.<E>. Background sounds include gentle street traffic and distant conversations from passersby.

Identity-as-Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation

Identity-as-Presence supports single-subject and multi-subject personalized joint audio-video generation for facial appearance and vocal timbre.