Project Page

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Generation

Yupeng Zhou^1,2, Lianghua Huang², Zhifan Wu², Jiabao Wang¹, Yupeng Shi², Biao Jiang^2,3, Daquan Zhou³, Yu Liu², Ming-Ming Cheng¹, Qibin Hou¹

¹ VCIP, School of Computer Science, Nankai University
² Tongyi Lab ³ Peking University

📄 Paper 💻 Code 🤗 Pretrained Model 📚 BibTeX

Demo video of Mutual Forcing. The preview is compressed for web display.

Abstract

In this work, we propose Mutual Forcing, a framework for fast autoregressive audio-video generation with long-horizon audio-video synchronization. Our approach addresses two key challenges: joint audio-video modeling and fast autoregressive generation. To ease joint audio-video optimization, we adopt a two-stage training strategy: we first train unimodal generators and then couple them into a unified audio-video model for joint training on paired data.

For streaming generation, we ask whether a native fast causal audio-video model can be trained directly, instead of following existing streaming distillation pipelines that typically train a bidirectional model first and then convert it into a causal generator through multiple distillation stages. Our answer is Mutual Forcing, which builds directly on a native autoregressive model and integrates few-step and multi-step generation within a single weight-shared model, enabling self-distillation and improved training-inference consistency.

The multi-step mode improves the few-step mode via self-distillation, while the few-step mode generates historical context during training to improve training-inference consistency; because the two modes share parameters, these two effects reinforce each other within a single model. Compared with prior approaches such as Self-Forcing, Mutual Forcing removes the need for an additional bidirectional teacher model, supports more flexible training sequence lengths, reduces training overhead, and allows the model to improve directly from real paired data rather than a fixed teacher. Experiments show that Mutual Forcing matches or surpasses strong baselines that require around 50 sampling steps while using only 4 to 8 steps, demonstrating substantial advantages in both efficiency and quality.

Method Overview

Mutual Forcing addresses two core challenges in fast autoregressive audio-video generation: joint multimodal modeling and efficient streaming generation. Instead of relying on a bidirectional-to-causal multi-stage distillation pipeline, it starts directly from a native causal model and improves it through a dual-mode self-evolution framework.

🎧🎬

Two-Stage Audio-Video Modeling

To ease joint multimodal optimization, we first train strong unimodal generators and then couple them into a unified audio-video model for joint training on paired data.

⚡

Native Causal Streaming Training

Rather than first training a bidirectional model and then converting it into a streaming generator through multiple distillation stages, Mutual Forcing directly builds on a native fast causal audio-video model.

🔀

Dual-Mode Weight-Shared Design

A single weight-shared model supports both few-step and multi-step generation. This unified design allows efficient inference while preserving strong generation quality.

🔁

Mutual Self-Evolution

The few-step mode generates historical context during training to reduce training-inference mismatch, while the multi-step mode improves the few-step mode via self-distillation; because they share parameters, the two modes reinforce each other within one model.

Illustration of Mutual Forcing. Compared with prior paradigms that either rely on real history and suffer from training-inference mismatch or require an additional bidirectional teacher for distillation, Mutual Forcing starts from a native causal model and uses a weight-shared dual-mode design to unify few-step and multi-step generation in a teacher-free framework.

Qualitative Results

We showcase Mutual Forcing across diverse audio-video generation scenarios, including speech, music-related scenes, non-speech audio, and long-horizon streaming generation.

🗣️ Speaking / Dialogue

Mutual Forcing supports synchronized speech generation with stable visual identity, coherent lip motion, and temporally consistent long-horizon generation.

Two men in camouflage in a mossy forest, with synchronized dialogue and stable long-range generation.

A two-person conversation with natural turn-taking, speech synchronization, and smooth facial motion.

An elderly couple conversation with stable identity, coherent speech, and consistent audiovisual generation.

🎤 Singing / Music-Related Scenes

The model generates synchronized music-related audio-video content with coherent vocal delivery, instrument sounds, and visually aligned motion.

Blonde woman at a white piano with synchronized singing and soft accompaniment.

A dark-stage female vocalist with sustained singing performance and temporally aligned motion.

An elderly man playing bamboo flute with natural breath, hand motion, and audio-video synchronization.

🎼 Scene Audio / Background Music

Mutual Forcing can synthesize scene-consistent audio that matches visual content, motion dynamics, and temporal progression.

A seaside scene with coherent atmosphere, motion, and scene-aligned audio.

A walking scene with calm scene-consistent audio and smooth visual progression.

A suspenseful scene with matched audio cues and coherent camera-aware generation.

🐾 Non-Speech Audio

Beyond speech, the model also generalizes to non-verbal audio domains such as animal vocalization and eating sounds.

Animal vocalization with synchronized visual behavior.

Eating scene with coherent foley-like audio and matching mouth motion.

⏱️ Long-Horizon Streaming Generation

Our autoregressive framework naturally scales to long-form generation, maintaining identity consistency, synchronized audiovisual dynamics, and temporal stability over extended durations.