arxiv:2512.21734

Knot Forcing: Taming Autoregressive Video Diffusion Models for Real-time Infinite Interactive Portrait Animation

Published on Dec 25

· Submitted by

Zihan Wang on Dec 30

TongyiLab

Upvote

Authors:

Steven Xiao ,

Dechao Meng ,

Abstract

Real-time portrait animation is essential for interactive applications such as virtual assistants and live avatars, requiring high visual fidelity, temporal coherence, ultra-low latency, and responsive control from dynamic inputs like reference images and driving signals. While diffusion-based models achieve strong quality, their non-causal nature hinders streaming deployment. Causal autoregressive video generation approaches enable efficient frame-by-frame generation but suffer from error accumulation, motion discontinuities at chunk boundaries, and degraded long-term consistency. In this work, we present a novel streaming framework named Knot Forcing for real-time portrait animation that addresses these challenges through three key designs: (1) a chunk-wise generation strategy with global identity preservation via cached KV states of the reference image and local temporal modeling using sliding window attention; (2) a temporal knot module that overlaps adjacent chunks and propagates spatio-temporal cues via image-to-video conditioning to smooth inter-chunk motion transitions; and (3) A "running ahead" mechanism that dynamically updates the reference frame's temporal coordinate during inference, keeping its semantic context ahead of the current rollout frame to support long-term coherence. Knot Forcing enables high-fidelity, temporally consistent, and interactive portrait animation over infinite sequences, achieving real-time performance with strong visual stability on consumer-grade GPUs.

View arXiv page View PDF Add to collection

Community

ZihanWang99

Paper submitter 1 day ago

We propose Knot Forcing, a streaming framework for real-time portrait animation that enables high-fidelity, temporally consistent, and interactive video generation from dynamic inputs such as reference images and driving signals. Unlike diffusion-based models that are non-causal and latency-heavy, or autoregressive methods that suffer from error accumulation and motion discontinuities, our approach supports efficient frame-by-frame synthesis while maintaining long-term visual and temporal coherence on consumer-grade hardware.

Our method introduces three key innovations:

Chunk-wise causal generation with hybrid memory: We preserve global identity by caching KV states of the reference image, while modeling local dynamics using sliding window attention for efficient temporal coherence.
Temporal knot module: By overlapping adjacent video chunks and propagating spatio-temporal cues via image-to-video conditioning, we smooth transitions and reduce motion jitter at chunk boundaries.
Global context running ahead: During inference, we dynamically update the temporal coordinate of the reference frame to keep its semantic context ahead of the current generation step, enabling stable long-term rollout.

Together, these designs enable Knot Forcing to deliver real-time, high-quality portrait animation over infinite sequences with strong visual stability and responsiveness.

Project page: this url

librarian-bot

about 20 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.21734 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.21734 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.21734 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.