Streaming Video Generation · USTC · FrameX.AI

Stream-R1 Reliability-Perplexity Aware Reward Distillation for streaming video generation

Bin Wu¹ Mengqi Huang^1†‡ Shaojin Wu^3‡ Weinan Jia¹ Yuxin Wang² Zhendong Mao¹ Yongdong Zhang¹

¹ University of Science and Technology of China ² FrameX.AI ³ Independent Researcher

^† Corresponding author ^‡ Project lead

arXiv Code Models

Hover any clip to pause · click to play fullscreen

Abstract

Stream-R1 reweights distribution-matching distillation along two complementary axes — Inter-Reliability across rollouts and Intra-Perplexity across spatiotemporal regions — using a single shared video reward model.

Existing DMD methods for streaming video diffusion treat every rollout, frame, and pixel as equally reliable supervision. We argue this caps the upper bound of distilled quality, because it overlooks two complementary axes of variance in the DMD signal. Stream-R1 rescales each rollout’s loss by an exponential of a pretrained reward score — rollouts whose gradient genuinely points toward the teacher’s high-quality mode dominate — and back-propagates the same reward model to obtain per-pixel saliency that concentrates optimization where the local reward landscape has not yet flattened. An adaptive balancing mechanism keeps visual quality, motion quality, and text alignment improving in lockstep. The resulting student surpasses its multi-step Wan2.1 teacher on VBench Total and Semantic at 23.1 FPS, with no architectural change and zero inference overhead.

Inter-Reliability Weighting

The DMD gradient g = f_fake − f_real varies in reliability across rollouts. Stream-R1 rescales each rollout’s loss by exp(β · r_final), so rollouts whose gradient genuinely points toward the teacher’s high-quality mode dominate the supervision.

Intra-Perplexity Weighting

Back-propagating the reward model yields a per-pixel saliency volume S ∈ R^F×H×W; factorized into spatial and temporal weights, it concentrates optimization where further refinement yields the largest expected gain — regions whose local reward landscape has not yet flattened.

Adaptive Reward Balancing

A sliding window tracks per-axis (Visual / Motion / Text-align) improvement and softly down-weights axes that are already improving, keeping all three quality dimensions advancing together.

Motivation

Two complementary axes of variance

The DMD supervision signal varies across rollouts — the gradient on rollouts far from the teacher’s high-quality mode encodes within-low-quality refinement rather than a path toward it — and across spatiotemporal regions, since some areas have already saturated under the reward while others still lie in the steepest part of the reward landscape. Treating them uniformly dilutes the supervision and wastes optimization budget on already-saturated regions while leaving high-perplexity content under-supervised.

Motivation of Stream-R1 — **Figure 1.** The DMD supervision signal exhibits Inter-Reliability across rollouts and Intra-Perplexity across spatiotemporal regions. Existing DMD applies a uniform weight; Stream-R1 upweights rollouts on which the supervision is reliable and concentrates optimization on regions where further refinement yields the largest expected gain, all driven by a single reward model.

Method

A single reward model, two complementary weights

Stream-R1 retains the tractability of the DMD objective while replacing its uniform weighting with reliability- and perplexity-aware guidance. The same reward model produces both a rollout-level scalar and a per-pixel gradient saliency volume, with adaptive fusion across visual, motion, and text-alignment axes.

Stream-R1 method overview — **Figure 2.** Inter-Reliability score extraction yields *W_inter*; adaptive saliency aggregation factorizes the back-propagated reward into spatial and temporal weights forming *W_intra*. The composed loss is `L_Stream-R1 = W_inter · (W_intra ⊙ L_DMD)`.

Long Video Generation

Four durations, no quality drift

Each row plays clips that were independently rolled out at a different duration — 30 seconds, 60 seconds, 2 minutes, 3 minutes. Hover a row to pause its scroll and zoom the clip under your cursor; click any clip to play it fullscreen.

30seconds

60seconds

2minutes

3minutes

Method Comparison

Reward Forcing vs. Stream-R1

Same prompt, same seed. Reward Forcing applies a single global scalar reward; Stream-R1 additionally redistributes the optimization signal across reliable rollouts and high-perplexity regions. Each row plays the two clips in sync the moment it scrolls into view.

Reward Forcing Stream-R1 (ours)

Baseline

Ours

Baseline

Ours

Baseline

Ours

Baseline

Ours

Baseline

Ours

Baseline

Ours

Spatiotemporal Saliency · Controlled Stress Test

Where the reward gradient looks

To probe whether the spatiotemporal weights actually respond to local quality deficiency, we run a controlled experiment: Gaussian blur is injected only into the lower half of each frame, so every frame contains a clean (top) versus degraded (bottom) contrast. Across the four frames, the corrupted area progressively expands from left to right. We then back-propagate the reward score through the vision encoder to recover a per-pixel gradient saliency. Drag the slider to step through.

Frame 1: small blur region in lower half

Per-frame temporal weight w_t 0.587

Within each frame, saliency is biased toward the lower (blurred) half rather than the visually intact upper half. As the blurred region grows, the spatial saliency tightens onto its interior. Per-frame temporal weights climb monotonically from 0.587 to 2.117 — not hand-engineered, this behavior emerges purely from the reward-model gradient, automatically allocating more learning signal to frames that need it most.

Quantitative Results

Beating the multi-step teacher at 30× speed

Despite being distilled into a 4-step model, Stream-R1 surpasses its own diffusion teacher Wan2.1 on VBench Total and Semantic, while improving on Reward Forcing across every axis at the same throughput.

VBench — 5-second video, 832×480

Best in highlighted mark. Stream-R1 achieves the highest Total of any compared method, surpassing both its diffusion teacher (84.26) and the previous best distilled model Reward Forcing (84.13).

Model	Params	FPS↑	Total↑	Quality↑	Semantic↑
Diffusion models
LTX-Video	1.9B	8.98	80.00	82.30	70.79
Wan2.1 (teacher)	1.3B	0.78	84.26	85.30	80.09
Autoregressive / streaming models
SkyReels-V2	1.3B	0.49	82.67	84.70	74.53
MAGI-1	4.5B	0.19	79.18	82.04	67.74
NOVA	0.6B	0.88	80.12	80.39	79.05
Pyramid Flow	2B	6.7	81.72	84.74	69.62
CausVid	1.3B	17.0	82.88	83.93	78.69
Self Forcing	1.3B	17.0	83.80	84.59	80.64
LongLive	1.3B	20.7	83.22	83.68	81.37
Rolling Forcing	1.3B	17.5	81.22	84.08	69.78
Reward-guided distillation
Reward Forcing	1.3B	23.1	84.13	84.84	81.32
Stream-R1 (ours)	1.3B	23.1	84.40	85.14	81.44

Qwen3-VL judgement — 60-second video

Each video scored 1–5 on three axes by a strong VLM judge. Stream-R1 attains the best Visual Quality and Text Alignment; we trade a small Dynamic margin for gains on the other two axes, consistent with the balanced multi-dimensional design.

Model	Visual↑	Dynamic↑	Text↑
SkyReels-V2	3.30	3.05	2.70
CausVid	4.66	3.16	3.32
Self Forcing	3.89	3.44	3.11
LongLive	4.79	3.81	3.98
Reward Forcing	4.82	4.18	4.04
Stream-R1 (ours)	4.92	4.04	4.11

Per-metric quality versus video length — **Figure 3.** Per-metric quality vs. video length. Stream-R1 widens its lead over Reward Forcing as duration grows — especially at 120s and 180s, confirming that spatiotemporal reward-guided weighting mitigates the quality drift accumulated during long autoregressive rollouts.

Ablation

Each component contributes

We progressively add each component on top of the standard DMD baseline, evaluating short videos on VBench (5s) and long videos on a 60-second protocol. Drift measures autoregressive quality decay (lower is better).

Variant	VBench (5 s)			Long Video (60 s)
Variant	Total↑	Quality↑	Semantic↑	Total↑	Drift↓
DMD baseline	83.44	84.16	80.55	79.45	2.479
+ Spatial reward W_s (σ_min=0.15)	83.67	84.46	80.51	80.71	2.653
+ Balanced multi-dim reward	83.68	84.44	80.62	80.73	2.697
+ Temporal reward W_t (full)	84.40	85.14	81.44	80.86	2.417

BibTeX

Cite Stream-R1

Citation forthcoming — we’ll publish a BibTeX entry once the preprint is available on arXiv.