LiteVSR: Unleashing the Potential of Frozen Diffusion Transformers for Video Super-Resolution

Abstract

Video Super-Resolution (VSR) has benefited significantly from large-scale pre-trained video generators, which provide powerful priors for realistic detail synthesis. Existing methods, however, rely on fine-tuning the generative backbone, incurring substantial computational costs and risking catastrophic forgetting of learned priors. We reconsider how these priors can be exploited from a frequency perspective: pre-trained generators are inherently capable of synthesizing high-frequency details given structural guidance, while low-quality videos supply reliable low-frequency information. Since low-frequency content is largely domain-agnostic, a frozen generator can perform VSR directly when the input structure is properly aligned to its embedding space. Building on this insight, we propose LiteVSR, a minimalist framework that performs VSR using a completely frozen Diffusion Transformer (DiT) with a lightweight State-Aware Adapter. The adapter employs a dual-stream architecture that jointly processes static structural cues from the low-quality input and dynamic cues from intermediate denoising states through time-dependent cross-attention, enabling adaptive transition from structural alignment to texture refinement as denoising proceeds. LiteVSR achieves state-of-the-art restoration quality with only 12.68% trainable parameters and 12 GPU-hours of training on a single A100, while preserving compatibility with off-the-shelf fast sampling algorithms.

Method

LiteVSR keeps the pre-trained DiT backbone entirely frozen and steers generation through a lightweight State-Aware Adapter. The adapter contains two parallel streams — a Structural Stream that extracts domain-agnostic layout from the low-quality input, and a Refinement Stream that reads the current clean estimate — fused by a time-modulated cross-attention layer. As denoising proceeds, attention shifts progressively from structural alignment to texture refinement.

Frozen DiT

All backbone blocks remain entirely frozen.

12.68%

Trainable parameters only.

12 GPU-h

Training on a single A100.

Plug-and-play

Compatible with off-the-shelf fast sampling.

Quantitative Results

Quantitative comparison on REDS4, UDM10, SPMCS, YouHQ40 (synthetic), and VideoLQ (real-world). Best results are in bold; second-best are underlined.

Dataset	Metric	Upscale-A-Video	MGLD-VSR	STAR	FlashVSR	DOVE	DiffVSR	LiteVSR
REDS4	PSNR ↑	20.22	21.90	21.37	20.67	23.08	21.08	21.10
	LPIPS ↓	0.4731	0.3190	0.4349	0.3202	0.3732	0.3677	0.3081
	DISTS ↓	0.2539	0.1325	0.1763	0.1315	0.1982	0.1552	0.1359
	CLIPIQA ↑	0.2042	0.2970	0.2045	0.3186	0.3017	0.2877	0.3748
	DOVER ↑	0.2853	0.3376	0.3320	0.3451	0.3402	0.3019	0.3622
	NIQE ↓	5.2102	3.5366	4.5904	2.9378	4.9108	3.1590	2.6938
	MUSIQ ↑	39.95	60.87	43.15	62.74	57.07	64.71	65.99
UDM10	PSNR ↑	22.76	23.96	24.15	23.32	25.74	22.34	23.01
	LPIPS ↓	0.4246	0.3231	0.4069	0.2738	0.2759	0.3341	0.3266
	DISTS ↓	0.2427	0.1533	0.2107	0.1354	0.1537	0.1799	0.1640
	CLIPIQA ↑	0.2515	0.4286	0.2214	0.4958	0.5348	0.3550	0.5580
	DOVER ↑	0.2484	0.3899	0.2270	0.4618	0.4673	0.4400	0.5150
	NIQE ↓	6.3404	3.9219	6.0595	3.9426	5.1821	4.8054	3.8333
	MUSIQ ↑	35.89	60.71	32.56	67.51	65.11	57.40	70.02
SPMCS	PSNR ↑	19.09	20.78	20.44	20.33	21.75	19.93	19.76
	LPIPS ↓	0.5230	0.4046	0.4826	0.3536	0.3682	0.4232	0.3808
	DISTS ↓	0.3151	0.2074	0.2546	0.1949	0.1973	0.2978	0.1917
	CLIPIQA ↑	0.3190	0.4616	0.3206	0.4823	0.5681	0.4021	0.5726
	DOVER ↑	0.2126	0.3091	0.2745	0.4065	0.3800	0.3448	0.4093
	NIQE ↓	5.7175	3.7654	5.7116	3.5318	4.9439	4.5756	3.4324
	MUSIQ ↑	41.52	65.41	44.72	70.33	69.83	67.24	70.42
YouHQ40	PSNR ↑	20.99	22.12	22.66	21.21	23.67	20.59	21.28
	LPIPS ↓	0.4964	0.3781	0.4747	0.3049	0.3377	0.3909	0.3842
	DISTS ↓	0.2529	0.1570	0.2120	0.1248	0.1639	0.1854	0.1816
	CLIPIQA ↑	0.2846	0.4413	0.2560	0.5278	0.4919	0.3976	0.5741
	DOVER ↑	0.3747	0.5019	0.3521	0.5766	0.5805	0.4769	0.5984
	NIQE ↓	6.5980	3.6783	6.3965	3.8682	4.9591	4.7449	3.5094
	MUSIQ ↑	31.40	59.33	27.67	69.51	62.86	55.60	68.67
VideoLQ	CLIPIQA ↑	0.2496	0.4524	0.2629	0.4236	0.3228	0.2895	0.4681
	DOVER ↑	0.3107	0.3389	0.3961	0.5037	0.4592	0.4202	0.4846
	NIQE ↓	6.0349	3.8245	6.2112	3.8623	5.3030	4.7311	3.7600
	MUSIQ ↑	27.07	49.07	33.94	56.14	44.69	44.94	59.05

Qualitative Comparisons

Video Comparisons

Drag the red box on either video to move it, and drag the bottom-right handle to resize. The selected region is magnified across all methods below.

High PSNR does not imply high perceptual quality. The examples above illustrate how methods that optimize fidelity (e.g. STAR, DOVE) often produce over-smoothed textures, while LiteVSR, by leveraging a frozen generative prior, synthesizes faithful high-frequency details.

BibTeX

@article{cao2026litevsr,
  author  = {Cao, Yu and Liu, Ziquan and Zhang, Zhensong and Deng, Jiankang and Gong, Shaogang and Song, Jifei},
  title   = {LiteVSR: Unleashing the Potential of Frozen Diffusion Transformers for Video Super-Resolution},
  journal = {arXiv preprint arXiv:2606.09250},
  year    = {2026},
}