LiteVSR: Unleashing the Potential of Frozen Diffusion Transformers for Video Super-Resolution

1Queen Mary University of London,   2Huawei Darwin Research Center,   3Imperial College London
ICML 2026

LiteVSR turns low-quality videos into high-fidelity reconstructions using a completely frozen Diffusion Transformer. Drag the handle to compare.

Abstract

Video Super-Resolution (VSR) has benefited significantly from large-scale pre-trained video generators, which provide powerful priors for realistic detail synthesis. Existing methods, however, rely on fine-tuning the generative backbone, incurring substantial computational costs and risking catastrophic forgetting of learned priors. We reconsider how these priors can be exploited from a frequency perspective: pre-trained generators are inherently capable of synthesizing high-frequency details given structural guidance, while low-quality videos supply reliable low-frequency information. Since low-frequency content is largely domain-agnostic, a frozen generator can perform VSR directly when the input structure is properly aligned to its embedding space. Building on this insight, we propose LiteVSR, a minimalist framework that performs VSR using a completely frozen Diffusion Transformer (DiT) with a lightweight State-Aware Adapter. The adapter employs a dual-stream architecture that jointly processes static structural cues from the low-quality input and dynamic cues from intermediate denoising states through time-dependent cross-attention, enabling adaptive transition from structural alignment to texture refinement as denoising proceeds. LiteVSR achieves state-of-the-art restoration quality with only 12.68% trainable parameters and 12 GPU-hours of training on a single A100, while preserving compatibility with off-the-shelf fast sampling algorithms.

Method

LiteVSR keeps the pre-trained DiT backbone entirely frozen and steers generation through a lightweight State-Aware Adapter. The adapter contains two parallel streams — a Structural Stream that extracts domain-agnostic layout from the low-quality input, and a Refinement Stream that reads the current clean estimate — fused by a time-modulated cross-attention layer. As denoising proceeds, attention shifts progressively from structural alignment to texture refinement.

LiteVSR pipeline

Frozen DiT

All backbone blocks remain entirely frozen.

12.68%

Trainable parameters only.

12 GPU-h

Training on a single A100.

Plug-and-play

Compatible with off-the-shelf fast sampling.

Quantitative Results

Quantitative comparison on REDS4, UDM10, SPMCS, YouHQ40 (synthetic), and VideoLQ (real-world). Best results are in bold; second-best are underlined.

Dataset Metric Upscale-A-Video MGLD-VSR STAR FlashVSR DOVE DiffVSR LiteVSR
REDS4 PSNR ↑20.2221.9021.3720.6723.0821.0821.10
LPIPS ↓0.47310.31900.43490.32020.37320.36770.3081
DISTS ↓0.25390.13250.17630.13150.19820.15520.1359
CLIPIQA ↑0.20420.29700.20450.31860.30170.28770.3748
DOVER ↑0.28530.33760.33200.34510.34020.30190.3622
NIQE ↓5.21023.53664.59042.93784.91083.15902.6938
MUSIQ ↑39.9560.8743.1562.7457.0764.7165.99
UDM10 PSNR ↑22.7623.9624.1523.3225.7422.3423.01
LPIPS ↓0.42460.32310.40690.27380.27590.33410.3266
DISTS ↓0.24270.15330.21070.13540.15370.17990.1640
CLIPIQA ↑0.25150.42860.22140.49580.53480.35500.5580
DOVER ↑0.24840.38990.22700.46180.46730.44000.5150
NIQE ↓6.34043.92196.05953.94265.18214.80543.8333
MUSIQ ↑35.8960.7132.5667.5165.1157.4070.02
SPMCS PSNR ↑19.0920.7820.4420.3321.7519.9319.76
LPIPS ↓0.52300.40460.48260.35360.36820.42320.3808
DISTS ↓0.31510.20740.25460.19490.19730.29780.1917
CLIPIQA ↑0.31900.46160.32060.48230.56810.40210.5726
DOVER ↑0.21260.30910.27450.40650.38000.34480.4093
NIQE ↓5.71753.76545.71163.53184.94394.57563.4324
MUSIQ ↑41.5265.4144.7270.3369.8367.2470.42
YouHQ40 PSNR ↑20.9922.1222.6621.2123.6720.5921.28
LPIPS ↓0.49640.37810.47470.30490.33770.39090.3842
DISTS ↓0.25290.15700.21200.12480.16390.18540.1816
CLIPIQA ↑0.28460.44130.25600.52780.49190.39760.5741
DOVER ↑0.37470.50190.35210.57660.58050.47690.5984
NIQE ↓6.59803.67836.39653.86824.95914.74493.5094
MUSIQ ↑31.4059.3327.6769.5162.8655.6068.67
VideoLQ CLIPIQA ↑0.24960.45240.26290.42360.32280.28950.4681
DOVER ↑0.31070.33890.39610.50370.45920.42020.4846
NIQE ↓6.03493.82456.21123.86235.30304.73113.7600
MUSIQ ↑27.0749.0733.9456.1444.6944.9459.05

Qualitative Comparisons

Qualitative comparison across methods

Video Comparisons

Drag the red box on either video to move it, and drag the bottom-right handle to resize. The selected region is magnified across all methods below.

High PSNR does not imply high perceptual quality. The examples above illustrate how methods that optimize fidelity (e.g. STAR, DOVE) often produce over-smoothed textures, while LiteVSR, by leveraging a frozen generative prior, synthesizes faithful high-frequency details.

BibTeX

@article{cao2026litevsr,
  author  = {Cao, Yu and Liu, Ziquan and Zhang, Zhensong and Deng, Jiankang and Gong, Shaogang and Song, Jifei},
  title   = {LiteVSR: Unleashing the Potential of Frozen Diffusion Transformers for Video Super-Resolution},
  journal = {arXiv preprint arXiv:2606.09250},
  year    = {2026},
}