Video Super-Resolution (VSR) has benefited significantly from large-scale pre-trained video generators, which provide powerful priors for realistic detail synthesis. Existing methods, however, rely on fine-tuning the generative backbone, incurring substantial computational costs and risking catastrophic forgetting of learned priors. We reconsider how these priors can be exploited from a frequency perspective: pre-trained generators are inherently capable of synthesizing high-frequency details given structural guidance, while low-quality videos supply reliable low-frequency information. Since low-frequency content is largely domain-agnostic, a frozen generator can perform VSR directly when the input structure is properly aligned to its embedding space. Building on this insight, we propose LiteVSR, a minimalist framework that performs VSR using a completely frozen Diffusion Transformer (DiT) with a lightweight State-Aware Adapter. The adapter employs a dual-stream architecture that jointly processes static structural cues from the low-quality input and dynamic cues from intermediate denoising states through time-dependent cross-attention, enabling adaptive transition from structural alignment to texture refinement as denoising proceeds. LiteVSR achieves state-of-the-art restoration quality with only 12.68% trainable parameters and 12 GPU-hours of training on a single A100, while preserving compatibility with off-the-shelf fast sampling algorithms.
LiteVSR keeps the pre-trained DiT backbone entirely frozen and steers generation through a lightweight State-Aware Adapter. The adapter contains two parallel streams — a Structural Stream that extracts domain-agnostic layout from the low-quality input, and a Refinement Stream that reads the current clean estimate — fused by a time-modulated cross-attention layer. As denoising proceeds, attention shifts progressively from structural alignment to texture refinement.
Frozen DiT
All backbone blocks remain entirely frozen.
12.68%
Trainable parameters only.
12 GPU-h
Training on a single A100.
Plug-and-play
Compatible with off-the-shelf fast sampling.
Quantitative comparison on REDS4, UDM10, SPMCS, YouHQ40 (synthetic), and VideoLQ (real-world). Best results are in bold; second-best are underlined.
| Dataset | Metric | Upscale-A-Video | MGLD-VSR | STAR | FlashVSR | DOVE | DiffVSR | LiteVSR |
|---|---|---|---|---|---|---|---|---|
| REDS4 | PSNR ↑ | 20.22 | 21.90 | 21.37 | 20.67 | 23.08 | 21.08 | 21.10 |
| LPIPS ↓ | 0.4731 | 0.3190 | 0.4349 | 0.3202 | 0.3732 | 0.3677 | 0.3081 | |
| DISTS ↓ | 0.2539 | 0.1325 | 0.1763 | 0.1315 | 0.1982 | 0.1552 | 0.1359 | |
| CLIPIQA ↑ | 0.2042 | 0.2970 | 0.2045 | 0.3186 | 0.3017 | 0.2877 | 0.3748 | |
| DOVER ↑ | 0.2853 | 0.3376 | 0.3320 | 0.3451 | 0.3402 | 0.3019 | 0.3622 | |
| NIQE ↓ | 5.2102 | 3.5366 | 4.5904 | 2.9378 | 4.9108 | 3.1590 | 2.6938 | |
| MUSIQ ↑ | 39.95 | 60.87 | 43.15 | 62.74 | 57.07 | 64.71 | 65.99 | |
| UDM10 | PSNR ↑ | 22.76 | 23.96 | 24.15 | 23.32 | 25.74 | 22.34 | 23.01 |
| LPIPS ↓ | 0.4246 | 0.3231 | 0.4069 | 0.2738 | 0.2759 | 0.3341 | 0.3266 | |
| DISTS ↓ | 0.2427 | 0.1533 | 0.2107 | 0.1354 | 0.1537 | 0.1799 | 0.1640 | |
| CLIPIQA ↑ | 0.2515 | 0.4286 | 0.2214 | 0.4958 | 0.5348 | 0.3550 | 0.5580 | |
| DOVER ↑ | 0.2484 | 0.3899 | 0.2270 | 0.4618 | 0.4673 | 0.4400 | 0.5150 | |
| NIQE ↓ | 6.3404 | 3.9219 | 6.0595 | 3.9426 | 5.1821 | 4.8054 | 3.8333 | |
| MUSIQ ↑ | 35.89 | 60.71 | 32.56 | 67.51 | 65.11 | 57.40 | 70.02 | |
| SPMCS | PSNR ↑ | 19.09 | 20.78 | 20.44 | 20.33 | 21.75 | 19.93 | 19.76 |
| LPIPS ↓ | 0.5230 | 0.4046 | 0.4826 | 0.3536 | 0.3682 | 0.4232 | 0.3808 | |
| DISTS ↓ | 0.3151 | 0.2074 | 0.2546 | 0.1949 | 0.1973 | 0.2978 | 0.1917 | |
| CLIPIQA ↑ | 0.3190 | 0.4616 | 0.3206 | 0.4823 | 0.5681 | 0.4021 | 0.5726 | |
| DOVER ↑ | 0.2126 | 0.3091 | 0.2745 | 0.4065 | 0.3800 | 0.3448 | 0.4093 | |
| NIQE ↓ | 5.7175 | 3.7654 | 5.7116 | 3.5318 | 4.9439 | 4.5756 | 3.4324 | |
| MUSIQ ↑ | 41.52 | 65.41 | 44.72 | 70.33 | 69.83 | 67.24 | 70.42 | |
| YouHQ40 | PSNR ↑ | 20.99 | 22.12 | 22.66 | 21.21 | 23.67 | 20.59 | 21.28 |
| LPIPS ↓ | 0.4964 | 0.3781 | 0.4747 | 0.3049 | 0.3377 | 0.3909 | 0.3842 | |
| DISTS ↓ | 0.2529 | 0.1570 | 0.2120 | 0.1248 | 0.1639 | 0.1854 | 0.1816 | |
| CLIPIQA ↑ | 0.2846 | 0.4413 | 0.2560 | 0.5278 | 0.4919 | 0.3976 | 0.5741 | |
| DOVER ↑ | 0.3747 | 0.5019 | 0.3521 | 0.5766 | 0.5805 | 0.4769 | 0.5984 | |
| NIQE ↓ | 6.5980 | 3.6783 | 6.3965 | 3.8682 | 4.9591 | 4.7449 | 3.5094 | |
| MUSIQ ↑ | 31.40 | 59.33 | 27.67 | 69.51 | 62.86 | 55.60 | 68.67 | |
| VideoLQ | CLIPIQA ↑ | 0.2496 | 0.4524 | 0.2629 | 0.4236 | 0.3228 | 0.2895 | 0.4681 |
| DOVER ↑ | 0.3107 | 0.3389 | 0.3961 | 0.5037 | 0.4592 | 0.4202 | 0.4846 | |
| NIQE ↓ | 6.0349 | 3.8245 | 6.2112 | 3.8623 | 5.3030 | 4.7311 | 3.7600 | |
| MUSIQ ↑ | 27.07 | 49.07 | 33.94 | 56.14 | 44.69 | 44.94 | 59.05 |
Drag the red box on either video to move it, and drag the bottom-right handle to resize. The selected region is magnified across all methods below.
High PSNR does not imply high perceptual quality. The examples above illustrate how methods that optimize fidelity (e.g. STAR, DOVE) often produce over-smoothed textures, while LiteVSR, by leveraging a frozen generative prior, synthesizes faithful high-frequency details.
@article{cao2026litevsr,
author = {Cao, Yu and Liu, Ziquan and Zhang, Zhensong and Deng, Jiankang and Gong, Shaogang and Song, Jifei},
title = {LiteVSR: Unleashing the Potential of Frozen Diffusion Transformers for Video Super-Resolution},
journal = {arXiv preprint arXiv:2606.09250},
year = {2026},
}