Diffusion models have significantly advanced video super-resolution (VSR) by enhancing perceptual quality, largely through elaborately designed temporal modeling to ensure inter-frame consistency. However, existing methods usually suffer from limited temporal coherence and prohibitively high computational costs (e.g., typically requiring over 8 NVIDIA A100-80G GPUs), especially for long videos. In this work, we propose LiftVSR, an efficient VSR framework that leverages and elevates the image-wise diffusion prior from PixArt-$\alpha$, achieving state-of-the-art results using only 4$\times$RTX 4090 GPUs. To balance long-term consistency and efficiency, we introduce a hybrid temporal modeling mechanism that decomposes temporal learning into two complementary components: (i) Dynamic Temporal Attention (DTA) for fine-grained temporal modeling within short frame segment (\ie, low complexity), and (ii) Attention Memory Cache (AMC) for long-term temporal modeling across segments (\ie, consistency). Specifically, DTA identifies multiple token flows across frames within multi-head query and key tokens to warp inter-frame contexts in the value tokens. AMC adaptively aggregates historical segment information via a cache unit, ensuring long-term coherence with minimal overhead. To further stabilize the cache interaction during inference, we introduce an asymmetric sampling strategy that mitigates feature mismatches arising from different diffusion sampling steps. Extensive experiments on several typical VSR benchmarks have demonstrated that LiftVSR achieves impressive performance with significantly lower computational costs.
In this paper, we propose a novel pipeline, LiftVSR, with hybrid temporal modeling mechanisms and asymmetric sampling strategy. The LiftVSR is built upon a pre-trained DiT model, including Dynamic Temporal Attention (DTA) module for fine-grained temporal modeling within short-term segments, and Attention Memory Cache (AMC) module for long-term temporal feature propagation. We further introduce an asymmetric sampling strategy to mitigate feature mismatches arising from different diffusion sampling steps to enable flexible cache interaction across segments. With the proposed innovations, LiftVSR achieves significant improvements in temporal consistency and visual quality while maintaining low training and inference costs.