Skip to main navigation Skip to search Skip to main content

Edit Temporal-Consistent Videos with Image Diffusion Model

  • Yuanzhi WANG
  • , Yong LI
  • , Xiaoya ZHANG
  • , Xin LIU
  • , Anbo DAI
  • , Antoni B. CHAN
  • , Zhen CUI

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

9 Downloads (CityUHK Scholars)

Abstract

Large-scale text-to-image (T2I) diffusion models have been extended for text-guided video editing, yielding impressive zero-shot video editing performance. Nonetheless, the generated videos usually show spatial irregularities and temporal inconsistencies as the temporal characteristics of videos have not been faithfully modeled. In this article, we propose an elegant yet effective Temporal-Consistent Video Editing (TCVE) method to mitigate the temporal inconsistency challenge for robust text-guided video editing. In addition to the utilization of a pretrained T2I 2D Unet for spatial content manipulation, we establish a dedicated temporal Unet architecture to faithfully capture the temporal coherence of the input video sequences. Furthermore, to establish coherence and interrelation between the spatial-focused and temporal-focused components, a cohesive spatial-temporal modeling unit is formulated. This unit effectively interconnects the temporal Unet with the pretrained 2D Unet, thereby enhancing the temporal consistency of the generated videos while preserving the capacity for video content manipulation. Quantitative experimental results and visualization results demonstrate that TCVE achieves state-of-the-art performance in both video temporal consistency and video editing capability, surpassing existing benchmarks in the field. Codes are released at https://github.com/mdswyz/TCVE. © 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Original languageEnglish
Article number368
JournalACM Transactions on Multimedia Computing, Communications and Applications
Volume20
Issue number12
Online published26 Nov 2024
DOIs
Publication statusPublished - Dec 2024

Funding

This work was supported by the National Natural Science Foundation of China (Grants Nos. 62476133, 62102180), the Research Grants Council of Hong Kong (Collaborative Research Fund No. C7055-21GF) and by the Hong Kong Scholars Program, the Natural Science Foundation of Shandong Province (Grant No. ZR2022LZH003), and the Natural Science Foundation of Jiangsu Province (Grant No. BK20210328).

Research Keywords

  • spatial-temporal modeling
  • temporal Unet
  • Text-guided video editing
  • text-to-image diffusion model

Publisher's Copyright Statement

  • COPYRIGHT TERMS OF DEPOSITED POSTPRINT FILE: © Authors | ACM 2024. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ACM Transactions on Multimedia Computing, Communications, and Applications, https://doi.org/10.1145/3691344.

RGC Funding Information

  • RGC-funded

Fingerprint

Dive into the research topics of 'Edit Temporal-Consistent Videos with Image Diffusion Model'. Together they form a unique fingerprint.

Cite this