Skip to main navigation Skip to search Skip to main content

End-to-End Dense Video Captioning with Parallel Decoding

  • Teng Wang
  • , Ruimao Zhang
  • , Zhichao Lu
  • , Feng Zheng*
  • , Ran Cheng
  • , Ping Luo
  • *Corresponding author for this work

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

Abstract

Dense video captioning aims to generate multiple associated captions with their temporal locations from the video. Previous methods follow a sophisticated “localize-then-describe” scheme, which heavily relies on numerous hand-crafted components. In this paper, we proposed a simple yet effective framework for end-to-end dense video captioning with parallel decoding (PDVC), by formulating the dense caption generation as a set prediction task. In practice, through stacking a newly proposed event counter on the top of a transformer decoder, the PDVC precisely segments the video into a number of event pieces under the holistic understanding of the video content, which effectively increases the coherence and readability of predicted captions. Compared with prior arts, the PDVC has several appealing advantages: (1) Without relying on heuristic non-maximum suppression or a recurrent event sequence selection network to remove redundancy, PDVC directly produces an event set with an appropriate size; (2) In contrast to adopting the two-stage scheme, we feed the enhanced representations of event queries into the localization head and caption head in parallel, making these two sub-tasks deeply interrelated and mutually promoted through the optimization; (3) Without bells and whistles, extensive experiments on ActivityNet Captions and YouCook2 show that PDVC is capable of producing high-quality captioning results, surpassing the state-of-the-art two-stage methods when its localization accuracy is on par with them. Code is available at https://github.com/ttengwang/PDVC. © 2021 IEEE
Original languageEnglish
Title of host publicationProceedings - 2021 IEEE/CVF International Conference on Computer Vision
Subtitle of host publicationICCV 2021
PublisherIEEE
Pages6827-6837
ISBN (Electronic)9781665428125
ISBN (Print)9781665428132
DOIs
Publication statusPublished - Oct 2021
Externally publishedYes
Event18th IEEE/CVF International Conference on Computer Vision (ICCV 2021) - Virtual, Montreal, Canada
Duration: 11 Oct 202117 Oct 2021
https://iccv2021.thecvf.com/home

Publication series

NameProceedings of the IEEE International Conference on Computer Vision
ISSN (Print)1550-5499
ISSN (Electronic)2380-7504

Conference

Conference18th IEEE/CVF International Conference on Computer Vision (ICCV 2021)
Abbreviated titleICCV2021
PlaceCanada
CityMontreal
Period11/10/2117/10/21
Internet address

Fingerprint

Dive into the research topics of 'End-to-End Dense Video Captioning with Parallel Decoding'. Together they form a unique fingerprint.

Cite this