Cone : An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

4 Scopus Citations
View graph of relations

Author(s)

  • Wanjun Zhong
  • Lei Ji
  • Difei Gao
  • Kun Yan
  • Mike Zheng Shou
  • Nan Duan

Related Research Unit(s)

Detail(s)

Original languageEnglish
Title of host publicationProceedings of the 61st Annual Meeting of the Association for Computational Linguistics
Subtitle of host publicationVolume 1: Long Papers
Place of PublicationToronto
PublisherAssociation for Computational Linguistics
Pages8013–8028
Number of pages16
Volume1
ISBN (electronic)978-1-959429-72-2
Publication statusPublished - 9 Jul 2023

Conference

Title61st Annual Meeting of the Association for Computational Linguistics (ACL 2023)
LocationWestin Harbour Castle
PlaceCanada
CityToronto
Period9 - 14 July 2023

Link(s)

Abstract

This paper tackles an emerging and challenging problem of long video temporal grounding (VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both substantial performance gains (e.g., from 3.13 to 6.87% on MAD) and state-of-the-art results. Analyses also reveal higher efficiency as the query-guided window selection mechanism accelerates inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results. Codes have been released at https://github.com/houzhijian/CONE. ©2023 Association for Computational Linguistics

Research Area(s)

  • cs.CV, cs.CL, cs.IR

Bibliographic Note

Research Unit(s) information for this publication is provided by the author(s) concerned.

Citation Format(s)

Cone: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding. / Hou, Zhijian; Zhong, Wanjun; Ji, Lei et al.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics : Volume 1: Long Papers. Vol. 1 Toronto: Association for Computational Linguistics, 2023. p. 8013–8028.

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

Download Statistics

No data available