Skip to main navigation Skip to search Skip to main content

Cone: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

52 Downloads (CityUHK Scholars)

Abstract

This paper tackles an emerging and challenging problem of long video temporal grounding (VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both substantial performance gains (e.g., from 3.13 to 6.87% on MAD) and state-of-the-art results. Analyses also reveal higher efficiency as the query-guided window selection mechanism accelerates inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results. Codes have been released at https://github.com/houzhijian/CONE. ©2023 Association for Computational Linguistics
Original languageEnglish
Title of host publicationProceedings of the 61st Annual Meeting of the Association for Computational Linguistics
Subtitle of host publicationVolume 1: Long Papers
Place of PublicationToronto
PublisherAssociation for Computational Linguistics
Pages8013–8028
Number of pages16
Volume1
ISBN (Electronic)978-1-959429-72-2
DOIs
Publication statusPublished - 9 Jul 2023
Event61st Annual Meeting of the Association for Computational Linguistics (ACL 2023) - Westin Harbour Castle, Toronto, Canada
Duration: 9 Jul 202314 Jul 2023
https://2023.aclweb.org/
https://aclanthology.org/events/acl-2023/

Conference

Conference61st Annual Meeting of the Association for Computational Linguistics (ACL 2023)
Abbreviated titleACL’23
PlaceCanada
CityToronto
Period9/07/2314/07/23
Internet address

Bibliographical note

Research Unit(s) information for this publication is provided by the author(s) concerned.

Research Keywords

  • cs.CV
  • cs.CL
  • cs.IR

Publisher's Copyright Statement

  • This full text is made available under CC-BY 4.0. https://creativecommons.org/licenses/by/4.0/

Fingerprint

Dive into the research topics of 'Cone: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding'. Together they form a unique fingerprint.

Cite this