Abstract
This paper tackles an emerging and challenging problem of long video temporal grounding (VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both substantial performance gains (e.g., from 3.13 to 6.87% on MAD) and state-of-the-art results. Analyses also reveal higher efficiency as the query-guided window selection mechanism accelerates inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results. Codes have been released at https://github.com/houzhijian/CONE. ©2023 Association for Computational Linguistics
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics |
| Subtitle of host publication | Volume 1: Long Papers |
| Place of Publication | Toronto |
| Publisher | Association for Computational Linguistics |
| Pages | 8013–8028 |
| Number of pages | 16 |
| Volume | 1 |
| ISBN (Electronic) | 978-1-959429-72-2 |
| DOIs | |
| Publication status | Published - 9 Jul 2023 |
| Event | 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023) - Westin Harbour Castle, Toronto, Canada Duration: 9 Jul 2023 → 14 Jul 2023 https://2023.aclweb.org/ https://aclanthology.org/events/acl-2023/ |
Conference
| Conference | 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023) |
|---|---|
| Abbreviated title | ACL’23 |
| Place | Canada |
| City | Toronto |
| Period | 9/07/23 → 14/07/23 |
| Internet address |
Bibliographical note
Research Unit(s) information for this publication is provided by the author(s) concerned.Research Keywords
- cs.CV
- cs.CL
- cs.IR
Publisher's Copyright Statement
- This full text is made available under CC-BY 4.0. https://creativecommons.org/licenses/by/4.0/
Fingerprint
Dive into the research topics of 'Cone: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver