Cone : An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding
Research output: Chapters, Conference Papers, Creative and Literary Works › RGC 32 - Refereed conference paper (with host publication) › peer-review
Author(s)
Related Research Unit(s)
Detail(s)
Original language | English |
---|---|
Title of host publication | Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics |
Subtitle of host publication | Volume 1: Long Papers |
Place of Publication | Toronto |
Publisher | Association for Computational Linguistics |
Pages | 8013–8028 |
Number of pages | 16 |
Volume | 1 |
ISBN (electronic) | 978-1-959429-72-2 |
Publication status | Published - 9 Jul 2023 |
Conference
Title | 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023) |
---|---|
Location | Westin Harbour Castle |
Place | Canada |
City | Toronto |
Period | 9 - 14 July 2023 |
Link(s)
DOI | DOI |
---|---|
Attachment(s) | Documents
Publisher's Copyright Statement
|
Document Link | Links
|
Link to Scopus | https://www.scopus.com/record/display.uri?eid=2-s2.0-85174386560&origin=recordpage |
Permanent Link | https://scholars.cityu.edu.hk/en/publications/publication(ea4a4de0-71e8-44e4-b34e-39e8365def70).html |
Abstract
This paper tackles an emerging and challenging problem of long video temporal grounding (VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both substantial performance gains (e.g., from 3.13 to 6.87% on MAD) and state-of-the-art results. Analyses also reveal higher efficiency as the query-guided window selection mechanism accelerates inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results. Codes have been released at https://github.com/houzhijian/CONE. ©2023 Association for Computational Linguistics
Research Area(s)
- cs.CV, cs.CL, cs.IR
Bibliographic Note
Research Unit(s) information for this publication is provided by the author(s) concerned.
Citation Format(s)
Cone: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding. / Hou, Zhijian; Zhong, Wanjun; Ji, Lei et al.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics : Volume 1: Long Papers. Vol. 1 Toronto: Association for Computational Linguistics, 2023. p. 8013–8028.
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics : Volume 1: Long Papers. Vol. 1 Toronto: Association for Computational Linguistics, 2023. p. 8013–8028.
Research output: Chapters, Conference Papers, Creative and Literary Works › RGC 32 - Refereed conference paper (with host publication) › peer-review
Download Statistics
No data available