Scalable Video Object Segmentation with Simplified Framework
Research output: Chapters, Conference Papers, Creative and Literary Works › RGC 32 - Refereed conference paper (with host publication) › peer-review
Author(s)
Related Research Unit(s)
Detail(s)
Original language | English |
---|---|
Title of host publication | Proceedings - 2023 IEEE/CVF International Conference on Computer Vision (ICCV 2023) |
Publisher | Institute of Electrical and Electronics Engineers, Inc. |
Pages | 13833-13843 |
ISBN (electronic) | 979-8-3503-0718-4 |
Publication status | Published - Oct 2023 |
Conference
Title | IEEE International Conference on Computer Vision 2023 (ICCV 2023) |
---|---|
Location | Paris Convention Center |
Place | France |
City | Paris |
Period | 2 - 6 October 2023 |
Link(s)
DOI | DOI |
---|---|
Document Link | |
Link to Scopus | https://www.scopus.com/record/display.uri?eid=2-s2.0-85175138104&origin=recordpage |
Permanent Link | https://scholars.cityu.edu.hk/en/publications/publication(1612bc32-8a86-4ef5-9d8c-325ef8aa9c08).html |
Abstract
The current popular methods for video object segmentation (VOS) implement feature matching through several hand-crafted modules that separately perform feature extraction and matching. However, the above hand-crafted designs empirically cause insufficient target interaction, thus limiting the dynamic target-aware feature learning in VOS. To tackle these limitations, this paper presents a scalable Simplified VOS (SimVOS) framework to perform joint feature extraction and matching by leveraging a single transformer backbone. Specifically, SimVOS employs a scalable ViT backbone for simultaneous feature extraction and matching between query and reference features. This design enables SimVOS to learn better target-ware features for accurate mask prediction. More importantly, SimVOS could directly apply well-pretrained ViT backbones (e.g., MAE [21]) for VOS, which bridges the gap between VOS and large-scale self-supervised pre-training. To achieve a better performance-speed trade-off, we further explore within-frame attention and propose a new token refinement module to improve the running speed and save computational cost. Experimentally, our SimVOS achieves state-of-the-art results on popular video object segmentation benchmarks, i.e., DAVIS-2017 (88.0% J&F), DAVIS-2016 (92.9% J&F) and YouTube-VOS 2019 (84.2% J&F), without applying any synthetic video or BL30K pre-training used in previous VOS approaches. Our code and models are available at https://github.com/jimmy-dq/SimVOS.git.
©2023 IEEE
©2023 IEEE
Bibliographic Note
Research Unit(s) information for this publication is provided by the author(s) concerned
Citation Format(s)
Scalable Video Object Segmentation with Simplified Framework. / Wu, Qiangqiang; Yang, Tianyu; Wu, Wei et al.
Proceedings - 2023 IEEE/CVF International Conference on Computer Vision (ICCV 2023). Institute of Electrical and Electronics Engineers, Inc., 2023. p. 13833-13843.
Proceedings - 2023 IEEE/CVF International Conference on Computer Vision (ICCV 2023). Institute of Electrical and Electronics Engineers, Inc., 2023. p. 13833-13843.
Research output: Chapters, Conference Papers, Creative and Literary Works › RGC 32 - Refereed conference paper (with host publication) › peer-review