Abstract
The goal of visual grounding is to establish connections between target objects and textual descriptions. Large Language Models (LLMs) have demonstrated strong comprehension abilities across a variety of visual tasks. To establish precise associations between the text and the corresponding visual region, we propose a Task-aware Cross-modal feature Refinement Transformer with LLMs for visual grounding, and our model is referred to as TCRT. To enable the LLM trained solely on text to understand images, we introduce an LLM adaptation module that extracts text-related visual features to bridge the domain discrepancy between the textual and visual modalities. We feed the text and visual features into the LLM to obtain task-aware priors. To enable the priors to guide feature fusion process, we further incorporate a cross-modal feature fusion module, which allows task-aware embeddings to refine visual features and facilitate information interaction between the Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES) tasks. We have performed extensive experiments to verify the effectiveness of the main components and demonstrate the superior performance of the proposed TCRT over state-of-the-art end-to-end visual grounding methods on RefCOCO, RefCOCOg, RefCOCO+ and ReferItGame.
©2025 IEEE
©2025 IEEE
| Original language | English |
|---|---|
| Title of host publication | Proceedings IEEE/CVF Conference on Computer Vision and Pattern Recognition CVPR 2025 |
| Publisher | IEEE |
| Pages | 3931-3941 |
| Number of pages | 11 |
| ISBN (Electronic) | 979-8-3315-4364-8 |
| ISBN (Print) | 979-8-3315-4365-5 |
| DOIs | |
| Publication status | Published - 13 Aug 2025 |
| Event | 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025) - Music City Center, Nashville, United States Duration: 11 Jun 2025 → 15 Jun 2025 https://cvpr.thecvf.com/Conferences/2025 https://cvpr.thecvf.com/ |
Publication series
| Name | |
|---|---|
| ISSN (Print) | 1063-6919 |
| ISSN (Electronic) | 2575-7075 |
Conference
| Conference | 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025) |
|---|---|
| Abbreviated title | CVPR2025 |
| Place | United States |
| City | Nashville |
| Period | 11/06/25 → 15/06/25 |
| Internet address |
Funding
This work was supported in part by TCL Science and Technology Innovation Fund (Project No. 20231752), in part by the Research Grants Council of the Hong Kong Special Administration Region (Project No. CityU 11206622), and in part by the GuangDong Basic and Applied Basic Research Foundation (Project No. 2024A1515011437).
RGC Funding Information
- RGC-funded
Fingerprint
Dive into the research topics of 'Task-aware Cross-modal Feature Refinement Transformer with Large Language Models for Visual Grounding'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver