Skip to main navigation Skip to search Skip to main content

Task-aware Cross-modal Feature Refinement Transformer with Large Language Models for Visual Grounding

  • Wenbo Chen (Co-first Author)
  • , Zhen Xu (Co-first Author)
  • , Ruotao Xu
  • , Si Wu*
  • , Hau-San Wong
  • *Corresponding author for this work

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

Abstract

The goal of visual grounding is to establish connections between target objects and textual descriptions. Large Language Models (LLMs) have demonstrated strong comprehension abilities across a variety of visual tasks. To establish precise associations between the text and the corresponding visual region, we propose a Task-aware Cross-modal feature Refinement Transformer with LLMs for visual grounding, and our model is referred to as TCRT. To enable the LLM trained solely on text to understand images, we introduce an LLM adaptation module that extracts text-related visual features to bridge the domain discrepancy between the textual and visual modalities. We feed the text and visual features into the LLM to obtain task-aware priors. To enable the priors to guide feature fusion process, we further incorporate a cross-modal feature fusion module, which allows task-aware embeddings to refine visual features and facilitate information interaction between the Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES) tasks. We have performed extensive experiments to verify the effectiveness of the main components and demonstrate the superior performance of the proposed TCRT over state-of-the-art end-to-end visual grounding methods on RefCOCO, RefCOCOg, RefCOCO+ and ReferItGame.
©2025 IEEE
Original languageEnglish
Title of host publicationProceedings IEEE/CVF Conference on Computer Vision and Pattern Recognition CVPR 2025
PublisherIEEE
Pages3931-3941
Number of pages11
ISBN (Electronic)979-8-3315-4364-8
ISBN (Print)979-8-3315-4365-5
DOIs
Publication statusPublished - 13 Aug 2025
Event2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025) - Music City Center, Nashville, United States
Duration: 11 Jun 202515 Jun 2025
https://cvpr.thecvf.com/Conferences/2025
https://cvpr.thecvf.com/

Publication series

Name
ISSN (Print)1063-6919
ISSN (Electronic)2575-7075

Conference

Conference2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025)
Abbreviated titleCVPR2025
PlaceUnited States
CityNashville
Period11/06/2515/06/25
Internet address

Funding

This work was supported in part by TCL Science and Technology Innovation Fund (Project No. 20231752), in part by the Research Grants Council of the Hong Kong Special Administration Region (Project No. CityU 11206622), and in part by the GuangDong Basic and Applied Basic Research Foundation (Project No. 2024A1515011437).

RGC Funding Information

  • RGC-funded

Fingerprint

Dive into the research topics of 'Task-aware Cross-modal Feature Refinement Transformer with Large Language Models for Visual Grounding'. Together they form a unique fingerprint.

Cite this