Abstract
The visual question localized-answering (VQLA) system has garnered increasing attention due to its potential as a knowledgeable assistant in surgical education. Apart from providing text-based answers, VQLA can also pinpoint the specific region of interest for better surgical scene understanding. Although recent Transformer-based models for VQLA have obtained promising results, they (1) conduct vanilla text-to-image cross attention, leading to unidirectional and coarse-grained alignment; (2) ignore exploiting the semantics of answers to further boost performance. In this paper, we propose a novel model termed OTAS, which first introduces optimal transport to achieve bidirectional and fine-grained alignment between images and questions, enabling more precise localization. Besides, OTAS incorporates a set of learnable candidate answer embeddings to query the probability of each answer class for a given image-question pair. Through Transformer attention, the candidate answer embeddings interact with the fused features of the image-question pair to make the answer decision. Extensive experiments on two widely-used benchmark datasets demonstrate the superiority of our model over state-of-the-art methods. © 2024 ELRA Language Resource Association: CC BY-NC 4.0.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) |
| Publisher | European Language Resources Association (ELRA) |
| Pages | 711-721 |
| ISBN (Print) | 9782493814104 |
| Publication status | Published - May 2024 |
| Event | Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024 - Hybrid, Torino, Italy Duration: 20 May 2024 → 25 May 2024 https://aclanthology.org/2024.lrec-main |
Publication series
| Name | Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING - Main Conference Proceedings |
|---|
Conference
| Conference | Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024 |
|---|---|
| Place | Italy |
| City | Hybrid, Torino |
| Period | 20/05/24 → 25/05/24 |
| Internet address |
Bibliographical note
Full text of this publication does not contain sufficient affiliation information. With consent from the author(s) concerned, the Research Unit(s) information for this record is based on the existing academic department affiliation of the author(s).Research Keywords
- Answer Semantics
- Optimal Transport
- Visual Question Localized-Answering
Publisher's Copyright Statement
- This full text is made available under CC-BY-NC 4.0. https://creativecommons.org/licenses/by-nc/4.0/