TY - GEN
T1 - Exploring Robust Features for Few-Shot Object Detection in Satellite Imagery
AU - Bou, Xavier
AU - Facciolo, Gabriele
AU - von Gioi, Rafael Grompone
AU - Morel, Jean-Michel
AU - Ehret, Thibaud
PY - 2024
Y1 - 2024
N2 - The goal of this paper is to perform object detection in satellite imagery with only a few examples, thus enabling users to specify any object class with minimal annotation. To this end, we explore recent methods and ideas from open-vocabulary detection for the remote sensing domain. We develop a few-shot object detector based on a traditional two-stage architecture, where the classification block is replaced by a prototype-based classifier. A large-scale pre-trained model is used to build class-reference embeddings or prototypes, which are compared to region proposal contents for label prediction. In addition, we propose to fine-tune prototypes on available training images to boost performance and learn differences between similar classes, such as aircraft types. We perform extensive evaluations on two remote sensing datasets containing challenging and rare objects. Moreover, we study the performance of both visual and image-text features, namely DINOv2 and CLIP, including two CLIP models specifically tailored for remote sensing applications. Results indicate that visual features are largely superior to vision-language models, as the latter lack the necessary domain-specific vocabulary. Lastly, the developed detector outperforms fully supervised and few-shot methods evaluated on the SIMD and DIOR datasets, despite minimal training parameters. © 2024 IEEE.
AB - The goal of this paper is to perform object detection in satellite imagery with only a few examples, thus enabling users to specify any object class with minimal annotation. To this end, we explore recent methods and ideas from open-vocabulary detection for the remote sensing domain. We develop a few-shot object detector based on a traditional two-stage architecture, where the classification block is replaced by a prototype-based classifier. A large-scale pre-trained model is used to build class-reference embeddings or prototypes, which are compared to region proposal contents for label prediction. In addition, we propose to fine-tune prototypes on available training images to boost performance and learn differences between similar classes, such as aircraft types. We perform extensive evaluations on two remote sensing datasets containing challenging and rare objects. Moreover, we study the performance of both visual and image-text features, namely DINOv2 and CLIP, including two CLIP models specifically tailored for remote sensing applications. Results indicate that visual features are largely superior to vision-language models, as the latter lack the necessary domain-specific vocabulary. Lastly, the developed detector outperforms fully supervised and few-shot methods evaluated on the SIMD and DIOR datasets, despite minimal training parameters. © 2024 IEEE.
UR - http://www.scopus.com/inward/record.url?scp=85200825876&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-85200825876&origin=recordpage
U2 - 10.1109/CVPRW63382.2024.00048
DO - 10.1109/CVPRW63382.2024.00048
M3 - RGC 32 - Refereed conference paper (with host publication)
SN - 9798350365481
T3 - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
SP - 430
EP - 439
BT - Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024
PB - IEEE Computer Society
CY - Los Alamitos, Calif.
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2024)
Y2 - 16 June 2024 through 22 June 2024
ER -