Skip to main navigation Skip to search Skip to main content

MoDE: CLIP Data Experts via Clustering

  • Jiawei Ma
  • , Po-Yao Huang
  • , Saining Xie
  • , Shang-Wen Li
  • , Luke Zettlemoyer
  • , Shih-Fu Chang
  • , Wen-Tau Yih
  • , Hu Xu

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

Abstract

The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less (<35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts.  The code is available at https://github.com/facebookresearch/MetaCLIP/tree/main/mode. Copyright © 2024 by The Institute of Electrical and Electronics Engineers, Inc.
Original languageEnglish
Title of host publicationProceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Subtitle of host publicationCVPR 2024
PublisherIEEE
Pages26354-26363
ISBN (Electronic)979-8-3503-5300-6
ISBN (Print)979-8-3503-5301-3
DOIs
Publication statusPublished - 2024
Externally publishedYes
Event2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024)
- Seattle Convention Center, Seattle, United States
Duration: 17 Jun 202421 Jun 2024
https://cvpr.thecvf.com/Conferences/2024
https://ieeexplore.ieee.org/xpl/conhome/1000147/all-proceedings
https://cvpr.thecvf.com/virtual/2024/index.html

Publication series

Name
ISSN (Print)1063-6919
ISSN (Electronic)2575-7075

Conference

Conference2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024)
PlaceUnited States
CitySeattle
Period17/06/2421/06/24
Internet address

Research Keywords

  • Computer Vision and Pattern Recognition
  • Artificial Intelligence
  • Computation and Language
  • Machine Learning

Fingerprint

Dive into the research topics of 'MoDE: CLIP Data Experts via Clustering'. Together they form a unique fingerprint.

Cite this