MAKE: Vision-Language Pre-training based Product Retrieval in Taobao Search

Xiaoyang Zheng, Zilong Wang, Sen Li, Ke Xu*, Tao Zhuang, Qingwen Liu, Xiaoyi Zeng

*Corresponding author for this work

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

7 Citations (Scopus)

Abstract

Taobao Search consists of two phases: the retrieval phase and the ranking phase. Given a user query, the retrieval phase returns a subset of candidate products for the following ranking phase. Recently, the paradigm of pre-training and fine-tuning has shown its potential in incorporating visual clues into retrieval tasks. In this paper, we focus on solving the problem of text-to-multimodal retrieval in Taobao Search. We consider that users' attention on titles or images varies on products. Hence, we propose a novel Modal Adaptation module for cross-modal fusion, which helps assigns appropriate weights on texts and images across products. Furthermore, in e-commerce search, user queries tend to be brief and thus lead to significant semantic imbalance between user queries and product titles. Therefore, we design a separate text encoder and a Keyword Enhancement mechanism to enrich the query representations and improve text-to-multimodal matching. To this end, we present a novel vision-language (V+L) pre-training methods to exploit the multimodal information of (user query, product title, product image). Extensive experiments demonstrate that our retrieval-specific pre-training model (referred to as MAKE) outperforms existing V+L pre-training methods on the text-to-multimodal retrieval task. MAKE has been deployed online and brings major improvements on the retrieval system of Taobao Search. © 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Original languageEnglish
Title of host publicationWWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023
PublisherAssociation for Computing Machinery
Pages356-360
ISBN (Print)9781450394161
DOIs
Publication statusPublished - Apr 2023
EventACM Web Conference 2023 (WWW '23) - Hybrid, Austin, United States
Duration: 30 Apr 20234 May 2023
https://www2023.thewebconf.org/

Publication series

NameACM Web Conference - Companion of the World Wide Web Conference, WWW

Conference

ConferenceACM Web Conference 2023 (WWW '23)
Abbreviated titleWWW '23
Country/TerritoryUnited States
CityAustin
Period30/04/234/05/23
Internet address

Bibliographical note

Full text of this publication does not contain sufficient affiliation information. With consent from the author(s) concerned, the Research Unit(s) information for this record is based on the existing academic department affiliation of the author(s).

Research Keywords

  • Multimodal Pre-training
  • Representation Learning
  • Semantic Retrieval

Fingerprint

Dive into the research topics of 'MAKE: Vision-Language Pre-training based Product Retrieval in Taobao Search'. Together they form a unique fingerprint.

Cite this