Abstract
Taobao Search consists of two phases: the retrieval phase and the ranking phase. Given a user query, the retrieval phase returns a subset of candidate products for the following ranking phase. Recently, the paradigm of pre-training and fine-tuning has shown its potential in incorporating visual clues into retrieval tasks. In this paper, we focus on solving the problem of text-to-multimodal retrieval in Taobao Search. We consider that users' attention on titles or images varies on products. Hence, we propose a novel Modal Adaptation module for cross-modal fusion, which helps assigns appropriate weights on texts and images across products. Furthermore, in e-commerce search, user queries tend to be brief and thus lead to significant semantic imbalance between user queries and product titles. Therefore, we design a separate text encoder and a Keyword Enhancement mechanism to enrich the query representations and improve text-to-multimodal matching. To this end, we present a novel vision-language (V+L) pre-training methods to exploit the multimodal information of (user query, product title, product image). Extensive experiments demonstrate that our retrieval-specific pre-training model (referred to as MAKE) outperforms existing V+L pre-training methods on the text-to-multimodal retrieval task. MAKE has been deployed online and brings major improvements on the retrieval system of Taobao Search. © 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Original language | English |
---|---|
Title of host publication | WWW '23 Companion: Companion Proceedings of the ACM Web Conference 2023 |
Publisher | Association for Computing Machinery |
Pages | 356-360 |
ISBN (Print) | 9781450394161 |
DOIs | |
Publication status | Published - Apr 2023 |
Event | ACM Web Conference 2023 (WWW '23) - Hybrid, Austin, United States Duration: 30 Apr 2023 → 4 May 2023 https://www2023.thewebconf.org/ |
Publication series
Name | ACM Web Conference - Companion of the World Wide Web Conference, WWW |
---|
Conference
Conference | ACM Web Conference 2023 (WWW '23) |
---|---|
Abbreviated title | WWW '23 |
Country/Territory | United States |
City | Austin |
Period | 30/04/23 → 4/05/23 |
Internet address |
Bibliographical note
Full text of this publication does not contain sufficient affiliation information. With consent from the author(s) concerned, the Research Unit(s) information for this record is based on the existing academic department affiliation of the author(s).Research Keywords
- Multimodal Pre-training
- Representation Learning
- Semantic Retrieval