Learning Compatible, Self-Explainable, Transferable Features for Cross-Modal Recipe Retrieval


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date17 Aug 2021


This thesis investigates the problem of cross-modal recipe retrieval from three perspectives: (1) learning compatible cross-modal recipe features in an adversarial way and explaining retrieval results by generating food images from recipes, (2) extending from cross-modal to cross-domain recipe retrieval to learn transferable features, and (3) exploring cross-lingual adaptation in the context of image-to-recipe retrieval.

We first study a new version of GAN, named Recipe Retrieval Generative Adversarial Network (R2GAN) for cross-modal recipe retrieval. By taking advantage of GAN to generate thumbnail images from recipes, the cross-modal features can be learnt to be compatible in an adversarial way and the search results can be self-explained by showing the generated images along with the rankings of recipes. R2GAN is designed with one generator and dual discriminators, which makes the generation of image from recipe a feasible idea. To further empower R2GAN with high-resolution synthesized images for explanation of search results, CookGAN is tailor-made for photo-realistic food image generation. CookGAN aims to mimic visual effect in causality chain, preserve fine-grained details and progressively up-sample image. A cooking simulator sub-network is proposed to incrementally make changes to food images based on the interaction between ingredients and cooking methods over a series of steps.

We extend from cross-modal to cross-domain recipe retrieval to facilitate recipe retrieval models with better generalization ability. Leveraging on image-recipe pairs in a source domain, we consider the problem of food transfer as recognizing food in a target domain with new food categories and attributes. We address the challenge of resource scarcity in the scenario that only partial data instead of a complete view of data is accessible for model transfer. Partial data refers to missing information such as absence of image modality or cooking instructions from an image-recipe pair. To cope with partial data, a novel generic model, equipped with various loss functions including cross-modal metric learning, recipe residual loss, semantic regularization and adversarial learning, is proposed for cross-domain transfer learning.

To alleviate the assumption that recipes in source and target domains are represented in the same language, we further explore the problem of cross lingual adaptation in the context of image-to-recipe retrieval. We design a novel self-supervised learning method to learn transferable embedding features across different languages, and bridge the domain gap in an unsupervised manner without the requirement of any pair data in the target domain. Concretely, we first introduce an intermediate domain obtained by exchanging the section(s) between source and target recipes. Then the self-supervised learning is achieved by imposing a constraint that the distance between the recipes in source (target) domains and intermediate domain should be smaller than that between source and target domains. In this way, the domain gap can be effectively narrowed.

We evaluate the proposed techniques on large scale food datasets including Recipe 1M and Vireo-FoodTransfer. Experimental evaluations demonstrate promising results of the proposed techniques, which shed light on various real-world multimedia applications, such as massive food data search, food recognition as well as nutrition estimation and food logging.

    Research areas

  • Cross-modal retrieval, Food recognition, Domain adaptation, Generative models