Cross-modal Cooking Recipe Retrieval


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date29 Jun 2018


This thesis investigates the problem of cross-modal cooking recipe retrieval from four aspects: (1) recognizing ingredients in food images and building ingredient graph for zero-shot recipe retrieval, (2) recognizing rich attributes of food, not only ingredient but also cooking, cutting methods, (3) learning joint embedding space between ingredients (extracted from recipe) and food images with attention modeling, and (4) deep understanding the cooking instructions for cross-modal learning.

We first focus on the recognition of ingredients for recipe retrieval in the domain of Chinese dishes. Different from food categorization, which is to identify the name of a dish, ingredient recognition is to uncover the ingredients inside a dish. As the size, shape and color of ingredients can exhibit large visual differences due to diverse ways of cutting and cooking, in addition to changes in viewpoints and lighting conditions, recognizing ingredient is much more challenging than food categorization. We propose deep architectures for simultaneous learning of ingredient recognition and food categorization, by exploiting the mutual but also fuzzy relationship between them. The learnt deep features and semantic labels of ingredients are then innovatively applied for zero-shot retrieval of recipes. Besides, to boost retrieval performance, a graph encoding the contextual relationship among ingredients is learnt from the recipe corpus. Using this graph, conditional random field (CRF) is employed to probabilistically tune the probability distribution of ingredients to reduce potential recognition error due to unseen food category.

As similar ingredient composition can end up with wildly different dishes depending on the cooking and cutting procedures, the difficulty of retrieval originates from fine-grained recognition of rich attributes from pictures. We therefore proposed multi-task learning to learn not only the ingredient composition but also the applied cooking and cutting methods. The proposed model suffers less from the need of a large amount of learning samples and is easier to be trained with a smaller number of network parameters. With a multi-task deep learning model, we provide insights on the feasibility of predicting ingredient, cutting and cooking attributes for food recognition and recipe retrieval. Besides, as the learning happens at region-level, localizing the ingredient is also possible even when region-level training examples are not provided.

Training deep models for ingredient recognition requires manually labeling the ingredient, which is expensive and time-consuming. As there are already millions of food-recipe pairs that can be acquired from the Internet, a more feasible means that can save labeling efforts is to learn the joint space between recipes and food images for cross-modal retrieval. Therefore, we exploit and revise a deep model, stacked attention network for joint embedding feature learning between dish images and ingredients extracted from cooking recipes. Given a large number of image and recipe pairs acquired from the Internet, a joint space is learnt to locally capture the ingredient correspondence from images and recipes. As learning happens at the region level for image and ingredient level for recipe, the model has the ability to generalize recognition to unseen food categories.

To further improve the overall retrieval performance, we explore utilizing cooking instruction for cross-modal learning. Cooking instruction, on the one hand, gives clues to the multimedia presentation of a dish (e.g., taste, color, shape). On the other hand, describes the process implicitly, implying only the cause of dish presentation rather than the visual effect that can be vividly observed on a picture. Therefore, different from other cross-modal retrieval problems in the literature, recipe search requires the understanding of textually described procedure to predict its possible consequence on visual appearance. We approach this problem from the perspective of attention modeling. Specifically, we model the attention of words and sentences in a recipe and align them with its image feature such that both text and visual features share high similarity in multi-dimensional space. Furthermore, with attention modeling, we show that language-specific named-entity extraction based on domain knowledge becomes optional.

The proposed techniques are evaluated on large-scale real-world food image and recipe dataset including VireoFood 172, UEC-Food100 and recipe1M. Experimental evaluations demonstrate promising results of our techniques and show good potential for real-world multimedia applications.

    Research areas

  • Recipe retrieval, Food recognition, Cross-modal learning, Cross-modal retrieval, Ingredient recognition