Learning to Predict Scene Contexts
DescriptionA scene context refers to how a surrounding environment of objects correlates to some objects of interest. Due to the recent success in extracting scene context information from images using deep-learning models, there is a lot of research on how to apply scene context information in various computer vision tasks, including object detection, recognition and segmentation. State-of- the-art performances are reported when scene contexts are taking into consideration. However, all these prior works utilize the context information gathered from existing images to infer the properties of some objects of interest in an image. In this work, we propose to investigate an inverse problem of these existing works – given the properties of multiple objects, we want to predict the unknown scene context. Being able to predict the scene context has many applications, including a new approach to image synthesis and video synthesis, as well as scene type classification. The main objective of the proposed research is to answer an important question – whether we can synthesize a meaningful scene context from the properties of just a few input objects. As a preliminary work, we have demonstrated in our CVPR 2019 (oral) paper that given just one or two foreground objects (in the form of semantic maps), we can synthesize the surrounding objects and the background of the image. Since the work was meant to serve as a proof-of-concept, it only studied a small part of this unexplored problem. In particular, a fundamental problem of how the properties of the input objects may affect the inferred scene contexts was not considered. In the proposed work, we will investigate three main issues along this direction: 1. We observe that when given multiple objects, not only the existence of these objects can be used to infer the surrounding scene context, how these objects interact with each other is equally important in determining the scene context. For example, the context for two persons jogging can be very different from that for two persons fighting with each other. This is a very high dimensional learning problem that involves learning the class, size, shape and location interactions of multiple objects. In this project, we will explore different ways, in particular the graph-based approaches, of modeling this multi-dimensional object interaction problem. 2. Our preliminary work did not consider small input objects, since the learned information from small objects is typically very weak as a result of the convolution operations. To handle input objects of different sizes, we will explore pyramidal learning to enrich our input object representation in this project.3. In our preliminary work, users specify the input objects in the form of semantic maps, which indicate the class, size, shape and location of each object. However, a user may not always be concerned with all these properties of every object. In our recent work published in SIGGRAPH 2019, we explored text encoding for graphic design layouts. In the proposed project, we will explore the integration of this text encoding idea to allow users to specify partial object properties through text. We will conduct thorough evaluations on the models developed and explore the imagesynthesis application to demonstrate the effectiveness of the idea.
|Effective start/end date||1/01/21 → …|