The Detection of Saliency, Objectness and Actionness in Media Data


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
  • Rynson W H LAU (Supervisor)
  • Weiqiang Wang (External person) (External Supervisor)
Award date6 Feb 2018


Researchers from cognitive psychology and neurobiology suggest that humans have a strong ability to perceive objects before recognizing them. Human attention theories also hypothesize that the human vision system (HVS) processes only parts of an image in details, while leaving the rest nearly unprocessed. We are motivated by the above findings and propose to understand images and videos intelligently, by focusing on novel concepts including saliency, objectness and actionness. They indicate the subjective properties, e.g., while a specific region of an image may be attractive and salient to some people, it may be easily ignored by others. In our work, we investigate these concepts and apply them to real-life scenarios.

Saliency is the state or quality by which something stands out relative to its surroundings, from human perspective. Lots of researchers contribute to the topic of saliency detection, and impressive progress has been made in recent years. Meanwhile, accurately and efficiently retrieving images are becoming huge demands, as tremendous amount of images are created, stored and transmitted for various reasons. Hence, we explore the feasibility of applying saliency detection to image matching, by assuming that salient regions are more important to humans when measuring the similarity of an image pair. As such, our results are closer to the human subjective perception, compared with using the low level features. To accomplish this objective, we first propose saliency detection methods by formulating the center-surround hypothesis in different ways. We then generate salient regions of an image based on the computed saliency map. Finally, we formulate distance metrics to measure the similarity between images, by adopting integrated region matching (IRM) or fully connected graph. The experimental results on publicly available datasets show that the proposed method achieves satisfactory performance on both saliency detection and pairwise matching.

Similar to saliency which indicates the likelihood that a region attracts attention, objectness is proposed to indicate the likelihood that a region contains a generic object. The candidate regions are called object proposals, which are allowed to overlap each other to ensure that all objects are detected accurately. Object proposal detection is an effective way of accelerating object recognition, by replacing the exhaustive search with a small amount of proposals. Existing proposal methods are mostly based on detecting object boundaries, which may not be effective for cluttered backgrounds. In our work, we leverage stereopsis as a robust and effective solution for generating object proposals. We first obtain a set of candidate bounding boxes through adaptive transformation, which fits the bounding boxes tightly to object boundaries detected by rough depth and color information. A two-level hierarchy composing of cluster and proposal levels is then constructed to estimate object locations in an efficient and accurate manner. Three stereo based cues, "exactness", "focus" and "distribution", are proposed for objectness estimation. A two-level hierarchical ranking technique is proposed to obtain accurately ranked object proposals. A stereo dataset with 400 labeled stereo image pairs is constructed to evaluate the performance of the proposed method in both indoor and outdoor scenes. Extensive experimental evaluations show that the proposed stereo based approach achieves better performance than the state-of-the-arts with either a small or a large number of object proposals. As stereopsis can be a complement to the color information, the proposed method can be integrated with existing proposal methods to obtain superior results.

Inspired by the idea of object proposals for localizing generic objects in images, we further study the localization of generic actions in egocentric videos, called temporal action proposals (TAPs), for accelerating the action recognition step, by replacing the popular sliding window strategy. Egocentric videos, which mainly record the activities carried out by the users of the wearable cameras, have drawn a lot of research attentions in recent years. In our work, an egocentric TAP refers to a sequence of frames that may contain a generic action performed by the wearer of a head-mounted camera, e.g., taking a knife, spreading jam, pouring milk, or cutting carrots. We first propose to temporally segment the input video into action atoms, which are the smallest units that may contain an action. We then apply a hierarchical clustering algorithm with four egocentric cues to generate TAPs, including hand position, eye gaze, motion blur and TAP length. To accurately detect the wearers' hands, we present two techniques, seed generation and dynamic region growing, by jointly considering hand-related motion, appearance and location. Finally, we propose two actionness networks to score the likelihood of each TAP containing a generic action. The top ranked candidates are returned as output TAPs. Experimental results show that the proposed TAP detection framework performs significantly better than relevant approaches for egocentric action detection.

    Research areas

  • Saliency detection, Objectness estimation, Stereo object proposals, Actionness estimation, Temporal action proposals