Abstract
Interactive object selection and extraction is an important research problem which extracts foreground objects from single static images or multiple video frames with very little user interaction, and it has many useful applications such as image/video annotation, localized editing, image composition, object tracking and video surveillance.However, traditional algorithms are extremely time-consuming with substantial user interactions to accurately estimate the foreground and background distributions. Recently, deep learning based approaches have a better understanding of the semantic information of objects and thus can perform well in semantic segmentation. Benefit from the Convolutional Neural Network (CNN) for solving computer vision problem, we propose to combine the low-level features (colors, edges, etc.) based approaches which have the real-time computational capability with higher level learned semantic representations by CNN to reduce the human interaction to just a single touch point.
In this thesis, we will investigate the fast interactive object extraction from image/video with minimal user interaction - a single touch point. Particularly, we identify three research problems concerning the proposed solution in different scenarios:
1) Interactive object proposals in a single image. We first investigate how to generate object proposals with low-level features in an image, and then explore an architecture to integrate human interaction in CNN to rank the generated object proposals so that the meaningful object can be accurately sought out.
2) Background subtraction with fix camera. We can more easily extract the interactive moving objects when we can subtract the background well from the video frames. Thus, a robust background subtraction approach runs in real time is proposed.
3) Interactive motion segmentation with moving camera. When the background also changes with obvious motions, it becomes very challenging to segment a moving object well with traditional background subtraction algorithms. Instead, we propose to generate interactive motion segmentation with respect to the touch point. With the interactive motion cue, we can enhance the localization of the extracted object from a single image.
Object proposal algorithms in a single image have been demonstrated to be very successful in accelerating object detection process. High object localization quality and detection recall can be obtained using thousands of proposals. However, the performance with a small number of proposals is still unsatisfactory. In this thesis, we demonstrate that the performance of a few proposals can be significantly improved with minimal human interaction - a single touch point. To this end, we first generate hierarchical superpixels using an efficient tree-organized structure as our initial object proposals and then select only a few proposals from them by learning an effective CNN for objectness ranking. We design an architecture to integrate human interaction with the global information of the whole image for objectness scoring, which is able to significantly improve the performance with a minimum number of object proposals. Extensive experiments show the proposed method outperforms all the state-of-the-art methods for finding out the meaningful object with the touch point constraint.
For interactive moving object extraction with fix camera, we propose a background subtraction algorithm using hierarchical superpixel segmentation, spanning trees and optical flow. First, we generate superpixel segmentation trees using a number of Gaussian Mixture Models (GMMs) by treating each GMM as one vertex to construct spanning trees. Next, we use the M-smoother to enhance the spatial consistency on the spanning trees and estimate optical flow to extend the M-smoother to the temporal domain. Experimental results on benchmark datasets show that the proposed algorithm performs favorably for background subtraction in videos against the state-of-the-art methods in spite of the frequent and sudden changes in pixel values.
For interactive moving object extraction with moving camera, we propose to generate interactive motion segmentation by estimating the view geometric transformation from the consecutive video frames and labeling the regions beyond the estimated transformation according to the touch point by using distance transform. We then use the edge of the interactive motion map as the new constraint for the generation of Superpixel Hierarchy (SH) and adopt the same network to rank the object proposals. It greatly improves the performance of interactive object detection for moving objects with fast computation and can be used in interactive vision system such as selecting the input of a real-time tracking system for video surveillance.
| Date of Award | 18 Jan 2018 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Qing LI (Supervisor) & Qingxiong YANG (Supervisor) |
Keywords
- Object Selection
- Object Proposal
- Transfer Learning
- Superpixel Hierarchy
- Background Modelling
- Minimum Spanning Tree
- Optical Flow
- Tracking