Image/Video Restoration and Object Tracking in Outdoor Environments
室外環境下圖像視頻恢復和目標跟蹤算法研究
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 30 Jan 2020 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(54c3e6fe-1bdc-43d6-b951-255f6b60e324).html |
---|---|
Other link(s) | Links |
Abstract
Computer vision algorithms (e.g., detection, segmentation and tracking) have been applied to many fields such as autonomous driving, intelligent transportation and public surveillance. However, they often fail in outdoor environments which have diverse illuminations, bad weathers or crowded scenes. Diverse illuminations and bad weathers mainly affect the visibility of an image, while crowded scenes usually cause occlusion or distraction. In this thesis, we mainly solve two problems: image/video restoration and object tracking in outdoor environments. For image/video restoration, we focus on highlight removal and video snow/rain removal, while for object tracking, we aim to solve single-object and multi-object tracking in crowd scenes. In the experiments, we have shown that image/video restoration is of benefit to object tracking. In summary, the main contents of this thesis are as follows:
1). According to dichromatic reflection model, recent methods of specular reflection separation (highlight removal) in image processing often separate specular reflection from a single image using patch-based priors. Due to lack of global information, these methods often cannot completely remove the specular component of an image and are incline to degrade image textures. In this thesis, we derive a global color-lines constraint from dichromatic reflection model to effectively separate specular and diffuse reflections. Our key observation is from that each image pixel lies along a color line in normalized RGB space and different color lines representing distinct diffuse chromaticities intersect at one point (i.e., the illumination chromaticity). For pixels along the same color line, they spread over the entire image and their distances to the illumination chromaticity reflect the amount of specular reflection component. With global (non-local) information from these color lines, our method can effectively separate specular reflection and diffuse reflection in a pixel-wise way for a single image, and it is suitable for real-time applications. Experimental results on synthetic and real images show that our method performs better than the state-of-the-art methods.
2). For the existing snow/rain removal methods, they often fail for heavy snow/rain and dynamic scenes. One reason for the failure is due to the assumption that all the snowflakes/rain streaks are sparse in snow/rain scenes. The other is that the existing methods often can not differentiate moving objects and snowflakes/rain streaks. In this thesis, we propose a model based on matrix decomposition for video desnowing and deraining to solve the problems mentioned above. We divide snowflakes/rain streaks into two categories: sparse ones and dense ones. With background fluctuations and optical flow information, the detection of moving objects and sparse snowflakes/rain streaks is formulated as a multi-label Markov Random Fields (MRFs). As for dense snowflakes/rain streaks, they are considered to obey Gaussian distribution. The snowflakes/rain streaks, including sparse ones and dense ones, in scene backgrounds are removed by low-rank representation of the backgrounds. Meanwhile, a group sparsity term in our model is designed to filter snow/rain pixels within the moving objects. Experimental results show that our proposed model performs better than the state-of-the-art methods for snow and rain removal.
3). While people tracking has been greatly improved over the recent years, crowd scenes remain particularly challenging for people tracking due to heavy occlusions, high crowd density, and significant appearance variation. To address these challenges, we first design a Sparse Kernelized Correlation Filter (S-KCF) to suppress target response variations caused by occlusions and spurious responses due to similar distractors. We then propose a people tracking framework that fuses S-KCF response map with an estimated crowd density map using a convolutional neural network (CNN), yielding a refined response map. To train the fusion CNN, we propose a two-stage strategy to gradually optimize the parameters. The first stage is to train a preliminary model in batch mode with image patches selected around targets, and the second stage is to fine-tune the preliminary model using the real frame-by-frame tracking process. Our density fusion framework significantly improves single person tracking in crowd scenes, and can also be combined with other visual trackers to improve the tracking performance. For multiple people tracking, we further extend our fusion CNN by incorporating responses of distractors. We validate our framework on five crowd video datasets: UCSD, PETS2009, LHI, DukeMTMC and MOT17.
4). State-of-the-art multi-object tracking methods (include our above fusion tracker) follow the tracking-by-detection paradigm, where object trajectories are obtained by associating per-frame outputs of object detectors. In crowded scenes, however, detectors often fail to obtain accurate detections due to heavy occlusions and high crowd density. In this thesis, we propose a new MOT paradigm, tracking-by-counting, tailored for crowded scenes. Using crowd density maps, we jointly model detection, counting, and tracking of multiple targets as a network flow program, which simultaneously finds the global optimal detections and trajectories of multiple targets over the whole video. This is in contrast to prior MOT methods that either ignore the crowd density and thus are prone to errors in crowded scenes, or rely on a suboptimal two-step process using heuristic density-aware point-tracks for matching targets. Our approach yields promising results on public benchmarks of various domains including people tracking, cell tracking, and fish tracking.
1). According to dichromatic reflection model, recent methods of specular reflection separation (highlight removal) in image processing often separate specular reflection from a single image using patch-based priors. Due to lack of global information, these methods often cannot completely remove the specular component of an image and are incline to degrade image textures. In this thesis, we derive a global color-lines constraint from dichromatic reflection model to effectively separate specular and diffuse reflections. Our key observation is from that each image pixel lies along a color line in normalized RGB space and different color lines representing distinct diffuse chromaticities intersect at one point (i.e., the illumination chromaticity). For pixels along the same color line, they spread over the entire image and their distances to the illumination chromaticity reflect the amount of specular reflection component. With global (non-local) information from these color lines, our method can effectively separate specular reflection and diffuse reflection in a pixel-wise way for a single image, and it is suitable for real-time applications. Experimental results on synthetic and real images show that our method performs better than the state-of-the-art methods.
2). For the existing snow/rain removal methods, they often fail for heavy snow/rain and dynamic scenes. One reason for the failure is due to the assumption that all the snowflakes/rain streaks are sparse in snow/rain scenes. The other is that the existing methods often can not differentiate moving objects and snowflakes/rain streaks. In this thesis, we propose a model based on matrix decomposition for video desnowing and deraining to solve the problems mentioned above. We divide snowflakes/rain streaks into two categories: sparse ones and dense ones. With background fluctuations and optical flow information, the detection of moving objects and sparse snowflakes/rain streaks is formulated as a multi-label Markov Random Fields (MRFs). As for dense snowflakes/rain streaks, they are considered to obey Gaussian distribution. The snowflakes/rain streaks, including sparse ones and dense ones, in scene backgrounds are removed by low-rank representation of the backgrounds. Meanwhile, a group sparsity term in our model is designed to filter snow/rain pixels within the moving objects. Experimental results show that our proposed model performs better than the state-of-the-art methods for snow and rain removal.
3). While people tracking has been greatly improved over the recent years, crowd scenes remain particularly challenging for people tracking due to heavy occlusions, high crowd density, and significant appearance variation. To address these challenges, we first design a Sparse Kernelized Correlation Filter (S-KCF) to suppress target response variations caused by occlusions and spurious responses due to similar distractors. We then propose a people tracking framework that fuses S-KCF response map with an estimated crowd density map using a convolutional neural network (CNN), yielding a refined response map. To train the fusion CNN, we propose a two-stage strategy to gradually optimize the parameters. The first stage is to train a preliminary model in batch mode with image patches selected around targets, and the second stage is to fine-tune the preliminary model using the real frame-by-frame tracking process. Our density fusion framework significantly improves single person tracking in crowd scenes, and can also be combined with other visual trackers to improve the tracking performance. For multiple people tracking, we further extend our fusion CNN by incorporating responses of distractors. We validate our framework on five crowd video datasets: UCSD, PETS2009, LHI, DukeMTMC and MOT17.
4). State-of-the-art multi-object tracking methods (include our above fusion tracker) follow the tracking-by-detection paradigm, where object trajectories are obtained by associating per-frame outputs of object detectors. In crowded scenes, however, detectors often fail to obtain accurate detections due to heavy occlusions and high crowd density. In this thesis, we propose a new MOT paradigm, tracking-by-counting, tailored for crowded scenes. Using crowd density maps, we jointly model detection, counting, and tracking of multiple targets as a network flow program, which simultaneously finds the global optimal detections and trajectories of multiple targets over the whole video. This is in contrast to prior MOT methods that either ignore the crowd density and thus are prone to errors in crowded scenes, or rely on a suboptimal two-step process using heuristic density-aware point-tracks for matching targets. Our approach yields promising results on public benchmarks of various domains including people tracking, cell tracking, and fish tracking.
- highlight removal, matrix decompisition, video desnowing and deraining, visual object tracking, crowd density maps