Wide-Area Crowd Counting via Deep Learning Based Multi-View Fusion
深度學習下多視角融合的寬廣場景人群計數算法
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 26 Feb 2021 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(8e7e23eb-610a-4216-8599-f5f70fbabe09).html |
---|---|
Other link(s) | Links |
Abstract
Deep learning based crowd counting in single-view images has achieved outstanding performance on existing counting datasets. However, single-view counting is not applicable to large and wide scenes (e.g., public parks, long subway platforms, or event spaces) because a single camera cannot capture the whole scene in adequate detail for counting, e.g., when the scene is too large to fit into the field-of-view of the camera, too long so that the resolution is too low on faraway crowds, or when there are too many large objects that occlude large portions of the crowd. Therefore, to solve the widearea counting task requires multiple cameras with overlapping fields-of-view. Traditional multi-view counting rely on foreground extraction techniques and hand-crafted features, which limit the multi-view counting performance. To address these problems, in this thesis we explore deep learning-based multi-view fusion methods for better wide-area crowd counting performance.
First, we put forward a deep neural network framework for multi-view crowd counting, which fuses information from multiple camera views to predict a scene-level density map on the ground-plane of the 3D world. We consider three versions of the fusion framework: the late fusion model fuses camera-view density map; the na¨ıve early fusion model fuses camera-view feature maps; and the multi-view multi-scale early fusion model ensures that features aligned to the same ground-plane point have consistent scales. We also collect a real-world wide-area counting dataset consisting of multiple camera views, which will advance research on multi-view wide-area counting. Second, considering the variable height of the people in the 3D world, we propose to solve the maps, instead of the 2D ones. We also explore the projection consistency among the 3D prediction and the ground-truth in the 2D views to further enhance the counting performance. Third, it is usually assumed that the cameras are all temporally synchronized when designing models for multi-camera based tasks. However, this assumption is not always valid, especially for multi-camera systems with network transmission delay and low frame-rates due to limited network bandwidth, resulting in desynchronization of the captured frames across cameras, therefore, we also propose a synchronization model that works in conjunction with existing DNN-based multi-view models, thus avoiding the redesign of the whole model. Fourth, unlike the previous multi-view counting methods which are trained and tested on the same scene, we propose a cross-view cross-scene (CVCS) multi-view counting model that attentively selects and fuses multiple views together using camera layout geometry, and a noise view regularization method to train the model to handle non-correspondence errors. Beside, we generate a synthetic multicamera crowd counting dataset with a large number of scenes and camera views for training.
In summary, the thesis focuses on the deep learning based multi-view counting task and put forward several end-to-end multi-view fusion models for practical situations, eg., unsynchronization, cross-view cross-scene. The thesis should advance the research on multi-view crowd counting and contribute new thoughts to the related research community.
First, we put forward a deep neural network framework for multi-view crowd counting, which fuses information from multiple camera views to predict a scene-level density map on the ground-plane of the 3D world. We consider three versions of the fusion framework: the late fusion model fuses camera-view density map; the na¨ıve early fusion model fuses camera-view feature maps; and the multi-view multi-scale early fusion model ensures that features aligned to the same ground-plane point have consistent scales. We also collect a real-world wide-area counting dataset consisting of multiple camera views, which will advance research on multi-view wide-area counting. Second, considering the variable height of the people in the 3D world, we propose to solve the maps, instead of the 2D ones. We also explore the projection consistency among the 3D prediction and the ground-truth in the 2D views to further enhance the counting performance. Third, it is usually assumed that the cameras are all temporally synchronized when designing models for multi-camera based tasks. However, this assumption is not always valid, especially for multi-camera systems with network transmission delay and low frame-rates due to limited network bandwidth, resulting in desynchronization of the captured frames across cameras, therefore, we also propose a synchronization model that works in conjunction with existing DNN-based multi-view models, thus avoiding the redesign of the whole model. Fourth, unlike the previous multi-view counting methods which are trained and tested on the same scene, we propose a cross-view cross-scene (CVCS) multi-view counting model that attentively selects and fuses multiple views together using camera layout geometry, and a noise view regularization method to train the model to handle non-correspondence errors. Beside, we generate a synthetic multicamera crowd counting dataset with a large number of scenes and camera views for training.
In summary, the thesis focuses on the deep learning based multi-view counting task and put forward several end-to-end multi-view fusion models for practical situations, eg., unsynchronization, cross-view cross-scene. The thesis should advance the research on multi-view crowd counting and contribute new thoughts to the related research community.