Deep Neural Network Based Crowd Density Estimation for Counting, Detection and Tracking


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date11 Jun 2019


For crowded scenes, the accuracy of object-based computer vision methods declines when the images are low-resolution and objects have severe occlusions. Taking counting methods for example, almost all the recent state-of-the-art counting methods bypass explicit detection and adopt regression-based methods to directly count the objects of interest. Among regression-based methods, density map estimation, where the number of objects inside a subregion is the integral of the density map over that subregion, is especially promising because it preserves spatial information, which makes it useful for both counting and localization (detection and tracking).

In this thesis, we evaluate density maps generated by various density estimation methods on a variety of crowd analysis tasks, including counting, detection, and tracking. A simple yet effective dense per-pixel density prediction method, which has not been tested before, is proposed for completeness. Previously, most CNN methods produce density maps with resolution that is smaller than the original images, due to the downsample strides in the convolution/pooling operations. In contrast, our proposed CNN uses a sliding window regressor to predict the density for every pixel in the image. We also consider a fully convolutional adaptation, with skip connections from lower convolutional layers to compensate for loss in spatial information during upsampling. In our experiments, we found that the lower-resolution density maps sometimes have better counting performance. In contrast, the original-resolution density maps improved localization tasks, such as detection and tracking, compared to bilinear upsampling the lower-resolution density maps. We also propose several metrics for measuring the quality of a density map, and relate them to experiment results on counting and localization.

Since the performance of localization tasks powered by or assisted by density map is highly dependent upon the quality of estimated density maps, the remainder of this thesis focuses on improving density estimation (evaluated by the most widely used counting performance). Because of the powerful learning capability of deep neural networks, counting performance via density map estimation has improved significantly during the past several years. However, it is still very challenging due to severe occlusion, large scale variations, and perspective distortion.

Firstly, we notice that side information such as the camera perspective (e.g., camera angle and height), which gives a clue about the appearance and scale of people, has not been fully utilized in deep learning based counting systems, although such information is useful for counting systems using traditional hand-crafted features. This phenomenon of under utilizing available side information is broadly existent in many computer vision areas, due to lack of methods that can effectively incorporate these 1D information with the 2D image input into the CNN pipeline. In order to incorporate the available side information, we propose an adaptive convolutional neural network (ACNN), where the convolution filter weights adapt to the current scene context via the side information. In particular, we model the filter weights as a low-dimensional manifold within the high-dimensional space of filter weights. The filter weights are generated using a learned "filter manifold" sub-network, whose input is the side information. With the help of side information and adaptive weights, the ACNN can disentangle the variations related to the side information, and extract discriminative features related to the current context (e.g. camera perspective, noise level, blur kernel parameters). We demonstrate the effectiveness of ACNN incorporating side information on 3 tasks: crowd counting, corrupted digit recognition, and image deblurring. Our experiments show that ACNN improves the performance compared to a plain CNN with a similar number of parameters and achieves similar or better than state-of-the-art performance on crowd counting task. Since existing crowd counting datasets do not contain ground-truth side information, we collect a new dataset with the ground-truth camera angle and height as the side information.

Although side information could be very helpful, sometimes it is very difficult or not practical to collect, especially for large scale datasets collected from the Internet. During the ACNN experiments, we find that perspective (indicating the scale of the object) is the most helpful side information. Scale variations (from image to image) coupled with perspective distortion (within one image) result in huge scale changes of the object size, making scale variation one major difficulty preventing good estimation. When side information is not available, scale variations can be effectively handled using image pyramids, but at the cost of more computation. Earlier methods based on convolutional neural networks (CNN) typically did not handle this scale variation explicitly, until Hydra-CNN and MCNN. MCNN uses three columns, each with different filter sizes, to extract features at different scales. In our algorithm, in contrast to using filters of different sizes, we utilize an image pyramid to deal with scale variations. It is more effective and efficient to resize the input fed into the network, as compared to using larger filter sizes. Secondly, we adaptively fuse the predictions from different scales (using adaptively changing per-pixel weights), which makes our method adapt to scale changes within an image. The adaptive fusing is achieved by generating an across-scale attention map, which softly selects a suitable scale for each pixel, followed by a 1x1 convolution. Extensive experiments on three popular datasets show very compelling results.