Robust Representation Learning for Crowd Counting and Its Applications


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date23 Aug 2021


Crowd counting is an essential topic in computer vision due to its practical usage in surveillance systems. The typical design of crowd counting algorithms is divided into two steps. First, the ground-truth density maps of crowd images are generated from the ground-truth dot maps (density map generation), e.g., by convolving with a Gaussian kernel. Second, deep learning models are designed to predict a density map from an input image (density map estimation). The density map based counting methods that incorporate density map as the intermediate representation have improved counting performance dramatically. However, the real-world applications related to crowd counting are limited due to several challenges. First, in the sense of end-to-end training, the hand-crafted methods used for generating the density maps may not be optimal for the particular network or dataset used. Second, the annotation noise is not effectively modeled, which affects the robustness and generalization ability of the networks. Finally, current crowd counting algorithms are only concerned about the number of people in an image, which lacks low-level fine-grained information of the crowd.

In this thesis, we first propose an adaptive density map generator, which takes the annotation dot map as input, and learns a density map representation for a counter. The counter and generator are trained jointly within an end-to-end framework. We also propose a noisy annotation modeling approach using a random variable with Gaussian distribution, and derive the pdf of the crowd density value for each spatial location in the image to improve the robustness of representation. To improve the generalization ability, we propose a residual regression approach to model the correlation information between samples. Finally, we investigate learning the density map representation through an unbalanced optimal transport problem to avoid the intermediate representation by directly evaluating the distance between a density map prediction and the dot annotations. We prove that traditional loss functions are special cases and suboptimal solutions to our proposed loss function.

For practical applications, the total number of people in an image is not as useful as the number of people in each sub-category. For example, knowing the number of people waiting inline or browsing can help retail stores. In this thesis, we propose fine-grained crowd counting, which differentiates a crowd into categories based on the low-level behavior attributes of the individuals (e.g. standing/sitting or violent behavior) and then counts the number of people in each category. To enable research in this area, we construct a new dataset of four real-world fine-grained counting tasks: traveling direction on a sidewalk, standing or sitting, waiting in line or not, and exhibiting violent behavior or not. Since the appearance features of different crowd categories are similar, the challenge of fine-grained crowd counting is to effectively utilize contextual information to distinguish between categories. We propose feature propagation guided by the density map prediction, which eliminates the effect of background features during propagation to encode contextual information. Experiment results confirm the effectiveness of our method.