Scalable and accurate dynamic texture models with applications to computer vision
Student thesis: Doctoral Thesis
Related Research Unit(s)
Dynamic texture modeling of motion sequences is one of the most active and inspiring research areas in computer vision and machine learning. Recently many challenging applications like time series analysis, video clustering, motion segmentation, background estimation and video annotation have been solved with great success by applying the concepts of dynamic texture models. Traditional approaches such as parametric models and optical flow fail to completely solve these problems because of the contrary nature of how these problems are perceived by biological vision, for example modeling of sequences containing fire, smoke, highway traffic etc. Rather than modeling individual objects/particles of motion, dynamic textures model the motion as a whole process in a more natural way similar to biological vision. In this thesis, we propose several extensions of traditional dynamic texture models, i.e.dynamic texture mixture models (DTM), with applications to existing and novel domains in computer vision with greater accuracy and scalability. At first, the hierarchical EM (HEM) algorithm for clustering the DTM model is derived. HEMDTM algorithm is capable of both clustering DTs and learning novel DT cluster centers that are representative of the cluster members, in a manner that is consistent with the underlying generative probabilistic model of the DT. We also derive an efficient recursive algorithm for sensitivity analysis of the discrete-time Kalman smoothing filter, which is used as the basis for computing expectations in the E-step of the HEM algorithm. Finally, we demonstrate the efficacy of the clustering algorithm on several applications in motion analysis, including hierarchical motion clustering, semantic motion annotation, and learning bag-of-systems codebooks for dynamic texture recognition. The bag-of-systems (BoS) representation is a descriptor of motion in a video, where dynamic texture (DT) codewords represent the typical motion patterns in spatiotemporal patches extracted from the video. The efficacy of the BoS descriptor depends on the richness of the codebook, which depends on the number of codewords in the codebook. However, for even modest sized codebooks, mapping videos onto the codebook results in a heavy computational load. In this part of the thesis, we propose the idea of a BoS Tree, which constructs a bottom-up hierarchy of codewords that enables efficient mapping of videos to the BoS codebook. By leveraging the tree structure to efficiently index the codewords, the BoS Tree allows for fast look-ups in the codebook and enables the practical use of larger, richer codebooks. We demonstrate the effectiveness of BoS Trees on classification of four video datasets, as well as on annotation of a video dataset and a music dataset. Finally, we show that, although the fast look-ups of BoS Tree result in different descriptors than BoS for the same video, the overall distance (and kernel) matrices are highly correlated resulting in similar classification performance. We next present a data structure for fast nearest neighbor retrieval of generative models of documents based on the Kullback-Leibler (KL) divergence. Our data structure, which shares some similarity with Bregman Ball Trees, consists of a hierarchical partition of a database, and uses a novel branch and bound methodology for search. The main technical contribution of this work is a novel and efficient algorithm for deciding whether to explore nodes during backtracking, based on a variational approximation. This reduces the number of computations per node, and overcomes the limitations of Bregman Ball Trees on high dimensional data. In addition, our strategy is applicable also to probability distributions with hidden state variables, and is not limited to regular exponential family distributions. Experiments demonstrate substantial speedups over both Bregman Ball Trees and over brute force search, on both moderate and high dimensional histogram data. In addition, experiments on dynamic textures demonstrate the flexibility of our approach to latent variable models. Next we propose a joint foreground-background mixture model (FBM) that simultaneously performs background subtraction and motion segmentation in complex dynamic scenes. Our FBM consist of a set of location-specific dynamic texture (DT) components, for modeling local background motion, and set of global DT components, for modeling consistent foreground motion. We derive an EM algorithm for estimating the parameters of the FBM. We also apply spatial constraints to the FBM using a Markov random field grid, and derive a corresponding variational approximation for inference. Unlike existing approaches to background subtraction, our FBM does not require a manually selected threshold or a separate training video. Unlike existing motion segmentation techniques, our FBM can segment foreground motions over complex background with mixed motions, and detect stopped objects. Since most dynamic scene datasets only contain videos with a single foreground object over a simple background, we develop a new challenging dataset with multiple foreground objects over complex dynamic backgrounds. In experiments, we show that jointly modeling the background and foreground segments with FBM adds more flexibility and power to the existing background subtraction techniques. Due to the memory and computationally intensive nature of the EM and HEM algorithms for dynamic textures models, experiments on annotation, classification and segmentation on large scale data sets using complex protocols, such as leave-one-out cross-validation, require significant amount of time to complete. Finally in our work, we make a valuable contribution in developing fast C++ and OpenCV implementations of the proposed DT algorithms and application frameworks.
- Computer vision, Visual texture recognition