Learning Model Updates for Visual Tracking


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date30 Jan 2020


Visual tracking is the task of localizing object in a video given the bounding box of the target in the first frame. It has been widely used in numerous applications like smart surveillance, autonomous driving and human-computer interaction.

Recently using convolutional neural networks (CNNs) has gained popularity in visual tracking, due to its robust feature representation of images. Recent methods perform online tracking by fine-tuning a pre-trained CNN model to the specific target object using stochastic gradient descent (SGD) back-propagation, which is usually time-consuming. In this thesis, we propose a recurrent filter learning (RFL) method for visual tracking. We directly feed the target’s image patch to a recurrent neural network (RNN) to estimate an object-specific filter for tracking. As the video sequence is a spatio-temporal data, we extend the matrix multiplications of the fully-connected layers of the RNN to a convolution operation on feature maps, which preserves the target’s spatial structure and also is memory-efficient. The tracked object in the subsequent frames will be fed into the RNN to adapt the generated filters to appearance variations of the target. Note that once the off-line training process of our network is finished, there is no need to fine-tune the network for specific objects, which makes our approach more efficient than methods that use iterative fine-tuning to online learn the tar- get. Extensive experiments conducted on widely used benchmarks, OTB and VOT, demonstrate encouraging results compared to other recent methods.

Updating the tracking model by interpolating historical object information (stored in memory states of RNN) with newly coming target template using a convolutional RNN as in RFL has shown promising results. To further improve the performance of tracking, we propose a dynamic memory network to adapt the template to the target’s appearance variations during tracking. The reading and writing process of the external memory is controlled by an LSTM network with the search feature map as input. A spatial attention mechanism is applied to concentrate the LSTM input on the potential target as the location of the target is at first unknown. To prevent aggressive model adaptivity, we apply gated residual template learning to control the amount of retrieved memory that is used to combine with the initial template. In order to alleviate the drift problem, we also design a ”negative” memory unit that stores templates for distractors, which are used to cancel out wrong responses from the object template. To further boost the tracking performance, an auxiliary classification loss is added after the feature extractor part. Unlike tracking-by-detection methods where the object’s information is maintained by the weight parameters of neural networks, which requires expensive online fine-tuning to be adaptable, our tracker runs completely feed-forward and adapts to the target’s appearance changes by updating the external memory. Moreover, the capacity of our model is not determined by the network size as with other trackers — the capacity can be easily enlarged as the memory requirements of a task increase, which is favorable for memorizing long-term object information. Extensive experiments on the OTB and VOT datasets demonstrate that our trackers perform favorably against state-of-the-art tracking methods while retaining real-time speed.

In tracking community, most algorithms focus on developing powerful classifiers/templates to determine the center of object while overlook the accurate bounding box estimation. They usually adopt a multi-scale search scheme to predict the scale change of object assuming that the aspect ratio of target is fixed during tracking. However, this assumption is not always true in real scenario, which limits the precision of predicted bounding box. What’s more, online up- dating a tracking model to adapt to object appearance variations is also crucial for improving tracking performance. For SGD-based model optimization, using a large learning rate may help to converge the model faster but has the risk of letting the loss wander wildly. Thus traditional optimization methods usually choose a relatively small learning rate and iterate for more steps to converge the model, which is time-consuming. We design a tracking model consisting of response generation and bounding box regression, where the the first component produces a heat map to indicate the presence of object at different positions and the second part regresses the relative bounding box shifts to anchors mounted on sliding-window locations. Thanks to the resizable convoultional filters used in both components to adapt to the shape changes of objects, our tracking model does not need to enumerate different sized anchors, thus saving model parameters. To effectively adapt the model to appearance variations, we propose to offline train a recurrent neural optimizer to update tracking model in a meta learning setting, which can converge the model in a few gradient steps. This substantially improves the convergence speed of updating the tracking model, while achieving better performance. Moreover, we also propose a simple yet effective training trick called Random Filter Scaling(RFS) to prevent overfitting, which boosts the performance greatly. We conduct comprehensive experiments on large scale datasets including commonly used OTB and VOT, as well as recently proposed LaSOT, GOT10k and TrackingNet, and our trackers achieve favorable performance compared with the state-of-the-art.