Abstract
Visual tracking is a task of localizing a target unceasingly in a video with an initial target state at the first frame. The limited target information makes this problem an extremely challenging task. Existing tracking methods either perform matching based similarity learning or optimization based discrimination reasoning. However, these two types of tracking methods suffer from the problem of ineffectiveness for distinguishing target objects from background distractors and the problem of insufficiency in maintaining spatio-temporal consistency among successive frames, respectively. In this paper, we design a joint spatio-temporal similarity and discrimination learning (STSDL) framework for accurate and robust tracking. The designed framework is composed of two complementary branches: a similarity learning branch and a discrimination learning branch. The similarity learning branch uses an effective transformer encoder-decoder to gather rich spatio-temporal context information to generate a similarity map. In parallel, the discrimination learning branch exploits an efficient model predictor to train a target model to produce a discriminative map. Finally, the similarity map and the discriminative map are adaptively fused for accurate and robust target localization. Experimental results on six prevalent datasets demonstrate that the proposed STSDL can obtain satisfactory results, while it retains a real-time tracking speed of 50 FPS on a single GPU. © 2024 IEEE
| Original language | English |
|---|---|
| Pages (from-to) | 7284-7300 |
| Journal | IEEE Transactions on Circuits and Systems for Video Technology |
| Volume | 34 |
| Issue number | 8 |
| Online published | 13 Mar 2024 |
| DOIs | |
| Publication status | Published - Aug 2024 |
Funding
This work was supported in part by the National Natural Science Foundation of China under Grant 62132002, Grant 62202249, and Grant 62102206; in part by the Postdoctoral Science Foundation of China under Grant 2022M721732; and in part by the Postdoctoral Fellowship Program of China Postdoctoral Science Foundation (CPSF) under Grant GZC20233362.
Research Keywords
- adaptive response map fusion
- joint learning
- spatio-temporal discrimination
- spatio-temporal similarity
- video object tracking