Joint Spatio-Temporal Similarity and Discrimination Learning for Visual Tracking

Yanjie Liang, Haosheng Chen, Qiangqiang Wu, Changqun Xia, Jia Li*

*Corresponding author for this work

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

16 Citations (Scopus)

Abstract

Visual tracking is a task of localizing a target unceasingly in a video with an initial target state at the first frame. The limited target information makes this problem an extremely challenging task. Existing tracking methods either perform matching based similarity learning or optimization based discrimination reasoning. However, these two types of tracking methods suffer from the problem of ineffectiveness for distinguishing target objects from background distractors and the problem of insufficiency in maintaining spatio-temporal consistency among successive frames, respectively. In this paper, we design a joint spatio-temporal similarity and discrimination learning (STSDL) framework for accurate and robust tracking. The designed framework is composed of two complementary branches: a similarity learning branch and a discrimination learning branch. The similarity learning branch uses an effective transformer encoder-decoder to gather rich spatio-temporal context information to generate a similarity map. In parallel, the discrimination learning branch exploits an efficient model predictor to train a target model to produce a discriminative map. Finally, the similarity map and the discriminative map are adaptively fused for accurate and robust target localization. Experimental results on six prevalent datasets demonstrate that the proposed STSDL can obtain satisfactory results, while it retains a real-time tracking speed of 50 FPS on a single GPU. © 2024 IEEE
Original languageEnglish
Pages (from-to)7284-7300
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume34
Issue number8
Online published13 Mar 2024
DOIs
Publication statusPublished - Aug 2024

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62132002, Grant 62202249, and Grant 62102206; in part by the Postdoctoral Science Foundation of China under Grant 2022M721732; and in part by the Postdoctoral Fellowship Program of China Postdoctoral Science Foundation (CPSF) under Grant GZC20233362.

Research Keywords

  • adaptive response map fusion
  • joint learning
  • spatio-temporal discrimination
  • spatio-temporal similarity
  • video object tracking

Fingerprint

Dive into the research topics of 'Joint Spatio-Temporal Similarity and Discrimination Learning for Visual Tracking'. Together they form a unique fingerprint.

Cite this