Variable Temporal Length Training for Action Recognition CNNs

Tan-Kun Li, Kwok-Leung Chan*, Tardi Tjahjadi

*Corresponding author for this work

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

16 Downloads (CityUHK Scholars)

Abstract

Most current deep learning models are suboptimal in terms of the flexibility of their input shape. Usually, computer vision models only work on one fixed shape used during training, otherwise their performance degrades significantly. For video-related tasks, the length of each video (i.e., number of video frames) can vary widely; therefore, sampling of video frames is employed to ensure that every video has the same temporal length. This training method brings about drawbacks in both the training and testing phases. For instance, a universal temporal length can damage the features in longer videos, preventing the model from flexibly adapting to variable lengths for the purposes of on-demand inference. To address this, we propose a simple yet effective training paradigm for 3D convolutional neural networks (3D-CNN) which enables them to process videos with inputs having variable temporal length, i.e., variable length training (VLT). Compared with the standard video training paradigm, our method introduces three extra operations during training: sampling twice, temporal packing, and subvideo-independent 3D convolution. These operations are efficient and can be integrated into any 3D-CNN. In addition, we introduce a consistency loss to regularize the representation space. After training, the model can successfully process video with varying temporal length without any modification in the inference phase. Our experiments on various popular action recognition datasets demonstrate the superior performance of the proposed method compared to conventional training paradigm and other state-of-the-art training paradigms. © 2024 by the authors.
Original languageEnglish
Article number3403
JournalSensors
Volume24
Issue number11
Online published25 May 2024
DOIs
Publication statusPublished - Jun 2024

Funding

The work described in this paper was fully supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. CityU 11202319).

Research Keywords

  • action recognition
  • deep learning
  • representation learning
  • video classification

Publisher's Copyright Statement

  • This full text is made available under CC-BY 4.0. https://creativecommons.org/licenses/by/4.0/

RGC Funding Information

  • RGC-funded

Fingerprint

Dive into the research topics of 'Variable Temporal Length Training for Action Recognition CNNs'. Together they form a unique fingerprint.

Cite this