Variable Temporal Length Training for Action Recognition CNNs

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Original languageEnglish
Article number3403
Journal / PublicationSensors
Volume24
Issue number11
Online published25 May 2024
Publication statusPublished - Jun 2024

Link(s)

Abstract

Most current deep learning models are suboptimal in terms of the flexibility of their input shape. Usually, computer vision models only work on one fixed shape used during training, otherwise their performance degrades significantly. For video-related tasks, the length of each video (i.e., number of video frames) can vary widely; therefore, sampling of video frames is employed to ensure that every video has the same temporal length. This training method brings about drawbacks in both the training and testing phases. For instance, a universal temporal length can damage the features in longer videos, preventing the model from flexibly adapting to variable lengths for the purposes of on-demand inference. To address this, we propose a simple yet effective training paradigm for 3D convolutional neural networks (3D-CNN) which enables them to process videos with inputs having variable temporal length, i.e., variable length training (VLT). Compared with the standard video training paradigm, our method introduces three extra operations during training: sampling twice, temporal packing, and subvideo-independent 3D convolution. These operations are efficient and can be integrated into any 3D-CNN. In addition, we introduce a consistency loss to regularize the representation space. After training, the model can successfully process video with varying temporal length without any modification in the inference phase. Our experiments on various popular action recognition datasets demonstrate the superior performance of the proposed method compared to conventional training paradigm and other state-of-the-art training paradigms. © 2024 by the authors.

Research Area(s)

  • action recognition, deep learning, representation learning, video classification

Citation Format(s)

Variable Temporal Length Training for Action Recognition CNNs. / Li, Tan-Kun; Chan, Kwok-Leung; Tjahjadi, Tardi.
In: Sensors, Vol. 24, No. 11, 3403, 06.2024.

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

Download Statistics

No data available