TY - JOUR
T1 - Not all samples are equal
T2 - Boosting action segmentation via selective incremental learning
AU - Huang, Feng
AU - Chen, Xiao-Diao
AU - Wu, Wen
AU - Ma, Weiyin
PY - 2025/5/1
Y1 - 2025/5/1
N2 - Temporal action segmentation (TAS) seeks to perform classification for each frame in a video. Existing methods tend to design diverse network architectures, while overlooking the intrinsic characteristics of training samples. Notably, two key issues arise: (1) Frames around action boundaries are more ambiguous and thus pose greater difficulties for training compared to other frames; and (2) beyond the commonly used categorical labels, the total number of action instances within a video may serve as an additional, potentially vital, supervision cue. To address these issues, this paper introduces a novel method that combines a model-agnostic training strategy with an instance number alignment loss, designed to enhance the performance of existing models. Specifically, a selective incremental learning (SIL) strategy is proposed to alleviate the impact of noisy samples by progressively training the model in an easy-to-difficult manner through a dynamic sample selection mechanism. Furthermore, an instance number alignment loss (INAL) is developed to capture both global and local features simultaneously by incorporating a multi-task learning module. Extensive evaluations are conducted on three benchmark datasets, namely 50Salads, Georgia Tech egocentric activities (GTEA), and Breakfast. The experimental results demonstrate that the proposed method achieves substantial performance improvements over state-of-the-art approaches. © 2025 Elsevier Ltd.
AB - Temporal action segmentation (TAS) seeks to perform classification for each frame in a video. Existing methods tend to design diverse network architectures, while overlooking the intrinsic characteristics of training samples. Notably, two key issues arise: (1) Frames around action boundaries are more ambiguous and thus pose greater difficulties for training compared to other frames; and (2) beyond the commonly used categorical labels, the total number of action instances within a video may serve as an additional, potentially vital, supervision cue. To address these issues, this paper introduces a novel method that combines a model-agnostic training strategy with an instance number alignment loss, designed to enhance the performance of existing models. Specifically, a selective incremental learning (SIL) strategy is proposed to alleviate the impact of noisy samples by progressively training the model in an easy-to-difficult manner through a dynamic sample selection mechanism. Furthermore, an instance number alignment loss (INAL) is developed to capture both global and local features simultaneously by incorporating a multi-task learning module. Extensive evaluations are conducted on three benchmark datasets, namely 50Salads, Georgia Tech egocentric activities (GTEA), and Breakfast. The experimental results demonstrate that the proposed method achieves substantial performance improvements over state-of-the-art approaches. © 2025 Elsevier Ltd.
KW - Incremental learning
KW - Instance number learning
KW - Multi-task learning
KW - Noisy sample
KW - Sample selection
KW - Temporal action segmentation
UR - http://www.scopus.com/inward/record.url?scp=85218627802&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-85218627802&origin=recordpage
U2 - 10.1016/j.engappai.2025.110334
DO - 10.1016/j.engappai.2025.110334
M3 - RGC 21 - Publication in refereed journal
SN - 0952-1976
VL - 147
JO - Engineering Applications of Artificial Intelligence
JF - Engineering Applications of Artificial Intelligence
M1 - 110334
ER -