Skip to main navigation Skip to search Skip to main content

Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection

Zhen Gao, Xiaowen Chen, Jingning Xu*, Rongjie Yu, Heng Zhang, Jinqiu Yang

*Corresponding author for this work

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

33 Downloads (CityUHK Scholars)

Abstract

Drowsy driving is a leading cause of commercial vehicle traffic crashes. The trend is to train fatigue detection models using deep neural networks on driver video data, but challenges remain in coarse and incomplete high-level feature extraction and network architecture optimization. This paper pioneers the use of the CLIP (Contrastive Language-Image Pre-training) model for fatigue detection. And by harnessing the power of a Transformer architecture, sophisticated and long-term temporal features are adeptly extracted from video sequences, paving the way for more nuanced and accurate fatigue analysis. The proposed CT-Net (CLIP-Transformer Network) achieves an AUC (Area Under the Curve) of 0.892, a 36% accuracy improvement over the prevalent CNN-LSTM (Convolutional Neural Network-Long Short-Term Memory) end-to-end model, reaching state-of-the-art performance. Experiments show that the CLIP pre-trained model more accurately extracts facial and behavioral features from driver video frames, improving the model’s AUC by 7% over the ImageNet-based pre-trained model. Moreover, compared with LSTM, the Transformer more flexibly captures long-term dependencies among temporal features, further enhancing the model’s AUC by 4%. © 2024 by the authors.
Original languageEnglish
Article number7948
JournalSensors
Volume24
Issue number24
Online published12 Dec 2024
DOIs
Publication statusPublished - Dec 2024

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well-being

Research Keywords

  • CLIP pre-trained model
  • fatigue detection
  • instance normalization
  • semantic analysis
  • Transformer

Publisher's Copyright Statement

  • This full text is made available under CC-BY 4.0. https://creativecommons.org/licenses/by/4.0/

Fingerprint

Dive into the research topics of 'Semantically-Enhanced Feature Extraction with CLIP and Transformer Networks for Driver Fatigue Detection'. Together they form a unique fingerprint.

Cite this