Spatiotemporal Representation Learning for Event-Based Visual Recognition
面向事件驅動視覺識別的時空表徵學習
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors | |
Award date | 2 Sept 2024 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(4a5a3911-cdb9-421a-957b-58fa1bf21793).html |
---|---|
Other link(s) | Links |
Abstract
Visual recognition is one of the fundamental research topics in computer vision, enabling machines to automatically identify and understand objects, scenes, and activities within visual data. With the development of deep learning techniques in the past decade, recognition algorithms designed for standard frame-based cameras have achieved remarkable progress. However, due to the limited dynamic range and fixed frame rate, standard cameras may not provide high-quality recording in challenging real-world scenarios, such as poor lighting conditions and high-speed movement.
Event cameras are novel bio-inspired vision sensors that can tackle the above problems. Unlike standard cameras that capture images at a fixed rate, event cameras asynchronously measure per-pixel brightness changes and output a space-time stream of events. The novel imaging mechanism gives such cameras many attractive advantages, including high dynamic range, microsecond temporal resolution, and low power consumption. Since event data is a new visual modality that is sparse and asynchronous, developing specialized algorithms to unlock the potential of event cameras in visual recognition is a challenging topic. This thesis focuses on developing efficient event-based recognition methods via deep neural networks to learn spatiotemporal representations from event streams. Considering the characteristics of event data, we conduct research from three perspectives, including (i) designing sparse processing frameworks that learn discriminative features from voxel-wise representations, (ii) boosting conventional vision models with frame-based representations for high-performance event-based action recognition, and (iii) exploring the event-image complementarity for robust recognition via multi-modal fusion.
Following the guidance, this thesis introduces four spatiotemporal representation learning frameworks for visual recognition with event cameras. (1) According to the first perspective, a voxel-wise graph learning model (VMV-GCN) is proposed for efficient event processing by utilizing the modality sparsity. VMV-GCN achieves competitive accuracy while maintaining low model complexity on object classification and action recognition. (2) Also, following the first perspective, a voxel set transformer model (EVSTr) is introduced to efficiently learn spatiotemporal representations via the attention mechanism for object classification and action recognition. To provide convincing model evaluation, we present a new action recognition dataset (NeuroHAR) recorded in challenging scenarios. (3) Following the second perspective, a compact event representation method (EVTC) is proposed to summarize the long-range temporal dynamics of events into informative frames. We can combine this representation with conventional vision models for high-performance action recognition. (4) Following the third perspective, an event-image fusion network (EISNet) is introduced to improve semantic segmentation performance in challenging visual scenarios using high-confidence event representation and adaptive multi-modal complementary fusion.
This thesis systematically studies the representation method, spatiotemporal feature learning, and modality fusion strategy of event camera data. Comprehensive experiments demonstrate that the proposed methods achieve state-of-the-art performance on multiple recognition tasks by unlocking the potential benefits of event cameras.
Event cameras are novel bio-inspired vision sensors that can tackle the above problems. Unlike standard cameras that capture images at a fixed rate, event cameras asynchronously measure per-pixel brightness changes and output a space-time stream of events. The novel imaging mechanism gives such cameras many attractive advantages, including high dynamic range, microsecond temporal resolution, and low power consumption. Since event data is a new visual modality that is sparse and asynchronous, developing specialized algorithms to unlock the potential of event cameras in visual recognition is a challenging topic. This thesis focuses on developing efficient event-based recognition methods via deep neural networks to learn spatiotemporal representations from event streams. Considering the characteristics of event data, we conduct research from three perspectives, including (i) designing sparse processing frameworks that learn discriminative features from voxel-wise representations, (ii) boosting conventional vision models with frame-based representations for high-performance event-based action recognition, and (iii) exploring the event-image complementarity for robust recognition via multi-modal fusion.
Following the guidance, this thesis introduces four spatiotemporal representation learning frameworks for visual recognition with event cameras. (1) According to the first perspective, a voxel-wise graph learning model (VMV-GCN) is proposed for efficient event processing by utilizing the modality sparsity. VMV-GCN achieves competitive accuracy while maintaining low model complexity on object classification and action recognition. (2) Also, following the first perspective, a voxel set transformer model (EVSTr) is introduced to efficiently learn spatiotemporal representations via the attention mechanism for object classification and action recognition. To provide convincing model evaluation, we present a new action recognition dataset (NeuroHAR) recorded in challenging scenarios. (3) Following the second perspective, a compact event representation method (EVTC) is proposed to summarize the long-range temporal dynamics of events into informative frames. We can combine this representation with conventional vision models for high-performance action recognition. (4) Following the third perspective, an event-image fusion network (EISNet) is introduced to improve semantic segmentation performance in challenging visual scenarios using high-confidence event representation and adaptive multi-modal complementary fusion.
This thesis systematically studies the representation method, spatiotemporal feature learning, and modality fusion strategy of event camera data. Comprehensive experiments demonstrate that the proposed methods achieve state-of-the-art performance on multiple recognition tasks by unlocking the potential benefits of event cameras.
- Event camera, Visual recognition, Representation learning, Multi-modal fusion