Object Classification with Event-Based Cameras

基於事件攝像機的物體分類

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
Award date18 Aug 2021

Abstract

Enabling computers to understand a scene in the real world is the goal that humans pursue. As an essential function of perceiving the world, object classification has to be achieved robustly and efficiently to accomplish this target. With the continuous development of computer vision technology in recent years, object classification on conventional frame-based cameras, such as complementary metal-oxide-semiconductor (CMOS) cameras, has been a great success. However, its real-life application remains challenging because traditional cameras have low dynamic range, high power consumption, and a tendency to motion blur.

Event-based cameras work asynchronously on pixel levels rather than trapping in frame-rate limitations. Low power consumption, extremely high temporal resolution (in the order of μs), and high dynamic range (140 dB versus 60 dB of traditional cameras) allow event cameras to address the hardware limitations in object classification applications. H-owever, the price for these potentials is certainly substantial: the asynchronous and sparse output of event-based sensors, known as event point clouds or event data, is noisy and has relatively low resolution. Owing to the unconventional output of event cameras, specific algorithms for event-based object classification to leverage the rich spatio-temporal information of event data are required to unlock their potential.

This dissertation utilizes deep learning technology to develop event-based object classification algorithms from three perspectives. (i) The event-based features show better compatibility with traditional vision models by eliminating the negative effect of motion information in the event data on the classification model (Chapter 3). Specifically, a speed invariant kernel is proposed to alleviate the interference of 2D semantic deformation introduced by different motion conditions. (ii)  Fully exploiting the spatio-temporal information and making the extracted motion and semantic cues serve downstream tasks cooperatively by customizing representations and learning models for event data. Initially, a multi-view fusion network is introduced for event-based object classification (Chapter 4). A novel learning architecture is proposed to further exploit the spatio-temporal relationships across patches in event-based representations leveraging the self-attention mechanism (Chapter 5). A solution is proposed to highlight the classification accuracy and model complexity simultaneously by developing a graph-based lightweight framework for event data (Chapter 6). (iii) Improving the feature extraction quality of event-based networks by introducing extra supervision other than category labels (Chapter 7). A distillation framework is presented to improve the performance of the event-based method on vision tasks by utilizing the supervision provided by traditional images. We capture an image-event paired dataset (CEP-DVS) consisting of samples with random motion trajectories to expand the existing classification datasets' diversity.

Comprehensive experiments demonstrate that the proposed methods can exploit the benefits of event cameras, allowing us to realize high-performance object classification in high-speed scenarios under challenging lighting conditions.