This project aims to implement a scene visual perception system in a complex environment and employ event data processing technology to solve the difficult problems caused by the irregularity, asynchronousness, and sparseness of the original event cloud data. Technically, a new deep learning-based object tracking/detection paradigm is proposed to operate directly on the raw event stream/cloud, theoretically quantifying the ambiguity of manually annotated ground-truth bounding boxes of the training dataset via a variational Bayesian model, exploring the cross-modal processing pipeline between event clouds and RGB data driven by advanced large pre-trained models. The research results of this project will greatly benefit public safety monitoring systems, transportation management, autonomous driving, and environmental monitoring.