Leveraging Spatio-Temporal Structure across Video Streams for Self-Supervised Depth-Pose Learning and Shape Estimation of Deformable Objects


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
  • Yajing SHEN (Supervisor)
  • Jia Pan (External person) (External Co-Supervisor)
Award date29 Aug 2022


The past decades have witnessed tremendous progress in computer vision for a variety of applications such as facial recognition, autonomous vehicles, and robotic manipulation. This thesis presents novel approaches for two essential computer vision problems related to the mentioned applications, including image depth-pose estimation, and shape estimation of deformable objects.

To tackle image depth-pose estimation, I follow the recent trends by exploiting techniques in deep learning to solve the problem. To this end, I explore various convolutional neural network (CNN) architectures to infer scene depth and camera ego-motion based on image input. I make three observations: (i) it is impractical to collect enormous quantities of ground-truth depth or pose data as supervision to train the models, (ii) most existing methods perceive the world as isolated images rather than sequences, which ignore the intrinsic temporal dependency in video streams, and (iii) previous work prefer to learn depth-pose estimation jointly from scratch, resulting in degraded performance, limited generalization, and depth-pose scale inconsistency. To address these problems, I design a self-supervised training framework to learn the depth and pose models using unlabeled data. Secondly, I present a unified recurrent feature extraction architecture to leverage the spatio-temporal structure in video data for both depth and pose estimation. Thirdly, instead of learning everything from scratch, I implement a pipeline to infer image correspondences based on the CNN model, and further recover pose and align depth-pose scale using multi-view geometry. Thus, I improve the system performance in both accuracy and generalization.

As for shape estimation of deformable objects, I focus on its application in robotic manipulation. Existing shape estimation methods for deformable object manipulation suffer from the drawbacks of being off-line, model-dependent, noise-sensitive, or occlusion-sensitive, and thus are not appropriate for manipulation tasks requiring high precision. This thesis presents a real-time shape estimation approach for autonomous robotic manipulation of 3D deformable objects. My method exploits temporal connection across RGB-D video steams and fulfills all the requirements necessary for high-quality deformable object manipulation in terms of being real-time, model-free, and robust to noise and occlusion. These advantages are accomplished using a joint tracking and reconstruction framework, in which I track the object deformation by aligning a reference shape model with the stream input from the RGB-D camera, and simultaneously upgrade the reference shape model according to the newly captured RGB-D data. I evaluate the quality and robustness of the presented real-time shape estimation pipeline on a set of deformable manipulation tasks implemented on physical robots.