Abstract
Minimally invasive surgery (MIS) has flourished over the past decade due to its small surgical trauma, less pain and shorter recovery. Moreover, surgical robotic systems in MIS have been extensively explored to help surgeons reduce surgical workload. However, existing techniques for surgery still pose additional challenges to surgeons and robotic systems. In MIS, lack of direct depth information of surgical scenes, narrow field of view from endoscopes, and lack of critical structures below the visual surface, still remain. Besides, in surgical robots, the geometry of the environment is lacking, which is important feedback for precise and safe automatic surgery. Therefore, surgical navigation systems which utilize multi-modal information, such as computed tomography (CT), magnetic resonance image (MRI), endoscope videos, and robotic information, to guide surgeons' operations have shown great potential in surgery. In this thesis, we work on topics related to endoscopic navigation with vision-based intelligence. Specifically, our works revolve around the topic of endoscope reconstruction and localization, where many challenges exist, such as texture scarceness, monochrome, scale-unaware, multi-modality, etc., and these prevent prior works from performing robustly and effectively. To this end, we propose to combine the strength of deep learning methods and multi-modal learning algorithms to develop a series of frameworks to solve such problems towards image-based intelligent endoscopic navigation.First, a learning-driven framework is designed, in which a vision-guided endoscope localization with 3D reconstructions of complex anatomy is obtained. To enable this, a learning-based stereo depth estimation module is fine-tuned for dense depth computation per single frame using supervised and unsupervised methods from surgical data. It can be applied to challenging surgical scenarios such as tissues with texture-less and monochromatic surfaces. Then, a dense visual reconstruction algorithm is developed to represent the scene by surfels, estimate the endoscope poses and fuse the depth maps into a unified reference coordinate for tissue reconstruction. It utilizes only the stereoscopic images from endoscope, thus completing the entire processes from online depth estimation to reconstruction of dense surgical scenes. Besides, a coarse-to-fine localization method which incorporates the reconstruction results is achieved to estimate the poses of new scenes. Extensive experiments have been conducted and the corresponding results demonstrate the accuracy and effectiveness of the proposed learning-drive 3D reconstruction and camera localization framework.
Second, a novel framework, SADER, which explores vision and kinematics data to estimate the high-quality absolute depth for monocular surgical scenes, is presented. To jointly learn the multi-modal data, a self-distillation based two-stage training policy in the framework is proposed. In the first stage, a boosting depth module based on vision transformer (ViT) is designed to improve the relative depth estimation network that is trained in a self-supervised method. Then, an algorithm to compute the scale from robot kinematics automatically is developed. By coupling the scale and relative depth data, pseudo absolute depth labels for all images are yielded. In the second stage, we re-train the network with 3D loss supervised by pseudo labels. To make the proposed method generalize to different endoscopes, the learning of endoscopic intrinsics is integrated into the network. In addition, a cadaver experiment to collect new surgical depth estimation data about robotic laparoscopy has been performed for evaluation. Experimental results demonstrate that the SADER outperforms previous state-of-art, even stereo-based methods with an accuracy error under 1.90mm, proving the feasibility of our approach to recover the absolute depth with monocular inputs.
Third, inspired by recently proposed neural radiance fields (NeRF), a novel pipeline KV-EndoNeRF is presented, exploring limited multi-modal data (i.e., robot kinematics, and monocular endoscope) for surgical scene reconstruction with absolute scale. Scale information from robot kinematics is extracted and then integrated into sparse depth from structure from motion (SfM). Based on the sparse depth supervision, a monocular depth estimated network is adapted to the current surgical scene to obtain scene-specific coarse depth. After adjusting the scale of coarse depth, we use it to guide the optimization of NeRF, resulting in absolute depth estimation. The 3D models of the tissue surface with real scale are recovered by fusing fine depth maps. Experimental results on SCARED robotic endoscope data demonstrate that KV-EndoNeRF not only learns an absolute scale from kinematics data, but also achieves 3D reconstruction with rich details of surface texture and high accuracy, outperforming other existing reconstruction methods.
In summary, the originality of this thesis is the combination of artificial intelligence and endoscopy to achieve image-based intelligent navigation of endoscopes. The success of this study not only gives a significant insight in achieving endoscopic navigation using visual cues, but also provides a meaningful reference for future automatic surgery and navigation.
| Date of Award | 24 Jul 2023 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Dong SUN (Supervisor) |