TY - JOUR
T1 - Absolute Monocular Depth Estimation on Robotic Visual and Kinematics Data via Self-Supervised Learning
AU - Wei, Ruofeng
AU - Li, Bin
AU - Zhong, Fangxun
AU - Mo, Hangjie
AU - Dou, Qi
AU - Liu, Yun-Hui
AU - Sun, Dong
PY - 2025
Y1 - 2025
N2 - Accurate estimation of absolute depth from a monocular endoscope is a fundamental task for automatic navigation systems in robotic surgery. Previous works solely rely on uni-modal data (i.e., monocular images), which can only estimate depth values arbitrarily scaled with the real world. In this paper, we present a novel framework, SADER, which explores vision and robot kinematics to estimate the high-quality absolute depth for monocular surgical scenes. To jointly learn the multi-modal data, we introduce a self-distillation based two-stage training policy in the framework. In the first stage, a boosting depth module based on vision transformer is proposed to improve the relative depth estimation network that is trained in a self-supervised method. Then, we develop an algorithm to automatically compute the scale from robot kinematics. By coupling the scale and relative depth data, pseudo absolute depth labels for all images are yielded. In the second stage, we re-train the network with 3D loss supervised by pseudo labels. To make our method generalize to different endoscopes, the learning of endoscopic intrinsics is integrated into the network. In addition, we did cadaver experiments to collect new surgical depth estimation data about robotic laparoscopy for evaluation. Experimental results on public SCARED and cadaver data demonstrate that the SADER outperforms previous state-of-art even stereo-based methods with an accuracy error under 1.90 mm, proving the feasibility of our approach to recover the absolute depth with monocular inputs. © 2024 IEEE.
AB - Accurate estimation of absolute depth from a monocular endoscope is a fundamental task for automatic navigation systems in robotic surgery. Previous works solely rely on uni-modal data (i.e., monocular images), which can only estimate depth values arbitrarily scaled with the real world. In this paper, we present a novel framework, SADER, which explores vision and robot kinematics to estimate the high-quality absolute depth for monocular surgical scenes. To jointly learn the multi-modal data, we introduce a self-distillation based two-stage training policy in the framework. In the first stage, a boosting depth module based on vision transformer is proposed to improve the relative depth estimation network that is trained in a self-supervised method. Then, we develop an algorithm to automatically compute the scale from robot kinematics. By coupling the scale and relative depth data, pseudo absolute depth labels for all images are yielded. In the second stage, we re-train the network with 3D loss supervised by pseudo labels. To make our method generalize to different endoscopes, the learning of endoscopic intrinsics is integrated into the network. In addition, we did cadaver experiments to collect new surgical depth estimation data about robotic laparoscopy for evaluation. Experimental results on public SCARED and cadaver data demonstrate that the SADER outperforms previous state-of-art even stereo-based methods with an accuracy error under 1.90 mm, proving the feasibility of our approach to recover the absolute depth with monocular inputs. © 2024 IEEE.
KW - absolute depth estimation
KW - Boosting
KW - endoscope
KW - Endoscopes
KW - Estimation
KW - Kinematics
KW - monocular images
KW - multi-modal learning
KW - Robot kinematics
KW - Surgical robotics
KW - Training
KW - Visualization
UR - http://www.scopus.com/inward/record.url?scp=85195407802&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-85195407802&origin=recordpage
U2 - 10.1109/TASE.2024.3409392
DO - 10.1109/TASE.2024.3409392
M3 - RGC 21 - Publication in refereed journal
SN - 1545-5955
VL - 22
SP - 4269
EP - 4282
JO - IEEE Transactions on Automation Science and Engineering
JF - IEEE Transactions on Automation Science and Engineering
ER -