Towards Flexible 3D Gaze Estimation Based on Head-mounted Eye Tracking Systems


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date29 Jun 2021


In contrast to remote eye tracking systems, head-mounted devices (HMD) offer the singular perspective from subjects’ first-person vision, allowing for gaze prediction in different applications. In fact, given the pace of their flexibility and miniaturization, lightweight HMD have become a useful tool in human-robot interactions. Such systems extend a user’s field of view into wider scenes instead just in front of computer screens, which significantly enriches the applications of gaze technologies. This thesis focus on achieving flexible and accurate 3D gaze estimation for HMD. Previous works on this task usually suffer from three challenges: (a) The estimation accuracy will be significant affected when the eyeball rotates on a large scale. (b) The traditional calibration process that involves calibration markers could be troublesome or even unattainable in some conditions. (c) Large estimation errors on 3D gaze points' depth are easily occur as it is determined by triangulating gaze vectors from both eyes.

In this thesis, a 3D gaze estimation framework with two kinds of calibration methods is proposed to address aforementioned problems. The polynomial model-based method is proposed to improve the accuracy of 3D gaze estimation. After obtaining eye gaze vectors and 3D target locations as the gaze data, eyeball centers in the scene camera coordinates are determined through a global nonlinear optimization. Accordingly, a target vector can be defined as an axis starts from the eyeball center and points to the target. After that, collected eye gaze vectors are divided into several sub-regions by K-means algorithm based on the distribution of calibration targets. In each sub-region, eye gaze vectors and their corresponding target vectors are treated as inputs for the specific regression model. After all models are trained, the new eye gaze vector in the estimation procedure can be classified into a sub-region according to its angular distance to the center of each sub-region, and the real gaze vector can be obtained through the model.

The saliency-based auto-calibration method is proposed to remove the explicit user calibration and achieves robust 3D gaze estimation. The method treats salient regions in the scene as possible 3D locations of gaze points. To improve the efficiency of predicting 3D gaze from visual saliency, bag-of-words algorithm is applied to eliminate redundant scene images based on their similarities. After the elimination, saliency maps are generated from those scene images, and extrinsic parameters between eye cameras and the scene camera can be determined by aggregating 3D salient targets with eye gaze vectors. Finally, the real gaze vector can be determined based on the extrinsic parameters.

After the calibration, eye gaze vectors can be transformed from eye cameras into the scene camera. Instead of calculating the 3D gaze point as the midpoint of the shortest segment between both gaze vectors, a point cloud is generated given 3D structures of a scene, and the gaze point is estimated as the point that closest to transformed gaze vectors. Meanwhile, the PnP technique is employed to compute the scene camera's trajectory and pose in the world coordinates. Thereafter, the estimated gaze points can be further converted into the world coordinates, which make the gaze estimation procedure more connected and suitable for shared-tasks of human-robot interaction.