Gaze Tracking with Head-mounted Trackers: System Modeling, Calibration and Estimation of 3D Point of Regard


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date14 Jun 2019


This dissertation presents a gaze estimation framework for head-mounted gaze tracking (HMGT) systems. The challenges in employing the current HMGT systems are as follows (a) The conventional calibration process appears tedious, and the distributions of acquired training data tend to be insufficiently informative. (b) 2D gaze estimation often suffers from interpolation/extrapolation and parallax errors. (c) Large 3-D gaze prediction errors are likely to occur due to over-reliance on simplified 3D eye models. The following issues or questions should be considered to cope with these challenges: (a) How is reliable and representative training data set collected? (b) How can a regression model be formed to efficiently encode the real gaze mapping function? (c) How can the constructed regression model be effectively trained given the training data? This dissertation can be considered as a pioneering research that aims to answer these three questions. Initially, a novel data acquisition method is proposed to collect training gaze data. Instead of successively fixating at a grid of calibration points, users are required to gaze at one point while rotating their heads. Thus, the calibration points are more likely to distribute densely over a wide field of view. To avoid computing degeneracy in the epipolar geometry recovery, this type of head-rotation motion should be performed twice at two different depths. Then, a smooth-pursuit based outlier removal method is proposed to identify the users’ distractions in the calibration process thereby notably improving the reliability of the training gaze data.

After obtaining the reliable and representative gaze data, an adequate regression model should be constructed to encode the 2D gaze mapping effectively. This dissertation applies a point-to-line relation to characterize the HMGT epipolar geometry. In specific, an epipolar line (i.e., the image line of the visual axis) needs to be predicted given one pupil center. Thus, PORs free from the parallax error can be calculated as the intersection of two epipolar lines inferred from both eyes. To this end, a homography mapping is learned at first to precisely infer the image gaze point projected from one calibration depth, in which a sparse Gaussian process using pseudo-inputs is used to capture the smooth residual field unmodeled by the polynomial function. By combining with the resolved homography-like relation, the parallax errors observed at the other calibrated depth are leveraged to estimate the epipolar point position. Correspondingly, the point-to-line relation can be obtained using the epipolar point and homography-like mapping.

In order to predict the 3D gaze points, a simple back-projection model is designed to associate the pupil center with its visual direction in a local manner. The partition structure of the input space is determined via the Leave-One-Out cross-validation criterion. Thereafter, a nonlinear optimization problem is formulated to minimize the angular disparities between the visual axes calculated from the back-projection matrices and vectors originating from the eyeball center to the 3D gaze points. Given the approximate distance between the eyeball center and scene camera, we can initialize the 3D position of the eyeball center from the epipolar points recovered in the 2D prediction model. The visual axes of both eyes can be easily estimated depending on the optimization results„ and the 3-D gaze point can be inferred as the point closest to two visual axes.

This dissertation also discusses the reason why the commonly used one depth calibration will result in the computing degeneration and how the selection of two calibration depths impacts the model estimation precision. These discussions provide very useful guidance for the practical implementation of the proposed framework. In addition, some other regression models, such as the region-wise hybrid-fundamental matrix model, are proposed in this dissertation, and compared given the two depth calibration data.

Lastly, the image registration and PnP techniques are leveraged to estimate the position and orientation of the scene camera in the world coordinates. Accordingly, the calculated POR in the scene camera coordinates is transformed into the world coordinates, which can be directly used in the shared manipulation tasks and other human-robot interaction applications.