Using visual information in automatic speech recognition has aroused the
interest of many researchers in recent years because the visual information will help
enhancing the robustness of the system. This thesis presents the studies of developing
a real-time system for automatic speech recognition solely based on the visual cues
from the lip shape and movement. Lip segmentation, modeling, visual feature
extraction and recognition are the major issues of the system and new algorithms for
these issues are presented in this thesis.
For lip segmentation, most of the widely used methods are based on the color
or intensity information. However, for the lip images with low contrast, these methods
will not achieve satisfactory results. In our study, both the color information and
spatial location are integrated into a fuzzy clustering framework. The lip image is
represented in CIELAB and CIELUV color space, where the luminance and
chromaticity information is separated and the distance between any two points in the
color space is proportional to their perceptual color difference. By integrating the
spatial information, the proposed algorithm can differentiate the pixels with similar
color but located in different regions. From the experimental results, the proposed
algorithm outperforms other lip segmentation techniques especially for images with
low color contrast. An extension of the algorithm has also been developed to solve the
lip segmentation problem with the presence of beards, which is regarded as a difficult
problem for lip region segmentation.
For the lip contour modeling and extraction, accuracy, robustness and
efficiency are the primary concerns. A 16-point model is employed to describe the lip
contour. Some geometric constraints are applied to ensure that the extracted lip
contour is physically meaningful. Based on the membership distribution derived from
the lip segmentation procedure, a region-based cost function is defined, which is
much more robust than the edge-based and intensity-based cost functions. A point-driven
optimization procedure with some fast implementation techniques is used for
model fitting and thus the lip contour can be obtained in an efficient manner.
A visual feature set containing the geometric parameters, lip shape descriptors,
and inner mouth information is obtained from the lip model for visual speech
recognition purpose. A spline representation is employed to translate the discrete-
sampled visual features into the continuous domain. The spline coefficients in the
same word class are constrained to have the same mean and covariance matrix and
can be estimated from the training data by the EM algorithm. In the speaker
independent recognition task, a multi-model approach is proposed to overcome the
difficulty due to the large variation caused by different speakers. By comparing with
the HMM, the proposed method gives better result especially when only limited
training data is available.
An automatic lipreading system has been implemented and running on a 1.9
GHz PC. An accuracy of 96% for the speaker dependent recognition and 88% for the
speaker independent recognition have been achieved. With the efficient
implementation of all the algorithms, the system is able to process images at a rate
higher than 25 frames/sec, leaving room for additional tasks in real-time applications.
| Date of Award | 4 Oct 2004 |
|---|
| Original language | English |
|---|
| Awarding Institution | - City University of Hong Kong
|
|---|
| Supervisor | Wing Hong Ricky LAU (Supervisor) |
|---|
- Lipreading
- Automatic speech recognition
- Computer simulation
The development of an automatic real-time lipreading system
WANG, S. (Author). 4 Oct 2004
Student thesis: Doctoral Thesis