Discriminative training for speech and speaker recognition
Student thesis: Doctoral Thesis
Related Research Unit(s)
In spoken utterances, there include both the linguistic information of text content and identity information of speaker. The objective of speech recognition is to recognize what is the text content, and speaker recognition is to find out who is the claimed speaker. Mostly, they are based on the hidden Markov model (HMM) and Gaussian mixture model (GMM) respectively. Before using HMM or GMM for the recognition, the model parameters must be trained to describe the observation sequences of the utterances. Many optimization criteria can be used for this training process. The maximum likelihood (ML) estimation is considered as a good choice because of its simplicity and mathematical tractability. However, this criterion only considers the likelihood for the labeled data. That is, each model is estimated separately using its assigned training utterances. When there are confusable models or the training data is limited, it is very likely to reach only a local optimization solution. To solve this problem, discriminative training is studied in this thesis. The maximum model distance (MMD) criterion was first proposed to maximize the dissimilarity measure of the total model set in HMM-based speech recognition. In the MMD-based algorithm, each HMM represents the stochastic characteristics of a class of acoustic signals, and the difference of those stochastic characteristics can be mapped into the dissimilarities of their HMMs. By maximizing the dissimilarities among HMMs, the performance of speech recognizer would be improved. In this thesis, we study the MMD criterion in detail and extend it to the area of speaker recognition. We propose an efficient discriminative algorithm for the GMM training. The characteristic of speaker recognition is investigated and a novel competitive selection strategy is further proposed to improve the performance of discriminative training. Experimental results of speaker identification and verification demonstrated that our training approach could improve the performance greatly compared with the conventional ML method, especially when only limited training data was available. Usually, the parameters are estimated based on an initialized model. Based on the ML criterion, the model parameters are updated repeatedly and the probability of observation sequences will be improved until some limited point. However, due to the hill-climbing characteristic, any arbitrary estimate of the initial model parameters will usually lead to a suboptimal model in practice. The genetic algorithm (GA) provides the global searching capability to the problem and hopes to find the best solution. We first present a hybrid GA for HMM-based speech recognition, which demonstrated better performance compared with the conventional Baum-Welch method when there were sufficient training data. After that, we further extend the hybrid GA to the GMM training of speaker identification. It uses the ML re-estimation as the heuristic operator to improve the converging speed of GA. Experimental results showed that the proposed GA could obtain more optimized GMMs than the simple GA and the conventional ML estimation method. When there are great acoustic variability between the training and test data, adaptation procedure is desired to minimize the mismatch. Based on the MMD criterion, a novel discriminative adaptation scheme is proposed. This approach works effectively with any amount of adaptation data. All parameters of each HMM with or without adaptation data could be adapted. Compared with the conventional approach, MMD adaptation is a suitable alternative to give individual transformation matrix for each Gaussian component when there is little adaptation data. It could adapt all model parameters with any amount of adaptation data; in other words, the whole HMM is forced to match the new environment. Furthermore, the MMD can make use of the discriminative information among adaptation data to enhance the discriminative capability of the recognizer.
- Speech processing systems, Automatic speech recognition