This dissertation details the development of a robust speech feature extraction technique called spectral compression and its relation to the perceived loudness of human beings. The function of spectral compression is to reduce the mismatch between the reference models trained from clean speech data, and the testing patterns derived from noisy speech. In simple words, spectral compression is to compress the spectral components of the speech power spectrum by taking a root value (i.e. a positive exponent smaller than one). Frame-based DFT speech analysis is adopted in this study to investigate the robustness of this technique. In the literature, constant or uniform spectral compression has already been used in Root Cepstral Analysis (RCA) and Perceptual Linear Prediction (PLP) speech analysis method. In the first part of the thesis, a compression function is proposed where each spectral component has its own root, instead of one constant root or exponent for the whole spectrum. We call this approach Non-uniform Spectral Compression (NSC). A decaying exponential curve is used as the compression function in the NSC. Experimental results show that the commonly used feature extraction methods, such as MFCC, PLP and LPC, incorporated with the NSC show significant improvement under white, pink and factory noise environments, as compared to the conventional uncompressed and constant root approaches. Surprisingly, the human auditory system employs a similar intensity compression operation, which can be approximated by the “power law of hearing” in psychophysics, to convert the physical sound intensity to human perceptual loudness. Psychoacoustic studies have shown that the compression root or exponent in the power law of hearing is smaller for a broadband signal than that for a narrowband 1kHz tone. This idea inspires the incorporation of human perception into the NSC scheme and leads to the development of the Perceptual Non-uniform Spectral Compression (PNSC). In this PNSC scheme, sound segments and speech components that have a large bandwidth are given a small exponent for a large degree of compression. On the other hand, the extent of compression to narrowband signals is small and these signals are largely retained for recognition. Using the PNSC, substantial improvement over the NSC scheme is obtained under white, pink and factory noises. Under babble and volvo noise environments, the improvement of both NSC and PNSC over conventional features is not so significant. Both the NSC and PNSC have a compression function which is based on a decaying exponential function. The compression schemes may not be effective when the noise has speech like or coloured characteristics. A radically different compression scheme called SNR-dependent Non-uniform Spectral Compression (SNSC) is then proposed to deal with this problem. In the SNSC scheme, speech components that are corrupted by noise will be de-emphasized depending on the estimated SNR of the components. Since different noise models have their own spectral characteristics and corrupt different regions of the speech spectrum, an SNR-dependent compression scheme is a favorable one to deal with different noise types. Moreover, the principle of SNSC can be supported by psychoacoustic evidence that a background noise produces a partial masking effect on a sound, where the background SNR determines the perceived loudness magnitude. Simulation results show that the SNSC can further boost the recognition accuracy over the PNSC and is able to deal with different noise models.
| Date of Award | 15 Jul 2005 |
|---|
| Original language | English |
|---|
| Awarding Institution | - City University of Hong Kong
|
|---|
| Supervisor | Shu Hung LEUNG (Supervisor) |
|---|
- Automatic speech recognition
- Speech processing systems
Feature extraction based on perceptual non-uniform spectral compression for noisy speech recognition
CHU, K. K. (Author). 15 Jul 2005
Student thesis: Master's Thesis