Head-related transfer function modeling and customization

  • Zhixin WANG

Student thesis: Doctoral Thesis

Abstract

Virtual 3D sound synthesis is aimed to create a three-dimensional perception of a sound using only two earphones. In comparing to physical 3D sound synthesis which requires multiple speakers placed at designated positions, virtual 3D sound synthesis has advantages in a wide range of applications, such as mobile entertainment devices (MP3, MP4, etc), human aid systems, computer games and military simulations. Head-related impulse response (HRIR), which captures the filtering effects of human torso, head and pinna to a sound propagating from a specific spatial position to the eardrum of a listener, is the core part in virtual 3D sound synthesis. Using the measured HRIR, a vivid 3D sound illusion can be created with sounds produced by two speakers which are positioned in the listener's ears. However, the measured HRIR, which is a function of time, elevation and azimuth and varies with subjects, is a large dataset. Besides, the tedious measuring procedures and the special equipments required have made experimental measurement impractical for adoption in commercial applications. It has long been desirable to generate individualized HRIR in a more efficient way and to reduce the storage requirement and computational complexity in real time virtual 3D sound synthesis. In this thesis, a hybrid implementation scheme based on combining principal component analysis (PCA) with balanced model truncation (BMT) is proposed to reduce both the computational complexity and the storage requirement. A grouping strategy is developed to divide HRIRs into several groups according to their similarities so that PCA performs better. This implementation scheme has potential advantages in case where there are multiple sound sources. The computational complexity of this implementation hardly increases with the adding of sound sources. A common factor decomposition (CFD) algorithm with IIR modeling of the directional factor is also proposed to improve the performance of the virtual sound system. A two-dimension common factor decomposition (2D-CFD) algorithm is further developed to represent the 3-dimensional (time, elevation and azimuth) HRIR dataset with a set of elevation-dependent impulse responses and a set of azimuthdependent impulse responses to reduce the storage requirement. Common pole IIR (CP-IIR) filter modeling is further used for computation simplification. The proposed algorithm is much more efficient and results in low distortion as compared to other algorithms in literature. However, as the size of the dataset obtained by 2D-CFD and CP-IIR modeling is related to the spatial resolution of measurement, the storage requirement is still large if HRIRs are measured at a high resolution. To avoid the expansion in dataset size caused by increasing the spatial resolution of measurement, a continuous function model is proposed to represent the measured HRIR as an IIR filter whose coefficients are low order harmonic functions of elevation and azimuth. The continuous function model reduces the HRIR storage dramatically and the memory requirement will not be increased even if HRIRs are measured at a higher spatial resolution. By applying an efficient method for harmonic function calculation, the proposed model requires comparatively low computation complexity. For HRIR customization, 2D-CFD algorithm is further applied to a HRIR dataset which contains HRIRs of multiple subjects and at multiple directions to extract a set of direction-dependent impulse responses (DDIRs) which are common for all subjects. A subject-dependent impulse response (SDIR) is extracted for each subject simultaneously to capture the subject-dependent information contained in HRIR. Such modeling not only reduces the dimensionality of the HRIR dataset but also allows the customization of a set of HRIRs via the customization of a SDIR. Two methods are proposed to calculate a target subject's SDIR for customization. In the first method, joint support vector regression (JSVR) is applied to train a nonlinear model to predict a target subject's SDIR from his/her anthropometric parameters. In the second method, the target subject's SDIR is extracted from several sampled HRIR measurements of the subject. The derived SDIR is then convolved with the trained DDIRs to construct the whole set of HRIRs of the target subject. Listening tests show that both methods can generate HRIR similar to the measured one.
Date of Award3 Oct 2014
Original languageEnglish
Awarding Institution
  • City University of Hong Kong
SupervisorCheung Fat CHAN (Supervisor)

Keywords

  • Auditory perception
  • Surround-sound systems

Cite this

'