Abstract
This paper presents a voice conversion framework that uses phonetic information in an exemplar-based voice conversion approach. The proposed idea is motivated by the fact that phone-dependent exemplars lead to better estimation of activation matrix, therefore, possibly better conversion. We propose to use the phone segmentation results from automatic speech recognition (ASR) to construct a sub-dictionary for each phone. The proposed framework can work with or without parallel training data. With parallel training data, we found that phonetic sub-dictionary outperforms the state-of-the-art baseline in objective and subjective evaluations. Without parallel training data, we use Phonetic PosteriorGrams (PPGs) as the speaker-independent exemplars in the phonetic sub-dictionary to serve as a bridge between speakers. We report that such technique achieves a competitive performance without the need of parallel training data.
| Original language | English |
|---|---|
| Title of host publication | 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) - Proceedings |
| Publisher | IEEE |
| Pages | 677-684 |
| ISBN (Electronic) | 978-1-5090-4788-8 |
| DOIs | |
| Publication status | Published - Dec 2017 |
| Event | 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Okinawa, Japan Duration: 16 Dec 2017 → 20 Dec 2017 |
Conference
| Conference | 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 |
|---|---|
| Place | Japan |
| City | Okinawa |
| Period | 16/12/17 → 20/12/17 |
Bibliographical note
Full text of this publication does not contain sufficient affiliation information. With consent from the author(s) concerned, the Research Unit(s) information for this record is based on the existing academic department affiliation of the author(s).Research Keywords
- phonetic exemplars
- Phonetic PosteriorGrams
- sparse representation
- Voice conversion