Sparse representation of phonetic features for voice conversion with and without parallel data

Research output: Chapters, Conference Papers, Creative and Literary Works (RGC: 12, 32, 41, 45)32_Refereed conference paper (with ISBN/ISSN)peer-review

29 Scopus Citations
View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Original languageEnglish
Title of host publication2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) - Proceedings
PublisherIEEE
Pages677-684
ISBN (Electronic)978-1-5090-4788-8
Publication statusPublished - Dec 2017

Conference

Title2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017
PlaceJapan
CityOkinawa
Period16 - 20 December 2017

Abstract

This paper presents a voice conversion framework that uses phonetic information in an exemplar-based voice conversion approach. The proposed idea is motivated by the fact that phone-dependent exemplars lead to better estimation of activation matrix, therefore, possibly better conversion. We propose to use the phone segmentation results from automatic speech recognition (ASR) to construct a sub-dictionary for each phone. The proposed framework can work with or without parallel training data. With parallel training data, we found that phonetic sub-dictionary outperforms the state-of-the-art baseline in objective and subjective evaluations. Without parallel training data, we use Phonetic PosteriorGrams (PPGs) as the speaker-independent exemplars in the phonetic sub-dictionary to serve as a bridge between speakers. We report that such technique achieves a competitive performance without the need of parallel training data.

Research Area(s)

  • phonetic exemplars, Phonetic PosteriorGrams, sparse representation, Voice conversion

Bibliographic Note

Full text of this publication does not contain sufficient affiliation information. With consent from the author(s) concerned, the Research Unit(s) information for this record is based on the existing academic department affiliation of the author(s).

Citation Format(s)

Sparse representation of phonetic features for voice conversion with and without parallel data. / Sisman, Berrak; Li, Haizhou; Tan, Kay Chen.

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) - Proceedings. IEEE, 2017. p. 677-684.

Research output: Chapters, Conference Papers, Creative and Literary Works (RGC: 12, 32, 41, 45)32_Refereed conference paper (with ISBN/ISSN)peer-review