Meta-Learning in Smart Voice Control Systems


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
  • Jianping WANG (Supervisor)
  • Qing Li (External person) (External Co-Supervisor)
Award date23 Dec 2020


Voice control systems aim to build a voice-based connection between humans and computers, which have become popular in Human-Computer Interaction (HCI). Spoken term classification and speaker verification are two typical tasks to be accomplished for building voice control systems. Spoken term classification aims to recognize the user commands, and speaker verification aims to identify who is issuing the commands. Conventional approaches for solving these two typical tasks require massive datasets to train deep learning models. However, it is not easy to apply such approaches directly to smart voice control systems. Smart voice control systems aim to intelligently interact with humans to personalize the interactions, such as introducing new commands and new command issuers, without lots of training data (i.e., new voice recordings).

Researchers have proposed various concepts and algorithms to enable effective machine learning from limited data, such as data augmentation, unsupervised learning, semi-supervised learning, and few-shot learning. The few-shot learning problem is defined as learning with limited labeled data for target tasks, using prior knowledge learned from external source data with a different distribution. Its characteristics match with the requirements of building smart voice control systems entirely. Among different approaches to tackle the few-shot learning problem, meta-learning aims to learn to make a quick adaptation to new tasks with only a few examples.

In this thesis, we propose extended meta-learning approaches to solve few-shot classification tasks in smart voice control systems, such as speaker (i.e., command issuer) verification and user-defined spoken term classification. Although many applications in smart voice control systems are few-shot classification tasks, most of them in practice are not N-way, K-shot as usual meta-learning settings. As a result, we try to bridge the gap between them by making extensions to traditional meta-learning approaches to fit the practical applications.

Conventional approaches of the speaker verification task learn a representation model to extract the speaker embeddings for verification. Recently, among the meta-learning approaches, prototypical networks aim at learning a non-linear mapping from the input space to an embedding space with a predefined distance metric. We investigate the use of prototypical networks in a small footprint text-independent speaker verification task and find it outperforms the conventional method when the amount of data per training speaker is limited.

For the spoken term classification task, we formulate a user-defined scenario as a relaxed few-shot classification problem, which is N+M-way, K-shot, where N and M are the number of new classes and fixed classes, respectively. We propose a modification to the Model-Agnostic Meta-Learning (MAML) algorithm to solve the problem. It outperforms the conventional supervised learning approach and the original MAML.

Moreover, in this thesis, we try to improve some typical meta-learning algorithms' performance and generalize them to a broader range of few-shot learning tasks in Artificial Intelligence (AI) fields other than speech and language processing. AI's remarkable success depends much on training deep neural networks on large-scale human-annotated data, which is quite expensive. We face two challenges in preparing enough data in real applications -- data diversity and data specialization. With an increasing number of AI applications, different applications need customized data. Moreover, in professional areas like the medical field, some barriers exist to collect and annotate data. The challenges lead to a few-shot learning problem in many AI tasks.

The applications in smart voice control systems prove that meta-learning is an effective solution to solve the few-shot learning problem. However, there exist weaknesses in current meta-learning algorithms. For example, MAML and its variants are popular optimization-based meta-learning algorithms. They train an initializer across various sampled learning tasks (i.e., episodes) such that the initialized model can adapt quickly to new tasks. However, current MAML-based algorithms have limitations in forming generalizable decision boundaries. We propose an approach called MetaMix. It generates virtual feature-target pairs within each episode to regularize the backbone models. MetaMix can be integrated with any of the MAML-based algorithms and learn the decision boundaries generalizing better to new tasks. We apply it to computer vision tasks, conducting experiments on different image datasets, and find it outperforms conventional MAML-based algorithms.

To summarize, in this thesis, we propose extended meta-learning approaches to solve few-shot classification tasks - the speaker verification task and the user-defined spoken term classification task, in smart voice control systems. Moreover, we propose improved meta-learning algorithms and generalize them to more few-shot learning tasks in AI fields. Considering the problems left in our work, we plan to make further improvements to meta-learning approaches in smart voice control systems and apply them to other more AI tasks.