Skip to main navigation Skip to search Skip to main content

Contrastive Learning for Target Speaker Extraction with Attention-based Fusion

Xiao Li*, Ruirui Liu, Huichou Huang*, Qingyao Wu*

*Corresponding author for this work

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

Abstract

Given a reference speech clip from the target speaker, Target Speaker Extraction (TSE) is a challenging task that involves extracting the signal of the target speaker from a multi-speaker environment. TSE networks typically comprise a main network and an auxiliary network. The former utilizes the obtained target speaker embedding to generate an appropriate mask for isolating the signal of the target speaker from those of other speakers, while the latter aims to learn deep discriminative embeddings from the signal of the target speaker. However, the TSE networks often face performance degradation when dealing with unseen speakers or speeches with short references. In this paper, we propose a novel approach that leverages contrastive learning in the auxiliary network to obtain better representations of unseen speakers or speeches with short references. Specifically, we employ contrastive learning to bridge the gap between short and long speech features. In this case, the auxiliary network with the input of a short speech generates feature embeddings that are as rich as those obtained from a long speech. Therefore, improving the recognition of unseen speakers or short speeches. Moreover, we introduce an attention-based fusion method that integrates the speaker embedding into the main network in an adaptive manner for enhancing mask generation. Experimental results demonstrate the effectiveness of our proposed method in improving the performance of TSE tasks in realistic open scenarios. © 2023 IEEE.
Original languageEnglish
Pages (from-to)178-188
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Volume32
Online published13 Oct 2023
DOIs
Publication statusPublished - 2024
Externally publishedYes

Research Keywords

  • Attention
  • Contrastive Learning
  • Decoding
  • Degradation
  • Feature extraction
  • Self-supervised Learning
  • Speaker Extraction
  • Speech enhancement
  • Task analysis
  • Transformers
  • Visualization

Fingerprint

Dive into the research topics of 'Contrastive Learning for Target Speaker Extraction with Attention-based Fusion'. Together they form a unique fingerprint.

Cite this