Abstract
To handle the problems of poor performances of speaker verification systems, appearing in multiple scenarios with cross-domain utterances, long-duration utterances and noisy utterances, a real-time robust speaker recognition model, PMS-Conformer, is designed based on Conformer in this paper. The architecture of the PMS-Conformer is inspired by the state-of-the-art model named MFA-Conformer. PMS-Conformer has made the improvements on the acoustic feature extractor, network components and loss calculation module of MFA-Conformer respectively, having the novel and effective acoustic feature extractor and the robust speaker embedding extractor with high generalization capability. PMS-Conformer is trained on VoxCeleb1&2 dataset, and it is compared with the baseline MFA-Conformer and ECAPA-TDNN, and extensive comparison experiments are conducted on the speaker verification tasks. The experimental results show that on VoxMovies with cross-domain utterances, SITW with long-duration utterances and VoxCeleb-O processed by adding noise to its utterances, the ASV system built with PMS-Conformer is more competitive than those built with MFA-Conformer and ECAPA-TDNN respectively. Moreover, the trainable Params and RTF of the speaker embedding extractor of PMS-Conformer are significantly lower than those of ECAPA-TDNN. All evaluation experiment results demonstrate that PMS-Conformer exhibits good performances in real-time multi-scenarios.
| Translated title of the contribution | Conformer-Based Speaker Recognition Model for Real-Time Multi-Scenarios |
|---|---|
| Original language | Chinese (Simplified) |
| Pages (from-to) | 147-156 |
| Journal | 计算机工程与应用 Computer Engineering and Applications |
| Volume | 60 |
| Issue number | 7 |
| DOIs | |
| Publication status | Published - Apr 2024 |
| Externally published | Yes |
Research Keywords
- speaker verification
- MFA-Conformer
- Sub-center AAM-Softmax
- speaker embedding
- acoustic feature extraction