Unified deep convolutional network for robust and highly generalized speaker clustering
Loading...
Files
Date
2025
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
IEEE
Abstract
Speaker Clustering (SC) is the task of allocating the speaker utterances into speaker-specific groups without the prior knowledge of the number and identity of speakers. In this paper, we elaborate on the application of transfer learning in a modified Visual Geometry Group (VGGish) net trained on Audioset data for a large scale Audio Classification. We transferred the knowledge from VGGish, integrated a Micro CNN architecture, and enhanced the voice feature modeling for the SC task. With our hybrid embedding extraction method (VGGish-SC), we outperformed the clustering performance in terms of misClassification rate (MR) on TIMIT and VCTK datasets against the state of the art SC methods. Various experimentations carried out validated our proposed methodology bettered state of the art approaches in in-domain by 25% and out-domain by 75%. And we reported baseline results for SC on noisy utterances, speaker accent variations, and language variations for the first time.
