Abstract:
This paper proposes a new feature aggregation mechanism for deep neural network based speaker embedding for text-independent speaker verification. In speaker verification models, frame-level features are fed into the pooling layer or the feature aggregation component to obtain fixed-length utterance-level features. Our method utilizes the correlation between frame-level features such that dependencies between speaker discriminative information are represented with weights and produces weighted mean features with fixed-length as output. Our pooling mechanism is applied to the ECAPA-TDNN baseline architecture. In comparison to the Attentive Statistics Pooling applied to the same baseline, training on VoxCeleb1-dev dataset and an evaluation on the VoxCeleb1-test dataset shows that it reduces equal error rate (EER) by 7.32% and minimum normalized detection cost function (MinDCF10 -2 ) by 7.34%.
Citation:
R. Thevagumaran, T. Sivaneswaran and B. Karunarathne, "Enhanced Feature Aggregation for Deep Neural Network Based Speaker Embedding," 2022 Moratuwa Engineering Research Conference (MERCon), 2022, pp. 1-5, doi: 10.1109/MERCon55799.2022.9906175.