Abstract:
Deep learning has achieved immense universality by outperforming GMM and i-vectors on speaker identification. Neural Network approaches have obtained promising results when fed by raw speech samples directly. Modified Convolutional Neural Network (CNN) architecture called SincNet, based on parameterized sinc functions which offer a very compact way to derive a customized filter bank in the short utterance. This paper proposes attention based Long Short Term Memory (LSTM) architecture that encourages discovering more meaningful speaker-related features with minimal training data. Attention layer built using Neural Networks offers a unique and efficient representation of the speaker characteristics which explore the connection between an aspect and the content of short utterances. The proposed approach converges faster and performs better than the SincNet on the experiments carried out in the speaker identification tasks.
Citation:
S. Balakrishnan, K. Jathusan and U. Thayasivam, "End To End Model For Speaker Identification With Minimal Training Data," 2021 Moratuwa Engineering Research Conference (MERCon), 2021, pp. 456-461, doi: 10.1109/MERCon52712.2021.9525740.