Abstract:
There are several applications when comes to spoken language understanding (SLU) such as topic identification and intent detection. One of the primary underlying components used in SLU studies are ASR (Automatic Speech Recognition). In recent years we have seen a major improvement in the ASR system to recognize spoken utterances. But it is still a challenging task for low resource languages as it requires 100’s hours of audio input to train an ASR model. To overcome this issue recent studies have used transfer learning techniques. However, the errors produced by the ASR models significantly affect the downstream natural language understanding (NLU) models used for intent or topic identification. In this work, we have proposed a multi-ASR setup to overcome this issue. We have shown that combining outputs from multiple ASR models can significantly increase the accuracy of low-resource speech-command transfer-learning tasks than using the output from a single ASR model. We have come up with CNN based setups that can utilize outputs from pre-trained ASR models such as DeepSpeech2 and Wav2Vec 2.0. The experiment result shows an 8% increase in accuracy over the current state-of-the-art low resource speech-command phoneme-based speech intent classification methodology.
Citation:
I. Mohamed and U. Thayasivam, "Low Resource Multi-ASR Speech Command Recognition," 2022 Moratuwa Engineering Research Conference (MERCon), 2022, pp. 1-6, doi: 10.1109/MERCon55799.2022.9906230.