Abstract:
Sentence similarity is useful in many Natural Language Processing tasks such as plagiarism checking and paraphrasing. So far, only conventional unsupervised sentence similarity measurement techniques (knowledge-based, corpus-based, string similarity-based, and hybrid) have been used to measure sentence similarity for Tamil and Sinhala languages. In this paper, we present a Deep Learning technique to measure sentence similarity for these two languages, which makes use of a Siamese Neural Network that consists of two Long Short-Term Memory (LSTM) networks, and neural word embeddings as the input representation. This approach achieved a 3.07% higher Pearson correlation coefficient for the dataset of 2500 Tamil sentence pairs, and a 3.61% higher Pearson correlation for the dataset of 5000 Sinhala sentence pairs over the conventional unsupervised sentence similarity measurement techniques.