Abstract:
Sinhala is a low-resource Indo-Aryan language used by approximately 16 million people, mainly in Sri Lanka. Because of the complexity of the Sinhala language, detection of spelling errors is not so easy. A real-word error happens when a word is in the vocabulary but is not valid in the context in which it appears. Checking for real-word errors in a sentence is more difficult than checking for non-word errors, which are not in the vocabulary. We present the implementation of a neural-network based system for identifying real-word errors and non-word errors in Sinhala. We prepared a candidate list of real-word errors. Further, we have selected a suitable model and trained it using several different datasets. Thus, this paper sets a new baseline for the detection and correction of real-word errors in Sinhala documents. Our product, source code, candidate error list, training datasets, and evaluation dataset are publicly released.
Citation:
P. Sudesh, D. Dashintha, R. Lakshan and G. Dias, "Erroff: A Tool to Identify and Correct Real-word Errors in Sinhala Documents," 2022 Moratuwa Engineering Research Conference (MERCon), 2022, pp. 1-6, doi: 10.1109/MERCon55799.2022.9906294.