Abstract:
This paper presents a comparative evaluation of
three state-of-the-art classifiers for Sinhala Parts-of-Speech
(POS) tagging. Support Vector Machines (SVM), Hidden
Markov Models (HMM) and Conditional Random Fields
(CRF) based POS tagger models are generated and tested
using different combinations of a corpus of news articles and
a corpus of official government documents. CRF is used for
the first time in Sinhala POS tagging, thus the best feature
set is experimentally derived. To further improve the
accuracy of POS tagging, a majority voting based ensemble
tagger is created using three individual taggers. This
ensemble tagger achieved the highest accuracy in POS
tagging than any individual tagger. The two domains (news,
and official government documents) used in this study have
noticeable differences in writing style and vocabulary.
Generating domain specific POS taggers is time consuming
and costly due to the overhead involved in creating and
manually tagging domain specific corpora, for low resourced
languages in particular. Therefore, this study also evaluates
the possibility and successfulness of using corpora of
different domains in training and testing phases of
aforementioned machine learning techniques.
Citation:
S. Fernando and S. Ranathunga, "Evaluation of Different Classifiers for Sinhala POS Tagging," 2018 Moratuwa Engineering Research Conference (MERCon), 2018, pp. 96-101, doi: 10.1109/MERCon.2018.8421997.