Abstract:
Analyzing existing machine translation approaches for Sinhala-Tamil official government
documents have revealed the shortcomings when translating named entities.
The diverse nature of the domain coupled with the lack of resources and
morphological complexity are the key reasons for this problem. Our research focuses
on translating named entities for official government documents between
Tamil and Sinhala. In this research, we focus on identifying and translating
named entities to improve the translation performance. We present a novel tag
set specific to official government documents and also propose a graph-based
semi-supervised approach that works better than state-of-the-art approaches for
low-resource settings. We employed this approach to build a large annotated corpus
in a cost-effective manner from a smaller amount of seed data and was able
to build an annotated corpus of over 200K words each for Tamil and Sinhala.
We also implemented a deep-learning approach for Named Entity Recognizer
that gave the best output for a completed corpus. Since the deep-learning approach
was a generic solution for sequential tagging, we also employed it to build
a Part-of-Speech tagger that outperforms existing systems. The University of
Moratuwa already has a system for translating official government documents
called SiTa. Finally, we incorporated the aforementioned models to build a module
that translated named entities and integrated it to SiTa. We empirically show
that our modules improve over the baseline for Tamil ! Sinhala and Sinhala !
Tamil translation tasks by upto 0.5 and 1.4 BLEU scores, respectively.