Abstract:
Sinhala and Tamil are declared to be the offi cial lang uages of Sri Lan ka. This requires
each government related dissemination/communication to be done in both the
languages. Even though the requirement for translation is higher, the number of available
human translators is limited. One feasible option to boost the productivity would
be assisting the human translators with machine translation output. Here the machine
translation output is given to translators to work on by post editing, rather than translating
from the scratch. However, Sinhala - Tamil pair does not have any well-performing
machine translation system. Therefore, the focus of this research is to develop a machine
translation system for short official government documents.
This thesis presents two main contributions towards building ‘Si-T a’, the first domainadapted
machine trans lation system for Sin hala - Tam il. The first contribution is building
the baseline translation system. The second is implementing data pre-processing
techniques to improve the translation quality of the base line sys tem.
The base line system was built using Moses, a phrase -based stat istical trans lation system.
This was the feasible option with the available resources.
To improve the quality of the translation, three main approaches were explored. They
are: (a) domain adaptation, (b) integration of terminology, dictionary, and name lists,
and (c) addressing out-of-vocabulary (OOV) problem using word-embedding-based
paraphrasing.
In or der to adapt the sys tem for the dom ain of official government documents, different
language model design techniques and a data filtration technique were experimented.
Under terminology integration, experiments were carried out to evaluate the effect of
incorporating bilingual terminology lists to the system. Moreover, a novel data augmentation
technique was experimented to generate parallel data using bilingual lists and
available parallel data. Further, open domain dictionary entries, as well as a list of person
names and addresses were integrated and evaluated. In addition, word-embeddingbased
paraphrasing was used along with a novel heuristic-based filtering to address the
out-of-vocabulary issue.
All the above-mentioned approaches gave an improvement over the baseline, apart from
data filtering technique. Yet, all these scores were above the scores of already available
machine translation systems for this language pair. Though our techniques/approaches
were evaluated only on Sinhala - Tamil pair, they are feasible to be applied to other
low-resourced, highly inflectional language pairs.