Institutional-Repository, University of Moratuwa.  

Sinhala-Tamil statistical machine translation (SMT) for official documents

Show simple item record

dc.contributor.advisor Ranathunga, S
dc.contributor.advisor Jayasena, S
dc.contributor.author Farhath, FF
dc.date.accessioned 2018
dc.date.available 2018
dc.date.issued 2018
dc.identifier.uri http://dl.lib.mrt.ac.lk/handle/123/15814
dc.description.abstract Sinhala and Tamil are declared to be the offi cial lang uages of Sri Lan ka. This requires each government related dissemination/communication to be done in both the languages. Even though the requirement for translation is higher, the number of available human translators is limited. One feasible option to boost the productivity would be assisting the human translators with machine translation output. Here the machine translation output is given to translators to work on by post editing, rather than translating from the scratch. However, Sinhala - Tamil pair does not have any well-performing machine translation system. Therefore, the focus of this research is to develop a machine translation system for short official government documents. This thesis presents two main contributions towards building ‘Si-T a’, the first domainadapted machine trans lation system for Sin hala - Tam il. The first contribution is building the baseline translation system. The second is implementing data pre-processing techniques to improve the translation quality of the base line sys tem. The base line system was built using Moses, a phrase -based stat istical trans lation system. This was the feasible option with the available resources. To improve the quality of the translation, three main approaches were explored. They are: (a) domain adaptation, (b) integration of terminology, dictionary, and name lists, and (c) addressing out-of-vocabulary (OOV) problem using word-embedding-based paraphrasing. In or der to adapt the sys tem for the dom ain of official government documents, different language model design techniques and a data filtration technique were experimented. Under terminology integration, experiments were carried out to evaluate the effect of incorporating bilingual terminology lists to the system. Moreover, a novel data augmentation technique was experimented to generate parallel data using bilingual lists and available parallel data. Further, open domain dictionary entries, as well as a list of person names and addresses were integrated and evaluated. In addition, word-embeddingbased paraphrasing was used along with a novel heuristic-based filtering to address the out-of-vocabulary issue. All the above-mentioned approaches gave an improvement over the baseline, apart from data filtering technique. Yet, all these scores were above the scores of already available machine translation systems for this language pair. Though our techniques/approaches were evaluated only on Sinhala - Tamil pair, they are feasible to be applied to other low-resourced, highly inflectional language pairs. en_US
dc.language.iso en en_US
dc.subject COMPUTER SCIENCE AND ENGINEERING-Dissertations en_US
dc.subject MACHINE TRANSLATION SYSTEMS en_US
dc.subject SINHALA LANGUAGE-Translation en_US
dc.subject TAMIL LANGUAGE-Translation en_US
dc.subject STATISTICAL MACHINE TRANSLATION en_US
dc.title Sinhala-Tamil statistical machine translation (SMT) for official documents en_US
dc.type Thesis-Full-text en_US
dc.identifier.faculty Engineering en_US
dc.identifier.degree Master of Philosophy en_US
dc.identifier.department Department of Computer Science & Engineering en_US
dc.date.accept 2018
dc.identifier.accno TH3871 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record