Abstract:
Availability of quality parallel data is a major
requirement to build a reasonably well performing statistical
machine translation (SMT) system. Thus, developing a decent
SMT system for a low-resourced language pair like Sinhala and
Tamil that does not have a large parallel corpus is rather
challenging. Past research for other different language pairs has
shown that different terminology / bilingual list integration
methodologies can be used to improve the quality of SMT
systems, for domain-specific SMT in particular. In this paper, we
explore if this can be effective for Sinhala-Tamil machine
translation for the domain of official government documents. We
evaluate the impact of three types of bilingual lists, namely, a list
of government organizations and official designations, a glossary
related to government administration and operations, and a
general bilingual dictionary, based on four different
methodologies (three static and one dynamic). Out of four, one
methodology gave notable improvements for all three types of list
over the baseline.
Citation:
F. Farhath, S. Ranathunga, S. Jayasena and G. Dias, "Integration of Bilingual Lists for Domain-Specific Statistical Machine Translation for Sinhala-Tamil," 2018 Moratuwa Engineering Research Conference (MERCon), 2018, pp. 538-543, doi: 10.1109/MERCon.2018.8421901.