Abstract:
Availability of quality parallel data is a major
requirement to build a reasonably well performing statistical
machine translation (SMT) system. Thus, developing a decent
SMT system for a low-resourced language pair like Sinhala and
Tamil that does not have a large parallel corpus is rather
challenging. Past research for other different language pairs has
shown that different terminology / bilingual list integration
methodologies can be used to improve the quality of SMT
systems, for domain-specific SMT in particular. In this paper, we
explore if this can be effective for Sinhala-Tamil machine
translation for the domain of official government documents. We
evaluate the impact of three types of bilingual lists, namely, a list
of government organizations and official designations, a glossary
related to government administration and operations, and a
general bilingual dictionary, based on four different
methodologies (three static and one dynamic). Out of four, one
methodology gave notable improvements for all three types of list
over the baseline.