Automatic Creation of a Sentence Aligned Sinhala-Tamil Parallel Corpus

Hameed, RA; Pathirennehelage, N; Ihalapathirana, A; Mohamed, MZ; Ranathunga, VSD; Jayasena, S; Dias, G; Fernando, S

dc.contributor.author	Hameed, RA
dc.contributor.author	Pathirennehelage, N
dc.contributor.author	Ihalapathirana, A
dc.contributor.author	Mohamed, MZ
dc.contributor.author	Ranathunga, VSD
dc.contributor.author	Jayasena, S
dc.contributor.author	Dias, G
dc.contributor.author	Fernando, S
dc.date.accessioned	2017-01-16T04:01:11Z
dc.date.available	2017-01-16T04:01:11Z
dc.identifier.uri	http://dl.lib.mrt.ac.lk/handle/123/12221
dc.description.abstract	A sentence aligned parallel corpus is an important prerequisite in statistical machine translation. However, manual creation of such a parallel corpus is time consuming, and requires experts fluent in both languages. Automatic creation of a sentence aligned parallel corpus using parallel text is the solution to this problem. In this paper, we present the first ever empirical evaluation carried out to identify the best method to automatically create a sentence aligned Sinhala-Tamil parallel corpus. Annual reports from Sri Lankan government institutions were used as the parallel text for aligning. Despite both Sinhala and Tamil being under-resourced languages, we were able to achieve an F-score value of 0.791 using a hybrid approach that makes use of a bilingual dictionary.	en_US
dc.relation.uri	http://www.aclweb.org/anthology/W/W16/W16-37.pdf	en_US
dc.source.uri	http://www.aclweb.org/anthology/W/W16/W16-37.pdf	en_US
dc.title	Automatic Creation of a Sentence Aligned Sinhala-Tamil Parallel Corpus	en_US
dc.type	Article-Abstract	en_US
dc.identifier.year	2016	en_US
dc.identifier.journal	WSSANLP	en_US
dc.identifier.pgnos	124	en_US
dc.identifier.email	gihan@uom.lk	en_US