IYAL: input text normalizer for Tamil language

dc.contributor.authorPremkumar, S
dc.contributor.authorSivarasa, N
dc.contributor.authorYogendrarajah, S
dc.contributor.authorDias, G
dc.contributor.authorSarveswaran, K
dc.date.accessioned2025-12-05T04:13:28Z
dc.date.issued2025
dc.description.abstractNatural Language Processing (NLP) systems have advanced significantly for high-resource languages, yet they continue to face major challenges when applied to low-resource languages such as Tamil. Tamil’s rich linguistic structure, dialectal differences that are not properly studied and annotated, diverse input methods, and legacy encodings pose unique preprocessing challenges, making Tamil language processing particularly difficult. In this paper, we present IYAL, an input text normalizer that addresses two major gaps in Tamil NLP: the lack of robust encoding normalization and the limited handling of informal or colloquial input variations. We propose a modular pipeline that handles various input methods to standardize input to Unicode, detects and converts legacy encodings and fonts, and optionally transforms colloquial Tamil into its literary equivalent to facilitate preprocessing for downstream tasks. Through empirical evaluation, we demonstrate the system’s effectiveness in handling real-world noisy text inputs, offering a scalable solution for Tamil and potentially many other languages in the region that face the same set of challenges.
dc.identifier.conferenceMoratuwa Engineering Research Conference 2025
dc.identifier.departmentEngineering Research Unit, University of Moratuwa
dc.identifier.emailsanujen.20@cse.mrt.ac.lk
dc.identifier.emailnisanthan.20@cse.mrt.ac.lk
dc.identifier.emailsathveegan.20@cse.mrt.ac.lk
dc.identifier.emailgihan@uom.lk
dc.identifier.emailsarves@univ.jfn.ac.lk
dc.identifier.facultyEngineering
dc.identifier.isbn979-8-3315-6724-8
dc.identifier.pgnospp. 852-857
dc.identifier.proceedingProceedings of Moratuwa Engineering Research Conference 2025
dc.identifier.urihttps://dl.lib.uom.lk/handle/123/24497
dc.language.isoen
dc.publisherIEEE
dc.subjectTamil NLP
dc.subjectText Normalization
dc.subjectCharacter Encoding
dc.subjectFonts
dc.subjectStyle Transfer
dc.subjectUnicode Processing
dc.subjectNatural Language Processing
dc.subjectLow-resource Languages
dc.titleIYAL: input text normalizer for Tamil language
dc.typeConference-Full-text

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
1571154600.pdf
Size:
1.58 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections