IYAL: input text normalizer for Tamil language

Premkumar, S; Sivarasa, N; Yogendrarajah, S; Dias, G; Sarveswaran, K

IYAL: input text normalizer for Tamil language

Files

1571154600.pdf (1.58 MB)

Date

2025

Authors

Publisher

IEEE

Abstract

Natural Language Processing (NLP) systems have advanced significantly for high-resource languages, yet they continue to face major challenges when applied to low-resource languages such as Tamil. Tamil’s rich linguistic structure, dialectal differences that are not properly studied and annotated, diverse input methods, and legacy encodings pose unique preprocessing challenges, making Tamil language processing particularly difficult. In this paper, we present IYAL, an input text normalizer that addresses two major gaps in Tamil NLP: the lack of robust encoding normalization and the limited handling of informal or colloquial input variations. We propose a modular pipeline that handles various input methods to standardize input to Unicode, detects and converts legacy encodings and fonts, and optionally transforms colloquial Tamil into its literary equivalent to facilitate preprocessing for downstream tasks. Through empirical evaluation, we demonstrate the system’s effectiveness in handling real-world noisy text inputs, offering a scalable solution for Tamil and potentially many other languages in the region that face the same set of challenges.

Keywords

Tamil NLP, Text Normalization, Character Encoding, Fonts, Style Transfer, Unicode Processing, Natural Language Processing, Low-resource Languages

URI

https://dl.lib.uom.lk/handle/123/24497

Collections

MERCon - 2025

Full item page

IYAL: input text normalizer for Tamil language

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

DOI

Collections

Endorsement

Review

Supplemented By

Referenced By