IYAL: input text normalizer for Tamil language
Loading...
Files
Date
2025
Journal Title
Journal ISSN
Volume Title
Publisher
IEEE
Abstract
Natural Language Processing (NLP) systems have advanced significantly for high-resource languages, yet they continue to face major challenges when applied to low-resource languages such as Tamil. Tamil’s rich linguistic structure, dialectal differences that are not properly studied and annotated, diverse input methods, and legacy encodings pose unique preprocessing challenges, making Tamil language processing particularly difficult. In this paper, we present IYAL, an input text normalizer that addresses two major gaps in Tamil NLP: the lack of robust encoding normalization and the limited handling of informal or colloquial input variations. We propose a modular pipeline that handles various input methods to standardize input to Unicode, detects and converts legacy encodings and fonts, and optionally transforms colloquial Tamil into its literary equivalent to facilitate preprocessing for downstream tasks. Through empirical evaluation, we demonstrate the system’s effectiveness in handling real-world noisy text inputs, offering a scalable solution for Tamil and potentially many other languages in the region that face the same set of challenges.
