Benchmarking OCR models for Sinhala and Tamil document digitization

Velayuthan, P; Ambegoda, TD

Benchmarking OCR models for Sinhala and Tamil document digitization

dc.contributor.author	Velayuthan, P
dc.contributor.author	Ambegoda, TD
dc.contributor.editor	Gamage , JR
dc.contributor.editor	Nandasiri , GK
dc.contributor.editor	Mawathage , SA
dc.contributor.editor	Herath, RP
dc.date.accessioned	2025-05-29T09:31:58Z
dc.date.issued	2024
dc.description.abstract	The digitization of documents in low-resource languages, such as Sinhala and Tamil, presents significant challenges due to the unique complexities of these scripts and the scarcity of high-quality training data. While traditional OCR systems have made strides in converting printed text to digital formats, they struggle with the intricate layouts and linguistic nuances of underrepresented languages. Recent advancements in Vision- Language Models (VLMs), like UDOP [1] and HRVDA [2], have integrated visual and textual data for improved document understanding. However, the application of these models to low-resource languages remains limited, leaving a gap in accu- rate document digitization. This research benchmarks several prominent OCR mod- els, including Surya-OCR, TR-OCR [3], EasyOCR [4], and Tesseract OCR [5], focusing on their performance in digitizing documents in Sinhala and Tamil. We evaluate these models using key metrics—Character Error Rate [6] (CER),Word Error Rate [7] (WER), BLEU Score [8], METEOR [9], and Edit Distance [6](ED)—to determine the most effective solutions for low-resource languages.
dc.identifier.conference	ERU Symposium - 2024
dc.identifier.department	Department of Computer Science & Engineering University of Moratuwa
dc.identifier.doi	https://doi.org/10.31705/ERU.2024.7
dc.identifier.email	purushothv@cse.mrt.ac.lk
dc.identifier.email	thanuja@cse.mrt.ac.lk
dc.identifier.faculty	Engineering
dc.identifier.issn	3051-4894
dc.identifier.pgnos	pp. 17-18
dc.identifier.place	Sri Lanka
dc.identifier.proceeding	Proceedings of the ERU Symposium 2024
dc.identifier.uri	https://dl.lib.uom.lk/handle/123/23574
dc.language.iso	en
dc.publisher	Engineering Research Unit
dc.subject	Vision Language Models (VLMs)
dc.subject	Optical Char- acter Recognition (OCR)
dc.subject	Benchmarks
dc.subject	Low-resource languages
dc.subject	Sinhala and Tamil
dc.title	Benchmarking OCR models for Sinhala and Tamil document digitization
dc.type	Conference-Extended-Abstract

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 7.Benchmarking OCRModelsforSinhalaandTamil.pdf
Size:: 178.83 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

ERU - 2024