Benchmarking OCR models for Sinhala and Tamil document digitization

dc.contributor.authorVelayuthan, P
dc.contributor.authorAmbegoda, TD
dc.contributor.editorGamage , JR
dc.contributor.editorNandasiri , GK
dc.contributor.editorMawathage , SA
dc.contributor.editorHerath, RP
dc.date.accessioned2025-05-29T09:31:58Z
dc.date.issued2024
dc.description.abstractThe digitization of documents in low-resource languages, such as Sinhala and Tamil, presents significant challenges due to the unique complexities of these scripts and the scarcity of high-quality training data. While traditional OCR systems have made strides in converting printed text to digital formats, they struggle with the intricate layouts and linguistic nuances of underrepresented languages. Recent advancements in Vision- Language Models (VLMs), like UDOP [1] and HRVDA [2], have integrated visual and textual data for improved document understanding. However, the application of these models to low-resource languages remains limited, leaving a gap in accu- rate document digitization. This research benchmarks several prominent OCR mod- els, including Surya-OCR, TR-OCR [3], EasyOCR [4], and Tesseract OCR [5], focusing on their performance in digitizing documents in Sinhala and Tamil. We evaluate these models using key metrics—Character Error Rate [6] (CER),Word Error Rate [7] (WER), BLEU Score [8], METEOR [9], and Edit Distance [6](ED)—to determine the most effective solutions for low-resource languages.
dc.identifier.conferenceERU Symposium - 2024
dc.identifier.departmentDepartment of Computer Science & Engineering University of Moratuwa
dc.identifier.doihttps://doi.org/10.31705/ERU.2024.7
dc.identifier.emailpurushothv@cse.mrt.ac.lk
dc.identifier.emailthanuja@cse.mrt.ac.lk
dc.identifier.facultyEngineering
dc.identifier.issn3051-4894
dc.identifier.pgnospp. 17-18
dc.identifier.placeSri Lanka
dc.identifier.proceedingProceedings of the ERU Symposium 2024
dc.identifier.urihttps://dl.lib.uom.lk/handle/123/23574
dc.language.isoen
dc.publisherEngineering Research Unit
dc.subjectVision Language Models (VLMs)
dc.subjectOptical Char- acter Recognition (OCR)
dc.subjectBenchmarks
dc.subjectLow-resource languages
dc.subjectSinhala and Tamil
dc.titleBenchmarking OCR models for Sinhala and Tamil document digitization
dc.typeConference-Extended-Abstract

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
7.Benchmarking OCRModelsforSinhalaandTamil.pdf
Size:
178.83 KB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections