Benchmarking OCR models for Sinhala and Tamil document digitization
dc.contributor.author | Velayuthan, P | |
dc.contributor.author | Ambegoda, TD | |
dc.contributor.editor | Gamage , JR | |
dc.contributor.editor | Nandasiri , GK | |
dc.contributor.editor | Mawathage , SA | |
dc.contributor.editor | Herath, RP | |
dc.date.accessioned | 2025-05-29T09:31:58Z | |
dc.date.issued | 2024 | |
dc.description.abstract | The digitization of documents in low-resource languages, such as Sinhala and Tamil, presents significant challenges due to the unique complexities of these scripts and the scarcity of high-quality training data. While traditional OCR systems have made strides in converting printed text to digital formats, they struggle with the intricate layouts and linguistic nuances of underrepresented languages. Recent advancements in Vision- Language Models (VLMs), like UDOP [1] and HRVDA [2], have integrated visual and textual data for improved document understanding. However, the application of these models to low-resource languages remains limited, leaving a gap in accu- rate document digitization. This research benchmarks several prominent OCR mod- els, including Surya-OCR, TR-OCR [3], EasyOCR [4], and Tesseract OCR [5], focusing on their performance in digitizing documents in Sinhala and Tamil. We evaluate these models using key metrics—Character Error Rate [6] (CER),Word Error Rate [7] (WER), BLEU Score [8], METEOR [9], and Edit Distance [6](ED)—to determine the most effective solutions for low-resource languages. | |
dc.identifier.conference | ERU Symposium - 2024 | |
dc.identifier.department | Department of Computer Science & Engineering University of Moratuwa | |
dc.identifier.doi | https://doi.org/10.31705/ERU.2024.7 | |
dc.identifier.email | purushothv@cse.mrt.ac.lk | |
dc.identifier.email | thanuja@cse.mrt.ac.lk | |
dc.identifier.faculty | Engineering | |
dc.identifier.issn | 3051-4894 | |
dc.identifier.pgnos | pp. 17-18 | |
dc.identifier.place | Sri Lanka | |
dc.identifier.proceeding | Proceedings of the ERU Symposium 2024 | |
dc.identifier.uri | https://dl.lib.uom.lk/handle/123/23574 | |
dc.language.iso | en | |
dc.publisher | Engineering Research Unit | |
dc.subject | Vision Language Models (VLMs) | |
dc.subject | Optical Char- acter Recognition (OCR) | |
dc.subject | Benchmarks | |
dc.subject | Low-resource languages | |
dc.subject | Sinhala and Tamil | |
dc.title | Benchmarking OCR models for Sinhala and Tamil document digitization | |
dc.type | Conference-Extended-Abstract |