Content extraction from PDF invoices on business document archives

dc.contributor.advisorPerera I
dc.contributor.authorBandara RMCV
dc.date.accept2020
dc.date.accessioned2020
dc.date.available2020
dc.date.issued2020
dc.description.abstractorganization better control over their information processes. When a business expands, more documents will be produced, and it needs to be carefully handled and tracked to make good use of. Output management systems that are working with ERP systems contains thousands of business documents and Portable document format (PDF) is the common output format for these types of documents. These systems need to execute documents search operations frequently. PDF documents Indexing is a critical part in this context. It will boost document search engine efficiency by cutting search space. Content extraction from PDF documents goes a step further and it will allow more structured search queries. Extracting the document content from a PDF file is a very important. But this is a very challenging task because PDF is a layout-based format that defines the fonts and locations of the individual character as opposed to the semantic units of the text and their role within the document. In this research I have developed a technique to extract content from a PDF file. We can use it for allow more structured search queries on large document archives in output management systems typically work with world leading ERP systems. On this research mainly considered on four aspects which are correctly identifying words, word order on a paragraph, clear separation of paragraph boundaries and semantic roles of each word. After extracting content from the PDF file, extracted texts content written to an xml document. XML file contains tags to recognize the pages and rotation angle and number of images on each page. Sample set of PDF invoices extracted and calculated the extracted word percentage to evaluate the accuracy of this technique. This tool hits 94.27% accuracy rate according to the results.en_US
dc.identifier.accnoTH4255en_US
dc.identifier.degreeMSc in Computer Scienceen_US
dc.identifier.departmentDepartment of Computer Science & Engineeringen_US
dc.identifier.facultyEngineeringen_US
dc.identifier.urihttp://dl.lib.uom.lk/handle/123/16780
dc.language.isoenen_US
dc.subjectCOMPUTER SCIENCE AND ENGINEERING-Dissertationsen_US
dc.subjectCOMPUTER SCIENCE-Dissertationsen_US
dc.subjectDATA PROCESSING, BUSINESSen_US
dc.subjectBUSINESS COMMUNICATION-Portable Document Formaten_US
dc.subjectAUTOMATIC CONTENT EXTRACTIONen_US
dc.titleContent extraction from PDF invoices on business document archivesen_US
dc.typeThesis-Full-texten_US

Files

Original bundle

Now showing 1 - 3 of 3
Loading...
Thumbnail Image
Name:
TH4255-1.pdf
Size:
119.48 KB
Format:
Adobe Portable Document Format
Description:
Pre-text
Loading...
Thumbnail Image
Name:
TH4255-2.pdf
Size:
106.29 KB
Format:
Adobe Portable Document Format
Description:
Post-text
Loading...
Thumbnail Image
Name:
TH4255.pdf
Size:
2.16 MB
Format:
Adobe Portable Document Format
Description:
Full-thesis