Advancing retrieval-augmented generation for financial question answering

dc.contributor.authorJasmin, AA
dc.contributor.authorPerera, I
dc.contributor.authorMohamed, M
dc.contributor.authorMushraf, M
dc.date.accessioned2025-12-09T05:45:31Z
dc.date.issued2025
dc.description.abstractRetrieval Augmented Generation (RAG) systems show promise for financial question answering, yet high accuracy on benchmarks such as FinanceBench (19% baseline, 32% updated) remains challenging [1] [8]. This paper presents a systematic, multistage approach to significantly improve the performance of the RAG pipeline for financial QA.We first established a robust curated baseline using Gemini-2.0, Docling parser, Google’s text-embedding-004, and a vector database, achieving an initial accuracy of 43%. Subsequent architectural and component-wise optimizations were then iteratively implemented. Firstly, a metadata filtering strategy, which utilizes a fine-tuned NER model to extract company names and years from queries, improved accuracy to 72%, demonstrating that targeted retrieval can simulate the benefits of a single-store per-filing approach [1]. Secondly, a hybrid chucking technique, which preserves the structure of the document and utilizes tokenization sensitive refinements, further increased the accuracy to 80%. Third, the implementation of a Hybrid Search mechanism, combining dense and sparse retrieval methods, advanced performance to 84%. Finally, LLM-based query expansion, which transforms user queries into answer formats, yielded a final accuracy of 88%. This research demonstrates that a carefully designed RAG pipeline, incorporating intelligent metadata filtering, layoutaware chunking, advanced similarity search, and query semantics enhancement, substantially improves financial QA, significantly outperforming existing baselines.
dc.identifier.conferenceMoratuwa Engineering Research Conference 2025
dc.identifier.departmentEngineering Research Unit, University of Moratuwa
dc.identifier.emailakmal.20@cse.mrt.ac.lk
dc.identifier.emailindika@cse.mrt.ac.lk
dc.identifier.emailmuaadh.20@cse.mrt.ac.lk
dc.identifier.emailismail.20@cse.mrt.ac.lk
dc.identifier.facultyEngineering
dc.identifier.isbn979-8-3315-6724-8
dc.identifier.pgnospp. 658-663
dc.identifier.proceedingProceedings of Moratuwa Engineering Research Conference 2025
dc.identifier.urihttps://dl.lib.uom.lk/handle/123/24543
dc.language.isoen
dc.publisherIEEE
dc.subjectFinancial insight engine
dc.subjecttransformer-based models
dc.subjectRetrieval-Augmented Generation
dc.subjectannual reports
dc.subjectregulatory filings
dc.subjecthybrid chunking
dc.subjectmetadata filtering
dc.subjectquery reformulation
dc.subjectreal-time analysis
dc.subjectfinancial decision-making
dc.titleAdvancing retrieval-augmented generation for financial question answering
dc.typeConference-Full-text

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
1571154315.pdf
Size:
2.03 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections