Abstract:
Due to the expansion of semantic web technologies, Resource Description frameworks (RDFs) and triple stores became more prevalent. Since there is a huge amount of RDF data available, managing them in a proper and efficient manner is challenging. Many Triple stores were implemented to support the queries related to semantic web. The queries submitted in this context is called as SPARQL queries which are read dominant. These SPARQL queries needs to be answered quickly and efficiently. RDF data is stored in <subject, predicate, object> form and which is called as a triple. A typical triple store contains billions of triples in the above form.
Much work has been devoted to handle RDF data efficiently. But state of the art systems still cannot handle web scale RDF data effectively. Most existing systems store and index data in particular ways. For an example some systems uses relational tables, bitmap matrix to optimize SPARQL query processing on RDF data. This relational approach suffers from high Join cost and large intermediate results. Some have used prolog inference engine to handle RDF data. This also have some limitations given a huge amount of RDF data.
A modern approach is to model the RDF data in its native Graph form. This approach requires
new algorithms to build the graph and graph exploration techniques to answer SPARQL queries.
This yields no join cost and very small intermediary results. Also this approach yields less query execution time for complex SPARQL queries.
The objective of this research is to build a graph based triple store for Apache Cassandra. It uses Apache Jena Graph Processing framework to build and explore the RDF graph. Towards the end, it conducts a performance benchmark of this RDF store with some other RDF store implementations using DBPedia dataset and sample queries and proves that this graph based approach outperforms other RDF store implementations