Abstract:
Modern scientific experiments generate vast volumes
of data which are hard to keep track of. Consequently, scientists
find it difficult to reuse and share these data sets. We address
this problem by developing a schema-independent data cataloging
framework for efficient management of scientific data. The proposed
solution consists of an agent which automatically identifies
new data products and extract metadata from them, as well as a
server which indexes the metadata using a NoSQL database and
provides a REST API for querying, sharing, and reusing the data
sets. The novelty of our solution lies in the pluggable metadata
extraction logic, extensible data product generation monitors, use
of a NoSQL database, and the ability to dynamically add new
metadata fields. The use of Apache Solr as the backend database
enables the proposed solution to index and search data products much fatser than a solution based on relational databases. For example, our Apache Solr based implementation can resolve full text, sub-string, prefix, and suffix queries 91% - 99% faster than a MySQL-based implementation.