Abstract:
The amount of data contained in a textual format has increased rapidly in the recent past. Such data includes web sites, documents of business organizations, etc., and contain lots of information. Information Retrieval (IR) is a field that allows identifying relevant document for a given query out of all these available documents. Information Extraction is taking another step in this direction. Instead of returning the set of documents that contains the relevant information, IE recognizes and returns the information among the natural text in these documents.
Ontology is defined as the “formal, explicit specification of a shared conceptualization”. It contains classes, properties, individuals and values to represent data in a certain domain. Most of the time in Ontology-Based Information Extraction, an IE technique is used to discover individuals for classes and values for properties to build ontology for a given domain. However, sometimes these classes and properties also identified as part of the IE technique rather than using a template with the pre-identified classes and properties in the Ontology.
A traditional Ontology Based Information Extraction system contains two main operations, ontology construction and ontology population. In the component-based approach defined in the “Ontology-Based Components for Information Extraction (OBCIE)”, the operation of constructing ontology is not changed. However, the operation to populate the ontology is refined in to a pipeline of three separate components: pre-processors, information extractors and aggregators.
By developing these components as web services, we have provided the ability for other applications to use them to extract the information out of any text based document. To demonstrate this concept, we have developed an application that accepts a set of text documents, and extracts useful information. It uses “metadata files”, which are dependent of the domain in which the ontology is created and populate the given ontology.