2. Information Extraction: State Of Art
- The FASTUS system as mentioned in, developed at SRI International was able to process English and Japanese text.
- The SPPC system as mentioned in, can process German text.
3. Information Extraction Architecture
- Metadata analysis: This step deals with extraction of metadata components of a document i.e. data which describes a document like title of the document etc.
- Tokenization: In this step text is broken into words or units known as tokens and they are classified into groups according to characteristics and attributes.
- Morphological analysis: Here morphological information is extracted from tokens which are similar to performing parts of speech disambiguation.
- Sentence/Utterance boundary detection: In this step text is segmented into a sequence of sentence having lexical contents together with respective features.
- Common Named-entity extraction: Named entities like names of persons, organizations, locations, expressions of times, numerical and currency expressions, etc. are detected irrespective of their domain.
- Phrase recognition: Several phrases like noun phrase, verb groups, prepositional phrases, acronyms, and abbreviations are identified in this stage.
- Syntactic analysis: Here all the possible interpretations of sentences are computed which are obtained after above mentioned steps based on the lexical sequence of text or tokens.
4. Information Extraction Types
- Named Entity Recognition: It is also known as entity identification, entity chunking and entity extraction and addresses the issue of identification (detection) and classification of text into pre-defined categories of named entities such as the names of persons (e.g., S.R Ranganathan), organizations (e.g. DRTC), locations (e.g., Bangalore), expressions of times (e.g., April 1962), numerical and currency expressions (e.g., 5 Thousand INR), etc.
- Co-reference Resolution: Co-reference resolution means identifying all the expressions that indicate to the same entity in a given text, e.g.
- Relation Extraction: This is the method of identification and classification of predefined relations among the given text. E.g.
- Event Extraction: This is a method of detecting events in a text and deriving comprehensive details and structured relevant information about them. E.g.,
5. Evaluation of Information Extraction Systems
- News Tracking: This is a classical application area of information extraction and it has gained a lot of attention of researchers in the NLP community. This is the task of automatically tracking a specific type of event from news contents. The popular MUC competitions are based on extracting structured entities like names of people and organization, and relation between them such as “is-Head-of” between them. One of its applications is connecting background information on people, location and organizations with the related contents in news articles using hyperlinks.
- Customer Care: This type of extraction techniques are used in scenario where several types of data or facts are collected from customers and for efficient extraction of information from this unstructured data source. This extraction method identifies the relevant product categories from the details provided by the customers and on the basis of this personalized services are provided to the customers. This type of application also involves data cleaning from stored records in databases i.e. suppose from flat string containing address, structured data are extracted like road name, city and state.
- Classified Ads: Classified ads and other listings like list of restaurants, paying guest facilities, apartments on lease etc. is another vital area where if unstructured data is exposed is invaluable for query. This extraction method deals with extraction of information from record-oriented data.
- Citation Databases: Several citation databases have been created by extracting information from vast range of databases ranging from conference web pages to personal home pages of researchers. Some of them are Citeseer, Google Scholarand Cora. These citation databases require several levels of extraction techniques to obtain desired results like navigating web pages for finding publication records, then extracting them from either HTML pages or from PDF files, and then extracting associated metadata like title, authors, venue and year. Citation databases provide important statistical analysis like author-level citation counts etc.
- Opinion Databases: There are several websites which collect opinion polls from users on a range of topics like sports, politics, about products, books, movies, music and people. Many of these opinions collected are in free-text form present in blogs, newsgroup posts, reviewing sites etc. These reviews by default are not directly reusable but the value can be enhanced by organizing them along structured fields.
- Community Websites: Community websites extract information about the events associated with a particular community or association. Some of its implementations are DBLife and Rexa which finds information about researchers, talks, workshops, conferences, projects and events relevant to the community.
- Comparison Shopping: With the growth of web several merchants have launched their commercial websites but as web is like a vast ocean so to enhance their visibility to the customers and compare prices of a particular product on different website for users is a challenging task. One of implementation of this kind for comparing prices of books online from different merchant websites is IndiaBookStore.
- Structured Web Searches: This is the most challenging issue for effective information extraction using entities and the relationship among those instead of keyword based searching. Keyword search are effecting in providing information about entities which are noun or noun phrases but when it is a matter of extractions based on relations between entities they fail. For example, if one needs documents on “artist born in Italy between 1450 and 1600”, then keyword based search is not effective in extracting the documents.
7. Tools and Services for IE
- General Architecture for Text Engineering: General Architecture for Text Engineering (GATE), which is bundled with a free Information Extraction system from large corporations to small startups, from €multi-million research consortia to undergraduate projects
- Apache OpenNLP: Apache OpenNLP is a Java machine learning toolkit for processing natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, parsing, and co-reference resolution.
- OpenCalais: OpenCalais is an automated information extraction web service from Thomson Reuters (Free limited version).
- Machine Learning for Language Toolkit (Mallet): Mallet is a Java-based package for a variety of natural language processing tasks, including information extraction.
- DBpedia Spotlight: DBpedia Spotlight is an open source tool in Java/Scala (and free web service) that can be used for named entity recognition and name resolution.
- Web Miner: Web Miner is commercial software used for extracting specific information, images and files from websites.
- Semantics3: Semantics3 is e-commerce product and pricing database that obtains its data through information extraction from thousands of online retailers.