Friday, February 13, 2015

13. Advanced course in information storage and retrieval - II P- 06. Information Storage and Retrieval

इस ब्लॉग्स को सृजन करने में आप सभी से सादर सुझाव आमंत्रित हैं , कृपया अपने सुझाव और प्रविष्टियाँ प्रेषित करे , इसका संपूर्ण कार्य क्षेत्र विश्व ज्ञान समुदाय हैं , जो सभी प्रतियोगियों के कॅरिअर निर्माण महत्त्वपूर्ण योगदान देगा ,आप अपने सुझाव इस मेल पत्ते पर भेज सकते हैं -

13. Advanced course in information storage and retrieval - II

P- 06. Information Storage and Retrieval

By :Dr P.M Devika ,Paper Coordinator


1. Introduction

The goal of Information Retrieval (IR) system, as we know, is to response to user's request by retrieving documents. The aim is to retrieve documents those contents match with the user's information need. The standard practice is, after retrieval of the documents, user examines the retrieved documents by going through the text and determines whether they are relevant or not. The standard practice is users express their information requirements through the natural language as a statement or as part of a natural language dialogue. However, as we know from our experiences often the retrieved documents do not match the user's information need. This is because of the ambiguous nature of natural languages (discussed in details in the succeeding sections).
 Natural Language Processing is an area of research and application that explores how natural language text, entered into a computer system, can be manipulated and transformed into a form more suitable for further processing. The aim of NLP techniques is to intelligently analyse documents and capture their meaning. The goal is to determine the structure of sentences, to derive the meaning and interpret the meaning in a context. This has led many to suggest that the NLP techniques can be productively applied to information retrieval problems to produce representations of documents and queries for efficient retrieval [1].
 This module is divided into two parts. In the first part, we discuss the natural language processing and information retrieval, linguistic phenomena of natural language, and some of the NLP techniques and tasks; while in the second part we discuss the Semantic Web (SW) technologies and the use of NLP in Semantic Web.

2. Natural Language Processing in Information Retrieval

Natural Language Processing (NLP) is an area of research and application that explores how natural language text entered into a computer system can be manipulated and transformed into a form more suitable for further processing. It was formed in 1960 as a sub-field of Artificial Intelligence and Linguistics. The aim was to study problems in the automatic generation and understanding of natural language [8]. The primary goal of NLP is to process text of any type, the same way, which we, as humans, do and extract what is meant at different levels at which meaning is conveyed in language [9].

Automatic NLP techniques have been considered as a desirable feature of information retrieval. The techniques can be used for facilitating descriptions of both document content and user's query. The aim is to compare the descriptions of document content and user's query and retrieve the documents that best suite user's information needs [10]. In the following, we briefly illustrate the tasks of NLP based automatic information retrieval systems.

1. Indexing the collection of documents: NLP techniques are applied to generate the index consisting of document descriptions. Usually a document is described through a set of terms that, in theory, best represents its content.
2. Indexing the user's query: when a user formulates a query, the system analyses it and transform it in a way that it represents the user's information need. The process is same as it is for document content representation.
3. Descriptions matching: the system compares the description of each document with the query, and presents the user with those documents whose descriptions are closest to the query description.
4. Displaying the results: the results are, usually, listed in order of relevancy, i.e., by the level of similarity between the document and query descriptions. 

3. Natural Language Understanding

Before discussing the NLP techniques, we discuss the features of natural language, alternatively, the linguistic phenomena that influence the recall and precision rate of information retrieval. The understanding of natural language is very important, as it lies in the core of NLP. The understanding of the natural language is concerned with process of comprehending and using language once the words are recognized. The objective is to specify a computational model that matches with humans in linguistic tasks such as reading, writing, hearing, and speaking [2].

The two main characteristics of natural language are:
  1. Linguistic variation – different words (aka terms) are used to express the same meaning. For instance, words 'car', 'auto', 'automobile', and 'motorcar' communicate the same meaning  'a motor vehicle with four wheels; usually propelled by an internal combustion engine'.
  2. Linguistic ambiguity – same word allows more than one meaning, or allows more than one interpretation. For example, 'bat', can mean 'sports equipment' (as in cricket bat') or 'nocturnal mammal' (as in 'fruit bat').

The above characteristics of natural language seriously affect the information retrieval process. For instance, linguistic variation phenomenon can provoke the system to be silent from document retrieval. Because the search term may not match with the term used in the document, although the semantically equivalent of the search term would be available in the document. On the other hand, linguistic ambiguity adds noise in the document retrieval result set, because those documents were retrieved that used the same term but with different meaning.

The affects of these phenomena in information retrieval are further illustrated below. The repercussions can be observed mainly at three different levels: syntactic level; semantic level and pragmatic level.

  1. At the syntactic level: the focus is to study the established relations between words to form larger linguistic units, phrases and sentences. The ambiguities rise due to the possibility of associating a sentence with more than one syntactic structure. For instance, John read the pamphlet in the train. The example could mean two things:John read the pamphlet that was in the train, or John read the pamphlet when he was traveling by train.
  2. At the semantic level: the focus is to study the meaning of a word and sentence by studying the meaning of each word in it. For instance, John was reading a book in the bank. Here, the word bank could refer to two different meanings: a financial institutionand a sloping land (especially the slope beside a body of water).
  3. At the pragmatic level: the focus is to study the language's relationship to its context and we often cannot use a literal and automated interpretation of the terms used. The idea is, in specific circumstances, the meaning of the words in the sentence must be interpreted at a level that includes the context in which the sentence is found [8]. For instance, John enjoyed the book. This can be interpreted differently: John enjoyed reading the book; or John enjoyed writing the book

4. Natural Language Processing Techniques

There are two fundamental NLP techniques that are generally practiced in IR. They are:
  1. Statistical approach; and
  2. Linguistic approach.

  1. Statistical Approach
Statistical processing of natural language represents the classical model of information retrieval systems. It is a simple approach and the key focus is on the ‘bag of words' [8].  In this approach, all words in a document are treated as its index terms. Each term is assigned a weight in function of its importance, which is usually determined by its appearance frequency within the document. However, the usefulness of this ‘bag of words’ model is not ideal for processing natural language documents. Because this approach fails to take into consideration the natural language characteristics, especially, the word's order, structure, meaning, etc.

  1. Linguistic Approach
Linguistic processing of natural language is based on the application of different techniques and rules that explicitly encode linguistic knowledge [11]. The documents are analysed at different linguistic levels, such as, syntactic, semantic and pragmatic (discussed above). 

5. Natural Language Processing Tasks

In the following we discuss some of the widely used linguistic processing techniques at the syntactic and semantic levels. Note that, today’s most of NLP systems follow a mixed approach, i.e., combination of techniques from both the statistical and linguistic approaches [8]. 

5.1 Syntactic analysis

Generally speaking syntax deals with the structural properties of texts. It is about the grammatical arrangement of words in sentences. In syntactic analysis, valid sentences are recognized and their underline structures are determined [6]. The syntactic analysis process involves in analyzing and decomposing the sentences into parts of speech with an explanation of the form, function, and syntactical relationship of each part.

The syntactic structure of a sentence is governed by the syntactical rules (aka grammar). Generally speaking, a grammar (formal grammar) is a set of rules for rewriting strings, along with a start symbol from which rewriting starts. The grammar is the means of formalizing our knowledge, and as a result, generates legal sentences of the language.

5.1.1  Context-free Grammar
Context-free grammar was developed by Noam Chomsky in the mid 1950's [12]. A grammar is called context-free when its production rules can be applied regardless of the context of a nonterminal. According to this grammar in each production, there must have only a single nonterminal symbol on its left hand side. A context-free grammar with an example is shown in Table 1. Figure 1 shows a context-free derivation tree for an example John liked the book. The nonterminal node (a node that appears only in the interior of the tree structure for the given sentence[6]) is the starting point and is expressed by root. As we can see from Figure 1, the single nonterminal node, on the left side, can always be replaced by the right hand side [13] and this process continues until we have only the terminal nodes ‘.’.
 Alternate Text

                                                                                                                     Table 1: Context-free grammar

5.1.2 Transformational Grammar

Transformational grammar is a generative grammar. It was first introduced by Noam Chomsky. In [14], Chomsky developed the idea that each sentence in a language has two levels of representation: a deep structure and a surface structure. The deep structure represents the core semantic relations of a sentence, and is mapped on to the surface structure via transformations. It is to be noted that the context-free grammars (discussed in the previous section) fail to represent subject-verb agreement in all cases [6].
The transformational grammar starts out with context-free rules to build up the basics of the sentence, but then modifies the basic sentences with the transformational rules [15]. Here, the tree structure produced by context-free rules from the basic structure is called deep structure. The tree structure produced after applying the transformational rules is called surface structure. Transformational grammars specify the legal sentences of a language by giving rules. For instance, in the rule s → np vp, the transformational rule specifies that theaux should be replaced by an aux that has feature that gives it the same number as the subject of the sentence [6]. Figure 2 presents a transformational grammar for an exampleJohn is walking. Here, Figure 2(a) shows the deep structure, that is generated using the context-free rules, and Figure 2(b) shows the surface structure, that is generated using the transformational rules.
Alternate Text

6. Semantic Web

Tim Berners-Lee, who is the inventor of the World Wide Web (WWW), first envisioned Semantic Web (SW) that provides automated information access based on machine-processable semantics of data. The SW is an extension of the current Web in which information is given well-defined meanings to enable computers and people to work in cooperation [28]. Antoniou et al [29] defined SW a vision of the next generation web, which enables Web applications to automatically collect Web documents from diverse sources, integrate and process information and interoperate with other applications in order to execute sophisticated tasks for humans. The aim of SW is to develop languages for expressing information in a machine processable way.  The explicit representation of the semantics of data, accompanied with domain theories (i.e. ontologies), will enable the web to provide qualitatively new level of services [31].

Furthermore, semantic technologies and techniques are to allow machines to process logically connected data on the Web automatically and infer new information. Through a rich knowledge representation model, such as, Resource Description Framework (RDF), Semantic Web provides a highly structured data. It is now possible for application developers to share their rich structured data on the Web, and software agents can infer knowledge based upon the different kinds of structured and logically connected data available on the Web. It is important to mention that RDF is built on the elementary pointer mechanism, Universal Resource Identifier (URI) (discussed in details in the following Sections). We know in traditional Web, URI is mainly used to refer the documents and its parts through the hypertext mechanism. But the emerging Semantic Web shows a new face of it by using it to name anything, starting from abstract concepts color, test, dream, etc. to physical object personlocation, mountain, etc. to electronic objects (aka information object)home page of an institution.  RDF is also used to name the relationships between objects as well as the objects themselves [34].

In the following Sections we discuss the Semantic Web techniques and technologies.

6.1  Semantic Web Components
Figure 6 shows the semantic web technology stack that describes the semantic web design and vision. It is built on layered structure. The goal of the layered structure is to implement the semantic web vision step by step. The pragmatic justification of it is that it is easier to achieve consensus on small steps, whereas it is much harder to get everyone on board if too much is attempted [29] [35]. It is also because that to achieve the vision of semantic web, it is not mandatory to implement the entire semantic web technology stack. Instead the decision of implementing the technologies would be guided by the overall system objective.
 Alternate Text
Figure 6: Layered Approach to Semantic Web [31]
In building the semantic Web in a layered manner, two principles should be followed:  
  1. Downward Compatibility: agents (agents are pieces of software that work autonomously and proactively [29] fully aware of one layer should also be able to interpret and use information written at lower levels. E.g. agents aware of the semantics of OWL can take full advantage of information written in RDF and RDF Schema.  
  2. Upward Partial Understanding: agents fully aware of one layer should also be able to take at least partial advantage of information at higher levels. E.g. an agent aware of only RDF and RDF Schema semantics can interpret partial knowledge written in OWL, by disregarding those elements that go beyond RDF and RDF Schema.

6.1.1  Extensible Markup Language (XML)
We have seen in Figure 6, at the bottom of the Semantic Web layer is XML (eXtensible Markup Language) and XML Schema. XML is a subset of Standard Generalized Markup Language (SGML). XML lets everyone create their own tags such as those that are used to annotate Web pages of sections of text on a page. But it says nothing about what the structures mean. XML, in particularly, is suitable for sending documents across the Web.  Salient Features of XML
Some of the salient features of XML are [29],
  1. Extensible: tags can be defined; can be extended to lots of different applications.

  1. Machine accessibility: XML document is more easily accessible to machines because every piece of information is described. Moreover, their relations are also defined through the nesting structure. For example, the <author> tags appear within the <book> tags, so they describe properties of the particular book. A machine processing the XML document would be able to deduce that the author element refers to the enclosing books element, rather than having to infer this fact from proximity considerations, as in HTML.
  2. Separates content from formatting: same information can be displayed in different ways, without requiring multiple copies of the same content; moreover, the content may be used for purposes other than display.
  3. A meta-language for markup: it does not have a fixed set of tags but allow users to define tags of their own.  Issues with XML

XML is a universal meta-language for defining markup. It provides a uniform framework, and a set of tools like parsers, for interchange of data and metadata between applications. But it has also some limitations like [33],

  1. XML does not ensure standard vocabulary and subject to interpretation. For example, one can use an element as ‘Author’, another can use it as ‘Writer’. Here, humans can make out that both are same, but how a machine/system will decide! This creates confusion when machines try to share data with each other.
  2. The nesting of tags does not have standard meaning. It is up to each application to interpret the nesting. For example, David John is a lecturer of Thermodynamics. There are various ways of representing this sentence in XML. At least the two possibilities are:  

<course name="Thermodynamics">
                               <lecturer>David John</lecturer>

<lecturer name="David John">

The above two formalizations include essentially an opposite nesting although they represent the same information. In the first case, course name is considered as the primary one that nested the element lecturer. Whereas, in the second case, lecturer is treated as primary element and the nested element is teaches referring the course name. So there is no standard way of assigning meaning to tag nesting.
  1. Domain-Specific Markup Languages: Since the user is at freedom to define his/her own tags, many domain-specific markup languages have been developed, for example, MathML [36], CML (Chemical Markup Language) [37]. The problem with various domain-specific markup language is that of non-standardization, while describing the resources on the Web. But at the same time preventing this kind of flexibility and extensibility will again result in lack of inadequate resource description. Hence, there should be a common model/framework that can bridge the gap between these various schemas. It is at this stage that the RDF came into the picture, which is also the next layer in the Semantic Web pyramid of Figure 6.

6.1.2  Resource Description Framework (RDF)
RDF is a basic data model, not a language. The RDF model provides the description of Web documents (in other words rendering of metadata to the documents) in a natural manner so that the metadata can be shared across different applications. RDF expresses the meaning, encoded in sets of triplets (resource/subject, predicate/property and object/value), each triplet being rather like the subject, verb and object of an elementary sentence. These triplets can be written using XML tags.

RDF Triplets

A simple RDF model has three parts [38]:  
  1. Subject/Resource: Any entity, which has to be described, is known as resource, also known as subject. For instance, it can be a webpage in Internet or a person in a society.  
  2. Predicate/Property: Any characteristic of a resource or its attribute, which is used for the description of the same, is known as property or predicate. For example, a webpage can be recognized by Title or a man can be recognized by his Name. Here, both are the attributes for recognition of the resources Webpage and person. 
  3. Object/Value: A value of a property is termed as object. For example, the title of DRTC Webpage is Documentation Research and Training Centrename of a Person is S. R. Ranganathan. Here, Documentation Research and Training Centre and S. R. Ranganathan are the values against the properties title and name respectively.

The combination of subject, predicate and object is said to be a Statement. For example, a statement, David John is the author of the webpage This statement can be represented diagrammatically as follows [30]:

Alternate Text
Figure 7: RDF statement

The XML representation of the above statement is
<?xml version="1.0"? Encoding=“UTF-16”
    xmlns : rdf =" syntax-ns#"
    xmlns : mydomain="">

     <rdf:Description rdf:about="">
<mydomain:author>David John</mydomain:author>
 </rdf :RDF>

The first line specifies that we are using XML version 1.0. xmlns:rdf =“ syntax-ns#” specifies the XML namespace. An XML namespace is a collection of names, identified by a URI reference [RFC2396], which are used in XML documents as element types and attribute name. The syntax of declaring an XML namespace is: xmlns:namespace-prefix=“namespace”. The rdf:Description element makes a statement about the resource Within the description the property is used as a tag, and the content is the value of the property.

The most important feature of RDF is that it is developed to be domain-independent. It is very general in nature and does not restrict/apply any constraint on any one particular domain. It can be used to describe information about any domain. The RDF model imitates the class system of object-oriented programming. A collection of classes (as defined for a specific purpose or domain) is called a schema in RDF. These classes are extensible through subclass refinement [38]. Thus, various related schemas can be made using the base schema. RDF also supports metadata reuse by allowing transmission or sharing between various schemas.  RDF vs. RDF Schema
An illustration of different layers involved in RDF and RDFS [38] can be represented in the following way for a statement: Networking is taught by David John. The schema for this statement may contain classes such as lecturers, academic staff members, staff members, courses and properties such as is taught by, involves, etc. The above statement can be illustrated as follows. In the following Figure 8, rectangles are properties, ellipses above the dashed line are classes, and ellipses below the dashed line are instances.
Alternate Text 

Figure 8: RDF and RDFS layers  Issues with RDF Schema
RDF and RDFS allow the representation of some ontological knowledge. The main modeling primitives of RDF/RDFS concern the organization of vocabularies in typed hierarchies: subclass and subproperty relationships, domain and range restrictions, and instances of classes. However, a number of other features that are missing as referred in [44] are  
  1. Local scope of properties: rdfs:range defines the range of a property for all the classes. Hence, in RDF Schema we cannot declare range restrictions that apply only to some classes, and not all. For example, we cannot say that Cows eat only Plants, while other Animals may eat Meat, too.  
  2. Disjointness of classes: Sometimes we wish to say that classes are disjoint. For example, Male and Female are disjoint. But in RDF Schema we cannot do this.  
  3. Boolean combinations of classes: Sometimes we wish to build new classes by combining other classes using union, intersection, and complement. For example, we may wish to define the class Person to be the disjoint union of the classes Male and Female. RDFS does not allow such descriptions.  
  4. Cardinality restrictions: Sometimes we wish to place restrictions on how many distinct values a property may or must take. For example, we would like to say that a Person has exactly two Parents, or that a Course is taught by at least one Lecturer. Again, such restrictions are not possible to express in RDFS.  
  5. Special characteristics of properties: Sometimes it is useful to say that a property is transitive, unique, or the inverse of another property (e.g., eats and is eaten by).

Thus we need an ontology language that is richer than RDF Schema, a language that offers the above features and more. In designing such a language one should be aware of the trade-off between expressive power and efficient reasoning support. Generally speaking, the richer the language is, the more inefficient the reasoning support becomes, often crossing the border of non-computability. Thus we need a compromise, a language that can be supported by reasonably efficient reasoners, while being sufficiently expressive to express large classes of ontologies and knowledge.

6.1.3  Ontology
The concept originated more than two thousand years ago from philosophy and more specifically from Aristotle’s theory of categories [45]. The original purpose was to provide a categorization of all existing things in the world. Ontologies have been lately adopted in several other fields, such as Library and Information Science (LIS), Artificial Intelligence (AI), and more recently in Computer Science (CS), as the main means for describing how classes of objects are correlated, or for categorizing the document resources [46]. Many definitions of ontologies have been provided. According to Gruber, ontology is defined as, “an explicit specification of a conceptualization” [47].  Later on Studer et al [48] extended the definition and defined ontology as "a formal, explicit specification of a shared conceptualisation". Studer’s definition includes the idea of shared in the notion of conceptualization and formal relations among the concepts. The explicit, formal representation of a shared conceptualization involves a perspective of a specific reality, and is constituted in the conceptual structure of a knowledge base.

The ultimate objective of ontology is to share the knowledge it represents. An ontology defines the terms and their formal relations within a given knowledge area. The main features of ontology are [29]

  1. Ontology provide a shared understanding of domains;
  2. Ontology is useful to represent and to facilitate the sharing of domain knowledge between human and automatic agents;
  3. Ontology is useful for the organization and navigation of websites;
  4. Ontology is useful for improving the accuracy of Web searches. Web searches can exploit the generalization and/ or specialization of information.

6.1.4  Logic and Ontology Language

In representing knowledge, logic plays an important role. Logics enhance the ontology language further.  It helps to establish the consistency and correctness of data sets and to infer conclusions that are not explicitly stated but are required by or consistent with a known set of data. We list here some of the important features of logics as follow [29] [34]:

  1. Language: logic provides a high-level language in which knowledge can be expressed in a transparent way and will have a high expressive power.
  2. Formal semantics: it has a well-understood formal semantics, which assigns an unambiguous meaning to logical statements.
  3. Reasoning: automated reasoners can deduce (i.e., infer) conclusions from the given knowledge, thus making implicit knowledge explicit. For example,
  1.                                                               i.      X is a Cat
  2.                                                             ii.      a Cat is a Mammal
  3.                                                           iii.      a Mammal gives birth to young ones

Therefore, X gives birth to young ones.
  1. Inferred knowledge explanation: with the proof systems, it is possible to trace the proof that leads to a logical consequence. In this sense, the logic can provide explanations for answers.

However, addition of logic to the Web needs care as the Web with several characteristics, can lead us to the problems, while we use the existing logics [34]. Addition of logic to the Web pre-supposes use rules to make inference, necessary courses of action, etc. It is important that the logic deployed must be powerful enough in describing the complex objects, but at the same time it must not be so complex and inflexible that it becomes contradictory for the software agents itself while inferring knowledge.

There are number of different knowledge representation paradigms that have emerged to provide languages for representing ontologies, in particular description logics (discussed below) and frame logics. Web Ontology Language (OWL) is one such language that is based upon Description Logics (DL). The other such languages belonging to the family of description logics are such as, Knowledge Interchange Format (KIF) [26], Simple Common Logic (SCL) [39] etc.

6.1.5  Description Logics
Description Logics (DL) are closely related to First Order Logic (FOL) and Modal Logic (ML). Research on DL started to overcome computational problems of different complexity as the reasoning in different fragments of FOL. The research on DL started under the labelterminological systems to emphasize that the representation language was used to establish the basic terminology adopted in the modeled domain [40] followed by concept languages. Now DL has become a cornerstone of Semantic Web for its use in designing ontologies.

DL became popular since the focus moved towards the properties of the underlying logical systems. Research on DL covered the theoretical foundation as well as the implementation of knowledge representation systems and the development of applications in several fields. For example, reasoning about database conceptual models; for schema representation in information integration system, or for metadata management; as logical foundation of ontology languages, etc. [40].

Description logics are formal logics with well-defined semantics. Semantics of DL is defined through model theoretic semantics, which formulate the relationships between the language syntax and the models of a domain. In designing DL, the emphasis is given on key reasoning problem decidability, and the provision of sound and complete reasoning algorithms. A key feature of DL is their ability to represent relationships beyond the is-arelationships that can hold between concepts [40].

In DL, the important notions of domain are described by concept descriptions that are built from concepts, i,e., unary predicates and, roles, i.e., binary predicates by the use of various concept and role constructors. In addition, it is also possible to state facts about the domain in the form of axioms which act as constraints on the interpretations in a DL knowledge base [41].

In DL knowledge base, the distinction between TBox (Terminological Box) and ABox (Assertional Box) is drawn which are the two main components of it. TBox containsintentional knowledge in the form of terminology and is build through declarations that describe general properties of concepts. In other words, it contains sentences describing concept hierarchies, i.e. relation between concepts. ABox contains extensional knowledgeor assertional knowledge that is specific to the individuals of the domain of discourse [40].  Web Ontology Language (OWL) and its Family Members
The Web Ontology Working Group of W3C identified a number of characteristic use-cases for the semantic Web that would require much more expressiveness than RDF and RDF Schema offer. The researchers in the United States (US) and in Europe identified the need for a more powerful language to build ontology. In Europe OIL (Ontology Interface Layer), an ontology language was developed. In US, DARPA (Defense Advanced Research Project Agency) had initiated a similar project called DAML (Distributed Agent Markup Language). Latter these two have been merged and came up with a single ontology language DAML+OIL.

DAML+OIL in turn was taken as the starting point for the W3C Web Ontology Working Group in defining OWL (Web Ontology Language), the language that is aimed to be the standardized, and broadly accepted ontology language of the Semantic Web. DL is the logical foundation of OWL ontology language. OWL is built on top of RDF and RDF Schema. OWL adds more vocabulary for describing properties and classes. It also adds, relations between classes (e.g. disjointness), cardinality (e.g. exactly one), equality, richer typing of properties, characteristics of properties (e.g. symmetry), and enumerated classes [42].

The intent of OWL language is to provide additional machine-processable semantics for resources that is to make the machine representations of resources more closely resemble their intended real world counterparts [43]. In order to add the following listed capabilities to ontologies, OWL uses both URIs naming and the description framework for the Web provided by RDF [42]. The added advantages are

  1. Ability to be distributed across many systems
  2. Scalability to Web needs
  3. Compatibility with Web standards for accessibility and internationalization
  4. Openness and extensibility

OWL 1.0 ontology language consists of three sub-languages, such as, OWL Full, OWL DLand OWL Lite. These sub-languages differ by their power of expressiveness as discussed below. In the following we discuss the OWL species along with their advantages and disadvantages.

OWL Full: It is the complete language and it uses all the OWL language primitives. It allows the combination of these primitives in arbitrary ways with RDF and RDF Schema. 

Advantage: It is fully upward-compatible with RDF, both syntactically and semantically. Any legal RDF document is also a legal OWL Full document, and any valid RDF/RDF Schema conclusion is also a valid OWL Full conclusion.  

Disadvantage: Due to its greater expressive power, it has become undecidable and therefore impractical for applications that require complete and efficient reasoning support. More expressive knowledge base leads to the complexity in terms of reasoning. Software agents will need more time (where time growth rate is exponential) to process a query.

OWL DL: Supports the users who want the maximum expressiveness while retaining computational completeness. All conclusions are guaranteed to be computable and is designed to regain computational decidability, i.e. all computations will finish in finite time. OWL DL includes all OWL language constructs, but they can be used only under certain restrictions (for example, while a class may be a subclass of many classes, a class cannot be an instance of another class). OWL DL corresponds to the SHOIN (D) [40] description logic, a little less expressive language.  

Advantage: It supports efficient reasoning. 

Disadvantage: We lose full compatibility with RDF. A RDF document will in general have to be extended in some ways and restricted in others before it is a legal OWL DL document. Every legal OWL DL document is a legal RDF document.

OWL Lite: OWL Lite is OWL DL with more restrictions. It corresponds to the less expressive SHIF (D) descriptive logic. For example, OWL Lite excludes enumerated classes, disjointness statements, and arbitrary cardinality. The idea is to make it easy to start with and easy to implement processors, so that people can begin using OWL Lite easily and later graduate to more complicated uses.

Advantage: It is easier to grasp (for users) and easier to implement (for tool builders).

Disadvantage: The expressiveness is more restricted.

Table 3 shows partially an OWL DL ontology code (expressed in RDF/XML) against the ontology of African Wildlife drawn in Figure 9 [30].
Alternate Text
Figure 9: Classes and subclasses of the African wildlife ontology
Table 3: OWL Ontology

xmlns:owl ="">
<owl:Ontology rdf:about="xml:base"/>
<owl:Class rdf:ID="Animal">
<rdfs:comment>Animals form a class.</rdfs:comment>
<owl:Class rdf:ID="Plant">
<rdfs:comment>Plants form a class disjoint from animals.</rdfs:comment>
<owl:disjointWith rdf:resource="#Animal"/>
<owl:Class rdf:ID="Tree">
<rdfs:comment>Trees are a type of plant.</rdfs:comment>
<rdfs:subClassOf rdf:resource="#Plant"/>

<owl:Class rdf:ID="Herbivore">
<rdfs:comment>Herbivores are exactly those animals that eat only plants or parts of plants.</rdfs:comment>
<owl:intersectionOf rdf:parseType="Collection">
 <owl:Class rdf:about="#Animal"/>
  <owl:onProperty rdf:resource="#eats"/>
     <owl:unionOf rdf:parseType="Collection">
      <owl:Class rdf:about="#Plant"/>
        <owl:onProperty rdf:resource="#is_part_of"/>
         <owl:allValuesFrom rdf:resource="#Plant"/>
<owl:Class rdf:ID="Carnivore">
  <rdfs:comment>Carnivores are exactly those animals that eat animals.</rdfs:comment>
   <owl:intersectionOf rdf:parseType="Collection">
    <owl:Class rdf:about="#Animal"/>
      <owl:onProperty rdf:resource="#eats"/>
      <owl:someValuesFrom rdf:resource="#Animal"/>
<owl:Class rdf:ID="Lion">
  <rdfs:comment>Lions are animals that eat only herbivores.</rdfs:comment>
   <rdfs:subClassOf rdf:resource="#Carnivore"/>
      <owl:onProperty rdf:resource="#eats"/>
    <owl:allValuesFrom rdf:resource="#Herbivore"/>
6.1.6  Trust Layer
At the top of the pyramid is the trust layer, which is a high-level and crucial concept. The Web will achieve its full potential only when users have trust in its operation (security) and in the quality of information provided. The trust layer can emerge through the use of digital signatures and other kinds of knowledge, based on recommendations by trusted agents or on rating and certification agencies and customer bodies [44].

Each layer in the Semantic Web layer cake is seen as building on the layer below. Each layer is progressively more specialized and also tends to be more complex than the layers below it. The layers can be developed and made operational relatively independently [44]. 

7. Semantic Web and Natural language Processing

In this section we explore the possibility of using NLP in Semantic Web.
 As it is stated in [49] it is entirely appropriate, indeed highly desirable, to apply NLP methods to the foundations of the Semantic Web. The dream of Semantic Web soon will become soon if really this happens. Dini [50] stated that NLP can help Semantic Web in two phases: in the acquisition phase (i.e., at the time of building Semantic Web), and in the retrieval phase (i.e., at the time of accessing Semantic Web). Here, the phrase at the time of building Semantic Web refers to the fact that to build Semantic Web we need very accurate tagging algorithm. The phrase at the retrieval phase refers to the fact that to query Semantic Web, NLP could help transforming semantic resources with simple but smart search interfaces.
 A number of recently appeared papers focus on the possibility of automatically tagging Web pages with RDF descriptions. Tagging has always been one of the most popular tasks in NLP experiments, and it is obviously tempting to assume that the final result of a completely tagged Web could be achieved only by applying tagging algorithms [50]. Furthermore, as the automatic classification has nowadays reached a satisfying degree of accuracy, this might be a precious help in extracting RDF descriptions, but it is definitely not enough. In the Semantic Web perspective it is not sufficient to say that a certain web page is about an institution. There is the need to qualify the resource (e.g., Organization) described in that page: the year of establishmentcourses offersplace where it is located, etc. In order to do that a tagging application should also be able to gather missing information from different websites and create links with different resources [50].
 In summary, some of the applications of NLP in Semantic Web are: can be applied to build knowledge bases, can be applied to construct ontology, and can be used in ontology learning. Note that the research in exploring the use of natural language processing in Semantic Web is at the premature stage. Currently, lots of research is going on in this area.

8. Conclusion

We know that natural language is the most practical means of users to interact with the information retrieval system. Users feel comfortable in constructing queries in natural language. However, it is often the case that the system fails to meet the user’s information need. We often come across with the fact that most of the retrieved documents are irrelevant to the users’ requirement. User has to spend lots of time to filter out the relevant documents before actually using them. It is because the system retrieves lots of irrelevant documents and very few documents that actually meet the user’s information need. This happens mainly due to ambiguous nature of natural language. The common phenomena of natural language are like, homography, complementary polysemy, metonymy, metaphor, etc.

In this module we discussed the use of natural language processing techniques in information retrieval. We also discussed some of the important natural language processing tasks, mainly, the tasks carried out at the syntactic and semantic levels. We discussed the semantic techniques and technologies in view of the Semantic Web, an extension of the present World Wide Web. We also discussed the use of natural language processing in Semantic Web.


8.  References
  1. Information retrieval: searching in the 21st century. Ayse Goker and John Davies (ed.). UK: Wiley, 2009.
  2. Robin (2010). Natural language understanding.
  3. Manning, Christopher D., Raghavan, Prabhakar and Schütze, Hinrich. Introduction to Information Retrieval. Cambridge University Press, 2008.
  4. What is Tokenization?
  5. Huang, C., Simon, P., Hsieh, S., & Prevot, L. (2007). Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Word break Identification. In the Proceeding of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 69-72.
  6. Chowdhury, G. G. Introduction to modern information retrieval. London: Library Association Publshing, 1999.
  7. Aho, A. V., Sethi, R. and Ullman , J. Compilers: principles, techniques, and tools. Boston, USA: Addison-Wesley Longman Pub. Co., 1986.
  8. Vallez, Mari, and Pedraza-Jimenez, Rafael (2007). Natural language processing in textual information retrieval and related topics., n. 5.
  9. Liddy, Elizabeth D. Natural language processing for information retrieval and knowledge discovery.
  10. Allan, J. (2000). NLP for IR – Natural Language Processing for Information Retrieval. NAACL/ANLP language technology joint conference, Washington, USA.
  11. Sanderson, M. (2000). Retrieving with good sense. Information Retrieval, vol. 2, pp. 49-69.
  12. Hopcroft, John E. and Ullman, Jeffrey D. Introduction to Automata Theory, Languages, and Computation, Addison-Wesley, 1979.
  13. Context-free grammar.
  14. Chomsky, Noam. Syntactic structure. Mouton & Co., 1957.
  15. Charniak, E. and McDermott, D. Introduction to artificial intelligence. Vol. 1. London: Pitman, 1981.
  16. Aho, Alfred, Lam, Monica, Sethi, Ravi and Ullman, Jeffrey. Compilers: Principles, Techniques, and Tools, ed. 2, Prentice Hall, 2006.
  17. Porter, Martin F. (1980). An algorithm for suffix stripping. Program, 14 (3), pp. 130-137.
  18. Lovins, Julie Beth (1968). Development of a stemming algorithm. Translation and Computational Linguistics 11 (1), pp. 22-31.
  19. Paice, Chris D. (1990). Another stemmer. SIGIR Forum 24 (3), pp. 56-61.
  20. Lemmatisation.
  21. Syntactic and Semantic Analysis and Knowledge Representation.
  22. Kemp, D. Computer-based knowledge retrieval. London: Aslib, 1988.
  23. Syntactic and Semantic Analysis and Knowledge Representation.
  24. Markoff, John (2007). Start-Up Aims for Database to Automate Web Searching. The New York Times.
  25. Sowa, J. F. (1987). Semantic Networks
  26. Garbham, A. Artificial intelligence: an introduction. London: Routledge & Kegan Paul, 1988.
  27. Salton, G. Automatic text processing: the transformation, analysis and retrieval of information by computer. MA: Addison-Wesley, 1989.
  28. Semantic Web Made Easy.
  29. Antoniou, Grigoris and Harmelen, Frank van. A semantic web primer. London: MIT Press, 2004.
  30. Dutta, B. and Prasad, A. R. D. Semantic e-learning system: theory, implementation and applications. Germany: LAP, 2013, pp. 216, ISBN 978-3-659-18318-8.
  31. Berners-Lee, T., Hendler, J. and Lassila, O. (2001). The Semantic Web: a new form of Web content that is meaningful to computers will unleash a revolution of new possibilities. Scientific American
  32. Dutta, B. (2006). Semantic Web Technology: Towards Meaningful Retrieval. SRELS Journal of Information Management, 43 (2), pp. 149-154.
  33. Dutta, B. (2008). Semantic Web Services: A Study of Existing Technologies, Tools and Projects. DESIDOC Journal of Library and Information Technology, 28 (3), pp. 47-55.
  34. Berners-Lee, T., Connolly, D., Kagal, L., Scharf, Y. and Hendler, J. (2006). N3Logic: a logical framework for the World Wide Web.
  35. Davis, J., Fensel, D. and Harmelen, Frank van. Towards the semantic web. West Sussex: John Wiley, 2003.
  36. MathML.  
  37. Chemical Markup Language (CML).
  38. Resource Description Framework (RDF) Model and Syntax Specification: W3C Recommendation, 22 Feb. 1999.
  39. Altheim, M., Anderson, B., Hayes, P., Menzel, C., Sowa, J. F., and Tammet, T. SCL: Simple Common Logic.
  40. Description Logic Handbook: Theory, Implementation and Applications. Ed. by F. Baader, D. Calvanese, D.L. McGuinness, D. Nardi, P.F. Patel-Schneider. Cambridge University Press, 2003.
  41. Agarwal, S. (2007). Formal Description of Web Services for Expressive Matchmaking. Doctoral thesis. http://www.
  42. Web Ontology Language.
  43. RDF primer, 2004.
  44. Lassila, O. Towards the semantic web. 
  45. Aristotle's Categories, 2007.
  46. Giunchiglia, F., Dutta, B. and Maltese, V. (2009). Faceted lightweight ontologies. Conceptual Modeling: Foundations and Applications, Alex Borgida, Vinay Chaudhri, Paolo Giorgini and Eric Yu (Eds.), LNCS 5600 Springer.
  47. Gruber, T. R. (1993). A translation approach to portable ontology specifications.Knowledge Acquisition, 5(2), pp.199–220].
  48. Studer, R., Benjamins, V. R. and Fensel, D. (1998). Knowledge engineering: principles and methods.
  49. Wilks, Yorick  and Brewster, Christopher (2009). Natural Language Processing as a Foundation of the Semantic Web. Foundations and Trends in Web Science, 1(3–4), 199‐327. doi:
  50. Dini, Luca (2004). NLP technologies and the semantic web: risks, opportunities and challenges. Intelligenza Artificiale 1(1), pp. 67-71. 

No comments: