Thursday, February 12, 2015

07. Major ISAR models P- 06. Information Storage and Retrieval

इस ब्लॉग्स को सृजन करने में आप सभी से सादर सुझाव आमंत्रित हैं , कृपया अपने सुझाव और प्रविष्टियाँ प्रेषित करे , इसका संपूर्ण कार्य क्षेत्र विश्व ज्ञान समुदाय हैं , जो सभी प्रतियोगियों के कॅरिअर निर्माण महत्त्वपूर्ण योगदान देगा ,आप अपने सुझाव इस मेल पत्ते पर भेज सकते हैं - chandrashekhar.malav@yahoo.com

07. Major ISAR models


P- 06. Information Storage and Retrieval

By :Dr P.M Devika ,Paper Coordinator

Home
 Content

1. Introduction

In the present era of ICT, information becomes as an important entity for each individual, organization, institutions etc. to keep pace with changing environment. Information is playing most important role in development of any organization in the present time in respect of product and services provided by the organization, marketing of organization as well as user satisfaction. Information acts as a key entity for the growth and development of society. Each and every fact is considered as information. Information is produced according to needs of society and supplied on the basis of users needs.
Definitions (Data, Information, Knowledge, Wisdom)
What is data?
Data is the collection of facts or figures. It is originated from Latin and literally meant for anything that is given. The singular form of data is ‘Datum.’ Data is the smallest unit of any kind of information. According to Oxford Encyclopedic English Dictionary ‘data are known facts or things used as a basis for inference or reckoning’.
UNESCO defines data as ‘facts, concepts or instructions in a formalized manner suitable for communication, interpretation or processing by humans or automatic means’.
McGraw-Hill Encyclopedia of Science and Technology defines data as ‘numerical or qualitative values derived from scientific experiments.’
NIWCDEL defines data in 1996 as “facts or figures from which conclusions may be drawn.”
In the field of information science Shuman defines data as ‘quantitative facts derived from experimentation, calculations or direct observations.’ Shuman opines that a more meaningful definition of data is ‘the symbolization of knowledge.’
A data have no shape or meaning that is relevant to a particular viewpoint. It must be given relevance, arrangement, and coherence, usefulness within a definite framework of meaning, intent or interest.
What is Information?
From the development of humans and civilization information becomes as an important entity for their further growth and development. Information is the key entity for the development and growth of living standards. In the present era information plays central and pivotal role in the development of socio-economic condition of any country. Growth and development of a society depends on its information richness. Information is the important component of an area (subject, field, society etc.) on the basis of which they keep pace with changing scenario. Information is the most demanded item in present day as result of which different agencies, organizations; associations are working to fulfill this demand at the extent of satisfaction. Information is defined by different experts as follows:
Fritz Machlup, the well known scholar, says that information is process, a flow of messages involving, and the act of telling or being told
What is knowledge?
Knowledge is the state of knowing. Everything what we know is considered as knowledge. Knowledge is the organized set of statements of facts or an idea presenting a reasoned judgment or experimental result which is transmitted to others through some communication medium in a systematic form. Knowledge can be acquired by thinking, observing, reading, listening, research etc. What we can grasp or perceive is our knowledge and our mind is involved in this case and that’s why knowledge is considered as personal. Knowledge is defined by NIWCDEL in 1996 as:
“(i) all that has been perceived or grasped by mind; learning; enlightenment;
  (ii) the body of facts, principles etc. accumulated by mankind.”
What is wisdom?
 Wisdom the ability to judge correctly and follow the best course of action based on knowledge and understanding (Lockyer p.1103). Wisdom means understanding the consequences of our action or words before we act or speak. It also refers to having knowledge and understanding to recognize the right course of action and having the will and courage to follow it.(Ref. http://www.christianbiblereference.org/introduc.htm)
All the above terms defined above are interrelated to each other and are arranged in the following order
Data                         Information                             Knowledge                                Wisdom
Concept of data, information, knowledge and wisdom are the building blocks of library and information science. Discussions and definitions of these terms pervade the literature from introductory textbooks to theoretical research articles (see Zins,2007).Expressions linking some of the concepts predate the development of information science as a field of study(Sharma 2008). But the first to put all the terms into a single formula was Russell Lincoln Ackoff, in 1989. Akoff provided a hierarchical setup to data, information, knowledge and wisdom in which he put wisdom at the top while data at the bottom.
 Alternate Text

(Reference: The Data-Information-Knowledge–Wisdom Hierarchy and Its Antithesis, Article; Bernstein, Jay H.; Kingsborough Community College)

What is Information Retrieval (IR)?:
The process of seeking or searching of particular information is known as information retrieval. Generally people thought that IR meant for document searching or searching of information on internet. But IR is different from these concepts, IR basically meant for the searching of any kind of relevant information. IR is an art and science of retrieving information from the collection of items a subset that serves the user’s purpose. A general definition of IR that can be applied to many type of information and search application is given by Salton in the year 1968 is as follows
“Information Retrieval is a field concerned with the structure, analysis, organization, storage, searching and retrieval of information.”
Advantages of Information Retrieval?:
Information retrieval is the process of searching particular information from a particular source.
Brief history of Information Retrieval:
The term information retrieval was coined by Calvin Moore in the year 1952 and it gains popularity from 1961 onwards in the field of research. Information retrieval is considered as the searching of specific information such as marks of a particular student. It is concerned with all activities related to the organization, processing and access to information of all forms and formats.
Issues in IRS Design and Usage
An information retrieval system is designed to enable users to find out the relevant information from a stored and organized collection of documents. It basically deals with unstructured data. The main aim of IRS is to retrieve the information either actual information or the document containing that information. There are number of IRS which we use in our daily life for example searching of library OPAC, searching of information on web, information retrieval on institutional intranet etc. An IRS consists of a broad category of information sources (documents, multimedia information etc.) which are indexed over the system for retrieval of information. An IRS facilitates fast access and easy maintenance of information because documents are stored at one or more places in the form of databases.  A conceptual overview of an IRS is given below:

Alternate Text
In an IRS information retrieval process is very much influenced with relevancy and appropriateness of the information. The major issue in an IRS is the query formulation. Sometimes users cannot express their need in the form of queries due to which system cannot provide satisfactory results matching with the users demand. An IRS is a set of interacting components under human control operating together to achieve an intended purpose. IRS design is based on series of choices out of which creator can chose the appropriate element and tries to fit it with proposed objective of the system. If IRS is not designed carefully then it could not work properly and not satisfy the information need of the users.

2. Basic Concepts and Components of IR Systems

Data: It is the smallest unit of information which can be processed.
The database: A database is the large, persistent, integrated collection of dynamic data that provide some operations to describe, establish, manipulate and access this data. The main aim of a database is record and maintains the information. The Chambers Science and Technology Dictionary defines the database as “a collection of structured data independent of any particular application”. The Macmillan Dictionary Of Information Technology defines database as “a collection of interrelated data stored so that it may be accessed by users with simple user-friendly dialogues.”
Records and Fields:
Records- It is a group of related information. Record is the unit of information in a database. It is generally what users want while searching a database. Each unit of information in a database is known as record. For example a book card of library catalogue which describes title, author subject etc. of the book. A record is composed of fields and subfields.
Fields- It is the collection of simple data values like integer, real text etc. In other words we can say that field is a pointer to record. Fields are the elements of information or particular segments which forms the records. The items described in the book card of a library catalogue such title, author etc. are the example of field.
Properties of database:
A database can permit retrieval of information to satisfy a wide variety of users information need. It also avoids the duplication of data. Some properties of database is discussed as follows:
Improved availability- A database is the collection of huge amount of data interlinked with each other, that’s why to find out any information within a database is an easy task. Availability of information is increases.
 Reduced redundancy- A database can reduce the availability of irrelevant data.
Accuracy- A database can provide exact and accurate data. We can find out links for related information within the same database.
Program and file consistency- Database facilitate centralized control of accessing of information
User –friendly- In a database a user can interact with the help of interface where they express their information need in the form of queries.
Improved security- A database can provide improved security control over the information accessing.
Different types of Database
Hierarchical Database: It is the collection of interrelated records through links. In this database data is organized in the tree like structure. There is one- to-one and one-to-many relationship exists in hierarchical database. For example we can see the following diagram
Alternate Text

Relational  Database: It is the collection of table items all of which are formally described and organized according to relational model.
Alternate Text
Network Database:
Database which represents the objects and their relationships. In this database schemas and arcs are found which represents the object type as nodes and arc types as relationships. In network database data are interrelated and organized in the form of net.
Alternate Text 

Object-oriented Database: In an object – oriented database data is organized in a graph of object where each object has a number of attributes. Attributes can be simple value or complex values reference to other objects.
Customer
Alternate Text
Growth of Database in an IR Environment


3. Database Management Systems


What is DBMS (Database Management System)?
It is a collection of interrelated data and a set of programs to access those data. It consists of related data within an implicit meaning and hence is a database.
Advantages of DBMS:
i) Data independence: The DBMS can provide an abstract view of the data insulated application codes from details of data representation and storage
ii) Efficient data access: DBMS utilizes a variety of sophisticated techniques to store and retrieve data efficiency. This feature is especially important if data is stored on external storage device.
iii) Data integrity and security: DBMS can enforce integrity constraints on the data, if data is accessed     through DBMS. It also enforces access controls that govern what data is visible to different classes of users.
iv) Data administration:
v) Reduced application development time:
ACID Property: ACID stands for Atomicity, Consistency, Isolation, and Durability. In context of databases, ACID is the property that guaranteed the reliable processing of database transaction. This property is defined by Jim Gray in 1970s and he also develops the technologies for automatic achievement of these properties.
Atomicity: It is an all-or-none proposition
Consistency: It guarantees that database never leaves the transaction in half-finished state.
Isolation: It keeps transaction separated from each other until they are finished.
Durability: It guarantees that the database will keep track of pending changes in such way that the server can recover from an abnormal termination.
The database developers always keep some rules and characteristics in their mind while developing the transaction system. The developers of the components that comprise the transaction are assured that these characteristics are in place. They do not need to manage these characteristics themselves.
Data models:
It is a collection of tools which describes data, data relationship, data semantics and constraints. There are basically two types of data models which are further divided into different models:
1. Object-based logical model
(a)entity-relationship model
PARKS

Alternate Text

Entity-relation model for a national parks database (Ref. Database Management System by R.G. Healey)
(b) object-oriented model
 (c) Functional model
Alternate Text
2. Record-based logical model
(a) Relational model
Alternate Text

Parts of national park in network database in network model (Ref. Ref. Database Management System by R.G. Healey)

(c)Hierarchical model
Alternate Text
DBMS Languages:
There are different languages provided by database system to specify the database schema and to express the queries and updates of the database.
Data Definition Language (DDL):
It is a database language which specifies the schemas of the database.DDL can executes a statement as well as updates the special set of table called data dictionary or data directory. A data dictionary consists of metadata (data about data). Reading and modification of data is based on data dictionary.DDL basically defines the structure of the database and bring out the relationship between records and indexing strategies, forms the links between the logical and physical view of the data. Schema is the logical structure of database and subschema is the database utilized by the users for an application program. DDL is used to construct the subschema and more than subschema is operated by a single database at one time.
Data Manipulation Language (DML)
It provides a set of procedural commands to process the data. It also provides the linkage between logical view of data and its physical location. DML is the used to access the data with its logical names rather than physical storage location s. The DML generally supports several high-level programming languages.
Data Control Language (DCL)
It is a subset of Structured Query Language that allows control access to database objects and data.
Brief Overview about Structured Query Language (SQL)
Structure Query Language (SQL) is a declarative programming language designed to create, transform and retrieve information from the database. It was developed by IBM in early 70’s. It is used for creating and querying relational database management system. SQL uses a set of commands to manipulate the data in a database. It can insert, modify, and delete data within the database.

4. The Physical Organization of Data


  • Record Structure
It is the structure where all the elements of a record are arranged in an organized manner to provide a structured data. Record is the simplest form of a record whose meaningful organized form gives a meaningful data. A record structure consist of fixed length of records, fixed number of files, each record with a length indicator, use an index to keep track of address and place a delimiter at the end of record.
  • File Structure
File is the collection of records. File structure basically deals with the order and arrangement of files by computers within a database. There are several file formats which used for arrangement of data. Some of which are used for specific types of files. For example PNG is a file format which is only used for the storage of bitmapped images.
  • Order of Record
It refers to arrangement of records in a file within a database. How data is recorded or stored, what are the methods, sequences and placement of data. An ordered record is helpful to find out a data easily from a database. For example information of employees of a library are arranged according to their job positions.
  • Finding Record
Record finding is the process of retrieving particular information within a record. A record may contain information about a book like title of book, author, ISBN, publication etc.

  • Organizational Method
    • Sequential File Structure
It is a format for storage of record. All records of some length are arranged in a physical order. Sequential files are read from beginning to end. Sequential files are helpful in efficient operations like finding averages. Unordered sequential files are known as pile files.
  • Index File Structure
An index file is an auxiliary file that makes it more efficient to search for a record in the data file. An index is also known as access path on the field which is usually specified on several fields of the file. Index file occupies less space than data file due to its much smaller entries.
  • Lists
  • Trees
Tree is a structure which consists of nodes or vertices containing node information together with pointers giving access to additional nodes of the tree. A tree organization supports operations such as searching for a record, inserting new record and deleting record. A tree search is performed by comparing the search key with the key values attached to the certain nodes of the tree, starting with the root of the tree. There are two types of trees i.e. binary search tree, balanced tree .
  • Parsing of Data Element
    • Phrase Parsing: To resolve a sentence into component part of speech and describe them grammatically by starting the part of speech and explaining the inflections and syntactical relationships.
    • Word Parsing: to analyze and examine a word minutely regarding to their grammatical structure, including their phonology.

5. Querying of the Information Retrieval System


The process of formulation and processing of queries is referred as querying. Querying is done in the form of following order
  • Sets and Subsets
An IRS should formulate the query in the form of sets and subsets. It should arrange the searched queries according to the relevant items. It is needed to prepare a set of relevant items and partially relevant items are kept under the subsets.
  • Relational Statements
Related search results provided by an IRS are considered as relational statements. Relational statements may be also used as search queries. It may be relevant, partially relevant or irrelevant.
  • Boolean query Logic
In this case queries are formulated by the use of logical operator i.e. AND, OR, NOT. It is the most suitable method of query formulation in an IRS. With the help of Boolean logic a user can specify his requirement and IRS can justify his need in specific manner.
  • Ranked and Fuzzy Sets
  • Similarity Measures

6. ISAR Systems: Functions and Design


  • User Interface System
A system where users interact with machines or in other words we can say that the user or hard interface is a system where users can interact with the software or hardware in a natural intuitive way. A good user interface is user friendly in nature.
User interfaces can be designed for either hardware or software but in most of the cases they are the combinations of both.
  • Query Processing System
In this system search terms are processed into natural language which machine can understand and match the terms with the indexed term and then after display the result in natural language which user can recognize. Query processing is the most challenging task of an information retrieval system to justify the information need of the users in exact form in which they want the information.
  • Database Modeling System
It is a system which constitutes the designing of a database. A database modeling system has an ability to determine the requirements of a database design. First step of a database designing constitute the development of text retrieval system. A number of decisions are involves in this step which are as follows:
i)                    Nature of data
ii)                   Nature of fields and subfields
iii)                 Number of fields and subfields
iv)                 Format for display and printing of data
v)                  Sorting of data while printing
vi)                 The entry and editing of data

  • Information Retrieval from Database

  • Sampling of Information Retrieval Systems(OPAC, Dialog, GOOGLE, EBSCO, Pub Med)

7. System Computed Relevance and Ranking


  • Retrieval Status Value
  • Ranking
 Ranking refers to the order of availability of relevant item while retrieving a record. Search   terms are ranked on the basis of relevant, non-relevant, partially relevant etc. from the total retrieved items. Ranking is done through the process of query matching with the result.
  • Methods of Evaluating the RSV
    • Vector Space Model
This model assumes that an available term set called term vector is used for both the stored records and information requests. Collectively the term assigned to a given text are used to represent text content. Consider a collection of document in which each document is characterized by one or more index terms. Thus, the documents are the objects in the collection each of which is represented by a number of properties.
  • Probabilistic Model
The model was proposed by Maron and Kuhns in 1960 for probabilistic retrieval. The model was based on the probability theory of mathematics. The model advocates the probability that the given document would be relevant to a user can be assessed by a calculation of the probability, for each document in the collection that a user submitting a particular query would judge that document relevant. The probabilistic approach is based on two parameters
i)                    The probability of relevance- Pr(rel)
ii)                The probability of non-relevance- Pr (non-relevance) 
 Pr (non-relevance) = 1 – Pr (relevance)
 Boolean Model
George Boole (1815- 64) devised a system of symbolic logic in which he used three operators i.e. +, x and -, to combine statements in symbolic form. John Venn later expressed Boolean logic relationships through what are known as Venn diagrams. The three operators of Boolean logic are the logical sum (+), logical product (x) and logical difference (-). Logical product is also known as AND which allows users to specify the coincidence of two or more concepts. For example
(COMPUTERS)       AND      (INFORMATION RETRIEVAL)
Logical sum is also known as OR which allows user to specify alternatives among search terms like
(COMPUTERS)       OR        (INFORMATION RETRIEVAL)
Logical difference that is also known as NOT which facilitates exclusion of items from a set. For example
(INFORMATION RETRIEVAL)     AND     NOT   (DBMS)
Boolean model is the simplest retrieval model that’s why it has formed the basis of most of the DBMS and IR Systems. All information retrieval system including OPAC, CD-ROM and online databases, web search engines and digital libraries makes extensive use of Boolean search operators.
Boolean searching also has some limitations:
i)     It is difficult for users to formulate the search query by the combination of Boolean operators.
ii)  Users cannot predict a priori exactly how many items are to be retrieved to satisfy a given query.
iii)   Boolean searching identifies an item as relevant strictly by finding out whether a given query term is present or not in a given record in the database.
iv) Retrieved items cannot be ranked in decreasing order of relevance.

8. Evaluation and Measurement of Information Retrieval System

  • Need for Evaluation
Information retrieval can be evaluated and measured to calculate efficient retrieval performed by the system.
  • Different Evaluation Criteria
There are different evaluation criteria proposed by different scientists for evaluating an information retrieval system. In the year 1966 Cleverdon suggested six criteria for evaluation of an IRS:
i)                    Recall – the ability of the system to present all the relevant items
ii)                   Precision – the ability of the system to present only those items that items that are relevant.
iii)                 Time lag – the average interval between the time that the search request is made and when an answer is provided.
iv)                 Effort -  intellectual as well as physical, required from the users in obtaining answers to the search requests
v)                  Form of presentation- of search output which affects the user’s ability to make use of the retrieved items.
vi)                 Coverage of the collection – the extent to which the system includes relevant matter.
Lancaster proposed five evaluation criteria in 1971:
i)                    Coverage of the system
ii)                   Ability of the system to retrieve wanted items
iii)                 Ability of the system to avoid retrieval of unwanted items
iv)                 The response time of the system
v)                  The amount of effort required by users
                     Vickery also proposed six criteria into two groups
                    Group 1:
  • Coverage- the proportion of the total potentially useful literature that has been analyzed
  • Recall – the proportion of such references that are retrieved in a search
  • Response time- the average time needed to obtain a response from the system
               Group 2:
  • Precision- the ability of the system to screen out irrelevant references
  • Usability – the value of the references retrieved in terms of such factors as their reliability, comprehensibility and currency
  • Presentation – the form in which search results are presented to the users
    • Evaluation of Outcome
      • Recall :
It refers to the ability of a system to retrieve all the relevant items. It is calculated by general formulae given as follows
Recall =   Number of relevant items retrieved                               x 100
                   Total number of relevant items in the collection
 Precision :
The ability of a system to present only those items which are relevant. It can be calculated by using the following formulae.
Precision =     Number of relevant items retrieved   x 100
                            Total number of items retrieved
 Efficiency
It refers that how an information retrieval system performs efficiently. How it can justifies the queries given by users by providing most relevant result.
  • Overall user Evaluation
User evaluation comprises the nature and type of user as well as the nature of the information they require. It includes the evaluation of information retrieval system by the users. They can evaluate the system on the basis of the performance of the system. How efficiently an IRS can perform, at what extent the system can satisfy the users need. How many relevant items they provide while the retrieval process? All these issues are considered under the user evaluation.
  • Types of Evaluation Experiments
  • The Cranfield Test
It was the first evaluation experiment on retrieval systems undertaken at Cranfield, UK, under the direction of C.W. Cleverdon. This test was performed in two phases i.e. Cranfield projecst1 and Cranfield project 2. The first Cranfield study has been done in 1957 and was reported by Cleverdon.
The main objective of the project was to compare the effectiveness of four indexing systems:
i)        Alphabetical subject catalogue based on subject heading list.
ii)       UDC classified catalogue with alphabetical chain index to the class headings constructed.
iii)     A catalogue based on a faceted classification and an alphabetical index to the class headings.
iv)     A catalogue compiled by a Uniterm co-ordinate index.
In the above study is based on 18000 index items and 1200 search topics. Three indexers were chosen to prepare index each source item five times within a given time period. As a result of which a set of 6000 index item arises (100 documents x 3 indexers x 4 systems x 5 times). The test was conducted in three phases with a view to find out the level of performance increased with increasing experience of the system personnel. Manufactured queries which were formulated before the actual search has been used in the project.
The project provides following result:
All the four systems were operating with an effectiveness that could be expressed by recall ratio between 60 % to 90%. Increased time in indexing increased recall. No significant difference has been seen in the performance of the different indexers.
  • Cranfield 2:
It was a controlled experiment that attempted to assess the effect of the components of index language on the performance of the retrieval system. In this study 1400 articles and reports were collected from the field of aircraft structure and aerodynamics. Query formulation has been done in the form of questions by authors of the collected articles. Authors were asked to point out the reference that they cited in their papers as well as which they not cited in their paper but thought that they are relevant. Finally 221 queries were formulated and 1400 documents were selected for the test. And the success of system was calculated on the basis of relevant papers retrieve by given search.
  • MEDLARS
This test was conducted on the database (MEDLAR) of biomedical articles. MEDLARS stand for the Medical Literature Analysis and Retrieval System. This database has the collection of biomedical articles with an index entry being drawn from the thesaurus of medical subject headings. It was developed by US National Library of Medicine which was analyzed between August 1966 and July 1967. The main objective of this test was to evaluate the existing system and to find out how it could be improved.
In the above test 21 user groups to provide test questions and 302 search requests has been selected. Query formulation were done by the system operators in terms of MeSH and searches were conducted then after precision was calculated by the following formula

Precision ratio =   H1 + H2 x100
                                      L
 Where H1- major value; H2- minor value; L- number of sample items retrieved.

  • The SMART Retrieval experiment
The Self Monitoring, Analysis and Reporting Technology (SMART) were designed in the year 1964 to evaluate the effectiveness of different types of search procedures. Salton suggested various steps of the functioning of this system
  • Take document and search queries posed in English.
  • Perform a fully automatic content analysis of texts
  • Match analyzed search statements and contents of the documents.
  • Retrieve the stored items which are most similar to the queries
   Alternate Text   

The STAIRS Project
Storage And Information Retrieval System (STAIRS) was an experiment to evaluate the effectiveness of a full text-search and retrieval system. In the year 1985, Blair and Maron presented this project in their report.
The study consists of 40,000 documents and full text of all pages was available online and could be retrieved where specified words appeared either simply or in Boolean combination. Search can be manipulated by the use of Thesaurus Linguistic System (TLS). The main aim of this study was how effectively the system can perform the retrieval of relevant documents regarding to the given request. Queries were generated by the lawyer and the Paralegals searched the document within the database. Lawyer evaluate retrieved document by categorizing them into different groups such as ‘vital’, ‘satisfactory’, ‘marginally relevant’ or ‘irrelevant’. Precision was calculated by dividing different groups of documents by the total number of documents retrieved. A sampling technique was adopted to calculate the recall.
The result shows that out of 100 retrieved documents 79 were relevant but only 20% of the relevant document in the collection could be retrieved.  When the values of recall and precision were plotted over a graph for each request then it was noted that in 50% cases precision was more than 80% while recall was only 20% or less.

TREC: The Text Retrieval Conference

9. Multimedia Information Retrieval


  • Introduction to Multimedia Information Retrieval
Multimedia information retrieval is considered as the system where users can find out multimedia information or a system which facilitates the searching of multimedia information. Multimedia information consists of textual, graphics, audio, video, images or any other kind of information in a particular system i.e. multimedia information retrieval. Different search engines like Google, yahoo, Bing etc. can facilitates multimedia information retrieval.
  • Text Information Retrieval
Searching of information in the form of texts or particular text searching is considered as text information retrieval. Search engines facilitate text searching.
  • Audio and Music Information Retrieval
Retrieval of information in the form of audio or music is known as audio and music information retrieval. A number of websites like songs.pk, hungama mp3, dhingana.com etc. where users can retrieve audio information of their interest.
  • Video Information Retrieval
Searching of particular video from available system is video information retrieval. YouTube, keepvid are some examples of video information retrieval system where we can find out the video information for what we can search.
  • Technology
  • Application Development Tools.

10. Users of Information Retrieval


  • Users and their nature
Information is basically depends on the needs, interest and queries of users. Information is developed and supplied according to the requirement of users and their interest. Language and region is also major factor which influence the information. There are number of information systems which provide information regarding various aspects. Users are categorized by their organization, nature of work they perform, age group, other social groups etc. In an academic organization users are students, research scholars, teachers or any other staff of the organization. For example in a college library the users are the students of the college, faculties of the college, administrative staff as well as other staffs of the college.
  • Different types of Information needs
Information need is a relative concept which depends on several factors and changes with the time period. It also varies from person to person, organizational need, according to subject etc. Information need depends on the environment and surrounding of users. Information need arises as a result of unsolved problems and when the current state of knowledge seems insufficient to cope with the task in hand or in order to resolve conflicts in a subject area.
In context of library search, Taylor identifies four types of information need that lead the user from the state of purely conceptual need to one that is formally expressed and constrained by the environment.
i)                    Visceral need- means the unconscious need of the users.
ii)                   Conscious need- Undefined need of the users
iii)                 Formalized need-  when the information needs of users are formally expressed
iv)                 Compromised need- expressed need of users influenced by internal and external constraints.

  • Information needs in different areas of activity
i)                    Information need in scientific and technological research
ii)                   Information need in business
iii)                 Information need in enterprises
iv)                 Information need for supporting community development planning

  • Information seeking behavior of users
Information seeking is a form of human behavior that involves seeking for information by means of the active examination of information sources or information retrieval system to satisfy the information need or to solve a problem (Ingwersen and Jarvelin 2005: 386). Information seeking is an interactive process that depends on the initiatives on the part of the user, feedback from information system and the user’s decisions about subsequent actions based on this feedback. In order to acquire information users have to select the information from particular source, system, channel or service. There are number of steps and models of information seeking. Ellis proposed the following activities of information seeking process
Starting             Chaining             Browsing              Differentiating              Monitoring
Information seeking behavior depends on a number of factors which includes general education level, awareness of people in society and overall context.

  • User Studies
The study about the users, their nature and their information needs is considered as user study. User study can be performed by following various steps which includes a research plan and its implementation over the problem. Different methods should be followed while conducting the user study which includes surveys (questionnaire, interviews, and case study), critical incident examination-observation of events, reviews of diary and qualitative method- study of internal process.

11. Evolutions in Information Retrieval


  • Types of Information Retrieval Standards and Protocols
The notion of interoperability between different database systems is so attractive that it has generated many different attempts to achieve standards. This aim to enable machine and information systems to be able to communicate with one another by sharing and exchanging data to enable users to have access to more than one information system using the same techniques and interface. There are different standards like MARC, MARC21 and AACR2 which allows library catalogue to exchange data. These standards also ensure the interoperability among various libraries.
There are number of information retrieval standards and protocols which are used to ensure the information exchange among information systems. The American National Standard Z39.50, the information retrieval service definition and protocol specification for library applications is a standard that came out first in 1988 with subsequent versions in 1992 and 1994. The main purpose of the standard is to encode the message required to communicate between two machines for the purpose of information searching and retrieval. The protocol is defined to serve as search and retrieval service that is completely independent of the underlying structure of data. It is designed to allow searching on remote systems without prior knowledge of the other system’s syntax, strategies or data content. Z39.50 has some equivalent international standards, ISO 10162 and ISO 10163.

  • Digital Global Library
 Libraries are the storehouse of information and in the present time infrastructure of libraries changes from traditional to different other types like digital library. Digital library is the library where all the resources are available in digitized form. In the present era library services are transferred to remote areas and user can access the information at their screen without going to the libraries. Digital libraries provide globalized services and user can retrieve any kind of information from these libraries in all over the world.

  • Intelligent Information Retrieval
It refers to information retrieval by machine without human mental effort. In intelligent information retrieval search processing can be done by machines. User can only provide the query term and rest part is performed by computer system.
  • Intelligent hypertext and hypermedia systems
Hypermedia refers to combination of different media. Text, images, audio, video, animation etc. are available together.
  • Advancement in User Interface
User interface have been largely standardized by the use of common browsers. Many user interfaces now have attractive clustering and visualization features, for example the interfaces of search engines such as Kartoo and Clusty provide search result in clusters of related search topics. Online search engine interfaces have improved significantly; they have become more intuitive and less demanding .Most advanced user interfaces allows users to formulate fairly complex search queries without having to learn to typical search syntax. Some online databases such as Factiva produce search result in a variety of ways including a visual display of result sets according to various categories such as companies, industries, keywords and subjects. Wolframalfa produces number of charts and figures along with the search results; often it also allows the users to generate figures on the fly just by clicking of a button.

12. Advanced Course in ISAR

                                
a) Natural Language Processing in Information Retrieval
It is an area of research and application that explores how the natural language text entered into a computer system can be manipulated and transformed into a suitable form for further processing. The main aim of an IR system is to retrieve relevant document in response to users query. It can be done only on the basis of expression of information need in a natural language statement. When information is retrieved it is further examined by users in natural language text and on the basis of this they determine that whether the given information is relevant or not. With the help of natural language processing techniques people should be able in presenting the information and queries for efficient retrieval.
  • Natural Language Understanding
Before developing a natural language system it is necessary to understand the natural language. The process of building computer program that understand natural language involves three major issues:
i)   Thought process
ii    Representation and meaning of the linguistic input
iii)  World knowledge
According to Liddy and Feldman there are seven interdependent levels are used by people to extract meaning from text or spoken language in order to understand the natural language:
i)                    Phonetic or phonological level
ii)                   morphological level
iii)                 lexical level
iv)                 syntactic level
v)                  semantic level
vi)                 discourse
  • Syntactic Analysis
Syntax refers to the formation of sentences, how words are combined to larger units than words to phrase and sentences. It deals with the structural properties of the texts. Syntactic analysis means decomposition of sentence into simpler phrases. Rules of syntax characterize the relations between components of a sentence and specify the legal syntactic structure for a sentence.
  • Parsing
Parsing means to recognize the valid sentences of a language and determine their underlying structure. In other words we can say that parsing is the delinearization of linguistic input. Parsing is done with the help of parser (component of syntactic phase in computational process) which converts sentences into some representational structure useful for the processing. Parsing is the transposition of potentially ambiguous phrase to an internal representation. There are two types of parsing discussed as follows:
i)                    Top-down parsing:
ii)                   Bottom-up parsing:


  • Tokenization
The process of breaking a sentence into words, phrases, symbols and other meaningful elements called tokenization. The meaningful elements of the sentence are known as token.
  • Stop word  Removal
Removal of irrelevant words from a document before or after the processing of text. It is basically done to prevent the searching from noise of useless terms. It is the process of filtration of terms after or prior to the natural language data processing.
  • Stemming
It is the process of removing infected words from their root form words. It is used to conflate word forms to avoid mismatches that may undermine recall.
  • Lemmatization
It is process of grouping together the different inflected form of a word so they can be analyzed as a single item.
  • Semantic Analysis:
When a program goes from syntactic tree to an internal representation then it is known as semantic representation. Semantic analysis is based on knowledge base.
  • Knowledge Base
Knowledge base is the collection of different kind of knowledge within the memory which is used in program while natural language processing. Knowledge base can be declarative or of procedural form. Both the forms can be distinct by the distinction between a database and a program that acts on the database. Declarative representation seems more natural than procedural representation to non-programmers because these are like statements of facts.
  • Knowledge Representation
It refers to the internal representation created from natural language statements. Knowledge representation is not limited to the language of input text and can be used for further processing. The main purpose of knowledge representation are :
i)                    to help people understand the system they are working with
ii)                   to enable the system to process the representation
b) Semantic Web:
It is a knowledge representation formalism which is used in AI researches called semantic web. Semantic web may be described as a drawing in which the hierarchies of all the relevant facets are represented with the lines joining the classes and subclasses. Semantic web is a useful tool for those who need to form a conceptual schema of the domain. These can be used to represent not only the relations between concepts but also relation between individuals and hence fact about the world.

References:

  1. Choudhury, G.G., Introduction To Modern Information Retrieval
  2. Robbins, Robert J. ,Database Fundamentals, Johns Hopkins University
  3. Ramakrishnan, Raghu , Gehrke, Johannes and Derstadt, Jeff [et. al…], Database Management System: Solution Manual; 3rd edition, University of Wisconsin, Madison, WI, USA, Cornell University, Ithaca, NY, USA.
  4. sRamakrishnan, Raghu and Gehrke, Johannes ,Databse Management System; 2ndedition, University of Wisconsin, Madison, WI, USA, Cornell University, Ithaca, NY, USA.
  5. Silberschatz, Korth and Sudarshan, (1997), Database System Concepts
  6. http://pic.dhe.ibm.com/infocenter/analytic/v2r1m0/index.jsp?topic=%2Fcom.ibm.discovery.es.ta.doc%2Fiiysalgstopwd.htm
  7. http://pic.dhe.ibm.com/infocenter/analytic/v2r1m0/index.jsp?topic=%2Fcom.ibm.discovery.es.ta.doc%2Fiiysalgstopwd.htm
  8. http://en.wikipedia.org/wiki/Stop_words
  9. http://www.comp.lancs.ac.uk/computing/research/stemming/general/
10.  Haithcoat, Tim, Relational Database Management Systems: Database Design and GIS, University of Missouri, Columbia.
11.  Healey, R.G., Database Management Sysytem

No comments: