Wednesday, February 11, 2015

05. Querying of the information retrieval system P- 06. Information Storage and Retrieval

इस ब्लॉग्स को सृजन करने में आप सभी से सादर सुझाव आमंत्रित हैं , कृपया अपने सुझाव और प्रविष्टियाँ प्रेषित करे , इसका संपूर्ण कार्य क्षेत्र विश्व ज्ञान समुदाय हैं , जो सभी प्रतियोगियों के कॅरिअर निर्माण महत्त्वपूर्ण योगदान देगा ,आप अपने सुझाव इस मेल पत्ते पर भेज सकते हैं - chandrashekhar.malav@yahoo.com

05. Querying of the information retrieval system


P- 06. Information Storage and Retrieval

By :Dr P.M Devika ,Paper Coordinator



Introduction:


Information search and retrieval involves finding out useful documents from a store of information. In any information search and retrieval system an important factor which plays a role is search and selection process. Information Retrieval System (IRS) allow to find useful documents from a large volume of information by giving a query to the IRS.
Information search can be made by presenting a query through the inter-mediator, or directly to the IRS. A query in general terms is a statement or series of statements made by a user to a retrieval system for the purpose of specifying what information is to be retrieved and in what form.
Most of the time the query is specified in format such as artificial language hence it is called query language. A query language is the means by which the user tells the IRS what to do and what is wanted.  A query is distinct from the types of documents that the user is trying to retrieve.  The document and the query undergo parallel processes within the retrieval system. On the document side, someone generates or gathers some data and formulates it into a document. After creating documents they are transferred into internal representation, which then gets transferred into a format that is used for matching process. On the query side, user begins with information needs.
There are two broad types of query language procedural and non-procedural or descriptive. A procedural language uses commands. If the query is written in a typical procedural query language often known as command language, little or no knowledge is required for the IRS to find what was asked for and retrieve. Natural language queries or non-procedural queries generally tend to be ambiguous in syntax and meaning. For natural language queries inter-mediator is required to formulate a query as these queries generally tend to be ambiguous. Command language queries are more structured and for IRS these queries are unambiguous. 

Sets and Subsets:

Most text information retrieval systems are designed in such a way that it is anticipated that user will make frequent revisions in the formulation of query statement and hence the IRS will create and maintain a set and subset of each query which generally include Boolean combination within the query. Set is also generally a list of identifying accession numbers of the retrieved records satisfying the query or component statement.  The user is informed only of the set number assigned by the IRS and the number of records found in the set. A set with no number is generally termed as a null set. To modify the search one has to modify a query statement and carry out new search.

Relational Statements:

Relational statements specify the characteristics of records in a set to be formed or the characteristics of records that compose a subset of the database. Sets are defined by specifying one or more combinations of an attribute, a relationship, and a value for e. g. publication date = 2000, author= Croft, B. or salary > 10,000. The relational statements in any IRS include
Equality (=) e. g. subject=library science and Inequality (>,<,<=,>=, <>) e. g. (date > 10102012 or subject <> library science). The symbol <> is used for not equal in most computer languages.
Equality is not only the relationship that is expressed. There is inequality characteristics as well in any relational statement.

Ranked and Fuzzy Sets:


Ranking means to assign each record a measure of the closeness of the record's content to the query, or the extent to which the record matches the query. If ranking logic is used, Boolean operators can still be used to define a set. The purpose of ranking records is to acknowledge that there is uncertainty as to whether the query exactly expressed user needs or not.
In case of fuzzy information retrieval when a user cannot accurately tell whether a given document will meet the information need this uncertainty is called of “fuzzy” evaluation of the document with respect to the query. The concept of fuzzy information retrieval allows for both fuzzy document evaluations and fuzzy queries.
The point of ranking is to acknowledge that there is uncertainty as to whether the query exactly expressed the user's needs. If the user knows the question is imperfect, there is little to be gained in getting back an answer that claims, in effect “Here is the exact information you wanted”. 

Similarity Measures:


Similarity is the key concept behind information storage and retrieval. The aim of IRS is to retrieve those documents whose contents are similar to the information need as mentioned in the query formed.  To aid in this process, catalogers and indexers try to organize the document collection so that similar documents are in some sense close together and can be retrieved as a group with little efforts.
Any document can be represented by a list of terms that occur in it. A common way to define document similarity is to relate it to the key terms that two documents have.  In order to handle the documents within the collection, it is is assumed that all the terms in collection are in fixed order say alphabetical order. 
In case of retrieval of records, number of co-occurring words serve as a measure of similarity between two texts and that percentage of co-occurring words better serve the purpose of searching. Precision of retrieval can be gained by considering the number in common and the number not in common. There are several ways to measure how similar two texts are. They all use the number of terms in common to the two texts, but other factors also play an important role such as sizes of the documents involved, the number of terms not in common and weights that may be assigned to the terms. The use of weights, allows greater importance to be given to co-occurrence of highly weighted terms, but their use is highly subjective. Similarity measures is generally applied to pairs of documents or to a document and a query.

References


Information storage and retrieval by Korfhage, R. R., Wiley Computer Publishing, New York, 1997, 349 p. ISBN: 0-471-14338-3
Text information retrieval systems by Meadow, C. T., Boyce, B. R., U. K., Emerald, 2007, 371


No comments: