Saturday, January 17, 2015

Search Engines P- 04. Information Communication Technology for Libraries

इस ब्लॉग्स को सृजन करने में आप सभी से सादर सुझाव आमंत्रित हैं , कृपया अपने सुझाव और प्रविष्टियाँ प्रेषित करे , इसका संपूर्ण कार्य क्षेत्र विश्व ज्ञान समुदाय हैं , जो सभी प्रतियोगियों के कॅरिअर निर्माण महत्त्वपूर्ण योगदान देगा ,आप अपने सुझाव इस मेल पत्ते पर भेज सकते हैं - chandrashekhar.malav@yahoo.com

Search Engines


P- 04. Information Communication Technology for Libraries *

By :Usha Munshi,Paper Coordinator

Home
 Content
    Objectives
    1. Introduction
    2. Search Engines: Definition
    3. Evolution of Search Engines
  Collapse  4. How Does Search Engines Work?
  Collapse  5. Categories of Search Engines
  Collapse  6. Choosing a Search Engine
  Collapse  7. Searching the Web: Search Techniques
  Collapse  8. Evaluation of Search Engines
  Collapse  9. Important Search Engines
    10. Summary
    References and Readings 


True or False

1 / 1 Points

Question 1: True or False

Advanced search isn't helpful.
 Un-checked True
Correct Answer Checked False
1 / 1 Points

Question 2: True or False

dogpile is a search engine?
Correct Answer Checked True
 Un-checked False
2 / 2 PointsFinal Score:

Multiple Choice Question

Question 1: Multiple Choice

A program used to look through the web sites and look for key words and/or links.
  •  Un-checked Indexer
  •  Un-checked Searching
  •  Un-checked Robot
  • Correct Answer Checked Web Crawler

Question 2: Multiple Choice

A(n) ____________ search is a search whose hits are restricted to Web pages within the current Web site
  •  Un-checked Exploratory
  •  Un-checked Site
  •  Checked Global
  •  Un-checked None of these

Question 3: Multiple Choice

How many search engines are used throughout the world today?
  •  Un-checked 500
  •  Checked 187
  •  Un-checked 300
  •  Un-checked 55

Question 4: Multiple Choice

Search Engine are used to___________
  •  Un-checked Search Videos
  •  Un-checked Search Documents
  •  Checked Software System that is designed to search for information on the World Wide Web
  •  Un-checked All of these

Question 5: Multiple Choice

Search engines use _________________to reach thousands of pages per day and uses index to help finding them later.
  •  Un-checked Robot
  •  Checked Web Crawler
  •  Un-checked Spiders
  •  Un-checked Wildcards

Question 6: Multiple Choice

subject-oriented search engine is:
  •  Un-checked Web search engine
  •  Un-checked Web Crawler
  •  Checked Indexer A
  •  Un-checked Dogpile

Question 7: Multiple Choice

Task of Crawling is performed by a complex software whish is called as:
  •  Checked Crawler
  •  Un-checked Spider
  •  Un-checked Boat
  •  Un-checked All of these

Question 8: Multiple Choice

The first search engine ever developed was _________________ which was used to search for FTP Site
  •  Un-checked Jughead
  •  Un-checked Veronica
  • Correct Answer Checked Archie
  •  Un-checked Reggie

Question 9: Multiple Choice

The first text-based search engine was:
  •  Un-checked MSN
  •  Checked Google
  •  Un-checked Veronica
  •  Un-checked Jughead

Question 10: Multiple Choice

What does a metasearch engine do?
  •  Un-checked returns a list of sites that have been reviewed by humans
  • Correct Answer Checked They look at several search engines simultaneously to the specified query
  •  Un-checked They uses the databases of another search engine to show the results
  •  Un-checked allows all users to change its content.

Question 11: Multiple Choice

What is a search engine spider?
  •  Un-checked Adware and malware downloaded to your computer.
  • Correct Answer Checked A program that follows, or "crawls", links throughout the Internet, grabbing content from sites and adding it to search engine indexes.
  •  Un-checked A Programming Error
  •  Un-checked Something that infects your computer.

Question 12: Multiple Choice

What is the best way to search for a word and to include pages containing words with similar meanings?
  •  Checked liking AND similar
  •  Un-checked hike:liking
  •  Un-checked liking+
  •  Un-checked ~liking

Question 13: Multiple Choice

What was the first big search engine?
  •  Un-checked Yahoo
  •  Checked Google
  •  Un-checked Lycos
  •  Un-checked MSN

Question 14: Multiple Choice

Which of the following is the correct way to search for an exact phrase?
  •  Un-checked +Bradford city
  •  Checked exactfrownBradford city)
  •  Un-checked (Bradford city)
  •  Un-checked "Bradford city"

Question 15: Multiple Choice

Which of the following will produce the best results?
  •  Checked Bradford City
  •  Un-checked BRADFORD CITY
  •  Un-checked bradford city
  •  Un-checked It doesn't matter, only special query keywords are case sensitive.

Question 16: Multiple Choice

Why are search engine important
  •  Un-checked They sort through information; kind of like a library card catalog.
  •  Un-checked Makes it easier to find information rather than linking to it yourself.
  •  Un-checked You can narrow your search down to the exact information you need with a search engine.
  • Correct Answer Checked All of the above.

Question 17: Multiple Choice

______, ________. ______ don't really matter when you put it in the search query.
  •  Un-checked are, for, to
  • Correct Answer Checked a, an, the
  •  Un-checked who, where, why
  •  Un-checked from, whose, who
Done!

Objectives

This lesson is designed to impart knowledge on the following components on Internet search engines: 

  • Search engines and their evolution
  • How search engines work? and components of a search engine
  • Categories of search engines
  • Search techniques
  • Meta data and search engines
  • Evaluation of search engines
  • Important search engines

1. Introduction

The growth of the Internet has led to a paradoxical situation. While on the one hand there is a huge amount of information available on the Internet, on the other hand sheer volume of unorganized information makes it difficult for the users to find relevant and accurate information in a speedy and efficient manner. The first Google index in 1998 had 26 million pages and it touched the mark of one billion by 2000. In mid 2008 it touched a new milestone with 1 trillion (as in 1,000,000,000,000) unique URLs. Internet can be said to be the most exhaustive, important and useful source of information on almost all aspects of knowledge hosted on millions of servers connected to Internet around the world. Searching for specific information is the main purpose of using Internet for several users. However, with availability of excessive information, it has become very difficult for a common user to search for precise and relevant information on the Internet. To tackle this situation, computer scientists came up with search tools that search through the information on the Internet to churn out required information by a user. There are varieties of search, resource discovery and browsing tools that has been developed to support more efficient information retrieval. Search engines are one of such discovery tools.


Search engines use automated programs, variably called bots, robots, spiders, crawlers, wanderers and worms developed to search the web. The robots traverse the web in order to index web sites. Some of them index web sites by title, some by uniform resource locators (URLs), some by words in each document in a web site, and some by combinations of these. These search engines function in different ways and search different parts of the Internet. 

2. Search Engines: Definition

Search engine is a generic term used for the software that “searches” the web for pages relating to a specific query. Google, Yahoo and Bing are few examples of common search engines that index and search a significant part of the web. Several web sites have their own search engines to index their own websites. The World Wide Web has several sites dedicated to indexing of information on all other sites.

A search engine can be defined as a tool for finding, classifying and storing information on various websites on the Internet.  It can help in locating information of relevance on a particular subject by using various search methods. It is a service that indexes, organizes, and often rates and reviews Web sites. It helps users to find the proverbial needle in the Internet haystack. Different search engines work in different ways. Some rely on people to maintain a catalogue of Web sites or web pages, others use software to identify key information on sites across the Internet.  Some combine both types of service.  Searching Internet with different search engines for the same topic, therefore, provide different results. 

3. Evolution of Search Engines

The Archie, developed in 1990 by Alan Emtage, a student at McGill University in Montreal, can be considered as the first search engine that was used for indexing and searching files on FTP server. Archie became a database of web filenames which it would match with the users queries.  Inspired with the success of Archie, the University of Nevada developed Veronica in 1993. VERONICA (Very Easy Rodent-Oriented Netwide Index to Computerized Archives) was developed at the University of Nevada to search all menu items on Gopher servers. Soon another user interface named Jughead appeared with the same purpose as Veronica. The JugHead (Jonzy's Universal Gopher Hierarchy Excavation And Display) was a powerful gopher search tool written by Rhett “Jonzy” Jones. It was computer program that searched a specified Gopher site (not all). It searched directory titles only, not the text of resources that appeared on the Gopher submenus. Archie, VERONICA and Jughead have now disappeared, but before the web's spectacular growth, these tools were real workhorses for searchers on the Internet.

Soon after launch of World Wide Web in 1993, the first robot, called World Wide Web Wanderer, was introduced by Matthew Gray to search the Web. In October 1993, Artijn Koster developed an Archie-like Indexing tool for the Web, called ALIWEB. It did not use a robot to collect the metadata, instead, it allowed users to submit the Web sites they wanted to be indexed by ALIWEB with their own descriptions and keywords. By December of 1993, three full-fledged robot-fed search engines had surfaced on the web, i.e. JumpStation, the World Wide Web Worm, and the Repository-Based Software Engineering (RBSE) spider. JumpStation gathered information about the title and header from Web pages and retrieved them using a simple linear search. As the web grew, JumpStation slowed to a stop. The WWW Worm indexed titles and URLs. The JumpStation and the World Wide Web Worm did not use any ranking method to list their search results, results were listed in the order they were found. The RSBE spider did implement a ranking system.

The Excite was a by-product of the project called Architext that was started in 1993 by six Stanford undergraduates. They used statistical analysis of word relationships to make searching more efficient. The Excite search software was released by mid-1993. However, the technique used by the Excite seems irrelevant because the spiders were not intelligent enough to understand what all the links meant. The EINet Galaxy Web Directory was launched in January, 1994. The EINet Galaxy became a success since it also contained Gopher and Telnet search features in addition to its web search feature.

In April 1994, David Filo and Jerry Yang created Yahoo as a collection of their favourite web pages. As their number of links grew, they had to reorganize and develop a searchable directory. The Yahoo directory provided description with each URL as an improvement to the Wanderer. Brian Pinkerton of the University of Washington launched the WebCrawler on April 20, 1994. It was the first crawler that indexed entire pages. In 1997, Excite bought out WebCrawler, and AOL began using Excite to power its NetFind. WebCrawler opened the door for many other services to follow the suit.
Three important search engines, namely Lycos, Infoseek and OpenText appeared soon after Web Crawler was launched. Lycos was the next major search engine developed at Carnegie Mellon University in July 1994. On July 20, 1994, Lycos was launched with a catalogue of 54,000 documents. By August 1994, Lycos had identified 394,000 documents and by November 1996, Lycos had indexed over 60 million documents, more than any other Web search engine. In October 1994, Lycos ranked first on Netscape's list of search engines by finding the most hits on the word “surf”. Infoseek was also launched in 1994. In December 1995, Netscape started using Infoseek as its default search engine. AltaVista was also launched in Dec. 1995. It brought many important features to the web searching. They were the first to allow natural language queries and advanced searching techniques.

The LookSmart directory commenced functioning in 1996. The Inktomi Corporation came about in May, 1996 with its search engine called Hotbot. It was bought by Yahoo. Ask Jeeves was launched in April 1997 followed by the Northern Light.

1998 witnessed the launch of Google, the most powerful search engine till date. The Google ranks its pages based on number of inbound links to a page. Google has become so popular that major portals such as AOL and Yahoo have used Google to search their directories. In 1998, three major search engines and directories were launched, they were: MSN search, Open Directory and Direct Hit.

Disney released the Go Network in 1999. Fast released its search technology in the same year, and was considered the closest competitor to Google. In 2000, the Teoma search engine was released, which uses clustering to organize sites by subject-specific popularity. In 2001, Ask Jeeves bought Teoma to replace the Direct Hit search engine.

LookSmart bought the WiseNut search engine in 2002 to power their new search product. In 2003, Google began to introduce semantic elements into its search product thereby bringing improvements in its search result, Overture purchased AllTheWeb and AltaVista. Yahoo bought Inktomi and Overture. In 2004, MSN dropped LookSmart in favour of Inktomi and Yahoo dumped Google in favour of its own search engine. Yahoo! has build a new database separate from the Inktomi database, that replaced both AltaVista and AllTheWeb, In 2004, Microsoft began a transition to its own search technology, powered by its own web crawler, Microsoft's rebranded search engine, Bing, was launched on June 1, 2009. On July 29, 2009, Yahoo! and Microsoft finalized a deal in which Yahoo! Search would be powered by Microsoft Bing technology.
The brief history given here does not include many smaller search engines that came, marginalized and disappeared. 

4.0 How Does Search Engines Work?

Search engines do not really search the World Wide Web directly. Instead, they search their own database consisting of the keywords or full text of web pages that were earlier selected and picked-up from billions of web pages residing on servers all over the world. When a user searches the web using a search engine, it always searches an old copy of the real web page that is residing on the server of a search engine. When a user clicks on the links provided in a search engine's search results, he / she is directed to the current version of the page. A typical search engine has the following three components:
  • Robots/Spiders
  • Database
  • User Interface
Alternate Text


4.1. The Robot or Spider

Page Contents Alternate Text  

Variably known as bot, robot, spiders, crawler, web wanderers or indexer, are automatic computer programs that traverse the World Wide Web information space. These programs move from one web page to another by visiting links imbedded on each web page it finds and in the process builds an index to visited web pages. This process can be compared to citation searching, where a user follow a reference within a journal article to another article on the same topic. Unlike a virus, a robot does not physically moves from computer to computer, it simply visits sites, like a user, and requests for documents to be indexed. Main functions of a robot or a spider are indexing of web pages, HTML validation, link validation, identifying new information and mirroring web sites.

 Alternate Text

Fig.2: Thematic Presentation of Functioning of Web Robots

 

Different robots use different strategies to index the web. In general, most search engines start from a meta resource site or a subject portal that contain links to several other resources. They scan the web constantly, keeping track of new documents that appear afresh and deleting documents that are duplicate. Search engines use their own spider software that are designed to harvest indexing information from web sites on specified criteria. Moreover, search engines cannot index database-driven sites. Such sites are referred to as the “Invisible Web” or “Hidden Web”, information that is priced and hidden behind databases.

 

After spiders find pages from web, they pass them on to another computer program for "indexing". This program identifies the text, links and other content in the page and stores it in the search engine database's files.


4.2 The Database

A robot or a spider harvest indexing information from web pages that it visited in its database or catalog that lists URLs, titles, headers, words from title and text, first lines, abstracts, and some times even full-text. The resulting database, that stores millions of web pages, forms the index that is searched by the users. The size of this database determines the comprehensiveness of a search engine. Most search tools also create a separate database containing records consisting of a web page, URL, title and a summary. When a user retrieves results from a search tool, a summary record is displayed for the users.

Search engines either update their databases cumulatively or rebuild it completely after a definite interval. 

4.3. The User Interface or the Agent

The user interface or agent is software program that accepts queries from users and search them through the database consisting of index of millions of pages. The agent matches the query with the database, finds hits and ranks them in order of relevance. The results consisting of web links and brief descriptions are arranged in order of relevance and are presented to the users. Amongst items with same relevance, the most popular sites are listed first.


5. 0 Categories of Search Engines

Most search engines facilitate several functionalities besides being a tool for finding Web sites.  They provide information such as recent news stories, newsgroup postings, reference material (such as dictionary entries and maps), and e-mail addresses, street addresses and telephone numbers of business and individuals. Search engines can be divided into the following categories:

5.1 Primary Search Engines

Primary search engines deploy computer programs called web crawler or spiders, web wanderers and web worms, to traverse the Web and scan web sites for words, phrases, or the whole site so as to generate database of Web pages. Search engines do not actually search the Web in real time, but uses database of web page collected by its robots hypertext links that are updated on a regular basis. Primary search engines are the most commonly used search engines. These vary to a great extent in terms of:

  • database size: width and depth of web sites indexed by their "spiders"
  • database content: Full-text or metadata, i.e. URL, keywords, title, description, etc.
  • syntax used: word search, Boolean search, phrase search and other advanced features
  • ranking of results: paid sites, recent update, popularity, etc.

Primary search engines can also be divided in to the following four categories according to method their robots used in collecting information for their indexing databases:

Automated Robots: The automated robots scan a large part of web wherever they are allowed.

Designated Robots: The designated robots (like those used by ALIWEB or W3 catalogue)are programmed to scan only specific sites on the web   rather than traverse the entire web. Sites using “designated robots” provides users to submit their web sites to the search engine. On submission of a URL, the new URL gets added to the robot's queue of websites to be visited on its next foray out onto the web.  
Breadth-oriented Search Engines: Some robots concentrate more on top-level resources that tend to store larger subject-oriented index engines such as JumpStation II. A query conducted on “breadth-oriented search engines” would result in a fewer retrievals with a higher percentage of those being meta resources or subject portals.

Depth-oriented Search Engines: “Depth-oriented robots” (like web crawler) follow links to deeper levels. They pull out individual items located on a server’s indexes and follow links to other servers. Depth-oriented robots have the tendency to retrieve duplicates or false hits. Search engines that deploy “depth-oriented search engines” have a tendency to catalogue too much information. 

5.2. Meta Search Engines

A meta search engine is defined as a search engine that searches the databases of several other search engines at the same time to locate web pages that match search terms given by a user. Unlike the primary search engines and directories, meta-search engines do not have their own databases, i.e. these search engines do not collect web pages, do not accept URL additions, and do not classify or review web sites. Instead, they send queries simultaneously to multiple search engines and / or Web directories. Many of the meta-search engines integrate search results: duplicate findings are merged into one entry, some of them rank the results according to various criteria, some allow selection on search engines to be searched. Ask Jeeves, MetaCrawler, Savvy Search, @Once!, All-in-One Search Page, Galaxy, Internet Sleuth, Magellan, Net Search, Dogpile, Metacrawler, Metafind, Metasearch, ixquick.com are some of the better-known meta search engines.

Successful use of a meta search engine depends on the current status of each of the primary  search engines being used. Some primary search engines may  be too busy at that time and some may be unreachable. It may be noted that a query submitted to a meta-search engine, with its uniform search interface and syntax, is to be applied against the diversity of individual search engines. It is, therefore, impossible for one meta search engines to take advantage of all the features of the individual search engines. Boolean searches, for example, may produce varied results. Phrase searches may not be supported. Other features, such as query refinement, are sacrificed in a meta search engine.
Moreover, meta-search engines generally do not conduct exhaustive searches and do not bring back all the pages from each of the individual search engines. They only make use of the top 10 to 100 hits from each of them. While this is sufficient for most searches, individual search engines must be consulted if one needs to go beyond the top hits as determined by the meta-search engines. Some meta-search engines facilitate this by providing query links back to the individual search engines. 

5.3. Specialised Search Engines

These are primary search engines that focus on a small or specialized segment of the Internet. Examples of specialised search engines are Hoovers Online, (http://www.hoovers.com/) and Sirus (http://www.sirus.com/).

5.4. Subject or Web Directories

Directories are the Yellow Pages of the Internet.  They contain information that has been submitted to them by their indexers or by users who submit entries. The subject directories are often manually maintained, browsable and searchable web-based interfaces. Yahoo! is the most famous subject directory. Yahoo! has several subject headings. A subject directory contains information that is organized into categories and subcategories or topics and subtopics. Directories differ from indexes, in the way they organize information. Web indexes simply compile a growing number of records while directories, in contrast, organize information into groups of related records. Besides, Yahoo!,  the best-known directories include Open Directory Project (Dmoz.org) and LookSmart.
Alternate Text
Directories contain fewer resources than search engine databases. It is because resources in a directory are manually selected, maintained and updated. This, in turn, can be to the advantage of users especially if he or she is searching for a general topic. The directories, therefore, increase the likelihood of retrieving relevant results and increase the possibility of finding high quality, reliable web sites.

The directories also have some drawbacks. It is possible that items with similar topics may be placed under two different subject categories in a directory. Directories may not be as current as search engine databases because while the search engines update their databases automatically using robots or spiders, directories are updated manually after new entries are selected, rated and categories. The directories may also miss out on some of important resources, since resources are selected manually. 
Alternate Text


5.5. Hybrid Search Engines

In the early days of web, a search engine either presented crawler-based results or human-powered listings. Today it is extremely common for search engines to present both types of results. Usually, a hybrid search engine will favour one type of listings over another.. 

5.6. Subject Gateways or Subject Portals

Subject gateways, variably called as meta resources, subject-based information gateways (SBIGs), subject-based gateways, subject index gateways, virtual libraries, clearing houses, subject trees, pathfinders and guide to Internet resources are facilities that allow easier access to network-based resources in a defined subject area.
A subject portal can be defined as an organized and structured guide to Internet-based electronic information resources that are carefully selected after a predefined process of evaluation and filtration in a subject area or specialty. Subject portals are often independent web-sites or part of an institution or library’s web site that serve as a guide to Internet resources considered appropriate for their target audiences. Some of the example of subject portals are LibrarySpot, Librarian’s Index to Internet, and Intute.
Alternate Text


6.1. Ease of Use

Most search engines provide a single search window for input of search terms. Search engines query its database for search terms keyed-in by the user. Some search engines have the capabilities of understanding natural language. Quite often the same controls are available from the standard search window using rather less user-friendly symbols such as AND, OR, NOT, NEAR, (), +, -, "", fieldname:, etc. 

6.2. Comprehensiveness

Several well-established search services index hundreds of millions of Web documents. Generally they index full text of documents. None of the existing search engine indexes complete spectrum of information available on the Internet. As the contents available on the web increases, the proportion indexed would decrease. 

6.3. Quality of Content

Search engines are increasingly paying attention to techniques that address the issue quality of content. Relevant and qualitative search results may be given higher weightage than speed and size of search engines. The results of such processing, combined with matching of search terms, generate a relevance score to be used in the ranking of search results. Currency of the information in the database, particularly working links, is another aspect of quality. 

6.4. Control Over the Search

In case of complex queries, the capability for specifying search parameters in detail becomes important.  A user should be able to combine multiple search terms with ease. Most search engines usually allow users to specify how search terms are combined, either by typing the search string using the Boolean terms AND, OR and NOT into the search window, or giving equivalent functionality via drop-down menus.

6.5. Flexibility in Searching

Advanced search features in some of the search engines provide following flexibilities:

  • Provision to restrict the search to specific fields, (i.e. title, description, keywords, links, body, etc.)
  • Provision to restrict a search to a specified time period;
  • Provision to search for similar documents (if a user find something useful); and
Provision to search within the results of a previous search.

6.6. Assessment of Relevance

A good search engine should take quality of resources into account while ranking search results. Search engines display the search results either using the frequency of keywords used in the web page, positioning of keywords on the web page (title, description, body, etc.) and frequency of inbound-links from other web sites. Irrespective of ranking mechanism, a user would prefer to see document relevant to his / her search in the first few search results.

6.7. Informative Presentation of Results

Search engines should record number of hits found for a search. A user would prefer to get enough information to judge the usefulness of a link before following it. Most search engines take some portion of the text to provide an abstract. Dates are often useful. 

7.0 Searching the Web: Search Techniques

When a user make a query to the search engine, the query is checked against the search engine's index of all the web pages and relevant documents with their URLs are returned as hits. These hits are ranked in order of their relevance.

Most search engines offer two types of interfaces to search their databases, i.e. basic search and advanced search. In a basic search, a user just key-in his / her search term without sifting through pull-down menus for additional options.  Full-featured search engines have options to expand or limit searches in a variety of ways. 

7.1. Basic Search

Most search engines offer a dialog box, pane or a dialog line where search terms can be keyed-in followed by options to either submit or clear the search. A user enters a word or words called “keywords or search terms” that he or she would like to search. The search engine then looks through its indexes in the database for matches.  It might look in the title, description or entire text of a Web page.  After conducting the search, a list of sites that matches the requested search terms are produced as results.  

7.2. Advanced Search or Refining Your Search

Different search engines have different methods of refining queries.  Options for advanced search differ from one search engine to another, but some of the common features include ability to search on more than one word, to confine the search to a specified field and to exclude words that are not required in a search by the user. A user may also search for proper names, phrases, and on words that are found within a certain proximity to other search terms. Several search engines allows the use of Boolean operators or signs like “+” or “_” to refine the search. Some of the popular techniques are as follows:

7.2.1.  Boolean Operators

Many search engines allow use of “AND”, “OR” and “NOT” to narrow or broaden a request.  Boolean operators allow us to connect the concepts of our search query in meaningful way so that a user can retrieve relevant search results.

Operator "AND" is used to narrow the search results to two terms combined with operator "AND". The operator "OR" is used to broaden the search results to any of the two terms combined with operator “OR”. The operator "NOT" is used to eliminate terms from the search results. It is used to exclude a particular term from the search results given after “NOT”. Operator “NOT” should be used with caution, as it might result in elimination of relevant results from a search.

7.2.2. Phrase Searching

The ability to query on phrases is very important in a search engine. A phrase is a group of words that must appear next to each other in a specified order. Phrase searches are especially useful when searching the famous sayings or proper names. Most search engines support this feature. It can be used when the search terms appears in the exact order. To indicate a phrase, surround it in double quotation marks. Phrase searching is one of best search features that can be used to increase the chance of retrieving relevant results.

7.2.3. Proximity Searching

Proximity operators are used to specify the relative location of words in a document. These operators facilitate searching for words that must be in the same phrase, paragraph, or sentence in a record. Proximity operators helps us to search for words within a certain distance of one another in databases. For example, a search may require that two concepts be in the same sentence but not necessarily next to each other, as in a phrase. One such operator is NEAR which means that the terms that are entered should be within a certain number of words to each other. Typically, the distance between two concepts can be 10-25 words. NEAR allows the terms to be in any order. Different search engines may use different proximity operators. The symbols generally used in this type of search are “w” (with/within) and “n” (near). For example a search for “library 'near' automation” would retrieve documents containing "Library automation" and “Automation of Library”.

Other operator used is “followed by” which means that one term must follow the other. ADJ (adjacent) serves the same function. A search engine that allows search on phrases essentially uses the same method, i.e., determining adjacency of keywords.

7.2.4. Parentheses

Most search engines permits the use of parentheses to group related terms. This is particularly useful for clustering synonyms or for searching specific terms together before other terms are searched. Parentheses may be used in combination with other search techniques.

7.2.5. Truncation and Wildcards

Truncation is a technique that facilitates search for multiple endings of a word. It is also called stemming. Most search engines that support this feature use certain symbols (called “wild card”) such as *, ? or # at the end of the word root to indicate a truncated search.  

7.2.6.  Case Sensitivity

Case sensitivity refers to the ability of search engines to distinguish between upper and lowercase letters. Most search engines are not case sensitive and will simply read all letters as lowercase. Others may distinguish between the word “aids” and the disease “AIDS”. Using lower case is advised, because lower case will always retrieve upper case as well.

7.2.7.  Field Searching

Web pages are made up of different parts or fields. Several search engines can limit a search to a specific area of a web page. This technique helps to increase the relevance of search results. The help section of search engines that support field searching would describe which fields may be searched. The actual field names may differ among search engines. 

8. 0 Evaluation of Search Engines

Evaluation of search engine is the process of identifying and collecting the data about search engines and establishing criteria by which their success can be assessed.  Evaluation of search engines should determine both their quality and the degree to which it has accomplished its stated goals and objectives. 

As discussed previously, search engines consist of three parts: i) robot or web crawler;  ii) a database of web documents, iii) an agent or a search engine operating on that database as well as a series of programs that determine how search results are displayed.  Joe Barker, University of California has identified the following criteria for evaluation of search engines for three different components of a search engine:

8.1. Database of Web Documents

Size of Database and Method of Compilation:
  • How many documents does the search engine claims to have?
  • How much of the total web can be searched using a search engine?
  • How is the index compiled (collection method)? Automated or human input?

Currency or Up-to-date

  • Search engine databases consist of copies of web pages and other documents that were made when their crawlers or spiders last visited each site. How often is the database refreshed to find new pages?
  • Does the search engine claim to revisit sites based to some degree on how often those sites add to or change their information?
  • How often do their crawlers update the copies of the web pages in its database?
  • How soon are pages included in the index after they are collected either by spiders or through submissions?
  • Pages crawled per day? How long does it take for the search engine to update its entire index?
  • Is there evidence of link checking (i.e. dead or out-of-date links)?

Indexing

  • Are there any provisions for use of controlled vocabulary?
  • Does it use a stop word list? How extensive it is? Is it documented as to what it identifies as a stop word?  

Coverage

  • Types of resources indexed in the database of search engine (ftp, www, newsgroups, etc.)
  • Are there any special criteria for inclusion? 

Completeness of Text

  • Is the database really “full text” or only parts of the pages are collected?
  • What elements (eg. title, keywords, descriptors, body) of a page are indexed?
  • Is every word indexed?

Types of Documents Covered

  • All search engines search web pages. Do they also have extensive PDF, Word, Excel, PowerPoint, and other formats like WordPerfect?
  • Are they full-text searchable?

Speed and Consistency

  • How fast is it?
  • How consistent is it? Do you get different results at different times?

8.2. Capabilities of a Search Engine

  • Does the search engine spider have access to password-protected sites?
  • Is the spider able to follow frame links and image maps?
  • Where does the search engine can not reach?
  • All search engines let you enter some keywords and search on them. What happens inside?
  • Can you limit the ways that will increase your chances of finding what you are looking for? 

8.3. Basic Search Options and Limitations

  • Automatic default of AND assumed between words?
  • Does search engine accepts “ ” to search phrases?
  • Is there an easy way to allow for synonyms and equivalent terms (OR searching)? 

8.4 Advanced Search Options and Limitations:

  • Can you restrict your search terms to specific fields, such as the document title?
  • Can you restrict some words in certain fields and some others in other fields?
  • Can you restrict to documents only from a certain domain (org, edu, gov, etc.)?
  • Can it be limit to more than one or only one?
  • Can you limit by type of document (PDF or excel, etc.)? More than one?
  • Can you limit by language?
  • How reliably and easily can you limit to date last updated?

8.5 General Limitations and Features

  • What do you have to do to make it search on common or stop words?
  • Maximum limit on search terms or on search complexity?
  • Ability to search within previous results?
  • Can you count on consistent results from search to search and from day to day?
  • Can you customize the search or display?
  • Is there a "family" filter? Does it work well? Is it easy to turn it on or off?

8.6. Results Display

  • All search engines return a list of results it "thinks" are relevant. How relevant the results actually are?     
  • Ranking: Do pages with search terms juxtaposed (like a phrase) rank highest? Do you get pages with only some of your words, perhaps in addition to pages with them all?
  • Display: Are your keywords highlighted in context, showing excerpts from the web pages, which caused the match? Some other excerpt from the page?
  • Collapse Pages from the Same Site: Does it show only one or a few pages from the same site, does it show the one(s) with terms that is searched? How easy is it to see all results from the same site? Can this be changed and saved as your preferred search method?

9.0 Important Search Engines

There are hundreds of search engines on the Internet, and more are being created every day.  According to the comScore report (April 2014) 66.6% of searches were powered by Google and 18.5% by Microsoft followed by Yahoo with 10%.[1]
Details about a few important search engines are given below:



9.1 Primary Search Engines

9.1.1.  Google (http://www.google.com/)

Google is the largest and most preferred search engine for today's internet users handling billions of search queries every day. It was developed by Larry Page and Sergey Brin two PhD students of Stanford University. Its relevance ranking uses two factors not generally included in other search engine rankings, i.e. number of links to the page from elsewhere and the “importance” of the pages that link to it. Other ranking factors are the number of hits on the search words in the title and the text and the proximity of search term to each other. 

Alternate Text
Google has become for many the pre-eminent Web search engine since it was launched in 1999. Its database, besides Web pages, includes options for additional file types such as PDF, .ps, .doc, .xls, .txt, .ppt, .rtf, .asp, .wpd and more. The Google also have a database of images, Usenet News group, and videos.

Google supports Boolean searching, proximity searching, field searching, limit to file type, language and domain, character searching, numbers and number range searching. Results are sorted by relevance that is determined by links from other pages with a weightage given to authoritative sites. Pages are also clustered by site. The display includes the title, URL, a brief extract showing text near the search terms, the date if available, and for many hits and a link to a cached copy of the page. The default output is 10 hits per screen, but the searcher can also choose 20, 30, 50, or 100 hits at a time on the preferences page. 

9.1.2.  Bing Search                (http://www.bing.com/)

Bing is the search engine launched by Microsoft in May 2009. It was known previously as Live Search, Windows Live Search, and MSN Search. The early version provided results from Inktomi, looksmart, AltaVista. Later it was updated by microsoft to provide search results from its own search engine. In early 2006, Microsoft launched the Windows Live Search and it replaced MSN search by late 2006.
 Alternate Text

Bing is available in many languages and can be integrated in various other sites like Hotmail, Facebook, Apple etc.

9.1.3.  Yahoo! (http://www.yahoo.com/)

Yahoo! is one of the best known and most popular Internet portals. Originally just a subject directory, it is now a search engine, directory and portal. A search on Yahoo provides search results that include a few categories from the directory and Inside Yahoo!. Originally in 2001, Yahoo used the Inktomi database for search results and later used Google as backend for search till 2004. Later in 2003 it created its independent web-crawler based search engine. In July 2009, Yahoo signed an agreement with Microsoft to show the search results on its webpage powered by Bing.  Yahoo supports Boolean searching, proximity searching, field searching, limits to language, domain, date, file type, country and adult content. 

10. Summary

Search Engines are the websites that deploy tools to search diverse and disorganized sources of information available on the Internet. There are a variety of search, resource discovery and browsing tools that has been developed to support more efficient information retrieval. Search Engines use automated programs variably called as spiders, robots, crawlers, wanderers and worms. Search Engines are defined as tools for finding, classifying and storing information about various websites on the Internet.

The chapter traces evolution of search engines from Archie, developed in 1990, to Bing launched in 2009. The evolution of search engines is checkered with companies in the business of web search technology acquiring other companies to strengthen their own position.

Describing functioning of a search engine, the chapter elaborates as the following three components of a search engine:
The Robots: that traverse the web using links that are embedded in the web pages to find information and build indexes of visited web pages;
Databases: a database of indexing information harvested by its robots / spiders from web pages;
User Interface or Agent: user interface or the agents are software that search through the database consisting of index of millions of pages recorded in the index to find matches to a search and rank them in order of relevance. The agent also displays the results on the search in convenient ways to the users.

Search engines have their own methods of organizing information. The chapter divides search engines into five categories, viz, primary search engines, meta search engines, specialized search engines, web directories and hybrid search engines.
The chapter elaborates upon the criteria for choosing search engines and method for evaluation of search engines. The chapter elaborates on use of meta tags in the web pages to describe it and how search engines use them for indexing the web site. It describes criteria that can be deployed to assess the success of a search engine. Lastly the chapter describes important search engines in brief.

References and Readings

  • Animated Internet: How Search Engines. (http://www.learnthenet.com/english/animate/search.html)

  • Balas, J. Beyond Veronica and Yahoo! More Internet search tools. Computers in Libraries, 16(3), 34-35, 1996.

  • Barlow, Linda. The spider's apprentice: A helpful guide to search engines. March 2, 2001. (http://www.monash.com/spidap.html)

  • Bates, M.E. The Internet: Part of a professional searcher’s toolkit. Online, 21(1), Jan/Feb., 1997.

  • Brandt, D.S. Relevancy and searching the Internet. Computers and Libraries, 16(8), 1996.

  • Cohen, Laura. Searching the Internet: recommended sites and search techniques. May 23, 2001. (http://library.albany.edu/internet/search.html)

  • Cohen, Laura. Conducting research on the Internet. May 2001. (http://library.albany.edu/internet/research.html)

  • Cohen, Laura. Quick reference guide to search engine syntax. May 2, 2001. (http://library.albany.edu/internet/syntax.html)

  • Cohen, Laura. Second generation searching on the Web. June 12, 2001. (http://library.albany.edu/internet/second.html)

  • Dong, L.T. and Su, L.T. Search engines on the World Wide Web and information retrieval from the Internet: A review and evaluation. Online and CD ROM Reviews, 21(2), 1997.

  • Falk, H. World Wide Web and retrieval. Electronic Library, 15(1), 1997.

  • Grossan, Bruce. What they are, how they work, and practical suggestions for getting the most out of them. (http://webreference.com/content/search/refer.html)

  • Hill, Brad. Google for dummies. New York, John Wiley & Sons, Inc., 2003

  • How to use Web search engines (http://www.monash.com/spidap4.html)

  • Kent, Peter. Search engine optimisation for dummies. New York, John Wiley & Sons, Inc., 2004

  • Levine, John R., Baroudi, Carol, Young, M.L. The Internet for dummies : Starter Kit, 7th ed. New York, John Wiley & Sons, Inc., 2000

  • Liu, Jian. Guide to meta search engines. (http://www.indiana.edu/~librcsd/search/meta.html)

  • Lowe, Doug. Internet Explorer 6 for dummies. New York, John Wiley & Sons, Inc., 2001

  • Notess, Greg R. Search engine statistics: Relative size showdown: Data from search engine analysis run on Dec. 31, 2002. (http://searchengineshowdown.com/stats/size.shtml)

  • Search engine show down (http://searchengineshowdown.com/)

  • SearchEngines.com - search engine rankings and search engine optimization tips.
(www.searchengines.com/)

  • Vine, Rita. Real people don't do Boolean: How to teach end users to find high-quality information on the Internet. Information Outlook,  March, 2001  (http://www.findarticles.com/p/articles/mi_m0FWE/is_3_5/ai_71965499)

  • Tyner, Ross. Sink or swim: Internet search tools and techniques, 2001 (http://www.sci.ouc.bc.ca/libr/connect96/search.htm)
  • Search Engine Basics ( http://seoenthusiasts.com/resources/search-engine-basics/)
  • We Knew the web is Big (http://googleblog.blogspot.in/2008/07/we-knew-web-was-big.html)


No comments: