Monday, December 8, 2014

21.Digital library protocols and standards

इस ब्लॉग्स को सृजन करने में आप सभी से सादर सुझाव आमंत्रित हैं , कृपया अपने सुझाव और प्रविष्टियाँ प्रेषित करे , इसका संपूर्ण कार्य क्षेत्र विश्व ज्ञान समुदाय हैं , जो सभी प्रतियोगियों के कॅरिअर निर्माण महत्त्वपूर्ण योगदान देगा ,आप अपने सुझाव इस मेल पत्ते पर भेज सकते हैं - chandrashekhar.malav@yahoo.com

21.Digital library protocols and standards


P- 01. Digital Libraries*

By :Jagdish Arora, Paper Coordinator

Multiple Choice Questions

1 / 1 Points

Question 1: Multiple Choice

http protocol is used for carrying requests from clients to the server over the ………….
  • Wrong Answer Un-checked E-mail
  • Correct Answer Checked World Wide Web (WWW)
  • Wrong Answer Un-checked Gopher and Archie
  • Wrong Answer Un-checked Voice Over Internet
0 / 1 Points

Question 2: Multiple Choice

OAI-PMH protocol has its application for harvesting metadata from:
  •  Un-checked Institutional repositories
  • Wrong Answer Checked Internet Web Sites
  • Wrong Answer Un-checked FTP server
  • Wrong Answer Un-checked Online Databases
0 / 1 Points

Question 3: Multiple Choice

PREMIS and OAIS are two protocols for
  •  Un-checked Digital Preservation
  • Wrong Answer Checked Information Retrieval
  • Wrong Answer Un-checked Bibliographic record
  • Wrong Answer Un-checked Record Structure
0 / 1 Points

Question 4: Multiple Choice

SGML, XML and Office Document Architecture are examples of:
  • Wrong Answer Un-checked Unstructured Text Format
  •  Un-checked Structured Text Format
  • Wrong Answer Checked Page Description Language
  • Wrong Answer Un-checked Page Image Format
0 / 1 Points

Question 5: Multiple Choice

TCP/IP protocol supports:
  •  Un-checked Packet Switched Networks
  • Wrong Answer Un-checked Circuit Switched Networks
  • Wrong Answer Un-checked Message Switched Network
  • Wrong Answer Checked Asynchronous Transmission Mode (ATM)
0 / 1 Points

Question 6: Multiple Choice

Transition Control Protocol (TCP) is responsible for:
  • Wrong Answer Un-checked Breaking down information in packets
  •  Un-checked Collection and reassembling of packets
  • Wrong Answer Checked Dissemination of packets to the right destination
  • Wrong Answer Un-checked Embedding control bids
0 / 1 Points

Question 7: Multiple Choice

Which one of the following is not a bibliographic standard
  • Wrong Answer Checked MARC
  • Wrong Answer Un-checked METs
  • Wrong Answer Un-checked MODs
  •  Un-checked ASCII
1 / 1 Points

Question 8: Multiple Choice

Which one of the following standard is not maintained by the OCLC
  • Correct Answer Checked DUBLIN CORE
  • Wrong Answer Un-checked MODs
  • Wrong Answer Un-checked METs
  • Wrong Answer Un-checked BIB-1
0 / 1 Points

Question 9: Multiple Choice

Which one of the followings protocol is not applicable for information retrieval:
  • Wrong Answer Un-checked SRW/SRU
  • Wrong Answer Un-checked REST
  • Wrong Answer Checked SOAP
  •  Un-checked OAIS
0 / 1 Points

Question 10: Multiple Choice

________is an organization that is specifically assigned the task of preparing standards in Library & Information Science
  • Wrong Answer Checked ISO
  • Wrong Answer Un-checked IEEE
  • Wrong Answer Un-checked ANSI
  •  Un-checked NISO
2 / 10 PointsFinal Score:
..................................................................................................................................................................



0. Objectives

  • Introduction to standards and protocols and their need and importance in digital libraries
  • Impart knowlegde on interoperability and data exchange in digital libraries and roles of standrards and protocols
  • Impart knowledge on importnat standards and protocols applicable for digital librares.


1. Introduction

Building a digital library requires a number of infrastructural software and hardware components that are not available off-the-shelf as packaged solution in market place. There are no turn-key, monolithic systems available for digital libraries, instead digital libraries are collection of disparate systems and resources connected through a network and made interoperable using open system architecture and open protocols and standards that are integrated within one web-based interface. Use of open architecture and open standards make it possible that pieces of required infrastructure, be it hardware, software or accessories, are gathered from different vendors in the market place and integrated to construct a working digital library environment. Several components required for setting-up a digital library are internal to the institutions, but several others are distributed across the Internet, owned and controlled by a large number of independent players. The task of building a digital library, therefore, requires a great deal of seamless integration of various components for its efficient functioning. As such, standards and protocols have a major role to play in building a digital library.

Standards and protocols are backbone of a digital library that is instrumental in its efficient implementation with utmost quality and consistency facilitating interoperability, data transfer and exchange of the system. Uniform standards and protocols are pre-requisite for data transfer, exchange and interoperability amongst digital libraries.

This chapter introduces standards and protocols, their role and importance in building digital libraries. The chapter describes important digital library standards and protocols used for digital communication, bibliographic data rendering, record structure, encoding standards to handle multi-lingual records, information retrieval standards, formats and media types used in digital library and digital preservation.


2. Standards and Protocols: Difination and Importance

A protocol is a series of prescribed steps to be taken, usually in order to allow for the coordinated action of multiple parties.  In the world of computers, protocols are used to allow different computers and/or software applications to work and communicate with one another.  Because computer protocols are frequently  formalized by national and international standard organizations such as ISO and ITU, they are also considered as standards. As such, a protocol that is accepted by most of the parties that implement it can be considered as standard. However, every protocol is not a standard, likewise every standard is not a protocol.  Standards are generally agreed-upon models for comparison.  In the world of computers, standards are often used to define syntactic or other rule sets, and occasionally protocols, that are used as a basis for comparison. (Ascher Interactive, 2007).   

Standards support cooperative relationships amongst multiple vendors and implementors and provide a common base from which individual developments may emerge. Standards make it possible to share and collaborate in developments of products and processes across institutional and political boundaries. Moreover, too many standards for the same products and processes undermine the utility of having any standard at all. Standard for citing bibliographic references is a good example since there are numerous rival and incompatible standards that are used to citing a document, for example the American Psychological Association, the Modern Language Association, the Chicago Manual of Style, Indian standards, ANSI Z39.29 (American National Standard for Bibliographic References) and several other well-known standards that can be used by editors or publishers as their standard.

Standards are supported by a range of national and international organizations, including professional associations such as the Institute of Electrical and Electronics Engineers (IEEE), national standard institutions such as the American National Standards Institute (ANSI) or the British Standards Institution (BSI) or Bureau of Indian Standards, and international bodies such as the International Organization for Standardization (ISO). The US National Information Standards Organization (NISO), accredited by ANSI is specifically assigned the task of preparing standards for library and information science.

A number of important institutions and organizations are actively involved in the development and promotion of standards relevant to digital libraries. For example, the Digital Library Federation (DLF), a consortium of libraries and related agencies, as one of its objectives identifies standards for digital collections and network access (http://www.diglib.org). The DLF operates under the administrative umbrella of the Council of Library and Information Resources (http://www.clir.org) located in Washington, DC. The Library of Congress (http://www.loc.gov), plays an important role in maintaining several key standards such as MARC, and the development of MARC within an XML environment. The International Federation of Library Associations and Institutions (IFLA) maintains a gateway – IFLANET Digital Libraries – to resources about a variety of relevant standards (http://www.ifla.org/II/metadata.htm).



3. Communication Protocols

Communication protocols are predefined sets of prompts and responses which two computers follow while communicating with each other. Since digital libraries are designed around Internet and Web technologies, communication protocols such as Transmission Control Protocol / Internet Protocol (TCP/IP), Hyper Text Transfer Protocol (http) and File Transfer Protocol (ftp) that are used by the Internet are also used for establishing communication between clients and servers in a digital library. 

3.1. Transmission Control Protocol / Internet Protocol (TCP/IP)

The Internet is a packet-switched network, wherein information to be communicated, is broken down into small packets. These packets are sent individually using several different routes at the same time and then reassembled at the receiving end. TCP is the component that collects and reassembles the packets of data, while IP is responsible for assuring that the packets are sent to the right destination. TCP/IP was developed in the 1970s and adopted as the protocol standard for ARPANET, the predecessor to the Internet, in 1983.

TCP/IP is the protocol that controls the creation of transmission paths between computers on a single network as well as between different networks. The standard defines how electronic devices (like computers) should be connected to the Internet, and how data should be transmitted between them. This protocol  is used universally for public networks and many in-house local area networks. Originally designed for the UNIX operating system, TCP/IP software is now available for every major kind of computer operating system and is a de facto standard for transmitting data over networks.

Moreover, the TCP/IP includes commands and facilities that facilitiates transfer of files between systems, log in to remote systems, run commands on remote systems, print files on remote systems, send electronic mail to remote users, converse interactively with remote users, manage a network, etc. Fig. 1 and Fig. 2 given below is pictorial depiction of TCP / IP model.

Fig. 1. TCP/IP Model Used for Connecting different Nodes in a Network

 
Fig. 2. TCP/IP Layers Envolved in Transmission of a Mail


3.2. Hyper Text Transfer Protocol (http)

The http is the underlying protocol used by the WWW to define how messages are formatted and transmitted. It needs an http client program (Internet Browser) on one end, and an http server program on the other end. The protocol is used for carrying requests from clients to the server and returning pages to the client.  It is also used for sending requests from one server to another. Http  is the most important protocol used in the World Wide Web (WWW).

HTTP runs on top of the TCP/IP protocol. Web browsers are HTTP clients that send file requests to Web servers, which, in turn, handle the requests via an HTTP service. HTTP was originally proposed in 1989 by Tim Berners-Lee, who was a coauthor of the 1.0 specification of http. HTTP, in its 1.0 version was “stateless”, i.e. each new request from a client required setting-up of a new connection instead of handling all requests from the same client through the same connection. Moreover, the version 1.0 of the protocol provided for raw data transfer across the Internet. However, version 1.1 was an improved protocol that included persistent connections, decompression of HTML files by client browsers, multiple domain names sharing the same IP address and handling MIME-like messages. Fig. 3 is pictorial depiction of client-server Interaction using http protocol.
                                                    Fig.3: Client-server Interaction using http Protocol



3.3. File Transfer Protocol (ftp)

The File Transfer Protocol (FTP), as its name indicate, is a protocol for trasferring files from one computer to another over a local area network (LAN) or a wide area network (WAN) such as Internet. It is a common method of moving files between between client and server over TCP / IP network. The protocol is in existance since 1971 when the file transfer system was first implemented  between MIT machines. FTP provides for reliable and swift exchange of files with defferent operating system and machine architecture. There are many Internet sites that have established publicly accessible repositories of material that can be obtained using FTP, by logging in using the account name anonymous, thus these sites are called anonymous ftp servers. Fig. 4 is pictorial depiction of FTP process model.
 
                                                         Fig. 4: FTP Process Model


4. Bibliographic Standards

Bibliographic standards are concerned with the description of contents as well as physical attributes of documents and non-documents in a library.  They are generally very complex (MARC has some 700 field definitions) and cover the most difficult and intellectual part of the object definition (Day, 2001) . These definitions are necessary for processing the material and also for searching it. Most digital libraries software support Dublic Core Metadata Sets for bibliographic records.

4.1. Machine Readable Catalogue (MARC)

MARC (MAchine-Readable Cataloging) standards are a set of formats for description of documents catalogued by libraries including books, journals, conference proceedings, CD ROM, etc. ‘Machine-readable’ essentially means that a computer can read and interpret the data given in the cataloging record. MARC was developed in 1960s by the US Library of Congress to create records that can be used by computers and shared among libraries. MARC contains bibliographic elements for content, physical and process description. By 1971, MARC formats had become the US national standard and international standard by 1973. There are several versions of MARC in use around the world, the most predominant being MARC 21, created in 1999 as a result of the harmonization of U.S. and Canadian MARC formats, and UNIMARC, widely used in Europe. The MARC 21 family of standards now includes formats for authority records, holdings records, classification schedules, and community information, in addition to the format for bibliographic records (Furrie, 2003).


4.2. Dublin Core

The Dublin Core refers to a set of metadata element that may be assigned to web pages so as to facilitate discovery of electronic resources. Originally conceived for author-generated description of web resources at the OCLC/NCSA Metadata Workshop held at Dublin, Ohio in 1995, it has attracted the attention of formal resource description communities such as museums, libraries, government agencies, and commercial organizations. The Dublin Core Workshop Series has gathered experts from the library world, the networking and digital library research communities, and a variety of content specialists in a series of invitational workshops. The building of an interdisciplinary, international consensus around a core element set is the central feature of the Dublin Core. A set of 15 core elements in Dublin Core include: Title, Creator, Subject and Keywords, Description, Publisher, Contributor, Date, Resource Type, Format, Resource Identifier, Source, Language, Relation, Coverage, Rights Management. (Baker, 1998).


4.3. BIB-1

BIB-1 is a simplified record structure for online transmission. It is essentially a sub-set of MARC. It is the original format for transmission of records within a Z39.50 dialogue between two systems. It has elements that are  mappable to both MARC and the Dublin Core (Library of Congress, 2007).


4.4. Text Encoding Initiative (TEI)

The initative provides a scheme for encoded text so that parts of it such as the start and end of lines, paragraph, pages, chapters, acts, and so on can be marked. Thus such text can be processed to produce accurate indexes for searching. Other features of the text both grammatical and linguistic and also content indicating such as the actors in a play can be identified allowing for a rich analysis. These rules require that the actual text be marked up with SGML encoding (TEI, 2013).

4.5. Electronic Archival Description (EAD)

An encoding  scheme devised within the SGML framework to define the content description of documents and other archival objects. It is defined with  a minimum number of descriptive elements, but in an extensible fashion. It is designed to create descriptive records which will assist in searching for the original material in a number of ways (Library of Congress, 2007).



4.6. Metadata Encoding and Transmission Standard (METS)

METS has the task of encoding descriptive, administrative and structural metadata for objects in a digital library to facilitate the management of such documents within a repository and their exchange between repositories. It is maintained by the Network Development and MARC Standards Office of the Library of Congress (http://www.loc.gov/standards/mets) and is an initiative of the Digital Library Federation, mentioned earlier in the Chapter (Library of Congress, 2013).  The METS format has seven major sections:

i)   The METS Header contains metadata describing the METS document itself, including such information as creator or editor.

ii)  The Descriptive Metadata section points to descriptive metadata external to the METS document (such as a MARC record in an OPAC or an EAD finding aid on a web server), or contains internally embedded descriptive metadata, or both.

iii)  The Administrative Metadata section provides information about how the files were created and stored, intellectual property rights, the original source from which the digital library object document derives, and information regarding the provenance of the files comprising the digital library object (that is master/derivative file relationships, and migration / transformation information). As with Descriptive Metadata, Administrative Metadata may be either external to the METS document, or encoded internally.

iv)  The File section lists all the files containing content that form part of the digital document.

v)  The Structural Map provides a hierarchical structure for the digital library document or object, and links the elements of that structure to content files and metadata that pertain to each element.

vi) The Structural Links section of METS allows METS’ creators to record the existence of hyperlinks between nodes in the hierarchy outlined in the Structural Map. This is of particular value when using METS to archive websites.

vii)  The Behaviour section associates executable behaviours with content in the document.

4.7. Metadata Object Description Schema (MODS)

The Metadata Object Description Schema was developed as a descriptive metadata scheme oriented toward digital objects, and drawing from the MAchine Readable Cataloging (MARC 21) Format. The scheme is reasonably usable, fairly refined as it provides descriptive metadata for digital objects by regrouping the MARC fields, adding a few new ones, and translating the numeric codes to readable English in XML.

MODS has gone through intense development, version 3.1 was released 27 July 2005, and version 3.2 was released 1 June 2006. In addition, MODS was adopted by the Digital Library Federation (DLF) for their Aquifer Project which is seeking to develop the best possible methods and services for federated access to digital resources. DLF intends to use MODS to replace Dublin Core for descriptive metadata for digital objects in the digital library world, for MODS allows more specification of contents and better clarification of the various elements than does Dublin Core (Library of Congress, 2013). 


5. Record Structure

Record structure define the physical and logical structure of the record which holds the data. A typical bibliographic record may contain multiple fields of variable length which may occur more than once (repeatable). Except for proprietary structures, there is really only one structure used for bibliographic data of any complexity. These formats facilitate exchange of data between systems and are not intended for human consumption. Most digital library software support ISO 2709 for structure for individual records which is well-suited to handling the MARC format.

5.1. ISO 2709 / Z39.2

ISO 2709 / Z39.2 defines a very flexible structure for individual records (originally for tape storage) and is exceptionally well-suited to handling the MARC format. The two standards were developed together but 2709 can be the structure of almost any type of record. The main strength of 2709 is its ability to handle variable length fields and records where the occurrence of fields is also variable. 


6. Encoding Standards

The encoding deals with the way individual characters are represented in the files and records. It is concerned with text within records almost exclusively. Most digital library software supports ASCII as well as Unicode to support multilingaul requirements. 


6.1. Unicode

Unicode is a universal encoding scheme using 16 bits to represent each character.  It has the advantages of being simple, complete, and is being widely adopted.  Its disadvantage is that all characters take twice as much space even for single language data.  However, disk storage is getting cheaper (text is very small compared to images and video) and there are ways of simply and speedily compressing the data in storage.  Unicode is controlled by the Unicode Consortium and is the operational equivalent of the ISO-10646 standard.  Note that 10646 also defines 32 bit characters, but these are not in any general use (Unicode Consortium, 2013).

6.2. ASCII

There are a wide variety of 8 bit character encodings in use round the world, but the most common is that of the American Standard Code for Information Interchange (ASCII). The ASCII defines all the characters necessary for English and many special characters,. This code has been used as the basis for most other 8 bit codes, the “lower” 128 are left alone (they contain the latin alphabet, numbers, control codes and some special  characters) and the “top” 128 characters are used for a second language. Thus there is a universal compatibility at the “low” 128 level and almost none for the rest. IBM / Microsoft produced a number of “National variants” for the PC DOS operating system and these have a large measures of acceptance through wide distribution however they are only a manufacturer’s standard.

7.1. Z39.50 or ISO 23950

Z39.50 is an ANSI / NISO standard for information storage and retrieval. It is a protocol which specifies data structures and interchange rules that allow a client machine to search databases on a server machine and retrieve records that are identified as a result of such a search.  Z39.50 protocol is used for searching and retrieving bibliographic records across more than one library system.  This  protocol is not used by the Internet search engines (they  use http). It is more complex and more comprehensive and powerful than searching through http.  Z39.50 has been extended to allow system feedback and inter-system dialogue. Like most applications working under client-server environment, Z39.50 needs a Z39.50 client program on one end, and a Z39.50 server program on the other end.

Z39.50 protocol was originally designed to facilitate searching of very large bibliographic databases like OCLC and the Library of Congress. However, the protocol is now used for a wide range of library applications involving multiple database searching, cataloguing, inter-library loan, online items ordering, document delivery and reference services. With the rapid growth of Internet, the Z39.50 standard has become widely accepted as a solution to the challenge of retrieving multimedia information including text, images, and digitized documents (Wikipedia, 2009).

The name Z39 came from the ANSI committee on libraries, publishing and information services which was named Z39. NISO standards are numbered sequentially and Z39 is the 50th standard developed by the NISO. The current version of Z39.50 was adopted in 1995 superceding earlier versions adopted in 1992 and 1988. Fig. 5 is pictorial depiction of Z39.50 model of information retrieval.

                              Fig. 5: Z39.50 Model of Information Retrieval


7.2. Search/Retrieve Web Service (SRW) and Search/Retrieve via URL (SRU)

SRW and SRU are web services-based protocols for querying Internet indexes or databases and retrieving search results. Web services essentially send requests for information from a client to a server. The server reads the input, processes it, and returns the results as an XML stream back to the client essentially in two flavours: REST (Representational State Transfer) and SOAP (Simple Object Access Protocol). SRW provides a SOAP interface to queries, to augment the URL interface provided by its companion protocol Search / Retrieve via URL (SRU). Queries in SRU and SRW are expressed using the Contextual Query Language (CQL). (Morgan, 2004).

A REST-based web service usually encodes commands from a client to a server in the query string of a URL. Each name / value pair of the query string specifies a set of input parameters for the server. Once received, the server parses these name / value pairs, does some processing using them as input, and returns the results as an XML stream. The shape of the query string as well as the shape of the XML stream are dictated by the protocol. By definition, the communication process between the client and the server is facilitated over an HTTP connection.

SOAP-based web services work in a similar manner, except the name / value pairs of SOAP requests are encoded in an XML SOAP 'envelope'. Similarly, SOAP servers return responses using the SOAP XML vocabulary. The biggest difference between REST-based web services and SOAP requests is the transport mechanism. REST-based web services are always transmitted via HTTP.

SRW is a SOAP-based web service, SRU is a REST-based web service. While, REST-based web services encode the input usually in the shape of URLs, SOAP requests are marked up in a SOAP XML vocabulary. 


7.3. Open Archives Initiatives-Metadata Harvesting Protocol (OAI-PMH)

The Open Archives Initiatives-Metadata Harvesting Protocol (OAI-PMH) (Open Archives, 2008) defines protocols that support creation of interoperable digital libraries allowing remote archives to access its metadata using an “open” standard. The OAI-PMH supports streaming of metadata from one repository to another and its harvesting by a service provider. A service provider can harvest metadata from various digital repositories distributed across universities and institutions to provide services such as browsing, searching, alert or annotation. In essence, the OAI-PMH works in a distributed mode with two classes of participants, i.e. Data providers (Domain-specific Digital Repositories and Institutional Repositories) and Service Providers or specialized search engines:

  • Data Providers: Data providers are OAI-compliant institutional repositories or domain-specific digital repositories set-up by the institutions and universities. IRs use OAI-compliant software that supports the OAI-PMH. The OAI-PMH protocol enable data providers (repositories) to expose structured metadata of publications stored in repositories to the Internet, so that it can be harvested by service providers.

  • Service Providers: Service providers, or harvesters issue OAI-PMH requests to data providers (i.e. OAI-compliant digital repositories) in order to harvest metadata so as to provide value-added services. The metadata stored in the data providers’ database is transferred in bulk to the metadata database of the service providers.
    Fig. 6 is pictorial depiction of OAI-PMH architecture.


                                       
                                              Fig. 6: The OAI-PMH Architecture


7.4. Open URL

OpenURL (Wikipedia, 2006) is a versatile linking standard that use metadata (instead of an object identifier such as DOI) for generating dynamic link by passing metadata about a resource to a resolver program. It consists of two components, i.e. the URL of OpenURL resolver followed  by a description of the information object consisting of a set of metadata elements (e.g. author, journal issue no., volume, year, etc.).

For OpenURL to work, a library is required to setup a resolution server with information on full-text journals accessible to the library with their link as well as how to link to local print holdings and other local services. The information provider (or publisher) must also be OpenURL-enabled to redirect the linking request to the local resolution server. A “link resolver” or “link-server”, parses the elements of an OpenURL and provides links to appropriate services as identified by a library. OpenURL link allows access to multiple information services from multiple resources, including full-text repositories, abstracting, indexing, and citation databases, online library catalogues, document delivery service and other web resources and services.

When a user clicks at an OpenURL link, he / she is directed to OpenURL resolver. The resolver, based on the services availed by the library provides him an HTML page consisting of a sets links to resources from where user can access the resource (full-text from publisher’s site, DDS, Aggregators, etc). The user selects an appropriate service, clicks on the link that takes him to the site of service provider. Fig. 7 is pictorial depiction of functioning of open URL

                                                                Fig. 7. How OpenURL Works?
  
OpenURL was developed by Herbert van de Sompel, a librarian at the University of Ghent. His link-server software, SFX, was purchased by the library automation company Ex Libris which popularized OpenURL in the information industry. Many other companies now market link server systems, including Openly Informatics (1Cate -acquired by OCLC in 2006), Endeavor Information Systems, Inc. (Discovery: Resolver), SerialsSolutions (ArticleLinker), Innovative Interfaces, Inc. (WebBridge), EBSCO (LinkSource), Ovid (LinkSolver), SirsiDynix (Resolver), Fretwell-Downing (OL2), TDNet (TOUR), Bowker (Ulrichs Resource Linker) and KINS (K-Link).

The National Information Standards Organization (NISO), has developed OpenURL and its data container (the ContextObject) as international ANSI standard Z39.88.

8. Formats and Media Types in Digital Library

A defined arrangement for discrete sets of data that allow a computer and software to interpret the data is called a file format.  Different file formats are used to store different media types like text, images, graphics, pictures, musical works, computer programs, databases, models and designs, video programs and compound works combining many type of information. Although almost every type of information can be represented in digital form, a few important file formats for text and images typically applicable to a library-based digital library are described here. However, every object in a digital library needs to have a name or identifier which distinctly identifies its type and format.  This is achieved by assigning file extensions to the digital objects.  The file extensions in a digital library typically denotes formats, protocols and rights management that are appropriate for the type of material.

Information contents of a digital library, depending on the media type it contain, may include a combination of structured / unstructured text, numerical data, scanned images, graphics, audio and video recordings.

8.1. Formats and Encoding used for Text

Text-based contents of a digital library can be stored and presented as i) simple text or ASCII (American Standard Code for Information Interchange; ii) unstructured text; and iii) Structured text (SGML or HTML or XML).

8.1.2. Structured Text Format

Structured text attempt to capture the essence of documents by “marking-up” the text so  that the original form could be recreated or even produce other forms such as ASCII. Structured text format have provision to imbed images, graphics and other multimedia formats in the text. SGML (Standard Generalized Markup Language) is one of the most important and popular structured text format.  ODA (Office Document  Architecture) is a similar and competing standard. SGML is an international standard (ISO, 1986) around which several related standards are built.  SGML is flexible language that gave birth to HTML (Hyper-text Markup Language), de facto markup language of  the World Wide Web,  to control the display format of documents and even the appearance of  the user interface for interacting with the documents. Extensible Markup Language (XML) is derived from SGML to interchange structured documents on the web. Like SGML, XML also deals with the structure of document and not its formatting. The Cascading Style Sheet (CSS) developed for HTML would also function for XML to take care of formatting and appearance. Unlike HTML, XML allows for the invention of new codes. Unlike SGML, XML always requires explicit end tags that make it a lot easier to write tools and browsers.

Like simple text or ASCII, structured text can be searched or manipulated. It is highly flexible and suitable both for electronic and paper production.  Well-formated text increase visual presentation of textual, graphical and pictorial value of information.  Structured formats can easily display complex tables and equations.  Moreover, the structured text is compact in comparison to the image-based formats, even after including  imbedded graphics and pictures.

Besides SGML and HTML, there are other formats used in digital library implementation.  TeX, used for formatting highly mathematical text is one such format which allows greater control over the resulting display of document, including reviewing the formatting of errors.

8.1.3. Page Description Language (PDL)

Page Description Language (PDLs), such as Adobe’s PostScript and PDF (Portable Document Format) are similar to image but the formatted pages displayed to the user are text-based rather than image-based. PostScript and PDF formats can easily be captured during the typesetting process.  PostScript  is especially easy to capture since most of the systems automatically generate it and conversion program, called Acrobat Distiller, can be used to convert PostScript file into PDF files.  The documents stored as PDF require Acrobat Reader at the user’s end to read or print the document. The Acrobat Reader can be downloaded free of cost from the Adobe’s Web Site.

Acrobat's Portable Document Format (PDF) is a by-product of PostScript, Adobe's page-description language that has become the standard way to describe pages electronically in the graphics world. While PostScript is a programming language, PDF is a page-description format. 


8.2. Page Image Format

The digitally scanned images are stored in a file as a bit-mapped page image, irrespective of the fact that a scanned page contains a photograph, a line drawing or text.  The bit-mapped page image can be created in dozens of different formats depending upon the scanner and its software.  National and international standards for image-file formats and compression methods exist to ensure that data will be interchangeable amongst systems.  An image file stores discrete sets of data and information allowing a computing system to display, interpret and print the image in a pre-defined fashion.  An image file format consists of three district components, i.e. header which stores information on file identifier and image specifications; Image data consisting of look-up table and image raster and lastly footer that signals file termination information.  While bit-mapped portion of a raster image is standardized, it is the file header that differentiate one format from another.

TIFF (Tagged Image File Format) is the most commonly used page image file format and is considered to be the de facto standard for bitonal images. Some image formats are propriety developed by a commercial vendor and require a specific software or hardware for display and printing.  Images can be coloured, grey-scale or black and white (called bitonal).  They can be uncompressed (raw) or compressed using several different compression algorithms.

Image files are much larger than text files, thus compression is necessary for their economic storage.  A compression algorithm reduce a redundant string such as one or more rows of white bits, to a single code.  The standard compression scheme for black and white bitonal image is the one developed by the International Telecommunications Union (formerly Consultative Committee for International Telephony & Telegraphy (CCITT) for group 4 fax images, commonly referred to as CCITT Group 4 (G-4) or ITU-G-4.  An image created as a TIFF and compressed using CCITT Group  4  is called a TIFF G4 which is the de facto standard for storing bitonal images.

Some of the formats mentioned above are maintained and developed by international organizations such as the International Standards Organization (ISO), the International Telecommunications Union (ITU).

9. Preservation Standards

The digital preservation metadata is a subset of metadata that describes attributes of digital resources essential for its long-term accessibility. Preservation metadata provides structured ways to describe and record information needed to manage the preservation of digital resources. In contrast to descriptive metadata schemas (e.g. MARC, Dublin Core), which are used in the discovery and identification of digital objects, preservation metadata is sometimes considered as a subset of administrative metadata design to assist in the management of technical metadata for assisting continuing access to the digital content. Preservation metadata is intended to store technical details on the format, structure and use of the digital content, the history of all actions performed on the resource including changes and decisions, the authenticity information such as technical features or custody history, and the responsibilities and rights information applicable to preservation actions. The scope and depth of the preservation metadata required for a given digital preservation activity will vary according to numerous factors, such as the “intensity” of preservation, the length of archival retention, or even the knowledge base of the intended user community.

9.1. PREMIS (PREservation Metadata: Implementation Strategies)

The OAIS Framework prompted interest in moving it toward a more implementable status. To achieve this objective, OCLC and RLG sponsored a second working group called PREMIS (PREservation Metadata: Implementation Strategies). Composed of more than thirty international experts in preservation metadata, PREMIS sought to: i) define a core set of implementable, broadly applicable preservation metadata elements, supported by a data dictionary; and ii) identify and evaluate alternative strategies for encoding, storing, managing, and exchanging preservation metadata in digital archiving systems. In September 2004, PREMIS released a survey report describing current practice and emerging trends associated with the management and use of preservation metadata to support repository functions and policies. The final report of the PREMIS Working Group was released in May 2005. The PREMIS Data Dictionary is a comprehensive, practical resource for implementing preservation metadata in digital archiving systems. It defines implementable, core preservation metadata, along with guidelines and recommendations for management and use. PREMIS also developed a set of XML schema to support use of the Data Dictionary by institutions managing and exchanging PREMIS conformant preservation metadata.

9.2. Open Archival Information System (OAIS)

OAIS describes all the functions of a digital repository: how digital objects can be prepared, submitted to an archive, stored for long periods, maintained, and retrieved as needed (Library of Congress, 2013) . It does not address specific technologies, archiving techniques, or types of content. RLG built on the OAIS model in our digital preservation projects such as PREMIS and Digital Repository Certification. The OAIS Reference Model was developed by the Consultative Committee for Space Data Systems (CCSDS) as a conceptual framework describing the environment, functional components and information objects associated with a system responsible for the longterm preservation of digital materials. The metadata in OAIS Model plays an essential role in preserving digital content and supporting its use over the long-term. The OAIS information model implicitly establishes the link between metadata and digital preservation – i.e., preservation metadata. The OAIS reference model provides a high-level overview of the types of information needed to support digital preservation that can broadly be grouped under two major umbrella terms called i) Preservation Description Information (PDI); and ii) Representation and Descriptive Information.

Summary

The chapter introduces standards and protocols, their role and importance in building digital libraries. The chapter elaborates upon important digital library standards and protocols used for digital communication (i.e. TCP / IP, http and FTP), bibliographic standards (i.e. MARC, Dublin Core, BIB-I, TEI, EAD, METS and MODS), record structure (i.e. ISO 2709 / Z39.2), encoding standards (Unicode and ACII), information retrieval standards (i.e. Z39.50, SRW/ SRU, SOAP, REST, OAI-PMH and Open URL), formats and media types used in digital library and digital preservation including unstructured text formats (ASCII), structured text formats (i.e. SGML, XML, ODA, HTML and TeX), Page Description Language (PDL) (i.e. Adobe’s PostScript and PDF), Page Image Formats (i.e. TIFF, Image PDF, JPEG, etc.) and preservation standards such as PREMIS and OAIS.

References

(All Internet URLs were checked on 18th Dec., 2013)

Ascher Interactive (2007).  What is the difference between a protocol and a standard? (http://ascherconsulting.com/what/is/the/difference/between/a/protocol/and/a/standard/).

Baker, T. (1998) Languages for Dublin Core. D-Lib Magazine, 4 (12). Available at:http://www.dlib.org/dlib/december98/12baker.html

Day, M. (2001). Metadata in a Nutshell.

Furrie, B. (2003) Understanding MARC Bibliographic Machine-Readable Cataloguing. 7th ed. McHenry, ILL: Follett Software

Library of Congress (2007). Bib-1 Attribute Set.

Library of Congress (2007) Electronic Archival Description (EAD) Version 2002 Official Site. (http://www.loc.gov/ead/)

Library of Congress (2013). Metadata Encoding and Transmission Standard (METS). (http://www.loc.gov/standards/mets/)

Library of Congress (2013). Metadata Object Description Schema Official web Site. (http://www.loc.gov/standards/mods/)

Library of Congress (2013). Preservation Metadata Maintenance Activity Version 2.0. (http://www.loc.gov/standards/premis/)

Moore, Brian (2001). An introduction to the Simple Object Access Protocol (SOAP). (http://www.techrepublic.com/article/an-introduction-to-the-simple-object-access-protocol-soap/)

Morgan, Eric Lease (2004). An Introduction to the Search/Retrieve URL Service (SRU).  Ariadne, (40).  (http://www.ariadne.ac.uk/issue40/morgan/)

Open Archives (2008). The Open Archives Initiative Protocol for Metadata Harvesting.  (http://www.openarchives.org/OAI/openarchivesprotocol.html)

SearchSOA (2005). Representational State Transfer. 
(http://searchsoa.techtarget.com/definition/REST)

TEI: Text Encoding Initiative (2013). (http://www.tei-c.org/index.xml)
Last visited on 18th Dec., 2013.

Unicode Consortium (2013). (http://www.unicode.org/). Last visited on 18th Dec., 2013.

Wikipedia (2009). Z39.50. (http://en.wikipedia.org/wiki/Z39.50). Last visited on 18th Dec., 2013.

Wikipedia, 2013. OpenURL. (http://en.wikipedia.org/wiki/OpenURL). Last visited on 18th Dec., 2013.

Wikipedia, 2013. Open Archival Information System

Glossary

File Transfer
File transfer is a common method of moving files between two Internet sites.
GIF

An acronym for Graphics Interchange Format, one of the two most commonly used file formats for storing graphic images displayed on the World Wide Web (others being JPEG and TIFF). An algorithm developed by Unisys, GIF is protected by patent, but in practice the company has not required users to obtain a license. The most recent version of GIF supports color, animation, and data compression.

Interoperability
Interoperability is ability of digital library components and services to be functionally and logically interchangeable by virtue of their having been implemented in accordance with a set of well-defined publicly known interfaces.
JPEG

An acronym for Joint Photographic Experts Group, a standard for compressing still images in digital format at ratios of 100:1 and higher. Data compression is accomplished by dividing the image into small blocks of pixels, halved again and again until the desired ratio is reached. Data is lost each time the compression ratio increases. Pronounced "jay-peg." Compare with MPEG.

Open Standards
An open standard is a standard that is publicly available and has various rights to use associated with it, and may also have various properties of how it was designed.
Open System Architecture
Open architecture is a type of computer architecture or software architecture that is designed to make adding, upgrading and swapping components easy. For example, most of the PCs today have an open architecture supporting plug-in cards, USB, etc.
Protocol
A protocol is a series of prescribed steps to be taken, usually in order to allow for the coordinated action of multiple parties.  In the world of computers, protocols are used to allow different computers and/or software applications to work and communicate with one another.   
Standards
Standards are generally agreed-upon models for comparison.  In the world of computers, standards are often used to define syntactic or other rule sets. Because computer protocols are frequently formalized by national and international standard organizations such as ISO and ITU, they are also considered as standards. As such, a protocol that is accepted by most of the parties that implement it can be considered as standard.
TCP / IP
Stands for Transmission Control Protocol / Internet Protocol, is the suite of protocols that defines the Internet. Originally designed for the UNIX operating system, TCP/IP software is now available for every major kind of computer operating system.
TEI (Text Encoding Initiative)
TEI Introduced in 1987, TEI is an international interdisciplinary standard intended to assist libraries, museums, publishers, and scholars in representing literary and linguistic texts in digital form to facilitate research and teaching. The encoding scheme is designed to maximize expressivity and minimize obsolescence. 

TIFF(Tagged Image File Format)

An acronym for Tagged Image File Format, a widely supported data format developed by Aldus and Microsoft for storing black and white, gray scale, or color bitmapped images. Files in TIFF format may be uncompressed or compressed using LZW or a variety of other compression schemes. They usually have the extension .tif or .tiffadded to the filename.

XML (Extensible Markup Language)
XML is an extremely simple dialect of SGML. The goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. XML It has been designed for ease of implementation and for interoperability with both SGML and HTML.
WWW (World Wide Web)
A global network of Internet servers providing access to documents written in a script called Hypertext Markup Language (HTML) that allows content to be interlinked, locally and remotely. The "Web" was designed in 1989 by Sir Timothy Berners-Lee, working at the CERN high-energy physics lab in Geneva. Mark Andreeson, a student at the University of Illinois, later devised a simple point-and-click system called Mosaic that subsequently evolved into the Netscape Web browser.

Points to Ponder

  1. Acrobat's Portable Document Format (PDF) is a by-product of PostScript, Adobe's page-description language that has become the standard way to describe pages electronically in the graphics world.
  2. For Open URL to work, a library is required to setup a resolution server with information on full-text journals accessible to the library with their link as well as how to link to local print holdings and other local services.
  3. OpenURL (Wikipedia, 2006) is a versatile linking standard that use metadata (instead of an object identifier such as DOI) for generating dynamic link by passing metadata about a resource to a resolver program.
  4. SRW and SRU are web services-based protocols for querying Internet indexes or databases and retrieving search results.
  5. Z39.50 is an ANSI / NISO standard for information storage and retrieval.
  6. A set of 15 core elements in Dublin Core include: Title, Creator, Subject and Keywords, Description, Publisher, Contributor, Date, Resource Type, Format, Resource Identifier, Source, Language, Relation, Coverage, Rights Management. (Baker, 1998).

Did you know?

  1. The bit-mapped page image can be created in dozens of different formats depending upon the scanner and its software.
  2. The biggest difference between REST-based web services and SOAP requests is the transport mechanism.
  3. The Z39.50 standard has become widely accepted as a solution to the challenge of retrieving multimedia information including text, images, and digitized documents (Wikipedia, 2009).
  4. By 1971, MARC formats had become the US national standard and international standard by 1973.
  5. Most digital library software support ISO 2709 for structure for individual records which is well-suited to handling the MARC format.
  6. ISO 2709 / Z39.2 both standards were developed together but 2709 can be the structure of almost any type of record.
  7. Disadvantage of Unicode is that all characters take twice as much space even for single language data.

Interesting Facts

  • The National Information Standards Organization (NISO), has developed OpenURL and its data container (the ContextObject) as international ANSI standard Z39.88.
  • The main strength of 2709 is its ability to handle variable length fields and records where the occurrence of fields is also variable. 
  • Open URL was developed by Herbert van de Sompel, a librarian at the University of Ghent.
  • SRW is a SOAP-based web service, SRU is a REST-based web service.
  • The name Z39 came from the ANSI committee on libraries, publishing and information services which was named Z39.
  • Z39.50 protocol was originally designed to facilitate searching of very large bibliographic databases like OCLC and the Library of Congress.
  • HTTP was originally proposed in 1989 by Tim Berners-Lee
  • MARC was developed in 1960s by the US Library of Congress to create records that can be used by computers and shared among libraries.


Match the following

0 / 1 Points

Question 1: Matching (Simple)

Match the followings
  • Wrong Answer (B) A. Record structure
  • Wrong Answer (I) B. Encoding Standard
  • Wrong Answer (F) C. Information Retrieval
  • Wrong Answer (G) D. Communication Protocol
  • Wrong Answer (D) E. Page Description Language
  • Wrong Answer (E) F. Structured Text Format
  • Wrong Answer (H) G. Simple or Unstructured Text
  • Wrong Answer (C) H. Page Image Format
  • Wrong Answer (A) I. Preservation Standard
  • A. OAIS
  • B. ISO 2709
  • C. TIFF
  • D. PDF
  • E. SGML
  • F. Z39.50
  • G. TCP/IP
  • H. ASCII
  • I. Unicode.
















No comments: