Carnegie Mellon University
Pittsburgh, Pa. 15213
December 1, 2001
412-268-2597
rr@cmu.edu
Creating a universal free to read,
digital library containing over one million scanned books, with optical
character recognition when possible to support full text searching, is the goal
of the million book digital library project. Such a resource will lead to the democratization of
knowledge by making available on the web, a unique library resource to
scholars, students, and citizens around the world. The availability of online search allows users to locate
relevant information quickly and reliably thus enhancing student willingness
and success in their research endeavors.
This 24x7x365 resource would also provide an excellent testbed for
language processing research in areas such as machine translation,
summarization, intelligent indexing, and information mining.
A portion of the content would
include out of copyright, pre-1920 materials. A “best books” feature of the project would involve
requesting permission to scan titles in the core collection development tool Books for
College Libraries. A
preliminary Carnegie Mellon University Libraries pilot suggests that 22% of the
80,000 titles might become available. Further, when 80% of the million books
are finished, scholars will be recruited to review collections in their
disciplines and to select remaining books of importance.
Mirroring the site at several locations worldwide will protect the
integrity and availability of the data. Several models for sustainability are
being explored and are discussed in this report. Usability studies would also
be conducted to ensure that the materials are easy to locate, navigate, and
use. Appropriate metadata for navigation and management would also be created.
National Science Foundation is providing funding for Scanners, Computers,
Servers, and Software. These
resources from NSF are augmented by almost twenty to one since China and India will
be providing the necessary manpower (2,000 man years each, over a four year
period), as their contribution to this project, to assist in selection of
documents, software development and in digitizing these materials. Indigenous Chinese and Indian materials
would form a portion of the content scanned as would English language materials
already resident in those countries.
In addition, U.S. libraries, primarily members of the Digital Library
Federation, would ship materials to be scanned and returned.
The primary long-term
objective is to capture all books in digital format. Some believe such a task is impossible. Thus as a first step we are planning to
demonstrate the feasibility by undertaking to digitize 1 million books (less
than 1% of all books in all languages ever published) by 2005. We believe such a project has the
potential to change how education is conducted in much of the world. The project hopes to create a universal
digital library free to read any time any where by anyone.
Each of the million books
is scanned. If it is in a language
for which optical character recognition software is available, the text is
converted to ascii/unicode format to allow full text search to guide students,
scholars, and citizens to the relevant portions of the work. Scanner operators
create metadata, based on existing cataloging records for these books and
journals, to accompany each book.
This project enhances
research, learning, and teaching by making a critical mass of scholarly
information freely available to read online. It has been observed that the result will be like Vannevar
Bush’s Memex. In addition to its
own indexes, major indexers, such as Google will index it and others, including
libraries participating in the project, will hyperlink to it.
A secondary objective of
this project will be to provide a test bed that will support other researchers
who are working on improved scanning techniques, improved optical character
recognition, and improved indexing.
The corpus this project creates will be at least ten times as large as
any existing free resource.
Primary benefit is to
supplement the formal education system by making knowledge available to anyone
who can read and has access. Libraries have played a vital role in the
advancement of human society. Societal advance depends on young people having
access to books via libraries and other means. We expect that making this unique web resource available
free to everyone in the U.S. and around the world will lead to a further
democratization of access to knowledge.
Libraries are unevenly
distributed around the world and within countries. In the U.S., the NCES Survey noted that in 1996, 3,408 of
3,792 institutions of higher education had libraries holding 806.7 million volumes. The 112 largest university libraries in
the United States and Canada each have at least 1.8 million books; they are
members of the Association for Research Libraries. Massachusetts has about 25 million volumes; New York has
about 31 million volumes, and California has about 40 million volumes in their
ARL Libraries (Association for Research Libraries, 1999/2000). Other states, such as North and South
Dakota, have no large libraries. A
few large public libraries have several million volumes. However, most junior colleges, high
schools, and public libraries have much smaller collections. Making this large knowledge repository
with the convenience of online access and the benefit of word and phrase full
text searching can revolutionize research at all levels of education and give a
much-needed boost at minimal cost to our national educational infrastructure.
Secondary benefit: Online search makes locating the
relevant information inside of books far more reliable and much easier. Student success in finding exactly what
they seek will increase and increased success will enhance student willingness
to perform research in this large resource. NCES reports that 84 percent of libraries around the country
are open between 60 and 80 hours a week.
This digital library would be open 24 hours a day, seven days a week,
and 365 days a year for a total of 168 hours a week, over twice the time most
libraries are open. More than one
individual will be able to use the same book at the same time. Thus, popular works will not be checked
out and thus unavailable to others.
This million-book project
will produce an extensive and rich testbed for use in further textual language
processing research. It is hoped
that at least 10,000 books among the million will be available in more than one
language, providing a key testing area for problems in example based machine
translation. In the last stage of the project, books in multiple languages will
be reviewed to ensure that this testbed feature is accomplished.
Many believe that knowledge
is now doubling at the rate of every two to three years. Machine summarization,
intelligent indexing, and information mining are tools that will be needed for
individuals to keep up in their discipline work, in their businesses, and in
their personal interests. This
large digitization project will support research in these areas.
The preliminary work
described below has been used to establish a protocol, to select standards to
be used, and to address issues of indexing and retrieval. Workflow and training programs to
support the larger project are being developed. Both the content and the mechanisms for using it will be
made available in open source code.
The National Science
Foundation’s 2000 ITR grant cycle provided $500,000 for equipment to begin a
large pilot. That grant will allow
the purchase of 18 Minolta book scanners to be located in India and China. Some machines have already been
deployed to begin the scanning process. Strong discounts from Minolta have
expanded the number of machines that can be purchased. Earlier pilot projects, a 100-book
scanning project and a 1000-book scanning project, that aided in the selection
of the scanners and the establishment of processes used are described more
fully below.
Chinese University
Presidents, a Ministry of Education official, and Chinese Academy of Sciences
leaders visited the U.S. to reach agreements and to form a steering
committee.
Dr. Michael Lesk and Dr.
Stephen Griffin from NSF attended the Carnegie Mellon meeting and also hosted
the Chinese delegation at the National Science Foundation. Professor PAN Yunhe,
President of Zhejiang University; Dr. GAO Wen, Deputy President of the Graduate
School, Chinese Academy of Science; Professor CHI Huisheng, Vice President of
Beijing University; Professor HU Dongcheng, Vice President of Tsinghua
University; Professor XU Zhong, Vice President of Fudan University; Professor,
ZHANG Yibin, Assistant to the President, Nanjing University; Mr. GUO Xinli,
Vice General Director, Ministry of Education of China; Mr. CHEN Jianping, Vice
Director, State Planning Commission of China; and Dr. Ching-Chih Chen of
Simmons College attended. The
National Science Foundation funded this summit.
The Indian university and
government officials are scheduled to visit on the 26th of May 2002
and it is expected that similar agreements would be reached.
U.S. Digital Library
Federation members met on November 15 and 16, 2001 to work out the logistics of
selecting and transporting materials from U.S. collections under a grant from
NSF. Drs. Lesk and Griffin were
joined in Pittsburgh by representatives from OCLC, the Center for Research
Libraries, and collection development officers and other librarians from the
Library of Congress, the University of Washington, the University of California
Berkeley, Stanford University, University of Illinois, University of Chicago,
Penn State University, and the University of Pittsburgh. The Digital Library Federation’s
Executive Director also attended the meeting.
The collection development librarians discussed:
·
Collection focus to achieve a consensus on how to select the million
books to be digitized.
·
Involvement of outside scholars in selection issues to consider how
non-librarian scholars might participate in selection.
·
Copyright considerations to consider seeking permission for a set of in
copyright “best books”, such as those in Books for College Libraries.
·
Standards for the work to review the current Digital Library Federation
standards with a view to rapid adoption.
·
Registry issues to move forward with OCLC in establishing a registry for
books selected.
·
Methods of transport to consider alternative means of transport and
return.
·
Timing to weigh the advantages of air containers and sea containers.
·
Level of participation to determine minimum levels for contributors to
the project.
·
Incentives for participation to establish means of recognition for
contributions through screen display and copies of the archives.
The outcome of this meeting
will result in a plan for the selection and transmission of almost a million
books to China and India over a multiyear period and a plan for assessing the
success of the project annually.
Creating a
scalable database to support this project is a related research proposal. Drs. Christos Faloutsos, Jeffrey
Eppinger, and Natalia Allamachi are submitting a proposal to NSF to address
these issues. Their globally distributed database will appear to be a virtual
central database from any place around the world. Mirroring the database in several countries will ensure
security and availability.
The
database will house both an image file and a text file at about 10-20 megabytes
per book. The aggregate of 20
terabytes will be affordable to store because the costs of storage continue to
decline substantially. By 2010, a
terabyte of storage is expected to cost as little as $10.
100 book
pilot: Two years
ago, we funded a pilot experiment to scan 100 books so that the practical
difficulties of a million book project could be assessed. Carnegie Mellon University Libraries
faculty and staff assisted in the pilot.
The scanner of choice was an inexpensive duplex scanner that required
the books to be disbound so that the pages could be fed through in
batches. While the economy and
speed of this technique were most attractive, several technical problems
occurred.
·
The pages had to be cut on all four edges for smooth feeding. The project required the purchase of a
$10,000 guillotine to accomplish this.
The guillotine was somewhat dangerous, required in-depth training in use
and safety, slowed the process, was a public relations nightmare for the
library community, and obviated the economy of the inexpensive scanner.
·
Dust, an inevitable accompaniment to older books, proved to be a formidable
opponent. Dust caused frequent
jamming and subsequent cleaning of the scanner. Paper fixatives were employed to counteract the dust. Spraying on the fixative slowed the
project and was not entirely satisfactory.
At the end
of the first hundred books, the scanner operators and their supervisor sought
another approach.
1000 book
project. Books 200 through 1000 were scanned
using a Minolta Overhead scanner.
Although this scanner was 5 times more expensive roll-feed double sided
scanner we used, it proved to be more reliable. Books did not have to be disbound. The image processing software for curvature correction,
deskewing, despeckling and cropping allows for thick books to be scanned either
flat or in an angled cradle that reduces wear on the spine. Thorough training is required to
operate the scanner, but several different employees were successfully trained
to use it during the period of the project. The results of this 1000 book project can be viewed at www.ulib.cs.cmu.edu under 1000 book
project. This scanner and the
processes are the ones that are recommended for the million book project. The advantages of the Minolta approach
include:
·
Disbinding via a guillotine is not necessary.
·
Books can be reused in their original form.
·
Dust, thick paper, and long books can be easily accommodated.
·
Training requirements are reasonable.
· Equipment is reliable.
3.
Data
Production
·
Bitonal images with a pixel depth of 1 bit-per-pixel were scanned at a
resolution of 600 dots per inch (dpi). Images stored as "Intel" TIFF
(Tagged Image File Format) files, with the header content specified. The
compression algorithm used is ITU (Formerly CCITT) Group 4.
·
TIFF version 5.0 is acceptable. Subject to testing, version 6.0 (or
later) may also be acceptable.
·
Initial-capture system includes dynamic thresholding or a similar feature
to capture variability of darkness in the imprint and possibly darker (e.g.,
foxed) backgrounds from decay.
Images should be as readable as the original pages.
·
"Typical" or "expected" data to be provided for most
TIFF tags (normally, the data supplied by software default settings). A
specification for the TIFF header to be produced to include scanner technical
information, filename, and other data, but to be in no way a burden on the
production service.
·
Images written in sequential order, with corresponding 8.3 file names,
e.g., 00000001.tif as first image in volume sequence and 00000341.tif as 341st
image in volume sequence
·
Volumes to be provided to Million Book Project by libraries with unique
identifiers that conform to 8.3 format; images should be in directories named
with corresponding identifier (e.g., akf3435.001 as identifier for volume will
result in directory with same name, and 00000001.tif through 0000000N.tif
within that directory)
·
Images and directories (as specified above) to be written by Million Book
Project to gold CD-ROM meeting agreed upon specifications, and using ISO9660
format.
·
Skew to be within a specified range of degrees allowed.
4.
Optical Character Recognition (OCR)
The primary function of OCR is to allow searching inside the text. Because words are often repeated, the
98% success rate will allow students and scholars to find relevant passage. In
the pilot projects, the OCR program Abby Fine Reader was run after the
scanning was completed. Abby Fine Reader was selected for its
ability to keep words intact if they were hyphenated between two pages. On
English language texts with print that has few broken letters, OCR accuracy of Abby Fine
Reader is about 98% of text.
We do not plan to correct the OCR output as part of this project.
More sophisticated programs with voting system to resolve different
interpretations are available, but licenses are too expensive. Chinese and Japanese OCR programs are
also available and will be used whenever possible. Providing a testbed that
will allow for the creation of even better OCR programs is a secondary goal of
this project. Scholars may wish to run newer OCR programs over the scans and
even to correct the output.
5.
Metadata
Digital Library Federation standards and metadata best practices will be
used throughout this project.
Bibliographic metadata for the pilot project will be derived from
existing library catalog records.
Carnegie Mellon libraries developed software that uses the standard
Z39.50 protocol to search and retrieve relevant metadata from catalog records
fields. Thus, author, title, and
publication data do not have to be rekeyed.
Another research project associated with this project will be the
creation of software that automatically creates "document structure"
metadata. This metadata allows
users to navigate through the chapters and other parts of a book
successfully. Entering such
information manually is too time consuming for this project, but automatic
metadata creation programs can be utilized subsequently.
Administrative metadata supports the maintenance and archiving of the
paper or digital objects and ensures their long-term availability by providing information
about how the files were created and stored. Administrative metadata will be maintained internally as
file descriptions in the project databases and externally as part of the
copyright permission database.
The Digital Library Federation, a supporter of this project, has several
initiatives underway that will allow commercial browsers to harvest metadata
more aggressively. The results of DLF’s metadata harvesting project will be
explored for possible application to the resources produced in this project
(www.diglib.org).
6.
Quality control
The standards established for quality control are those currently
endorsed by the Digital Library Federation, whose missions include the
establishment of best practices and the development of standards. The project must maintain a 98%
accuracy rate for the quality of images and the inclusion of all pages. Nevertheless, a process must be
developed to allow for users to report missing pages and for those missing pages
to be scanned and dropped back into the existing scanned text. Because the owning library will have to
pull the book, scan the pages, and transport the file, this process will be
expensive. Maintaining high quality
the first time the book is scanned will be essential. A demonstration of high quality, reliable work done on
materials currently in China and India will give U.S. libraries confidence that
their collections should be shared.
E. Content
Seeking to develop a collection of
one million digital books, the Million Book Project envisages a staged approach
as described below. The
Million Book project will adhere to copyright law. U.S. collections will primarily include the following types
of materials.
1. Coordination of Selection
Creating one digital copy, which
can then be easily mirrored in different locations, will suffice and will
support the multiple uses an item may receive. Preliminary discussions with OCLC as a host for a registry
of scanned items are underway.
Certain key projects, such as the Making of America project, are already
represented in the OCLC database as digital books. Other large digitization projects may require some data
entry of their content in order to avoid duplication.
2. Non-copyrighted materials
Materials
published before 1920 are in the public domain and may be scanned for this
project. Several large academic libraries are considering shipping materials
from their depositories of little used material to India/China. These materials will be scanned there
and then returned. To reduce the costs of selection, the project will probably
develop a strategy of selecting key topics and then removing large runs of
books and journals from a selected depository. Having a reasonable turn around time will be essential to
the success of the project. A test
will be devised to understand the logistics of shipping the materials and the
impact of their absence from the home library.
The 1909 copyright law granted
copyright for 28 years. Rights
holders could then renew the copyright for another 28 years; many publishers and
authors did not exercise that renewal option. Thus, some materials published after 1922 (56 years prior to
the 1978 effective date of the 1976 act) may be out of copyright. In order to provide for the efficient
checking of these books’ status, copyright renewal records for books for these
years been scanned and made available online at www.ulib.org. Similar records for other formats, such
as serials and audiovisual material, will also be made available as a part of
this resource.
Government documents are also in
the public domain and may be included in this project. Many participating libraries are
depositories for full runs of government documents and could supply them to the
project, as could the Library of Congress. The inclusion of documents will allow for more recent
material to enter the project legally and to become available to a broader
audience and in a more accessible manner.
Many government documents are currently available in digital form. The creation of these back files would
enhance those resources.
The Chinese delegation is most
eager to have technical reports and science and technology dissertations as a
part of this project. The
producing scholar and the university have copyright interests in these formats. Gaining university permission might be
fairly straightforward. A good
faith attempt would also have to be made to win the permission of the
scholar. That could be a part of
an externally funded copyright clearance project, but no pilot has been done to
allow for an estimate of contact rate and subsequent success. If some arrangement could be made with
University Microfilms to scan dissertations of selected universities from
microfilm, which would be cheaper and easier to transport, such an initiative
might satisfy a strong desire among all participants to increase science
content.
3. Copyrighted
materials
The 1998 Copyright law grants
copyright to authors for their lifetimes plus 70 years or for 95 years. Patent law, by contrast, gives 20
years. A.W. Mellon’s JSTOR project
developed the concept of a moving wall that allowed the inclusion of materials
over five years old. Journal
publishers generally agreed that the economic value of that material was
greatly reduced and granted permission for its inclusion in this most successful
project. A similar broad publisher
agreement about the point at which economic value of a print book declines is
greatly needed because books often go out of print in two or three years and
can then remain in copyright but unavailable for over 90 years.
Dr. Raj Reddy and Dr. Peter Shane,
Director of the Institute for the Study of Information, Technology and Society
recently had a conversation with a major book publisher to explore the
possibility of taking a broad publisher approach to receiving copyright
permissions. Certain publishers,
including the National Academy Press, have had the experience that when they
digitized their books, sales increased because attention was focused on the
material and the scholars were not yet ready to read the books online. Authors' guilds will also be contacted
to see if they would be interested in grant permissions.
Three conditions seem to be
necessary to attract publishers to the scanning of their out of print but in
copyright titles:
· Publisher should receive a tax deduction for contributing
the title to this project. The tax
deduction might reflect revenues previously generated by the title.
· When a print on demand feature becomes a part of this
project, publishers should collect royalties on books printed.
· If a book were to return to general popularity, as the
effect of the movie Titanic had on the sales of out of print titles, the
publisher should be able to withdraw the permission for a fee. The publisher might be expected to
reimburse the project for the costs of digitizing the title and maintaining it
online.
Dr. Michael Shamos, a Director of
Carnegie Mellon’s Universal Library project and an intellectual property
attorney, recommends the following approach to copyright clearance. The million book project will make a
good faith effort to clear copyright on appropriate materials by sending the
publisher of record a letter asking for permission. Replies will be recorded in the administrative metadata. If the publisher has returned the
rights to the author, the author will be contacted. Subsequent copyright holders will be contacted as
needed. If the permission letter
receives no response, then materials will be digitized as a part of the
project. If rights holders
subsequently identify themselves and request that the material be removed from
the project, that request will be complied with immediately.
4. Best books
approach
The project will seek publisher
permission to scan books from Books for College Libraries (BCL), one
source for core academic books in English. A previous study done at Carnegie Mellon University
Libraries indicates that 22% of publishers granted permission for scanning and
mounting on the web. The materials
in the study were a random sample of Carnegie Mellon libraries’ books and
included a broad range of dates, publishers, and in and out of print statuses.
Numerous difficulties from out of business publishers, lack of publisher
records, return of copyright to authors, and other circumstances were
identified. Subsequently, Carol
Hughes, the collections development officer for Questia, corroborated Carnegie
Mellon’s experience.
OCLC owns a database of books from
the latest edition of Books for College Libraries. OCLC representatives will attend the
November 15 & 16 meeting and will discuss using the database to support the
project. BCL contains about 50,000
titles. A 22% success rate in
clearing copyright would result in 10,000 of the best books for college
students being included in the project.
Clearing copyright is labor intensive and expensive. Bradd Burningham’s recent article
estimated those costs (“Copyright Permissions” in Journal of Interlibrary Loan, Document
Delivery, and Information Supply, 11:2 (2000), 95-111). The BCL database, however, will allow
for sorting by publisher so that permission requests can contain the names of
several books. A quick sample
indicates that as many as 25,000 publishers may be represented there. Despite the expense, this commitment to
quality should be attempted.
Carnegie Mellon University Libraries will seek private foundation
funding to undertake this project.
Publishers increasingly see that
digital presentation of their works can attract buyers. They are interested in exploring ways
in which their out of print titles may be returned to profitability. Continued work with publishers through
the course of this project may attract many of them to it. That would be most beneficial in enriching
the content to be made available.
F. Sustainability
Sustainability is a long-term
issue for this project; further research will be done on developing economic
models to support this major contribution to education. Partial answers to these significant
challenges are discussed below. Three general alternatives have potential for
offering a sustainable model for this project—the Library of Congress and
similar national libraries, OCLC, and other commercial concerns. Several major philanthropists
have computer industry fortunes and might be interested in sustaining this
project.
Library of Congress: The million-book project
will be a public good and as such must have a suitable repository that will
continue to make it available to the public at no charge. That responsibility belongs most
clearly to the national library in each country. The Library of Congress should be motivated to respond to
this challenge because the national interest is so clearly served. However, the Library of Congress is not
the national library of the United States, although many people assume that it
is. In the LOC’s own words in its
mission statement: “THE FIRST PRIORITY
of the Library of Congress is to make knowledge and creativity available to the
United States Congress.” It is
only a lesser goal to make knowledge available to the public, and that is why
we have to undertake the million-book project in the first place -- LOC won’t
do it. LOC is also the guardian of
the copyright office and is extremely nervous about digitizing anything to
which there might be a copyright claim.
Having a network of national libraries mirroring the resource around the
world would be an appropriate and desired outcome.
In addition, last year, Congress
appropriated 100 million dollars for Digital Preservation, contingent on LC’s
raising of $75 million in matching resources. The law allows the acceptance of gifts in kind as a part of
the matching funding. Perhaps the
best solution to the sustainability issue would be to pledge the million-book
project to LC as a part of the Digital Preservation initiative. Even if the value of the project were
only assessed on its inputs (equipment and labor), it does represent a
significant investment. Initial
overtures have already been made for this alternative.
OCLC: Another alternative might be for
OCLC to maintain a free version of the resource. OCLC is a non-profit organization whose member libraries are
committed to enhancing access to information. OCLC might cover its costs by charging member libraries a
small fee when the million-book project is accessed through the 48 million-title
database. For the millions of OCLC
users, that convenience would be worth a small payment in an already existing
fee relationship. OCLC’s recent
strategic planning initiatives identified the addition of more full text to the
database, exploring archiving responsibilities, and becoming more international
as important thrusts. OCLC would also be able to cover partial costs through
some of the strategies listed below for publishers.
Commercial alternatives: The marketplace for
electronic books is chaotic at this moment. Questia, designed to be an online source with at least
50,000 of the best books with sophisticated software to support searching and
the creation of footnotes, marketed itself directly to students at a $20-30
monthly fee. Although the project
was well capitalized and attracted a great deal of media attention, it has gone
out of business. Librarians have long observed that charging for resources in
the academic environment reduces use.
Student desire for the convenience of online information sends them to
the web and to the much-used electronic resources of their own libraries. That love of convenience apparently
does not extend to purchasing Questia under current pricing models.
During the same period, the
company netLibrary has announced that it will provide of new full textbooks
online. NetLibrary marketed itself
to libraries through consortia.
Use of materials, thus, was at no direct cost to students and faculty.
While students appreciated the convenience of being able to use the resource
online, they had many complaints about its functionality—in particular they
resented the fact that books could only be printed one page at a time and that
books were unavailable if another individual were using them. The economic models behind netLibrary
charges also seemed to reflect an adherence to those of paper books rather than
recognizing the economies of digital materials. The assets of netLibrary have since been sold to OCLC.
At this time, the marketplace
responses suggest that turning the million book project into a private,
revenue-generating source would not offer a sustainable model. JSTOR, Project Muse, and other digital
journal projects that offer online materials with superior functionality and
sustainability continue to flourish.
At some future time, an enhanced version of the resource might be
marketed commercially if it offered sufficient added functionality to encourage
a user to pay for using the commercial version rather than the free one.
Another commercial alternative
might revolve around relationships with publishers. As publishers find that making the book available increases
sales, they might be required to contribute to the support of the project by
subscribing to support buy buttons and by paying a part of their revenues from
print on demand sales of out of print materials.
G. Logistical
Challenges
Many
logistical challenges face the project: 1) throughput on each scanner, 2) time
to completion, and 3) movement of books to and from India and China.
Optimum scanner throughput:
One Minolta
scanner running two shifts daily = 16
books per day
250 work
days per year 4000
books/year
With
currently supplied 18 scanners = 72,000
books/year
With a total
of 100 scanners
100 scanners
@ 250 days/year at 16 books per day = 400,000
books/year
Allowing a generous 50%
deterioration in throughput, 100 scanners can complete the project in five
years.
China and India have demonstrated
time and again that they are the best of the destinations for skilled manpower
based tasks both from an economic and technical complexity point of views.
Time to completion: Decisions about how many
shifts are running—one to three, how many days are worked annually, and how
aggressively the operators are able to maintain the one book per hour
schedule. Nevertheless, the
equipment provided by NSF and industry should allow the project to be completed
within five years.
Movement
of books from the U.S. to India and China: Indian and Chinese
academic libraries are large and will contain some of the material to be
included in this project; over 700 of these libraries are OCLC members whose
holdings can be easily ascertained. Because the scanner centers will be
distributed around the countries, they can easily be established in places
where the transport of materials from libraries to them will be a minimal
difficulty. It may be feasible to
even locate some of the centers in academic libraries.
The task of shipping materials
from the U.S. to India will be a monumental one. The current plan is to use air containers rather than
shipping containers to reduce the time that materials would be away from their
owning libraries.
20 x 20 air containers are shipped
by weight. Estimated cost would be
about $7,500 per trip. Such a
container will hold about 20-23,000 volumes, which can be packed at originating
libraries in small vendor boxes.
It may be desirable to shrink wrap the books to ensure their intact
arrival. The basic cost for
shipping per books will be about seventy-five cents per round trip. Some libraries may select some
materials that do not need to be returned because they are in the process of
being discarded, but the default will be that materials will return to their
originating libraries. Some
additional funding is being sought from other sources to cover library costs in
selecting, packing, and transporting the books. The Center for Research Libraries might supply some of its
own materials to the project. It
might also serve as a collection place towards the end of the project when the
quantity of material to be shipped falls below the whole container level.
H. People
at Carnegie Mellon University
The directors of the existing Universal Library Project will serve as the
main consultants for the project.
They are:
·
Dr. Raj Reddy, Principal Investigator, Herbert A. Simon Professor of
Computer Science, cochair of the President’s Information Technology Advisory
Committee (PITAC), holder of numerous awards and prizes.
·
Dr. Michael Shamos, Distinguished Career Professor in computer science
and intellectual property attorney, director of the Universal Library project,
and codirector of the e-commerce program.
·
Dr. Jaime Carbonell, Allen Newell Professor of Computer Science and Director
of the Language Technologies Institute, whose research interests include
machine translation, intelligent indexing, auto-summarization, and information
mining. This project will provide the infrastructure needed for many of the
research activities of Carbonell and associates in LTI.
·
Dr. Robert Thibadeau, Principal research scientist with areas of
specialization in scanning equipment and in areas of privacy
·
Dr. Gloriana St. Clair, Principal Investigator, University Librarian, and
editor of portal:
Libraries and the Academy and an active member of the Digital
Library Federation.
Additional university libraries personnel include:
·
Ms. Gabrielle Michalek, Digital library projects coordinator, successful
leader of the scanning of 1 million pages of digital archival materials, the
100 book project and the 1000 book project.
·
Ms. Erika Linke, Associate University Librarian, with expertise in
collection management, digital libraries, and intellectual property issues.
·
Ms. Denise Troll, Associate University Librarian and Distinguished Fellow
of the Digital Library Federation, with broad competencies in digital libraries
and special interests in user studies.
People
elsewhere
·
Dr. Ching-chih Chen, Professor of the Graduate School of Library and
Information Sciences, Simmons College, member of PITAC
·
Dr. Gao Wen, Professor of Computer Science, Vice President of University
of Science and Technology of China, Deputy President of the Graduate School of
Chinese Academy of Sciences.
·
Dr. N. Balakrishnan, head of the Indian Institute of Sciences Division of
Information Sciences
·
Dr. Daniel Greenstein, Executive Director of the Digital Library
Federation.
Appendix: Chinese
Collections
Chinese collections: Several
of the Chinese Universities participating in the project have identified
collections they will want to scan initially. China uses a different system for intellectual
property. Appropriate permissions
will be secured to scan the materials included in this project. Six other universities may also
contribute materials.
Beijing University
·
Ancient
rare books including Song and Yuan Dynasty rare books, family trees, paintings,
and inscription rubbings.
·
Chinese
periodicals before 1949 in politics, law, culture, education, finance,
economics, students, women, academics, technology, religion, folk customs, and
natural sciences.
·
Ancient
engineering technology history and study in China
·
China’s
contribution to Science and Technology, including engraved bone texts.
·
Full text
of documents in the Chinese Culture Documents Database
·
Full text
of documents in the Chinese Classical Literature Database
·
Full text
of documents in the Chinese Classical art vision database
·
Full text
of documents in the Jinling University Technical Periodical database
·
“Six-dynasties
Culture” multimedia database in cooperation with the library Nanjing normal
school
·
Taiping
Heavenly Kingdom materials
·
Dunhuang
documents, including hand written and carved ones.
·
China’s
Southeast countryside area with local area geography, commercial town
materials, and history and technical articles.
·
Tea
culture materials, ancient documents, periodicals, texts
·
Silkwork
and silk materials, including ancient documents, technical articles, texts
Appendix – Indian Collections:
The Indian agencies participating
in the project have done extensive studies and identified the documents that
are precious, unique to the regions and are beyond all copyright issues. Close
to about 1000 documents and books have already been digitized at the three
operational centers.
·
Indian
Institute of Science:
· It
is one of the oldest and the largest S&T Libraries with more than 400,000
holdings and out of this collection nearly 40,000 are estimated to be out of
copyright.
·
International
Institute for Information Technology, Hyderabad and the Government of Andhra
Pradesh, Hyderabad:
· Telugu Text books
·
Indian
Institute of Information Technology:
· Sanskrit
Literature and S&T Books in English and Indian Languages available from
Bose Library, Allahabad.
·
Pune
University:
· Maharastrian
Literature and books
·
Goa
University
· Portughese
Literature and Books
·
Tirupathi
and Tirumala Devasthanam:
· Sanskrit
and Telugu Literature and vedic documents, palm leaves
·
Anna University:
· Tamil
Literature, Palm leaves containing ancient Indian medical practices (Ayurveda)
·
National
Centre for Software development and the Government of Maharastra:
· Text Books in
Marati and S& T Books
·
SASTRA:
· Sanskrit
and Tamil Literature from Tanjore library dating back to 4th Century
BC
·
Avinashalingam
College:
· Books and
manuscripts from old libraries in the Tamilnadu region in Tamil, Telugu,
English and Sanskrit.