The Million Book Digital Library Project

Raj Reddy and Gloriana StClair

Carnegie Mellon University

Pittsburgh, Pa. 15213

December 1, 2001



The objective of this project is to create a free-to-read, searchable collection of one million books, primarily in the English language, available to everyone over the Internet.  This task is accomplished by scanning the books and indexing their full text.  The text file is created, where possible, through optical character recognition.  The result will be a unique resource accessible to anyone in the world 24x7x365, without regard to nationality or socioeconomic background.


Typical large high-school libraries house fewer than 30,000 volumes.  One million volumes is the approximate size of the combined libraries at Carnegie Mellon University.  The total number of different titles indexed in OCLC’s WorldCat is about 48 million.  One million books, therefore, is more than the holdings of any high-school, equivalent to the library at a substantial university and a significant fraction of all available books.


 Executive Summary


Creating a universal free to read, digital library containing over one million scanned books, with optical character recognition when possible to support full text searching, is the goal of the million book digital library project.  Such a resource will lead to the democratization of knowledge by making available on the web, a unique library resource to scholars, students, and citizens around the world.  The availability of online search allows users to locate relevant information quickly and reliably thus enhancing student willingness and success in their research endeavors.  This 24x7x365 resource would also provide an excellent testbed for language processing research in areas such as machine translation, summarization, intelligent indexing, and information mining.


A portion of the content would include out of copyright, pre-1920 materials.  A “best books” feature of the project would involve requesting permission to scan titles in the core collection development tool Books for College Libraries.  A preliminary Carnegie Mellon University Libraries pilot suggests that 22% of the 80,000 titles might become available. Further, when 80% of the million books are finished, scholars will be recruited to review collections in their disciplines and to select remaining books of importance.


Mirroring the site at several locations worldwide will protect the integrity and availability of the data. Several models for sustainability are being explored and are discussed in this report. Usability studies would also be conducted to ensure that the materials are easy to locate, navigate, and use. Appropriate metadata for navigation and management would also be created.


National Science Foundation is providing funding for Scanners, Computers, Servers, and Software.  These resources from NSF are augmented by almost twenty to one since China and India will be providing the necessary manpower (2,000 man years each, over a four year period), as their contribution to this project, to assist in selection of documents, software development and in digitizing these materials.  Indigenous Chinese and Indian materials would form a portion of the content scanned as would English language materials already resident in those countries.  In addition, U.S. libraries, primarily members of the Digital Library Federation, would ship materials to be scanned and returned.


II.        Technical Description


A.        Primary Objective


The primary long-term objective is to capture all books in digital format.  Some believe such a task is impossible.  Thus as a first step we are planning to demonstrate the feasibility by undertaking to digitize 1 million books (less than 1% of all books in all languages ever published) by 2005.  We believe such a project has the potential to change how education is conducted in much of the world.  The project hopes to create a universal digital library free to read any time any where by anyone.


Each of the million books is scanned.  If it is in a language for which optical character recognition software is available, the text is converted to ascii/unicode format to allow full text search to guide students, scholars, and citizens to the relevant portions of the work. Scanner operators create metadata, based on existing cataloging records for these books and journals, to accompany each book.


This project enhances research, learning, and teaching by making a critical mass of scholarly information freely available to read online.  It has been observed that the result will be like Vannevar Bush’s Memex.  In addition to its own indexes, major indexers, such as Google will index it and others, including libraries participating in the project, will hyperlink to it.


A secondary objective of this project will be to provide a test bed that will support other researchers who are working on improved scanning techniques, improved optical character recognition, and improved indexing.  The corpus this project creates will be at least ten times as large as any existing free resource.


B.        Primary Benefit


Primary benefit is to supplement the formal education system by making knowledge available to anyone who can read and has access. Libraries have played a vital role in the advancement of human society. Societal advance depends on young people having access to books via libraries and other means.  We expect that making this unique web resource available free to everyone in the U.S. and around the world will lead to a further democratization of access to knowledge.


Libraries are unevenly distributed around the world and within countries.  In the U.S., the NCES Survey noted that in 1996, 3,408 of 3,792 institutions of higher education had libraries holding 806.7 million volumes.  The 112 largest university libraries in the United States and Canada each have at least 1.8 million books; they are members of the Association for Research Libraries.  Massachusetts has about 25 million volumes; New York has about 31 million volumes, and California has about 40 million volumes in their ARL Libraries (Association for Research Libraries, 1999/2000).  Other states, such as North and South Dakota, have no large libraries.  A few large public libraries have several million volumes.  However, most junior colleges, high schools, and public libraries have much smaller collections.  Making this large knowledge repository with the convenience of online access and the benefit of word and phrase full text searching can revolutionize research at all levels of education and give a much-needed boost at minimal cost to our national educational infrastructure.


Secondary benefit:  Online search makes locating the relevant information inside of books far more reliable and much easier.  Student success in finding exactly what they seek will increase and increased success will enhance student willingness to perform research in this large resource.  NCES reports that 84 percent of libraries around the country are open between 60 and 80 hours a week.  This digital library would be open 24 hours a day, seven days a week, and 365 days a year for a total of 168 hours a week, over twice the time most libraries are open.  More than one individual will be able to use the same book at the same time.  Thus, popular works will not be checked out and thus unavailable to others.


This million-book project will produce an extensive and rich testbed for use in further textual language processing research.  It is hoped that at least 10,000 books among the million will be available in more than one language, providing a key testing area for problems in example based machine translation. In the last stage of the project, books in multiple languages will be reviewed to ensure that this testbed feature is accomplished.


Many believe that knowledge is now doubling at the rate of every two to three years. Machine summarization, intelligent indexing, and information mining are tools that will be needed for individuals to keep up in their discipline work, in their businesses, and in their personal interests.  This large digitization project will support research in these areas.


C.       Status to Date


The preliminary work described below has been used to establish a protocol, to select standards to be used, and to address issues of indexing and retrieval.  Workflow and training programs to support the larger project are being developed.  Both the content and the mechanisms for using it will be made available in open source code.


The National Science Foundation’s 2000 ITR grant cycle provided $500,000 for equipment to begin a large pilot.  That grant will allow the purchase of 18 Minolta book scanners to be located in India and China.  Some machines have already been deployed to begin the scanning process. Strong discounts from Minolta have expanded the number of machines that can be purchased.  Earlier pilot projects, a 100-book scanning project and a 1000-book scanning project, that aided in the selection of the scanners and the establishment of processes used are described more fully below. 


Chinese University Presidents, a Ministry of Education official, and Chinese Academy of Sciences leaders visited the U.S. to reach agreements and to form a steering committee. 

Dr. Michael Lesk and Dr. Stephen Griffin from NSF attended the Carnegie Mellon meeting and also hosted the Chinese delegation at the National Science Foundation. Professor PAN Yunhe, President of Zhejiang University; Dr. GAO Wen, Deputy President of the Graduate School, Chinese Academy of Science; Professor CHI Huisheng, Vice President of Beijing University; Professor HU Dongcheng, Vice President of Tsinghua University; Professor XU Zhong, Vice President of Fudan University; Professor, ZHANG Yibin, Assistant to the President, Nanjing University; Mr. GUO Xinli, Vice General Director, Ministry of Education of China; Mr. CHEN Jianping, Vice Director, State Planning Commission of China; and Dr. Ching-Chih Chen of Simmons College attended.  The National Science Foundation funded this summit.


The Indian university and government officials are scheduled to visit on the 26th of May 2002 and it is expected that similar agreements would be reached.


U.S. Digital Library Federation members met on November 15 and 16, 2001 to work out the logistics of selecting and transporting materials from U.S. collections under a grant from NSF.  Drs. Lesk and Griffin were joined in Pittsburgh by representatives from OCLC, the Center for Research Libraries, and collection development officers and other librarians from the Library of Congress, the University of Washington, the University of California Berkeley, Stanford University, University of Illinois, University of Chicago, Penn State University, and the University of Pittsburgh.  The Digital Library Federation’s Executive Director also attended the meeting.


       The collection development librarians discussed:


·          Collection focus to achieve a consensus on how to select the million books to be digitized.

·          Involvement of outside scholars in selection issues to consider how non-librarian scholars might participate in selection.

·          Copyright considerations to consider seeking permission for a set of in copyright “best books”, such as those in Books for College Libraries.

·          Standards for the work to review the current Digital Library Federation standards with a view to rapid adoption.

·          Registry issues to move forward with OCLC in establishing a registry for books selected.

·          Methods of transport to consider alternative means of transport and return.

·          Timing to weigh the advantages of air containers and sea containers.

·          Level of participation to determine minimum levels for contributors to the project.

·          Incentives for participation to establish means of recognition for contributions through screen display and copies of the archives.


The outcome of this meeting will result in a plan for the selection and transmission of almost a million books to China and India over a multiyear period and a plan for assessing the success of the project annually.


D.       Technical Approach


1.      Database creation


Creating a scalable database to support this project is a related research proposal.  Drs. Christos Faloutsos, Jeffrey Eppinger, and Natalia Allamachi are submitting a proposal to NSF to address these issues. Their globally distributed database will appear to be a virtual central database from any place around the world.  Mirroring the database in several countries will ensure security and availability.


The database will house both an image file and a text file at about 10-20 megabytes per book.  The aggregate of 20 terabytes will be affordable to store because the costs of storage continue to decline substantially.  By 2010, a terabyte of storage is expected to cost as little as $10.


2.      Scanning


100 book pilot: Two years ago, we funded a pilot experiment to scan 100 books so that the practical difficulties of a million book project could be assessed.  Carnegie Mellon University Libraries faculty and staff assisted in the pilot.  The scanner of choice was an inexpensive duplex scanner that required the books to be disbound so that the pages could be fed through in batches.  While the economy and speed of this technique were most attractive, several technical problems occurred.


·    The pages had to be cut on all four edges for smooth feeding.  The project required the purchase of a $10,000 guillotine to accomplish this.  The guillotine was somewhat dangerous, required in-depth training in use and safety, slowed the process, was a public relations nightmare for the library community, and obviated the economy of the inexpensive scanner.

·    Dust, an inevitable accompaniment to older books, proved to be a formidable opponent.  Dust caused frequent jamming and subsequent cleaning of the scanner.  Paper fixatives were employed to counteract the dust.  Spraying on the fixative slowed the project and was not entirely satisfactory.


At the end of the first hundred books, the scanner operators and their supervisor sought another approach.


1000 book project.  Books 200 through 1000 were scanned using a Minolta Overhead scanner.  Although this scanner was 5 times more expensive roll-feed double sided scanner we used, it proved to be more reliable.  Books did not have to be disbound.  The image processing software for curvature correction, deskewing, despeckling and cropping allows for thick books to be scanned either flat or in an angled cradle that reduces wear on the spine.  Thorough training is required to operate the scanner, but several different employees were successfully trained to use it during the period of the project.  The results of this 1000 book project can be viewed at under 1000  book project.  This scanner and the processes are the ones that are recommended for the million book project.  The advantages of the Minolta approach include:


·  Disbinding via a guillotine is not necessary.

·  Books can be reused in their original form.

·  Dust, thick paper, and long books can be easily accommodated.

·  Training requirements are reasonable.

·  Equipment is reliable.


3.      Data Production


·    Bitonal images with a pixel depth of 1 bit-per-pixel were scanned at a resolution of 600 dots per inch (dpi). Images stored as "Intel" TIFF (Tagged Image File Format) files, with the header content specified. The compression algorithm used is ITU (Formerly CCITT) Group 4.

·   TIFF version 5.0 is acceptable. Subject to testing, version 6.0 (or later) may also be acceptable.

·    Initial-capture system includes dynamic thresholding or a similar feature to capture variability of darkness in the imprint and possibly darker (e.g., foxed) backgrounds from decay.  Images should be as readable as the original pages.

·    "Typical" or "expected" data to be provided for most TIFF tags (normally, the data supplied by software default settings). A specification for the TIFF header to be produced to include scanner technical information, filename, and other data, but to be in no way a burden on the production service. 

·  Images written in sequential order, with corresponding 8.3 file names, e.g., 00000001.tif as first image in volume sequence and 00000341.tif as 341st image in volume sequence

·  Volumes to be provided to Million Book Project by libraries with unique identifiers that conform to 8.3 format; images should be in directories named with corresponding identifier (e.g., akf3435.001 as identifier for volume will result in directory with same name, and 00000001.tif through 0000000N.tif within that directory)

·  Images and directories (as specified above) to be written by Million Book Project to gold CD-ROM meeting agreed upon specifications, and using ISO9660 format.

·  Skew to be within a specified range of degrees allowed.


4.      Optical Character Recognition (OCR)


The primary function of OCR is to allow searching inside the text.  Because words are often repeated, the 98% success rate will allow students and scholars to find relevant passage. In the pilot projects, the OCR program Abby Fine Reader was run after the scanning was completed. Abby Fine Reader was selected for its ability to keep words intact if they were hyphenated between two pages. On English language texts with print that has few broken letters, OCR accuracy of Abby Fine Reader is about 98% of text.  We do not plan to correct the OCR output as part of this project. 


More sophisticated programs with voting system to resolve different interpretations are available, but licenses are too expensive.  Chinese and Japanese OCR programs are also available and will be used whenever possible. Providing a testbed that will allow for the creation of even better OCR programs is a secondary goal of this project. Scholars may wish to run newer OCR programs over the scans and even to correct the output. 


5.       Metadata


Digital Library Federation standards and metadata best practices will be used throughout this project.  Bibliographic metadata for the pilot project will be derived from existing library catalog records.  Carnegie Mellon libraries developed software that uses the standard Z39.50 protocol to search and retrieve relevant metadata from catalog records fields.  Thus, author, title, and publication data do not have to be rekeyed.


Another research project associated with this project will be the creation of software that automatically creates "document structure" metadata.  This metadata allows users to navigate through the chapters and other parts of a book successfully.  Entering such information manually is too time consuming for this project, but automatic metadata creation programs can be utilized subsequently.


Administrative metadata supports the maintenance and archiving of the paper or digital objects and ensures their long-term availability by providing information about how the files were created and stored.  Administrative metadata will be maintained internally as file descriptions in the project databases and externally as part of the copyright permission database.


The Digital Library Federation, a supporter of this project, has several initiatives underway that will allow commercial browsers to harvest metadata more aggressively. The results of DLF’s metadata harvesting project will be explored for possible application to the resources produced in this project (


6.       Quality control


The standards established for quality control are those currently endorsed by the Digital Library Federation, whose missions include the establishment of best practices and the development of standards.  The project must maintain a 98% accuracy rate for the quality of images and the inclusion of all pages.  Nevertheless, a process must be developed to allow for users to report missing pages and for those missing pages to be scanned and dropped back into the existing scanned text.  Because the owning library will have to pull the book, scan the pages, and transport the file, this process will be expensive.  Maintaining high quality the first time the book is scanned will be essential.  A demonstration of high quality, reliable work done on materials currently in China and India will give U.S. libraries confidence that their collections should be shared.


E.       Content


Seeking to develop a collection of one million digital books, the Million Book Project envisages a staged approach as described below.   The Million Book project will adhere to copyright law.  U.S. collections will primarily include the following types of materials.


1.      Coordination of Selection


Creating one digital copy, which can then be easily mirrored in different locations, will suffice and will support the multiple uses an item may receive.  Preliminary discussions with OCLC as a host for a registry of scanned items are underway.  Certain key projects, such as the Making of America project, are already represented in the OCLC database as digital books.  Other large digitization projects may require some data entry of their content in order to avoid duplication.


2.      Non-copyrighted materials


Materials published before 1920 are in the public domain and may be scanned for this project. Several large academic libraries are considering shipping materials from their depositories of little used material to India/China.  These materials will be scanned there and then returned. To reduce the costs of selection, the project will probably develop a strategy of selecting key topics and then removing large runs of books and journals from a selected depository.  Having a reasonable turn around time will be essential to the success of the project.  A test will be devised to understand the logistics of shipping the materials and the impact of their absence from the home library. 


The 1909 copyright law granted copyright for 28 years.  Rights holders could then renew the copyright for another 28 years; many publishers and authors did not exercise that renewal option.  Thus, some materials published after 1922 (56 years prior to the 1978 effective date of the 1976 act) may be out of copyright.  In order to provide for the efficient checking of these books’ status, copyright renewal records for books for these years been scanned and made available online at  Similar records for other formats, such as serials and audiovisual material, will also be made available as a part of this resource.


Government documents are also in the public domain and may be included in this project.   Many participating libraries are depositories for full runs of government documents and could supply them to the project, as could the Library of Congress.  The inclusion of documents will allow for more recent material to enter the project legally and to become available to a broader audience and in a more accessible manner.  Many government documents are currently available in digital form.  The creation of these back files would enhance those resources.


The Chinese delegation is most eager to have technical reports and science and technology dissertations as a part of this project.  The producing scholar and the university have copyright interests in these formats.  Gaining university permission might be fairly straightforward.  A good faith attempt would also have to be made to win the permission of the scholar.  That could be a part of an externally funded copyright clearance project, but no pilot has been done to allow for an estimate of contact rate and subsequent success.  If some arrangement could be made with University Microfilms to scan dissertations of selected universities from microfilm, which would be cheaper and easier to transport, such an initiative might satisfy a strong desire among all participants to increase science content. 


3.       Copyrighted materials


The 1998 Copyright law grants copyright to authors for their lifetimes plus 70 years or for 95 years.  Patent law, by contrast, gives 20 years.  A.W. Mellon’s JSTOR project developed the concept of a moving wall that allowed the inclusion of materials over five years old.  Journal publishers generally agreed that the economic value of that material was greatly reduced and granted permission for its inclusion in this most successful project.  A similar broad publisher agreement about the point at which economic value of a print book declines is greatly needed because books often go out of print in two or three years and can then remain in copyright but unavailable for over 90 years.


Dr. Raj Reddy and Dr. Peter Shane, Director of the Institute for the Study of Information, Technology and Society recently had a conversation with a major book publisher to explore the possibility of taking a broad publisher approach to receiving copyright permissions.  Certain publishers, including the National Academy Press, have had the experience that when they digitized their books, sales increased because attention was focused on the material and the scholars were not yet ready to read the books online.  Authors' guilds will also be contacted to see if they would be interested in grant permissions.


Three conditions seem to be necessary to attract publishers to the scanning of their out of print but in copyright titles:


·  Publisher should receive a tax deduction for contributing the title to this project.  The tax deduction might reflect revenues previously generated by the title.

·  When a print on demand feature becomes a part of this project, publishers should collect royalties on books printed.

·  If a book were to return to general popularity, as the effect of the movie Titanic had on the sales of out of print titles, the publisher should be able to withdraw the permission for a fee.  The publisher might be expected to reimburse the project for the costs of digitizing the title and maintaining it online.


Dr. Michael Shamos, a Director of Carnegie Mellon’s Universal Library project and an intellectual property attorney, recommends the following approach to copyright clearance.  The million book project will make a good faith effort to clear copyright on appropriate materials by sending the publisher of record a letter asking for permission.  Replies will be recorded in the administrative metadata.  If the publisher has returned the rights to the author, the author will be contacted.  Subsequent copyright holders will be contacted as needed.  If the permission letter receives no response, then materials will be digitized as a part of the project.  If rights holders subsequently identify themselves and request that the material be removed from the project, that request will be complied with immediately. 


4.       Best books approach


The project will seek publisher permission to scan books from Books for College Libraries (BCL), one source for core academic books in English.  A previous study done at Carnegie Mellon University Libraries indicates that 22% of publishers granted permission for scanning and mounting on the web.  The materials in the study were a random sample of Carnegie Mellon libraries’ books and included a broad range of dates, publishers, and in and out of print statuses. Numerous difficulties from out of business publishers, lack of publisher records, return of copyright to authors, and other circumstances were identified.  Subsequently, Carol Hughes, the collections development officer for Questia, corroborated Carnegie Mellon’s experience.


OCLC owns a database of books from the latest edition of Books for College Libraries.  OCLC representatives will attend the November 15 & 16 meeting and will discuss using the database to support the project.  BCL contains about 50,000 titles.  A 22% success rate in clearing copyright would result in 10,000 of the best books for college students being included in the project.  Clearing copyright is labor intensive and expensive.  Bradd Burningham’s recent article estimated those costs (“Copyright Permissions” in Journal of Interlibrary Loan, Document Delivery, and Information Supply, 11:2 (2000), 95-111).  The BCL database, however, will allow for sorting by publisher so that permission requests can contain the names of several books.  A quick sample indicates that as many as 25,000 publishers may be represented there.  Despite the expense, this commitment to quality should be attempted.  Carnegie Mellon University Libraries will seek private foundation funding to undertake this project.


Publishers increasingly see that digital presentation of their works can attract buyers.  They are interested in exploring ways in which their out of print titles may be returned to profitability.  Continued work with publishers through the course of this project may attract many of them to it.  That would be most beneficial in enriching the content to be made available.


F.  Sustainability


Sustainability is a long-term issue for this project; further research will be done on developing economic models to support this major contribution to education.  Partial answers to these significant challenges are discussed below. Three general alternatives have potential for offering a sustainable model for this project—the Library of Congress and similar national libraries, OCLC, and other commercial concerns.   Several major philanthropists have computer industry fortunes and might be interested in sustaining this project.


Library of Congress:  The million-book project will be a public good and as such must have a suitable repository that will continue to make it available to the public at no charge.  That responsibility belongs most clearly to the national library in each country.  The Library of Congress should be motivated to respond to this challenge because the national interest is so clearly served.  However, the Library of Congress is not the national library of the United States, although many people assume that it is.  In the LOC’s own words in its mission statement: “THE FIRST PRIORITY of the Library of Congress is to make knowledge and creativity available to the United States Congress.”  It is only a lesser goal to make knowledge available to the public, and that is why we have to undertake the million-book project in the first place -- LOC won’t do it.  LOC is also the guardian of the copyright office and is extremely nervous about digitizing anything to which there might be a copyright claim.  Having a network of national libraries mirroring the resource around the world would be an appropriate and desired outcome.


In addition, last year, Congress appropriated 100 million dollars for Digital Preservation, contingent on LC’s raising of $75 million in matching resources.  The law allows the acceptance of gifts in kind as a part of the matching funding.  Perhaps the best solution to the sustainability issue would be to pledge the million-book project to LC as a part of the Digital Preservation initiative.  Even if the value of the project were only assessed on its inputs (equipment and labor), it does represent a significant investment.  Initial overtures have already been made for this alternative.


OCLC:   Another alternative might be for OCLC to maintain a free version of the resource.  OCLC is a non-profit organization whose member libraries are committed to enhancing access to information.  OCLC might cover its costs by charging member libraries a small fee when the million-book project is accessed through the 48 million-title database.  For the millions of OCLC users, that convenience would be worth a small payment in an already existing fee relationship.  OCLC’s recent strategic planning initiatives identified the addition of more full text to the database, exploring archiving responsibilities, and becoming more international as important thrusts. OCLC would also be able to cover partial costs through some of the strategies listed below for publishers.


Commercial alternatives:  The marketplace for electronic books is chaotic at this moment.  Questia, designed to be an online source with at least 50,000 of the best books with sophisticated software to support searching and the creation of footnotes, marketed itself directly to students at a $20-30 monthly fee.  Although the project was well capitalized and attracted a great deal of media attention, it has gone out of business. Librarians have long observed that charging for resources in the academic environment reduces use.  Student desire for the convenience of online information sends them to the web and to the much-used electronic resources of their own libraries.  That love of convenience apparently does not extend to purchasing Questia under current pricing models.


During the same period, the company netLibrary has announced that it will provide of new full textbooks online.  NetLibrary marketed itself to libraries through consortia.  Use of materials, thus, was at no direct cost to students and faculty. While students appreciated the convenience of being able to use the resource online, they had many complaints about its functionality—in particular they resented the fact that books could only be printed one page at a time and that books were unavailable if another individual were using them.  The economic models behind netLibrary charges also seemed to reflect an adherence to those of paper books rather than recognizing the economies of digital materials.  The assets of netLibrary have since been sold to OCLC.


At this time, the marketplace responses suggest that turning the million book project into a private, revenue-generating source would not offer a sustainable model.  JSTOR, Project Muse, and other digital journal projects that offer online materials with superior functionality and sustainability continue to flourish.  At some future time, an enhanced version of the resource might be marketed commercially if it offered sufficient added functionality to encourage a user to pay for using the commercial version rather than the free one.


Another commercial alternative might revolve around relationships with publishers.  As publishers find that making the book available increases sales, they might be required to contribute to the support of the project by subscribing to support buy buttons and by paying a part of their revenues from print on demand sales of out of print materials.


G.       Logistical Challenges


Many logistical challenges face the project: 1) throughput on each scanner, 2) time to completion, and 3) movement of books to and from India and China.


Optimum scanner throughput: 


One Minolta scanner running two shifts daily                 =                16 books per day

250 work days per year                                                                        4000 books/year

With currently supplied 18 scanners                                 =                72,000 books/year

With a total of 100 scanners

100 scanners @ 250 days/year at 16 books per day         =                400,000 books/year


Allowing a generous 50% deterioration in throughput, 100 scanners can complete the project in five years.



China and India have demonstrated time and again that they are the best of the destinations for skilled manpower based tasks both from an economic and technical complexity point of views. 


Time to completion:  Decisions about how many shifts are running—one to three, how many days are worked annually, and how aggressively the operators are able to maintain the one book per hour schedule.  Nevertheless, the equipment provided by NSF and industry should allow the project to be completed within five years.


Movement of books from the U.S. to India and China:  Indian and Chinese academic libraries are large and will contain some of the material to be included in this project; over 700 of these libraries are OCLC members whose holdings can be easily ascertained. Because the scanner centers will be distributed around the countries, they can easily be established in places where the transport of materials from libraries to them will be a minimal difficulty.  It may be feasible to even locate some of the centers in academic libraries.


The task of shipping materials from the U.S. to India will be a monumental one.  The current plan is to use air containers rather than shipping containers to reduce the time that materials would be away from their owning libraries. 


20 x 20 air containers are shipped by weight.  Estimated cost would be about $7,500 per trip.  Such a container will hold about 20-23,000 volumes, which can be packed at originating libraries in small vendor boxes.  It may be desirable to shrink wrap the books to ensure their intact arrival.  The basic cost for shipping per books will be about seventy-five cents per round trip.  Some libraries may select some materials that do not need to be returned because they are in the process of being discarded, but the default will be that materials will return to their originating libraries.  Some additional funding is being sought from other sources to cover library costs in selecting, packing, and transporting the books.  The Center for Research Libraries might supply some of its own materials to the project.  It might also serve as a collection place towards the end of the project when the quantity of material to be shipped falls below the whole container level.


H.       People at Carnegie Mellon University


The directors of the existing Universal Library Project will serve as the main consultants for the project.  They are:


·  Dr. Raj Reddy, Principal Investigator, Herbert A. Simon Professor of Computer Science, cochair of the President’s Information Technology Advisory Committee (PITAC), holder of numerous awards and prizes.

·  Dr. Michael Shamos, Distinguished Career Professor in computer science and intellectual property attorney, director of the Universal Library project, and codirector of the e-commerce program.

·  Dr. Jaime Carbonell, Allen Newell Professor of Computer Science and Director of the Language Technologies Institute, whose research interests include machine translation, intelligent indexing, auto-summarization, and information mining. This project will provide the infrastructure needed for many of the research activities of Carbonell and associates in LTI.

·  Dr. Robert Thibadeau, Principal research scientist with areas of specialization in scanning equipment and in areas of privacy

·  Dr. Gloriana St. Clair, Principal Investigator, University Librarian, and editor of portal: Libraries and the Academy and an active member of the Digital Library Federation.


Additional university libraries personnel include:


·  Ms. Gabrielle Michalek, Digital library projects coordinator, successful leader of the scanning of 1 million pages of digital archival materials, the 100 book project and the 1000 book project.

·  Ms. Erika Linke, Associate University Librarian, with expertise in collection management, digital libraries, and intellectual property issues.

·  Ms. Denise Troll, Associate University Librarian and Distinguished Fellow of the Digital Library Federation, with broad competencies in digital libraries and special interests in user studies.


                      People elsewhere


·  Dr. Ching-chih Chen, Professor of the Graduate School of Library and Information Sciences, Simmons College, member of PITAC

·  Dr. Gao Wen, Professor of Computer Science, Vice President of University of Science and Technology of China, Deputy President of the Graduate School of Chinese Academy of Sciences.

·  Dr. N. Balakrishnan, head of the Indian Institute of Sciences Division of Information Sciences

·  Dr. Daniel Greenstein, Executive Director of the Digital Library Federation.


Appendix:  Chinese Collections


Chinese collections: Several of the Chinese Universities participating in the project have identified collections they will want to scan initially.  China uses a different system for intellectual property.  Appropriate permissions will be secured to scan the materials included in this project.  Six other universities may also contribute materials.


Beijing University

·          Ancient rare books including Song and Yuan Dynasty rare books, family trees, paintings, and inscription rubbings.

·          Chinese periodicals before 1949 in politics, law, culture, education, finance, economics, students, women, academics, technology, religion, folk customs, and natural sciences.


Tsinghua University

·          Ancient engineering technology history and study in China

·          China’s contribution to Science and Technology, including engraved bone texts.


Fudan University

·          Full text of documents in the Chinese Culture Documents Database

·          Full text of documents in the Chinese Classical Literature Database

·          Full text of documents in the Chinese Classical art vision database


Nanjing University

·          Full text of documents in the Jinling University Technical Periodical database

·          “Six-dynasties Culture” multimedia database in cooperation with the library Nanjing normal school

·          Taiping Heavenly Kingdom materials


Zhejiang University

·          Dunhuang documents, including hand written and carved ones.

·          China’s Southeast countryside area with local area geography, commercial town materials, and history and technical articles.

·          Tea culture materials, ancient documents, periodicals, texts

·          Silkwork and silk materials, including ancient documents, technical articles, texts


Appendix – Indian Collections:


The Indian agencies participating in the project have done extensive studies and identified the documents that are precious, unique to the regions and are beyond all copyright issues. Close to about 1000 documents and books have already been digitized at the three operational centers.

·         Indian Institute of Science:

·      It is one of the oldest and the largest S&T Libraries with more than 400,000 holdings and out of this collection nearly 40,000 are estimated to be out of copyright.

·         International Institute for Information Technology, Hyderabad and the Government of Andhra Pradesh, Hyderabad:

· Telugu Text books

·         Indian Institute of Information Technology:

·  Sanskrit Literature and S&T Books in English and Indian Languages available from Bose Library, Allahabad.

·         Pune University:

·  Maharastrian Literature and books

·         Goa University

       ·       Portughese Literature and Books

·         Tirupathi and Tirumala Devasthanam:

       ·       Sanskrit and Telugu Literature and vedic documents, palm leaves

·         Anna University:

       ·       Tamil Literature, Palm leaves containing ancient Indian medical practices (Ayurveda)

·         National Centre for Software development and the Government of Maharastra:

·  Text Books in Marati and S& T Books

·         SASTRA:

       ·       Sanskrit and Tamil Literature from Tanjore library dating back to 4th Century BC

·         Avinashalingam College:

·  Books and manuscripts from old libraries in the Tamilnadu region in Tamil, Telugu, English and Sanskrit.