Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Million Book Project The Mini-UL Digital Library Platform Carnegie Mellon University School of Computer Science Raj Reddy Eric Burns.

Similar presentations


Presentation on theme: "The Million Book Project The Mini-UL Digital Library Platform Carnegie Mellon University School of Computer Science Raj Reddy Eric Burns."— Presentation transcript:

1 The Million Book Project The Mini-UL Digital Library Platform Carnegie Mellon University School of Computer Science Raj Reddy Eric Burns

2 What is the Million Book Project? Free-to-read, open-platform digital library  Worldwide distribution and mirroring  Public domain works  Out of print but in copyright  Rare materials Collaborative content acquisition  India 20 mini scanning centers, 3 mega scanning centers Over 80,000 books to date  China Over 30,000 books to date  USA / Carnegie Mellon (Hunt Library/SCS) 1200 books, technology contributor Truly multi-lingual corpus  Several Indian languages  Mandarin Chinese  Most European languages

3 MBP offers unique systems challenges Multiple deployments  China  India  Partners in US Human-intensive scanning process  Error prone DC XML entered by hand Operator error on scanning devices  Difficult to standardize  Multiple QA passes required Everyone wants autonomy and customization  System-level solution must satisfy small and large data sets  CMU must provide a framework for remote sites to extend  Equipment budget is limited Developing nations’ networks are limited  China, India output must be shipped to US

4 Core Problems Multiple scanning centers, each with:  Distinct values and goals  Limited connectivity  Varying IT infrastructure Common base requirements  Searching  Browsing  Viewing  File-system compatibility Basic standard for acquiring and storing scanned books  Data preservation  Quality assurance  Flexibility  Openness Fault-tolerant storage at all sites Data movement via physical shipment Standardized OS and base software

5 Our Solution: Mini-UL Embedded Digital library on a CD  OS (Knoppix Linux), servers (Apache, PaperSight ImageServer), code (Perl) on single ISO  Boots single systems or whole clusters  Ensures standardization, eases upgrades To use new software, admins burn CD and reboot Commodity PC and disk hardware spec  Software RAID: Use low-end PC as network-attach storage  Sub-$1000 PC = 1 TB NASD Barebones economy PC 250 GB OEM disk x 4  Add storage PCs as needs grow  1 processor per storage unit CD + PC(s) = Embedded digital library  “Black box” approach Dump MBP-format books into upload bucket Easily search, browse, view, and download all books added

6 The MBP Book Format Dir w/ five subdirectories:  OTIFF “Original TIFF”: exactly as scanned Eight-digit, zero-padded page numbers (00000123.tif) 1-bit color at 600 DPI, lossless  PTIFF “Processed TIFF”: current best batch image processing Eight-digit zero-padded numbers match OTIFF  TXT ASCII, UTF-8, or UTF-16 text Numbers match OTIFF/PTIFF  HTML UTF-8 HTML w/ low-res JPEG images Numbers match OTIFF/PTIFF  [MARC|DC] Binary MARC record Dublin Core XML Flexible: other format directories can be added Internal storage format:  OTIFF/PTIFF -> multipage  TXT/HTML -> zip  500 page book = 2001 files  Converted at addition time to 5 files  Speeds copying

7 High-level Cluster Architecture Head Node (NASD 0) Web traffic NASD 1 NASD 2 NASD … NASD n Internal network subnet Network-Attach Storage Devices (SATA RAID PCs)

8 Adding a Book Head node has SMB share “Upload” User moves one or more MBP-format books into Upload share System automatically checks each book for completeness/correctness:  All formats present  Contiguous page numbers  Metadata present and parseable  Errors presented to user for correction Converts to internal storage format Assigns serial number Moves to NASD node with most free space Incremental search index

9 Viewing a book Users view original page images  HTML, raw TXT as option Intra-book searching  Seeks to matching page  Highlights token match  Rapidly seek from one token match to the next  Boolean queries, phrase matching PaperSight ImageServer  Convert 600 DPI 1-bit TIFF to ~96DPI 8-bit GIF  Real-time conversion performance is faster than human response  Anti-aliased grey-scale image is ideal for monitor reading  Significant reduction in bandwidth  Conversion happens on hosting NASD node, not head

10 Browsing Simple alphabetic browse Keep list sizes small

11 The Missing Piece: Search Searching the full text of tens of thousands of books is computationally intensive Solution: parallelize  Each NASD node indexes and searches content it stores  Results are unified and sorted at head node  NASD cluster architecture maintains parity between processors and storage Grow from n to 2n nodes? Search speed remains constant (assuming homogeneous corpus) Search too slow? Increase machine count and redistribute data. Search features  Fast! 0.1 sec per-token response in most cases (AMD 1400+).  Joint bibliographic and full-text search with single query  Phrase matching, boolean queries, cross-page phrases  Context display for full-text matches  Rich scoring system: Metadata matches Token proximity scoring (multi-token queries only)  Direct-to-page matching Full text matches yield actual matching page, with highlighting Full search API (Perl)

12

13 Customization APIs provided for all major components:  Search  Book Reader  Metadata processing and conversion All HTML lives in read-write space on head node  Development sites can create rich HTML hierarchies Scripting is not limited to CD contents  cgi-bin and site_perl can be extended CD/core upgrades leave extensions untouched

14 Future Directions Search engine in wider distribution  GPL  Perl CPAN “Phone Home” capability  Individual Mini-UL systems with slow but persistent links relay manifests Metadata + text  Master site to search all sites IIIT Hyderabad contributions  MySQL-based metadata search Separate search and storage clusters  9 TB hardware RAID servers  Multiple diskless search nodes

15 Embedded Digital Library Uses Gives MBP sites foundation on which to build  Allows convergence on standards as sites contribute new extensions to main distribution  Gives basic search, browse, view, and audit capability to any site, regardless of development staff Uses extend beyond MBP deployments  Any site with archives of multi-page text documents can benefit  Only requirements are a scanner and a PC  Virtually no administration required

16 Questions?


Download ppt "The Million Book Project The Mini-UL Digital Library Platform Carnegie Mellon University School of Computer Science Raj Reddy Eric Burns."

Similar presentations


Ads by Google