Presentation is loading. Please wait.

Presentation is loading. Please wait.

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of.

Similar presentations


Presentation on theme: "HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of."— Presentation transcript:

1 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of Computer Science Carnegie Mellon University Pittsburgh, Pennsylvania, USA

2 Where is Pittsburgh?

3 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Universal Library Project of Carnegie Mellon University All published works of mankind digitized and online Instantly available Free to read In any language Anywhere in the world Searchable and browsable by humans and machines DEMO

4 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Why Digitize? Books are inefficient carriers of information Heavy, expensive Environmentally harmful Linear, not hyperlinked Poorly indexed Not searchable Not easily transported MOST IMPORTANT: not everyone has every book IN FACT, no one has every book

5 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS How Do We Convey Information? Books Orally Observation Teaching (a combination of the above) The book is –Information –AND a physical carrier The information can be conveyed digitally We don’t CARE about the carrier

6 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Objections to Digital Books People can’t read books from a screen Books are convenient –You can carry them –You can write in them –You can put a place marker in them –You can lend them to people Books are beautiful Books smell nice

7 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS How Many Books Are There? 1996 World published output: 800,000 books Total book titles ever published ~ 100M 1 book = 500 pp., 2000 char/page = 1 megabyte uncompressed (about 1 floppy disk) –10 8 books = 10 14 bytes = 100 terabytes –Disk costs HK$10 per gigabyte –100 terabytes costs about HK$1 million Total books in WorldCat = 41,000,000 –Requires only 41 terabytes, HK$410,000

8 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS We Can Store Everything 100 terabytes can store: 3,000,000,000 photographs (compressed) 100,000,000 books 10,000 movies 300 years of music 100 terabytes occupies 240 cubic feet on DVD = 1 van 6 x 4 x 10 feet

9 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS We Can Send Everything Human speech: 30 bits/sec Gigabit Internet: 1,000,000,000 bits/sec (This talk: < 1 millisecond including slides) Feb. 2002 Fujitsu achieved 5 terabits per second on one optical fiber 100 terabytes = 800 terabits It would take less than 3 minutes to transmit every book ever published

10 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Why a Universal Library? The largest library in the world (U.S. Library of Congress) has less than 20% of all books –Two hours to retrieve one book –Must travel to Washington, DC –No copying allowed Largest university library: 14 million (Harvard ) Hong Kong University: 3 million Typical large U.S. university: 1 million Largest high school: 130,000 (Philips Andover) Largest public high schools: 30,000 (U.S.) Average high school: 5,000 (U.S.)

11 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Universal Library Goals Democratization of information –Knowledge is power Education, distance learning –“Library” for distance education Research, technology transfer Promotion of understanding Preservation of human culture

12 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project A million books is a lot. CMU just reached 1 million. Idea: scan 1 million books in each of several countries. Make them available to everyone NSF provided $3 million to buy scanners for China and India China and India are each providing 500 full-time people for scanning Each country is scanning 1 million books over the next 3 years CMU is hosting, indexing, building infrastructure

13 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Million Book Project Operation

14 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Million Book Project Operation

15 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Million Book Project Operation

16 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Million Book Project Operation

17 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Million Book Project Operation

18 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Million Book Project Operation

19 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Effect of the Million Book Project All books scanned (in many languages) will be available free to read to everyone over the Internet Many cultural artifacts and treasures are being scanned All works are fully keyword-indexed and searchable All participating countries will have complete copies (mirrors) of all content Knowledge will be available to all

20 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Partners China –Beijing University –Chinese Academy of Science –Fudan University –Ministry of Education of China –Nanjing UniversityNanjing University –Shanghai Jiaotung University –State Planning Commission of ChinaState Planning Commission of China –Tsinghua UniversityTsinghua University –Zhejiang UniversityZhejiang University

21 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Partners India –Arulmigu Kalasalingam College Of EngineeringArulmigu Kalasalingam College Of Engineering –Goa UniversityGoa University –Indian Institute of Information Technology - AllahabadIndian Institute of Information Technology - Allahabad –Indian Institute of ScienceIndian Institute of Science –International Institute of Information Technology - HyderabadInternational Institute of Information Technology - Hyderabad –Shanmugha Arts,Science,Technology & Research AcademyShanmugha Arts,Science,Technology & Research Academy –Tirumala Tirupati DevasthanamsTirumala Tirupati Devasthanams –Maharashtra Industrial Development CorporationMaharashtra Industrial Development Corporation –University of PuneUniversity of Pune

22 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Copyright Problem Compulsory License –Owner CAN’T refuse; user MUST pay –Limited in US (Music: 1.55¢/min, 8.0¢/song) –Extensive compulsory licensing in Japan Flat-fee subscription (e.g. HBO) Free (subsidized by government) Public Lending Right (UK) “Buy” button Metered use (electric company) Micropayments

23 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Roadblocks Biggest obstacle: librarians Belief that the project is too large No funding –In the U.S., everyone assumes it is being done –Outside the U.S., everyone assume the U.S. is doing it Copyright Myriad of small independent digital libraries

24 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Policy Challenges Convenience displaces quality (Gresham) What to digitize first? Suitable copyright law Economics (Who pays? Who gets?) Privacy Reliability of information Change in the nature of teaching, learning

25 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS LAYERED UL MODEL UNIVERSAL LIBRARY: DIGITIZED ITEMS NAVIGATION TOOLS RETRIEVER SERVICE CUSTOM CATALOGS HYPERTEXT GENERATORS SEARCHERS TRANSLATORS NEWS AGENTS HUMAN USERS DIRECT MACHINE USERS HUMAN USERS ENCYCLOPEDIA VALUE-ADDED SERVICES BASELINE UL SERVICES

26 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Universal Dictionary A glossary containing every word in every language, with a translation Use: indexing the Universal Library Now has 1 million words (26 languages) 2 million by February (50 languages) 3 million by May (80 languages)

27 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Q A &

28 Multilingual Searching Find all documents containing “elephant” Find all documents about elephants –Even if the word “elephant” does not occur in the document Translation, transliteration –Book titles, works of art, proper names –Idioms, colloquial phrases

29 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Use of © Content Philosophy: must pay for use –Authors, publishers must not lose Implied license Bulk licensing Compulsory licensing

30 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Universal Dictionary Lexicon of all words in all languages, with English translations, e.g. Obtained from –Web dictionaries –Scanning + OCR –Publishers machine-readable form Uses: –Indexing the Universal Library –Machine translation –Spelling correction –Linguistic studies

31 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Technological Challenges Input (scanning, digitizing, OCR) Data representation –text, kset, notations, images, web pages Navigation and Search Multilingual Issues Output (voice, pictures, virtual reality) Synthetic documents

32 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Navigation Keyword searching does not scale –Imagine 10 6 hits Browsing, finding, searching, flying Fractal view –Keys are granularity and connectivity View whole collections or one glyph –Hyperbolic trees, virtual reality, discovered similarities

33 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Hyperbolic Tree Navigation

34 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Multilingual Issues Character sets Representations Íîäà ôèçè÷åñêè íàõîäèòñÿ â çäàíèè Èçâåñòèé Нода физически находится в здании Известий Multilingual navigation Translation assistance

35 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS UNIVERSAL LIBRARY STATUS >10,000 digital volumes Public-domain issues of the New York TimesNew York Times Portal to hundreds of other collections Art, music, video, Internet radio Magazines, newspapers, journals Installing 1.25 terabytes Visit www.ulib.orgwww.ulib.org

36 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Language Identification Given a string x, which language(s) is it from? –What language is “peogwir” from? Given x, which language(s) does it seem to be from? –“contrefaçon” “dazs” “chalupa” “mbwewe” Character set may be unknown Brief input (e.g. single word) Intermixed languages –“Zeitgeist Fever” Neologisms, slang, abbreviations

37 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Generative Approach Assume that the lexicon of a language L is generated by a probabilistic finite-state machine M L < a b z a z > a z > a z > a z > START OF WORD PROB THAT WORD STARTS WITH A PROB THAT WORD STARTS WITH Z PROB (a|<a) PROB (>|<a) PROB (a|<z) PROB (z|<z) PROB (z|<za) > PRODUCT = PROB ( ) > PRODUCT = PROB ( )

38 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Problems Where do all the required probabilities come from? How can they all be stored? If string x does not actually occur in a language, its probability will be zero. Won’t work for neologisms or misspellings. “Moving trigrams” work

39 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Generative Approach Let p L (y| x) be the probability that string x is followed by string y in language L (i.e. the probability given a prefix x the suffix is y) Then p L (x), the probability that x= was generated by L, is p L (x 1 | | <x 1 x 2 x 3... x n-1 x n ) This computation requires huge memory, so approximate: Assume p L (x n | <x 1 x 2 x 3... x n-1 )  p L (x n | x n-2 x n-1 ) So p L (x)  p L (x 3 | | x n-1 x n ) Try it

40 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Searching Mathematics Has this integral ever been evaluated?

41 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Searching Mathematics MATHEMATICA C.F.: Integrate[ Times[Power[E,Times[ -1,Power[V1,2]]], Sin[Power[V1,2]]], {V1,0,Infinity}]

42 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Hierarchical Nature of Aboutness What does it mean to say that a book is “about” chemistry? Can a word be about chemistry? If one paragraph is about chemistry, is the book about chemistry? If the book is about chemistry, is every sentence in it about chemistry? Aboutness is central to cataloging and retrieval

43 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Aboutness Hierarchy Universe Word Sentence Paragraph Section Chapter Collection BookNewspaper Article Photograph Object 3D Artifact Glyph KEYWORD SEARCHING OCCURS HERE SUBJECT SEARCHING OCCURS HERE

44 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Thesauri and Aboutness A set of numbered thesaurus entries defines a topic Thesaurus is topic-hierarchical 1011 Hindrance –1011.5 barrier, bar, gate, fence, wall, rampart, dam, moat … A word is “about” any topic to which it belongs Dam: –241.1 lake –293.7 close (v.) –560.11 mother –757.2 horse –856.11 put a stop to (v.) –1011.5 barrier Thesaurus + aboutness hierarchy can be used to disambiguate meanings without “understanding” Note: topic numbers are language independent

45 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS Set Theory of Aboutness Given a finite universe W of objects (e.g. all words) Define a topic T  W to be a subset of W (a wordlist) Topic inclusion (defines the hierarchy): –Topic T includes topic S iff S  T Definition of aboutness: –A subset P  W of the universe (e.g., a book) is about topic T iff P  T   (intersection is nonempty) Hierarchical nature of aboutness: –If P is about S and T includes S, then P is also about T

46 HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS We Can Search a Few Things Text In the Roman alphabet “Hidden” databases effectively unsearchable No images or two-dimensional structures –math –music –dance notation... No subject index of photographs or art –Corbis is one of the “best”


Download ppt "HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of."

Similar presentations


Ads by Google