The Indexer’s Legacy: Promoting Access to a Million Books Michael Huggett Edie Rasmussen ICDL 2010.

The Indexer’s Legacy: Promoting Access to a Million Books Michael Huggett Edie Rasmussen ICDL 2010

Overview Problem statement Background to study – Indexers and Indexes – From Print to Digital Book Collections – Searching Digital Collections Research Project – Pilot Study – Phase I: Building the Collection – Phase II:Deconstructing the Indexes – Phase III:Building a meta-index – Phase IV:Index-augmented search

Digital Book Projects Project Gutenberg (1971+) Million Book Project (2002+) –  Universal Digital Library Google Books Library Project (2004+) Open Content Alliance (2005+) –  Universal Digital Library, Internet Archive And many others...

Searching Digital Collections Combination of ’dirty OCR’ of text plus page image Standard IR retrieval techniques: query leads to relevance ranked output Text level vs. Passage level retrieval (e.g. INEX Book Track) Adequate for many purposes Problems with heterogeneity of text, ambiguity of terms

Problem Statement I The ”million books problem” – ”…the human life contains only about 30,000 days; reading a book a day we would finsih a million books only after 30 lifetimes of reading…No longer a distant probability, a digital representation of [the vast written record of humanity] is taking shape before us… ” – ”digitization does provide scale (or quantity) but does so at the price of rich, largely manual encoding” (Many More Than A Million, 2007)

Problem Statement II Role of indexes: the index is one of the oldest known information retrieval devices, representing a network of interrelationships among concepts in a text Intellectual effort: an index represents hours of interpretation and analysis Intellectual content: includes information about a book’s content but also incorporates the structure of knowledge in a given field Standard information retrieval techniques reduce index terms (and all text terms) to a ’bag of words’ model

Research Goal As we move from print to digital collections of scholarly works, how can we retain, extract and use the knowledge that is embedded in the indexes? The goal of this research is to develop techniques that will help to capture, visualize and access the world’s digital knowledge through application of text processing techniques to digital indexes of legacy materials

The Indexing Process Read  identify indexable concepts  (mark)  create vocabulary  invert?  sort and format (s/ware)  add cross references -  edit for consistency Reduces contents of a book to its essentials (5 – 10%) Vocabulary is author’s plus indexer’s Goal is to facilitate access to material in the text

Knowledge in Indexes Premises: – The index identifies the most significant topics in the book – The index expresses the topics in the author’s vocabulary and in the vocabulary of the field (i.e. that of the reader) – The index provides links between concepts, showing how they are related – As indexes on a topic are aggregated, significant concepts related to that topic, and the relationships between them are reinforced, creating both a vocabulary and a guide to the collection

Challenges Not all books are indexed Indexing conventions have changed over time Books in public domain are older; quality of index may be lower Quality of OCR, errors in text No markup; index structure is indicated visually (e.g. indents, punctuation) Matching page numbers in index to physical pages in text

Related Research ’key ideas’ (Schilit and Kolak, Google Research, 2008) – Mining and linking ideas in digital books – Quotation extraction (quote plus context) ’Searching in a book’ (Liesaputra, Witten and Bainbridge, NZDL, 2009) E-book usability with indexes (Noorhidawata, 2007) Reorganizing indexes (Chi et al., 2004) – Creating mini-indexes ’on the fly’

Pilot Study I Work on a small number of digital items – 3 biographies of Charles Darwin – 12 books on BC history from UBC University Press Software to parse indexes – From pdf to index structures Operator driven: scan and correct ocr errors; key indicators in database Parse index terms and entries by shared references Identify common words on shared page references

Pilot Study II Preliminary results: – Measure of coherence Rank terms by frequency and normalize Deviation = ∑(average rank – term rank) Calculated for content, index entries, index words Calculated for all terms, and for shared terms only

Pilot Study III Preliminary Results: – index terms show more coherence than corpus terms – Suggests that BoB are a good source of corpus-level keywords CorpusIndex EntriesIndex Words All terms0.53610.31630.1913 Shared terms0.37920.01290.0278

Phase I: Building a Test Collection Needed: – General collection Collection of 1000 books With indexes! In the public domain – Topic-oriented collections (5-6?) Collections of 100(?) books in a topic GRAs to identify and download target books Result: a test collection for this project (and others)

Phase II: Deconstructing the Indexes No BoB indexing standards No controlled vocabulary A few indexing conventions – Headings, subheadings, sub-subheadings... – Structure is indicated by spacing and punctuation Need to parse the index to identify entries and page references Parsing software written and tested

Phase II: Research Questions How can index structure (run-on or indented, heading hierarchies) be extracted? Can keyphrases be extracted (proper nouns, concepts)? What are the syntax and semantics of indexes? Can we identify the historical development of indexes? How have they changed over time? Can we use XML to create a useful intermediate product?

Phase III: Building a Meta-Index Meta-index: a digital collection-level aggregation of the BoB indexes for a digital collection Merging/ concatenating index entries May be a standard index format (alphabetical, hierarchical entries), i.e., a digital browsalbe index Or may use new formats, e.g. Visualizations, topic maps

INDEX BOOK META-INDEX 1 DIGITAL COLLECTION META-INDEX 2

Phase III: Research Questions Can digital versions of BoB indexes be used to facilitate access to digital collections? What form should these indexes take? Conventional index format (alphabetical/searchable with headings and subheadings) Index visualization How do these meta-indexes compare to a standard search engine when searching a digital collection? Evaluation: task-oriented evaluation with human subjects (e.g. Humanities scholars)

Phase IV: Index Augmented Search Using the index information in new ways – Building a ontology in domain areas – Identifying concept relationships between index vocabulary and term vocabulary – Use for Query expansion Question answering Summarization Categorization

Phase IV: Research Questions Based on standard text processing procedures, i.e. stemming, use of stopwords, keyphrase extraction, term weighting such as tf*idf or BM25 How strong is the relationship between the index entry and the words on the page(s) referred to? Assume that for a single entry, this relationship is weak; over multiple similar entries in many books, do real relationships emerge and false ones disappear? Evaluation: using external collections, e.g. TREC or INEX, to measure contribution of index term relationships to retrieval performance

Further Research Building themed or personalized collections (using index for book similiarity measures) Ability to mine large multidisciplinary collections for references (historical, economic, etc.) Ability to mine collections and build special- format indexes and browsers (e.g. images, figures) Changes in topics over time, evolution of thinking on a subject Knowledge discovery: detecting previously undiscovered links between topics

The Indexer’s Legacy… (a) an archaic addendum to an obsolete medium? OR (b) value-added knowledge in electronic text that enhances access to digital collections?

Thank you!

The Indexer’s Legacy: Promoting Access to a Million Books Michael Huggett Edie Rasmussen ICDL 2010.

Similar presentations

Presentation on theme: "The Indexer’s Legacy: Promoting Access to a Million Books Michael Huggett Edie Rasmussen ICDL 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Indexer’s Legacy: Promoting Access to a Million Books Michael Huggett Edie Rasmussen ICDL 2010.

Similar presentations

Presentation on theme: "The Indexer’s Legacy: Promoting Access to a Million Books Michael Huggett Edie Rasmussen ICDL 2010."— Presentation transcript:

Similar presentations

About project

Feedback