Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.

Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard

Agenda Questions Finish up evaluation from last time Computational complexity Inverted indexes Project planning

User Studies Goal is to account for interface issues –By studying the interface component –By studying the complete system Formative evaluation –Provide a basis for system development Summative evaluation –Designed to assess performance

Quantitative User Studies Select independent variable(s) –e.g., what info to display in selection interface Select dependent variable(s) –e.g., time to find a known relevant document Run subjects in different orders –Average out learning and fatigue effects Compute statistical significance –Null hypothesis: independent variable has no effect –Rejected if p<0.05

Variation in Automatic Measures System –What we seek to measure Topic –Sample topic space, compute expected value Topic+System –Pair by topic and compute statistical significance Collection –Repeat the experiment using several collections

Additional Effects in User Studies Learning –Vary topic presentation order Fatigue –Vary system presentation order Topic+User (Expertise) –Ask about prior knowledge of each topic

Presentation Order

Document Selection Experiments Interactive Selection F 0.8 Standard Ranked List Topic Description

Measures of Effectiveness Query Formulation: Uninterpolated Average Precision –Expected value of precision [over relevant document positions] –Interpreted based on query content at each iteration Document Selection: Unbalanced F-Measure: –P = precision –R = recall –  = 0.8 favors precision Models expensive human translation

End-to-End Experiments Query Formulation Automatic Retrieval Interactive Selection Average Precision F 0.8 Topic Description

End-to-End Experiment Results F α=0.8 English queries, German documents 4 searchers, 20 minutes per topic

Summary Qualitative user studies suggest what to build Design decomposes task into components Automated evaluation helps to refine components Quantitative user studies show how well it works

Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Indexing Index Acquisition Collection

Some Questions for Today How long will it take to find a document? –Is there any work we can do in advance? If so, how long will that take? How big a computer will I need? –How much disk space? How much RAM? What if more documents arrive? –How much of the advance work must be repeated? –Will searching become slower? –How much more disk space will be needed?

A Cautionary Tale Searching is easy - just ask Microsoft! –“Find” can search my hard drive in a few minutes If it only looks at the file names... How long would it would take for the Web? –A 100 GB disk? –For the World Wide Web? Computers are getting faster, but… –How does Google give answers in 3 seconds?

Find “complex” in the dictionary marsupial belligerent complex marsupial belligerent complex arcade astronomical mastiff relatively relaxation resplendent

Computational Complexity Time complexity: how long will it take? Space complexity: how much memory is needed? Things you need to know to assess complexity: –What is the “size” of the input? (“n”) What aspects of the input are we paying attention to? –How is the input represented? –How is the output represented? –What are the internal data structures? –What is the algorithm?

Worst Case Complexity

10n: O(n) 100n: O(n) 100n+25263: O(n) n 2 : O(n 2 ) n 2 +45662: O(n 2 )

“Asymptotic” Complexity Constant, i.e. O(1) n doesn’t matter Sublinear, e.g. O(log n) n = 65536  log n = 16 Linear, i.e. O(n) n = 65536  n = 65536 Polynomial, e.g. O(n 3 ) n = 65536  n 3 = 281,474,976,710,656 Exponential, e.g. O(2 n ) n = 65536  beyond astronomical

The “Inverted File” Trick Organize the bag of words matrix by terms –You know the terms that you are looking for Look up terms like you search dictionaries –For each letter, jump directly to the right spot For terms of reasonable length, this is very fast –For each term, store the document identifiers For every document that contains that term At query time, use the document identifiers –Consult a “postings file”

An Example quick brown fox over lazy dog back now time all good men come jump aid their party 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1 Term Doc 1Doc 2 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 Doc 3 Doc 4 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 Doc 5Doc 6 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 0 1 1 1 1 0 0 0 Doc 7Doc 8 A B C F D G J L M N O P Q T AI AL BA BR TH TI 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 Postings Inverted File

The Finished Product quick brown fox over lazy dog back now time all good men come jump aid their party Term A B C F D G J L M N O P Q T AI AL BA BR TH TI 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 PostingsInverted File

What Goes in a Postings File? Boolean retrieval –Just the document number Ranked Retrieval –Document number and term weight (TF*IDF,...) Proximity operators –Word offsets for each occurrence of the term Example: Doc 3 (t17, t36), Doc 13 (t3, t45)

How Big Is the Postings File? Very compact for Boolean retrieval –About 10% of the size of the documents If an aggressive stopword list is used! Not much larger for ranked retrieval –Perhaps 20% Enormous for proximity operators –Sometimes larger than the documents!

Building an Inverted Index Simplest solution is a single sorted array –Fast lookup using binary search –But sorting large files on disk is very slow –And adding one document means starting over Tree structures allow easy insertion –But the worst case lookup time is linear Balanced trees provide the best of both –Fast lookup and easy insertion –But they require 45% more disk space

Starting a B+ Tree Inverted File nowtimegoodall aaaaanow Now is the time for all good …

Adding a New Term nowtimegoodall aaaaanow Now is the time for all good men … aaaaamen

How Big is the Inverted Index? Typically smaller than the postings file –Depends on number of terms, not documents Eventually, most terms will already be indexed –But the postings file will continue to grow Postings dominate asymptotic space complexity –Linear in the number of documents

Index Compression CPU’s are much faster than disks –A disk can transfer 1,000 bytes in ~20 ms –The CPU can do ~10 million instructions in that time Compressing the postings file is a big win –Trade decompression time for fewer disk reads Key idea: reduce redundancy –Trick 1: store relative offsets (some will be the same) –Trick 2: use an optimal coding scheme

Compression Example Postings (one byte each = 7 bytes = 56 bits) –37, 42, 43, 48, 97, 98, 243 Difference –37, 5, 1, 5, 49, 1, 145 Optimal Huffman Code –0:1, 10:5, 110:37, 1110:49, 1111: 145 Compressed (17 bits) –11010010111001111

Indexing and Searching Indexing –Walk the inverted file, splitting if needed –Insert into the postings file in sorted order –Hours or days for large collections Query processing –Walk the inverted file –Read the postings file –Manipulate postings based on query –Seconds, even for enormous collections

Summary Slow indexing yields fast query processing –Key fact: most terms don’t appear in most documents We use extra disk space to save query time –Index space is in addition to document space –Time and space complexity must be balanced Disk block reads are the critical resource –This makes index compression a big win

Project Options LBSC 796 MLS/MIM –Option 1: TREC-like IR evaluation (team of 2) –Option 2: Design and run a user study (team of 3) LBSC 796 Ph.D. –Research paper LBSC 828o –Program a new capability

One Minute Paper What was the muddiest point in today’s lecture?

Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.

Similar presentations

Presentation on theme: "Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.

Similar presentations

Presentation on theme: "Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard."— Presentation transcript:

Similar presentations

About project

Feedback