Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Chapter 5: Introduction to Information Retrieval
Counting the bits Analysis of Algorithms Will it run on a larger problem? When will it fail?
Search and Ye Shall Find (maybe) Seminar on Emergent Information Technology August 20, 2007 Douglas W. Oard.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
CpSc 881: Information Retrieval. 2 Why compression? (in general) Use less disk space (saves money) Keep more stuff in memory (increases speed) Increase.
LBSC 796/INFM 718R: Week 5 Indexing Jimmy Lin College of Information Studies University of Maryland Monday, February 27, 2006.
Information Retrieval Review
Full-Text Indexing Session 10 INFM 718N Web-Enabled Databases.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
INFO 624 Week 3 Retrieval System Evaluation
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
CS/Info 430: Information Retrieval
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Advance Information Retrieval Topics Hassan Bashiri.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Organizing files for performance Chapter Data compression Advantages of reduced file size Redundancy reduction: state code example Repeating sequences:
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 5: Index Compression 1.
Chapter 5: Information Retrieval and Web Search
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Indexing and Complexity. Agenda Inverted indexes Computational complexity.
1 Complexity Lecture Ref. Handout p
Search Engines and Information Retrieval Chapter 1.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
1.2. Comparing Algorithms. Learning outcomes Understand that algorithms can be compared by expressing their complexity as a function relative to the size.
Indexing LBSC 708A/CMSC 838L Session 7, October 23, 2001 Philip Resnik.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Evaluation INST 734 Module 5 Doug Oard. Agenda  Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Chapter 6: Information Retrieval and Web Search
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Evidence from Content INST 734 Module 2 Doug Oard.
Evidence from Content INST 734 Module 2 Doug Oard.
Algorithm Analysis (Big O)
Week 12 - Friday.  What did we talk about last time?  Finished hunters and prey  Class variables  Constants  Class constants  Started Big Oh notation.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Scalability for Search Scaling means how a system must grow if resources or work grows –Scalability is the ability of a system, network, or process, to.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
Searching Topics Sequential Search Binary Search.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Introduction to COMP9319: Web Data Compression and Search Search, index construction and compression Slides modified from Hinrich Schütze and Christina.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
CS315 Introduction to Information Retrieval Boolean Search 1.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Why indexing? For efficient searching of a document
COMP9319: Web Data Compression and Search
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Large Scale Search: Inverted Index, etc.
Text Indexing and Search
Scalability for Search
Objective of This Course
Lecture 7: Index Construction
Structure of IR Systems
Presentation transcript:

Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard

Agenda Questions Finish up evaluation from last time Computational complexity Inverted indexes Project planning

User Studies Goal is to account for interface issues –By studying the interface component –By studying the complete system Formative evaluation –Provide a basis for system development Summative evaluation –Designed to assess performance

Quantitative User Studies Select independent variable(s) –e.g., what info to display in selection interface Select dependent variable(s) –e.g., time to find a known relevant document Run subjects in different orders –Average out learning and fatigue effects Compute statistical significance –Null hypothesis: independent variable has no effect –Rejected if p<0.05

Variation in Automatic Measures System –What we seek to measure Topic –Sample topic space, compute expected value Topic+System –Pair by topic and compute statistical significance Collection –Repeat the experiment using several collections

Additional Effects in User Studies Learning –Vary topic presentation order Fatigue –Vary system presentation order Topic+User (Expertise) –Ask about prior knowledge of each topic

Presentation Order

Document Selection Experiments Interactive Selection F 0.8 Standard Ranked List Topic Description

Measures of Effectiveness Query Formulation: Uninterpolated Average Precision –Expected value of precision [over relevant document positions] –Interpreted based on query content at each iteration Document Selection: Unbalanced F-Measure: –P = precision –R = recall –  = 0.8 favors precision Models expensive human translation

End-to-End Experiments Query Formulation Automatic Retrieval Interactive Selection Average Precision F 0.8 Topic Description

End-to-End Experiment Results F α=0.8 English queries, German documents 4 searchers, 20 minutes per topic

Summary Qualitative user studies suggest what to build Design decomposes task into components Automated evaluation helps to refine components Quantitative user studies show how well it works

Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Indexing Index Acquisition Collection

Some Questions for Today How long will it take to find a document? –Is there any work we can do in advance? If so, how long will that take? How big a computer will I need? –How much disk space? How much RAM? What if more documents arrive? –How much of the advance work must be repeated? –Will searching become slower? –How much more disk space will be needed?

A Cautionary Tale Searching is easy - just ask Microsoft! –“Find” can search my hard drive in a few minutes If it only looks at the file names... How long would it would take for the Web? –A 100 GB disk? –For the World Wide Web? Computers are getting faster, but… –How does Google give answers in 3 seconds?

Find “complex” in the dictionary marsupial belligerent complex marsupial belligerent complex arcade astronomical mastiff relatively relaxation resplendent

Computational Complexity Time complexity: how long will it take? Space complexity: how much memory is needed? Things you need to know to assess complexity: –What is the “size” of the input? (“n”) What aspects of the input are we paying attention to? –How is the input represented? –How is the output represented? –What are the internal data structures? –What is the algorithm?

Worst Case Complexity

10n: O(n) 100n: O(n) 100n+25263: O(n) n 2 : O(n 2 ) n : O(n 2 )

“Asymptotic” Complexity Constant, i.e. O(1) n doesn’t matter Sublinear, e.g. O(log n) n =  log n = 16 Linear, i.e. O(n) n =  n = Polynomial, e.g. O(n 3 ) n =  n 3 = 281,474,976,710,656 Exponential, e.g. O(2 n ) n =  beyond astronomical

The “Inverted File” Trick Organize the bag of words matrix by terms –You know the terms that you are looking for Look up terms like you search dictionaries –For each letter, jump directly to the right spot For terms of reasonable length, this is very fast –For each term, store the document identifiers For every document that contains that term At query time, use the document identifiers –Consult a “postings file”

An Example quick brown fox over lazy dog back now time all good men come jump aid their party Term Doc 1Doc Doc 3 Doc Doc 5Doc Doc 7Doc 8 A B C F D G J L M N O P Q T AI AL BA BR TH TI 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 Postings Inverted File

The Finished Product quick brown fox over lazy dog back now time all good men come jump aid their party Term A B C F D G J L M N O P Q T AI AL BA BR TH TI 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 PostingsInverted File

What Goes in a Postings File? Boolean retrieval –Just the document number Ranked Retrieval –Document number and term weight (TF*IDF,...) Proximity operators –Word offsets for each occurrence of the term Example: Doc 3 (t17, t36), Doc 13 (t3, t45)

How Big Is the Postings File? Very compact for Boolean retrieval –About 10% of the size of the documents If an aggressive stopword list is used! Not much larger for ranked retrieval –Perhaps 20% Enormous for proximity operators –Sometimes larger than the documents!

Building an Inverted Index Simplest solution is a single sorted array –Fast lookup using binary search –But sorting large files on disk is very slow –And adding one document means starting over Tree structures allow easy insertion –But the worst case lookup time is linear Balanced trees provide the best of both –Fast lookup and easy insertion –But they require 45% more disk space

Starting a B+ Tree Inverted File nowtimegoodall aaaaanow Now is the time for all good …

Adding a New Term nowtimegoodall aaaaanow Now is the time for all good men … aaaaamen

How Big is the Inverted Index? Typically smaller than the postings file –Depends on number of terms, not documents Eventually, most terms will already be indexed –But the postings file will continue to grow Postings dominate asymptotic space complexity –Linear in the number of documents

Index Compression CPU’s are much faster than disks –A disk can transfer 1,000 bytes in ~20 ms –The CPU can do ~10 million instructions in that time Compressing the postings file is a big win –Trade decompression time for fewer disk reads Key idea: reduce redundancy –Trick 1: store relative offsets (some will be the same) –Trick 2: use an optimal coding scheme

Compression Example Postings (one byte each = 7 bytes = 56 bits) –37, 42, 43, 48, 97, 98, 243 Difference –37, 5, 1, 5, 49, 1, 145 Optimal Huffman Code –0:1, 10:5, 110:37, 1110:49, 1111: 145 Compressed (17 bits) –

Indexing and Searching Indexing –Walk the inverted file, splitting if needed –Insert into the postings file in sorted order –Hours or days for large collections Query processing –Walk the inverted file –Read the postings file –Manipulate postings based on query –Seconds, even for enormous collections

Summary Slow indexing yields fast query processing –Key fact: most terms don’t appear in most documents We use extra disk space to save query time –Index space is in addition to document space –Time and space complexity must be balanced Disk block reads are the critical resource –This makes index compression a big win

Project Options LBSC 796 MLS/MIM –Option 1: TREC-like IR evaluation (team of 2) –Option 2: Design and run a user study (team of 3) LBSC 796 Ph.D. –Research paper LBSC 828o –Program a new capability

One Minute Paper What was the muddiest point in today’s lecture?