Presentation is loading. Please wait.

Presentation is loading. Please wait.

Incremental Indexing Dr. Susan Gauch. Indexing  Current indexing algorithms are essentially batch processing  They start from scratch every time  What.

Similar presentations


Presentation on theme: "Incremental Indexing Dr. Susan Gauch. Indexing  Current indexing algorithms are essentially batch processing  They start from scratch every time  What."— Presentation transcript:

1 Incremental Indexing Dr. Susan Gauch

2 Indexing  Current indexing algorithms are essentially batch processing  They start from scratch every time  What happens if we have already indexed a million documents and add 1 document to the collection  Do not want to index 1,000,001 documents from scratch  Web search engines have spiders/crawlers/robots continually collecting new content  Need a way to add a new document to existing inverted files

3 Adding a document  This can cause two types of changes  Add a new word  Add an occurrence of an existing word

4 Adding a New Word  This is the easiest type of change  Fill in a new entry in dict file for the word  Append to the end of the post file  If the dict file is 1/3 full after indexing phase  We can add many words before dict file blank records are used up  Over time, probability of a collision increases, slowing down retrieval  When dict file is > 2/3 full, rehash on disk  Essentially, create new dict file twice as big  Rehash all dict file records to new location  Lots of I/O, but can be done in background or on separate computer

5 Adding an New Occurrence  Change to dict file is trivial  Just increment numdocs  Change to post file is catastrophic  Need to add a new posting record but cannot insert a new record in the middle of a file  The idf for the word is different (log N/numdocs; but numdocs just changed)  All postings for that word have the wrong term weights

6 Adding Posting Records Option 1: Blank records  Write blank records after existing postings  Number of blank records should be proportional to the number of existing postings  E.g., if “dog” has 3 postings, write scale_factor * 3 blank records after the 3 real postings; if “many” has 100 postings, write scale_factor * 100 blank records after the 100 real ones  Allows for scale_factor expansion  First word to have more than numdocs * scale_factor new postings causes entire postings file to be rewritten with new blanks inserted

7 Adding Posting Records Option 2: Move Postings  Copy existing postings for the word to end of file  Append new posting there  Update dict to have “start” index to new location  Causes a lot of data movement  Post file becomes fragmented

8 Adding Posting Records Option 3: Overflow pointer  Change post record format to have an overflow pointer (record number/block address)  Add new posting to end of post file or in separate overflow file  While processing post records:  Loop over numdocs records in post  If overflow is null  Next = i++  Else  Next = overflow_location

9 Adding Posting Records Option 4: Next pointers  Variation of Option 3  While processing post records:  Seek to start  Read >> docid >> wt >> next  While next != -1  Seek to next  Read >> docid >> wt >> next  Allows infinite expandability  Can degenerate into equivalent of a linked list on disk with one seek per post record

10 Handling idf  Updating numdocs changes idf which in turn changes wt for all postings for the term  Read all postings for the term, change wt, rewrite postings  If doing proper document length normalization,  All document lengths for this term now have new lengths  Must recalculate norm factor and rewrite the postings for all terms in that document  Infeasible: we don’t have a way to find all postings for a document without reading whole file or adding a new file that maps docid -> postings (doubling the inverted index size)

11 Better idea  Calculate term weights on the fly  Store rtf in posting record  Prenormalized by document length  Loop over postings  Acc [docid} += wt  Becomes  Calc idf from current value of numdocs  Loop over postings  Acc [docid] += rtf * idf

12 Scalability (or how Google does it)  Create overflow areas that are larger than 1  Make them variable sizes  Store a few postings in dict file  Dict record becomes  Token, numdocs, idf, P postings, Next  Pick P so that dict record is size of 0.5 or 1 block (e.g., 100)  Create Small, Medium, and Large overflow files

13 Variable Overflows  If have > P postings  Allocate a record in the “Small” overflow file  Record format: S postings, Next  Pick S so that record fits in 1 block  Or Pick S so that 50% of all tokens can be processed without going to Medium overflow file  If have > P + S postings  Allocate a record in “Medium” overflow file  Record format: M postings, Next  Or Pick M so that 90% of all tokens can be processed without going to Large overflow file

14 Variable Overflows  If have > P + S + M postings  Allocate a record in the “Large” overflow file  Record format: L postings, Next  Pick L so that 99% of all tokens can be processed without going to second overflow file  If have > P + S + L postings  Allocate another record at end of Large file  Next pointer just points to the next Large record


Download ppt "Incremental Indexing Dr. Susan Gauch. Indexing  Current indexing algorithms are essentially batch processing  They start from scratch every time  What."

Similar presentations


Ads by Google