Presentation is loading. Please wait.

Presentation is loading. Please wait.

Discussion 5 Sara Javanmardi.

Similar presentations


Presentation on theme: "Discussion 5 Sara Javanmardi."— Presentation transcript:

1 Discussion 5 Sara Javanmardi

2 Assignment 3 Demo Friday Feb 4th: *8am-10:00 am *11am-1pm *2pm-4pm
*In my office

3 Microsoft Spell Checker Contest
Speller Contest Dataset

4 Assignment 4 Indexing Enron s

5 How to Download the compressed file Unzip it

6 Part1 General Questions

7 Part2 Quantifying the Data
Listing the Files or Subdirectories in a Directory

8 Part3 Index the data

9 Posting List To create the posting lists, you have 4 options 1) [term : docID \t]+ 2) [term : docID:termFrequency \t]+ 3 )[term : docID: position of the term in the documment \t]+ 4 )[term : docID:Frequency, position of the term in the documment \t]+

10 Example Abandon\tdoc1:4:3,301,400,700\tdoc3:102,105\n Bail\tdoc2:1:21\tdoc3:2:100,1012\n . Sorted based on Doc IDs Alphabetically sorted

11 Index construction How do we construct an index?
Ch. 4 Index construction How do we construct an index? What strategies can we use with limited main memory?

12 The basic steps to construct your index:
1)    Make a pass through the collection assembling term-docID pairs. 2)    To make index construction more efficient, we present terms as termIDs, where a termID is a unique serial number. We can do it in 2 ways: a.    On the fly while we are processing the collection b.    We can compile vocabulary in the first pass and construct the inverted index in the second pass. 3)    Sort the pairs with the terms 4)    Finally, we organize the docIDs for each term into a postings list.

13 Sec. 4.2 index construction Documents are parsed to extract words and these are saved with the Document ID. Doc 1 Doc 2 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

14 Sec. 4.2 Key step After all documents have been parsed, the inverted file is sorted by terms. We focus on this sort step. We have 100M items to sort.

15 Problems? How to update? See InvertedIndex.java


Download ppt "Discussion 5 Sara Javanmardi."

Similar presentations


Ads by Google