Presentation is loading. Please wait.

Presentation is loading. Please wait.

IR Homework #1 By J. H. Wang Mar. 25, 2009. Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:

Similar presentations


Presentation on theme: "IR Homework #1 By J. H. Wang Mar. 25, 2009. Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:"— Presentation transcript:

1 IR Homework #1 By J. H. Wang Mar. 25, 2009

2 Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input: a set of text documents –(to be described later) Output: inverted index files –(exact format to be described later)

3 Input: the Test Collection Reuters-RCV1: http://trec.nist.gov/data/reuters/reuters.html http://trec.nist.gov/data/reuters/reuters.html –About 810,000 English news stories from 1996/08/20 to 1997/08/19 (2.5GB uncompressed) –Needs to sign agreements Reuters-21578: http://www.daviddlewis.com/resources/testcollection s/reuters21578/ http://www.daviddlewis.com/resources/testcollection s/reuters21578/ –21,578 news in 1987 (28.0MB uncompressed) Test collections held at University of Glasgow: http://www.dcs.gla.ac.uk/idom/ir_resources/test_coll ections/ http://www.dcs.gla.ac.uk/idom/ir_resources/test_coll ections/ –LISA, NPL, CACM, CISI, Cranfield, Time, Medline, ADI –Ex: The Time Collection: 423 documents (1.5MB)

4 Output: Inverted Index Using the standard inverted index (Chap. 1 & 2) Output format: –Dictionary file: a sorted list of vocabularies (in separate lines) –Postings list: for each word, a list of occurrences in the original text term i, df i : ; doc2, tf i2 : ; … > (as in Fig. 2.11, Sec. 2.4) –df i : document frequency of term i –tf ij : term frequency of term i in doc j to, 993427: ; 2, 5: ; … > …

5 Implementation Issues Note: pos means the token positions in the body of documents –This can facilitate easier implementation in later steps after indexing, for example, proximity search Document preprocessing should be handled with care –Different formats for different collections –Digits, hyphens, punctuation marks, …

6 Implementation Issues You can have a separate data structure (e.g. trie, which is more efficient) to store the vocabularies and occurrences in your program to speed up the indexing process, but the output should be in the designated format Optional functionality –Case folding –Stopword removal –Stemming –They should be able to be turned off by a parameter trigger

7 Submission Your submission *should* include –The source code (and optionally your executable file) –A one-page description that includes the following Major features in your work (ex: high efficiency, low storage, multiple input formats, huge corpus, …) Major difficulties encountered Special requirements for execution environments (ex: Java Runtime Environment, special compilers, …) The names and the responsible parts of each individual member should be clearly identified for team work Due: extended to three weeks (Apr. 1, 2009)

8 Submission Instructions Programs or homework in electronic files must be submitted directly to the TA as follows – Team members list : please e-mail your team members list to the TA (t6598006 @ ntut. edu. tw) even if you’re the only team member – Preparing submission file : one single compressed file named as, for example, IR0901- HW1.ZIP Remember to specify the names of your team members and student ID in the files and documentation –E-mail or online submission: TBD –If you cannot successfully submit your work, please contact with the TA or the instructor

9 Evaluation Minimum requirement : the Reuters-21578 Test Collection as the input, and the inverted index generated by your program will be checked Optional features such as case folding, stemming and stopword removal will be considered as bonus You might be required to demo if the program submitted was unable to compile/run by TA

10 Questions?


Download ppt "IR Homework #1 By J. H. Wang Mar. 25, 2009. Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:"

Similar presentations


Ads by Google