Presentation is loading. Please wait.

Presentation is loading. Please wait.

IR Homework #1 By J. H. Wang Mar. 5, 2008. Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:

Similar presentations


Presentation on theme: "IR Homework #1 By J. H. Wang Mar. 5, 2008. Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:"— Presentation transcript:

1 IR Homework #1 By J. H. Wang Mar. 5, 2008

2 Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input: a set of documents concatenated into a single large file –(to be described later) Output: inverted index files –(exact format to be described later)

3 Input: the Test Collection Test collections held at University of Glasgow: http://www.dcs.gla.ac.uk/idom/ir_resou rces/test_collections/ http://www.dcs.gla.ac.uk/idom/ir_resou rces/test_collections/ –LISA, NPL, CACM, CISI, Cranfield, Time, Medline, ADI, each in different formats Ex: The Time Collection: 423 documents (1.5MB) –You have to do some preprocessing for different test collections

4 Output: Inverted Index Two files –Vocabulary file: a sorted list of words (each word in a separate line) –Occurrences file: for each word, a list of occurrences in the original text [word#] [term freq.] [ (doc#, char#) pairs] 1 7 (1, 12) (1, 28) (3, 31) (8, 39) (8, 65) (10, 16) (11, 91) 2 2 (3, 44) (8, 72) …

5 Implementation Issues Note: char# means the character position in the FILE (not the document) –This can facilitate easier implementation in later steps after indexing Document preprocessing should be handled with care –Digits, hyphens, punctuation marks, …

6 Implementation Issues You can have a separate data structure (e.g. trie, which is more efficient) to store the vocabularies and occurrences in your program to speed up the indexing process, but the output should be in the designated format Optional functionality –Stopword removal –Stemming –They should be able to be turned off by a parameter trigger

7 Submission Your submission *should* include –The source code (and optionally your executable file) –A one-page description that includes the following Major features in your work (ex: high efficiency, low storage, able to deal with multiple formats, …) Major difficulties encountered Special requirements for execution environments (ex: Java Runtime Environment) The names and the responsible parts of each individual member should be clearly identified for team work Due: two weeks (Mar.19, 2008)

8 Submission Instructions Programs or homework in electronic files must be submitted directly to the TA by e-mail as follows –Before submission: one single compressed file (including source codes and documentation), for example, 9659xxxx-HW1.ZIP Remember to specify your name and student ID in the files and documentation –E-mail of TA: alowblow @ hotmail. com You will get a confirmation e-mail from the TA after receiving your submission –If you cannot successfully e-mail your work, please contact with the TA or the instructor

9 Evaluation Minimum requirement : the Time Collection as provided on the Web page will be used as input, and the inverted index generated by your program will be checked for correctness Optional features such as stemming and stopword removal will be considered as bonus You might be required to demo if the program submitted was unable to run by TA

10 Questions?


Download ppt "IR Homework #1 By J. H. Wang Mar. 5, 2008. Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:"

Similar presentations


Ads by Google