IR Homework #1 By J. H. Wang Mar. 25, 2009. Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Homework #2: Functions and Arrays
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
Inverted Index Hongning Wang
N-gram Search Engine on Wikipedia Satoshi Sekine (NYU) Kapil Dalwani (JHU)
Information Retrieval in Practice
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Evaluating the Performance of IR Sytems
DB2 Net Search Extender Presenter: Sudeshna Banerji (CIS 595: Bioinformatics)
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be.
Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.
Overview of a Information Retrieval System: Terrier Ashish.
CSE 1340 Introduction to Computing Concepts Class 2.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Proposal for Term Project J. H. Wang Mar. 2, 2015.
Question of the Day  On a game show you’re given the choice of three doors: Behind one door is a car; behind the others, goats. After you pick a door,
Introduction to Data Structures
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Chapter 6: Information Retrieval and Web Search
Homework Assignment #1 J. H. Wang Oct. 2, 2015.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
1 CSC 321: Data Structures Fall 2013 See online syllabus (also available through BlueLine2): Course goals:  To understand.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
IR Homework #1 By J. H. Wang Mar. 21, Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and.
Homework Assignment #1 J. H. Wang Oct. 13, Homework #1 Chap.1: 1.24 Chap.2: 2.13 Chap.3: 3.5, 3.13* (or 3.14*) Chap.4: 4.6, 4.12* –(*: optional.
Homework Assignment #1 J. H. Wang Oct. 6, 2011.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
IR Homework #1 By J. H. Wang Mar. 16, Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection.
IR Homework #1 By J. H. Wang Mar. 5, Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:
Homework #2: Functions and Arrays By J. H. Wang Mar. 20, 2012.
IR Homework #3 By J. H. Wang May 10, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Information Retrieval
Homework #4: Operator Overloading and Strings By J. H. Wang Apr. 17, 2009.
Homework #2: Functions and Arrays By J. H. Wang Mar. 24, 2014.
Chapter 11 Enhancing an Online Form and Using Macros Microsoft Word 2013.
Teacher Notes Vocabulary Magic Steps 1-5
Homework Assignment #1 J. H. Wang Oct. 11, 2013.
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Homework #1 J. H. Wang Oct. 24, 2011.
GCSE ICT 3 rd Edition The system life cycle 18 The system life cycle is a series of stages that are worked through during the development of a new information.
1 Discussion Class 1 Inverted Files. 2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to comment.
1 Turnitin in BUMoodle - Quick Reference Guide for Staff (This quick guide is a modified version of the one prepared by ITO/ITSC.) Turnitin Standard online.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
IR Homework #2 By J. H. Wang Apr. 13, Programming Exercise #2: Query Processing and Searching Goal: to search for relevant documents Input: a query.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
Advanced Higher Computing Science
Why indexing? For efficient searching of a document
Text Indexing and Search
Proposal for Term Project
Homework Assignment #1 J. H. Wang Oct. 11, 2016.
Information Retrieval and Web Search
Big Data Analytics: HW#3
MR Application with optimizations for performance and scalability
Project 1: Text Classification by Neural Networks
MR Application with optimizations for performance and scalability
INF 141: Information Retrieval
CS-171 Discussion Week3.
Information Retrieval B
Homework #2 J. H. Wang Oct. 18, 2018.
Presentation transcript:

IR Homework #1 By J. H. Wang Mar. 25, 2009

Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input: a set of text documents –(to be described later) Output: inverted index files –(exact format to be described later)

Input: the Test Collection Reuters-RCV1: –About 810,000 English news stories from 1996/08/20 to 1997/08/19 (2.5GB uncompressed) –Needs to sign agreements Reuters-21578: s/reuters21578/ s/reuters21578/ –21,578 news in 1987 (28.0MB uncompressed) Test collections held at University of Glasgow: ections/ ections/ –LISA, NPL, CACM, CISI, Cranfield, Time, Medline, ADI –Ex: The Time Collection: 423 documents (1.5MB)

Output: Inverted Index Using the standard inverted index (Chap. 1 & 2) Output format: –Dictionary file: a sorted list of vocabularies (in separate lines) –Postings list: for each word, a list of occurrences in the original text term i, df i : ; doc2, tf i2 : ; … > (as in Fig. 2.11, Sec. 2.4) –df i : document frequency of term i –tf ij : term frequency of term i in doc j to, : ; 2, 5: ; … > …

Implementation Issues Note: pos means the token positions in the body of documents –This can facilitate easier implementation in later steps after indexing, for example, proximity search Document preprocessing should be handled with care –Different formats for different collections –Digits, hyphens, punctuation marks, …

Implementation Issues You can have a separate data structure (e.g. trie, which is more efficient) to store the vocabularies and occurrences in your program to speed up the indexing process, but the output should be in the designated format Optional functionality –Case folding –Stopword removal –Stemming –They should be able to be turned off by a parameter trigger

Submission Your submission *should* include –The source code (and optionally your executable file) –A one-page description that includes the following Major features in your work (ex: high efficiency, low storage, multiple input formats, huge corpus, …) Major difficulties encountered Special requirements for execution environments (ex: Java Runtime Environment, special compilers, …) The names and the responsible parts of each individual member should be clearly identified for team work Due: extended to three weeks (Apr. 1, 2009)

Submission Instructions Programs or homework in electronic files must be submitted directly to the TA as follows – Team members list : please your team members list to the TA ntut. edu. tw) even if you’re the only team member – Preparing submission file : one single compressed file named as, for example, IR0901- HW1.ZIP Remember to specify the names of your team members and student ID in the files and documentation – or online submission: TBD –If you cannot successfully submit your work, please contact with the TA or the instructor

Evaluation Minimum requirement : the Reuters Test Collection as the input, and the inverted index generated by your program will be checked Optional features such as case folding, stemming and stopword removal will be considered as bonus You might be required to demo if the program submitted was unable to compile/run by TA

Questions?