IR Homework #2 By J. H. Wang Mar. 31, 2015. Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Traditional IR models Jian-Yun Nie.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Homework #2: Functions and Arrays
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
IR Models: Overview, Boolean, and Vector
Information Retrieval in Practice
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 7: Scores in a Complete Search.
Generalized Vector Space Model Definition Let k i be a vector associated with the index term k i. Independence of index terms in the vector model implies.
Evaluating the Performance of IR Sytems
Vector Space Model CS 652 Information Extraction and Integration.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Microsoft ® Official Course Interacting with the Search Service Microsoft SharePoint 2013 SharePoint Practice.
IR Models: Review Vector Model and Probabilistic.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.
Documents as vectors Each doc j can be viewed as a vector of tf.idf values, one component for each term So we have a vector space terms are axes docs live.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Proposal for Term Project J. H. Wang Mar. 2, 2015.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
The Internet 8th Edition Tutorial 4 Searching the Web.
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Homework Assignment #1 J. H. Wang Oct. 2, 2015.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
IR Homework #1 By J. H. Wang Mar. 21, Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and.
Homework Assignment #1 J. H. Wang Oct. 13, Homework #1 Chap.1: 1.24 Chap.2: 2.13 Chap.3: 3.5, 3.13* (or 3.14*) Chap.4: 4.6, 4.12* –(*: optional.
Evaluation of Agent Building Tools and Implementation of a Prototype for Information Gathering Leif M. Koch University of Waterloo August 2001.
Homework Assignment #1 J. H. Wang Oct. 6, 2011.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
IR Homework #1 By J. H. Wang Mar. 16, Programming Exercise #1: Vector Space Retrieval - Indexing Goal: to build an inverted index for a text collection.
IR Homework #1 By J. H. Wang Mar. 5, Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:
Vector Space Models.
IR Homework #3 By J. H. Wang May 10, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Information Retrieval
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Homework #2: Functions and Arrays By J. H. Wang Mar. 24, 2014.
Homework Assignment #1 J. H. Wang Oct. 11, 2013.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
IR Homework #1 By J. H. Wang Mar. 25, Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
IR Homework #2 By J. H. Wang Apr. 13, Programming Exercise #2: Query Processing and Searching Goal: to search for relevant documents Input: a query.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
Information Retrieval in Practice
Search Engine Architecture
Proposal for Term Project
Information Retrieval and Web Search
Big Data Analytics: HW#3
Improving DevOps and QA efficiency using machine learning and NLP methods Omer Sagi May 2018.
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
Lab 2: Information Retrieval
Presentation transcript:

IR Homework #2 By J. H. Wang Mar. 31, 2015

Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query Input: a query (and the inverted index) –(simple search: keyword, Boolean) Output: a ranked list of search results from ClueWeb09 collection –(details to be described later)

Input: User Query and Inverted Index Simple queries –Single keywords Ex: Microsoft, airplanes, … –Free texts Ex: United States, non-profit organization, … –Simple Boolean search Ex: open source AND Linux, software engineer OR project manager, … Inverted Index –As generated in HW#1

Output: Ranked Search Results A ranked list of search results from ClueWeb09 collection –Ranking: vector space model Term weighting scheme: TF-IDF Similarity estimation: cosine similarity between query and document vectors w ij = (1+ log tf ij ) * log (N/df i )

Example Output Ex: –Query: “ Hong Kong ” –Result: …

Optional Features Optional functionalities –Better user interface for search –Complex queries: phrase, wildcard, substring, proximity search, combinations of Boolean operators, … (Ch.2 & 3) –Query processing: spell-correction, phonetic correction, … (Ch.3) –Different term weighting schemes: variants of TF- IDF, … (Ch.6) –In-exact top- k retrieval: index elimination, champion lists, impact-ordering, tiered index, … (Ch.7) –Able to be turned on/off by a parameter trigger

Submission Your submission *should* include –The source code (and your configurations of extra libraries) For utilizing open source tools, please also submit your source code on calling the APIs or libraries –A one-page documentation including Major features : ex: high efficiency, low storage, multiple input formats, huge corpus, … Major difficulties encountered Special requirements for execution environments (ex: Java Runtime Environment, special compilers, …) Team members list : the names and the responsible parts of each individual member should be clearly identified Due: three weeks (Apr. 27, 2015)

Submission Instructions Programs and related electronic files in your homework must be submitted directly on the submission site: – Submission site: – Preparing your submission file : as one single compressed file Name your file according to your ID such as _HW2.zip. Remember to specify the names of your team members and student ID in the files and documentation –If you cannot successfully submit your work, please contact with the TA R1424, Technology Building)

Evaluation Minimum requirement: correctness for simple queries –Some example queries from ClueWeb09 Test Collection will be submitted to your program, and the ranked list will be checked for effectiveness Optional features will be considered as bonus –Various query types, weighting schemes, efficient scoring and ranking, … You might be required to demo if the program submitted was unable to run by the TA

Any Questions or Comments?