Information Retrieval and Web Search

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Traditional IR models Jian-Yun Nie.
Boolean and Vector Space Retrieval Models
Chapter 5: Introduction to Information Retrieval
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval
Improved TF-IDF Ranker
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Ch 4: Information Retrieval and Text Mining
CS/Info 430: Information Retrieval
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
Evaluating the Performance of IR Sytems
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
The Vector Space Model …and applications in Information Retrieval.
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
1 Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Information Retrieval and Web Search IR models: Vectorial Model Instructor: Rada Mihalcea Class web page: [Note: Some.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
IR Homework #1 By J. H. Wang Mar. 21, Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
An Efficient Information Retrieval System Objectives: n Efficient Retrieval incorporating keyword’s position; and occurrences of keywords in heading or.
Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
Statistical Properties of Text
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
IR 6 Scoring, term weighting and the vector space model.
Sampath Jayarathna Cal Poly Pomona
Why indexing? For efficient searching of a document
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Indexing & querying text
Information Retrieval in Practice
Text Based Information Retrieval
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Multimedia Information Retrieval
אחזור מידע, מנועי חיפוש וספריות
Information Retrieval and Web Search
Basic Information Retrieval
Representation of documents and queries
Java VSR Implementation
Implementation Based on Inverted Files
6. Implementation of Vector-Space Retrieval
4. Boolean and Vector Space Retrieval Models
Java VSR Implementation
5. Vector Space and Probabilistic Retrieval Models
Dr. Sampath Jayarathna Cal Poly Pomona
Boolean and Vector Space Retrieval Models
Efficient Retrieval Document-term matrix t1 t tj tm nf
Java VSR Implementation
Dr. Sampath Jayarathna Cal Poly Pomona
Presentation transcript:

Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course taught by Ray Mooney at UT Austin (who in turn adapted them from Joydeep Ghosh), and from an IR course taught by Chris Manning at Stanford)

Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, dj, for keyword vocabulary V. Convert query to a tf-idf-weighted vector q. For each dj in D do Compute score sj = cosSim(dj, q) Sort documents by decreasing score. Present top ranked documents to the user. Time complexity: O(|V|·|D|) Bad for large V & D ! |V| = 10,000; |D| = 100,000; |V|·|D| = 1,000,000,000

Practical Implementation Based on the observation that documents containing none of the query keywords do not affect the final ranking Try to identify only those documents that contain at least one query keyword Actual implementation of an inverted index

Step 1: Preprocessing Implement the preprocessing functions: For tokenization For stop word removal For stemming Input: Documents that are read one by one from the collection Output: Tokens to be added to the index No punctuation, no stop-words, stemmed

Step 2: Indexing Build an inverted index, with an entry for each word in the vocabulary Input: Tokens obtained from the preprocessing module Output: An inverted index for fast access

Step 2 (continued) Many data structures are appropriate for fast access B-trees, skipped lists, hashtables We need: One entry for each word in the vocabulary For each such entry: Keep a list of all the documents where it appears together with the corresponding frequency  tf For each such entry, keep the total number of documents in each the word occurs:  idf

Step 2 (continued) Dj, tfj df Index terms computer 3 D7, 4 database system computer database science D2, 4 D5, 2 D1, 3 D7, 4 Index terms df 3 2 4 1 Dj, tfj Index file lists   

Step 2 (continued) tf and idf for each token can be computed in one pass Cosine similarity also requires document lengths Need a second pass to compute document vector lengths Remember that the length of a document vector is the square-root of sum of the squares of the weights of its tokens. Remember the weight of a token is: tf-idf Therefore, must wait until idf-s are known (and therefore until all documents are indexed) before document lengths can be determined. Do a second pass over all documents: keep a list or hashtable with all document id-s, and for each document determine its length.

Time Complexity of Indexing Complexity of creating vector and indexing a document of n tokens is O(n). So indexing |D| such documents is O(|D| n). Computing token idf-s can be done during the same first pass Computing vector lengths is also O(|D| n). Complete process is O(|D| n), which is also the complexity of just reading in the corpus.

Step 3: Retrieval Use inverted index (from step 2) to find the limited set of documents that contain at least one of the query words. Incrementally compute cosine similarity of each indexed document as query words are processed one by one. To accumulate a total score for each retrieved document, store retrieved documents in a hashtable (or another search data structure), where the document id is the key, and the partial accumulated score is the value. Input: Query and Inverted Index (from Step 2) Output: Similarity values between query and documents

Step 4: Ranking Sort the search structure including the retrieved documents based on the value of cosine similarity Return the documents in descending order of their relevance Input: Similarity values between query and documents Output: Ranked list of documented in reversed order of their relevance

What Weighting Methods? Weights applied to both document terms and query terms Direct impact on the final ranking  Direct impact on the results  Direct impact on the quality of IR system

Standard Evaluation Measures Start with a CONTINGENCY table retrieved not retrieved relevant w x n1 = w + x not relevant y z N n2 = w + y

Precision and Recall w w+x w w+y From all the documents that are relevant out there, how many did the IR system retrieve? w Recall: w+x From all the documents that are retrieved by the IR system, how many are relevant? w Precision: w+y

Precision and Recall for a Set of Queries For each query, determine the retrieved documents retrieved and the relevant documents Calculate Macro-average: average the P/R/F calculated for the individual queries Micro-average: sum all/relevant documents for individual queries, and calculate P/R/F only once