1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Multimedia Database Systems
Basic IR: Modeling Basic IR Task: Slightly more complex:
CS 430 / INFO 430 Information Retrieval
1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday.
CS 430 / INFO 430 Information Retrieval
IR Models: Overview, Boolean, and Vector
1 Discussion Class 2 A Vector Space Model for Automated Indexing.
1 Discussion Class 11 Click through Data as Implicit Feedback.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 Discussion Class 4 Latent Semantic Indexing. 2 Discussion Classes Format: Question Ask a member of the class to answer. Provide opportunity for others.
Ch 4: Information Retrieval and Text Mining
1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II.
CS/Info 430: Information Retrieval
Evaluating the Performance of IR Sytems
1 Discussion Class 12 User Interfaces and Visualization.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
1 Discussion Class 3 Inverse Document Frequency. 2 Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 6 Vector Methods 2.
1 Discussion Class 2 A Vector Space Model for Automated Indexing.
1 Discussion Class 6 Crawling the Web. 2 Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for others to.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 Discussion Class 1 Three Information Retrieval Systems.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
CS 430 / INFO 430 Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Vector Methods Classical IR Thanks to: SIMS W. Arms.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
1 CS 430: Information Discovery Lecture 16 Thesaurus Construction.
1 CS 430: Information Discovery Lecture 12 Extending the Boolean Model.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
CSE3201/CSE4500 Term Weighting.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Vector Space Models.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 Discussion Class 1 Three Information Retrieval Systems.
Information Retrieval Lecture 6 Vector Methods 2.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
IR Homework #2 By J. H. Wang Apr. 13, Programming Exercise #2: Query Processing and Searching Goal: to search for relevant documents Input: a query.
1 Discussion Class 2 A Vector Space Model for Automated Indexing.
1 Midterm Examination. 2 General Observations Examination was too long! Most people submitted by .
Automated Information Retrieval
Plan for Today’s Lecture(s)
Vector Methods Classical IR
CS 430: Information Discovery
Information Retrieval and Web Search
Representation of documents and queries
CS 430: Information Discovery
Discussion Class 7 Lucene.
CS 430: Information Discovery
CS 430: Information Discovery
Relevance Feedback and Query Modification
Information Retrieval and Web Design
Introduction to information retrieval
Discussion Class 9 Google.
Discussion Class 7 User Requirements.
CS 430: Information Discovery
Vector Methods Classical IR
Presentation transcript:

1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

2 Course Administration Web site: Notices: See the home page on the course Web site Sign-up sheet: If you did not sign up at the first class, please sign up now. Programming assignments: The web site require Java or C++. Other languages are under consideration.

3 Course Administration Please send all questions about the course to: The message will be sent to William Arms Teaching Assistants

4 Course Administration Discussion class, Wednesday, August 30 Phillips Hall 203, 7:30 to 8:30 p.m. Prepare for the class as instructed on the course Web site. Participation in the discussion classes is one third of the grade, but tomorrow's class will not be included in the grade calculation.

5 Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for others to comment. When answering: Stand up. Give your name. Make sure that the TA hears it. Speak clearly so that all the class can hear. Suggestions: Do not be shy at presenting partial answers. Differing viewpoints are welcome.

6 Discussion Class: Preparation You are given two problems to explore: What is the medical evidence that red wine is good or bad for your health? What in history led to the current turmoil in Palestine and the neighboring countries? In preparing for the class, focus on the question: What characteristics of the three search services are helpful or lead to difficulties in addressing these two problems? The aim of your preparation is to explore the search services, not to solve these two problems. Take care. Many of the documents that you might find are written from a one-sided viewpoint.

7 Discussion Class: Preparation In preparing for the discussion classes, you may find it useful to look at the slides from last year's class on the old Web site:

8 Similarity Ranking Methods Similarity ranking methods: measure the degree of similarity between a query and a document. Query Documents Similar Similar: How similar is a document to a query? [Contrast with methods that look for exact matches (e.g., Boolean). Those methods assume that a document is either relevant to a query or not relevant.]

9 Similarity Ranking Methods: Use of Indexes Query DocumentsIndex database Mechanism for determining the similarity of the query to the document. Set of documents ranked by how similar they are to the query

10 Term Similarity: Example Problem: Given two text documents, how similar are they? A documents can be any length from one word to thousands. The following examples use very short artificial documents. Example Here are three documents. How similar are they? d 1 ant ant bee d 2 dog bee dog hog dog ant dog d 3 cat gnu dog eel fox

11 Concept: Two documents are similar if they contain some of the same terms. Term Similarity: Basic Concept

12 Term Vector Space: No Weighting Term vector space n-dimensional space, where n is the number of different terms used to index a set of documents (i.e. size of the word list). Vector Document i is represented by a vector. Its magnitude in dimension j is t ij, where: t ij = 1 if term j occurs in document i t ij = 0 otherwise [This is the basic method with no term weighting.]

13 A Document Represented in a 3-Dimensional Term Vector Space t1t1 t2t2 t3t3 d1d1 t 13 t 12 t 11

14 Basic Method: Term Incidence Matrix (No Weighting) documenttextterms d 1 ant ant bee ant bee d 2 dog bee dog hog dog ant dogant bee dog hog d 3 cat gnu dog eel fox cat dog eel fox gnu ant bee cat dog eel fox gnu hog d d d t ij = 1 if document i contains term j and zero otherwise 3 vectors in 8-dimensional term vector space

15 Similarity between two Documents Similarity The similarity between two documents, d 1 and d 2, is a function of the angle between their vectors in the term vector space.

16 Two Documents Represented in 3-Dimensional Term Vector Space t1t1 t2t2 t3t3 d1d1 d2d2 

17 Vector Space Revision x = (x 1, x 2, x 3,..., x n ) is a vector in an n-dimensional vector space Length of x is given by (extension of Pythagoras's theorem) |x| 2 = x x x x n 2 If x 1 and x 2 are vectors: Inner product (or dot product) is given by x 1.x 2 = x 11 x 21 + x 12 x 22 + x 13 x x 1n x 2n Cosine of the angle between the vectors x 1 and x 2: cos (  ) = x 1.x 2 |x 1 | |x 2 |

18 Example: Comparing Documents (No Weighting) ant bee cat dog eel fox gnu hog length d  2 d  4 d  5

19 Example: Comparing Documents d 1 d 2 d 3 d d d Similarity of documents in example:

20 Similarity between a Query and a Document Consider a query as another vector in the term vector space. The similarity between a query, q, and a document, d, is a function of the angle between their vectors in the term vector space.

21 Similarity between a Query and a Document in 3-Dimensional Term Vector Space t1t1 t2t2 t3t3 qd  cos(  ) is used as a measure of similarity

22 Similarity of Query to Documents (Term Incidence Matrix: no Weighting) ant bee cat dog eel fox gnu hog q 1 1 d d d query qant dog documenttextterms d 1 ant ant bee ant bee d 2 dog bee dog hog dog ant dogant bee dog hog d 3 cat gnu dog eel fox cat dog eel fox gnu

23 Calculate Ranking d 1 d 2 d 3 q 1/2 1/√2 1/√ Similarity of query to documents in example: If the query q is searched against this document set, the ranked results are: d 2, d 1, d 3

24 Simple Uses of Vector Similarity in Information Retrieval Threshold For query q, retrieve all documents with similarity above a threshold, e.g., similarity > Ranking For query q, return the n most similar documents ranked in order of similarity. [This is the standard practice.]

25 An improved measure of similarity might take account of: (a) Whether the terms are common or unusual (c) How many times each term appears in a document (d)The lengths of the documents (e) The place in the document that a term appears (f)Terms that are adjacent to each other (phrases) Term Extending the Basic Concept with Term Weighting

26 Term Vector Space with Weighting Term vector space n-dimensional space, where n is the number of different terms used to index a set of documents (i.e. size of the word list). Vector Document i is represented by a vector. Its magnitude in dimension j is t ij, where: t ij > 0 if term j occurs in document i t ij = 0 otherwise t ij is the weight of term j in document i.