1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Eigen Decomposition and Singular Value Decomposition
Chapter 28 – Part II Matrix Operations. Gaussian elimination Gaussian elimination LU factorization LU factorization Gaussian elimination with partial.
Dimensionality Reduction PCA -- SVD
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Learning for Text Categorization
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Modeling Modern Information Retrieval
Hinrich Schütze and Christina Lioma
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Information Retrieval: Models and Methods October 15, 2003 CMSC Gina-Anne Levow.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Vector Space Model CS 652 Information Extraction and Integration.
The Vector Space Model …and applications in Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
IR Models: Review Vector Model and Probabilistic.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Chapter 5: Information Retrieval and Web Search
Linear Algebra and Image Processing
Advanced Multimedia Text Classification Tamara Berg.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Linear Algebra Review 1 CS479/679 Pattern Recognition Dr. George Bebis.
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
SINGULAR VALUE DECOMPOSITION (SVD)
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Information Retrieval and Vector Space Model.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Vector Space Models.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Information Retrieval Models: Vector Space Models
Natural Language Processing Topics in Information Retrieval August, 2002.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze.
IR 6 Scoring, term weighting and the vector space model.
Automated Information Retrieval
Plan for Today’s Lecture(s)
Information Retrieval: Models and Methods
Information Retrieval: Models and Methods
LSI, SVD and Data Management
Representation of documents and queries
From frequency to meaning: vector space models of semantics
CS 430: Information Discovery
Information Retrieval and Web Search
Information Retrieval and Web Design
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

1 Vector Space Model Rong Jin

2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine query according to users’ feedbacks?

3 Basic Issues in IR  How to represent queries?  How to represent documents?  How to compute the similarity between documents and queries?  How to utilize the users’ feedbacks to enhance the retrieval performance?

4 IR: Formal Formulation  Vocabulary V={w 1, w 2, …, w n } of language  Query q = q 1,…,q m, where q i  V  Collection C= {d 1, …, d k } Document d i = (d i1,…,d im i ), where d ij  V  Set of relevant documents R(q)  C Generally unknown and user-dependent Query is a “hint” on which doc is in R(q)  Task = compute R’(q), an “approximate R(q)”

5 Computing R(q)  Strategy 1: Document selection Classification function f(d,q)  {0,1}  Outputs 1 for relevance, 0 for irrelevance R(q) is determined as a set {d  C|f(d,q)=1} System must decide if a doc is relevant or not (“absolute relevance”) Example: Boolean retrieval

Document Selection Approach True R(q) Classifier C(q)

7 Computing R(q)  Strategy 2: Document ranking Similarity function f(d,q)   Outputs a similarity between document d and query q Cut off   The minimum similarity for document and query to be relevant R(q) is determined as the set {d  C|f(d,q)>  } System must decide if one doc is more likely to be relevant than another (“relative relevance”)

8 Document Selection vs. Ranking Doc Ranking f(d,q)=? 0.98 d d d d d d d d d 9 - R’(q) True R(q) 

9 Document Selection vs. Ranking Doc Ranking f(d,q)=? 0.98 d d d d d d d d d 9 - R’(q) Doc Selection f(d,q)=? R’(q) True R(q)

10 Ranking is often preferred  Similarity function is more general than classification function  The classifier is unlikely to be accurate Ambiguous information needs, short queries  Relevance is a subjective concept Absolute relevance vs. relative relevance

11 Probability Ranking Principle  As stated by Cooper  Ranking documents in probability maximizes the utility of IR systems “If a reference retrieval system’s response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.”

12 Vector Space Model  Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document  Similarity is determined by relationship between two vectors e.g., the cosine of the angle between the vectors, or the distance between vectors  The SMART system: Developed at Cornell University, Still used widely

13 Vector Space Model: illustration JavaStarbuckMicrosoft D1D1 110 D2D2 011 D3D3 101 D4D4 111 Query10.11

14 Vector Space Model: illustration Java Microsoft Starbucks D2D2 ? D1D1 ? ?? ? D3D3 Query D4D4 ?

15 Vector Space Model: Similarity  Represent both documents and queries by word histogram vectors n: the number of unique words A query q = (q 1, q 2,…, q n )  q i : occurrence of the i-th word in query A document d k = (d k,1, d k,2,…, d k,n )  d k,i : occurrence of the the i-th word in document  Similarity of a query q to a document d k q dkdk

Some Background in Linear Algebra  Dot product (scalar product)  Example:  Measure the similarity by dot product 16

Some Background in Linear Algebra  Length of a vector  Angle between two vectors 17 q dkdk

Some Background in Linear Algebra  Example:  Measure similarity by the angle between vectors 18 q dkdk

19 Vector Space Model: Similarity  Given A query q = (q 1, q 2,…, q n )  q i : occurrence of the i-th word in query A document d k = (d k,1, d k,2,…, d k,n )  d k,i : occurrence of the the i-th word in document  Similarity of a query q to a document d k q dkdk

Vector Space Model: Similarity 20 q dkdk

Vector Space Model: Similarity 21 q dkdk

22 Term Weighting  w k,i : the importance of the i-th word for document d k  Why weighting ? Some query terms carry more information  TF.IDF weighting TF (Term Frequency) = Within-doc-frequency IDF (Inverse Document Frequency) TF normalization: avoid the bias of long documents

23 TF Weighting  A term is important if it occurs frequently in document  Formulas: f(t,d): term occurrence of word ‘ t ’ in document d Maximum frequency normalization: Term frequency normalization

24 TF Weighting  A term is important if it occurs frequently in document  Formulas: f(t,d): term occurrence of word ‘ t ’ in document d “ Okapi/BM25 TF ” : Term frequency normalization doclen(d): the length of document d avg_doclen: average document length k,b: predefined constants

25 TF Normalization  Why? Document length variation “Repeated occurrences” are less informative than the “first occurrence”  Two views of document length A doc is long because it uses more words A doc is long because it has more contents  Generally penalize long doc, but avoid over- penalizing (pivoted normalization)

26 TF Normalization Norm. TF Raw TF “Pivoted normalization”

27 IDF Weighting  A term is discriminative if it occurs only in a few documents  Formula: IDF(t) = 1+ log(n/m) n – total number of docs m -- # docs with term t (doc freq)  Can be interpreted as mutual information

28 TF-IDF Weighting  TF-IDF weighting : The importance of a term t to a document d weight(t,d)=TF(t,d)*IDF(t) Freq in doc  high tf  high weight Rare in collection  high idf  high weight

29 TF-IDF Weighting  TF-IDF weighting : The importance of a term t to a document d weight(t,d)=TF(t,d)*IDF(t) Freq in doc  high tf  high weight Rare in collection  high idf  high weight Both q i and d k,i arebinary values, i.e. presence and absence of a word in query and document.

30 Problems with Vector Space Model  Still limited to word based matching A document will never be retrieved if it does not contain any query word How to modify the vector space model ?

31 Choice of Bases Java Microsoft Starbucks D1D1 Q D

32 Choice of Bases Java Microsoft Starbucks D1D1 Q D

33 Choice of Bases Java Microsoft Starbucks D1D1 Q D D’

34 Choice of Bases Java Microsoft Starbucks D1D1 Q D D’ Q’

35 Choice of Bases Java Microsoft Starbucks D1D1 D’ Q’

36 Choosing Bases for VSM  Modify the bases of the vector space Each basis is a concept: a group of words Every document is a vector in the concept space A1 A2 c1c2c3c4c5m1m2m3m4 A A

37 Choosing Bases for VSM  Modify the bases of the vector space Each basis is a concept: a group of words Every document is a mixture of concepts A1 A2 c1c2c3c4c5m1m2m3m4 A A

38 Choosing Bases for VSM  Modify the bases of the vector space Each basis is a concept: a group of words Every document is a mixture of concepts  How to define/select ‘basic concept’? In VS model, each term is viewed as an independent concept

39 Basic: Matrix Multiplication

40 Basic: Matrix Multiplication

41 Linear Algebra Basic: Eigen Analysis  Eigenvectors (for a square m  m matrix S)  Example eigenvalue(right) eigenvector

42 Linear Algebra Basic: Eigen Analysis

43 Linear Algebra Basic: Eigen Decomposition S = U *  * U T

44 Linear Algebra Basic: Eigen Decomposition S = U *  * U T

45 Linear Algebra Basic: Eigen Decomposition  This is generally true for symmetric square matrix  Columns of U are eigenvectors of S  Diagonal elements of  are eigenvalues of S S = U *  * U T

46 Singular Value Decomposition mmmmmnmnV is n  n For an m  n matrix A of rank r there exists a factorization (Singular Value Decomposition = SVD) as follows: The columns of U are left singular vectors. The columns of V are right singular vectors  is a diagonal matrix with singular values

47 Singular Value Decomposition  Illustration of SVD dimensions and sparseness

48 Singular Value Decomposition  Illustration of SVD dimensions and sparseness

49 Singular Value Decomposition  Illustration of SVD dimensions and sparseness

50 Low Rank Approximation  Approximate matrix with the largest singular values and singular vectors

51 Low Rank Approximation  Approximate matrix with the largest singular values and singular vectors

52 Low Rank Approximation  Approximate matrix with the largest singular values and singular vectors

53 Latent Semantic Indexing (LSI) Computation: using single value decomposition (SVD) with the first m largest singular values and singular vectors, where m is the number of concepts Rep. of Concepts in term space Concept Rep. of concepts in document space 

54 Finding “Good Concepts”

55 XX SVD: Example: m=2

56 XX SVD: Example: m=2

57 XX SVD: Example: m=2

58 XX SVD: Example: m=2

59 SVD: Orthogonality XX u 1 u 2 · = 0 v1v1 v2v2 v 1 · v 2 = 0

60 XX SVD: Properties  rank(S): the maximum number of either row or column vectors within matrix S that are linearly independent.  SVD produces the best low rank approximation  X’: rank(X’) = 2 X: rank(X) = 9

61 SVD: Visualization X=

62 SVD: Visualization  SVD tries to preserve the Euclidean distance of document vectors