Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.
Learning for Text Categorization
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Eigenvalues and Eigenvectors
Chapter 5 Orthogonality
CS347 Lecture 4 April 18, 2001 ©Prabhakar Raghavan.
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Chapter 3 Determinants and Matrices
Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Boot Camp in Linear Algebra Joel Barajas Karla L Caballero University of California Silicon Valley Center October 8th, 2008.
資訊科學數學11 : Linear Equation and Matrices
Chapter 5: Information Retrieval and Web Search
Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka Virginia de Sa (UCSD) Cogsci 108F Linear.
Introduction The central problems of Linear Algebra are to study the properties of matrices and to investigate the solutions of systems of linear equations.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Documents as vectors Each doc j can be viewed as a vector of tf.idf values, one component for each term So we have a vector space terms are axes docs live.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Linear Algebra Review 1 CS479/679 Pattern Recognition Dr. George Bebis.
1 February 24 Matrices 3.2 Matrices; Row reduction Standard form of a set of linear equations: Chapter 3 Linear Algebra Matrix of coefficients: Augmented.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
8.1 Vector spaces A set of vector is said to form a linear vector space V Chapter 8 Matrices and vector spaces.
1 MAC 2103 Module 12 Eigenvalues and Eigenvectors.
Chapter 5: The Orthogonality and Least Squares
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Linear algebra: matrix Eigen-value Problems
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Chapter 6: Information Retrieval and Web Search
SINGULAR VALUE DECOMPOSITION (SVD)
Elementary Linear Algebra Anton & Rorres, 9th Edition
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
4.8 Rank Rank enables one to relate matrices to vectors, and vice versa. Definition Let A be an m  n matrix. The rows of A may be viewed as row vectors.
Vector Space Models.
Elementary Linear Algebra Anton & Rorres, 9 th Edition Lecture Set – 07 Chapter 7: Eigenvalues, Eigenvectors.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Review of Matrix Operations Vector: a sequence of elements (the order is important) e.g., x = (2, 1) denotes a vector length = sqrt(2*2+1*1) orientation.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking.
1. Systems of Linear Equations and Matrices (8 Lectures) 1.1 Introduction to Systems of Linear Equations 1.2 Gaussian Elimination 1.3 Matrices and Matrix.
Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
5 5.1 © 2016 Pearson Education, Ltd. Eigenvalues and Eigenvectors EIGENVECTORS AND EIGENVALUES.
Term weighting and Vector space retrieval
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
IR 6 Scoring, term weighting and the vector space model.
Plan for Today’s Lecture(s)
Introduction The central problems of Linear Algebra are to study the properties of matrices and to investigate the solutions of systems of linear equations.
Introduction The central problems of Linear Algebra are to study the properties of matrices and to investigate the solutions of systems of linear equations.
Matrices and vector spaces
Representation of documents and queries
Text Categorization Assigning documents to a fixed set of categories
From frequency to meaning: vector space models of semantics
Chapter 3 Linear Algebra
Chapter 2 Determinants.
Presentation transcript:

Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document Similarity is determined by distance in a vector space Example: The cosine of the angle between the vectors The SMART system: Developed at Cornell University, Still used widely

Vector Space Model Documents represented as vectors in a multi-dimensional Euclidean space Each axis = a term (token) Coordinate of document d in direction of term t determined by: Term frequency: TF(d,t) number of times t occurs in document d, scaled in a variety of ways to normalize document length Inverse document frequency: IDF(t) to scale down the terms that occur in many documents

Term Frequency: Scaling  The number of times t occurs in document d:  The Cornell SMART system uses: if otherwise

Inverse Document Frequency  Not all axes (terms) in the vector space are equally important.  IDF seeks to scale down the coordinates of terms that occur in many documents.  The Cornell SMART system uses:  If the term t will enjoy a large IDF scale and vice versa.  Other variants are also used, these are mostly dampened functions of

TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight for term occurring in doc. Global weight for term occurring in the corpus Document normalization factor

Term-by-Document Matrix  A document collection (corpus) composed of n doc. that are indexed by m terms (tokens) can be represented as an matrix

Summary Tokenization Removing stopwords Stemming Term Weighting TF: Local IDF: Global Normalization TF-IDF Vector Space Term-by-Document Matrix

Reuters docs – terms, and 135 classes  documents  belong to training set  belong to testing set  Reuters includes 135 categories by using ApteMod version of the TOPICS set  Result in 90 categories with 7,770 training documents and 3,019 testing documents

Preprocessing Procedures (cont.)  After Stopwords Elimination  After Porter Algorithm

Business Understanding Deployment Data Understanding Data Preparation Modeling Evaluation DATA

Problems with Vector Space Model How to define/select ‘basic concept’? VS model treats each term as a basic vector E.g., q=(‘microsoft’, ‘software’), d = (‘windows_xp’) How to assign weights to different terms? Need to distinguish common words from uninformative words Weight in query indicates importance of term Weight in doc indicates how well the term characterizes the doc How to define similarity/distance function? How to store the term-by-document matrix?

Choice of ‘Basic Concepts’ Java Microsoft Starbucks D1D1 Which one is better?

Vector Space Model: Similarity Given A query q = (q 1, q 2, …, q n ) q i : term frequency of the i-th word A document d k = (d k,1, d k,2, …, d k,n ) d k,i : term frequency of the i- th word Similarity of a query q to a document d k q dkdk

TermsDocuments T1: Bab(y,ies,y ’ s) D1:Infant & Toddler First Aid T2: Child(ren ’ s) D2:Babies & Children ’ s Room (For Your Home) T3: Guide D3:Child Safety at Home T4: Health D4:Your Baby ’ s Health and Safety: From Infant T5: Home to Toddler T6: Infant D5:Baby Proofing Basics T7: Proofing D6:Your Guide to Easy Rust Proofing T8: Safety D7:Beanie Babies Collector ’ s Guide T9: Toddler

The 9 x 7 term-by-document matrix before normalization, where the element is the number of times term appears in document title :

The 9 x 7 term-by-document matrix with unit columns:

val col_ind row_ptr RCS

val col_ind row_ptr CCS

Short Review of Linear Algebra

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product Eigenvalue, Eigenvector Projection

Matrix Factorization  LU-Factorization:  QR-Factorization:  Very useful for solving linear system equations  Some row exchanges are required Every matrix with linearly independent columns can be factored into. The columns of are orthonormal,and is upper triangular and invertible. When and all matrices are square, becomes an orthogonal matrix ( )

QR Factorization Simplifies Least Squares Problem  The normal equation for LS problem:  Note: The orthogonal matrix constructs the column space of matrix

Motivation for Computing QR of the term-by-doc Matrix  The basis vectors of the column space of can be used to describe the semantic content of the corresponding text collection  Let be the angle between a query and the document vector  That means we can keep and instead of  QR also can be applied to dimension reduction