Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document Similarity is determined by distance in a vector space Example: The cosine of the angle between the vectors The SMART system: Developed at Cornell University, Still used widely
Vector Space Model Documents represented as vectors in a multi-dimensional Euclidean space Each axis = a term (token) Coordinate of document d in direction of term t determined by: Term frequency: TF(d,t) number of times t occurs in document d, scaled in a variety of ways to normalize document length Inverse document frequency: IDF(t) to scale down the terms that occur in many documents
Term Frequency: Scaling The number of times t occurs in document d: The Cornell SMART system uses: if otherwise
Inverse Document Frequency Not all axes (terms) in the vector space are equally important. IDF seeks to scale down the coordinates of terms that occur in many documents. The Cornell SMART system uses: If the term t will enjoy a large IDF scale and vice versa. Other variants are also used, these are mostly dampened functions of
TFIDF-space An obvious way to combine TF-IDF: the coordinate of document in axis is given by General form of consists of three parts: Local weight for term occurring in doc. Global weight for term occurring in the corpus Document normalization factor
Term-by-Document Matrix A document collection (corpus) composed of n doc. that are indexed by m terms (tokens) can be represented as an matrix
Summary Tokenization Removing stopwords Stemming Term Weighting TF: Local IDF: Global Normalization TF-IDF Vector Space Term-by-Document Matrix
Reuters docs – terms, and 135 classes documents belong to training set belong to testing set Reuters includes 135 categories by using ApteMod version of the TOPICS set Result in 90 categories with 7,770 training documents and 3,019 testing documents
Preprocessing Procedures (cont.) After Stopwords Elimination After Porter Algorithm
Business Understanding Deployment Data Understanding Data Preparation Modeling Evaluation DATA
Problems with Vector Space Model How to define/select ‘basic concept’? VS model treats each term as a basic vector E.g., q=(‘microsoft’, ‘software’), d = (‘windows_xp’) How to assign weights to different terms? Need to distinguish common words from uninformative words Weight in query indicates importance of term Weight in doc indicates how well the term characterizes the doc How to define similarity/distance function? How to store the term-by-document matrix?
Choice of ‘Basic Concepts’ Java Microsoft Starbucks D1D1 Which one is better?
Vector Space Model: Similarity Given A query q = (q 1, q 2, …, q n ) q i : term frequency of the i-th word A document d k = (d k,1, d k,2, …, d k,n ) d k,i : term frequency of the i- th word Similarity of a query q to a document d k q dkdk
TermsDocuments T1: Bab(y,ies,y ’ s) D1:Infant & Toddler First Aid T2: Child(ren ’ s) D2:Babies & Children ’ s Room (For Your Home) T3: Guide D3:Child Safety at Home T4: Health D4:Your Baby ’ s Health and Safety: From Infant T5: Home to Toddler T6: Infant D5:Baby Proofing Basics T7: Proofing D6:Your Guide to Easy Rust Proofing T8: Safety D7:Beanie Babies Collector ’ s Guide T9: Toddler
The 9 x 7 term-by-document matrix before normalization, where the element is the number of times term appears in document title :
The 9 x 7 term-by-document matrix with unit columns:
val col_ind row_ptr RCS
val col_ind row_ptr CCS
Short Review of Linear Algebra
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product Eigenvalue, Eigenvector Projection
Matrix Factorization LU-Factorization: QR-Factorization: Very useful for solving linear system equations Some row exchanges are required Every matrix with linearly independent columns can be factored into. The columns of are orthonormal,and is upper triangular and invertible. When and all matrices are square, becomes an orthogonal matrix ( )
QR Factorization Simplifies Least Squares Problem The normal equation for LS problem: Note: The orthogonal matrix constructs the column space of matrix
Motivation for Computing QR of the term-by-doc Matrix The basis vectors of the column space of can be used to describe the semantic content of the corresponding text collection Let be the angle between a query and the document vector That means we can keep and instead of QR also can be applied to dimension reduction