Presentation is loading. Please wait.

Presentation is loading. Please wait.

TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.

Similar presentations


Presentation on theme: "TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight."— Presentation transcript:

1 TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight for term occurring in doc. Global weight for term occurring in the corpus Document normalization factor

2 Term-by-Document Matrix  A document collection (corpus) composed of n doc. that are indexed by m terms (tokens) can be represented as an matrix

3 Summary Tokenization Removing stopwords Stemming Term Weighting TF: Local IDF: Global Normalization TF-IDF Vector Space Term-by-Document Matrix

4 Problems with Vector Space Model How to define/select ‘basic concept’? VS model treats each term as a basic vector E.g., q=(‘microsoft’, ‘software’), d = (‘windows_xp’) How to assign weights to different terms? Need to distinguish common words from uninformative words Weight in query indicates importance of term Weight in doc indicates how well the term characterizes the doc How to define similarity/distance function? How to store the term-by-document matrix?

5 Choice of ‘Basic Concepts’ Java Microsoft Starbucks D1D1

6 Short Review of Linear Algebra

7 The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product Eigenvalue, Eigenvector Projection

8 Least Squares Problem:  The normal equation for LS problem: Finding the projection of onto the  The projection matrix:  Let be a matrix with full column rank  If has orthonormal columns, then the LS problem becomes easy:  Think of orthonormal axis system

9 Matrix Factorization  LU-Factorization:  QR-Factorization:  Very useful for solving linear system equations  Some row exchanges are required Every matrix with linearly independent columns can be factored into. The columns of are orthonormal,and is upper triangular and invertible. When and all matrices are square, becomes an orthogonal matrix ( )

10 QR Factorization Simplifies Least Squares Problem  The normal equation for LS problem:  Note: The orthogonal matrix constructs the column space of matrix LS problem: Finding the projection of onto the

11 Motivation for Computing QR of the term-by-doc Matrix  The basis vectors of the column space of can be used to describe the semantic content of the corresponding text collection  Let be the angle between a query and the document vector  That means we can keep and instead of  QR also can be applied to dimension reduction

12 Singular Value Decomposition (SVD) The columns of are eigenvectors of and the columns of are eigenvectors of eigenvalues of both and are square roots of the nonzero

13 Singular Value Decomposition (SVD)

14 Latent Semantic Indexing (LSI) Basic idea: explore the correlation between words and documents Two words are correlated when they co-occur together many times Two documents are correlated when they have many words

15 Latent Semantic Indexing (LSI) Computation: using single value decomposition (SVD) Concept Space m is the number of concepts Rep. of Concepts in term space Concept Rep. of concepts in document space m: number of concepts/topics 

16 XX SVD: Example: m=2

17 XX

18 XX

19 XX

20 SVD: Eigenvalues Determining m is usually difficult

21 SVD: Orthogonality XX u 1 u 2 · = 0 v1v1 v2v2 v 1 · v 2 = 0

22 XX SVD: Properties rank(S): the maximum number of either row or column vectors within matrix S that are linearly independent. SVD produces the best low rank approximation  X’: rank(X’) = 2 X: rank(X) = 9

23 SVD: Visualization X =

24 SVD tries to preserve the Euclidean distance of document vectors

25 Principal Components Analysis An unsupervised method for dimension reduction  The principal component is the direction such that the projections of all data points on to this direction are most spread out  An important fact: then  We are looking for the direction with such that is maximized

26 Principal Components Analysis An unsupervised method for dimension reduction Don’t forget your purpose: The largest eigenvector will be a right choice!

27 The Second Principal Component  Add one more constraint for the second pc: It should be orthogonal to the first pc

28 Singular Value Decomposition (SVD) Assume: m > n The columns of are eigenvectors of and the columns of are eigenvectors of eigenvalues of both and are square roots of the nonzero

29 How to Compute SVD? and Q2: Is there any relation between and ? Q1: Which one or is easier to compute?


Download ppt "TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight."

Similar presentations


Ads by Google