Recap: maximum likelihood estimation

Recap: maximum likelihood estimation
Data: a collection of words, 𝑤 1 , 𝑤 2 ,…, 𝑤 𝑛 Model: multinomial distribution p(𝑊) with parameters 𝜃 𝑖 =𝑝 𝑤 𝑖 , i.e., unigram language model Maximum likelihood estimator: 𝜃 =𝑎𝑟𝑔𝑚𝑎 𝑥 𝜃 𝑝(𝑊|𝜃) 𝑝 𝑊 𝜃 = 𝑁 𝑐 𝑤 1 ,…,𝑐( 𝑤 𝑁 ) 𝑖=1 𝑁 𝜃 𝑖 𝑐( 𝑤 𝑖 ) ∝ 𝑖=1 𝑁 𝜃 𝑖 𝑐( 𝑤 𝑖 ) log 𝑝 𝑊 𝜃 = 𝑖=1 𝑁 𝑐 𝑤 𝑖 log 𝜃 𝑖 𝐿 𝑊,𝜃 = 𝑖=1 𝑁 𝑐 𝑤 𝑖 log 𝜃 𝑖 +𝜆 𝑖=1 𝑁 𝜃 𝑖 −1 Using Lagrange multiplier approach, we’ll tune 𝜽 𝒊 to maximize 𝑳(𝑾,𝜽) 𝜕𝐿 𝜕 𝜃 𝑖 = 𝑐 𝑤 𝑖 𝜃 𝑖 +𝜆 → 𝜃 𝑖 =− 𝑐 𝑤 𝑖 𝜆 Set partial derivatives to zero 𝑖=1 𝑁 𝜃 𝑖 =1 𝜆=− 𝑖=1 𝑁 𝑐 𝑤 𝑖 Since we have Requirement from probability 𝜃 𝑖 = 𝑐 𝑤 𝑖 𝑖=1 𝑁 𝑐 𝑤 𝑖 CS 6501: Text Mining ML estimate

Recap: illustration of N-gram language model smoothing
𝑝 𝑤 𝑖 𝑤 𝑖−1 ,…, 𝑤 𝑖−𝑛+1 w Max. Likelihood Estimate 𝑝 𝑤 𝑖 𝑤 𝑖−1 ,…, 𝑤 𝑖−𝑛+1 = 𝑐( 𝑤 𝑖 , 𝑤 𝑖−1 ,…, 𝑤 𝑖−𝑛+1 ) 𝑐( 𝑤 𝑖−1 ,…, 𝑤 𝑖−𝑛+1 ) Smoothed LM Discount from the seen words Assigning nonzero probabilities to the unseen words CS 6501: Text Mining

Recap: perplexity The inverse of the likelihood of the test set as assigned by the language model, normalized by the number of words 𝑃𝑃 𝑤 1 ,…, 𝑤 𝑁 = 𝑁 1 𝑖=1 𝑁 𝑝( 𝑤 𝑖 | 𝑤 𝑖−1 ,…, 𝑤 𝑖−𝑛+1 ) N-gram language model CS 6501: Text Mining

Latent Semantic Analysis
Hongning Wang

VS model in practice Document and query are represented by term vectors Terms are not necessarily orthogonal to each other Synonymy: car v.s. automobile Polysemy: fly (action v.s. insect) CS6501: Text Mining

Choosing basis for VS model
A concept space is preferred Semantic gap will be bridged Sports Education Finance D2 D4 D3 D1 D5 Query CS6501: Text Mining

How to build such a space
Automatic term expansion Construction of thesaurus WordNet Clustering of words Word sense disambiguation Dictionary-based Relation between a pair of words should be similar as in text and dictionary’s description Explore word usage context CS6501: Text Mining

Latent Semantic Analysis Assumption: there is some underlying latent semantic structure in the data that is partially obscured by the randomness of word choice with respect to text generation It means: the observed term-document association data is contaminated by random noise CS6501: Text Mining

Solution Low rank matrix approximation Imagine this is *true* concept-document matrix Imagine this is our observed term-document matrix Random noise over the word selection in each document CS6501: Text Mining

Latent Semantic Analysis (LSA)
Low rank approximation of term-document matrix 𝐶 𝑀×𝑁 Goal: remove noise in the observed term-document association data Solution: find a matrix with rank 𝑘 which is closest to the original matrix in terms of Frobenius norm 𝑍 = argmin 𝑍|𝑟𝑎𝑛𝑘 𝑍 =𝑘 𝐶−𝑍 𝐹 = argmin 𝑍|𝑟𝑎𝑛𝑘 𝑍 =𝑘 𝑖=1 𝑀 𝑗=1 𝑁 𝐶 𝑖𝑗 − 𝑍 𝑖𝑗 2 CS6501: Text Mining

Basic concepts in linear algebra
Symmetric matrix 𝐶= 𝐶 𝑇 Rank of a matrix The number of linearly independent rows (columns) in a matrix 𝐶 𝑀×𝑁 𝑟𝑎𝑛𝑘 𝐶 𝑀×𝑁 ≤min⁡(𝑀,𝑁) CS6501: Text Mining

Eigen system For a square matrix 𝐶 𝑀×𝑀 If 𝐶𝑥=𝜆𝑥, 𝑥 is called the right eigenvector of 𝐶 and 𝜆 is the corresponding eigenvalue For a symmetric full-rank matrix 𝐶 𝑀×𝑀 We have its eigen-decomposition as 𝐶=𝑄Λ 𝑄 𝑇 where the columns of 𝑄 are the orthogonal and normalized eigenvectors of 𝐶 and Λ is a diagonal matrix whose entries are the eigenvalues of 𝐶 CS6501: Text Mining

Singular value decomposition (SVD) For matrix 𝐶 𝑀×𝑁 with rank 𝑟, we have 𝐶=𝑈Σ 𝑉 𝑇 where 𝑈 𝑀×𝑟 and 𝑉 𝑁×𝑟 are orthogonal matrices, and Σ is a 𝑟×𝑟 diagonal matrix, with Σ 𝑖𝑖 = 𝜆 𝑖 and 𝜆 1 … 𝜆 𝑟 are the eigenvalues of 𝐶 𝐶 𝑇 We define 𝐶 𝑀×𝑁 𝑘 = 𝑈 𝑀×𝑘 Σ 𝑘×𝑘 𝑉 𝑁×𝑘 𝑇 where we place Σ 𝑖𝑖 in a descending order and set Σ 𝑖𝑖 = 𝜆 𝑖 for 𝑖≤𝑘, and Σ 𝑖𝑖 =0 for 𝑖>𝑘 CS6501: Text Mining

Solve LSA by SVD Procedure of LSA Perform SVD on document-term adjacency matrix Construct 𝐶 𝑀×𝑁 𝑘 by only keeping the largest 𝑘 singular values in Σ non-zero Map to a lower dimensional space 𝑍 = argmin 𝑍|𝑟𝑎𝑛𝑘 𝑍 =𝑘 𝐶−𝑍 𝐹 = argmin 𝑍|𝑟𝑎𝑛𝑘 𝑍 =𝑘 𝑖=1 𝑀 𝑗=1 𝑁 𝐶 𝑖𝑗 − 𝑍 𝑖𝑗 2 = 𝐶 𝑀×𝑁 𝑘 CS6501: Text Mining

Another interpretation 𝐶 𝑀×𝑁 is the document-term adjacency matrix 𝐷 𝑀×𝑀 =𝐶 𝑀×𝑁 × 𝐶 𝑀×𝑁 𝑇 𝐷 𝑖𝑗 : document-document similarity by counting how many terms co-occur in 𝑑 𝑖 and 𝑑 𝑗 𝐷= 𝑈Σ 𝑉 𝑇 × 𝑈Σ 𝑉 𝑇 𝑇 =𝑈 Σ 2 𝑈 𝑇 Eigen-decomposition of document-document similarity matrix 𝑑 𝑖 ′ s new representation is then 𝑈Σ 𝑖 in this system(space) In the lower dimensional space, we will only use the first 𝑘 elements in 𝑈Σ 𝑖 to represent 𝑑 𝑖 The same analysis applies to 𝑇 𝑁×𝑁 =𝐶 𝑀×𝑁 𝑇 × 𝐶 𝑀×𝑁 CS6501: Text Mining

Geometric interpretation of LSA
𝐶 𝑀×𝑁 𝑘 (i,j) measures the relatedness between 𝑑 𝑖 and 𝑤 𝑗 in the 𝑘-dimensional space Therefore As 𝐶 𝑀×𝑁 𝑘 = 𝑈 𝑀×𝑘 Σ 𝑘×𝑘 𝑉 𝑁×𝑘 𝑇 𝑑 𝑖 is represented as 𝑈 𝑀×𝑘 Σ 𝑘×𝑘 𝑖 𝑤 𝑗 is represented as 𝑉 𝑁×𝑘 Σ 𝑘×𝑘 𝑗 CS6501: Text Mining

HCI Visualization Graph theory CS6501: Text Mining

What are those dimensions in LSA
Principle component analysis CS6501: Text Mining

What we have achieved via LSA Terms/documents that are closely associated are placed near one another in this new space Terms that do not occur in a document may still close to it, if that is consistent with the major patterns of association in the data A good choice of concept space for VS model! CS6501: Text Mining

LSA for retrieval Project queries into the new document space
𝑞 =𝑞 𝑉 𝑁×𝑘 Σ 𝑘×𝑘 −1 Treat query as a pseudo document of term vector Cosine similarity between query and documents in this lower-dimensional space CS6501: Text Mining

LSA for retrieval HCI Graph theory q: “human computer interaction”
CS6501: Text Mining

Discussions Computationally expensive Optimal choice of 𝑘
Time complexity 𝑂(𝑀 𝑁 2 ) Optimal choice of 𝑘 Difficult to handle dynamic corpus Difficult to interpret the decomposition results We will come back to this later! CS6501: Text Mining

LSA beyond text Collaborative filtering CS6501: Text Mining

LSA beyond text Eigen face CS6501: Text Mining

LSA beyond text Cat from deep neuron network
One of the neurons in the artificial neural network, trained from still frames from unlabeled YouTube videos, learned to detect cats. CS6501: Text Mining

What you should know Assumptions in LSA Interpretation of LSA
Low rank matrix approximation Eigen-decomposition of co-occurrence matrix for documents and terms CS6501: Text Mining

Today’s reading Introduction to information retrieval
Chapter 13: Matrix decompositions and latent semantic indexing Deerwester, Scott C., et al. "Indexing by latent semantic analysis." JAsIs 41.6 (1990): CS6501: Text Mining

Happy Lunar New Year! CS6501: Text Mining

Recap: maximum likelihood estimation

Similar presentations

Presentation on theme: "Recap: maximum likelihood estimation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Recap: maximum likelihood estimation

Similar presentations

Presentation on theme: "Recap: maximum likelihood estimation"— Presentation transcript:

Similar presentations

About project

Feedback