Presentation is loading. Please wait.

Presentation is loading. Please wait.

Latent Semantic Analysis (LSA) Jed Crandall 16 June 2009.

Similar presentations


Presentation on theme: "Latent Semantic Analysis (LSA) Jed Crandall 16 June 2009."— Presentation transcript:

1 Latent Semantic Analysis (LSA) Jed Crandall 16 June 2009

2 What is LSA? “A technique in natural language processing of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms” -Paraphrasing Wikipedia Based on the “bag of words” model

3 Timeline LSA patented in 1988 ◦ Deerwester et al., mostly psycology types from the University of Colorado pLSA by Hofmann in 1999 ◦ Assumes Poisson distribution of terms and documents instead of Gaussian Latent Dirichlet Allocation, Blei et al. in 2002 ◦ More of a graphical model

4 What you can do with LSA Start with a corpus of text, e.g., Wikipedia Create a term frequency matrix Do some fancy math (not too fancy, though) Output is a matrix you can use to project terms (or documents) into a lower- dimensional concept space ◦ “Mao Zedong” and “communism” should have a high dot product ◦ “Mao Zedong” and “pocket watch” should not

5 Intuition behind LSA “A is 5 furlongs away from B” “A is 5 furlongs away from C” “B is 8 furlongs away from C”

6 In two dimensions BC A 5 8

7 Noise in the measurements “A, B, and C are all on a straight, flat, road”

8 Dimension reduction BCA 9 4.5

9 Dimension reduction

10 Assumptions Suppose we want to do LSA on Wikipedia, with k=600 We’re assuming all authors draw from two sources when choosing words while writing an article ◦ A 600-dimensional “true” concept space ◦ Their freedom of choice which we model as white Gaussian noise

11 Process Build a term frequency matrix Do tf-idf weighting Calculate a singular value decomposition (SVD) Do a rank reduction by chopping off all but the k = 600 largest singular values Can now map terms (or documents) into a space defined by the 600-dimensional matrix approximation of our original matrix, compare them with a dot product or cosine similarity

12 Build a term frequency matrix

13

14 Do tf-idf weighting (optional) n i,j is # occurences of term i in document j Denominator is total # terms in document j Numerator is total # of documents, denominator is # documents where term i appears at least once (like entropy)

15 Calculate an SVD Why “an” SVD, not “the” SVD? Unitary matrix ◦ Normal ◦ Determinant =1 (length preserving) Dimension is the rank of a matrix An SVD exists for every matrix Fast, numerically stable, can stop at k, etc.

16

17 Rank reduction Rank reduction = dimension reduction Will give us the k-dimensional matrix that is the optimal approximation of our original matrix in terms of the Frobenius norm Rank reduction has the effect of reducing white Guassian noise

18

19 Map terms into concept space Using V instead of P (plagiarism from multiple sources) (or documents) Can compare terms using, e.g., cosine similarity

20

21 Example application ConceptDoppler 我的奋斗 (Mein Kampf), 转化率 (conversion rate), 绝食 (hunger strike) List changes dramatically at times ◦ 19 September 2007 – 122 out of ? ◦ 6 March 2008 – 108 out of ? ◦ 18 June 2008 – 133 words out of ? ◦ As of February 2009 法轮功 not blocked

22 Questions? Sources I plagiarized from: ◦ Wikipedia article on latent semantic analysis ◦ http://lsa.colorado.edu/papers/dp1.LSAintro.p df http://lsa.colorado.edu/papers/dp1.LSAintro.p df ◦ ConceptDoppler, Crandall et al. from CCS 2007


Download ppt "Latent Semantic Analysis (LSA) Jed Crandall 16 June 2009."

Similar presentations


Ads by Google