Presentation is loading. Please wait.

Presentation is loading. Please wait.

U NSUPERVISED T OPIC M ODELING Daphna Weinshall B-530 1 Slides credit: Thomas Huffman, Tom Landauer, Peter Foltz, Melanie Martin, Hsuan-Sheng Chiu, Haiyan.

Similar presentations


Presentation on theme: "U NSUPERVISED T OPIC M ODELING Daphna Weinshall B-530 1 Slides credit: Thomas Huffman, Tom Landauer, Peter Foltz, Melanie Martin, Hsuan-Sheng Chiu, Haiyan."— Presentation transcript:

1 U NSUPERVISED T OPIC M ODELING Daphna Weinshall B-530 1 Slides credit: Thomas Huffman, Tom Landauer, Peter Foltz, Melanie Martin, Hsuan-Sheng Chiu, Haiyan Qiao, Jonathan Huang and others

2 O UTLINE The subject of this section: unsupervised organization of large collections of data Goal: Organize data for meta-analysis Retrieve documents that are relevant to a given query by matching documents in “semantic space” Identify novel documents Today we will discuss 3 methods: LSA, originated in IR and psychology, seeks retrieval which captures semantic meaning pLSA, similar goal but with probabilistic meaning LDA, Bayesian elaboration of pLSA to allow for the modeling of new documents 2

3 S ALTON ’ S V ECTOR S PACE M ODEL Suggestion: represent document as a vector where each entry corresponds to a different word and the number at that entry corresponds to how many times that word was present in the document (or some function of it) Select and use a smaller set of words that are of interest The set of different remaining words is called dictionary or vocabulary. 3

4 T HE VECTOR SPACE METHOD Representation: matrix of term (rows) by document (columns): Row - vector of the term’s occurrences in all documents: Column - vector of all terms’ occurrence in one document Assigned meaning: One column vector for each document One row vector for each word Cosine of angle between the normalized vectors [≈ inner product, or correlation] measures similarity between documents or words 4

5 Technical Memo Titles c1: Human machine interface for ABC computer applications c2: A survey of user opinion of computer system response time c3: The EPS user interface management system c4: System and human system engineering testing of EPS c5: Relation of user perceived response time to error measurement m1: The generation of random, binary, ordered trees m2: The intersection graph of paths in trees m3: Graph minors IV: Widths of trees and well-quasi-ordering m4: Graph minors : A survey S MALL EXAMPLE 5 This example is taken from: Deerwester, S.,Dumais, S.T., Landauer, T.K.,Furnas, G.W. and Harshman, R.A. (1990). "Indexing by latent semantic analysis." Journal of the Society for Information Science, 41(6), 391-407.

6 Technical Memo Titles c1: Human machine interface for ABC computer applications c2: A survey of user opinion of computer system response time c3: The EPS user interface management system c4: System and human system engineering testing of EPS c5: Relation of user perceived response time to error measurement m1: The generation of random, binary, ordered trees m2: The intersection graph of paths in trees m3: Graph minors IV: Widths of trees and well-quasi-ordering m4: Graph minors : A survey S MALL EXAMPLE 6 This example is taken from: Deerwester, S.,Dumais, S.T., Landauer, T.K.,Furnas, G.W. and Harshman, R.A. (1990). "Indexing by latent semantic analysis." Journal of the Society for Information Science, 41(6), 391-407.

7 M ATRIX REPRESENTATION 7

8 PROBLEMS WHEN USING THE VECTOR SPACE MODEL synonymy : different words may have an identical or similar meaning, e.g. car and automobile ⇨ leads to poor recall polysemy : words often have a multitude of meanings, e.g. model, python, chip ⇨ leads to poor precision auto engine bonnet tyres lorry boot car emissions hood make model truck make hidden Markov model emissions normalize Synonymy Will have small cosine but are related Polysemy Will have large cosine but not truly related 8

9 P OLYSEMY AND C ONTEXT The way to overcome ambiguity due to synonymy and polysemy - use context car company dodge ford meaning 2 ring jupiter space voyager meaning 1 … saturn... … planet... contribution to similarity, if used in 1 st meaning, but not if in 2 nd 9

10 LSA Latent Semantic Analysis, originated in the domain of psychology and Information Retrieval Dumais, S. T., Furnas, G. W., Landauer, T. K. and Deerwester, S. (1988), "Using latent semantic analysis to improve information retrieval." In Proceedings of CHI'88: Conference on Human Factors in Computing, New York: ACM, 281-285. Foltz, P. W. (1990) "Using Latent Semantic Indexing for Information Filtering". In R. B. Allen (Ed.) Proceedings of the Conference on Office Information Systems, Cambridge, MA, 40-47. Designed to address these two problems with the vector space model It does so by attempting to map both words and documents into “semantic" space, where synonymous terms look more similar, and a polysemic word is considered in context 10

11 T HE S ETTING Corpus, a set of n documents D={d 1, …,d n } Vocabulary, a set of m words W={w 1, …,w m } A matrix of size m  n to represent the data via the occurrence of words in documents [henceforth the term-document matrix] 11

12 LSA BASIC STEPS Rank-reduced Singular Value Decomposition (SVD) performed on matrix all but the k highest singular values are set to 0 obtain rank-k approximation of the original matrix (in least-squares sense) thus define “semantic space”, where terms and documents are approximated by k dimensional vectors Compute similarities between entities in semantic space (usually with cosine) 12

13 L INEAR ALGEBRA REFRESHER Eigenvectors of a square m  m matrix S satisfy: eigenvalue(right) eigenvector 13

14 Let be a square matrix with m linearly independent eigenvectors Theorem : there exists an eigen decomposition Columns of U are eigenvectors of S Diagonal elements of are eigenvalues of E IGEN DECOMPOSITION diagonal 14

15 S INGULAR V ALUE D ECOMPOSITION mmmmmnmnV is n  n For an m  n matrix A of rank r there exists a factorization (Singular Value Decomposition = SVD) as follows: Singular values. The diagonal entries Σ i,i of Σ are called the singular values of M.singular values The m columns of U and the n columns of V are called the left singular vectors and right singular vectors of A, respectively. orthonormal 15

16 SVD AND EIGEN DECOMPOSITION mmmmmnmn nnnn From the above: The columns of V are orthonormal eigenvectors of A T A. The columns of U are orthonormal eigenvectors of AA T. 16

17 S INGULAR V ALUE D ECOMPOSITION Illustration of SVD dimensions and sparseness 17

18 Solution via SVD L OW - RANK A PPROXIMATION set smallest r-k singular values to zero column notation: sum of rank 1 matrices k 18

19 Approximation problem: Given k, find A k of rank k such that Result: is the matrix obtained from SVD of with r- k smallest singular values set to 0. L OW - RANK A PPROXIMATION Frobenius norm 19

20 L OW - RANK A PPROXIMATION AND LSA From term-doc matrix A, we compute the approximation A k. There is a row for each term and a column for each doc in A k Terms and docs live in a space of k<<r dimensions 20

21 L OW - RANK A PPROXIMATION AND LDA Low rank approximation A k : It doesn’t exactly match A and it gets closer as more and more singular values are kept This is what we want, we don’t want perfect fit. It reflects the major associative patterns in the data, and ignores the smaller, less important influence and noise. Empirically, it has been observed that precision and recall improve as dimension is increased until it hits optimal, then slowly decreases until it hits standard vector model Comparing two terms: the inner product between two row vectors of A k reflects the extent to which two terms have a similar pattern of occurrence across the set of document. Comparing two documents: inner product between two normalized column vectors of A k 21

22 Singular Value Decomposition {A}={U}{ S }{V} T Dimension Reduction {~A}~={~U}{~ S }{~V} T H OW IT WORKS - SUMMARY 22

23 T ERM - DOCUMENT MATRIX 23 r (human.user) = -.38r (human.minors) = -.29

24 {U} = S MALL EXAMPLE CONT. 24

25 { S } = S MALL EXAMPLE CONT. 25

26 {V} = S MALL EXAMPLE CONT. 26

27 S MALL EXAMPLE CONT. r (human.user) =.94r (human.minors) = -.83 27

28 C ORRELATION R AW DATA 0.92 -0.72 1.00 28

29 LSA SUMMARY Has proven to be a valuable tool in many areas of NLP as well as IR summarization cross-language IR topics segmentation text classification Disadvantages: Statistical foundation is missing matrix factors are not always easy to interpret 29

30 {U} = S MALL EXAMPLE - INTERPRETABILITY 30 {V} = 30

31 P ROBABLISTIC L ATENT S EMANTIC A NALYSIS 31

32 PLSA - P ROBABILISTIC LSA Originated in the domain of statistics & machine learning (e.g., Hoffman, 2001) Extracts aspects (topics) from large collections of text Topics are interpretable unlike the arbitrary dimensions of LSA Essential difference from LDA : instead of algebraic decomposition of the term-doc matrix via SVD, a decomposition with probabilistic interpretation is computed 32

33 DATA Corpus of text: Word counts for each document Topic Model Find parameters that “reconstruct” data M ODEL IS G ENERATIVE 33 Each document is a probability distribution over topics Each topic is a probability distribution over words

34 D OCUMENT GENERATION AS A PROBABILISTIC PROCESS TOPICS MIXTURE TOPIC TOPIC WORD WORD...... 1. for each document, choose a mixture of topics 2. For every word slot, sample a topic [1..T] from the mixture 3. sample a word from the topic 34

35 loan TOPIC 1 money loan bank money bank river TOPIC 2 river stream bank stream bank loan DOCUMENT 2: river 2 stream 2 bank 2 stream 2 bank 2 money 1 loan 1 river 2 stream 2 loan 1 bank 2 river 2 bank 2 bank 1 stream 2 river 2 loan 1 bank 2 stream 2 bank 2 money 1 loan 1 river 2 stream 2 bank 2 stream 2 bank 2 money 1 river 2 stream 2 loan 1 bank 2 river 2 bank 2 money 1 bank 1 stream 2 river 2 bank 2 stream 2 bank 2 money 1 DOCUMENT 1: money 1 bank 1 bank 1 loan 1 river 2 stream 2 bank 1 money 1 river 2 bank 1 money 1 bank 1 loan 1 money 1 stream 2 bank 1 money 1 bank 1 bank 1 loan 1 river 2 stream 2 bank 1 money 1 river 2 bank 1 money 1 bank 1 loan 1 bank 1 money 1 stream 2.3.8.2 E XAMPLE 35 Mixture components Mixture weights.7

36 DOCUMENT 2: river ? stream ? bank ? stream ? bank ? money ? loan ? river ? stream ? loan ? bank ? river ? bank ? bank ? stream ? river ? loan ? bank ? stream ? bank ? money ? loan ? river ? stream ? bank ? stream ? bank ? money ? river ? stream ? loan ? bank ? river ? bank ? money ? bank ? stream ? river ? bank ? stream ? bank ? money ? DOCUMENT 1: money ? bank ? bank ? loan ? river ? stream ? bank ? money ? river ? bank ? money ? bank ? loan ? money ? stream ? bank ? money ? bank ? bank ? loan ? river ? stream ? bank ? money ? river ? bank ? money ? bank ? loan ? bank ? money ? stream ? I NVERTING (“ FITTING ”) THE MODEL Mixture components Mixture weights TOPIC 1 TOPIC 2 ? ? ? 36 ? ?

37 E XAMPLE : TOPICS FROM AN EDUCATIONAL CORPUS (TASA) 37K docs, 26K words 1700 topics, e.g.: PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW 37

38 P OLYSEMY PRINTING PAPER PRINT PRINTED TYPE PROCESS INK PRESS IMAGE PRINTER PRINTS PRINTERS COPY COPIES FORM OFFSET GRAPHIC SURFACE PRODUCED CHARACTERS PLAY PLAYS STAGE AUDIENCE THEATER ACTORS DRAMA SHAKESPEARE ACTOR THEATRE PLAYWRIGHT PERFORMANCE DRAMATIC COSTUMES COMEDY TRAGEDY CHARACTERS SCENES OPERA PERFORMED TEAM GAME BASKETBALL PLAYERS PLAYER PLAY PLAYING SOCCER PLAYED BALL TEAMS BASKET FOOTBALL SCORE COURT GAMES TRY COACH GYM SHOT JUDGE TRIAL COURT CASE JURY ACCUSED GUILTY DEFENDANT JUSTICE EVIDENCE WITNESSES CRIME LAWYER WITNESS ATTORNEY HEARING INNOCENT DEFENSE CHARGE CRIMINAL HYPOTHESIS EXPERIMENT SCIENTIFIC OBSERVATIONS SCIENTISTS EXPERIMENTS SCIENTIST EXPERIMENTAL TEST METHOD HYPOTHESES TESTED EVIDENCE BASED OBSERVATION SCIENCE FACTS DATA RESULTS EXPLANATION STUDY TEST STUDYING HOMEWORK NEED CLASS MATH TRY TEACHER WRITE PLAN ARITHMETIC ASSIGNMENT PLACE STUDIED CAREFULLY DECIDE IMPORTANT NOTEBOOK REVIEW 38

39 T HREE DOCUMENTS WITH THE WORD “ PLAY ” ( NUMBERS & COLORS  TOPIC ASSIGNMENTS ) 39

40 M ORE FORMALLY … THE A SPECT M ODEL Aspect Model Document is a mixture of underlying (latent) K aspects Each aspect is represented by a distribution of words p(w|z) Generative Model Select a doc with probability P(d) Pick a latent class z with probability P(z|d) Generate a word w with probability p(w|z) 40 dzw P(d)P(z|d) P(w|z)

41 T HE PROBABILISTIC MODEL Joint probability model Using Bayes’ rule Inference problem: compute P(z), P(z|d), P(w|z) from observations of documents(d) and words(w) summarized in matrix P(d,w). 41

42 Observed word distributions word distributions per topic Topic distributions per document Slide credit: Josef Sivic In matrix form: T HE PROBABILISTIC MODEL 42 Obvious similarity to decomposition of LSA main difference: decomposition requires non- negative values

43 Maximize log-likelihood function Use Expectation Maximization (EM) to estimate model parameters observed counts of word i in document j M ODEL FITTING WITH EM a התפלגות מולטינומית 43

44 P ROBABILISTIC LATENT SEMANTIC SPACE Reminder: the multinomial distribution represents the probability of conducting an experiment with K possible results, each one with its own event probability, and getting each result a specific number of times Each “topic” z defines a point on the simplex R, by the multinomial distribution P(W|z) The modeling assumption is that P(w|d) can be created as a convex combinations (all factors non-negative) of P(w|z), where the factors are P(z|d) 44

45 EM S TEPS E-Step posterior probabilities are computed for the latent variables based on the current estimates of the parameters and expected complete data log-likelihood is computed M-Step maximize the following function 45

46 E XAMPLE : P OLYSEMY Abstracts of 1568 documents 128 latent classes Shows word stems for the same word “power” as p(w|z) Power1 – Astronomy Power2 - Electricals 46

47 C OMPARING PLSA AND LSA LSA and PLSA perform dimensionality reduction In LSA, by keeping only k singular values In PLSA, by having k aspects Comparison to SVD U Matrix related to P(d|z) (doc to aspect) V Matrix related to P(z|w) (aspect to term) S Matrix related to P(z) (aspect strength) The main difference is the way the approximation is done PLSA generates a model (aspect model) and maximizes its predictive power Selecting the proper value of k is heuristic in LSA Model selection in statistics can determine optimal k in PLSA Main disadvantage of PLSA: We essentially model an existing corpus; what happens when a new document appears? 47

48 L ATENT D IRICHLET A LLOCATION 48

49 M OTIVATION FOR LDA In PLSA, the observed variable d is an index into some training set. There is no natural way for the model to handle previously unseen documents. The number of parameters for PLSA grows linearly with M (the number of documents in the training set). We choose to be Bayesian about our topic mixture proportions. Ref: Latent Dirichlet allocation. D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3:993-1022, January 2003. 49

50 G RAPHICAL MODEL REPRESENTATION 50 PLSA: LDA: Multimnomial distribution over topics z Multimnomial distribution over words w

51 D IRICHLET D ISTRIBUTIONS In the LDA model, we would like to say that the topic mixture proportions for each document are drawn from some distribution. So, we need a distribution on multinomials. That is, k- tuples of non-negative numbers that sum to one. The space of all such multinomials is the (k-1)- simplex, which is just a generalization of a triangle to (k-1) dimensions. Criteria for selecting our prior: It needs to be defined over the (k-1)-simplex. Algebraically speaking, we would like it to interact with the multinomial distribution in a “nice” way. 51

52 D IRICHLET DISTRIBUTION 52

53 D IRICHLET D ISTRIBUTION - USEFUL FACTS This distribution is defined over a (k-1)-simplex. That is, it takes k non-negative arguments which sum to one. Consequently it is a natural distribution to use over multinomial distributions In fact, the Dirichlet distribution is the conjugate prior to the multinomial distribution. (This means that if our likelihood is multinomial with a Dirichlet prior, then the posterior is also Dirichlet) The Dirichlet parameter  i can be thought of as a prior count of the i th class 53

54 G RAPHICAL MODEL REPRESENTATION 54 PLSA: LDA: Multimnomial distribution over topics z Multimnomial distribution over words w

55 T HE LDA M ODEL For each document, Choose  ~Dirichlet(  ) For each of the N words w n : Choose a topic z n ~ Multinomial(  ) Choose a word w n from p(w n |z n,  ), a multinomial probability conditioned on the topic z n  z4z4 z3z3 z2z2 z1z1 w4w4 w3w3 w2w2 w1w1    z4z4 z3z3 z2z2 z1z1 w4w4 w3w3 w2w2 w1w1  z4z4 z3z3 z2z2 z1z1 w4w4 w3w3 w2w2 w1w1 55

56 I NFERENCE The inference problem in LDA is to compute the posterior of the hidden variables given a document and corpus parameters  and . That is, compute p( ,z|w, ,  ). Unfortunately, exact inference is intractable, so people have explored alternatives. 56

57 V ARIATIONAL I NFERENCE In variational inference, we consider a simplified graphical model with variational parameters ,  and minimize the KL Divergence between the variational and posterior distributions. 57

58 L ATENT D IRICHLET ALLOCATION ( CONT.) The joint distribution of a topic θ, a set of N topic z, and a set of N words w: Marginal distribution of a document: Probability of a corpus: 58

59 I NFERENCE AND PARAMETER ESTIMATION The key inferential problem is that of computing the posteriori distribution of the hidden variable given a document 59 Unfortunately, this distribution is intractable to compute in general due to the coupling between the hidden variables

60 I NFERENCE AND PARAMETER ESTIMATION ( CONT.) Drop some edges and the w nodes 60

61 I NFERENCE AND PARAMETER ESTIMATION ( CONT.) Variational distribution: Lower bound on Log-likelihood KL between variational posterior and true posterior 61

62 I NFERENCE AND PARAMETER ESTIMATION ( CONT.) Finding a tight lower bound on the log likelihood Maximizing the lower bound with respect to γ and φ is equivalent to minimizing the KL divergence between the variational posterior probability and the true posterior probability 62

63 I NFERENCE AND PARAMETER ESTIMATION ( CONT.) Expand the lower bound: 63

64 I NFERENCE AND PARAMETER ESTIMATION ( CONT.) Then 64

65 I NFERENCE AND PARAMETER ESTIMATION ( CONT.) We can find the variational parameters by adding Lagrange multipliers and setting all partial derivatives to zero: 65

66 P ARAMETER E STIMATION Given a corpus of documents, we would like to find the parameters  and  which maximize the likelihood of the observed data. Strategy ( Variational EM) : Lower bound log p(w| ,  ) by a function L( ,  ; ,  ) Repeat until convergence: Maximize L( ,  ; ,  ) with respect to the variational parameters , . Maximize the bound with respect to parameters  and . 66

67 S OME R ESULTS PoliticalTeamSpaceDriveGod PartyGameNASAWindowsJesus BusinessPlayResearchCardHis ConventionYearCenterDOSBible InstituteGamesEarthSCSIChristian CommitteeWinHealthDiskChrist StatesHockeyMedicalSystemHim RightsSeasonGovMemoryChristians “politics”“sports”“space”“computers”“christianity” 67 Train set: 10,000 text articles posted to 20 online newsgroups with 40 iterations of EM. The number of topics was set to 50.

68 D ISCUSSION LDA is a flexible generative probabilistic model for collections of discrete data. Exact inference is intractable for LDA, but a variety of approximate inference algorithms for inference and parameter estimation can be used with the LDA framework. One can also use Monte-Carlo methods to estimate the probability. LDA is a relatively simple model which has been extended to more complex scenarios, like hierarchical models or temporal evolution. 68


Download ppt "U NSUPERVISED T OPIC M ODELING Daphna Weinshall B-530 1 Slides credit: Thomas Huffman, Tom Landauer, Peter Foltz, Melanie Martin, Hsuan-Sheng Chiu, Haiyan."

Similar presentations


Ads by Google