Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Introduction To Matrix Decomposition and Graphical Model Lei Zhang/Lead Researcher Microsoft Research Asia 2012-04-17.

Similar presentations


Presentation on theme: "An Introduction To Matrix Decomposition and Graphical Model Lei Zhang/Lead Researcher Microsoft Research Asia 2012-04-17."— Presentation transcript:

1 An Introduction To Matrix Decomposition and Graphical Model Lei Zhang/Lead Researcher Microsoft Research Asia 2012-04-17

2 Outline Matrix Decomposition – PCA, SVD, NMF – LDA, ICA, Sparse Coding, etc. Graphical Model – Basic concepts in probabilistic machine learning – EM – pLSA – LDA Two Applications – Document decomposition for “long query” retrieval – Modeling Threaded Discussions

3 What Is Matrix Decomposition? We wish to decompose the matrix A by writing it as a product of two or more matrices: A n × m = B n × k C k × m Suppose A, B, C are column matrices – A n×m = (a 1, a 2, …, a m ), each a i is a n-dim data sample – B n × k = (b 1, b 2, …, b k ), each b j is a n-dim basis, and space B consists of k bases. – C k × m = (c 1, c 2, …, c m ), each c i is the k -dim coordinates of a i projected to space B

4 Why We Need Matrix Decomposition? Given one data sample: a 1 = B n × k c 1 (a 11, a 12, …, a 1n ) T = (b 1, b 2, …, b k ) (c 11, c 12, …, c 1k ) T Another data sample: a 2 = B n × k c 2 More data sample: a m = B n × k c m Together (m data samples): (a 1, a 2, …, a m ) = B n × k (c 1, c 2, …, c m ) A n × m = B n × k C k × m

5 Why We Need Matrix Decomposition? (a 1, a 2, …, a m ) = B n × k (c 1, c 2, …, c m ) A n × m = B n × k C k × m We wish to find a set of new basis B to represent data samples A, and A will become C in the new space. In general, B captures the common features in A, while C carries specific characteristics of the original samples. In PCA: B is eigenvectors In SVD: B is right (column) eigenvectors In LDA: B is discriminant directions In NMF: B is local features

6 PRINCIPLE COMPONENT ANALYSIS

7 Definition – Eigenvalue & Eigenvector Given a m x m matrix C, for any λ and w, if Then λ is called eigenvalue, and w is called eigenvector.

8 Definition – Principle Component Analysis – Principle Component Analysis (PCA) – Karhunen-Loeve transformation (KL transformation) Let A be a n × m data matrix in which the rows represent data samples Each row is a data vector, each column represents a variable A is centered: the estimated mean is subtracted from each column, so each column has zero mean. Covariance matrix C ( m x m ):

9 Principle Component Analysis C can be decomposed as follows: C=UΛU T Λ is a diagonal matrix diag(λ 1 λ 2,…,λ n ), each λ i is an eigenvalue U is an orthogonal matrix, each column is an eigenvector  U T U=I  U -1 =U T

10 Maximizing Variance The objective of the rotation transformation is to find the maximal variance Projection of data along w is Aw. Variance: σ 2 w = (Aw) T (Aw) = w T A T Aw = w T Cw where C = A T A is the covariance matrix of the data ( A is centered!) Task: maximize variance subject to constraint w T w = 1.

11 Optimization Problem Maximize λ is the Lagrange multiplier Differentiating with respect to w yields Eigenvalue equation: Cw = λw, where C = A T A. Once the first principal component is found, we continue in the same fashion to look for the next one, which is orthogonal to (all) the principal component(s) already found.

12 Property: Data Decomposition PCA can be treated as data decomposition a=UU T a =(u 1,u 2,…,u n ) (u 1,u 2,…,u n ) T a =(u 1,u 2,…,u n ) (, …, ) T =(u 1,u 2,…,u n ) (b 1, b 2, …, b n ) T = Σ b i ·u i

13 Face Recognition – Eigenface Turk, M.A.; Pentland, A.P. Face recognition using eigenfaces, CVPR 1991 (Citation: 2654) The eigenface approach – images are points in a vector space – use PCA to reduce dimensionality – face space – compare projections onto face space to recognize faces

14

15

16 PageRank – Power Iteration Column j has nonzero elements in positions corresponding to outlinks of j ( N j in total) Row i has nonzero element in positions corresponding to inlinks I i.

17 Column-Stochastic & Irreducible Column-Stochastic where Irreducible

18 Iterative PageRank Calculation For k=1,2,… Equivalently ( λ =1, A is a Markov chain transition matrix) Why can we use power iteration to find the first eigenvector?

19 Convergence of the power iteration Expand the initial approximation r 0 in terms of the eigenvectors

20 SINGULAR VALUE DECOMPOSITION

21 SVD - Definition Any m x n matrix A, with m ≥ n, can be factorized

22 Singular Values And Singular Vectors The diagonal elements σ j of are the singular values of the matrix A. The columns of U and V are the left singular vectors and right singular vectors respectively. Equivalent form of SVD:

23 Matrix approximation Theorem: Let U k = ( u 1 u 2 … u k ), V k = ( v 1 v 2 … v k ) and Σ k = diag(σ 1, σ 2, …, σ k ), and define Then It means, that the best approximation of rank k for the matrix A is

24 SVD and PCA We can write: Remember that in PCA, we treat A as a row matrix V is just eigenvectors for A – Each column in V is an eigenvector of row matrix A – we use V to approximate a row in A Equivalently, we can write: U is just eigenvectors for A T – Each column in U is an eigenvector of column matrix A – We use U to approximate a column in A

25 Example - LSI Build a term-by-document matrix A Compute the SVD of A : A = UΣV T Approximate A by – U k : Orthogonal basis, that we use to approximate all the documents – D k : Column j hold the coordinates of document j in the new basis – D k is the projection of A onto the subspace spanned by U k.

26 SVD and PCA For symmetric A, SVD is closely related to PCA PCA: A = UΛU T – U and Λ are eigenvectors and eigenvalues. SVD: A = UΛV T – U is left(column) eigenvectors – V is right(row) eigenvectors – Λ is the same eigenvalues For symmetric A, column eigenvectors equal to row eigenvectors Note the difference of A in PCA and SVD – SVD: A is directly the data, e.g. term-by-document matrix – PCA: A is covariance matrix, A = X T X, each row in X is a sample

27 Latent Semantic Indexing (LSI) 1.Document file preparation/ preprocessing: – Indexing: collecting terms – Use stop list: eliminate ”meaningless” words – Stemming 2.Construction term-by-document matrix, sparse matrix storage. 3.Query matching: distance measures. 4.Data compression by low rank approximation: SVD 5.Ranking and relevance feedback.

28 Latent Semantic Indexing Assumption: there is some underlying latent semantic structure in the data. E.g. car and automobile occur in similar documents, as do cows and sheep. This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD.

29 Similarity Measures Term to term AA T = UΣ 2 U T = (UΣ)(UΣ) T UΣ are the coordinates of A (rows) projected to space V Document to document A T A = VΣ 2 V T = (VΣ)(VΣ) T VΣ are the coordinates of A (columns) projected to space U

30 Similarity Measures Term to document A = UΣV T = (UΣ ½ )(VΣ ½ ) T UΣ ½ are the coordinates of A (rows) projected to space V VΣ ½ are the coordinates of A (columns) projected to space U

31 HITS (Hyperlink Induced Topic Search) Idea: Web includes two flavors of prominent pages: – authorities contain high-quality information, – hubs are comprehensive lists of links to authorities A page is a good authority, if many hubs point to it. A page is a good hub if it points to many authorities. Good authorities are pointed to by good hubs and good hubs point to good authorities. HubsAuthorities

32 Power Iteration Each page i has both a hub score h i and an authority score a i. HITS successively refines these scores by computing Define the adjacency matrix L of the directed web graph: Now

33 HITS and SVD L : rows are outlinks, columns are inlinks. a will be the dominant eigenvector of the authority matrix L T L h will be the dominant eigenvector of the hub matrix LL T They are in fact the first left and right singular vectors of L!! We are in fact running SVD on the adjacency matrix.

34 HITS vs PageRank PageRank may be computed once, HITS is computed per query. HITS takes query into account, PageRank doesn’t. PageRank has no concept of hubs HITS is sensitive to local topology: insertion or deletion of a small number of nodes may change the scores a lot. PageRank more stable, because of its random jump step.

35 NMF – NON-NEGATIVE MATRIX FACTORIZATION

36 Definition Given a nonnegative matrix V n × m, find non-negative matrix factors W n × k and H k × m such that V n × m ≈W n × k H k × m V : column matrix, each column is a data sample (n-dimension) W i : k -basis represents one base H : coordinates of V projected to W v j ≈ W n × k h j

37 Motivation Non-negativity is natural in many applications... Probability is also non-negative Additive model to capture local structure

38 Multiplicative Update Algorithm Cost function  Euclidean distance Multiplicative Update

39 Multiplicative Update Algorithm Cost function  Divergence – Reduce to Kullback-Leibler divergence when – A and B can be regarded as normalized probability distributions. Multiplicative update PLSA is NMF with KL divergence

40 NMF vs PCA n = 2429 faces, m = 19x19 pixels Positive values are illustrated with black pixels and negative values with red pixels NMF  Parts-based representation PCA  Holistic representations

41 Reference D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. (pdf) NIPS, 2001.pdf D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. (pdf) Nature 401, 788-791 (1999).pdf

42 Major Reference Saara Hyvönen, Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki (Highly recommend)Linear Algebra Methods for Data Mining

43 Outline Basic concepts – Likelihood, i.i.d. – ML, MAP and Bayesian Inference – Expectation-Maximization – Mixture Gaussian Parameter estimation pLSA – Motivation – Derivation & Geometry properties – Applications LDA – Motivation - Why to add a hyper parameter – Dirichlet Distribution – Variational EM – Relations with other topic modals – Incorporating category information Summary

44 Not Included General graphical model theories Markov random field (belief propagation) Detailed derivation of LDA

45 BASIC CONCEPTS

46 What Is Machine Learning? Data Let x = (x 1, x 2,..., x D ) T denote a data point, and D = {x (1), x (2)..., x (N) }, a data set. D is sometimes associated with desired outputs y 1, y 2,.... Predictions We are generally interested in predicting something based on the observed data set. Given D what can we say about x (N+1) ? Model To make predictions, we need to make some assumptions. We can often express these assumptions in the form of a model, with some parameters, θ Given data D, we learn the model parameters, from which we can predict new data points. The model can often be expressed as a probability distribution over data points

47 Likelihood Function Given a set of parameter values, probability density function (PDF) will show that some data are more probable than other data. Inversely, given the observed data and a model of interest, Likelihood function is defined as: L(θ) = f θ (x|θ) = p(x|θ) That is, likelihood function L(θ) will show that some parameters are more likely to have produced the data

48 Maximum Likelihood (ML) Maximum likelihood will find the best model parameters that make the data “most likely” generated from this model. Suppose we are given n data samples (x 1, x 2, …, x n ), Maximum likelihood will find θ that maximize L(θ) Predictive distribution

49 I.I.D. – Independent, Identically Distributed I.I.D. means The problem is considerably simplified as: Usually, log likehood is used

50 Reference Zoubin Ghahramani, Machine Learning (4F13), 2006, Cambridge (Introduction to Machine Learning, Lectures 1-2 Slides)Machine Learning Lectures 1-2 Slides Gregor Heinrich, Parameter estimation for text analysis, Technical Note, 2005-2008Parameter estimation for text analysis

51 EXPECTATION MAXIMIZATION

52 Why We Need EM? The Expectation–Maximization (EM) algorithm is a method for ML learning of parameters in latent variable models. Why we need latent variables? To describe complex model  Gaussian Mixture Model To discover the intrinsic structure inside a data set  Topic Models, such as pLSA, LDA.

53 More General Data set, Likelihood: Goal: learn maximum likelihood (ML) parameter values The maximum likelihood procedure finds parameters θ such that: Because of the integral (or sum) over latent variables, the likelihood can be a very complicated, and hard to optimize function.

54 The Expectation Maximization (EM) Algorithm The EM algorithm finds a (local) maximum of a latent variable model likelihood. It starts from arbitrary values of the parameters, and iterates two steps: E step: Fill in values of latent variables according to posterior given data. M step: Maximize likelihood as if latent variables were not hidden. Decomposes difficult problems into series of tractable steps.

55 Jensen’s Inequality

56 Lower Bounding the Log Likelihood Observed data D = {y n } ; Latent variables X = {x n } ; Parameters θ. Goal: Maximize the log likelihood (i.e. ML learning) wrt θ : Any distribution, q(X), over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensen’s inequality: where H[q] is the entropy of q(X).

57 The E and M Steps of EM The lower bound on the log likelihood is given by: EM alternates between: E step: optimize wrt distribution over hidden variables holding params fixed: M step: maximize wrt parameters holding hidden distribution fixed:

58 The E Step E step: for fixed θ, The second term is the Kullback-Leibler divergence. This means that, for fixed θ, F is bounded above by L, and achieves that bound when So, the E step simply sets

59 The M Step M step: maximize wrt parameters holding hidden distribution q fixed: The second equality comes from fact that entropy of q(X) does not depend directly on θ. The specific form of the M step depends on the model. Often the maximum wrt θ can be found analytically.

60 EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: The E step brings F(q, θ) to the likelihood L(θ).  顶 The M-step maximizes F(q, θ) wrt θ.  抬 F(q, θ) < L(θ) by Jensen – or, equivalently, from the non- negativity of KL

61 Reference Zoubin Ghahramani, Machine Learning (4F13), 2006, Cambridge (Unsupervised learning, Lecture 5 Slides)Machine Learning Lecture 5 Slides Christopher M. Bishop (2006) Pattern Recognition and Machine Learning. SpringerPattern Recognition and Machine Learning

62 WHY DO WE NEED GRAPHICAL MODEL?

63 Why Do We Need Graphical Models? Cons – Graphical model is so complex, even with a few circles… – We have to make too many assumptions. Pros – We do need probability to explain our world. But joint probability is hard to compute. – Graphical model can help us analyze and understand our problems. – Graphs are an intuitive way of representing and visualizing the relationships between many variables. – With a graphical model, we can decouple joint probability to conditional probabilities, which are usually easier.

64 Directed Acyclic Graphical Models (Bayesian Networks) A DAG Model / Bayesian network corresponds to a factorization of the joint probability distribution: p(A,B,C,D,E) = p(A)p(B)p(C|A,B)p(D|B,C)p(E|C,D) In general where pa (i) are the parents of node i.

65 Directed Graphs for Statistical Models: Plate Notation A data set of N points generated from a Gaussian:

66 PLSA – PROBABILISTIC LATENT SEMANTIC ANALYSIS

67 Latent Semantic Indexing (LSI) Review For natural Language Queries, simple term matching does not work effectively – Ambiguous terms – Same Queries vary due to personal styles Latent semantic indexing – Creates this ‘latent semantic space’ (hidden meaning) LSI puts documents together even if they don’t have common words, if the docs share frequently co-occurring terms Disadvantages: – Statistical foundation is missing

68 pLSA – Probabilistic Latent Semantic Analysis Automated Document Indexing and Information retrieval Identification of Latent Classes using an Expectation Maximization (EM) Algorithm Shown to solve – Polysemy Java could mean “coffee” and also the “PL Java” Cricket is a “game” and also an “insect” – Synonymy “computer”, “pc”, “desktop” all could mean the same Has a better statistical foundation than LSA

69 pLSA M Nd d d z w w M d d z1z1 w1w1 w1w1 z2z2 w2w2 w2w2 z3z3 w3w3 w3w3 zNzN wNwN wNwN … z 1, …, z N are variables. z i є[1,K]. K is the number of latent topics.

70 pLSA d1d1 d1d1 z1z1 w1w1 w1w1 z2z2 w2w2 w2w2 z N1 w N1 … d2d2 d2d2 z1z1 w1w1 w1w1 z2z2 w2w2 w2w2 z N2 w N2 … dMdM dMdM z1z1 w1w1 w1w1 z2z2 w2w2 w2w2 z Nm w Nm … p(w|z=1), p(w|z=2), p(w | z=N m ) are shared for all documents. Likelihood:

71 Joint Probability vs Likelihood Joint probability Likelihood (only for observed variables) p(d) is assumed to be uniform

72 Document Decomposition Each document can be decomposed as: This is similar to the matrix decomposition, if we consider each discrete distribution as a vector. p(w|d) = Z V × k p(z|d) With many documents, we hope to find latent topics as common basis.

73 pLSA – Objective Function pLSA tries to maximize the log likelihood: Due to the summation over z inside log, we have to resort to EM.

74 EM Steps E-Step – Expectation of the likelihood function is calculated with the current parameter values M-Step – Update the parameters with the calculated posterior probabilities – Find the parameters that maximizes the likelihood function

75 Lower Bounding the Log Likelihood

76 EM Steps The E-Step: The M-Step:

77 Latent Subspace

78 pLSA vs LSA LSA and PLSA perform dimensionality reduction – In LSA, by keeping only K singular values – In PLSA, by having K aspects Comparison to SVD – U Matrix related to P(z|d) (doc to aspect) – V Matrix related to P(w|z) (aspect to term) – E Matrix related to P(z) (aspect strength)

79 pLSA vs LSA The main difference is the way the approximation is done PLSA generates a model (aspect model) and maximizes its predictive power Selecting the proper value of K is heuristic in LSA Model selection in statistics can determine optimal K in PLSA

80 Applications Text mining / topic discovering Scene Classification

81 Text Mining

82 Scene Classification

83 Example Images

84 Classification Result

85 Reference Thomas Hofmann. Probabilistic latent semantic analysis. In Proc. of Uncertainty in Artificial Intelligence, UAI'99, Stockholm, 1999Probabilistic latent semantic analysis Bosch, A., Zisserman, A. and Munoz, X. Scene Classification via pLSA Proceedings of the European Conference on Computer Vision (2006)Scene Classification via pLSA Sivic, J., Russell, B. C., Efros, A. A., Zisserman, A. and Freeman, W. T., Discovering Object Categories in Image Collections, MIT AI Lab Memo AIM-2005-005, February, 2005. Discovering Object Categories in Image Collections

86 LDA – LATENT DIRICHILET ALLOCATION

87 Problems in pLSA pLSA provides no probabilistic model at the document level. Each doc has its own topic mixture proportion. The number of parameters in the model grows linearly with M (the number of documents in the training set).

88 Problems in pLSA There is no constraint for distributions p(z|d i ). Easy to lead to serious problems with over-fitting. d1d1 d1d1 z1z1 w1w1 w1w1 z2z2 w2w2 w2w2 z N1 w N1 … d2d2 d2d2 z1z1 w1w1 w1w1 z2z2 w2w2 w2w2 z N2 w N2 … dmdm dmdm z1z1 w1w1 w1w1 z2z2 w2w2 w2w2 z Nm w Nm … p(z|d 1 )p(z|d 2 )p(z|d m )

89 Dirichlet Distribution In the LDA model, the topic mixture proportions for each document are assumed to follow some distribution. Requirement for such a distribution: – The samples (mixture proportions) generated from it are K-tuples of non-negative numbers that sum to one. That is, the samples are multinormials. – Easy to optimize. Dirichlet Distribution is one of such distributions. The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex.

90 Dirichlet Distribution Definition: The density is zero outside this open (K − 1)-dimensional simplex.

91 Various parameter α. (6, 2, 2) (3, 7, 5) (2, 3, 4) (6, 2, 6) Example Dirichlet Distributions (K=3)

92 Equal α i, different α 0 =0.1α 0 =1α 0 =10

93 The LDA Model  z4z4 z3z3 z2z2 z1z1 w4w4 w3w3 w2w2 w1w1    z4z4 z3z3 z2z2 z1z1 w4w4 w3w3 w2w2 w1w1  z4z4 z3z3 z2z2 z1z1 w4w4 w3w3 w2w2 w1w1 For each document, Choose  ~Dirichlet(  ) For each of the N words w n : – Choose a topic z n » Multinomial(  ) – Choose a word w n from p(w n |z n,  ), a multinomial probability conditioned on the topic z n.

94 The LDA Model For each document, Choose  ~Dirichlet(  ) For each of the N words w n : – Choose a topic z n » Multinomial(  ) – Choose a word w n from p(w n |z n,  ), a multinomial probability conditioned on the topic z n.

95 Joint Probability Given parameter α and β where

96 Likelihood Joint Probability Marginal distribution of a document Likelihood over all the documents

97 Inference The likelihood can be computed by summing each document. Jansen’s inequality in EM.

98 Inference In E-Step, we need to compute the posterior distribution of the hidden variables. Unfortunately, this distribution is intractable to compute in general. We have to resort to variational approach

99 Variational Inference In variational inference, we consider a simplified graphical model with variational parameters ,  and minimize the KL Divergence between the variational and posterior distributions.

100 Variantional Inference The difference between the lower bound and the likelihood is the KL divergence. Maximizing the lower bound L( ,  ; ,  ) with respect to  and  is equivalent to minimizing the KL divergence.

101 VBEM vs EM Only different in the E-Step. In standard EM, q(X) is directly set to p(X|D,θ), and let KL=0. In VBEM, it is intractable to compute p(X|D,θ). Instead, it approximates p(X|D,θ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D, θ). This is also equivalent to maximizing the lower bound L(θ).

102 Parameter Estimation Given a corpus of documents, we would like to find the parameters  and  which maximize the likelihood of the observed data. Strategy (Variational EM): Lower bound log p(w| ,  ) by a function L( ,  ; ,  ) Repeat until convergence: – E: Maximize L( ,  ; ,  ) with respect to the variational parameters , . – M: Maximize the bound with respect to parameters  and .

103 Parameter Estimation E-Step: Variational Inference – repeat until convergence.   M-Step: Parameter estimation β: α: can be implemented using the Newton-Raphson method.

104 Topic Examples in a 100-topic LDA Model) 16,000 documents from a subset of the TREC AP corpus.

105 Classification (50-topic LDA + SVM) Reuters-21578 dataset – contains 8000 documents and 15,818 words. (a) EARN vs. NOT EARN. (b) GRAIN vs. NOT GRAIN.

106 Problems in LDA Dirichlet Distribution is helpful to avoid over-fitting. But the assumption might be too strong.  z4z4 z3z3 z2z2 z1z1 w4w4 w3w3 w2w2 w1w1    z4z4 z3z3 z2z2 z1z1 w4w4 w3w3 w2w2 w1w1  z4z4 z3z3 z2z2 z1z1 w4w4 w3w3 w2w2 w1w1

107 A Bayesian Hierarchical Model for Learning Natural Scene Categories Incorporating category information M Nd π π z x x θ β

108 Codebook 174 Local Image Patches Detection: Evenly Sampled Grid Random Sampling Saliency Detector Lowe’s DoG Detector Representation: Normalized 11x11 gray values 128-dim SIFT

109 Topic Distribution in Different Categories

110 Topic Hierarchical Clustering

111 More Topic Models Dynamic topic models, ICML 2006 Correlated Topic Model, NIPS 2005 Hierarchical Dirichlet Process, Journal of the American Statistical Association 2003 Nonparametric Bayes pachinko allocation, UAI 2007 Supervised LDA, NIPS 2007 MedLDA – Maximum Margin Discrimant LDA, ICML 2009 …

112 Are you really into Graphical Models? Describing Visual Scenes using Transformed Dirichlet Processes. E. Sudderth, A. Torralba, W. Freeman, and A. Willsky. NIPS, Dec. 2005. Describing Visual Scenes using Transformed Dirichlet Processes

113 Reference David M. Blei, Andrew Y. Ng, Michael I. Jordan, Latent Dirichlet Allocation, Journal of Machine Learning Research (JMLR), 2003Latent Dirichlet Allocation Matthew J. Beal. Variational Algorithms for Approximate Bayesian Inference, PhD. Thesis, University of Cambridge, 1998Variational Algorithms for Approximate Bayesian Inference L. Fei-Fei and P. Perona. A Bayesian Hierarchical Model for Learning Natural Scene Categories. CVPR, 2005.A Bayesian Hierarchical Model for Learning Natural Scene Categories

114 Outline Matrix Decomposition – PCA, SVD, NMF – LDA, ICA, Sparse Coding, etc. Graphical Model – Basic concepts in probabilistic machine learning – EM – pLSA – LDA Two Applications – Document decomposition for “long query” retrieval, ICCV 2009 – Modeling Threaded Discussions, SIGIR 2009

115 Large-Scale Indexing for “Long Query” Retrieval (Similarity Search) Xiao Zhang, Zhiwei Li, Lei Zhang, Wei-Ying Ma, and Heung-Yeung Shum ICCV 2009

116 The Long Query Problem If a query contains 1000 keywords: – Need to access 1000 inverted lists – The intersection of 1000 inverted lists may be empty – The union of 1000 inverted list may be the whole corpus Dimension reduction? Term1Term2Term3Term4…TermN Img11200…2 f1f1 f2f2 …fMfM 0.20.1…0.03 Topic Projection Dim = 1 million Dim = 200

117 Key Idea: Dimension Reduction + Residual Error Preservation p : original TF-IDF vector in vocabulary space X : projection matrix for dimension reduction w: low dimensional feature vector  : residual error An image = + a few words (10 words) a few words (10 words)

118 Orthogonal Decomposition Base vector Low dimensional representation Residual An image = + a few words (10 words) a few words (10 words) X 1, X 2, X 3, …, X k

119 A Probabilistic Implementation x is a switch variable. It controls a word generated from: a topic specific distribution a document specific distribution a background distribution C. Chemudugunta, et.al. Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model, NIPS 2006

120 Search (Online) DS1DS2 … DS1DS2 … DS1DS2 … DS1DS2 … DS1DS2 … … … … … … LSH Index Doc 300Doc 401 … A query = + a few words Re-rankingRe-ranking Doc 401 … Doc 1 Doc 2 Doc 300 Doc 401 Doc N Doc 300 Index 10M Images: 4.6GB Search Speed: < 100ms Doc Meta

121 Search Example Query Image

122 Search Example Query Image

123 Simultaneously Modeling Semantics and Structure of Threaded Discussions: A Sparse Coding Approach and Its Applications Chen LIN, Jiang-Ming YANG, Rui CAI, Xin-jing WANG, Wei WANG, Lei ZHANG SIGIR 2009 123

124 Semantic & structure 124 Semantic: Topics Structure: Who reply to who

125 Optimize them together Model semantic Model structure 125

126 Reply reconstruction 126 Document Similarity Topic Similarity Structure Similarity

127 Baselines NP Reply to Nearest Post RR Reply to Root DS Document Similarity LDA Latent Dirichlet Allocation Project documents to topic space SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space 127

128 Evaluation methodSlashdotApple All PostsGood PostsAll PostsGood Posts NP0.0210.0120.2890.239 RR0.1830.3190.2690.474 DS0.4630.6430.4090.628 LDA0.4650.6440.4100.648 SWB0.4630.6440.4100.641 SMSS0.5240.7370.5170.772 128

129 Expert finding Methods HITS PageRank … 129

130 Baselines LM Formal Models for Expert Finding in Enterprise Corpora. SIGIR 06 Achieves stable performance in expert finding task using a language model PageRank Benchmark nodal ranking method HITS Find hub nodes and authority node EABIF Personalized Recommendation Driven by Information Flow. SIGIR ’06 Find most influential node 130

131 Evaluation 131 Bayesian estimate MethodMRRMAPP@10 LM0.8210.6980.800 EABIF(ori.)0.6740.3620.243 EABIF(rec.)0.7420.3180.281 PageRank(ori.)0.6750.3770.263 PageRank(rec.)0.7430.3210.266 HITS(ori.)0.9060.8320.900 HITS(rec.)0.9380.8220.906

132 Summary Matrix and probability are fundamental mathematics in information retrieval and computer vision. – Matrix decomposition – a good practice to learn matrix. – Graphical model – a good practice to learn probability. Graphical model is a good tool to analyze problems The essence of decomposition is to discover a set of mid-level features to describe original documents/images. It is more adaptable for various applications than matrix decomposition.


Download ppt "An Introduction To Matrix Decomposition and Graphical Model Lei Zhang/Lead Researcher Microsoft Research Asia 2012-04-17."

Similar presentations


Ads by Google