Download presentation
Presentation is loading. Please wait.
Published byEverett Pearson Modified over 8 years ago
1
An Introduction To Matrix Decomposition and Graphical Model Lei Zhang/Lead Researcher Microsoft Research Asia 2012-04-17
2
Outline Matrix Decomposition – PCA, SVD, NMF – LDA, ICA, Sparse Coding, etc. Graphical Model – Basic concepts in probabilistic machine learning – EM – pLSA – LDA Two Applications – Document decomposition for “long query” retrieval – Modeling Threaded Discussions
3
What Is Matrix Decomposition? We wish to decompose the matrix A by writing it as a product of two or more matrices: A n × m = B n × k C k × m Suppose A, B, C are column matrices – A n×m = (a 1, a 2, …, a m ), each a i is a n-dim data sample – B n × k = (b 1, b 2, …, b k ), each b j is a n-dim basis, and space B consists of k bases. – C k × m = (c 1, c 2, …, c m ), each c i is the k -dim coordinates of a i projected to space B
4
Why We Need Matrix Decomposition? Given one data sample: a 1 = B n × k c 1 (a 11, a 12, …, a 1n ) T = (b 1, b 2, …, b k ) (c 11, c 12, …, c 1k ) T Another data sample: a 2 = B n × k c 2 More data sample: a m = B n × k c m Together (m data samples): (a 1, a 2, …, a m ) = B n × k (c 1, c 2, …, c m ) A n × m = B n × k C k × m
5
Why We Need Matrix Decomposition? (a 1, a 2, …, a m ) = B n × k (c 1, c 2, …, c m ) A n × m = B n × k C k × m We wish to find a set of new basis B to represent data samples A, and A will become C in the new space. In general, B captures the common features in A, while C carries specific characteristics of the original samples. In PCA: B is eigenvectors In SVD: B is right (column) eigenvectors In LDA: B is discriminant directions In NMF: B is local features
6
PRINCIPLE COMPONENT ANALYSIS
7
Definition – Eigenvalue & Eigenvector Given a m x m matrix C, for any λ and w, if Then λ is called eigenvalue, and w is called eigenvector.
8
Definition – Principle Component Analysis – Principle Component Analysis (PCA) – Karhunen-Loeve transformation (KL transformation) Let A be a n × m data matrix in which the rows represent data samples Each row is a data vector, each column represents a variable A is centered: the estimated mean is subtracted from each column, so each column has zero mean. Covariance matrix C ( m x m ):
9
Principle Component Analysis C can be decomposed as follows: C=UΛU T Λ is a diagonal matrix diag(λ 1 λ 2,…,λ n ), each λ i is an eigenvalue U is an orthogonal matrix, each column is an eigenvector U T U=I U -1 =U T
10
Maximizing Variance The objective of the rotation transformation is to find the maximal variance Projection of data along w is Aw. Variance: σ 2 w = (Aw) T (Aw) = w T A T Aw = w T Cw where C = A T A is the covariance matrix of the data ( A is centered!) Task: maximize variance subject to constraint w T w = 1.
11
Optimization Problem Maximize λ is the Lagrange multiplier Differentiating with respect to w yields Eigenvalue equation: Cw = λw, where C = A T A. Once the first principal component is found, we continue in the same fashion to look for the next one, which is orthogonal to (all) the principal component(s) already found.
12
Property: Data Decomposition PCA can be treated as data decomposition a=UU T a =(u 1,u 2,…,u n ) (u 1,u 2,…,u n ) T a =(u 1,u 2,…,u n ) (, …, ) T =(u 1,u 2,…,u n ) (b 1, b 2, …, b n ) T = Σ b i ·u i
13
Face Recognition – Eigenface Turk, M.A.; Pentland, A.P. Face recognition using eigenfaces, CVPR 1991 (Citation: 2654) The eigenface approach – images are points in a vector space – use PCA to reduce dimensionality – face space – compare projections onto face space to recognize faces
16
PageRank – Power Iteration Column j has nonzero elements in positions corresponding to outlinks of j ( N j in total) Row i has nonzero element in positions corresponding to inlinks I i.
17
Column-Stochastic & Irreducible Column-Stochastic where Irreducible
18
Iterative PageRank Calculation For k=1,2,… Equivalently ( λ =1, A is a Markov chain transition matrix) Why can we use power iteration to find the first eigenvector?
19
Convergence of the power iteration Expand the initial approximation r 0 in terms of the eigenvectors
20
SINGULAR VALUE DECOMPOSITION
21
SVD - Definition Any m x n matrix A, with m ≥ n, can be factorized
22
Singular Values And Singular Vectors The diagonal elements σ j of are the singular values of the matrix A. The columns of U and V are the left singular vectors and right singular vectors respectively. Equivalent form of SVD:
23
Matrix approximation Theorem: Let U k = ( u 1 u 2 … u k ), V k = ( v 1 v 2 … v k ) and Σ k = diag(σ 1, σ 2, …, σ k ), and define Then It means, that the best approximation of rank k for the matrix A is
24
SVD and PCA We can write: Remember that in PCA, we treat A as a row matrix V is just eigenvectors for A – Each column in V is an eigenvector of row matrix A – we use V to approximate a row in A Equivalently, we can write: U is just eigenvectors for A T – Each column in U is an eigenvector of column matrix A – We use U to approximate a column in A
25
Example - LSI Build a term-by-document matrix A Compute the SVD of A : A = UΣV T Approximate A by – U k : Orthogonal basis, that we use to approximate all the documents – D k : Column j hold the coordinates of document j in the new basis – D k is the projection of A onto the subspace spanned by U k.
26
SVD and PCA For symmetric A, SVD is closely related to PCA PCA: A = UΛU T – U and Λ are eigenvectors and eigenvalues. SVD: A = UΛV T – U is left(column) eigenvectors – V is right(row) eigenvectors – Λ is the same eigenvalues For symmetric A, column eigenvectors equal to row eigenvectors Note the difference of A in PCA and SVD – SVD: A is directly the data, e.g. term-by-document matrix – PCA: A is covariance matrix, A = X T X, each row in X is a sample
27
Latent Semantic Indexing (LSI) 1.Document file preparation/ preprocessing: – Indexing: collecting terms – Use stop list: eliminate ”meaningless” words – Stemming 2.Construction term-by-document matrix, sparse matrix storage. 3.Query matching: distance measures. 4.Data compression by low rank approximation: SVD 5.Ranking and relevance feedback.
28
Latent Semantic Indexing Assumption: there is some underlying latent semantic structure in the data. E.g. car and automobile occur in similar documents, as do cows and sheep. This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD.
29
Similarity Measures Term to term AA T = UΣ 2 U T = (UΣ)(UΣ) T UΣ are the coordinates of A (rows) projected to space V Document to document A T A = VΣ 2 V T = (VΣ)(VΣ) T VΣ are the coordinates of A (columns) projected to space U
30
Similarity Measures Term to document A = UΣV T = (UΣ ½ )(VΣ ½ ) T UΣ ½ are the coordinates of A (rows) projected to space V VΣ ½ are the coordinates of A (columns) projected to space U
31
HITS (Hyperlink Induced Topic Search) Idea: Web includes two flavors of prominent pages: – authorities contain high-quality information, – hubs are comprehensive lists of links to authorities A page is a good authority, if many hubs point to it. A page is a good hub if it points to many authorities. Good authorities are pointed to by good hubs and good hubs point to good authorities. HubsAuthorities
32
Power Iteration Each page i has both a hub score h i and an authority score a i. HITS successively refines these scores by computing Define the adjacency matrix L of the directed web graph: Now
33
HITS and SVD L : rows are outlinks, columns are inlinks. a will be the dominant eigenvector of the authority matrix L T L h will be the dominant eigenvector of the hub matrix LL T They are in fact the first left and right singular vectors of L!! We are in fact running SVD on the adjacency matrix.
34
HITS vs PageRank PageRank may be computed once, HITS is computed per query. HITS takes query into account, PageRank doesn’t. PageRank has no concept of hubs HITS is sensitive to local topology: insertion or deletion of a small number of nodes may change the scores a lot. PageRank more stable, because of its random jump step.
35
NMF – NON-NEGATIVE MATRIX FACTORIZATION
36
Definition Given a nonnegative matrix V n × m, find non-negative matrix factors W n × k and H k × m such that V n × m ≈W n × k H k × m V : column matrix, each column is a data sample (n-dimension) W i : k -basis represents one base H : coordinates of V projected to W v j ≈ W n × k h j
37
Motivation Non-negativity is natural in many applications... Probability is also non-negative Additive model to capture local structure
38
Multiplicative Update Algorithm Cost function Euclidean distance Multiplicative Update
39
Multiplicative Update Algorithm Cost function Divergence – Reduce to Kullback-Leibler divergence when – A and B can be regarded as normalized probability distributions. Multiplicative update PLSA is NMF with KL divergence
40
NMF vs PCA n = 2429 faces, m = 19x19 pixels Positive values are illustrated with black pixels and negative values with red pixels NMF Parts-based representation PCA Holistic representations
41
Reference D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. (pdf) NIPS, 2001.pdf D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. (pdf) Nature 401, 788-791 (1999).pdf
42
Major Reference Saara Hyvönen, Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki (Highly recommend)Linear Algebra Methods for Data Mining
43
Outline Basic concepts – Likelihood, i.i.d. – ML, MAP and Bayesian Inference – Expectation-Maximization – Mixture Gaussian Parameter estimation pLSA – Motivation – Derivation & Geometry properties – Applications LDA – Motivation - Why to add a hyper parameter – Dirichlet Distribution – Variational EM – Relations with other topic modals – Incorporating category information Summary
44
Not Included General graphical model theories Markov random field (belief propagation) Detailed derivation of LDA
45
BASIC CONCEPTS
46
What Is Machine Learning? Data Let x = (x 1, x 2,..., x D ) T denote a data point, and D = {x (1), x (2)..., x (N) }, a data set. D is sometimes associated with desired outputs y 1, y 2,.... Predictions We are generally interested in predicting something based on the observed data set. Given D what can we say about x (N+1) ? Model To make predictions, we need to make some assumptions. We can often express these assumptions in the form of a model, with some parameters, θ Given data D, we learn the model parameters, from which we can predict new data points. The model can often be expressed as a probability distribution over data points
47
Likelihood Function Given a set of parameter values, probability density function (PDF) will show that some data are more probable than other data. Inversely, given the observed data and a model of interest, Likelihood function is defined as: L(θ) = f θ (x|θ) = p(x|θ) That is, likelihood function L(θ) will show that some parameters are more likely to have produced the data
48
Maximum Likelihood (ML) Maximum likelihood will find the best model parameters that make the data “most likely” generated from this model. Suppose we are given n data samples (x 1, x 2, …, x n ), Maximum likelihood will find θ that maximize L(θ) Predictive distribution
49
I.I.D. – Independent, Identically Distributed I.I.D. means The problem is considerably simplified as: Usually, log likehood is used
50
Reference Zoubin Ghahramani, Machine Learning (4F13), 2006, Cambridge (Introduction to Machine Learning, Lectures 1-2 Slides)Machine Learning Lectures 1-2 Slides Gregor Heinrich, Parameter estimation for text analysis, Technical Note, 2005-2008Parameter estimation for text analysis
51
EXPECTATION MAXIMIZATION
52
Why We Need EM? The Expectation–Maximization (EM) algorithm is a method for ML learning of parameters in latent variable models. Why we need latent variables? To describe complex model Gaussian Mixture Model To discover the intrinsic structure inside a data set Topic Models, such as pLSA, LDA.
53
More General Data set, Likelihood: Goal: learn maximum likelihood (ML) parameter values The maximum likelihood procedure finds parameters θ such that: Because of the integral (or sum) over latent variables, the likelihood can be a very complicated, and hard to optimize function.
54
The Expectation Maximization (EM) Algorithm The EM algorithm finds a (local) maximum of a latent variable model likelihood. It starts from arbitrary values of the parameters, and iterates two steps: E step: Fill in values of latent variables according to posterior given data. M step: Maximize likelihood as if latent variables were not hidden. Decomposes difficult problems into series of tractable steps.
55
Jensen’s Inequality
56
Lower Bounding the Log Likelihood Observed data D = {y n } ; Latent variables X = {x n } ; Parameters θ. Goal: Maximize the log likelihood (i.e. ML learning) wrt θ : Any distribution, q(X), over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensen’s inequality: where H[q] is the entropy of q(X).
57
The E and M Steps of EM The lower bound on the log likelihood is given by: EM alternates between: E step: optimize wrt distribution over hidden variables holding params fixed: M step: maximize wrt parameters holding hidden distribution fixed:
58
The E Step E step: for fixed θ, The second term is the Kullback-Leibler divergence. This means that, for fixed θ, F is bounded above by L, and achieves that bound when So, the E step simply sets
59
The M Step M step: maximize wrt parameters holding hidden distribution q fixed: The second equality comes from fact that entropy of q(X) does not depend directly on θ. The specific form of the M step depends on the model. Often the maximum wrt θ can be found analytically.
60
EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: The E step brings F(q, θ) to the likelihood L(θ). 顶 The M-step maximizes F(q, θ) wrt θ. 抬 F(q, θ) < L(θ) by Jensen – or, equivalently, from the non- negativity of KL
61
Reference Zoubin Ghahramani, Machine Learning (4F13), 2006, Cambridge (Unsupervised learning, Lecture 5 Slides)Machine Learning Lecture 5 Slides Christopher M. Bishop (2006) Pattern Recognition and Machine Learning. SpringerPattern Recognition and Machine Learning
62
WHY DO WE NEED GRAPHICAL MODEL?
63
Why Do We Need Graphical Models? Cons – Graphical model is so complex, even with a few circles… – We have to make too many assumptions. Pros – We do need probability to explain our world. But joint probability is hard to compute. – Graphical model can help us analyze and understand our problems. – Graphs are an intuitive way of representing and visualizing the relationships between many variables. – With a graphical model, we can decouple joint probability to conditional probabilities, which are usually easier.
64
Directed Acyclic Graphical Models (Bayesian Networks) A DAG Model / Bayesian network corresponds to a factorization of the joint probability distribution: p(A,B,C,D,E) = p(A)p(B)p(C|A,B)p(D|B,C)p(E|C,D) In general where pa (i) are the parents of node i.
65
Directed Graphs for Statistical Models: Plate Notation A data set of N points generated from a Gaussian:
66
PLSA – PROBABILISTIC LATENT SEMANTIC ANALYSIS
67
Latent Semantic Indexing (LSI) Review For natural Language Queries, simple term matching does not work effectively – Ambiguous terms – Same Queries vary due to personal styles Latent semantic indexing – Creates this ‘latent semantic space’ (hidden meaning) LSI puts documents together even if they don’t have common words, if the docs share frequently co-occurring terms Disadvantages: – Statistical foundation is missing
68
pLSA – Probabilistic Latent Semantic Analysis Automated Document Indexing and Information retrieval Identification of Latent Classes using an Expectation Maximization (EM) Algorithm Shown to solve – Polysemy Java could mean “coffee” and also the “PL Java” Cricket is a “game” and also an “insect” – Synonymy “computer”, “pc”, “desktop” all could mean the same Has a better statistical foundation than LSA
69
pLSA M Nd d d z w w M d d z1z1 w1w1 w1w1 z2z2 w2w2 w2w2 z3z3 w3w3 w3w3 zNzN wNwN wNwN … z 1, …, z N are variables. z i є[1,K]. K is the number of latent topics.
70
pLSA d1d1 d1d1 z1z1 w1w1 w1w1 z2z2 w2w2 w2w2 z N1 w N1 … d2d2 d2d2 z1z1 w1w1 w1w1 z2z2 w2w2 w2w2 z N2 w N2 … dMdM dMdM z1z1 w1w1 w1w1 z2z2 w2w2 w2w2 z Nm w Nm … p(w|z=1), p(w|z=2), p(w | z=N m ) are shared for all documents. Likelihood:
71
Joint Probability vs Likelihood Joint probability Likelihood (only for observed variables) p(d) is assumed to be uniform
72
Document Decomposition Each document can be decomposed as: This is similar to the matrix decomposition, if we consider each discrete distribution as a vector. p(w|d) = Z V × k p(z|d) With many documents, we hope to find latent topics as common basis.
73
pLSA – Objective Function pLSA tries to maximize the log likelihood: Due to the summation over z inside log, we have to resort to EM.
74
EM Steps E-Step – Expectation of the likelihood function is calculated with the current parameter values M-Step – Update the parameters with the calculated posterior probabilities – Find the parameters that maximizes the likelihood function
75
Lower Bounding the Log Likelihood
76
EM Steps The E-Step: The M-Step:
77
Latent Subspace
78
pLSA vs LSA LSA and PLSA perform dimensionality reduction – In LSA, by keeping only K singular values – In PLSA, by having K aspects Comparison to SVD – U Matrix related to P(z|d) (doc to aspect) – V Matrix related to P(w|z) (aspect to term) – E Matrix related to P(z) (aspect strength)
79
pLSA vs LSA The main difference is the way the approximation is done PLSA generates a model (aspect model) and maximizes its predictive power Selecting the proper value of K is heuristic in LSA Model selection in statistics can determine optimal K in PLSA
80
Applications Text mining / topic discovering Scene Classification
81
Text Mining
82
Scene Classification
83
Example Images
84
Classification Result
85
Reference Thomas Hofmann. Probabilistic latent semantic analysis. In Proc. of Uncertainty in Artificial Intelligence, UAI'99, Stockholm, 1999Probabilistic latent semantic analysis Bosch, A., Zisserman, A. and Munoz, X. Scene Classification via pLSA Proceedings of the European Conference on Computer Vision (2006)Scene Classification via pLSA Sivic, J., Russell, B. C., Efros, A. A., Zisserman, A. and Freeman, W. T., Discovering Object Categories in Image Collections, MIT AI Lab Memo AIM-2005-005, February, 2005. Discovering Object Categories in Image Collections
86
LDA – LATENT DIRICHILET ALLOCATION
87
Problems in pLSA pLSA provides no probabilistic model at the document level. Each doc has its own topic mixture proportion. The number of parameters in the model grows linearly with M (the number of documents in the training set).
88
Problems in pLSA There is no constraint for distributions p(z|d i ). Easy to lead to serious problems with over-fitting. d1d1 d1d1 z1z1 w1w1 w1w1 z2z2 w2w2 w2w2 z N1 w N1 … d2d2 d2d2 z1z1 w1w1 w1w1 z2z2 w2w2 w2w2 z N2 w N2 … dmdm dmdm z1z1 w1w1 w1w1 z2z2 w2w2 w2w2 z Nm w Nm … p(z|d 1 )p(z|d 2 )p(z|d m )
89
Dirichlet Distribution In the LDA model, the topic mixture proportions for each document are assumed to follow some distribution. Requirement for such a distribution: – The samples (mixture proportions) generated from it are K-tuples of non-negative numbers that sum to one. That is, the samples are multinormials. – Easy to optimize. Dirichlet Distribution is one of such distributions. The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex.
90
Dirichlet Distribution Definition: The density is zero outside this open (K − 1)-dimensional simplex.
91
Various parameter α. (6, 2, 2) (3, 7, 5) (2, 3, 4) (6, 2, 6) Example Dirichlet Distributions (K=3)
92
Equal α i, different α 0 =0.1α 0 =1α 0 =10
93
The LDA Model z4z4 z3z3 z2z2 z1z1 w4w4 w3w3 w2w2 w1w1 z4z4 z3z3 z2z2 z1z1 w4w4 w3w3 w2w2 w1w1 z4z4 z3z3 z2z2 z1z1 w4w4 w3w3 w2w2 w1w1 For each document, Choose ~Dirichlet( ) For each of the N words w n : – Choose a topic z n » Multinomial( ) – Choose a word w n from p(w n |z n, ), a multinomial probability conditioned on the topic z n.
94
The LDA Model For each document, Choose ~Dirichlet( ) For each of the N words w n : – Choose a topic z n » Multinomial( ) – Choose a word w n from p(w n |z n, ), a multinomial probability conditioned on the topic z n.
95
Joint Probability Given parameter α and β where
96
Likelihood Joint Probability Marginal distribution of a document Likelihood over all the documents
97
Inference The likelihood can be computed by summing each document. Jansen’s inequality in EM.
98
Inference In E-Step, we need to compute the posterior distribution of the hidden variables. Unfortunately, this distribution is intractable to compute in general. We have to resort to variational approach
99
Variational Inference In variational inference, we consider a simplified graphical model with variational parameters , and minimize the KL Divergence between the variational and posterior distributions.
100
Variantional Inference The difference between the lower bound and the likelihood is the KL divergence. Maximizing the lower bound L( , ; , ) with respect to and is equivalent to minimizing the KL divergence.
101
VBEM vs EM Only different in the E-Step. In standard EM, q(X) is directly set to p(X|D,θ), and let KL=0. In VBEM, it is intractable to compute p(X|D,θ). Instead, it approximates p(X|D,θ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D, θ). This is also equivalent to maximizing the lower bound L(θ).
102
Parameter Estimation Given a corpus of documents, we would like to find the parameters and which maximize the likelihood of the observed data. Strategy (Variational EM): Lower bound log p(w| , ) by a function L( , ; , ) Repeat until convergence: – E: Maximize L( , ; , ) with respect to the variational parameters , . – M: Maximize the bound with respect to parameters and .
103
Parameter Estimation E-Step: Variational Inference – repeat until convergence. M-Step: Parameter estimation β: α: can be implemented using the Newton-Raphson method.
104
Topic Examples in a 100-topic LDA Model) 16,000 documents from a subset of the TREC AP corpus.
105
Classification (50-topic LDA + SVM) Reuters-21578 dataset – contains 8000 documents and 15,818 words. (a) EARN vs. NOT EARN. (b) GRAIN vs. NOT GRAIN.
106
Problems in LDA Dirichlet Distribution is helpful to avoid over-fitting. But the assumption might be too strong. z4z4 z3z3 z2z2 z1z1 w4w4 w3w3 w2w2 w1w1 z4z4 z3z3 z2z2 z1z1 w4w4 w3w3 w2w2 w1w1 z4z4 z3z3 z2z2 z1z1 w4w4 w3w3 w2w2 w1w1
107
A Bayesian Hierarchical Model for Learning Natural Scene Categories Incorporating category information M Nd π π z x x θ β
108
Codebook 174 Local Image Patches Detection: Evenly Sampled Grid Random Sampling Saliency Detector Lowe’s DoG Detector Representation: Normalized 11x11 gray values 128-dim SIFT
109
Topic Distribution in Different Categories
110
Topic Hierarchical Clustering
111
More Topic Models Dynamic topic models, ICML 2006 Correlated Topic Model, NIPS 2005 Hierarchical Dirichlet Process, Journal of the American Statistical Association 2003 Nonparametric Bayes pachinko allocation, UAI 2007 Supervised LDA, NIPS 2007 MedLDA – Maximum Margin Discrimant LDA, ICML 2009 …
112
Are you really into Graphical Models? Describing Visual Scenes using Transformed Dirichlet Processes. E. Sudderth, A. Torralba, W. Freeman, and A. Willsky. NIPS, Dec. 2005. Describing Visual Scenes using Transformed Dirichlet Processes
113
Reference David M. Blei, Andrew Y. Ng, Michael I. Jordan, Latent Dirichlet Allocation, Journal of Machine Learning Research (JMLR), 2003Latent Dirichlet Allocation Matthew J. Beal. Variational Algorithms for Approximate Bayesian Inference, PhD. Thesis, University of Cambridge, 1998Variational Algorithms for Approximate Bayesian Inference L. Fei-Fei and P. Perona. A Bayesian Hierarchical Model for Learning Natural Scene Categories. CVPR, 2005.A Bayesian Hierarchical Model for Learning Natural Scene Categories
114
Outline Matrix Decomposition – PCA, SVD, NMF – LDA, ICA, Sparse Coding, etc. Graphical Model – Basic concepts in probabilistic machine learning – EM – pLSA – LDA Two Applications – Document decomposition for “long query” retrieval, ICCV 2009 – Modeling Threaded Discussions, SIGIR 2009
115
Large-Scale Indexing for “Long Query” Retrieval (Similarity Search) Xiao Zhang, Zhiwei Li, Lei Zhang, Wei-Ying Ma, and Heung-Yeung Shum ICCV 2009
116
The Long Query Problem If a query contains 1000 keywords: – Need to access 1000 inverted lists – The intersection of 1000 inverted lists may be empty – The union of 1000 inverted list may be the whole corpus Dimension reduction? Term1Term2Term3Term4…TermN Img11200…2 f1f1 f2f2 …fMfM 0.20.1…0.03 Topic Projection Dim = 1 million Dim = 200
117
Key Idea: Dimension Reduction + Residual Error Preservation p : original TF-IDF vector in vocabulary space X : projection matrix for dimension reduction w: low dimensional feature vector : residual error An image = + a few words (10 words) a few words (10 words)
118
Orthogonal Decomposition Base vector Low dimensional representation Residual An image = + a few words (10 words) a few words (10 words) X 1, X 2, X 3, …, X k
119
A Probabilistic Implementation x is a switch variable. It controls a word generated from: a topic specific distribution a document specific distribution a background distribution C. Chemudugunta, et.al. Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model, NIPS 2006
120
Search (Online) DS1DS2 … DS1DS2 … DS1DS2 … DS1DS2 … DS1DS2 … … … … … … LSH Index Doc 300Doc 401 … A query = + a few words Re-rankingRe-ranking Doc 401 … Doc 1 Doc 2 Doc 300 Doc 401 Doc N Doc 300 Index 10M Images: 4.6GB Search Speed: < 100ms Doc Meta
121
Search Example Query Image
122
Search Example Query Image
123
Simultaneously Modeling Semantics and Structure of Threaded Discussions: A Sparse Coding Approach and Its Applications Chen LIN, Jiang-Ming YANG, Rui CAI, Xin-jing WANG, Wei WANG, Lei ZHANG SIGIR 2009 123
124
Semantic & structure 124 Semantic: Topics Structure: Who reply to who
125
Optimize them together Model semantic Model structure 125
126
Reply reconstruction 126 Document Similarity Topic Similarity Structure Similarity
127
Baselines NP Reply to Nearest Post RR Reply to Root DS Document Similarity LDA Latent Dirichlet Allocation Project documents to topic space SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space 127
128
Evaluation methodSlashdotApple All PostsGood PostsAll PostsGood Posts NP0.0210.0120.2890.239 RR0.1830.3190.2690.474 DS0.4630.6430.4090.628 LDA0.4650.6440.4100.648 SWB0.4630.6440.4100.641 SMSS0.5240.7370.5170.772 128
129
Expert finding Methods HITS PageRank … 129
130
Baselines LM Formal Models for Expert Finding in Enterprise Corpora. SIGIR 06 Achieves stable performance in expert finding task using a language model PageRank Benchmark nodal ranking method HITS Find hub nodes and authority node EABIF Personalized Recommendation Driven by Information Flow. SIGIR ’06 Find most influential node 130
131
Evaluation 131 Bayesian estimate MethodMRRMAPP@10 LM0.8210.6980.800 EABIF(ori.)0.6740.3620.243 EABIF(rec.)0.7420.3180.281 PageRank(ori.)0.6750.3770.263 PageRank(rec.)0.7430.3210.266 HITS(ori.)0.9060.8320.900 HITS(rec.)0.9380.8220.906
132
Summary Matrix and probability are fundamental mathematics in information retrieval and computer vision. – Matrix decomposition – a good practice to learn matrix. – Graphical model – a good practice to learn probability. Graphical model is a good tool to analyze problems The essence of decomposition is to discover a set of mid-level features to describe original documents/images. It is more adaptable for various applications than matrix decomposition.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.