An Introduction To Matrix Decomposition and Graphical Model Lei Zhang/Lead Researcher Microsoft Research Asia 2012-04-17.

Slides:



Advertisements
Similar presentations
Part 2: Unsupervised Learning
Advertisements

Eigen Decomposition and Singular Value Decomposition
Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
Information retrieval – LSI, pLSI and LDA
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Dimensionality Reduction PCA -- SVD
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Probabilistic Clustering-Projection Model for Discrete Data
Visual Recognition Tutorial
Generative Topic Models for Community Analysis
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Principal Component Analysis
Dimensional reduction, PCA
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Latent Dirichlet Allocation a generative model for text
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Maximum Likelihood (ML), Expectation Maximization (EM)
A Bayesian Hierarchical Model for Learning Natural Scene Categories L. Fei-Fei and P. Perona. CVPR 2005 Discovering objects and their location in images.
Visual Recognition Tutorial
Probabilistic Latent Semantic Analysis
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
Summarized by Soo-Jin Kim
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
Chapter 2 Dimensionality Reduction. Linear Methods
Presented By Wanchen Lu 2/25/2013
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Example 16,000 documents 100 topic Picked those with large p(w|z)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Integrating Topics and Syntax -Thomas L
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
SINGULAR VALUE DECOMPOSITION (SVD)
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
An Introduction to Latent Dirichlet Allocation (LDA)
Lecture 2: Statistical learning primer for biologists
Latent Dirichlet Allocation
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Web-Mining Agents Topic Analysis: pLSI and LDA
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
14.0 Linguistic Processing and Latent Topic Analysis.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Unsupervised Learning II Feature Extraction
LECTURE 11: Advanced Discriminant Analysis
University of Ioannina
Lecture 8:Eigenfaces and Shared Features
Multimodal Learning with Deep Boltzmann Machines
Latent Variables, Mixture Models and EM
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Michal Rosen-Zvi University of California, Irvine
Feature space tansformation methods
Latent Dirichlet Allocation
Topic Models in Text Processing
Presentation transcript:

An Introduction To Matrix Decomposition and Graphical Model Lei Zhang/Lead Researcher Microsoft Research Asia

Outline Matrix Decomposition – PCA, SVD, NMF – LDA, ICA, Sparse Coding, etc. Graphical Model – Basic concepts in probabilistic machine learning – EM – pLSA – LDA Two Applications – Document decomposition for “long query” retrieval – Modeling Threaded Discussions

What Is Matrix Decomposition? We wish to decompose the matrix A by writing it as a product of two or more matrices: A n × m = B n × k C k × m Suppose A, B, C are column matrices – A n×m = (a 1, a 2, …, a m ), each a i is a n-dim data sample – B n × k = (b 1, b 2, …, b k ), each b j is a n-dim basis, and space B consists of k bases. – C k × m = (c 1, c 2, …, c m ), each c i is the k -dim coordinates of a i projected to space B

Why We Need Matrix Decomposition? Given one data sample: a 1 = B n × k c 1 (a 11, a 12, …, a 1n ) T = (b 1, b 2, …, b k ) (c 11, c 12, …, c 1k ) T Another data sample: a 2 = B n × k c 2 More data sample: a m = B n × k c m Together (m data samples): (a 1, a 2, …, a m ) = B n × k (c 1, c 2, …, c m ) A n × m = B n × k C k × m

Why We Need Matrix Decomposition? (a 1, a 2, …, a m ) = B n × k (c 1, c 2, …, c m ) A n × m = B n × k C k × m We wish to find a set of new basis B to represent data samples A, and A will become C in the new space. In general, B captures the common features in A, while C carries specific characteristics of the original samples. In PCA: B is eigenvectors In SVD: B is right (column) eigenvectors In LDA: B is discriminant directions In NMF: B is local features

PRINCIPLE COMPONENT ANALYSIS

Definition – Eigenvalue & Eigenvector Given a m x m matrix C, for any λ and w, if Then λ is called eigenvalue, and w is called eigenvector.

Definition – Principle Component Analysis – Principle Component Analysis (PCA) – Karhunen-Loeve transformation (KL transformation) Let A be a n × m data matrix in which the rows represent data samples Each row is a data vector, each column represents a variable A is centered: the estimated mean is subtracted from each column, so each column has zero mean. Covariance matrix C ( m x m ):

Principle Component Analysis C can be decomposed as follows: C=UΛU T Λ is a diagonal matrix diag(λ 1 λ 2,…,λ n ), each λ i is an eigenvalue U is an orthogonal matrix, each column is an eigenvector  U T U=I  U -1 =U T

Maximizing Variance The objective of the rotation transformation is to find the maximal variance Projection of data along w is Aw. Variance: σ 2 w = (Aw) T (Aw) = w T A T Aw = w T Cw where C = A T A is the covariance matrix of the data ( A is centered!) Task: maximize variance subject to constraint w T w = 1.

Optimization Problem Maximize λ is the Lagrange multiplier Differentiating with respect to w yields Eigenvalue equation: Cw = λw, where C = A T A. Once the first principal component is found, we continue in the same fashion to look for the next one, which is orthogonal to (all) the principal component(s) already found.

Property: Data Decomposition PCA can be treated as data decomposition a=UU T a =(u 1,u 2,…,u n ) (u 1,u 2,…,u n ) T a =(u 1,u 2,…,u n ) (, …, ) T =(u 1,u 2,…,u n ) (b 1, b 2, …, b n ) T = Σ b i ·u i

Face Recognition – Eigenface Turk, M.A.; Pentland, A.P. Face recognition using eigenfaces, CVPR 1991 (Citation: 2654) The eigenface approach – images are points in a vector space – use PCA to reduce dimensionality – face space – compare projections onto face space to recognize faces

PageRank – Power Iteration Column j has nonzero elements in positions corresponding to outlinks of j ( N j in total) Row i has nonzero element in positions corresponding to inlinks I i.

Column-Stochastic & Irreducible Column-Stochastic where Irreducible

Iterative PageRank Calculation For k=1,2,… Equivalently ( λ =1, A is a Markov chain transition matrix) Why can we use power iteration to find the first eigenvector?

Convergence of the power iteration Expand the initial approximation r 0 in terms of the eigenvectors

SINGULAR VALUE DECOMPOSITION

SVD - Definition Any m x n matrix A, with m ≥ n, can be factorized

Singular Values And Singular Vectors The diagonal elements σ j of are the singular values of the matrix A. The columns of U and V are the left singular vectors and right singular vectors respectively. Equivalent form of SVD:

Matrix approximation Theorem: Let U k = ( u 1 u 2 … u k ), V k = ( v 1 v 2 … v k ) and Σ k = diag(σ 1, σ 2, …, σ k ), and define Then It means, that the best approximation of rank k for the matrix A is

SVD and PCA We can write: Remember that in PCA, we treat A as a row matrix V is just eigenvectors for A – Each column in V is an eigenvector of row matrix A – we use V to approximate a row in A Equivalently, we can write: U is just eigenvectors for A T – Each column in U is an eigenvector of column matrix A – We use U to approximate a column in A

Example - LSI Build a term-by-document matrix A Compute the SVD of A : A = UΣV T Approximate A by – U k : Orthogonal basis, that we use to approximate all the documents – D k : Column j hold the coordinates of document j in the new basis – D k is the projection of A onto the subspace spanned by U k.

SVD and PCA For symmetric A, SVD is closely related to PCA PCA: A = UΛU T – U and Λ are eigenvectors and eigenvalues. SVD: A = UΛV T – U is left(column) eigenvectors – V is right(row) eigenvectors – Λ is the same eigenvalues For symmetric A, column eigenvectors equal to row eigenvectors Note the difference of A in PCA and SVD – SVD: A is directly the data, e.g. term-by-document matrix – PCA: A is covariance matrix, A = X T X, each row in X is a sample

Latent Semantic Indexing (LSI) 1.Document file preparation/ preprocessing: – Indexing: collecting terms – Use stop list: eliminate ”meaningless” words – Stemming 2.Construction term-by-document matrix, sparse matrix storage. 3.Query matching: distance measures. 4.Data compression by low rank approximation: SVD 5.Ranking and relevance feedback.

Latent Semantic Indexing Assumption: there is some underlying latent semantic structure in the data. E.g. car and automobile occur in similar documents, as do cows and sheep. This structure can be enhanced by projecting the data (the term-by-document matrix and the queries) onto a lower dimensional space using SVD.

Similarity Measures Term to term AA T = UΣ 2 U T = (UΣ)(UΣ) T UΣ are the coordinates of A (rows) projected to space V Document to document A T A = VΣ 2 V T = (VΣ)(VΣ) T VΣ are the coordinates of A (columns) projected to space U

Similarity Measures Term to document A = UΣV T = (UΣ ½ )(VΣ ½ ) T UΣ ½ are the coordinates of A (rows) projected to space V VΣ ½ are the coordinates of A (columns) projected to space U

HITS (Hyperlink Induced Topic Search) Idea: Web includes two flavors of prominent pages: – authorities contain high-quality information, – hubs are comprehensive lists of links to authorities A page is a good authority, if many hubs point to it. A page is a good hub if it points to many authorities. Good authorities are pointed to by good hubs and good hubs point to good authorities. HubsAuthorities

Power Iteration Each page i has both a hub score h i and an authority score a i. HITS successively refines these scores by computing Define the adjacency matrix L of the directed web graph: Now

HITS and SVD L : rows are outlinks, columns are inlinks. a will be the dominant eigenvector of the authority matrix L T L h will be the dominant eigenvector of the hub matrix LL T They are in fact the first left and right singular vectors of L!! We are in fact running SVD on the adjacency matrix.

HITS vs PageRank PageRank may be computed once, HITS is computed per query. HITS takes query into account, PageRank doesn’t. PageRank has no concept of hubs HITS is sensitive to local topology: insertion or deletion of a small number of nodes may change the scores a lot. PageRank more stable, because of its random jump step.

NMF – NON-NEGATIVE MATRIX FACTORIZATION

Definition Given a nonnegative matrix V n × m, find non-negative matrix factors W n × k and H k × m such that V n × m ≈W n × k H k × m V : column matrix, each column is a data sample (n-dimension) W i : k -basis represents one base H : coordinates of V projected to W v j ≈ W n × k h j

Motivation Non-negativity is natural in many applications... Probability is also non-negative Additive model to capture local structure

Multiplicative Update Algorithm Cost function  Euclidean distance Multiplicative Update

Multiplicative Update Algorithm Cost function  Divergence – Reduce to Kullback-Leibler divergence when – A and B can be regarded as normalized probability distributions. Multiplicative update PLSA is NMF with KL divergence

NMF vs PCA n = 2429 faces, m = 19x19 pixels Positive values are illustrated with black pixels and negative values with red pixels NMF  Parts-based representation PCA  Holistic representations

Reference D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. (pdf) NIPS, 2001.pdf D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. (pdf) Nature 401, (1999).pdf

Major Reference Saara Hyvönen, Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki (Highly recommend)Linear Algebra Methods for Data Mining

Outline Basic concepts – Likelihood, i.i.d. – ML, MAP and Bayesian Inference – Expectation-Maximization – Mixture Gaussian Parameter estimation pLSA – Motivation – Derivation & Geometry properties – Applications LDA – Motivation - Why to add a hyper parameter – Dirichlet Distribution – Variational EM – Relations with other topic modals – Incorporating category information Summary

Not Included General graphical model theories Markov random field (belief propagation) Detailed derivation of LDA

BASIC CONCEPTS

What Is Machine Learning? Data Let x = (x 1, x 2,..., x D ) T denote a data point, and D = {x (1), x (2)..., x (N) }, a data set. D is sometimes associated with desired outputs y 1, y 2,.... Predictions We are generally interested in predicting something based on the observed data set. Given D what can we say about x (N+1) ? Model To make predictions, we need to make some assumptions. We can often express these assumptions in the form of a model, with some parameters, θ Given data D, we learn the model parameters, from which we can predict new data points. The model can often be expressed as a probability distribution over data points

Likelihood Function Given a set of parameter values, probability density function (PDF) will show that some data are more probable than other data. Inversely, given the observed data and a model of interest, Likelihood function is defined as: L(θ) = f θ (x|θ) = p(x|θ) That is, likelihood function L(θ) will show that some parameters are more likely to have produced the data

Maximum Likelihood (ML) Maximum likelihood will find the best model parameters that make the data “most likely” generated from this model. Suppose we are given n data samples (x 1, x 2, …, x n ), Maximum likelihood will find θ that maximize L(θ) Predictive distribution

I.I.D. – Independent, Identically Distributed I.I.D. means The problem is considerably simplified as: Usually, log likehood is used

Reference Zoubin Ghahramani, Machine Learning (4F13), 2006, Cambridge (Introduction to Machine Learning, Lectures 1-2 Slides)Machine Learning Lectures 1-2 Slides Gregor Heinrich, Parameter estimation for text analysis, Technical Note, Parameter estimation for text analysis

EXPECTATION MAXIMIZATION

Why We Need EM? The Expectation–Maximization (EM) algorithm is a method for ML learning of parameters in latent variable models. Why we need latent variables? To describe complex model  Gaussian Mixture Model To discover the intrinsic structure inside a data set  Topic Models, such as pLSA, LDA.

More General Data set, Likelihood: Goal: learn maximum likelihood (ML) parameter values The maximum likelihood procedure finds parameters θ such that: Because of the integral (or sum) over latent variables, the likelihood can be a very complicated, and hard to optimize function.

The Expectation Maximization (EM) Algorithm The EM algorithm finds a (local) maximum of a latent variable model likelihood. It starts from arbitrary values of the parameters, and iterates two steps: E step: Fill in values of latent variables according to posterior given data. M step: Maximize likelihood as if latent variables were not hidden. Decomposes difficult problems into series of tractable steps.

Jensen’s Inequality

Lower Bounding the Log Likelihood Observed data D = {y n } ; Latent variables X = {x n } ; Parameters θ. Goal: Maximize the log likelihood (i.e. ML learning) wrt θ : Any distribution, q(X), over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensen’s inequality: where H[q] is the entropy of q(X).

The E and M Steps of EM The lower bound on the log likelihood is given by: EM alternates between: E step: optimize wrt distribution over hidden variables holding params fixed: M step: maximize wrt parameters holding hidden distribution fixed:

The E Step E step: for fixed θ, The second term is the Kullback-Leibler divergence. This means that, for fixed θ, F is bounded above by L, and achieves that bound when So, the E step simply sets

The M Step M step: maximize wrt parameters holding hidden distribution q fixed: The second equality comes from fact that entropy of q(X) does not depend directly on θ. The specific form of the M step depends on the model. Often the maximum wrt θ can be found analytically.

EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: The E step brings F(q, θ) to the likelihood L(θ).  顶 The M-step maximizes F(q, θ) wrt θ.  抬 F(q, θ) < L(θ) by Jensen – or, equivalently, from the non- negativity of KL

Reference Zoubin Ghahramani, Machine Learning (4F13), 2006, Cambridge (Unsupervised learning, Lecture 5 Slides)Machine Learning Lecture 5 Slides Christopher M. Bishop (2006) Pattern Recognition and Machine Learning. SpringerPattern Recognition and Machine Learning

WHY DO WE NEED GRAPHICAL MODEL?

Why Do We Need Graphical Models? Cons – Graphical model is so complex, even with a few circles… – We have to make too many assumptions. Pros – We do need probability to explain our world. But joint probability is hard to compute. – Graphical model can help us analyze and understand our problems. – Graphs are an intuitive way of representing and visualizing the relationships between many variables. – With a graphical model, we can decouple joint probability to conditional probabilities, which are usually easier.

Directed Acyclic Graphical Models (Bayesian Networks) A DAG Model / Bayesian network corresponds to a factorization of the joint probability distribution: p(A,B,C,D,E) = p(A)p(B)p(C|A,B)p(D|B,C)p(E|C,D) In general where pa (i) are the parents of node i.

Directed Graphs for Statistical Models: Plate Notation A data set of N points generated from a Gaussian:

PLSA – PROBABILISTIC LATENT SEMANTIC ANALYSIS

Latent Semantic Indexing (LSI) Review For natural Language Queries, simple term matching does not work effectively – Ambiguous terms – Same Queries vary due to personal styles Latent semantic indexing – Creates this ‘latent semantic space’ (hidden meaning) LSI puts documents together even if they don’t have common words, if the docs share frequently co-occurring terms Disadvantages: – Statistical foundation is missing

pLSA – Probabilistic Latent Semantic Analysis Automated Document Indexing and Information retrieval Identification of Latent Classes using an Expectation Maximization (EM) Algorithm Shown to solve – Polysemy Java could mean “coffee” and also the “PL Java” Cricket is a “game” and also an “insect” – Synonymy “computer”, “pc”, “desktop” all could mean the same Has a better statistical foundation than LSA

pLSA M Nd d d z w w M d d z1z1 w1w1 w1w1 z2z2 w2w2 w2w2 z3z3 w3w3 w3w3 zNzN wNwN wNwN … z 1, …, z N are variables. z i є[1,K]. K is the number of latent topics.

pLSA d1d1 d1d1 z1z1 w1w1 w1w1 z2z2 w2w2 w2w2 z N1 w N1 … d2d2 d2d2 z1z1 w1w1 w1w1 z2z2 w2w2 w2w2 z N2 w N2 … dMdM dMdM z1z1 w1w1 w1w1 z2z2 w2w2 w2w2 z Nm w Nm … p(w|z=1), p(w|z=2), p(w | z=N m ) are shared for all documents. Likelihood:

Joint Probability vs Likelihood Joint probability Likelihood (only for observed variables) p(d) is assumed to be uniform

Document Decomposition Each document can be decomposed as: This is similar to the matrix decomposition, if we consider each discrete distribution as a vector. p(w|d) = Z V × k p(z|d) With many documents, we hope to find latent topics as common basis.

pLSA – Objective Function pLSA tries to maximize the log likelihood: Due to the summation over z inside log, we have to resort to EM.

EM Steps E-Step – Expectation of the likelihood function is calculated with the current parameter values M-Step – Update the parameters with the calculated posterior probabilities – Find the parameters that maximizes the likelihood function

Lower Bounding the Log Likelihood

EM Steps The E-Step: The M-Step:

Latent Subspace

pLSA vs LSA LSA and PLSA perform dimensionality reduction – In LSA, by keeping only K singular values – In PLSA, by having K aspects Comparison to SVD – U Matrix related to P(z|d) (doc to aspect) – V Matrix related to P(w|z) (aspect to term) – E Matrix related to P(z) (aspect strength)

pLSA vs LSA The main difference is the way the approximation is done PLSA generates a model (aspect model) and maximizes its predictive power Selecting the proper value of K is heuristic in LSA Model selection in statistics can determine optimal K in PLSA

Applications Text mining / topic discovering Scene Classification

Text Mining

Scene Classification

Example Images

Classification Result

Reference Thomas Hofmann. Probabilistic latent semantic analysis. In Proc. of Uncertainty in Artificial Intelligence, UAI'99, Stockholm, 1999Probabilistic latent semantic analysis Bosch, A., Zisserman, A. and Munoz, X. Scene Classification via pLSA Proceedings of the European Conference on Computer Vision (2006)Scene Classification via pLSA Sivic, J., Russell, B. C., Efros, A. A., Zisserman, A. and Freeman, W. T., Discovering Object Categories in Image Collections, MIT AI Lab Memo AIM , February, Discovering Object Categories in Image Collections

LDA – LATENT DIRICHILET ALLOCATION

Problems in pLSA pLSA provides no probabilistic model at the document level. Each doc has its own topic mixture proportion. The number of parameters in the model grows linearly with M (the number of documents in the training set).

Problems in pLSA There is no constraint for distributions p(z|d i ). Easy to lead to serious problems with over-fitting. d1d1 d1d1 z1z1 w1w1 w1w1 z2z2 w2w2 w2w2 z N1 w N1 … d2d2 d2d2 z1z1 w1w1 w1w1 z2z2 w2w2 w2w2 z N2 w N2 … dmdm dmdm z1z1 w1w1 w1w1 z2z2 w2w2 w2w2 z Nm w Nm … p(z|d 1 )p(z|d 2 )p(z|d m )

Dirichlet Distribution In the LDA model, the topic mixture proportions for each document are assumed to follow some distribution. Requirement for such a distribution: – The samples (mixture proportions) generated from it are K-tuples of non-negative numbers that sum to one. That is, the samples are multinormials. – Easy to optimize. Dirichlet Distribution is one of such distributions. The space of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex.

Dirichlet Distribution Definition: The density is zero outside this open (K − 1)-dimensional simplex.

Various parameter α. (6, 2, 2) (3, 7, 5) (2, 3, 4) (6, 2, 6) Example Dirichlet Distributions (K=3)

Equal α i, different α 0 =0.1α 0 =1α 0 =10

The LDA Model  z4z4 z3z3 z2z2 z1z1 w4w4 w3w3 w2w2 w1w1    z4z4 z3z3 z2z2 z1z1 w4w4 w3w3 w2w2 w1w1  z4z4 z3z3 z2z2 z1z1 w4w4 w3w3 w2w2 w1w1 For each document, Choose  ~Dirichlet(  ) For each of the N words w n : – Choose a topic z n » Multinomial(  ) – Choose a word w n from p(w n |z n,  ), a multinomial probability conditioned on the topic z n.

The LDA Model For each document, Choose  ~Dirichlet(  ) For each of the N words w n : – Choose a topic z n » Multinomial(  ) – Choose a word w n from p(w n |z n,  ), a multinomial probability conditioned on the topic z n.

Joint Probability Given parameter α and β where

Likelihood Joint Probability Marginal distribution of a document Likelihood over all the documents

Inference The likelihood can be computed by summing each document. Jansen’s inequality in EM.

Inference In E-Step, we need to compute the posterior distribution of the hidden variables. Unfortunately, this distribution is intractable to compute in general. We have to resort to variational approach

Variational Inference In variational inference, we consider a simplified graphical model with variational parameters ,  and minimize the KL Divergence between the variational and posterior distributions.

Variantional Inference The difference between the lower bound and the likelihood is the KL divergence. Maximizing the lower bound L( ,  ; ,  ) with respect to  and  is equivalent to minimizing the KL divergence.

VBEM vs EM Only different in the E-Step. In standard EM, q(X) is directly set to p(X|D,θ), and let KL=0. In VBEM, it is intractable to compute p(X|D,θ). Instead, it approximates p(X|D,θ) by a variational distribution q(X) by minimizing KL(q(X) | P(X|D, θ). This is also equivalent to maximizing the lower bound L(θ).

Parameter Estimation Given a corpus of documents, we would like to find the parameters  and  which maximize the likelihood of the observed data. Strategy (Variational EM): Lower bound log p(w| ,  ) by a function L( ,  ; ,  ) Repeat until convergence: – E: Maximize L( ,  ; ,  ) with respect to the variational parameters , . – M: Maximize the bound with respect to parameters  and .

Parameter Estimation E-Step: Variational Inference – repeat until convergence.   M-Step: Parameter estimation β: α: can be implemented using the Newton-Raphson method.

Topic Examples in a 100-topic LDA Model) 16,000 documents from a subset of the TREC AP corpus.

Classification (50-topic LDA + SVM) Reuters dataset – contains 8000 documents and 15,818 words. (a) EARN vs. NOT EARN. (b) GRAIN vs. NOT GRAIN.

Problems in LDA Dirichlet Distribution is helpful to avoid over-fitting. But the assumption might be too strong.  z4z4 z3z3 z2z2 z1z1 w4w4 w3w3 w2w2 w1w1    z4z4 z3z3 z2z2 z1z1 w4w4 w3w3 w2w2 w1w1  z4z4 z3z3 z2z2 z1z1 w4w4 w3w3 w2w2 w1w1

A Bayesian Hierarchical Model for Learning Natural Scene Categories Incorporating category information M Nd π π z x x θ β

Codebook 174 Local Image Patches Detection: Evenly Sampled Grid Random Sampling Saliency Detector Lowe’s DoG Detector Representation: Normalized 11x11 gray values 128-dim SIFT

Topic Distribution in Different Categories

Topic Hierarchical Clustering

More Topic Models Dynamic topic models, ICML 2006 Correlated Topic Model, NIPS 2005 Hierarchical Dirichlet Process, Journal of the American Statistical Association 2003 Nonparametric Bayes pachinko allocation, UAI 2007 Supervised LDA, NIPS 2007 MedLDA – Maximum Margin Discrimant LDA, ICML 2009 …

Are you really into Graphical Models? Describing Visual Scenes using Transformed Dirichlet Processes. E. Sudderth, A. Torralba, W. Freeman, and A. Willsky. NIPS, Dec Describing Visual Scenes using Transformed Dirichlet Processes

Reference David M. Blei, Andrew Y. Ng, Michael I. Jordan, Latent Dirichlet Allocation, Journal of Machine Learning Research (JMLR), 2003Latent Dirichlet Allocation Matthew J. Beal. Variational Algorithms for Approximate Bayesian Inference, PhD. Thesis, University of Cambridge, 1998Variational Algorithms for Approximate Bayesian Inference L. Fei-Fei and P. Perona. A Bayesian Hierarchical Model for Learning Natural Scene Categories. CVPR, 2005.A Bayesian Hierarchical Model for Learning Natural Scene Categories

Outline Matrix Decomposition – PCA, SVD, NMF – LDA, ICA, Sparse Coding, etc. Graphical Model – Basic concepts in probabilistic machine learning – EM – pLSA – LDA Two Applications – Document decomposition for “long query” retrieval, ICCV 2009 – Modeling Threaded Discussions, SIGIR 2009

Large-Scale Indexing for “Long Query” Retrieval (Similarity Search) Xiao Zhang, Zhiwei Li, Lei Zhang, Wei-Ying Ma, and Heung-Yeung Shum ICCV 2009

The Long Query Problem If a query contains 1000 keywords: – Need to access 1000 inverted lists – The intersection of 1000 inverted lists may be empty – The union of 1000 inverted list may be the whole corpus Dimension reduction? Term1Term2Term3Term4…TermN Img11200…2 f1f1 f2f2 …fMfM …0.03 Topic Projection Dim = 1 million Dim = 200

Key Idea: Dimension Reduction + Residual Error Preservation p : original TF-IDF vector in vocabulary space X : projection matrix for dimension reduction w: low dimensional feature vector  : residual error An image = + a few words (10 words) a few words (10 words)

Orthogonal Decomposition Base vector Low dimensional representation Residual An image = + a few words (10 words) a few words (10 words) X 1, X 2, X 3, …, X k

A Probabilistic Implementation x is a switch variable. It controls a word generated from: a topic specific distribution a document specific distribution a background distribution C. Chemudugunta, et.al. Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model, NIPS 2006

Search (Online) DS1DS2 … DS1DS2 … DS1DS2 … DS1DS2 … DS1DS2 … … … … … … LSH Index Doc 300Doc 401 … A query = + a few words Re-rankingRe-ranking Doc 401 … Doc 1 Doc 2 Doc 300 Doc 401 Doc N Doc 300 Index 10M Images: 4.6GB Search Speed: < 100ms Doc Meta

Search Example Query Image

Search Example Query Image

Simultaneously Modeling Semantics and Structure of Threaded Discussions: A Sparse Coding Approach and Its Applications Chen LIN, Jiang-Ming YANG, Rui CAI, Xin-jing WANG, Wei WANG, Lei ZHANG SIGIR

Semantic & structure 124 Semantic: Topics Structure: Who reply to who

Optimize them together Model semantic Model structure 125

Reply reconstruction 126 Document Similarity Topic Similarity Structure Similarity

Baselines NP Reply to Nearest Post RR Reply to Root DS Document Similarity LDA Latent Dirichlet Allocation Project documents to topic space SWB Special Words Topic Model with Background distribution Project documents to topic and junk topic space 127

Evaluation methodSlashdotApple All PostsGood PostsAll PostsGood Posts NP RR DS LDA SWB SMSS

Expert finding Methods HITS PageRank … 129

Baselines LM Formal Models for Expert Finding in Enterprise Corpora. SIGIR 06 Achieves stable performance in expert finding task using a language model PageRank Benchmark nodal ranking method HITS Find hub nodes and authority node EABIF Personalized Recommendation Driven by Information Flow. SIGIR ’06 Find most influential node 130

Evaluation 131 Bayesian estimate LM EABIF(ori.) EABIF(rec.) PageRank(ori.) PageRank(rec.) HITS(ori.) HITS(rec.)

Summary Matrix and probability are fundamental mathematics in information retrieval and computer vision. – Matrix decomposition – a good practice to learn matrix. – Graphical model – a good practice to learn probability. Graphical model is a good tool to analyze problems The essence of decomposition is to discover a set of mid-level features to describe original documents/images. It is more adaptable for various applications than matrix decomposition.