Probabilistic Latent Semantic Analysis

Slides:



Advertisements
Similar presentations
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Advertisements

Information retrieval – LSI, pLSI and LDA
Dimensionality Reduction PCA -- SVD
Generative learning methods for bags of features
Probabilistic Clustering-Projection Model for Discrete Data
INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING Crista Lopes.
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Latent Semantic Analysis
Hinrich Schütze and Christina Lioma
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Expectation Maximization Method Effective Image Retrieval Based on Hidden Concept Discovery in Image Database By Sanket Korgaonkar Masters Computer Science.
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Latent Dirichlet Allocation a generative model for text
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Support Vector Machines Kernel Machines
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
Generative learning methods for bags of features
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Review ML and ML(Ext) Bob Durrant School of Computer Science University of Birmingham.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Learning Techniques for Information Retrieval We cover 1.Perceptron algorithm 2.Least mean square algorithm 3.Chapter 5.2 User relevance feedback (pp )
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
1 Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Universit at Dortmund, LS VIII
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Chapter 6: Information Retrieval and Web Search
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
SINGULAR VALUE DECOMPOSITION (SVD)
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
HMM - Part 2 The EM algorithm Continuous density HMM.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Link Distribution on Wikipedia [0407]KwangHee Park.
Arab Open University Faculty of Computer Studies M132: Linear Algebra
10.0 Latent Semantic Analysis for Linguistic Processing References : 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings.
Natural Language Processing Topics in Information Retrieval August, 2002.
14.0 Linguistic Processing and Latent Topic Analysis.
Vector Semantics Dense Vectors.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Automated Information Retrieval
LECTURE 11: Advanced Discriminant Analysis
Learning Sequence Motif Models Using Expectation Maximization (EM)
Lecture 15: Text Classification & Naive Bayes
Representation of documents and queries
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Michal Rosen-Zvi University of California, Irvine
Topic Models in Text Processing
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement Rie Kubota Ando. Latent semantic space: Iterative.
Latent Semantic Analysis
Presentation transcript:

Probabilistic Latent Semantic Analysis Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)

Overview We will learn how we can: represent text in a simple numerical form in the computer find out topics from a collection of text documents

Salton’s Vector Space Model Represent each document by a high-dimensional vector in the space of words Gerald Salton 1960 – 70 Dates are of the work, not Salton’s dates!

Salton’s Vector Space Model Represent the doc as a vector where each entry corresponds to a different word and the number at that entry corresponds to how many times that word was present in the document (or some function of it) Number of words is huge Select and use a smaller set of words that are of interest E.g. uninteresting words: ‘and’, ‘the’ ‘at’, ‘is’, etc. These are called stop-words Stemming: remove endings. E.g. ‘learn’, ‘learning’, ‘learnable’, ‘learned’ could be substituted by the single stem ‘learn’ Other simplifications can also be invented and used The set of different remaining words is called dictionary or vocabulary. Fix an ordering of the terms in the dictionary so that you can operate them by their index.

Example This is a small document collection that consists of 9 text documents. Terms that are in our dictionary are italicised.

Collect all doc vectors into a term by document matrix

Queries Have a collection of documents Want to find the most relevant documents to a query A query is just like a very short document Compute the similarity between the query and all documents in the collection Return the best matching documents When are two documents similar? When are two document vectors similar?

Document similarity Simple, intuitive Fast to compute, because x and y are typically sparse (i.e. have many 0-s)

How to measure success? Assume there is a set of ‘correct answers’ to the query. The docs in this set are called relevant to the query The set of documents returned by the system are called retrieved documents Precision: what percentage of the retrieved documents are relevant Recall: what percentage of all relevant documents are retrieved

Problems Synonyms: separate words that have the same meaning. E.g. ‘car’ & ‘automobile’ They tend to reduce recall Polysemes: words with multiple meanings E.g. ‘saturn’ – a planet, a Roman deity, a manned rocket, a Sega game console… They tend to reduce precision The problem is more general: there is a disconnect between topics and words. ‘… a more appropriate model should consider some conceptual dimensions instead of words.’ (Gardenfors)

Latent Semantic Analysis (LSA) LSA aims to discover something about the meaning behind the words; about the topics in the documents. What is the difference between topics and words? Words are observable Topics are not. They are latent. How to find out topics from the words in an automatic way? We can imagine them as a compression of words A combination of words Try to formalise this

Probabilistic Latent Semantic Analysis Let us start from what we know Remember the random sequence model: We know how to compute the parameters of this model, i.e. P(term_t|doc) - We ‘guessed’ it intuitively in Lecture 1 We also derived it by Maximum Likelihood in Lecture 1 because we said the guessing strategy may not work for more complicated models.

Probabilistic Latent Semantic Analysis Now let us have K topics as well: What are the parameters of this model?

Probabilistic Latent Semantic Analysis The parameters of this model are: P(t|k) P(k|doc) It is possible to derive the equations for computing these parameters by Maximum Likelihood. If we do so, what do we get? P(t|k) for all t and k, is a term by topic matrix (describes which terms make up a topic) P(k|doc) for all k and doc, is a topic by document matrix (describes which topics are in a document)

Deriving the parameter estimation algorithm The log likelihood of this model is the log probability of the entire collection:

Extra Credit! For those who would like to work it out: Rewrite the constraints in Lagrangian terms. Take derivatives w.r.t each of the parameters (one of them at a time) and equate these to zero Solve the resulting equations. You will get fixed point equations which can be solved iteratively. This is the PLSA algorithm. Note these steps are the same as those we did in Lecture 1 when deriving the Maximum Likelihood estimate for random sequence models, just the working is (ahem!) a little more tedious. We will skip doing this in the class and just give the resulting algorithm (on the next slide). You can get 5% bonus if you work this algorithm out.

The PLSA algorithm Inputs: term by document matrix X(t,d), t=1:T, d=1:N and the number K of topics sought Initialise arrays P1 and P2 randomly with numbers between [0,1] and normalise them to sum to 1 along rows Iterate until convergence. For d=1 to N, For t =1 to T, For k=1 to K, Output: arrays P1 and P2, which hold the estimated parameters P(t|k) and P(k|d) respectively.

Example of topics found from a Science Magazine papers collection

The performance of a retrieval system based on this model (PLSI) was found superior to that of both the vector space based similarity (cos) and a non-probabilistic latent semantic indexing (LSI) method. (We skip details here.) From Th. Hofmann, 2000

Summing up Documents can be represented as numeric vectors in the space of words. The order of words is lost but the co-occurrences of words may still provide useful insights about the topical content of a collection of documents. PLSA is an unsupervised method based on this idea. We can use it to find out what topics are there in a collection of documents It is also a good basis for information retrieval systems

Related resources Thomas Hofmann, Probabilistic Latent Semantic Analysis. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI'99) http://www.cs.brown.edu/~th/papers/Hofmann-UAI99.pdf Scott Deerwester et al: Indexing by latent semantic analysis, Journal of the American Society for Information Science, vol 41, no 6, pp. 391—407, 1990. http://citeseer.ist.psu.edu/cache/papers/cs/339/http:zSzzSzsuperbook.bellcore.comzSz~stdzSzpaperszSzJASIS90.pdf/deerwester90indexing.pdf The BOW toolkit for creating term by doc matrices and other text processing and analysis utilities: http://www.cs.cmu.edu/~mccallum/bow