Latent Semantic Analysis (LSA) Jed Crandall 16 June 2009.

Slides:



Advertisements
Similar presentations
Topic models Source: Topic models, David Blei, MLSS 09.
Advertisements

Introduction to Information Retrieval Outline ❶ Latent semantic indexing ❷ Dimensionality reduction ❸ LSI in information retrieval 1.
Information retrieval – LSI, pLSI and LDA
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) Dimensionality Reductions or data projections Random projections.
Dimensionality Reduction PCA -- SVD
CS 599: Social Media Analysis University of Southern California1 Elementary Text Analysis & Topic Modeling Kristina Lerman University of Southern California.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
Latent Semantic Analysis
Hinrich Schütze and Christina Lioma
Dimensionality Reduction
Dimensionality Reduction
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Latent Dirichlet Allocation a generative model for text
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,
Lecture 21 SVD and Latent Semantic Indexing and Dimensional Reduction
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
A probabilistic approach to semantic representation Paper by Thomas L. Griffiths and Mark Steyvers.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
Information Retrieval Latent Semantic Indexing. Speeding up cosine computation What if we could take our vectors and “pack” them into fewer dimensions.
MIMO Multiple Input Multiple Output Communications © Omar Ahmad
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
SINGULAR VALUE DECOMPOSITION (SVD)
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
Topic Modeling using Latent Dirichlet Allocation
A pTree organization for text mining... Position are April apple and an always. all again a... Term (Vocab)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Latent Dirichlet Allocation
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,
Link Distribution on Wikipedia [0407]KwangHee Park.
10.0 Latent Semantic Analysis for Linguistic Processing References : 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings.
Natural Language Processing Topics in Information Retrieval August, 2002.
Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze.
DATA MINING LECTURE 8 Sequence Segmentation Dimensionality Reduction.
14.0 Linguistic Processing and Latent Topic Analysis.
Term weighting and Vector space retrieval
Vector Semantics Dense Vectors.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Singular Value Decomposition and its applications
Document Clustering Based on Non-negative Matrix Factorization
Vector-Space (Distributional) Lexical Semantics
LSI, SVD and Data Management
CS 430: Information Discovery
Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement Rie Kubota Ando. Latent semantic space: Iterative.
Latent Semantic Analysis
Presentation transcript:

Latent Semantic Analysis (LSA) Jed Crandall 16 June 2009

What is LSA? “A technique in natural language processing of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms” -Paraphrasing Wikipedia Based on the “bag of words” model

Timeline LSA patented in 1988 ◦ Deerwester et al., mostly psycology types from the University of Colorado pLSA by Hofmann in 1999 ◦ Assumes Poisson distribution of terms and documents instead of Gaussian Latent Dirichlet Allocation, Blei et al. in 2002 ◦ More of a graphical model

What you can do with LSA Start with a corpus of text, e.g., Wikipedia Create a term frequency matrix Do some fancy math (not too fancy, though) Output is a matrix you can use to project terms (or documents) into a lower- dimensional concept space ◦ “Mao Zedong” and “communism” should have a high dot product ◦ “Mao Zedong” and “pocket watch” should not

Intuition behind LSA “A is 5 furlongs away from B” “A is 5 furlongs away from C” “B is 8 furlongs away from C”

In two dimensions BC A 5 8

Noise in the measurements “A, B, and C are all on a straight, flat, road”

Dimension reduction BCA 9 4.5

Dimension reduction

Assumptions Suppose we want to do LSA on Wikipedia, with k=600 We’re assuming all authors draw from two sources when choosing words while writing an article ◦ A 600-dimensional “true” concept space ◦ Their freedom of choice which we model as white Gaussian noise

Process Build a term frequency matrix Do tf-idf weighting Calculate a singular value decomposition (SVD) Do a rank reduction by chopping off all but the k = 600 largest singular values Can now map terms (or documents) into a space defined by the 600-dimensional matrix approximation of our original matrix, compare them with a dot product or cosine similarity

Build a term frequency matrix

Do tf-idf weighting (optional) n i,j is # occurences of term i in document j Denominator is total # terms in document j Numerator is total # of documents, denominator is # documents where term i appears at least once (like entropy)

Calculate an SVD Why “an” SVD, not “the” SVD? Unitary matrix ◦ Normal ◦ Determinant =1 (length preserving) Dimension is the rank of a matrix An SVD exists for every matrix Fast, numerically stable, can stop at k, etc.

Rank reduction Rank reduction = dimension reduction Will give us the k-dimensional matrix that is the optimal approximation of our original matrix in terms of the Frobenius norm Rank reduction has the effect of reducing white Guassian noise

Map terms into concept space Using V instead of P (plagiarism from multiple sources) (or documents) Can compare terms using, e.g., cosine similarity

Example application ConceptDoppler 我的奋斗 (Mein Kampf), 转化率 (conversion rate), 绝食 (hunger strike) List changes dramatically at times ◦ 19 September 2007 – 122 out of ? ◦ 6 March 2008 – 108 out of ? ◦ 18 June 2008 – 133 words out of ? ◦ As of February 2009 法轮功 not blocked

Questions? Sources I plagiarized from: ◦ Wikipedia article on latent semantic analysis ◦ df df ◦ ConceptDoppler, Crandall et al. from CCS 2007