Latent Semantic Analysis

Slides:

Advertisements

Similar presentations

CMU SCS : Multimedia Databases and Data Mining Lecture #17: Text - part IV (LSI) C. Faloutsos.

Advertisements

Introduction to Information Retrieval Outline ❶ Latent semantic indexing ❷ Dimensionality reduction ❸ LSI in information retrieval 1.

Text Databases Text Types

1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.

Dimensionality Reduction PCA -- SVD

15-826: Multimedia Databases and Data Mining

INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING Crista Lopes.

Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,

What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.

Latent Semantic Analysis

Hinrich Schütze and Christina Lioma

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

An Introduction to Latent Semantic Analysis

1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University

Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005

Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.

TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.

Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.

1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.

IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.

SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

Adding Semantics to Information Retrieval By Kedar Bellare 20 th April 2003.

Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.

Dimension of Meaning Author: Hinrich Schutze Presenter: Marian Olteanu.

Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.

Probabilistic Latent Semantic Analysis

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.

1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.

HCC class lecture 14 comments John Canny 3/9/05. Administrivia.

Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])

Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.

Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.

An Introduction to Latent Semantic Analysis. 2 Matrix Decompositions Definition: The factorization of a matrix M into two or more matrices M 1, M 2,…,

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:

Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.

Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.

Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.

June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.

LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.

SINGULAR VALUE DECOMPOSITION (SVD)

Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.

Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.

Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.

LATENT SEMANTIC INDEXING BY SINGULAR VALUE DECOMPOSITION

Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)

1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.

Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.

Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

Natural Language Processing Topics in Information Retrieval August, 2002.

Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Vector Semantics Dense Vectors.

Latent Semantic Analysis (LSA) Jed Crandall 16 June 2009.

Latent Semantic Analysis John Martin Small Bear Technologies, Inc.

Search Engine and Optimization 1. Agenda Indexing Algorithms Latent Semantic Indexing 2.

Plan for Today’s Lecture(s)

Best pTree organization? level-1 gives te, tf (term level)

Entity- & Topic-Based Information Ordering

15-826: Multimedia Databases and Data Mining

15-826: Multimedia Databases and Data Mining

HCC class lecture 13 comments

Restructuring Sparse High Dimensional Data for Effective Retrieval

Retrieval Performance Evaluation - Measures

Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement Rie Kubota Ando. Latent semantic space: Iterative.

Presentation transcript:

Latent Semantic Analysis

Problem Introduction Traditional term-matching method doesn’t work well in information retrieval We want to capture the concepts instead of words. Concepts are reflected in the words. However, One term may have multiple meaning Different terms may have the same meaning.

The Problem Two problems that arose using the vector space model: synonymy: many ways to refer to the same object, e.g. car and automobile leads to poor recall polysemy: most words have more than one distinct meaning, e.g.model, python, chip leads to poor precision

The Problem Example: Vector Space Model (from Lillian Lee) auto engine bonnet tyres lorry boot car emissions hood make model trunk make hidden Markov model emissions normalize Synonymy Will have small cosine but are related Polysemy Will have large cosine but not truly related

LSI (Latent Semantic Analysis) LSI approach tries to overcome the deficiencies of term-matching retrieval by treating the unreliability of observed term-document association data as a statistical problem. The goal is to find effective models to represent the relationship between terms and documents. Hence a set of terms, which is by itself incomplete and unreliable, will be replaced by some set of entities which are more reliable indicants. Terms that did not appear in a document may still associate with a document. LSI derives uncorrelated index factors that might be considered artificial concepts.

Some History Latent Semantic Indexing was developed at Bellcore (now Telcordia) in the late 1980s (1988). It was patented in 1989. http://lsi.argreenhouse.com/lsi/LSI.html

Some History The first papers about LSI: Dumais, S. T., Furnas, G. W., Landauer, T. K. and Deerwester, S. (1988), "Using latent semantic analysis to improve information retrieval." In Proceedings of CHI'88: Conference on Human Factors in Computing, New York: ACM, 281-285. Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman, R.A. (1990) "Indexing by latent semantic analysis." Journal of the Society for Information Science, 41(6), 391-407. Foltz, P. W. (1990) "Using Latent Semantic Indexing for Information Filtering". In R. B. Allen (Ed.) Proceedings of the Conference on Office Information Systems, Cambridge, MA, 40-47.

LSA But first: What is the difference between LSI and LSA??? LSI refers to using it for indexing or information retrieval. LSA refers to everything else.

LSA Idea (Deerwester et al): “We would like a representation in which a set of terms, which by itself is incomplete and unreliable evidence of the relevance of a given document, is replaced by some other set of entities which are more reliable indicants. We take advantage of the implicit higher-order (or latent) structure in the association of terms and documents to reveal such relationships.”

SVD (Singular Value Decomposition) How to learn the concepts from data? SVD is applied on the term-document matrix to derive the latent semantic structure model. What is SVD?

SVD Basics S D X T S D T X = ^ = Singular Value Decomposition t x m m x m m x d * D S T Singular Value Decomposition t x d terms documents X = t x d t x k k x k k x d = terms documents * X D S T Select first k singular values ^

SVD Basics II Rank-reduced Singular Value Decomposition (SVD) performed on matrix all but the k highest singular values are set to 0 produces k-dimensional approximation of the original matrix (in least-squares sense) this is the “semantic space” Compute similarities between entities in semantic space (usually with cosine)

SVD SVD of the term-by-document matrix X: If the singular values of S0 are ordered by size, we only keep the first k largest values and get a reduced model: doesn’t exactly match X and it gets closer as more and more singular values are kept This is what we want. We don’t want perfect fit since we think some of 0’s in X should be 1 and vice versa. It reflects the major associative patterns in the data, and ignores the smaller, less important influence and noise.

Fundamental Comparison Quantities from the SVD Model Comparing Two Terms: the dot product between two row vectors of reflects the extent to which two terms have a similar pattern of occurrence across the set of document. Comparing Two Documents: dot product between two column vectors of Comparing a Term and a Document

LSI Paper example Index terms in italics

term-document Matrix

Latent Semantic Indexing t x m m x m m x d * D S T Singular Value Decomposition t x d terms documents X = t x d t x k k x k k x d = terms documents * X D S T Select first k singular values ^

T0

S0

D0

SVD with minor terms dropped TS define coordinates for documents in latent space

Terms Graphed in Two Dimensions

Documents and Terms

Change in Text Correlation

Summary Some Issues SVD Algorithm complexity O(n^2k^3) n = number of terms k = number of dimensions in semantic space (typically small ~50 to 350) for stable document collection, only have to run once dynamic document collections: might need to rerun SVD, but can also “fold in” new documents

Summary Some issues Finding optimal dimension for semantic space precision-recall improve as dimension is increased until hits optimal, then slowly decreases until it hits standard vector model run SVD once with big dimension, say k = 1000 then can test dimensions <= k in many tasks 150-350 works well, still room for research

Summary Some issues SVD assumes normally distributed data term occurrence is not normally distributed matrix entries are weights, not counts, which may be normally distributed even when counts are not