Using Latent Semantic Analysis to Find Different Names for the Same Entity in Free Text Tim Oates, Vinay Bhat, Vishal Shanbhag, Charles Nicholas University.

Slides:



Advertisements
Similar presentations
Dimensionality Reduction PCA -- SVD
Advertisements

INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING Crista Lopes.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Hinrich Schütze and Christina Lioma
An Introduction to Latent Semantic Analysis
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,
Singular Value Decomposition
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Kathryn Linehan Advisor: Dr. Dianne O’Leary
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Automatic Collection “Recruiter” Shuang Song. Project Goal Given a collection, automatically suggest other items to add to the collection  Design a process.
WebPage Summarization Using Clickthrough Data JianTao Sun & Yuchang Lu, TsingHua University, China Dou Shen & Qiang Yang, HK University of Science & Technology.
Chapter 2 Dimensionality Reduction. Linear Methods
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
Latent Semantic Analysis Hongning Wang VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
1 Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.
An Introduction to Latent Semantic Analysis. 2 Matrix Decompositions Definition: The factorization of a matrix M into two or more matrices M 1, M 2,…,
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Generic text summarization using relevance measure and latent semantic analysis Gong Yihong and Xin Liu SIGIR, April 2015 Yubin Lim.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
SINGULAR VALUE DECOMPOSITION (SVD)
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
Progress Report (Concept Extraction) Presented by: Mohsen Kamyar.
Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore.
LATENT SEMANTIC INDEXING BY SINGULAR VALUE DECOMPOSITION
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Recuperação de Informação B Cap. 02: Modeling (Latent Semantic Indexing & Neural Network Model) 2.7.2, September 27, 1999.
Basics of Databases and Information Retrieval1 Databases and Information Retrieval Lecture 1 Basics of Databases and Information Retrieval Instructor Mr.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Vector Semantics Dense Vectors.
Latent Semantic Analysis (LSA) Jed Crandall 16 June 2009.
Search Engine and Optimization 1. Agenda Indexing Algorithms Latent Semantic Indexing 2.
Plan for Today’s Lecture(s)
Clustering of Web pages
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent Semantic Analysis
Presentation transcript:

Using Latent Semantic Analysis to Find Different Names for the Same Entity in Free Text Tim Oates, Vinay Bhat, Vishal Shanbhag, Charles Nicholas University of Maryland - Baltimore County ACM Web Information Data Management, 2002:31-35 Conley Read Computer Science & Engineering University of California - Riverside

Overview The problem – research motivation The solution, LSA? LSA doesn’t work so well Let’s do it (LSA) again Two-stage LSA works! Create your own Corpus

The Problem Mumbai Bombay

Motivation al Qaeda al Qaida

Motivation Nutrasweet aspartame

Motivation al Qaeda cellsal Qaida networksuspects Iraqbin Laden allegedcellwarned terrorist

An Old IR Problem … drove their car to … … minor car accident … … new 2002 car models … … car gets good mileage … … drove their automobile to … … minor automobile accident … … new 2002 automobile models … … automobile gets good mileage …

Keyword Query: CAR … drove their car to … … minor car accident … … new 2002 car models … … car gets good mileage … … drove their automobile to … … minor automobile accident … … new 2002 automobile models … … automobile gets good mileage …

Keyword Query: AUTOMOBILE … drove their car to … … minor car accident … … new 2002 car models … … car gets good mileage … … drove their automobile to … … minor automobile accident … … new 2002 automobile models … … automobile gets good mileage …

Latent Semantic Analysis … drove their car to … … minor car accident … … new 2002 car models … … car gets good mileage … … drove their automobile to … … minor automobile accident … … new 2002 automobile models … … automobile gets good mileage …

Term-Document Matrix A = m terms n documents A(I, J) = number of times term I occurs in document J

Latent Semantic Analysis Compute singular value decomposition (SVD) of A Retain k < n largest singular values Set remainder to zero Projects terms/docs into k-dimensional space Compute similarity in that space A = U  V T

Singular Value Decomposition U  V U – row corresponds to a word Σ – singular values of A V – column corresponds to a document [Berry & Fierro 1996] Numerical Linear Algebra with Applications 3(4):

Using SVD U  V U – Look only at k columns (words)   – Set all but k largest to zero V – Look only at k rows (documents) [Berry & Fierro 1996] Numerical Linear Algebra with Applications 3(4):

Using LSA to Find Aliases Given name N and document collection D Compute SVD of term-document matrix Retain k largest singular values Compute similarity of all terms to N Report rank-ordered list of terms True aliases for N must be high in list

Experiment: Creating Aliases name N and document collection D Set P, a percentage S1 and S2 are two strings not in D Replace N with S1 in P% of the documents Replace N with S2 in the other documents Search for aliases for S1 Observe rank of S2 in ordered list

Our Dataset 77 documents from Shortest has 131 words, longest has 1923 “al Qaeda” occurs in 49 documents Others on politics, sports, entertainment N = “al Qaeda” S1 = “alqaeda1” S2 = “alqaeda2” P = 50

Algorithm Parameters k – dimensionality of compressed space Small values result in spurious similarities Large values closely approximate A T – threshold on TF/IDF value More aggressive filtering with larger values Want to avoid filtering aliases Want to filter irrelevant words Term Frequency / Inverse Document Frequency We want High Retrieval (precision) and Low Miss (infrequent in collection) rates.

Results 1: LSA Stage 1 Figure 1: Plot of Rank as a function of t for values of k.

Results: Ontologically Dissimilar arrested government ressam lindh zubaydah raids attacks brahim passengers virginia k = 5 zubaydah raids ressam pakistani hamdi soldier trial alqaeda2 pakistan walker k = 10 zubaydah ressam raids hamdi alqaeda2 pakistani trial soldier pakistan lindh k = 20 Problem: LSA shows Organizations and Individuals as similar.

Local Context to Ontology … list of al Qaeda leaders … … most senior al Qaeda member captured … … alleged al Qaeda representative … … photograph showing Lindh blindfolded … … with Lindh, the 21-year-old American … … Lindh pleaded guilty … Ontology: Hierarchical structuring of knowledge according to relevant or cognitive qualities. An Organization An Individual

A Second Run of LSA For each term T in the top 250 candidates Create a document D T D T contains the words just before and just after each occurrence of T in the original corpus Run LSA on all of the D T (the new corpus) … most senior al Qaeda member captured … … photograph showing Lindh blindfolded and …

Results 2: LSA Stage 2 Figure 2: Plot of Rank as a function of t for values of k.

Results 2: Scaled to Figure 1 Figure 3: Plot of Rank as a function of t for values of k.

Results 1 & 2: Comparison LSA-1 and LSA-2, Before and After.

Results: Contextually Similar tenet suspected warned alqaeda2 terrorism terrorist anaconda potential operation operations k = 5 cells alqaeda2 network suspects germany laden alleged cell terrorist warned k = 10 cells network alqaeda2 cell terrorist alleged suspects laden singapore germany k = 20 Solution: LSA with context ranks terms by ontological similarity.

Applications Example alias in Movie Titles: Query N = “Ocean’s 12” Use Google to get top 100 hits Run two-stage LSA algorithm Create your own corpus Submit N as Google query Create corpus from top M hits Run two-stage LSA You might retrieve: 1. GoldenEye 2. Ocean’s Die Hard: Vengeance 4. The Italian Job

Review Find semantically related terms Obvious solution – LSA LSA is not so good We ran LSA again! LSA is great! Create a Corpus with Google

Your Questions? Acknowledgements Dr. Tim Oates, References – the math… Berry, M., Fierro R Low-rank orthogonal decompositions for information retrieval applications.