Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Latent Semantic Analysis to Find Different Names for the Same Entity in Free Text Tim Oates, Vinay Bhat, Vishal Shanbhag, Charles Nicholas University.

Similar presentations


Presentation on theme: "Using Latent Semantic Analysis to Find Different Names for the Same Entity in Free Text Tim Oates, Vinay Bhat, Vishal Shanbhag, Charles Nicholas University."— Presentation transcript:

1 Using Latent Semantic Analysis to Find Different Names for the Same Entity in Free Text Tim Oates, Vinay Bhat, Vishal Shanbhag, Charles Nicholas University of Maryland - Baltimore County ACM Web Information Data Management, 2002:31-35 Conley Read cread@cs.ucr.edu Computer Science & Engineering University of California - Riverside

2 Overview The problem – research motivation The solution, LSA? LSA doesn’t work so well Let’s do it (LSA) again Two-stage LSA works! Create your own Corpus

3 The Problem Mumbai Bombay

4 Motivation al Qaeda al Qaida

5 Motivation Nutrasweet aspartame

6 Motivation al Qaeda cellsal Qaida networksuspects Iraqbin Laden allegedcellwarned terrorist

7 An Old IR Problem … drove their car to … … minor car accident … … new 2002 car models … … car gets good mileage … … drove their automobile to … … minor automobile accident … … new 2002 automobile models … … automobile gets good mileage …

8 Keyword Query: CAR … drove their car to … … minor car accident … … new 2002 car models … … car gets good mileage … … drove their automobile to … … minor automobile accident … … new 2002 automobile models … … automobile gets good mileage …

9 Keyword Query: AUTOMOBILE … drove their car to … … minor car accident … … new 2002 car models … … car gets good mileage … … drove their automobile to … … minor automobile accident … … new 2002 automobile models … … automobile gets good mileage …

10 Latent Semantic Analysis … drove their car to … … minor car accident … … new 2002 car models … … car gets good mileage … … drove their automobile to … … minor automobile accident … … new 2002 automobile models … … automobile gets good mileage …

11 Term-Document Matrix A = m terms n documents A(I, J) = number of times term I occurs in document J

12 Latent Semantic Analysis Compute singular value decomposition (SVD) of A Retain k < n largest singular values Set remainder to zero Projects terms/docs into k-dimensional space Compute similarity in that space A = U  V T

13 Singular Value Decomposition U  V U – row corresponds to a word Σ – singular values of A V – column corresponds to a document [Berry & Fierro 1996] Numerical Linear Algebra with Applications 3(4):301-327

14 Using SVD U  V U – Look only at k columns (words)   – Set all but k largest to zero V – Look only at k rows (documents) [Berry & Fierro 1996] Numerical Linear Algebra with Applications 3(4):301-327

15 Using LSA to Find Aliases Given name N and document collection D Compute SVD of term-document matrix Retain k largest singular values Compute similarity of all terms to N Report rank-ordered list of terms True aliases for N must be high in list

16 Experiment: Creating Aliases name N and document collection D Set P, a percentage S1 and S2 are two strings not in D Replace N with S1 in P% of the documents Replace N with S2 in the other documents Search for aliases for S1 Observe rank of S2 in ordered list

17 Our Dataset 77 documents from www.cnn.comwww.cnn.com Shortest has 131 words, longest has 1923 “al Qaeda” occurs in 49 documents Others on politics, sports, entertainment N = “al Qaeda” S1 = “alqaeda1” S2 = “alqaeda2” P = 50

18 Algorithm Parameters k – dimensionality of compressed space Small values result in spurious similarities Large values closely approximate A T – threshold on TF/IDF value More aggressive filtering with larger values Want to avoid filtering aliases Want to filter irrelevant words Term Frequency / Inverse Document Frequency We want High Retrieval (precision) and Low Miss (infrequent in collection) rates.

19 Results 1: LSA Stage 1 Figure 1: Plot of Rank as a function of t for values of k.

20 Results: Ontologically Dissimilar arrested government ressam lindh zubaydah raids attacks brahim passengers virginia k = 5 zubaydah raids ressam pakistani hamdi soldier trial alqaeda2 pakistan walker k = 10 zubaydah ressam raids hamdi alqaeda2 pakistani trial soldier pakistan lindh k = 20 Problem: LSA shows Organizations and Individuals as similar.

21 Local Context to Ontology … list of al Qaeda leaders … … most senior al Qaeda member captured … … alleged al Qaeda representative … … photograph showing Lindh blindfolded … … with Lindh, the 21-year-old American … … Lindh pleaded guilty … Ontology: Hierarchical structuring of knowledge according to relevant or cognitive qualities. An Organization An Individual

22 A Second Run of LSA For each term T in the top 250 candidates Create a document D T D T contains the words just before and just after each occurrence of T in the original corpus Run LSA on all of the D T (the new corpus) … most senior al Qaeda member captured … … photograph showing Lindh blindfolded and …

23 Results 2: LSA Stage 2 Figure 2: Plot of Rank as a function of t for values of k.

24 Results 2: Scaled to Figure 1 Figure 3: Plot of Rank as a function of t for values of k.

25 Results 1 & 2: Comparison LSA-1 and LSA-2, Before and After.

26 Results: Contextually Similar tenet suspected warned alqaeda2 terrorism terrorist anaconda potential operation operations k = 5 cells alqaeda2 network suspects germany laden alleged cell terrorist warned k = 10 cells network alqaeda2 cell terrorist alleged suspects laden singapore germany k = 20 Solution: LSA with context ranks terms by ontological similarity.

27 Applications Example alias in Movie Titles: Query N = “Ocean’s 12” Use Google to get top 100 hits Run two-stage LSA algorithm Create your own corpus Submit N as Google query Create corpus from top M hits Run two-stage LSA You might retrieve: 1. GoldenEye 2. Ocean’s 11 3. Die Hard: Vengeance 4. The Italian Job

28 Review Find semantically related terms Obvious solution – LSA LSA is not so good We ran LSA again! LSA is great! Create a Corpus with Google

29 Your Questions? Acknowledgements Dr. Tim Oates, oates@cs.umbc.eduoates@cs.umbc.edu References – the math… Berry, M., Fierro R. 1996. Low-rank orthogonal decompositions for information retrieval applications.


Download ppt "Using Latent Semantic Analysis to Find Different Names for the Same Entity in Free Text Tim Oates, Vinay Bhat, Vishal Shanbhag, Charles Nicholas University."

Similar presentations


Ads by Google