LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Text Databases Text Types
Introduction to Information Retrieval
Modern Information Retrieval Chapter 1: Introduction
INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING Crista Lopes.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Latent Semantic Analysis
Hinrich Schütze and Christina Lioma
ISP 433/533 Week 2 IR Models.
Internet Resources Discovery (IRD) Search Engines Quality.
An Introduction to Latent Semantic Analysis
ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides.
1 CS 430 / INFO 430 Information Retrieval Lecture 11 Latent Semantic Indexing Extending the Boolean Model.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
INFO 624 Week 3 Retrieval System Evaluation
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Chapter 5: Information Retrieval and Web Search
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
Latent Semantic Analysis Hongning Wang VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
LOGO Recommendation Algorithms Lecturer: Dr. Bo Yuan
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
1 Query Operations Relevance Feedback & Query Expansion.
An Introduction to Latent Semantic Analysis. 2 Matrix Decompositions Definition: The factorization of a matrix M into two or more matrices M 1, M 2,…,
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
Badrul M. Sarwar, George Karypis, Joseph A. Konstan, and John T. Riedl
SINGULAR VALUE DECOMPOSITION (SVD)
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.
1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing.
LATENT SEMANTIC INDEXING BY SINGULAR VALUE DECOMPOSITION
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
1 Information Retrieval LECTURE 1 : Introduction.
Computer-assisted essay assessment zSimilarity scores by Latent Semantic Analysis zComparison material based on relevant passages from textbook zDefining.
Information Retrieval and Web Search Probabilistic IR and Alternative IR Models Rada Mihalcea (Some of the slides in this slide set come from a lecture.
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Search Engine and Optimization 1. Agenda Indexing Algorithms Latent Semantic Indexing 2.
Best pTree organization? level-1 gives te, tf (term level)
Erasmus University Rotterdam
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Design open relay based DNS blacklist system
Information Retrieval and Web Search
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent Semantic Analysis
Presentation transcript:

LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt

OUTLINE Problem History What is LSI? What we really search? How LSI Works? Concept Space Application Areas Examples Advantages / Disadvantages

PROBLEM Conventional IR methods depend on – boolean – vector space – probabilistic models H andicap – dependent on term- match ing –not efficent for IR Need to c apture the concepts instead of only the words –multiple terms can contribute to similar semantic meanings synonym, i.e. car and automobile – one term may have various meanings depending on their context polysemy, ie. apple and computer

IR MODELS

THE IDEAL SEARCH ENGINE Scope –able to search every document on the Internet Speed –better hardware and programming Currency –frequent updates Recall: –always find every document relevant to our query Precision: –no irrelevant documents in our result set Ranking: –most relevant results would come first

What a Simple Search Engine Can Not Differentiate? Polysemy –monitor workflow, monitor Synonymy –car, automobile Singular and plural forms –tree, trees Word with similar roots –different, differs, differed

HISTORY Mathematical technique for information filtering Developed at BellCore Labs, Telcordia 30 % more effective in filtering relevant documents than the word matching methods A solution to “polysemy” and “synonymy” End of the 1980’s

LOOKING FOR WHAT? Search –“Paris Hilton” Really interested in The Hilton Hotel in Paris? –“Tiger Woods” Searching something about wildlife or the famous golf player? Simple word matching fails

LSI? Concepts instead of words Mathematical model –relates documents and the concepts Looks for concepts in the documents Stores them in a concept space –related documents are connected to form a concept space Do not need an exact match for the query

CONCEPTS

HOW LSI WORKS? A set of documents –how to determine the similiar ones? examine the documents try to find concepts in common classify the documents This is how LSI also works. LSI represents terms and documents in a high-dimensional space allowing relationships between terms and documents to be exploited during searching.

HOW TO OBTAIN A CONCEPT SPACE? One possible way would be to find canonical representations of natural language –difficult task to achieve. Much simpler –use mathematical properties of the term document matrix, i.e. determine the concepts by matrix computation.

TERM-DOCUMENT MATRIX Query: Human-Computer Interaction Dataset: c1Human machine interface for Lab ABC computer application c2A survey of user opinion of computer system response time c3The EPS user interface management system c4System and human system engineering testing of EPS c5Relations of user-perceived response time to error measurement m1 The generation of random, binary, unordered trees m2 The intersection graph of paths in trees m3 Graph minors IV: Widths of trees and well-quasi-ordering m4 Graph minors: A survey

CONCEPT SPACE Makes the semantic comparison possible Created by using “Matrices” Try to detect the hidden similarities between documents Avoid little similarities Words with similar meanings –occur close to each other Dimensions: terms Plots: documents Each document is a vector Reduce the dimensions –Singular Value Decomposition

APPLICATION AREAS Dynamic advertisements put on pages, Google’s AdSense Improving performance of Search Engines –in ranking pages Spam filtering for s Optimizing link profile of your web page Cross language retrieval Foreign language translation Automated essay grading Modelling of human cognitive function

GOOGLE USES LSI Increasing its weight in ranking pages –~ sign before the search term stands for the semantic search –“~phone” the first link appearing is the page for “Nokia” although page does not contain the word “phone” –“~humor” retrieved pages contain its synonyms; comedy, jokes, funny Google AdSense sandbox –check which advertisements google would put on your page

ANOTHER USAGE –Tried on TOEFL exam. a word is given the most similar in meaning should be selected from the four words scored %65 correct

+ / - Improve the efficency of the retrieval process Decreases the dimensionality of vectors Good for machine learning algorithms in which high dimensionality is a problem Dimensions are more semantic Newly reduced vectors are dense vectors Saving memory is not guaranteed Some words have several meanings and it makes the retrieval confusing Expensive to compute, complexity is high.