1 CS 430 / INFO 430 Information Retrieval Lecture 11 Latent Semantic Indexing Extending the Boolean Model.

Slides:



Advertisements
Similar presentations
Traditional IR models Jian-Yun Nie.
Advertisements

Chapter 5: Introduction to Information Retrieval
Text Databases Text Types
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
CS 430 / INFO 430 Information Retrieval
CS 430 / INFO 430 Information Retrieval
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Hinrich Schütze and Christina Lioma
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Chapter 5: Information Retrieval and Web Search
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
Chapter 2 Dimensionality Reduction. Linear Methods
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
1 CS 430: Information Discovery Lecture 12 Extending the Boolean Model.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
SINGULAR VALUE DECOMPOSITION (SVD)
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.
1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing.
Vector Space Models.
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Information Retrieval and Web Search Probabilistic IR and Alternative IR Models Rada Mihalcea (Some of the slides in this slide set come from a lecture.
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Search Engine and Optimization 1. Agenda Indexing Algorithms Latent Semantic Indexing 2.
Automated Information Retrieval
Plan for Today’s Lecture(s)
CS 430: Information Discovery
Representation of documents and queries
CS 430: Information Discovery
CS 430: Information Discovery
CS 430: Information Discovery
Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
Information Retrieval and Web Search
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent Semantic Analysis
CS 430: Information Discovery
Presentation transcript:

1 CS 430 / INFO 430 Information Retrieval Lecture 11 Latent Semantic Indexing Extending the Boolean Model

2 Course Administration Assignment 1 If you have questions about your grading, send me . The following are reasonable requests: the wrong files were graded, points were added up wrongly, comments are unclear, etc. We are not prepared to argue over details of judgment. If you ask for a regrade, the final grade may be lower than the original!

3 Course Administration Assignment 2 The assignment has been posted. The test data is being checked. Look for changes before Saturday evening.

4 Course Administration Midterm Examination Wednesday, October 14, 7:30 to 9:00 p.m., Upson B17. Open book. Laptop computers may be used for lecture slides, notes, readings, etc., but no network connections during the examination. A sample examination and discussion of the solution will be posted to the Web site.

5 CS 430 / INFO 430 Information Retrieval Latent Semantic Indexing

6 Objective Replace indexes that use sets of index terms by indexes that use concepts. Approach Map the term vector space into a lower dimensional space, using singular value decomposition. Each dimension in the new space corresponds to a latent concept in the original data.

7 Deficiencies with Conventional Automatic Indexing Synonymy: Various words and phrases refer to the same concept (lowers recall). Polysemy: Individual words have more than one meaning (lowers precision) Independence: No significance is given to two terms that frequently appear together

8 Example Query: "IDF in computer-based information look-up" Index terms for a document: access, document, retrieval, indexing How can we recognize that information look-up is related to retrieval and indexing? Conversely, if information has many different contexts in the set of documents, how can we discover that it is an unhelpful term for retrieval?

9 Technical Memo Example: Titles c1Human machine interface for Lab ABC computer applications c2A survey of user opinion of computer system response time c3The EPS user interface management system c4System and human system engineering testing of EPS c5Relation of user-perceived response time to error measurement m1The generation of random, binary, unordered trees m2The intersection graph of paths in trees m3Graph minors IV: Widths of trees and well-quasi-ordering m4Graph minors: A survey

10 Technical Memo Example: Terms and Documents Terms Documents c1c2c3c4c5m1m2m3m4 human interface computer user system response time EPS survey trees graph minors

11 Technical Memo Example: Query Query: Find documents relevant to "human computer interaction" Simple Term Matching: Matches c1, c2, and c4 Misses c3 and c5

12 t1t1 t2t2 t3t3 d1d1 d2d2  The space has as many dimensions as there are terms in the word list. The index term vector space

13 Models of Semantic Similarity Proximity models: Put similar items together in some space or structure Clustering (hierarchical, partition, overlapping). Documents are considered close to the extent that they contain the same terms. Most then arrange the documents into a hierarchy based on distances between documents. [Covered later in course.] Factor analysis based on matrix of similarities between documents (single mode). Two-mode proximity methods. Start with rectangular matrix and construct explicit representations of both row and column objects.

14 Selection of Two-mode Factor Analysis Additional criterion: Computationally efficient O(N 2 k 3 ) N is number of terms plus documents k is number of dimensions

15 Figure 1 term document query --- cosine > 0.9

16 Mathematical concepts Singular Value Decomposition Define X as the term-document matrix, with t rows (number of index terms) and d columns (number of documents). There exist matrices T, S and D', such that: X = T 0 S 0 D 0 ' T 0 and D 0 are the matrices of left and right singular vectors T 0 and D 0 have orthonormal columns S 0 is the diagonal matrix of singular values

17 Dimensions of matrices X= T0T0 D0'D0'S0S0 t x dt x mm x dm x m m is the rank of X < min(t, d)

18 Reduced Rank Diagonal elements of S 0 are positive and decreasing in magnitude. Keep the first k and set the others to zero. Delete the zero rows and columns of S 0 and the corresponding rows and columns of T 0 and D 0. This gives: X X = TSD' Interpretation If value of k is selected well, expectation is that X retains the semantic information from X, but eliminates noise from synonymy, and recognizes dependence. ~ ~ ^ ^

19 Selection of singular values X = t x dt x kk x dk x k k is the number of singular values chosen to represent the concepts in the set of documents. Usually, k « m. T SD' ^

20 Comparing Two Terms XX' = TSD'(TSD')' = TSD'DS'T' = TSS'T Since D is orthonormal = TS(TS)' To calculate the i, j cell, take the dot product between the i and j rows of TS Since S is diagonal, TS differs from T only by stretching the coordinate system ^ ^ The dot product of two rows of X reflects the extent to which two terms have a similar pattern of occurrences. ^

21 Comparing Two Documents X'X = (TSD')'TSD' = DS(DS)' To calculate the i, j cell, take the dot product between the i and j columns of DS. Since S is diagonal DS differs from D only by stretching the coordinate system ^ ^ The dot product of two columns of X reflects the extent to which two columns have a similar pattern of occurrences. ^

22 Comparing a Term and a Document Comparison between a term and a document is the value of an individual cell of X. X = TSD' = TS(DS)' where S is a diagonal matrix whose values are the square root of the corresponding elements of S. ^ ^

23 Technical Memo Example: Query Terms Query x q human1 interface0 computer0 user0 system1 response0 time0 EPS0 survey0 trees1 graph0 minors0 Query: "human system interactions on trees" In term-document space, a query is represented by x q, a t x 1 vector. In concept space, a query is represented by d q, a 1 x k vector.

24 Comparing a Query and a Document A query can be expressed as a vector in the term- document vector space x q. x qi = 1 if term i is in the query and 0 otherwise. Let p qj be the inner product of the query x q with document d j in the term-document vector space. p qj is the j th element in the product of x q 'X. ^

25 Comparing a Query and a Document [p q1... p qj... p qt ] = [x q1 x q2... x qt ] ^ X inner product of query q with document d j query document d j is column j of X ^ p q ' = x q 'X = x q 'TSD' = x q 'T(DS)' similarity(q, d j ) = ^ p qj |x q | |d j | cosine of angle is inner product divided by lengths of vectors Revised October 6, 2004

26 Comparing a Query and a Document In the reading, the authors treat the query as a pseudo- document in the concept space d q : d q = x q 'TS -1 To compare a query against document j, they extend the method used to compare document i with document j. Take the j th element of the product of: d q S and (DS)' This is the j th element of product of: x q 'T (DS)' which is the same expression as before. Note that d q is a row vector. Revised October 6, 2004

27 Experimental Results Deerwester, et al. tried latent semantic indexing on two test collections, MED and CISI, where queries and relevant judgments available. Documents were full text of title and abstract. Stop list of 439 words (SMART); no stemming, etc. Comparison with: (a) simple term matching, (b) SMART, (c) Voorhees method.

28 Experimental Results: 100 Factors

29 Experimental Results: Number of Factors

30 CS 430 / INFO 430 Information Retrieval Extending the Boolean Model

31 Boolean Diagram A B A and B A or B not (A or B)

32 Problems with the Boolean model Counter-intuitive results: Query q = A and B and C and D and E Document d has terms A, B, C and D, but not E Intuitively, d is quite a good match for q, but it is rejected by the Boolean model. Query q = A or B or C or D or E Document d 1 has terms A, B, C, D and E Document d 2 has term A, but not B, C, D or E Intuitively, d 1 is a much better match than d 2, but the Boolean model ranks them as equal.

33 Problems with the Boolean model (continued) Boolean is all or nothing Boolean model has no way to rank documents. Boolean model allows for no uncertainty in assigning index terms to documents. The Boolean model has no provision for adjusting the importance of query terms.

34 Boolean model as sets A d d is either in the set A or not in A.

35 Extending the Boolean model Term weighting Give weights to terms in documents and/or queries. Combine standard Boolean retrieval with vector ranking of results Fuzzy sets Relax the boundaries of the sets used in Boolean retrieval

36 Ranking methods in Boolean systems SIRE (Syracuse Information Retrieval Experiment) Term weights Add term weights to documents Weights calculated by the standard method of term frequency * inverse document frequency. Ranking Calculate results set by standard Boolean methods Rank results by vector distances

37 Relevance feedback in SIRE SIRE (Syracuse Information Retrieval Experiment) Relevance feedback is particularly important with Boolean retrieval because it allow the results set to be expanded Results set is created by standard Boolean retrieval User selects one document from results set Other documents in collection are ranked by vector distance from this document

38 Boolean model as fuzzy sets A d d is more or less in A.

39 Basic concept A document has a term weight associated with each index term. The term weight measures the degree to which that term characterizes the document. Term weights are in the range [0, 1]. (In the standard Boolean model all weights are either 0 or 1.) For a given query, calculate the similarity between the query and each document in the collection. This calculation is needed for every document that has a non- zero weight for any of the terms in the query.

40 MMM: Mixed Min and Max model Fuzzy set theory d A is the degree of membership of an element to set A intersection (and) d A  B = min(d A, d B ) union (or) d A  B = max(d A, d B )

41 MMM: Mixed Min and Max model Fuzzy set theory example standard fuzzy set theory set theory d A d B and d A  B or d A  B

42 MMM: Mixed Min and Max model Terms: A 1, A 2,..., A n Document D, with index-term weights: d A1, d A2,..., d An Q or = (A 1 or A 2 or... or A n ) Query-document similarity: S(Q or, D) = C or1 * max(d A1, d A2,.., d An ) + C or2 * min(d A1, d A2,.., d An ) where C or1 + C or2 = 1

43 MMM: Mixed Min and Max model Terms: A 1, A 2,..., A n Document D, with index-term weights: d A1, d A2,..., d An Q and = (A 1 and A 2 and... and A n ) Query-document similarity: S(Q and, D) = C and1 * min(d A1,.., d An ) + C and2 * max(d A1,.., d An ) where C and1 + C and2 = 1

44 MMM: Mixed Min and Max model Experimental values: C and1 in range [0.5, 0.8] C or1 > 0.2 Computational cost is low. Retrieval performance much improved.

45 Other Models Paice model The MMM model considers only the maximum and minimum document weights. The Paice model takes into account all of the document weights. Computational cost is higher than MMM. P-norm model Document D, with term weights: d A1, d A2,..., d An Query terms are given weights, a 1, a 2,...,a n Operators have coefficients that indicate degree of strictness Query-document similarity is calculated by considering each document and query as a point in n space.

46 Test data CISICACMINSPEC P-norm Paice MMM Percentage improvement over standard Boolean model (average best precision) Lee and Fox, 1988

47 Reading E. Fox, S. Betrabet, M. Koushik, W. Lee, Extended Boolean Models, Frake, Chapter 15 Methods based on fuzzy set concepts