1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing.

1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing

2 Course Administration Thursday, October 16: Presidential Inauguration Narayana N. R. Murthy Chairman and Chief Mentor Officer Infosys Technologies Limited Cornell - The Unfinished Agenda: The Musings of a Corporate Person Biotech Building, Large Conference Room 10:00 to 11:00 a.m.

3 Course Administration Assignment 1 You should receive answers to questions by email Newsgroup If you have a general question about the assignment, send it to the newsgroup: News server: newsstand.cit.cornell.edu Newsgroup: cornell.class.cs430

4 Course Administration Midterm Examination A sample examination and discussion of the solution will be posted to the Web site. Assignment 2 It is not the job of the Teaching Assistants to do the assignments for you. They may answer your question, "This is a matter for you to judge," or "This was covered in the lectures." Use the report to explain your choices. We will post answers to general questions on the news group.

5 Probabilistic Principle Given a query q and a document d j the model needs an estimate of the probability that the user finds d j relevant. i.e., P(R | d j ). similarity (d j, q) = = by Bayes Theorem = x k where k is constant P(R | d j ) P(d j | R) P(R) P(d j | R) P(d j | R) is the probability of randomly selecting d j from R.

8 Binary Independence Retrieval Model (BIR) For terms that appear in the query let p i = P(x i = 1 | R) r i = P(x i = 1 | R) For terms that do not appear in the query assume p i = r i p i 1 - p i x i = q i = 1 r i x i = 0, q i = 1 1 - r i p i (1 - r i ) 1 - p i x i = q i = 1 r i (1 - p i ) q i = 1 1 - r i S = k = k ∏ constant for a given query

9 Binary Independence Retrieval Model (BIR) Taking logs and ignoring factors that are constant for a given query, we have: p i (1 - r i ) (1 - p i ) r i where the summation is taken over those terms that appear in both the query and the document. This similarity measure can be used to rank all documents against the query q. similarity (d, q) = ∑ log

10 Estimates of P(x i | R) Initial guess, with no information to work from: p i = P(x i | R) = c r i = P(x i | R) = n i / N where: c is an arbitrary constant, e.g., 0.5 n i is the number of documents that contain x i N is the total number of documents in the collection

11 Improving the Estimates of P(x i | R) Human feedback -- relevance feedback (discussed later) Automatically (a) Run query q using initial values. Consider the t top ranked documents. Let s i be the number of these documents that contain the term x i. (b) The new estimates are: p i = P(x i | R) = s i / t r i = P(x i | R) = (n i - s i ) / (N - t)

12 Discussion of Probabilistic Model Advantages Based on firm theoretical basis Disadvantages Initial values have to be guessed. Weights ignore term frequency Assumes independent index terms

13 Latent Semantic Indexing Objective Replace indexes that use sets of index terms by indexes that use concepts. Approach Map the index term vector space into a lower dimensional space, using singular value decomposition.

14 Deficiencies with Conventional Automatic Indexing Synonymy: Various words and phrases refer to the same concept (lowers recall). Polysemy: Individual words have more than one meaning (lowers precision) Independence: No significance is given to two terms that frequently appear together

15 Example Query: "IDF in computer-based information look-up" Index terms for a document: access, document, retrieval, indexing How can we recognize that information look-up is related to retrieval and indexing? Conversely, if information has many different contexts in the set of documents, how can we discover that it is an unhelpful term for retrieval?

16 Technical Memo Example: Titles c1Human machine interface for Lab ABC computer applications c2A survey of user opinion of computer system response time c3The EPS user interface management system c4System and human system engineering testing of EPS c5Relation of user-perceived response time to error measurement m1The generation of random, binary, unordered trees m2The intersection graph of paths in trees m3Graph minors IV: Widths of trees and well-quasi-ordering m4Graph minors: A survey

17 Technical Memo Example: Terms and Documents Terms Documents c1c2c3c4c5m1m2m3m4 human100100000 interface101000000 computer110000000 user011010000 system011200000 response010010000 time010010000 EPS001100000 survey010000001 trees000001110 graph000000111 minors000000011

18 Technical Memo Example: Query Query: Find documents relevant to "human computer interaction" Simple Term Matching: Matches c1, c2, and c4 Misses c3 and c5

19 t1t1 t2t2 t3t3 d1d1 d2d2  The space has as many dimensions as there are terms in the word list. The index term vector space

20 Models of Semantic Similarity Proximity models: Put similar items together in some space or structure Clustering (hierarchical, partition, overlapping). Documents are considered close to the extent that they contain the same terms. Most then arrange the documents into a hierarchy based on distances between documents. [Covered later in course.] Factor analysis based on matrix of similarities between documents (single mode). Two-mode proximity methods. Start with rectangular matrix and construct explicit representations of both row and column objects.

21 Selection of Two-mode Factor Analysis Additional criterion: Computationally efficient O(N 2 k 3 ) N is number of terms plus documents k is number of dimensions

22 Figure 1 term document query --- cosine > 0.9

23 Mathematical concepts Singular Value Decomposition Define X as the term-document matrix, with t rows (number of index terms) and d columns (number of documents). There exist matrices T, S and D', such that: X = T 0 S 0 D 0 ' T 0 and D 0 are the matrices of left and right singular vectors T 0 and D 0 have orthonormal columns S 0 is the diagonal matrix of singular values

24 Dimensions of matrices X= T0T0 D0'D0'S0S0 t x dt x mm x dm x m m is the rank of X < min(t, d)

25 Reduced Rank Diagonal elements of S 0 are positive and decreasing in magnitude. Keep the first k and set the others to zero. Delete the zero rows and columns of S 0 and the corresponding rows and columns of T 0 and D 0. This gives: X X = TSD' Interpretation If value of k is selected well, expectation is that X retains the semantic information from X, but eliminates noise from synonymy, and recognizes dependence. ~ ~ ^ ^

26 Selection of singular values X = t x dt x kk x dk x k k is the number of singular values chosen to represent the concepts in the set of documents. Usually, k « m. T SD' ^

27 Comparing Two Terms XX' = TSD'(TSD')' = TSD'DS'T' = TSS'T Since D is orthonormal = TS(TS)' To calculate the i, j cell, take the dot product between the i and j rows of TS Since S is diagonal, TS differs from T only by stretching the coordinate system ^ ^ The dot product of two rows of X reflects the extent to which two terms have a similar pattern of occurrences. ^

28 Comparing Two Documents X'X = (TSD')'TSD' = DS(DS)' To calculate the i, j cell, take the dot product between the i and j columns of DS. Since S is diagonal DS differs from D only by stretching the coordinate system ^ ^ The dot product of two columns of X reflects the extent to which two columns have a similar pattern of occurrences. ^

29 Comparing a Term and a Document Comparison between a term and a document is the value of an individual cell of X. X = TSD' = TS(DS)' where S is a diagonal matrix whose values are the square root of the corresponding elements of S. ^ - - - ^

30 Technical Memo Example: Query Terms Query x q human1 interface0 computer0 user0 system1 response0 time0 EPS0 survey0 trees1 graph0 minors0 Query: "human system interactions on trees" In term-document space, a query is represented by x q, a t x 1 vector. In concept space, a query is represented by d q, a 1 x k vector.

31 Query Suggested form of d q is: d q = x q 'TS -1 Example of use. To compare a query against document i, take the i th element of the product of DS and d q S, which is the i th element of product of DS and x q 'T. Note that is a d q row vector.

32 Experimental Results Deerwester, et al. tried latent semantic indexing on two test collections, MED and CISI, where queries and relevant judgments available. Documents were full text of title and abstract. Stop list of 439 words (SMART); no stemming, etc. Comparison with: (a) simple term matching, (b) SMART, (c) Voorhees method.

33 Experimental Results: 100 Factors

34 Experimental Results: Number of Factors

1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing.

Similar presentations

Presentation on theme: "1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing.

Similar presentations

Presentation on theme: "1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing."— Presentation transcript:

Similar presentations

About project

Feedback