# Information Retrieval CSE 8337 (Part III) Spring 2011 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.

## Presentation on theme: "Information Retrieval CSE 8337 (Part III) Spring 2011 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates."— Presentation transcript:

Information Retrieval CSE 8337 (Part III) Spring 2011 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/ http://www.sims.berkeley.edu/~hearst/irbook/ Data Mining Introductory and Advanced Topics by Margaret H. Dunham http://www.engr.smu.edu/~mhd/book  Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze http://informationretrieval.org

CSE 8337 Spring 2011 2 CSE 8337 Outline Introduction Text Processing Indexes Boolean Queries Web Searching/Crawling Vector Space Model Matching Evaluation Feedback/Expansion

CSE 8337 Spring 2011 3 Modeling TOC (Vector Space and Other Models ) Introduction Classic IR Models Boolean Model Vector Model Probabilistic Model Extended Boolean Model Vector Space Scoring Vector Model and Web Search

CSE 8337 Spring 2011 4 IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector probabilistic Set Theoretic Fuzzy Extended Boolean Probabilistic Inference Network Belief Network Algebraic Generalized Vector Lat. Semantic Index Neural Networks Browsing Flat Structure Guided Hypertext

CSE 8337 Spring 2011 5 The Boolean Model Simple model based on set theory Queries specified as boolean expressions precise semantics and neat formalism Terms are either present or absent. Thus, w ij  {0,1} Consider q = k a  (k b   k c ) q dnf = (1,1,1)  (1,1,0)  (1,0,0) q cc = (1,1,0) is a conjunctive component

CSE 8337 Spring 2011 6 The Boolean Model q = k a  (k b   k c ) sim(q,d j ) = 1 if  q cc | (q cc  q dnf )  (  k i, g i (d j )= g i (q cc )) 0 otherwise (1,1,1) (1,0,0) (1,1,0) KaKa KbKb KcKc

CSE 8337 Spring 2011 7 Drawbacks of the Boolean Model Retrieval based on binary decision criteria with no notion of partial matching No ranking of the documents is provided Information need has to be translated into a Boolean expression The Boolean queries formulated by the users are most often too simplistic As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query

CSE 8337 Spring 2011 8 The Vector Model Use of binary weights is too limiting Non-binary weights provide consideration for partial matches These term weights are used to compute a degree of similarity between a query and each document Ranked set of documents provides for better matching

CSE 8337 Spring 2011 9 The Vector Model w ij > 0 whenever k i appears in d j w iq >= 0 associated with the pair (k i,q) d j = (w 1j, w 2j,..., w tj ) q = (w 1q, w 2q,..., w tq ) To each term k i is associated a unitary vector i The unitary vectors i and j are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents) The t unitary vectors i form an orthonormal basis for a t-dimensional space where queries and documents are represented as weighted vectors

CSE 8337 Spring 2011 10 The Vector Model Sim(q,d j ) = cos(  ) = [d j  q] / |d j | * |q| = [  w ij * w iq ] / |d j | * |q| Since w ij > 0 and w iq > 0, 0 <= sim(q,d j ) <=1 A document is retrieved even if it matches the query terms only partially i j dj q 

CSE 8337 Spring 2011 11 Weights w ij and w iq ? One approach is to examine the frequency of the occurence of a word in a document: Absolute frequency: tf factor, the term frequency within a document freq i,j - raw frequency of k i within d j Both high-frequency and low-frequency terms may not actually be significant Relative frequency: tf divided by number of words in document Normalized frequency: f i,j = (freq i,j )/(max l freq l,j )

CSE 8337 Spring 2011 12 Inverse Document Frequency Importance of term may depend more on how it can distinguish between documents. Quantification of inter-documents separation Dissimilarity not similarity idf factor, the inverse document frequency

CSE 8337 Spring 2011 13 IDF N be the total number of docs in the collection n i be the number of docs which contain k i The idf factor is computed as idf i = log (N/n i ) the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term k i. IDF Ex: N=1000, n 1 =100, n 2 =500, n 3 =800 idf 1 = 3 - 2 = 1 idf 2 = 3 – 2.7 = 0.3 idf 3 = 3 – 2.9 = 0.1

CSE 8337 Spring 2011 14 The Vector Model The best term-weighting schemes take both into account. w ij = f i,j * log(N/n i ) This strategy is called a tf-idf weighting scheme

CSE 8337 Spring 2011 15 The Vector Model For the query term weights, a suggestion is w iq = (0.5 + [0.5 * freq i,q / max(freq l,q ]) * log(N/n i ) The vector model with tf-idf weights is a good ranking strategy with general collections The vector model is usually as good as any known ranking alternatives. It is also simple and fast to compute.

CSE 8337 Spring 2011 16 The Vector Model Advantages: term-weighting improves quality of the answer set partial matching allows retrieval of docs that approximate the query conditions cosine ranking formula sorts documents according to degree of similarity to the query Disadvantages: Assumes independence of index terms (??); not clear that this is bad though

CSE 8337 Spring 2011 17 The Vector Model: Example I d1 d2 d3 d4d5 d6 d7 k1 k2 k3

CSE 8337 Spring 2011 18 The Vector Model: Example II d1 d2 d3 d4d5 d6 d7 k1 k2 k3

CSE 8337 Spring 2011 19 The Vector Model: Example III d1 d2 d3 d4d5 d6 d7 k1 k2 k3

CSE 8337 Spring 2011 20 Probabilistic Model Objective: to capture the IR problem using a probabilistic framework Given a user query, there is an ideal answer set Querying as specification of the properties of this ideal answer set (clustering) But, what are these properties? Guess at the beginning what they could be (i.e., guess initial description of ideal answer set) Improve by iteration

CSE 8337 Spring 2011 21 Probabilistic Model An initial set of documents is retrieved somehow User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected) IR system uses this information to refine description of ideal answer set By repeating this process, it is expected that the description of the ideal answer set will improve Have always in mind the need to guess at the very beginning the description of the ideal answer set Description of ideal answer set is modeled in probabilistic terms

CSE 8337 Spring 2011 22 Probabilistic Ranking Principle Given a user query q and a document d j, the probabilistic model tries to estimate the probability that the user will find the document d j interesting (i.e., relevant). Ideal answer set is referred to as R and should maximize the probability of relevance. Documents in the set R are predicted to be relevant. But, how to compute probabilities? what is the sample space?

CSE 8337 Spring 2011 23 The Ranking Probabilistic ranking computed as: sim(q,d j ) = P(d j relevant-to q) / P(d j non-relevant- to q) This is the odds of the document d j being relevant Taking the odds minimize the probability of an erroneous judgement Definition: w ij  {0,1} P(R | d j ) : probability that given doc is relevant P(  R | d j ) : probability doc is not relevant

CSE 8337 Spring 2011 24 The Ranking sim(d j,q) = P(R | d j ) / P(  R | d j ) = [P(d j | R) * P(R)] [P(d j |  R) * P(  R)] ~ P(d j | R) P(d j |  R) P(d j | R) : probability of randomly selecting the document dj from the set R of relevant documents

CSE 8337 Spring 2011 25 The Ranking sim(d j,q)~ P(d j | R) P(d j |  R) ~ [  P(k i | R)] * [  P(  k i | R)] [  P(k i |  R)] * [  P(  k i |  R)] P(k i | R) : probability that the index term k i is present in a document randomly selected from the set R of relevant documents

CSE 8337 Spring 2011 26 The Ranking sim(d j,q) ~ log [  P(k i | R)] * [  P(  k j | R)] [  P(k i |  R)] * [  P(  k i |  R)] ~ K * [ log  P(k i | R) + log  P(k i |  R) ] P(  k i | R) P(  k i |  R) where P(  k i | R) = 1 - P(k i | R) P(  k i |  R) = 1 - P(k i |  R)

CSE 8337 Spring 2011 27 The Initial Ranking sim(d j,q) ~  w iq * w ij * (log P(k i | R) + log P(k i |  R) ) P(  k i | R) P(  k i |  R) Probabilities P(k i | R) and P(k i |  R) ? Estimates based on assumptions: P(k i | R) = 0.5 P(k i |  R) = n i N Use this initial guess to retrieve an initial ranking Improve upon this initial ranking

CSE 8337 Spring 2011 28 Improving the Initial Ranking Let V : set of docs initially retrieved V i : subset of docs retrieved that contain k i Reevaluate estimates: P(k i | R) = V i V P(k i |  R) = n i - V i N - V Repeat recursively

CSE 8337 Spring 2011 29 Improving the Initial Ranking To avoid problems with V=1 and Vi=0: P(k i | R) = V i + 0.5 V + 1 P(k i |  R) = n i - V i + 0.5 N - V + 1 Also, P(k i | R) = V i + n i /N V + 1 P(k i |  R) = n i - V i + n i /N N - V + 1

CSE 8337 Spring 2011 30 Pluses and Minuses Advantages: Docs ranked in decreasing order of probability of relevance Disadvantages: need to guess initial estimates for P(k i | R) method does not take into account tf and idf factors

CSE 8337 Spring 2011 31 Brief Comparison of Classic Models Boolean model does not provide for partial matches and is considered to be the weakest classic model Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections This seems also to be the view of the research community

CSE 8337 Spring 2011 32 Extended Boolean Model Boolean model is simple and elegant. But, no provision for a ranking As with the fuzzy model, a ranking can be obtained by relaxing the condition on set membership Extend the Boolean model with the notions of partial matching and term weighting Combine characteristics of the Vector model with properties of Boolean algebra

CSE 8337 Spring 2011 33 The Idea The Extended Boolean Model (introduced by Salton, Fox, and Wu, 1983) is based on a critique of a basic assumption in Boolean algebra Let, q = k x  k y w xj = f xj * idf x associated with [k x,d j ] max(idf i ) Further, w xj = x and w yj = y

CSE 8337 Spring 2011 34 The Idea: q and = k x  k y ; w xj = x and w yj = y djdj y = w yj x = w xj (0,0) (1,1) kxkx kyky sim(q and,dj) = 1 - sqrt( (1-x) + (1-y) ) 2 22 AND

CSE 8337 Spring 2011 35 The Idea: q or = k x  k y ; w xj = x and w yj = y (1,1) sim(q or,dj) = sqrt( x + y ) 2 22 djdj y = w yj x = w xj (0,0)kxkx kyky OR

CSE 8337 Spring 2011 36 Generalizing the Idea We can extend the previous model to consider Euclidean distances in a t- dimensional space This can be done using p-norms which extend the notion of distance to include p-distances, where 1  p   is a new parameter

CSE 8337 Spring 2011 37 Generalizing the Idea A generalized disjunctive query is given by q or = k 1 k 2... k t A generalized conjunctive query is given by q and = k 1 k 2... k t p  p  p   p  p  p sim(q or,d j ) = (x 1 + x 2 +... + x m ) m p pp p 1 sim(q and,d j )=1 - ((1-x 1 ) + (1-x 2 ) +... + (1-x m ) ) m p 1 p p p

CSE 8337 Spring 2011 38 Properties If p = 1 then (Vector like) sim(q or,d j ) = sim(q and,d j ) = x 1 +... + x m m If p =  then (Fuzzy like) sim(q or,d j ) = max (w xj ) sim(q and,d j ) = min (w xj ) By varying p, we can make the model behave as a vector, as a fuzzy, or as an intermediary model

CSE 8337 Spring 2011 39 Properties This is quite powerful and is a good argument in favor of the extended Boolean model q = (k 1 k 2 ) k 3 k 1 and k 2 are to be used as in a vector retrieval while the presence of k 3 is required. sim(q,d j ) = ( (1 - ( (1-x 1 ) + (1-x 2 ) ) ) + x 3 ) 2 ______ 2    2

CSE 8337 Spring 2011 40 Conclusions Model is quite powerful Properties are interesting and might be useful Computation is somewhat complex However, distributivity operation does not hold for ranking computation: q1 = (k1  k2)  k3 q2 = (k1  k3)  (k2  k3) sim(q1,dj)  sim(q2,dj)

CSE 8337 Spring 2011 41 Vector Space Scoring First cut: distance between two points ( = distance between the end points of the two vectors) Euclidean distance? Euclidean distance is a bad idea...... because Euclidean distance is large for vectors of different lengths.

CSE 8337 Spring 2011 42 Why distance is a bad idea The Euclidean distance between q and d 2 is large even though the distribution of terms in the query q and the distribution of terms in the document d 2 are very similar.

CSE 8337 Spring 2011 43 Use angle instead of distance Thought experiment: take a document d and append it to itself. Call this document d′. “Semantically” d and d′ have the same content The Euclidean distance between the two documents can be quite large The angle between the two documents is 0, corresponding to maximal similarity. Key idea: Rank documents according to angle with query.

CSE 8337 Spring 2011 44 From angles to cosines The following two notions are equivalent. Rank documents in decreasing order of the angle between query and document Rank documents in increasing order of cosine(query,document) Cosine is a monotonically decreasing function for the interval [0 o, 180 o ]

CSE 8337 Spring 2011 45 Length normalization A vector can be (length-) normalized by dividing each of its components by its length – for this we use the L 2 norm: Dividing a vector by its L 2 norm makes it a unit (length) vector Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length- normalization.

CSE 8337 Spring 2011 46 cosine(query,document) Dot product Unit vectors q i is the tf-idf weight of term i in the query d i is the tf-idf weight of term i in the document cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d.

CSE 8337 Spring 2011 47 Cosine similarity amongst 3 documents termSaSPaPWH affection1155820 jealous10711 gossip206 wuthering0038 How similar are the novels SaS: Sense and Sensibility PaP: Pride and Prejudice, and WH: Wuthering Heights? Term frequencies (counts)

CSE 8337 Spring 2011 48 3 documents example contd. Log frequency weighting termSaSPaPWH affection3.062.762.30 jealous2.001.852.04 gossip1.3001.78 wutherin g 002.58 After normalization termSaSPaPWH affection0.7890.8320.524 jealous0.5150.5550.465 gossip0.33500.405 wuthering000.588 cos(SaS,PaP) ≈ 0.789 ∗ 0.832 + 0.515 ∗ 0.555 + 0.335 ∗ 0.0 + 0.0 ∗ 0.0 ≈ 0.94 cos(SaS,WH) ≈ 0.79 cos(PaP,WH) ≈ 0.69 Why do we have cos(SaS,PaP) > cos(SAS,WH)?

CSE 8337 Spring 2011 49 tf-idf weighting has many variants Columns headed ‘n’ are acronyms for weight schemes. Why is the base of the log in idf immaterial?

CSE 8337 Spring 2011 50 Weighting may differ in Queries vs Documents Many search engines allow for different weightings for queries vs documents To denote the combination in use in an engine, we use the notation qqq.ddd with the acronyms from the previous table Example: ltn.ltc means: Query: logarithmic tf (l in leftmost column), idf (t in second column), no normalization … Document logarithmic tf, no idf and cosine normalization Is this a bad idea?

CSE 8337 Spring 2011 51 tf-idf example: ltn.lnc TermQueryDocumentProd tf- raw tf-wtdfidfwttf- raw tf-wtwtn’lized auto0050002.301110.520 best11500001.3 00000 car11100002.0 1110.521.04 insurance1110003.0 21.30.681.32.04 Document: car insurance auto insurance Query: best car insurance Exercise: what is N, the number of docs? Score = 0+0+1.04+2.04 = 3.08 Doc length =

CSE 8337 Spring 2011 52 Summary – vector space ranking Represent the query as a weighted tf-idf vector Represent each document as a weighted tf-idf vector Compute the cosine similarity score for the query vector and each document vector Rank documents with respect to the query by score Return the top K (e.g., K = 10) to the user

CSE 8337 Spring 2011 53 Vector Model and Web Search Speeding up vector space ranking Putting together a complete search system Will require learning about a number of miscellaneous topics and heuristics

CSE 8337 Spring 2011 54 Efficient cosine ranking Find the K docs in the collection “nearest” to the query  K largest query-doc cosines. Efficient ranking: Computing a single cosine efficiently. Choosing the K largest cosine values efficiently. Can we do this without computing all N cosines?

CSE 8337 Spring 2011 55 Efficient cosine ranking What we’re doing in effect: solving the K- nearest neighbor problem for a query vector In general, we do not know how to do this efficiently for high-dimensional spaces But it is solvable for short queries, and standard indexes support this well

CSE 8337 Spring 2011 56 Special case – unweighted queries No weighting on query terms Assume each query term occurs only once Then for ranking, don’t need to normalize query vector

CSE 8337 Spring 2011 57 Faster cosine: unweighted query

CSE 8337 Spring 2011 58 Computing the K largest cosines: selection vs. sorting Typically we want to retrieve the top K docs (in the cosine ranking for the query) not to totally order all docs in the collection Can we pick off docs with K highest cosines? Let J = number of docs with nonzero cosines We seek the K best of these J

CSE 8337 Spring 2011 59 Use heap for selecting top K Binary tree in which each node’s value > the values of children Takes 2J operations to construct, then each of K “winners” read off in 2log J steps. For J=1M, K=100, this is about 10% of the cost of sorting. 1.9.3.8.3.1

CSE 8337 Spring 2011 60 Bottlenecks Primary computational bottleneck in scoring: cosine computation Can we avoid all this computation? Yes, but may sometimes get it wrong a doc not in the top K may creep into the list of K output docs Is this such a bad thing?

CSE 8337 Spring 2011 61 Cosine similarity is only a proxy User has a task and a query formulation Cosine matches docs to query Thus cosine is anyway a proxy for user happiness If we get a list of K docs “close” to the top K by cosine measure, should be ok

CSE 8337 Spring 2011 62 Generic approach Find a set A of contenders, with K < |A| << N A does not necessarily contain the top K, but has many docs from among the top K Return the top K docs in A Think of A as pruning non-contenders The same approach is also used for other (non-cosine) scoring functions Will look at several schemes following this approach

CSE 8337 Spring 2011 63 Index elimination Basic algorithm of Fig 7.1 only considers docs containing at least one query term Take this further: Only consider high-idf query terms Only consider docs containing many query terms

CSE 8337 Spring 2011 64 High-idf query terms only For a query such as catcher in the rye Only accumulate scores from catcher and rye Intuition: in and the contribute little to the scores and don’t alter rank-ordering much Benefit: Postings of low-idf terms have many docs  these (many) docs get eliminated from A

CSE 8337 Spring 2011 65 Docs containing many query terms Any doc with at least one query term is a candidate for the top K output list For multi-term queries, only compute scores for docs containing several of the query terms Say, at least 3 out of 4 Imposes a “soft conjunction” on queries seen on web search engines (early Google) Easy to implement in postings traversal

CSE 8337 Spring 2011 66 3 of 4 query terms Brutus Caesar Calpurnia 12358132134 248163264128 1316 Antony 348163264128 32 Scores only computed for 8, 16 and 32.

CSE 8337 Spring 2011 67 Champion lists Precompute for each dictionary term t, the r docs of highest weight in t’s postings Call this the champion list for t (aka fancy list or top docs for t) Note that r has to be chosen at index time At query time, only compute scores for docs in the champion list of some query term Pick the K top-scoring docs from amongst these

CSE 8337 Spring 2011 68 Quantitative Static quality scores We want top-ranking documents to be both relevant and authoritative Relevance is being modeled by cosine scores Authority is typically a query-independent property of a document Examples of authority signals Wikipedia among websites Articles in certain newspapers A paper with many citations Many diggs, Y!buzzes or del.icio.us marks (Pagerank)

CSE 8337 Spring 2011 69 Modeling authority Assign to each document a query- independent quality score in [0,1] to each document d Denote this by g(d) Thus, a quantity like the number of citations is scaled into [0,1] Exercise: suggest a formula for this.

CSE 8337 Spring 2011 70 Net score Consider a simple total score combining cosine relevance and authority net-score(q,d) = g(d) + cosine(q,d) Can use some other linear combination than an equal weighting Indeed, any function of the two “signals” of user happiness – more later Now we seek the top K docs by net score

CSE 8337 Spring 2011 71 Top K by net score – fast methods First idea: Order all postings by g(d) Key: this is a common ordering for all postings Thus, can concurrently traverse query terms’ postings for Postings intersection Cosine score computation Exercise: write pseudocode for cosine score computation if postings are ordered by g(d)

CSE 8337 Spring 2011 72 Why order postings by g(d)? Under g(d)-ordering, top-scoring docs likely to appear early in postings traversal In time-bound applications (say, we have to return whatever search results we can in 50 ms), this allows us to stop postings traversal early Short of computing scores for all docs in postings

CSE 8337 Spring 2011 73 Champion lists in g(d)-ordering Can combine champion lists with g(d)- ordering Maintain for each term a champion list of the r docs with highest g(d) + tf-idf td Seek top-K results from only the docs in these champion lists

CSE 8337 Spring 2011 74 High and low lists For each term, we maintain two postings lists called high and low Think of high as the champion list When traversing postings on a query, only traverse high lists first If we get more than K docs, select the top K and stop Else proceed to get docs from the low lists Can be used even for simple cosine scores, without global quality g(d) A means for segmenting index into two tiers

CSE 8337 Spring 2011 75 Impact-ordered postings We only want to compute scores for docs for which wf t,d is high enough We sort each postings list by wf t,d Now: not all postings in a common order! How do we compute scores in order to pick off top K? Two ideas follow

CSE 8337 Spring 2011 76 1. Early termination When traversing t’s postings, stop early after either a fixed number of r docs wf t,d drops below some threshold Take the union of the resulting sets of docs One from the postings of each query term Compute only the scores for docs in this union

CSE 8337 Spring 2011 77 2. idf-ordered terms When considering the postings of query terms Look at them in order of decreasing idf High idf terms likely to contribute most to score As we update score contribution from each query term Stop if doc scores relatively unchanged Can apply to cosine or some other net scores

CSE 8337 Spring 2011 78 Cluster pruning: preprocessing Pick  N docs at random: call these leaders For every other doc, pre-compute nearest leader Docs attached to a leader: its followers; Likely: each leader has ~  N followers.

CSE 8337 Spring 2011 79 Cluster pruning: query processing Process a query as follows: Given query Q, find its nearest leader L. Seek K nearest docs from among L’s followers.

CSE 8337 Spring 2011 80 Visualization Query LeaderFollower

CSE 8337 Spring 2011 81 Why use random sampling Fast Leaders reflect data distribution

CSE 8337 Spring 2011 82 General variants Have each follower attached to b1=3 (say) nearest leaders. From query, find b2=4 (say) nearest leaders and their followers. Can recur on leader/follower construction.

CSE 8337 Spring 2011 83 Putting it all together

Download ppt "Information Retrieval CSE 8337 (Part III) Spring 2011 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates."

Similar presentations