7CCSMWAL Algorithmic Issues in the WWW

7CCSMWAL Algorithmic Issues in the WWW
Lecture 9

Information Retrieval Models
Boolean Model Vector Model Basic idea. There are a total of t terms in the document collection. For each document D have a vector of length t showing the occurrence of each of the terms The collection is a (t ×D) term-document matrix What are the matrix entries?

1 if document contains term, 0 otherwise
Boolean model Binary term-document incidence matrix gives information of which term exists in which document (1 if document contains term, 0 otherwise) Documents Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 Brutus Calpurnia mercy worse Terms 1 if document contains term, 0 otherwise

Example Documents D1 = "I like databases" D2 = "I hate hate databases”
D3 = ”I like coffee” D4 = ”Databases of coffee and tea” D5 = ”Mmm, tea and tea biscuits for tea” Terms? (omit stop words). Stemming? Queries Q= ”Database of coffee hate(rs)” P= Q OR ”tea”

biscuit, coffee, database, hate, like, tea zeros omitted in table
Terms biscuit, coffee, database, hate, like, tea zeros omitted in table Query Q can be viewed as a document, but P cant D1 D2 D3 D4 D5 Q P biscuit 1 coffee database hate like tea ?

Boolean Search Simplest form. Retrieve documents which positively match all (or some) query terms. Ignore negative matches Q1=coffee, Q1=( 0,1,0,0,0,0) Retrieve D3, D4 Q2= biscuit, coffee, Q2=(1,1,0,0,0,0) Nothing retrieved

Query relevance Retrieve documents based on similarity to query
Number of common entries between document and query Q1 retrieves {D3,D4} followed by {D1,D2,D5} Q2 retrieves {D3, D4, D5}. All relevance 1, and then {D1,D2} relevance 0 Q3=(0,1,1,1,0,1) gets D4 (rel 3), D2 (rel 2), {D1,D3, D5} (rel 1)

Jaccard Index J= ( A ∩ B) / (A U B) Set similarity ratio
For Boolean queries use number of occurences. Let Q2= {biscuit, coffee} J1= (Q∩D1)/(Q U D1 ) = 0/4 J2= (Q∩D2 )/(Q U D2 ) = 0/4 J3= (Q∩D3 )/(Q U D3 ) = 1/3 J4= (Q∩D4 )/(Q U D4 ) = 1/4 J5= (Q∩D5 )/(Q U D5 ) = 1/3 Ranking {D3,D5}, D4, {D1, D2}

Boolean search Each term can be represented by a Boolean vector of d dimensions, where d is the number of documents Brutus: Caesar: Calpurnia: Queries are in the form of a Boolean expression of terms. Terms are combined with operators AND, OR, and NOT Query result can be seen as applying the Boolean operators to the Boolean vectors: Brutus AND Caesar AND NOT Calpurnia AND AND NOT = , i.e., the first and fourth documents satisfy the query

Disjunctive Normal Form (DNF)
Queries translated into DNF C(1) OR C(2) OR …OR C(k) Where C is a clause of (term AND term …) E.g. hot AND (sunny OR rainy) (hot AND sunny) OR (hot AND rainy) D AND (B OR (NOT C)) (D AND B) OR (D AND (NOT C)) Exercise: Convert (x OR y) AND (z OR w) into DNF

Why DNF ? Easy to think about Think of AND as × and OR as +
1+1=1, 1+0=1, 0+0=0 1 × 1= 1, 1×0=0, 0×0=0 A AND (B OR C) = A × (B+C)= (A ×B) + (A ×C) =(A AND B) OR (A AND C) [A OR (B AND C)= (A OR B) AND (A OR C)] NOT(A AND B)= (NOT A) OR (NOT B) NOT(A OR B)= (NOT A) AND (NOT B) Exercise: Convert (x OR y) AND (z OR w) into DNF

Boolean model Ceasar AND (Anthony OR Brutus) =
(Ceasar AND Anthony) OR (Ceasar AND Brutus) Documents Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 Brutus Calpurnia mercy worse Terms

Ceasar AND (Anthony OR Brutus)
(Ceasar AND Anthony) OR (Ceasar AND Brutus) × × ( ) OR ( ) = ( ) Add the entries A document is relevant if it is retrieved by the query. D1, D2, D4, D6 are relevant

Exercise Brutus AND mercy Brutus AND Calphurnia AND mercy
Brutus AND ((NOT Caesar) OR (worse)) Construct a query which returns all documents Construct a query which returns no documents based on the terms in the documents

Boolean search results
Documents either match or don’t for a query Good for expert users with precise understanding of their needs and the collection (Legal systems) Also good for (computer) applications Applications can easily evaluate 1000s of results. Not good for the majority of users Most users incapable of writing Boolean queries (or if they are, think it’s too much work) Most users don’t want to wade through 1000s of results This is particularly true of web search

Ranked retrieval Boolean model returns all relevant documents as equally valid Use Scoring as the basis of ranked retrieval We wish to return the documents in the order most likely to be useful to the searcher How can we rank-order the documents in the collection with respect to a query? Assign a score, say in [0, 1], to each document This score measures how well document and query “match”

Vector Model Given a set of documents, d1, d2, ... and the terms in the vocabulary t1, t2, ..., Each document di is represented as a vector Di of v dimensions, where v is the number of terms in the vocabulary Di = (wi1, wi2, ..., wiv)

wij reflects the significance of the term tj a term in document di
Di = (wi1, wi2, ..., wiv) In the Boolean model wij = 1 if di contains term tj, wij = 0 otherwise In the Vector model, wij reflects the significance of the term tj a term in document di How wij is determined?

Similarity based on Boolean?
The Jaccard measure we used earlier was a measure of similarity between Boolean vectors. It assigns {0,1} weights to the vector components. A common similarity measure is cosine similarity S(x,y)=Cos(x,y) = (∑ x(i) y(i)) /(|x| . |y|) where |x| is Pythagorean distance (x^2+y^2=z^2) i.e. |x|^2 = x(1)^2 + x(2)^2 + …. + x(n)^2

Cosine similarity?

Boolean Weight Similarity
1*1=1 |x|= squareroot(#1’s in the vector x) [Easy] Q=(1,1,0,0,0,0) |Q|= √2 We do Boolean similarity first to explain the general idea Q=(1,1,0,0,0,0) (Biscuit, Coffee)

D1 D2 D3 D4 D5 Q biscuit 1 coffee database hate like tea Length^2 2 3 |D| √2 √3 Q ● D S(Q,D) 0/√(2*2) 1/√(2*3) 1/√2*3=0.41 1/√2*2=1/2 RANK

Document-Document Similarity
S(A,B) where A,B are Boolean vectors of documents. Need to calculate A●B for docs Then divide by |A| |B| BA●B D1 D2 D3 D4 D5 1

Document-Document Similarity
Then divide by |A| |B| D1 D2 D3 D4 D5 |D| √2 √3 S(A,B) D1 D2 D3 D4 D5 1/2 0.41 1/3

Vector Model How to determine the weights in the term vector?

Term-document count matrices
Consider the number of occurrences of a term in a document (term frequency in the document) Each document is a count vector in a column Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 157 73 1 Brutus 4 232 227 2 Calpurnia 10 57 mercy 3 5 worser

Term frequency tf The term frequency tfij of term tj in document di is defined as the number of times that tj occurs in di We want to use tfij when computing wij. But how? Raw term frequency is not what we want A document with 10 occurrences of a term is more relevant than a document with one occurrence of the term But not 10 times more relevant? Relevance does not increase proportionally with term frequency

Log-frequency weighting
The log frequency weight gij of term tj in di is gij = 1 + log10 tfij, if tfij > 0 gij = 0, if tfij = 0 0  0, 1  1, 2  1.3, 10  2, 1000  4, etc. gij increases slowly as tfij increases

Document frequency Rare terms are more informative than frequent terms (e.g., stop words) Consider a term in the query that is rare in the collection (e.g., arachnocentric) A document containing this term is very likely to be relevant to the query arachnocentric  We want a high weight for rare terms like arachnocentric

Document frequency The number of documents containing a given term
Consider a query term that is frequent in the collection (e.g., high, increase, line) A document containing such a term is more likely to be relevant than a document that doesn’t, but it is not a sure indicator of relevance For frequent terms, we want positive weights for words like high, increase, and line, but lower weights than for rare terms (to the rare terms we give higher weights) We use document frequency (df) to capture this N, the number of documents dfi (  N, the number of documents) is the number of documents that contain the term ti Go to back to table for examples

Document frequency The number of documents containing a given term
Consider a query term that is frequent in the collection (e.g., high, increase, line) A document containing such a term is more likely to be relevant than a document that doesn’t, but it is not a sure indicator of relevance For frequent terms, we want positive weights for words like high, increase, and line, but lower weights than for rare terms (to the rare terms we give higher weights) We use document frequency (df) to capture this N, the number of documents dfi (  N, the number of documents) is the number of documents that contain the term ti

Document Frequency N=6 documents Antony and Cleopatra Julius Caesar
Tempest Hamlet Othello Macbeth DOCUMENT FREQUENCY Antony 157 73 1 3 Brutus 4 232 227 2 5 Calpurnia 10 57 mercy worser

Inverse document frequency
dfi is the document frequency of term ti: the number of documents that contain ti dfi is a measure of the “informativeness” of ti We define the idf of ti idf = inverse document frequency idfi = log10 ( N / dfi ) We use log10 (N / dfi) instead of N / dfi to “dampen” the effect of idf What if dfi = 0? Put idfi=0

idf example, suppose N=1 million
terms dfi idfi calpurnia 1 6 animal 100 4 sunday 1,000 3 fly 10,000 2 under 100,000 the 1,000,000 There is one idf value for each term in a collection

Collection vs. Document frequency
The collection frequency of term ti is the number of occurrences of ti in the collection, counting multiple occurrences of term in each document Example Which word is a better search term (and should get a higher weight)? terms Collection frequency Document frequency insurance 10440 3997 try 10422 8760

wij = (1 + log10 tfij)  log10 (N / dfj)
tf-idf weighting The term frequency tfij of term tj in document di is defined as the number of times that tj occurs in di The tf-idf weight of a term in a document. The product of tf weight (within doc.) and its idf weight (among docs.) wij = (1 + log10 tfij)  log10 (N / dfj) w(i,j) is the weight of term j in document i Best known weighting scheme in information retrieval Alternative names: tf.idf, tf  idf Increases with the number of occurrences of the term within a document Increases with the rarity of the term in the collection Put w(i,j)=0 if tfij =0

Example (see page 14 for data)
Antony ( ) N=6 documents Document frequency of Antony, df(Antony)=3/6 Term frequencies in document (Antony and Cleopatra) Antony 157 Brutus 4 Caesar 232 Calpurnia 0 Cleopatra 57 mercy 2 worser 2 For term “Antony” in document “Antony and Cleopatra” w11 = (1 + log10 157)  log10 (6 / 3) = 0.96

For term “Antony” in document “Antony and Cleopatra”
Each document is now represented by a real-valued vector of tf-idf weights For term “Antony” in document “Antony and Cleopatra” w11 = (1 + log10 157)  log10 (6 / 3) = 0.96 Other wij are computed similarly (compare page 14) d1 Antony and Cleopatra d2 Julius Caesar d3 The Tempest d4 Hamlet d5 Othello d6 Macbeth Antony wi1 0.96 0.86 0.3 Brutus wi2 0.48 wi3 0.27 0.1 0.08 Calpurnia wi4 1.56 wi5 2.14 mercy wi6 0.12 0.13 worser wi7 0.23 0.18

Queries in vector model
A query is just a list of terms, without any connecting search operators such as Boolean operators Similar to the formation of documents This query style is popular on the web Query is represented by a v dimensional vector, where v is the number of terms in the vocabulary Q = (q1, q2, ..., qv) Eg Boolean model (0 or 1) Eg tf-idf model : qj = log10 (N / dfj), if query contains term tj, qj = 0 otherwise N number of documents, dfj document freq. of tj Assumption: Since a term either appears or not in the query, the term frequency is either 1 or 0. So the log-frequency weighting (tf) is also either 1 or 0.

Query as a vector (Example)
Query: Caesar, Calpurnia, worser ( 0, 0, Caesar, Calphurnia, 0, 0, worser) q1 = q2 = q5 = q6 = 0 since those terms are not in the query q3 = log10 (6/5) = 0.08 q4 = log10 (6/1) = 0.78 q7 = log10 (6/4) = 0.18 Q = (0, 0, 0.08, 0.78, 0, 0, 0.18) Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 157 73 1 Brutus 4 Caesar 232 227 2 Calpurnia 10 Cleopatra 57 mercy 3 5 worser

Document-Query proximity
Measures the “similarity” of the document vectors and the query vector, S(Q, Di) In the Boolean model S(Q, Di) = 1 if Di contains all the terms in Q satisfying the Boolean expression S(Q, Di) = 0 otherwise In the Vector model S(Q, Di) is a real value between 0 and 1; the higher the value S(Q, Di) the more similar the two vectors Q, Di We can rank the documents using the value S(Q, Di)

Consider the 2 dimensional case, i.e., vocabulary with 2 terms only, e.g., Brutus, Caesar Which of D1, D2, and D3 is more similar to Q? The length of the vectors is not important Consider a document d’ that contains the same terms as document d, but the term frequencies are twice that of d The lengths of the two vectors are different but the orientations are the same In this example D(2) is twice as long as Q Brutus D2 D1 Q D3 Caesar

The similarity S(Q, Di) is measure by the angle between the vectors Di and Q An angle between the two vectors of 0 corresponds to maximal similarity The smaller the angle the higher the similarity Technically, we calculate the cosine value of the angle between two vectors While an angle  ranges from 0º to 180º, cos  ranges from 1 to -1 The higher the cosine value of the angle the smaller the angle We rank the documents in decreasing order of the cosine value of the angle between the documents and the query Brutus D2 D1 Q D3 Caesar

Notes: Cosine function In a right angled triangle,
Cos = adjacent/hypotenuse The cosine between lines can be calculated in terms of coordinates of the vectors (lines) involved

Di  Q = (wi1q1) + (wi2q2) + ... + (wivqv)
S(Q, Di) S(Q, Di), which is the cosine of the angle between vectors Di and Q, is equal to Di  Q / ( |Di|  |Q| ) where Di  Q is the dot product of Di and Q, i.e., Di  Q = (wi1q1) + (wi2q2) (wivqv) and |Di| and |Q| are the lengths of the Di and Q, |Di| = wi12 + wi wiv2 |Q| = q12 + q qv2

Properties of S(Di, Q) 0  S(Q, Di)  1
S(Q, Di) = 0  qj * wij = 0 for all j S(Q, Di) = 0 / ( |Di|  |Q| ) = 0 S(Q, Di) = 1  qj = wij for all j S(Q, Di) = (q12 + q qv2) / (q12 + q qv2) = 1 Its basically ‘correlation’

Consider query Q: Caesar, Calpurnia, worser
As calculated before Q = (0, 0, .08, .78, 0, 0, .18) Compute S(Q, Di) for all i, [tf-idf matrix (page 24)] D1 Antony and Cleopatra D2 Julius Caesar D3 The Tempest D4 Hamlet D5 Othello D6 Macbeth Antony wi1 0.96 0.86 0.3 Brutus wi2 0.48 wi3 0.27 0.1 0.08 Calpurnia wi4 1.56 wi5 2.14 mercy wi6 0.12 0.13 worser wi7 0.23 0.18

Example |Q| =  = |D1| =  = |D2| =  = |D3| =  = |D4| =  = |D5| =  = 0.236 |D6| =  = D1 Antony and Cleopatra D2 Julius Caesar D3 The Tempest D4 Hamlet D5 Othello D6 Macbeth Q Antony wi1 0.96 0.86 0.3 q1 Brutus wi2 0.48 q2 wi3 0.27 0.1 0.08 q3 Calpurnia wi4 1.56 0.78 q4 wi5 2.14 q5 mercy wi6 0.12 0.13 q6 worser wi7 0.23 0.18 q7

S(Q, D2) = D2  Q / (|D2| |Q|) ) = 0.754
(.96    0 + .1 .18) / (2.4223 .8045) S(Q, D2) = D2  Q / (|D2| |Q|) ) = 0.754 (.86    0 + 00 + 0.18) / (2.0415.8045 Similarly, we compute other similarities D1 Antony and Cleopatra D2 Julius Caesar D3 The Tempest D4 Hamlet D5 Othello D6 Macbeth Q Antony wi1 0.96 0.86 0.3 q1 Brutus wi2 0.48 q2 wi3 0.27 0.1 0.08 q3 Calpurnia wi4 1.56 0.78 q4 wi5 2.14 q5 mercy wi6 0.12 0.13 q6 worser wi7 0.23 0.18 q7

query Q: Caesar, Calpurnia, worser
S(Q, D1) =0.0323, S(Q, D2) =0.754, S(Q, D3) = , S(Q, D4) = 0.13, S(Q, D5) = , S(Q, D6) = Ranking of the documents according to the similarity to Q (in descending order) D2, D5, D3, D4, D1, D6 D1 Antony and Cleopatra D2 Julius Caesar D3 The Tempest D4 Hamlet D5 Othello D6 Macbeth Q Antony wi1 0.96 0.86 0.3 q1 Brutus wi2 0.48 q2 wi3 0.27 0.1 0.08 q3 Calpurnia wi4 1.56 0.78 q4 wi5 2.14 q5 mercy wi6 0.12 0.13 q6 worser wi7 0.23 0.18 q7

Similarity between documents
Why might we want to do this? Determine the similarity between two documents Di and Dj, using the same method as before S(Di, Dj) = Di  Dj / (|Di| |Dj|) E.g., S(D1, D2) = (.96    0 + .1 0) / (2.4223 ) = D1 Antony and Cleopatra D2 Julius Caesar D3 The Tempest D4 Hamlet D5 Othello D6 Macbeth Antony wi1 0.96 0.86 0.3 Brutus wi2 0.48 wi3 0.27 0.1 0.08 Calpurnia wi4 1.56 wi5 2.14 mercy wi6 0.12 0.13 worser wi7 0.23 0.18

Exercise Calculate the document similarities for the previous table

Pseudo code of computing cosine scores
Assume the postings of term tj contain the frequencies of tj in each document di, i.e., tfij, as well as the number of documents tj appears in, i.e., dfj. So wij and qj can be computed. Assume |Di| for all i has been computed and stored in Length[ ] The algorithm returns the top K documents in cosine score CosineScore(q) float Scores[N]=0 Initialize Length[N] {i.e., |Di|} for each query term tj do fetch postings list for tj and calculate qj for each document di in postings list do Scores[i] += qj  wij Read the array Length[ ] for each i = 1 to N do Scores[i] = Scores[i] / Length[i] return Top K components of Scores[ ]

tf-idf weighting has many variants
tf-idf weighting is the product of term frequency and document frequency Variation of definitions for term frequency and document frequency Term frequency tf Document frequency idf natural tfij Boolean 1 logarithm* 1+log10(tfij) log (inverse*) log10(N/dfj) 1 if tfij>0 0 otherwise * used in previous examples

Exercise D1 = "I like databases" D2 = "I hate hate databases”
D3 = ”I like coffee” D4 = ”Databases of coffee and tea” D5 = ”Mmm, tea and tea biscuits for tea” Q= ”Database of coffee hate(rs)” i) construct the term frequency matrix for the docs ii) rank the documents according to the relevance to the query Q

7CCSMWAL Algorithmic Issues in the WWW

Similar presentations

Presentation on theme: "7CCSMWAL Algorithmic Issues in the WWW"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

7CCSMWAL Algorithmic Issues in the WWW

Similar presentations

Presentation on theme: "7CCSMWAL Algorithmic Issues in the WWW"— Presentation transcript:

Similar presentations

About project

Feedback