7CCSMWAL Algorithmic Issues in the WWW

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Chapter 5: Introduction to Information Retrieval
Beyond Boolean Queries Ranked retrieval  Thus far, our queries have all been Boolean.  Documents either match or don’t.  Good for expert users with.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Hinrich Schütze and Christina Lioma
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
TF-IDF David Kauchak cs160 Fall 2009 adapted from:
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
CpSc 881: Information Retrieval
Ch 4: Information Retrieval and Text Mining
CS276 Information Retrieval and Web Mining
Hinrich Schütze and Christina Lioma
The Vector Space Model …and applications in Information Retrieval.
Vector Space Model : TF - IDF
CES 514 Data Mining March 11, 2010 Lecture 5: scoring, term weighting, vector space model (Ch 6)
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Boolean and Vector Space Models
Scoring, Term Weighting, and Vector Space Model Lecture 7: Scoring, Term Weighting and the Vector Space Model Web Search and Mining 1.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang.
Basic ranking Models Boolean and Vector Space Models.
Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
Introduction to Information Retrieval Introduction to Information Retrieval COMP4210: Information Retrieval and Search Engines Lecture 5: Scoring, Term.
Vector Space Models.
Introduction to Information Retrieval CSE 538 MRS BOOK – CHAPTER VI SCORING, TERM WEIGHTING AND THE VECTOR SPACE MODEL 1.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 9: Scoring, Term Weighting and the Vector Space Model.
Web Information Retrieval
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Automated Information Retrieval
The Vector Space Models (VSM)
Take-away Administrativa
Ch 6 Term Weighting and Vector Space Model
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Information Retrieval and Web Search
Representation of documents and queries
From frequency to meaning: vector space models of semantics
CS 430: Information Discovery
6. Implementation of Vector-Space Retrieval
Chapter 5: Information Retrieval and Web Search
4. Boolean and Vector Space Retrieval Models
Hankz Hankui Zhuo Text Categorization Hankz Hankui Zhuo
Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
Efficient Retrieval Document-term matrix t1 t tj tm nf
Recuperação de Informação B
CS276: Information Retrieval and Web Search
Term Frequency–Inverse Document Frequency
CS 430: Information Discovery
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

7CCSMWAL Algorithmic Issues in the WWW Lecture 9

Information Retrieval Models Boolean Model Vector Model Basic idea. There are a total of t terms in the document collection. For each document D have a vector of length t showing the occurrence of each of the terms The collection is a (t ×D) term-document matrix What are the matrix entries?

1 if document contains term, 0 otherwise Boolean model Binary term-document incidence matrix gives information of which term exists in which document (1 if document contains term, 0 otherwise) Documents Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 Brutus Calpurnia mercy worse Terms 1 if document contains term, 0 otherwise

Example Documents D1 = "I like databases" D2 = "I hate hate databases” D3 = ”I like coffee” D4 = ”Databases of coffee and tea” D5 = ”Mmm, tea and tea biscuits for tea” Terms? (omit stop words). Stemming? Queries Q= ”Database of coffee hate(rs)” P= Q OR ”tea”

biscuit, coffee, database, hate, like, tea zeros omitted in table Terms biscuit, coffee, database, hate, like, tea zeros omitted in table Query Q can be viewed as a document, but P cant D1 D2 D3 D4 D5 Q P biscuit 1 coffee database hate like tea ?

Boolean Search Simplest form. Retrieve documents which positively match all (or some) query terms. Ignore negative matches Q1=coffee, Q1=( 0,1,0,0,0,0) Retrieve D3, D4 Q2= biscuit, coffee, Q2=(1,1,0,0,0,0) Nothing retrieved

Query relevance Retrieve documents based on similarity to query Number of common entries between document and query Q1 retrieves {D3,D4} followed by {D1,D2,D5} Q2 retrieves {D3, D4, D5}. All relevance 1, and then {D1,D2} relevance 0 Q3=(0,1,1,1,0,1) gets D4 (rel 3), D2 (rel 2), {D1,D3, D5} (rel 1)

Jaccard Index J= ( A ∩ B) / (A U B) Set similarity ratio For Boolean queries use number of occurences. Let Q2= {biscuit, coffee} J1= (Q∩D1)/(Q U D1 ) = 0/4 J2= (Q∩D2 )/(Q U D2 ) = 0/4 J3= (Q∩D3 )/(Q U D3 ) = 1/3 J4= (Q∩D4 )/(Q U D4 ) = 1/4 J5= (Q∩D5 )/(Q U D5 ) = 1/3 Ranking {D3,D5}, D4, {D1, D2}

Boolean search Each term can be represented by a Boolean vector of d dimensions, where d is the number of documents Brutus: 110100 Caesar: 110111 Calpurnia: 010000 Queries are in the form of a Boolean expression of terms. Terms are combined with operators AND, OR, and NOT Query result can be seen as applying the Boolean operators to the Boolean vectors: Brutus AND Caesar AND NOT Calpurnia 110100 AND 110111 AND NOT 010000 = 100100, i.e., the first and fourth documents satisfy the query

Disjunctive Normal Form (DNF) Queries translated into DNF C(1) OR C(2) OR …OR C(k) Where C is a clause of (term AND term …) E.g. hot AND (sunny OR rainy) (hot AND sunny) OR (hot AND rainy) D AND (B OR (NOT C)) (D AND B) OR (D AND (NOT C)) Exercise: Convert (x OR y) AND (z OR w) into DNF

Why DNF ? Easy to think about Think of AND as × and OR as + 1+1=1, 1+0=1, 0+0=0 1 × 1= 1, 1×0=0, 0×0=0 A AND (B OR C) = A × (B+C)= (A ×B) + (A ×C) =(A AND B) OR (A AND C) [A OR (B AND C)= (A OR B) AND (A OR C)] NOT(A AND B)= (NOT A) OR (NOT B) NOT(A OR B)= (NOT A) AND (NOT B) Exercise: Convert (x OR y) AND (z OR w) into DNF

Boolean model Ceasar AND (Anthony OR Brutus) = (Ceasar AND Anthony) OR (Ceasar AND Brutus) Documents Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 Brutus Calpurnia mercy worse Terms

Ceasar AND (Anthony OR Brutus) (Ceasar AND Anthony) OR (Ceasar AND Brutus) 1 1 0 0 0 1 × 1 1 0 1 0 0 × 1 1 0 1 1 1 1 1 0 1 1 1 ---------------- ----------------- 1 1 0 0 0 1 1 1 0 1 0 0 (1 1 0 0 0 1 ) OR (1 1 0 1 0 0) = (1 1 0 1 0 1) Add the entries A document is relevant if it is retrieved by the query. D1, D2, D4, D6 are relevant

Exercise Brutus AND mercy Brutus AND Calphurnia AND mercy Brutus AND ((NOT Caesar) OR (worse)) Construct a query which returns all documents Construct a query which returns no documents based on the terms in the documents

Boolean search results Documents either match or don’t for a query Good for expert users with precise understanding of their needs and the collection (Legal systems) Also good for (computer) applications Applications can easily evaluate 1000s of results. Not good for the majority of users Most users incapable of writing Boolean queries (or if they are, think it’s too much work) Most users don’t want to wade through 1000s of results This is particularly true of web search

Ranked retrieval Boolean model returns all relevant documents as equally valid Use Scoring as the basis of ranked retrieval We wish to return the documents in the order most likely to be useful to the searcher How can we rank-order the documents in the collection with respect to a query? Assign a score, say in [0, 1], to each document This score measures how well document and query “match”

Vector Model Given a set of documents, d1, d2, ... and the terms in the vocabulary t1, t2, ..., Each document di is represented as a vector Di of v dimensions, where v is the number of terms in the vocabulary Di = (wi1, wi2, ..., wiv)

wij reflects the significance of the term tj a term in document di Di = (wi1, wi2, ..., wiv) In the Boolean model wij = 1 if di contains term tj, wij = 0 otherwise In the Vector model, wij reflects the significance of the term tj a term in document di How wij is determined?

Similarity based on Boolean? The Jaccard measure we used earlier was a measure of similarity between Boolean vectors. It assigns {0,1} weights to the vector components. A common similarity measure is cosine similarity S(x,y)=Cos(x,y) = (∑ x(i) y(i)) /(|x| . |y|) where |x| is Pythagorean distance (x^2+y^2=z^2) i.e. |x|^2 = x(1)^2 + x(2)^2 + …. + x(n)^2

Cosine similarity?

Boolean Weight Similarity 1*1=1 |x|= squareroot(#1’s in the vector x) [Easy] Q=(1,1,0,0,0,0) |Q|= √2 We do Boolean similarity first to explain the general idea Q=(1,1,0,0,0,0) (Biscuit, Coffee)

D1 D2 D3 D4 D5 Q biscuit 1 coffee database hate like tea Length^2 2 3 |D| √2 √3 Q ● D S(Q,D) 0/√(2*2) 1/√(2*3) 1/√2*3=0.41 1/√2*2=1/2 RANK

Document-Document Similarity S(A,B) where A,B are Boolean vectors of documents. Need to calculate A●B for docs Then divide by |A| |B| BA●B D1 D2 D3 D4 D5 1

Document-Document Similarity Then divide by |A| |B| D1 D2 D3 D4 D5 |D| √2 √3 S(A,B) D1 D2 D3 D4 D5 1/2 0.41 1/3

Vector Model How to determine the weights in the term vector?

Term-document count matrices Consider the number of occurrences of a term in a document (term frequency in the document) Each document is a count vector in a column Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 157 73 1 Brutus 4 232 227 2 Calpurnia 10 57 mercy 3 5 worser

Term frequency tf The term frequency tfij of term tj in document di is defined as the number of times that tj occurs in di We want to use tfij when computing wij. But how? Raw term frequency is not what we want A document with 10 occurrences of a term is more relevant than a document with one occurrence of the term But not 10 times more relevant? Relevance does not increase proportionally with term frequency

Log-frequency weighting The log frequency weight gij of term tj in di is gij = 1 + log10 tfij, if tfij > 0 gij = 0, if tfij = 0 0  0, 1  1, 2  1.3, 10  2, 1000  4, etc. gij increases slowly as tfij increases

Document frequency Rare terms are more informative than frequent terms (e.g., stop words) Consider a term in the query that is rare in the collection (e.g., arachnocentric) A document containing this term is very likely to be relevant to the query arachnocentric  We want a high weight for rare terms like arachnocentric

Document frequency The number of documents containing a given term Consider a query term that is frequent in the collection (e.g., high, increase, line) A document containing such a term is more likely to be relevant than a document that doesn’t, but it is not a sure indicator of relevance For frequent terms, we want positive weights for words like high, increase, and line, but lower weights than for rare terms (to the rare terms we give higher weights) We use document frequency (df) to capture this N, the number of documents dfi (  N, the number of documents) is the number of documents that contain the term ti Go to back to table for examples

Document frequency The number of documents containing a given term Consider a query term that is frequent in the collection (e.g., high, increase, line) A document containing such a term is more likely to be relevant than a document that doesn’t, but it is not a sure indicator of relevance For frequent terms, we want positive weights for words like high, increase, and line, but lower weights than for rare terms (to the rare terms we give higher weights) We use document frequency (df) to capture this N, the number of documents dfi (  N, the number of documents) is the number of documents that contain the term ti

Document Frequency N=6 documents Antony and Cleopatra Julius Caesar Tempest Hamlet Othello Macbeth DOCUMENT FREQUENCY Antony 157 73 1 3 Brutus 4 232 227 2 5 Calpurnia 10 57 mercy worser

Inverse document frequency dfi is the document frequency of term ti: the number of documents that contain ti dfi is a measure of the “informativeness” of ti We define the idf of ti idf = inverse document frequency idfi = log10 ( N / dfi ) We use log10 (N / dfi) instead of N / dfi to “dampen” the effect of idf What if dfi = 0? Put idfi=0

idf example, suppose N=1 million terms dfi idfi calpurnia 1 6 animal 100 4 sunday 1,000 3 fly 10,000 2 under 100,000 the 1,000,000 There is one idf value for each term in a collection

Collection vs. Document frequency The collection frequency of term ti is the number of occurrences of ti in the collection, counting multiple occurrences of term in each document Example Which word is a better search term (and should get a higher weight)? terms Collection frequency Document frequency insurance 10440 3997 try 10422 8760

wij = (1 + log10 tfij)  log10 (N / dfj) tf-idf weighting The term frequency tfij of term tj in document di is defined as the number of times that tj occurs in di The tf-idf weight of a term in a document. The product of tf weight (within doc.) and its idf weight (among docs.) wij = (1 + log10 tfij)  log10 (N / dfj) w(i,j) is the weight of term j in document i Best known weighting scheme in information retrieval Alternative names: tf.idf, tf  idf Increases with the number of occurrences of the term within a document Increases with the rarity of the term in the collection Put w(i,j)=0 if tfij =0

Example (see page 14 for data) Antony (157 73 0 0 0 1) N=6 documents Document frequency of Antony, df(Antony)=3/6 Term frequencies in document (Antony and Cleopatra) Antony 157 Brutus 4 Caesar 232 Calpurnia 0 Cleopatra 57 mercy 2 worser 2 For term “Antony” in document “Antony and Cleopatra” w11 = (1 + log10 157)  log10 (6 / 3) = 0.96

For term “Antony” in document “Antony and Cleopatra” Each document is now represented by a real-valued vector of tf-idf weights For term “Antony” in document “Antony and Cleopatra” w11 = (1 + log10 157)  log10 (6 / 3) = 0.96 Other wij are computed similarly (compare page 14) d1 Antony and Cleopatra d2 Julius Caesar d3 The Tempest d4 Hamlet d5 Othello d6 Macbeth Antony wi1 0.96 0.86 0.3 Brutus wi2 0.48 wi3 0.27 0.1 0.08 Calpurnia wi4 1.56 wi5 2.14 mercy wi6 0.12 0.13 worser wi7 0.23 0.18

Queries in vector model A query is just a list of terms, without any connecting search operators such as Boolean operators Similar to the formation of documents This query style is popular on the web Query is represented by a v dimensional vector, where v is the number of terms in the vocabulary Q = (q1, q2, ..., qv) Eg Boolean model (0 or 1) Eg tf-idf model : qj = log10 (N / dfj), if query contains term tj, qj = 0 otherwise N number of documents, dfj document freq. of tj Assumption: Since a term either appears or not in the query, the term frequency is either 1 or 0. So the log-frequency weighting (tf) is also either 1 or 0.

Query as a vector (Example) Query: Caesar, Calpurnia, worser ( 0, 0, Caesar, Calphurnia, 0, 0, worser) q1 = q2 = q5 = q6 = 0 since those terms are not in the query q3 = log10 (6/5) = 0.08 q4 = log10 (6/1) = 0.78 q7 = log10 (6/4) = 0.18 Q = (0, 0, 0.08, 0.78, 0, 0, 0.18) Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 157 73 1 Brutus 4 Caesar 232 227 2 Calpurnia 10 Cleopatra 57 mercy 3 5 worser

Document-Query proximity Measures the “similarity” of the document vectors and the query vector, S(Q, Di) In the Boolean model S(Q, Di) = 1 if Di contains all the terms in Q satisfying the Boolean expression S(Q, Di) = 0 otherwise In the Vector model S(Q, Di) is a real value between 0 and 1; the higher the value S(Q, Di) the more similar the two vectors Q, Di We can rank the documents using the value S(Q, Di)

Document-Query proximity Consider the 2 dimensional case, i.e., vocabulary with 2 terms only, e.g., Brutus, Caesar Which of D1, D2, and D3 is more similar to Q? The length of the vectors is not important Consider a document d’ that contains the same terms as document d, but the term frequencies are twice that of d The lengths of the two vectors are different but the orientations are the same In this example D(2) is twice as long as Q Brutus D2 D1 Q D3 Caesar

Document-Query proximity The similarity S(Q, Di) is measure by the angle between the vectors Di and Q An angle between the two vectors of 0 corresponds to maximal similarity The smaller the angle the higher the similarity Technically, we calculate the cosine value of the angle between two vectors While an angle  ranges from 0º to 180º, cos  ranges from 1 to -1 The higher the cosine value of the angle the smaller the angle We rank the documents in decreasing order of the cosine value of the angle between the documents and the query Brutus D2 D1 Q D3 Caesar

Notes: Cosine function In a right angled triangle, Cos = adjacent/hypotenuse The cosine between lines can be calculated in terms of coordinates of the vectors (lines) involved

Di  Q = (wi1q1) + (wi2q2) + ... + (wivqv) S(Q, Di) S(Q, Di), which is the cosine of the angle between vectors Di and Q, is equal to Di  Q / ( |Di|  |Q| ) where Di  Q is the dot product of Di and Q, i.e., Di  Q = (wi1q1) + (wi2q2) + ... + (wivqv) and |Di| and |Q| are the lengths of the Di and Q, |Di| = wi12 + wi22 + ... + wiv2 |Q| = q12 + q22 + ... + qv2

Properties of S(Di, Q) 0  S(Q, Di)  1 S(Q, Di) = 0  qj * wij = 0 for all j S(Q, Di) = 0 / ( |Di|  |Q| ) = 0 S(Q, Di) = 1  qj = wij for all j S(Q, Di) = (q12 + q22 + ... + qv2) / (q12 + q22 + ... + qv2) = 1 Its basically ‘correlation’

Consider query Q: Caesar, Calpurnia, worser As calculated before Q = (0, 0, .08, .78, 0, 0, .18) Compute S(Q, Di) for all i, [tf-idf matrix (page 24)] D1 Antony and Cleopatra D2 Julius Caesar D3 The Tempest D4 Hamlet D5 Othello D6 Macbeth Antony wi1 0.96 0.86 0.3 Brutus wi2 0.48 wi3 0.27 0.1 0.08 Calpurnia wi4 1.56 wi5 2.14 mercy wi6 0.12 0.13 worser wi7 0.23 0.18

Example |Q| = .082 + .782 + .182 = 0.8045 |D1| = .962 + .482 +.272 + 2.142 + .12 + .232 = 2.4223 |D2| = .862 + .962 + .272 + 1.562 = 2.0415 |D3| = .122 + .182 = 0.2163 |D4| = .32 + .12 + .132 + .182 = 0.3864 |D5| = .082 + .132 + .182 = 0.236 |D6| = .32 + .082 + .082 = 0.3206 D1 Antony and Cleopatra D2 Julius Caesar D3 The Tempest D4 Hamlet D5 Othello D6 Macbeth Q Antony wi1 0.96 0.86 0.3 q1 Brutus wi2 0.48 q2 wi3 0.27 0.1 0.08 q3 Calpurnia wi4 1.56 0.78 q4 wi5 2.14 q5 mercy wi6 0.12 0.13 q6 worser wi7 0.23 0.18 q7

S(Q, D2) = D2  Q / (|D2| |Q|) ) = 0.754 (.960 + .480 + .27.08 + 0.78 + 2.140 + .10 + .23.18) / (2.4223 .8045) S(Q, D2) = D2  Q / (|D2| |Q|) ) = 0.754 (.860 + .960 + .27.08 + 1.56.78 + 00 + 00 + 0.18) / (2.0415.8045 Similarly, we compute other similarities D1 Antony and Cleopatra D2 Julius Caesar D3 The Tempest D4 Hamlet D5 Othello D6 Macbeth Q Antony wi1 0.96 0.86 0.3 q1 Brutus wi2 0.48 q2 wi3 0.27 0.1 0.08 q3 Calpurnia wi4 1.56 0.78 q4 wi5 2.14 q5 mercy wi6 0.12 0.13 q6 worser wi7 0.23 0.18 q7

query Q: Caesar, Calpurnia, worser S(Q, D1) =0.0323, S(Q, D2) =0.754, S(Q, D3) = 0.1862, S(Q, D4) = 0.13, S(Q, D5) = 0.2044, S(Q, D6) = 0.0246 Ranking of the documents according to the similarity to Q (in descending order) D2, D5, D3, D4, D1, D6 D1 Antony and Cleopatra D2 Julius Caesar D3 The Tempest D4 Hamlet D5 Othello D6 Macbeth Q Antony wi1 0.96 0.86 0.3 q1 Brutus wi2 0.48 q2 wi3 0.27 0.1 0.08 q3 Calpurnia wi4 1.56 0.78 q4 wi5 2.14 q5 mercy wi6 0.12 0.13 q6 worser wi7 0.23 0.18 q7

Similarity between documents Why might we want to do this? Determine the similarity between two documents Di and Dj, using the same method as before S(Di, Dj) = Di  Dj / (|Di| |Dj|) E.g., S(D1, D2) = (.96.86 + .48.96 + .27.27 + 01.56 + 2.140 + .10 + .230) / (2.4223 2.0415) = 0.2749 D1 Antony and Cleopatra D2 Julius Caesar D3 The Tempest D4 Hamlet D5 Othello D6 Macbeth Antony wi1 0.96 0.86 0.3 Brutus wi2 0.48 wi3 0.27 0.1 0.08 Calpurnia wi4 1.56 wi5 2.14 mercy wi6 0.12 0.13 worser wi7 0.23 0.18

Exercise Calculate the document similarities for the previous table

Pseudo code of computing cosine scores Assume the postings of term tj contain the frequencies of tj in each document di, i.e., tfij, as well as the number of documents tj appears in, i.e., dfj. So wij and qj can be computed. Assume |Di| for all i has been computed and stored in Length[ ] The algorithm returns the top K documents in cosine score CosineScore(q) float Scores[N]=0 Initialize Length[N] {i.e., |Di|} for each query term tj do fetch postings list for tj and calculate qj for each document di in postings list do Scores[i] += qj  wij Read the array Length[ ] for each i = 1 to N do Scores[i] = Scores[i] / Length[i] return Top K components of Scores[ ]

tf-idf weighting has many variants tf-idf weighting is the product of term frequency and document frequency Variation of definitions for term frequency and document frequency Term frequency tf Document frequency idf natural tfij Boolean 1 logarithm* 1+log10(tfij) log (inverse*) log10(N/dfj) 1 if tfij>0 0 otherwise * used in previous examples

Exercise D1 = "I like databases" D2 = "I hate hate databases” D3 = ”I like coffee” D4 = ”Databases of coffee and tea” D5 = ”Mmm, tea and tea biscuits for tea” Q= ”Database of coffee hate(rs)” i) construct the term frequency matrix for the docs ii) rank the documents according to the relevance to the query Q