 # | 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.

## Presentation on theme: "| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4."— Presentation transcript:

| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4

Chapter 6 Overview Scoring Parametric and zone indexing Weighted zone scoring (skip 6.1.2 and 6.1.3) Term frequency and weighting Vector model, queries as vectors Computing scores (skip 6.4)

Formulas and symbols Ʃ : sum ∏ : product √ : (square) root (base 2) √x = y : y 2 = x √1=1 √4=2 √10=3,16.. √100=10 log: logarithm (base 10) logx = y : 10 y = x log1=0 log4=0,6..log10=1 log100=2 sum/ multiply array elements numbers increase slowly |x| stands for number of elements of x

Parametric indexes To a digital document, metadata can be attached fields: format, date, author, title, keywords, … Dublin core: set of metadata fields for digital library Extra parametric indexes built for these fields each field its own - limited - dictionary The user interface should accommodate querying of metadata next to text queries The system must combine the query parts for retrieval

Zone indexes In the (free) text of a document itself, different zones can be recognized: Author Title date of publ Abstract Keywords Body … Difference between parametric index and zone index

Zone indexes – If documents have a common outline, zones can be separated in the index, in 2 ways: 1.different index terms: william.author, william.body 2.different postings (each doc zone): william -> [1.author -> 2.author, 2.body -> …] Again: user interface and retrieval must be adapted

Weighted zone scoring Zones can be used by the system for a ranked Boolean retrieval (without zone queries) With a query word in the title the doc is more relevant Each zone gets a partial weight (sum=1) Score of each zone of a doc for a query is Boolean score (0/1) * zone weight The score of the doc for the query is the sum of its zone scores (0 <= docscore <= 1)), so docs can be ordered

Example Query: pie AND cream 1:title 2:abstract g 1 = 0.6, g 2 = 0.4, s=AND Doc score: Document D1 g s sc 1. Title apple pie 0.6 * 0 = 0 2. Abstract pie cream 0.4 * 1 = 0.4 Total score (D1, q) 0.4 What if one word in title, other in abstract?

Summary simple boolean model only store presence of a term in a document conjunctive query select docs that contain all query terms disjunctive query select docs that contain at least one of the query terms arbitrary boolean query select docs in which the query term distribution satisfies the query, using the operators AND, OR and NOT binary decision: documents are selected or not, no ranking

Problems with the Boolean model Boolean systems give too many or too few results Only useful for specific search in a homogeneous corpus Boolean operators are difficult for many users A more flexible system is needed to let anyone search in a huge heterogeneous corpus of documents Users want to type in just some query words, without operators

How can we improve the system? We need methods to rank the results containing some or all query terms Is there more than presence/absence? Is each term equally important?

Vector model terms in a document (and a query) receive a score scores are combined to calculate for each document the similarity with the query using the similarity scores, the result set of documents containing one or more query terms can be ranked How can we calculate a score for a term? For a term in a document? A term in general?

Term weighting First idea: use the frequency of the query terms in the document

Documents as vectors Documents presented as vector of weights for each term of the dictionary V (Doc1) = (8, 2, 2, 10, 2,...) only few of the dictionary terms shown below using term frequencies as weights, which document scores best for query: food of ape? ape child food of panther Doc1 822102 Doc2 1 59200 Docn.......... Query 1 0110

15 of 29 Advanced Scores for Terms in Documents tf t,d : term frequency of t in d: how many tokens of term t in document d df t : document frequency of t:how many documents with term t? dft /N : relative document frequency of t:how many docs with term t / total of docs (N)? BUT: high document freq => important term? idf t : inverse document frequency of t: N/dft, if df t is higher, the inverse is lower tf t,d ∙idf t :score for a term in a document

Score for a query – doc combination qquery ddocument each term t that is an element of query q tf t,d term frequency of term t in doc d idf t inverse document frequency of t: log N/df t Nnumber of documents in collection

Example assume: idf ape =5 idf food =2 idf of =0,001 score(“food of ape”, Doc1) = score(“food of ape”, Doc2) = ape child food of panther Doc1 822102 Doc2 1 59200 Docn.......... Query 1 0110

Example score(“food of ape”, Doc1) = 8*5 + 2*2 + 0 = 44 score(“food of ape”, Doc2) = 1*5 + 9*2 + 0 = 23 ape child food of panther Doc1 822102 Doc2 1 59200 Docn.......... Query 1 0110

19 of 29 What about the query terms? The query vector can also have weights, then we multiply the 2 weights for each term: Often not the same weighting, so in general

20 of 29 Problem: current approach does not take into account document length Solution: use vector term weights instead of term frequencies For vector length: Euclidean length Length of a vector V with n elements = the square root of (the sum of the squares of all the elements)

Example: compute normalized frequencies for Figure 6.9 Next couple of sheets from: http://www.stanford.edu/class/cs276/

Documents as vectors So we have a |V|-dimensional vector space Terms are axes of the space Documents are points or vectors in this space Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine These are very sparse vectors - most entries are zero. Sec. 6.3

Queries as vectors Key idea 1: Do the same for queries: represent them as vectors in the space Key idea 2: Rank documents according to their proximity to the query in this space proximity = similarity of vectors proximity ≈ inverse of distance Recall: We do this because we want to get away from the you’re-either-in-or-out Boolean model. Instead: rank more relevant documents higher than less relevant documents Sec. 6.3

Formalizing vector space proximity First cut: distance between two points ( = distance between the end points of the two vectors) Euclidean distance? Euclidean distance is a bad idea...... because Euclidean distance is large for vectors of different lengths. Sec. 6.3

Why distance is a bad idea The Euclidean distance between q and d 2 is large even though the distribution of terms in the query q and the distribution of terms in the document d 2 are very similar. Sec. 6.3

Use angle instead of distance Thought experiment: take a document d and append it to itself. Call this document d′. “Semantically” d and d′ have the same content The Euclidean distance between the two documents can be quite large The angle between the two documents is 0, corresponding to maximal similarity. Key idea: Rank documents according to angle with query. Sec. 6.3

Length normalization A vector can be (length-) normalized by dividing each of its components by its length – for this we use the L 2 norm: Dividing a vector by its L 2 norm makes it a unit (length) vector (on surface of unit hypersphere) Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length- normalization. Long and short documents now have comparable weights Sec. 6.3

cosine(query,document) Dot product Unit vectors q i is the tf-idf weight of term i in the query d i is the tf-idf weight of term i in the document cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d. Sec. 6.3

Cosine for length-normalized vectors For length-normalized vectors, cosine similarity is simply the dot product (or scalar product): for q, d length-normalized. 29

Example: compute weights for Figure 6.13, for the query “jealous gossip” (v(q) = (0,1,1) not normalized) v(q) = (0,0.707,0.707) (why?) v(SaS) = (0.996,0.087,0.017) v(PaP) = (0.993,0.120,0) v(WH) = (0.874,0.466,0.254)