Presentation is loading. Please wait.

Presentation is loading. Please wait.

Www.monash.edu.au CSE3201/CSE4500 Term Weighting.

Similar presentations


Presentation on theme: "Www.monash.edu.au CSE3201/CSE4500 Term Weighting."— Presentation transcript:

1 www.monash.edu.au CSE3201/CSE4500 Term Weighting

2 www.monash.edu.au 2 Weighting Terms Having decided on a set of terms for indexing, we need to consider whether all terms should be given the same significance. If not, how should we decide on their significance? A significant term –represent the content of a document –discriminate one document from the others

3 www.monash.edu.au 3 Weighting Terms - tf tf ijLet tf ij be the term frequency for term i on document j. The more a term appears in a document, the more likely it is to be a highly significant index term.

4 www.monash.edu.au 4 Weighting Terms - df & idf Let df i be document frequency of the i-th term. inverse document frequencylog e (N/df i ) log e is the natural logarithmSince the significance increases with a decrease in the document frequency, we have the inverse document frequency, idf i = log e (N/df i ) where N is the number of documents in the database; log e is the natural logarithm (ln in the calculator)

5 www.monash.edu.au 5 Weighting Terms - tf. idf tf.idf w ij = tf ij * idf iThe above two indicators are very often multiplied together to form the “tf.idf” weight, w ij = tf ij * idf i

6 www.monash.edu.au 6 Example Consider 5 document collection: D1= “Dogs eat the same things that cats eat” D2 = “No dog is a mouse” D3 = “Mice eat little things” D4 = “Cats often play with rats and mice” D5 = “Cats often play, but not with other cats”

7 www.monash.edu.au 7 Example - Cont. We might generate the following index sets: I1 = ( dog, eat, cat ) I2 = ( dog, mouse ) I3 = ( mouse, eat ) I4 = ( cat, play, rat, mouse ) I5 = (cat, play) System dictionary (cat,dog,eat,mouse,play,rat)

8 www.monash.edu.au 8 Example-Cont df cat =3 idf cat =ln(5/3)=0.51 df dog =2 idf dog =ln(5/2)=0.91 df eat =2 idf eat =ln(5/2)=0.91 df mouse =3 idf mouse =ln(5/3)=0.51 df play =2 idf play =ln(5/2)=0.91 df rat =1 idf rat =ln(5/1)=1.61

9 www.monash.edu.au 9 Example-Cont I1(cat, eat,dog) –w cat = tf cat * idf cat = 1 * 0.51 = 0.51 –w dog = tf dog * idf dog = 1 * 0.91 = 0.91 –w eat = tf eat * idf at = 2 * 0.91 = 1.82 I2(dog,mouse) –w dog = tf dog * idf dog = 1 * 0.91 = 0.91 –w mouse = tf mouse * idf mouse = 1 * 0.51 = 0.51

10 www.monash.edu.au 10 Example-Cont I3(mouse,eat) –w mouse = tf mouse * idf mouse = 1 * 0.51 = 0.51 –w eat = tf eat * idf at = 1 * 0.91 = 0.91 I4(cat,mouse,play, rat) –w cat = tf cat * idf cat = 1 * 0.51 = 0.51 –w play = tf play * idf play = 1 * 0.91 = 0.91 –w rat = tf rat * idf rat = 1 * 1.61 = 1.61 –w mouse = tf mouse * idf mouse = 1 * 0.51 = 0.51

11 www.monash.edu.au 11 Example-Cont I5 –w cat = tf cat * idf cat = 2 * 0.51 = 1.02 –w play = tf play * idf play = 1 * 0.91 = 0.91

12 www.monash.edu.au 12 Example - cont. Dictionary: (cat,dog,eat,mouse,play,rat) Weights: I1 = [cat(0.51), dog (0.91),eat(1.82), 0, 0,0 ] I2 = [0,dog(0.91),0,mouse(0.51),0,0] I3 = [0,0,eat(0.91), mouse(0.51),0,0] I4 = [cat(0.51), 0,0,mouse(0.51), play(0.91), rat(1.61)] I5 = [cat(1.02),0,0,0, play (0.91),0]

13 www.monash.edu.au 13 A Larger Example Doc 1: The problem of how to describe documents for retrieval is called indexing. Doc 2: It is possible to use a document as its own index. Doc 3: The problem is that a document will exactly match only one query, namely document itself. Doc 4: The purpose of indexing then is to provide a description of a document so that it can be retrieved with queries that concern the same subject as the document. Doc 5: It must be a sufficiently specific description so that the document will not be returned for queries unrelated to the document.

14 www.monash.edu.au 14 A Larger Example Doc 6: A simple way of indexing a document is to give a single code from a predefined set. Doc 7: We have the task of describing how we are going to match queries against document. Doc 8: The vector space model creates a space in which both document and queries are represented by vectors. Doc 9: A vector is obtained for each document and query from sets of index terms with associated weights. Doc 10: In order to compare the similarity of these vectors, we may measure the angle between them.

15 www.monash.edu.au 15 A Larger Example If we index these document using all words not on a stop list, we might obtain D1 - problem, describe, documents, retrieval, called, indexing D2 - possible, document, own, index D3 - problem, document (*), exactly, match, one, query, namely D4 - purpose, indexing, provide, description, document(*), retrieved, queries, concern, subject D5 - sufficiently, specific, description, document(*), returned, queries, unrelated

16 www.monash.edu.au 16 A Larger Example If we index these documents using all words not on a stop list, we might obtain D6 - simple, way, indexing, document, give, single, code, predefined, list D7- task, describing, going, match, queries, against, documents D8- vector (*), space(*), model, creates, documents, queries, represented D9- vector, obtained, document, query, sets, index, terms, associated, weights D10- order, compare, similarity, vectors, measure, angle

17 www.monash.edu.au 17 A larger Example We may now choose to stem the terms, which may leave us : D1 - problem, describ, docu, retriev, call, index D2- possibl, docu, own, index D3- problem, docu (*), exact, match, on, quer, name D4- purpos, index, provid, descript, docu (*), retriev, queries, concern, subject D5- suffic, specif, descript, docu (*), return, quer, unrelat

18 www.monash.edu.au 18 A Larger Example We may now choose to stem the terms, which may leave us: D6 - simpl, way, index, docu, giv, singl, cod, predefin, list D7- task, describ, go, match, quer, against, docu D8- vect (*), spac (*), model, creat, docu, quer, represent D9- vect, obtain, docu, quer, set, index, terms, associat, weight D10- order, compar, similarit, vect, measur, angle

19 www.monash.edu.au 19 Document Frequencies

20 www.monash.edu.au 20 A Larger Example We can now calculate the weights of the terms of one of the documents. For document 8, using the tf. idf formula, we give the terms the following weights: vect (2.41), spac (4.60), model (2.30), creat(2.30), docu (0.22), quer (0.51), represent (2.30)

21 www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Retrieval Model

22 www.monash.edu.au 22 Retrieval Process

23 www.monash.edu.au 23 Retrieval Paradigms How do we match? –Produce non-ranked output >Boolean retrieval –Produce ranked output >vector space model >probabilistic retrieval

24 www.monash.edu.au 24 Advantages of Ranking Good control over how many documents are viewed by a user. Good control over in which order documents are viewed by a user. The first documents that are viewed may help modify the order in which later documents are viewed. – The main disadvantage is computational cost.

25 www.monash.edu.au 25 Boolean Retrieval A query is a set of terms combined by the Boolean connectives “and”, “or” and “not”. –e.g... FIND (document OR information) AND retrieval AND (NOT (information AND systems)) Each term is matched against this query and either matches (TRUE) or it doesn’t (FALSE)

26 www.monash.edu.au 26 An Example Consider the following document collection: D1 = “Dogs eat the same things that cats eat” D2 = “no dog is a mouse” D3 = “mice eat little things” D4 = “Cats often play with rats and mice” D5 = “cats often play, but not with other cats” indexed by: D1 = dog, eat, cat D2 = dog, mouse D3 = mouse, eat D4 = cat, play, rat, mouse D5 =cat, play

27 www.monash.edu.au 27 An Example The Boolean query (cat AND dog) returns D1 (cat OR (dog AND eat)) returns D1, D4, D5

28 www.monash.edu.au 28 Problem with Boolean No ranking –users must fuss with retrieved set size, structural reformulation –users must scan entire retrieved set No weights on query terms and document terms –No discrimination between the significant and non- significant terms/words. Writing a query can be difficult for novice users.

29 www.monash.edu.au 29 Any Good News for Boolean? Yes. Advantages –conceptually simple –computationally inexpensive –commercially available

30 www.monash.edu.au 30 The Vector Space Model Each document and query is represented by a vector. A vector is obtained for each document and query from sets of index terms with associated weights. nThe document and query representatives are considered as vectors in n dimensional space where n is the number of unique terms in the dictionary/document collection. Measuring vectors similarity: –inner product –value of cosine of the angle between the two vectors.

31 www.monash.edu.au 31 Vector Space Assume that document’s vector is represented by vector D and the query is represented by vector Q. The total number of terms in the dictionary is n. Similarity between D and Q can be measured by the angle .

32 www.monash.edu.au 32 Vectors A.B = |A||B| cos  A=(a 1, a 2, a 3,…, a n ), B=(b 1, b 2, b 3,…, b n ) A.B = (a 1 b 1 + a 2 b 2 + a 3 b 3 + …+ a n b n ) Magnitude of a vector |A|=(a 1, a 2, a 3,…, a n ) is defined as

33 www.monash.edu.au 33 Cosine The similarity between D and Q can be written as: Using the weight of the term as the components of D and Q:

34 www.monash.edu.au 34 Inner Product

35 www.monash.edu.au 35 Simple Example (1) Assume: –there are 2 terms in the dictionary (t 1, t 2 ) –Doc-1 contains t 1 and t 2, with weights 0.5 and 03 respectively. –Doc-2 contains t 1 with weight 0.6 –Doc-3 contains t 2 with weights 0.4. –Query contains t 2 with weight 0.5.

36 www.monash.edu.au 36 Simple Example (2) The vectors for the query and documents: Doc#w t1 w t2 10.50.3 20.60 300.4 Doc-1= (0.5,0.3) Doc-2= (0.6,0) Doc-3= (0,0.4) Query = ( 0, 0.5)

37 www.monash.edu.au 37 Simple Example - Inner Product D1=0.5x0+0.3x0.5=0.15 D2=0.6x0+0x0.5=0 D3=0x0+0.4x0.5=0.2 Ranking: D3, D1, D2 Doc#w t1 w t2 10.50.3 20.60 300.4 Query = ( 0, 0.5)

38 www.monash.edu.au 38 Simple Example - Cosine Similarity measured between Query(Q) and Doc-1 Doc-2 Doc-3 Ranked output: D3, D1, D2

39 www.monash.edu.au 39 Large Example (1) Consider the same five document collection D1= “Dogs eat the same things that cats eat” D2 = “No dog is mouse” D3 = “Mice eat little things” D4 = “Cats often play with rats and mice” D5 = “Cats often play, but not with other cats” Indexed by I1 = ( dog, eat, cat ) I2 = ( dog, mouse ) I3 = ( mouse, eat ) I4 = ( cat, play, rat, mouse ) I5 = (cat, play)

40 www.monash.edu.au 40 Large Example (2) The set of all terms (dictionary) (cat, dog, eat, mouse, play, rat) Using tf.idf weights, we obtain weights I1 = (cat(0.51), eat(1.82), dog(0.91)) I2 = (dog(0.91), mouse(0.51)) I3 = (mouse(0.51), eat(0.91)) I4 = (cat(0.51), play(0.91), rat(1.61), mouse(0.51)) I5 = (cat (1.02), play (0.91))

41 www.monash.edu.au 41 Large Example (3) In the vector space model, we obtain vectors V1=(0.51, 0.91, 1.82, 0.00, 0.00, 0.00) V2=(0.00, 0.91, 0.00, 0.51, 0.00, 0.00) V3=(0.00, 0.00, 0.91, 0.51, 0.00, 0.00) V4=(0.51, 0.00, 0.00, 0.51, 0.91, 1.61) V5=(1.02, 0.00, 0.00, 0.00, 0.91, 0.00) 6 dimensional space for 6 terms

42 www.monash.edu.au 42 Inner-Product Query: “what do cats play with?” forms a query vector as (0.51, 0.00, 0.00, 0.00, 0.91, 0.00) D1= 0.51x0.51+0x0.91+0x1.82+0x0+0x0.91+0x0=0.2601 D2= 0.00x0.51+0.91x0+0x0+0.51x0+0x0.91+0x0=0 D3= 0.00x0.51+ 0x0+0.91x0+0.51x0+0x0.91+0x0=0 D4= 0.51x0.51+0x0+0x0+0.51x0+0.91x0.91+1.61x0=1.0882 D5= 1.02x0.51+0x0+0x0+0x0+0.91x0.91+0x0=1.3483 Ranking: D5, D4, D1, D2, D3

43 www.monash.edu.au 43 Cosine Similarity Query: “what do cats play with?” forms a query vector as (0.51, 0.00, 0.00, 0.00, 0.91, 0.00) using the cosine measure (cm), we obtain the following similarity measures: D1 = 0.51 2 /[(0.51 2 +0.91 2 ) 0.5 x(0.51 2 +0.91 2 +1.82 2 ) 0.5 ] D2 = 0.0 D3 = 0.0 D4 = (0.51 2 +0.91 2 )/[(0.51 2 +0.91 2 ) 0.5 x(0.51 2 +0.51 2 +0.91 2 +1.61 2 ) 0.5 ] D5 = (0.51*1.02+0.91 2 )/[(0.51 2 +0.91 2 ) 0.5 x(1.02 2 +0.91 2 ) 0.5 ] Thus we obtain he ranking: D5, D4, D1, D2, D3 (or D3, D2)


Download ppt "Www.monash.edu.au CSE3201/CSE4500 Term Weighting."

Similar presentations


Ads by Google