CSE3201/4500 Information Retrieval Systems

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

1 A B C
Computer Graphics: 2D Transformations
Scenario: EOT/EOT-R/COT Resident admitted March 10th Admitted for PT and OT following knee replacement for patient with CHF, COPD, shortness of breath.
Adders Used to perform addition, subtraction, multiplication, and division (sometimes) Half-adder adds rightmost (least significant) bit Full-adder.
Using Matrices in Real Life
AP STUDY SESSION 2.
1
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
STATISTICS HYPOTHESES TEST (I)
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
Objectives: Generate and describe sequences. Vocabulary:
David Burdett May 11, 2004 Package Binding for WS CDL.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
Local Customization Chapter 2. Local Customization 2-2 Objectives Customization Considerations Types of Data Elements Location for Locally Defined Data.
Process a Customer Chapter 2. Process a Customer 2-2 Objectives Understand what defines a Customer Learn how to check for an existing Customer Learn how.
Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.
CALENDAR.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
Break Time Remaining 10:00.
Turing Machines.
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
PP Test Review Sections 6-1 to 6-6
1 The Blue Café by Chris Rea My world is miles of endless roads.
Bright Futures Guidelines Priorities and Screening Tables
Lecture 6: Boolean to Vector
Bellwork Do the following problem on a ½ sheet of paper and turn in.
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
Text Categorization.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
Adding Up In Chunks.
Boolean and Vector Space Retrieval Models
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
6.4 Best Approximation; Least Squares
Before Between After.
Subtraction: Adding UP
: 3 00.
5 minutes.
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Analyzing Genes and Genomes
1 Let’s Recapitulate. 2 Regular Languages DFAs NFAs Regular Expressions Regular Grammars.
Speak Up for Safety Dr. Susan Strauss Harassment & Bullying Consultant November 9, 2012.
Essential Cell Biology
Converting a Fraction to %
Exponents and Radicals
Clock will move after 1 minute
PSSA Preparation.
Essential Cell Biology
Immunobiology: The Immune System in Health & Disease Sixth Edition
Physics for Scientists & Engineers, 3rd Edition
Energy Generation in Mitochondria and Chlorplasts
Select a time to count down from the clock above
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
Copyright Tim Morris/St Stephen's School
1.step PMIT start + initial project data input Concept Concept.
1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
CSE3201/CSE4500 Term Weighting.
Presentation transcript:

CSE3201/4500 Information Retrieval Systems Term Weighting CSE3201/4500 Information Retrieval Systems

Weighting Terms Having decided on a set of terms for indexing, we need to consider whether all terms should be given the same significance. If not, how should we decide on their significance?

Weighting Terms - tf Let tfij be the term frequency for term i on document j. The more a term appears in a document, the more likely it is to be a highly significant index term.

Weighting Terms - df & idf Let dfi be document frequency of the i-th term. Since the significance increases with a decrease in the document frequency, we have the inverse document frequency, idfi = loge (N/dfi) where N is the number of documents in the database; loge is the natural logarithm (ln in the calculator)

Weighting Terms - tf. idf The above two indicators are very often multiplied together to form the “tf.idf” weight, wij = tfij * idfi or as is now more popular wij = loge (1 + tfij ) (1 + idfi)

Example Consider 5 document collection: D1= “Dogs eat the same things that cats eat” D2 = “No dog is a mouse” D3 = “Mice eat little things” D4 = “Cats often play with rats and mice” D5 = “Cats often play, but not with other cats”

Example - Cont. We might generate the following index sets: V1 = ( dog, eat, cat ) V2 = ( dog, mouse ) V3 = ( mouse, eat ) V4 = ( cat, play, rat, mouse ) V5 = (cat, play) System dictionary (cat,dog,eat,mouse,play,rat)

Example-Cont dfcat=3 idfcat=ln(5/3)=0.51 dfdog=2 idfdog=ln(5/2)=0.91 dfeat=2 idfeat=ln(5/2)=0.91 dfmouse=3 idfmouse=ln(5/3)=0.51 dfplay=2 idfplay=ln(5/2)=0.91 dfrat=1 idfrat=ln(5/1)=1.61

Example-Cont V1(cat, eat,dog) V2(dog,mouse) wcat= tfcat * idfcat = 1 * 0.51 = 0.51 wdog= tfdog * idfdog = 1 * 0.91 = 0.91 weat= tfeat * idfat = 2 * 0.91 = 1.82 V2(dog,mouse) wmouse= tfmouse * idfmouse = 1 * 0.51 = 0.51

Example-Cont V3(mouse,eat) V4(cat,mouse,play, rat) wmouse= tfmouse * idfmouse = 1 * 0.51 = 0.51 weat= tfeat * idfat = 1 * 0.91 = 0.91 V4(cat,mouse,play, rat) wcat= tfcat * idfcat = 1 * 0.51 = 0.51 wplay= tfplay * idfplay = 1 * 0.91 = 0.91 wrat= tfrat * idfrat = 1 * 1.61 = 1.61

Example-Cont V5 wcat= tfcat * idfcat = 2 * 0.51 = 1.02 wplay= tfplay * idfplay = 1 * 0.91 = 0.91

Example - cont. Dictionary: (cat,dog,eat,mouse,play,rat) Weights: V1 = [cat(0.51), dog (0.91),eat(1.82), 0, 0,0 ] V2 = [0,dog(0.91),0,mouse(0.51),0,0] V3 = [0,0,eat(0.91), mouse(0.51),0,0] V4 = [cat(0.51), 0,0,mouse(0.51), play(0.91), rat(1.61)] V5 = [cat(1.02),0,0,0, play (0.91),0]

A Larger Example Doc 1: The problem of how to describe documents for retrieval is called indexing. Doc 2: It is possible to use a document as its own index. Doc 3: The problem is that a document will exactly match only one query, namely document itself. Doc 4: The purpose of indexing then is to provide a description of a document so that it can be retrieved with queries that concern the same subject as the document. Doc 5: It must be a sufficiently specific description so that the document will not be returned for queries unrelated to the document.

A Larger Example Doc 6: A simple way of indexing a document is to give a single code from a predefined set. Doc 7: We have the task of describing how we are going to match queries against document. Doc 8: The vector space model creates a space in which both document and queries are represented by vectors. Doc 9: A vector is obtained for each document and query from sets of index terms with associated weights. Doc 10: In order to compare the similarity of these vectors, we may measure the angle between them.

A Larger Example If we index these document using all words not on a stop list, we might obtain D1- problem, describe, documents, retrieval, called, indexing D2 - possible, document, own, index D3 - problem, document (*), exactly, match, one, query, namely D4 - purpose, indexing, provide, description, document(*), retrieved, queries, concern, subject D5 - sufficiently, specific, description, document(*), returned, queries, unrelated

A Larger Example If we index these documents using all words not on a stop list, we might obtain D6- simple, way, indexing, document, give, single, code, predefined, list D7- task, describing, going, match, queries, against, documents D8- vector (*), space(*), model, creates, documents, queries, represented D9- vector, obtained, document, query, sets, index, terms, associated, weights D10- order, compare, similarity, vectors, measure, angle

A larger Example We may now choose to stem the terms, which may leave us : D1- problem, describ, docu, retriev, call, index D2- possibl, docu, own, index D3- problem, docu (*), exact, match, on, quer, name D4- purpos, index, provid, descript, docu (*), retriev, queries, concern, subject D5- suffic, specif, descript, docu (*), return, quer, unrelat

A Larger Example We may now choose to stem the terms, which may leave us: D6-simpl, way, index, docu, giv, singl, cod, predefin, list D7- task, describ, go, match, quer, against, docu D8- vect (*), spac (*), model, creat, docu, quer, represent D9- vect, obtain, docu, quer, set, index, terms, associat, weight D10- order, compar, similarit, vect, measur, angle

Document Frequencies

A Larger Example We can now calculate the weights of the terms of one of the documents. For document 8, using the tf . idf formula, we give the terms the following weights: vect (2.41), spac (4.60), model (2.30), creat(2.30), docu (0.22), quer (0.51), represent (2.30)

CSE3201/4500 Information Retrieval Systems Retrieval Model CSE3201/4500 Information Retrieval Systems

Retrieval Process

Retrieval Paradigms How do we match? Produce non-ranked output Boolean retrieval Produce ranked output vector space model probabilistic retrieval

Advantages of Ranking Good control over how many documents are viewed by a user. Good control over in which order documents are viewed by a user. The first documents that are viewed may help modify the order in which later documents are viewed. The main disadvantage is computational cost.

Boolean Retrieval A query is a set of terms combined by the Boolean connectives “and”, “or” and “not”. e.g... FIND (document OR information) AND retrieval AND (NOT (information AND systems)) Each term is matched against this query and either matches (TRUE) or it doesn’t (FALSE)

Systems Provide Most systems provide match information such as FIND (document or information) 1,000 records found FIND (document OR information) AND retrieval 40 records found FIND (document OR information) AND retrieval AND *NOT (information AND systems)) 10 records found SHOW

An Example Consider the following document collection: D1 = “Dogs eat the same things that cats eat” D2 = “no dog is a mouse” D3 = “mice eat little things” D4 = “Cats often play with rats and mice” D5 = “cats often play, but not with other cats” indexed by: D1 = dog, eat, cat D2 = dog, mouse D3 = mouse, eat D4 = cat, play, rat, mouse D5 =cat, play

An Example The Boolean query (cat AND dog) returns D1 (cat OR (dog AND eat)) returns D1, D4, D5

Problem with Boolean No ranking No weights on query terms users must fuss with retrieved set size, structural reformulation users must scan entire retrieved set No weights on query terms users cannot give more importance to some terms --- retrieval:2 AND system:1 users cannot give more importance to some clauses --- retrieval:1 AND (system OR model):2

Problem with Boolean No weights on document terms no use can be made of importance of a term in a document --- if occurs frequently no use can be made of importance of a term in the collection --- if occurs rarely

Any Good News for Boolean? Yes. Advantages conceptually simple computationally inexpensive commercially available

Introduction to Vectors A.B = |A||B| cos  A=(a1, a2, a3,…, an), B=(b1, b2, b3,…, bn) A.B = (a1b1+ a2b2+ a3b3+ …+ anbn) Magnitude of a vector |A|=(a1, a2, a3,…, an) is defined as

Similarity Measures Inner product Cosine A.B = (a1b1+ a2b2+ a3b3+ …+ anbn) Cosine

The Vector Space Model Each document and query is represented by a vector. A vector is obtained for each document and query from sets of index terms with associated weights. The document and query representatives are considered as vectors in n dimensional space where n is the number of unique terms in the dictionary/document collection. Measuring vectors similarity: inner product value of cosine of the angle between the two vectors.

Vector Space Assume that document’s vector is represented by vector D and the query is represented by vector Q. The total number of terms in the dictionary is n. Similarity between D and Q is measured by the angle .

Inner Product

Cosine The similarity between D and Q can be written as: Using the weight of the term as the components of D and Q:

Simple Example (1) Assume: there are 2 terms in the dictionary (t1, t2) Doc-1 contains t1 and t2, with weights 0.5 and 03 respectively. Doc-2 contains t1 with weight 0.6 Doc-3 contains t2 with weights 0.4. Query contains t2 with weight 0.5.

Simple Example (2) The vectors for the query and documents: Doc# wt1 0.5 0.3 2 0.6 3 0.4 Doc-1= (0.5,0.3) Doc-2= (0.6,0) Doc-3= (0,0.4) Query = ( 0, 0.5)

Simple Example - Inner Product D1=0.5x0+0.3x0.5=0.15 D2=0.6x0+0x0.5=0 D3=0x0+0.4x0.5=0.2 Ranking: D3, D1, D2 Doc# wt1 wt2 1 0.5 0.3 2 0.6 3 0.4 Query = ( 0, 0.5)

Simple Example - Cosine Similarity measured between Query(Q) and Doc-1 Doc-2 Doc-3 Ranked output: D3, D1, D2

Large Example (1) Consider the same five document collection D1= “Dogs eat the same things that cats eat” D2 = “No dog is mouse” D3 = “Mice eat little things” D4 = “Cats often play with rats and mice” D5 = “Cats often play, but not with other cats” Indexed by V1 = ( dog, eat, cat ) V2 = ( dog, mouse ) V3 = ( mouse, eat ) V4 = ( cat, play, rat, mouse ) V5 = (cat, play)

Large Example (2) The set of all terms (dictionary) (cat, dog, eat, mouse, play, rat) Using tf.idf weights, we obtain weights v1 = (cat(0.51), eat(1.82), dog(0.91)) v2 = (dog(0.91), mouse(0.51)) v3 = (mouse(0.51), eat(0.91)) v4 = (cat(0.51), play(0.91), rat(1.61), mouse(0.51)) v5 = (cat (1.02), play (0.91))

Large Example (3) In the vector space model, we obtain vectors (0.51, 0.91, 1.82, 0.00, 0.00, 0.00) (0.00, 0.91, 0.00, 0.51, 0.00, 0.00) (0.00, 0.00, 0.91, 0.51, 0.00, 0.00) (0.51, 0.00, 0.00, 0.51, 0.91, 1.61) (1.02, 0.00, 0.00, 0.00, 0.91, 0.00) 6 dimensional space for 6 terms

Inner-Product Query: “what do cats play with?” forms a query vector as (0.51, 0.00, 0.00, 0.00, 0.91, 0.00) D1= 0.51x0.51+0x0.91+0x1.82+0x0+0x0.91+0x0=0.2601 D2= 0.00x0.51+0.91x0+0x0+0.51x0+0x0.91+0x0=0 D3= 0.00x0.51+ 0x0+0.91x0+0.51x0+0x0.91+0x0=0 D4= 0.51x0.51+0x0+0x0+0.51x0+0.91x0.91+1.61x0=1.0882 D5= 1.02x0.51+0x0+0x0+0x0+0.91x0.91+0x0=1.3483 Ranking: D5, D4, D1, D2, D3

Cosine Similarity Query: “what do cats play with?” forms a query vector as (0.51, 0.00, 0.00, 0.00, 0.91, 0.00) using the cosine measure (cm), we obtain the following similarity measures: D1 = 0.512/[(0.512+0.912)0.5 x(0.512+0.912+1.822)0.5] D2 = 0.0 D3 = 0.0 D4 = (0.512+0.912)/[(0.512+0.912)0.5x(0.512+0.512+0.912+1.612)0.5] D5 = (0.51*1.02+0.912)/[(0.512+0.912)0.5x(1.022+0.912)0.5] Thus we obtain he ranking: D5, D4, D1, D2, D3 (or D3, D2)