Presentation is loading. Please wait.

Presentation is loading. Please wait.

(C) 2003, The University of Michigan1 Information Retrieval Handout #2 February 3, 2003.

Similar presentations


Presentation on theme: "(C) 2003, The University of Michigan1 Information Retrieval Handout #2 February 3, 2003."— Presentation transcript:

1 (C) 2003, The University of Michigan1 Information Retrieval Handout #2 February 3, 2003

2 (C) 2003, The University of Michigan2 Course Information Instructor: Dragomir R. Radev (radev@si.umich.edu) Office: 3080, West Hall Connector Phone: (734) 615-5225 Office hours: M&F 11-12 Course page: http://tangra.si.umich.edu/~radev/650/ Class meets on Mondays, 1-4 PM in 409 West Hall

3 (C) 2003, The University of Michigan3 Queries and documents

4 (C) 2003, The University of Michigan4 Queries Single-word queries Context queries –Phrases –Proximity Boolean queries Natural Language queries

5 (C) 2003, The University of Michigan5 Pattern matching Words, prefixes, suffixes, substrings, ranges, regular expressions Structured queries (e.g., XML)

6 (C) 2003, The University of Michigan6 Relevance feedback Query expansion Term reweighting Pseudo-relevance feedback Latent semantic indexing Distributional clustering

7 (C) 2003, The University of Michigan7 Document processing Lexical analysis Stopword elimination Stemming Index term identification Thesauri

8 (C) 2003, The University of Michigan8 Porter’s algorithm 1. The measure, m, of a stem is a function of sequences of vowels followed by a consonant. If V is a sequence of vowels and C is a sequence of consonants, then m is: C(VC) m V where the initial C and final V are optional and m is the number of VC repeats. m=0 free, why m=1 frees, whose m=2 prologue, compute 2. * - stem ends with letter X 3. *v* - stem ends in a vowel 4. *d - stem ends in double consonant 5. *o - stem ends with consonant-vowel-consonant sequence where the final consonant is now w, x, or y

9 (C) 2003, The University of Michigan9 Porter’s algorithm Suffix conditions take the form current_suffix = = pattern Actions are in the form old_suffix -> new_suffix Rules are divided into steps to define the order of applying the rules. The following are some examples of the rules: STEP CONDITION SUFFIX REPLACEMENT EXAMPLE 1a NULL sses ss stresses->stress 1b *v* ing NULL making->mak 1b1 NULL at ate inflat(ed)->inflate 1c *v* y I happy->happi 2 m>0 aliti al formaliti->formal 3 m>0 icate ic duplicate->duplic 4 m>1 able NULL adjustable->adjust 5a m>1 e NULL inflate->inflat 5b m>1 and NULL single letter controll->control

10 (C) 2003, The University of Michigan10 Porter’s algorithm Example: the word “duplicatable” duplicat rule 4 duplicate rule 1b1 duplic rule 3 The application of another rule in step 4, removing “ic,” cannot be applied since one rule from each step is allowed to be applied.

11 (C) 2003, The University of Michigan11 Porter’s algorithm

12 (C) 2003, The University of Michigan12 Relevance feedback Automatic Manual Method: identifying feedback terms Q’ = a 1 Q + a 2 R - a 3 N Often a 1 = 1, a 2 = 1/|R| and a 3 = 1/|N|

13 (C) 2003, The University of Michigan13 Example Q = “safety minivans” D1 = “car safety minivans tests injury statistics” - relevant D2 = “liability tests safety” - relevant D3 = “car passengers injury reviews” - non- relevant R = ? S = ? Q’ = ?

14 (C) 2003, The University of Michigan14 Automatic query expansion Thesaurus-based expansion Distributional similarity-based expansion

15 (C) 2003, The University of Michigan15 WordNet and DistSim wn reason -hypen - hypernyms wn reason -synsn - synsets wn reason -simsn - synonyms wn reason -over - overview of senses wn reason -famln - familiarity/polysemy wn reason -grepn - compound nouns /clair3/tools/relatedwords/relate reason

16 (C) 2003, The University of Michigan16 Related (substitutable) words Book: publication, product, fact, dramatic composition, record Computer: machine, expert, calculator, reckoner, figurer Fruit: reproductive structure, consequence, product, bear Politician: leader, schemer Newspaper: press, publisher, product, paper, newsprint Distributional clustering: Wordnet Book: autobiography, essay, biography, memoirs, novels Computer: adobe, computing, computers, developed, hardware Fruit: leafy, canned, fruits, flowers, grapes Politician: activist, campaigner, politicians, intellectuals, journalist Newspaper: daily, globe, newspapers, newsday, paper

17 (C) 2003, The University of Michigan17 Indexing and searching

18 (C) 2003, The University of Michigan18 Computing term salience Term frequency (IDF) Document frequency (DF) Inverse document frequency (IDF)

19 (C) 2003, The University of Michigan19 Scripts to compute tf and idf cd /clair4/class/ir-w03/hw2./tf.pl 053.txt | sort -nr +1 | more./tfs.pl 053.txt | sort -nr +1 | more./stem.pl reasonableness./build-idf.pl./idf.pl | sort -n +2 | more

20 (C) 2003, The University of Michigan20 Applications of TFIDF Cosine similarity Indexing Clustering


Download ppt "(C) 2003, The University of Michigan1 Information Retrieval Handout #2 February 3, 2003."

Similar presentations


Ads by Google