Presentation is loading. Please wait.

Presentation is loading. Please wait.

(C) 2003, The University of Michigan1 Information Retrieval Handout #4 January 28, 2005.

Similar presentations


Presentation on theme: "(C) 2003, The University of Michigan1 Information Retrieval Handout #4 January 28, 2005."— Presentation transcript:

1 (C) 2003, The University of Michigan1 Information Retrieval Handout #4 January 28, 2005

2 (C) 2003, The University of Michigan2 Course Information Instructor: Dragomir R. Radev (radev@si.umich.edu) Office: 3080, West Hall Connector Phone: (734) 615-5225 Office hours: M 11-12 & Th 12-1 or via email Course page: http://tangra.si.umich.edu/~radev/650/ Class meets on Fridays, 2:10-4:55 PM in 409 West Hall

3 (C) 2003, The University of Michigan3 Arithmetic coding

4 (C) 2003, The University of Michigan4 Arithmetic coding Uses probabilities Achieves about 2.5 bits per character – close to optimal (Rissanen and Langdon 1979, Witten, Neal, and Cleary 1987)

5 (C) 2003, The University of Michigan5

6 6 Exercise Assuming the alphabet consists of a, b, and c, develop arithmetic encodings for the following strings: aaaaab ababaa abccab cbabac

7 (C) 2003, The University of Michigan7 Stemming

8 (C) 2003, The University of Michigan8 Goals Motivation: –Computer, computers, computerize, computational, computerization –User, users, using, used Representing related words as one token Simplify matching Reduce storage and computation Also known as: term conflation

9 (C) 2003, The University of Michigan9 Methods Manual (tables) –Achievement  achiev –Achiever  achiev –Etc. Affix removal (Harman 1991, Frakes 1992) –if a word ends in “ies” but not “eies” or “aies” then “ies”  “y” –If a word ends in “es” but not “aes”, “ees”, or “oes”, then “es”  “e” –If a word ends in “s” but not “us” or “ss” then “s”  NULL –(apply only the first applicable rule)

10 (C) 2003, The University of Michigan10 Porter’s algorithm (Porter 1980) Home page: –http://www.tartarus.org/~martin/PorterStemmerhttp://www.tartarus.org/~martin/PorterStemmer Reading assignment: –http://www.tartarus.org/~martin/PorterStemmer/def.txthttp://www.tartarus.org/~martin/PorterStemmer/def.txt Consonant-vowel sequences: –CVCV... C –CVCV... V –VCVC... C –VCVC... V –Shorthand: [C]VCVC... [V]

11 (C) 2003, The University of Michigan11 Porter’s algorithm (cont’d) [C](VC){m}[V] {m} indicates repetition Examples: m=0 TR, EE, TREE, Y, BY m=1 TROUBLE, OATS, TREES, IVY m=2 TROUBLES, PRIVATE, OATEN Conditions: *S - the stem ends with S (and similarly for the other letters). *v* - the stem contains a vowel. *d - the stem ends with a double consonant (e.g. -TT, -SS). *o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP).

12 (C) 2003, The University of Michigan12 Step 1a SSES -> SS caresses -> caress IES -> I ponies -> poni ties -> ti SS -> SS caress -> caress S -> cats -> cat Step 1b (m>0) EED -> EE feed -> feed agreed -> agree (*v*) ED -> plastered -> plaster bled -> bled (*v*) ING -> motoring -> motor sing -> sing Step 1b1 If the second or third of the rules in Step 1b is successful, the following is done: AT -> ATE conflat(ed) -> conflate BL -> BLE troubl(ed) -> trouble IZ -> IZE siz(ed) -> size (*d and not (*L or *S or *Z)) -> single letter hopp(ing) -> hop tann(ed) -> tan fall(ing) -> fall hiss(ing) -> hiss fizz(ed) -> fizz (m=1 and *o) -> E fail(ing) -> fail fil(ing) -> file

13 (C) 2003, The University of Michigan13 Step 1c (*v*) Y -> I happy -> happi sky -> sky Step 2 (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible

14 (C) 2003, The University of Michigan14 Step 3 (m>0) ICATE -> IC triplicate -> triplic (m>0) ATIVE -> formative -> form (m>0) ALIZE -> AL formalize -> formal (m>0) ICITI -> IC electriciti -> electric (m>0) ICAL -> IC electrical -> electric (m>0) FUL -> hopeful -> hope (m>0) NESS -> goodness -> good Step 4 (m>1) AL -> revival -> reviv (m>1) ANCE -> allowance -> allow (m>1) ENCE -> inference -> infer (m>1) ER -> airliner -> airlin (m>1) IC -> gyroscopic -> gyroscop (m>1) ABLE -> adjustable -> adjust (m>1) IBLE -> defensible -> defens (m>1) ANT -> irritant -> irrit (m>1) EMENT -> replacement -> replac (m>1) MENT -> adjustment -> adjust (m>1) ENT -> dependent -> depend (m>1 and (*S or *T)) ION -> adoption -> adopt (m>1) OU -> homologou -> homolog (m>1) ISM -> communism -> commun (m>1) ATE -> activate -> activ (m>1) ITI -> angulariti -> angular (m>1) OUS -> homologous -> homolog (m>1) IVE -> effective -> effect (m>1) IZE -> bowdlerize -> bowdler

15 (C) 2003, The University of Michigan15 Step 5a (m>1) E -> probate -> probat rate -> rate (m=1 and not *o) E -> cease -> ceas Step 5b (m > 1 and *d and *L) -> single letter controll -> control roll -> roll

16 (C) 2003, The University of Michigan16 Porter’s algorithm (cont’d) Example: the word “duplicatable” duplicat rule 4 duplicate rule 1b1 duplic rule 3 The application of another rule in step 4, removing “ic,” cannot be applied since one rule from each step is allowed to be applied. % cd /clair4/class/ir-w03/tf-idf %./stem.pl computers computers comput

17 (C) 2003, The University of Michigan17 Porter’s algorithm

18 (C) 2003, The University of Michigan18 Stemming Not always appropriate (e.g., proper names, titles) The same applies to casing (e.g., CAT vs. cat)

19 (C) 2003, The University of Michigan19 String matching

20 (C) 2003, The University of Michigan20 String matching methods Index-based Full or approximate –E.g., theater = theatre

21 (C) 2003, The University of Michigan21 Index-based matching Inverted files Position-based inverted files Block-based inverted files 1 6 9 11 1719 24 28 33 40 46 50 55 60 This is a text. A text has many words. Words are made from letters. Text: 11, 19 Words: 33, 40 From: 55

22 (C) 2003, The University of Michigan22 Inverted index (trie) Letters: 60 Text: 11, 19 Words: 33, 40 Made: 50 Many: 28 l m t w a d n

23 (C) 2003, The University of Michigan23 Sequential searching No indexing structure given Given: database d and search pattern p. –Example: find “words” in the earlier example Brute force method –try all possible starting positions –O(n) positions in the database and O(m) characters in the pattern so the total worst-case runtime is O(mn) –Typical runtime is actually O(n) given that mismatches are easy to notice

24 (C) 2003, The University of Michigan24 Knuth-Morris-Pratt Average runtime similar to BF Worst case runtime is linear: O(n) Idea: reuse knowledge Need preprocessing of the pattern

25 (C) 2003, The University of Michigan25 Knuth-Morris-Pratt (cont’d) Example ( http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm ) database: ABC ABC ABC ABDAB ABCDABCDABDE pattern: ABCDABD index 0 1 2 3 4 5 6 7 char A B C D A B D – pos -1 0 0 0 0 1 2 0 1234567 ABCDABD

26 (C) 2003, The University of Michigan26 Knuth-Morris-Pratt (cont’d) ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^

27 (C) 2003, The University of Michigan27 Boyer-Moore Used in text editors Demos –http://www-sr.informatik.uni-tuebingen.de/~buehler/BM/BM.htmlhttp://www-sr.informatik.uni-tuebingen.de/~buehler/BM/BM.html –http://www.blarg.com/~doyle/pages/bmi.htmlhttp://www.blarg.com/~doyle/pages/bmi.html

28 (C) 2003, The University of Michigan28 Other methods The Soundex algorithm (Odell and Russell) Uses: –spelling correction –hash function –non-recoverable

29 (C) 2003, The University of Michigan29 Word similarity Hamming distance - when words are of the same length Levenshtein distance - number of edits (insertions, deletions, replacements) –color --> colour (1) –survey --> surgery (2) –com puter --> computer ? Longest common subsequence (LCS) –lcs (survey, surgery) = surey

30 (C) 2003, The University of Michigan30 The Soundex algorithm 1. Retain the first letter of the name, and drop all occurrences of a,e,h,I,o,u,w,y in other positions 2. Assign the following numbers to the remaining letters after the first: b,f,p,v : 1 c,g,j,k,q,s,x,z : 2 d,t : 3 l : 4 m n : 5 r : 6

31 (C) 2003, The University of Michigan31 The Soundex algorithm 3. if two or more letters with the same code were adjacent in the original name, omit all but the first 4. Convert to the form “LDDD” by adding terminal zeros or by dropping rightmost digits Examples: Euler: E460, Gauss: G200, H416: Hilbert, K530: Knuth, Lloyd: L300 same as Ellery, Ghosh, Heilbronn, Kant, and Ladd Some problems: Rogers and Rodgers, Sinclair and StClair


Download ppt "(C) 2003, The University of Michigan1 Information Retrieval Handout #4 January 28, 2005."

Similar presentations


Ads by Google