(C) 2003, The University of Michigan1 Information Retrieval Handout #4 January 28, 2005.

Slides:



Advertisements
Similar presentations
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.
Advertisements

Indexing DNA Sequences Using q-Grams
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Chapter 2: Text Operations.
Chapter 5: Introduction to Information Retrieval
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Space-for-Time Tradeoffs
LING 388: Language and Computers Sandiway Fong Lecture 21: 11/8.
Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪.
Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Chapter 7: Text Preprocessing.
October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms.
CMSC 723 / LING 645: Intro to Computational Linguistics September 22, 2004: Dorr Porter Stemmer, Intro to Probabilistic NLP and N-grams (chap )
CMSC 723: Intro to Computational Linguistics February 11, 2003 Lecture 3: Finite-State Morphology Prof. Bonnie J. Dorr and Dr. Nizar Habash TAs: Nitin.
December 2007NLP: Conflation Algorithms1 Natural Language Processing Conflation Algorithms.
Term Processing & Normalization Major goal: Find the best possible representation Minor goals: Improve storage and speed First: Need to transform sequence.
Rules Always answer in the form of a question 50 points deducted for wrong answer.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
CS 430 / INFO 430 Information Retrieval
1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday.
Spring 2002NLE1 CC 384: Natural Language Engineering Week 2, Lecture 2 - Lemmatization and Stemming; the Porter Stemmer.
CS 430 / INFO 430 Information Retrieval
Goodrich, Tamassia String Processing1 Pattern Matching.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 11/1.
1 CSE 417: Algorithms and Computational Complexity Winter 2001 Lecture 15 Instructor: Paul Beame.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Pattern Matching COMP171 Spring Pattern Matching / Slide 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences.
Indexing and Searching
Spelling Rules. Spelling Rule Review  What is the rule about adding “y”, “ly”, or “ness” to a word that doesn’t end in “y”?  Ex. Sincere + ly OR sad.
LING 388 Language and Computers Lecture 21 11/13/03 Sandiway FONG.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Text Search and Fuzzy Matching
LING/C SC/PSYC 438/538 Lecture 17 Sandiway Fong. Administrivia Grading – Midterm grading not finished yet – Homework 3 graded Reminder – Next Monday:
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Information Retrieval (2) Prof. Dragomir R. Radev
Text Preprocessing. Preprocessing step Aims to create a correct text representation, according to the adopted model. Step: –Lexical analysis; –Case folding,
MA/CSSE 473 Day 24 Student questions Quadratic probing proof
CS 430: Information Discovery
MCS 101: Algorithms Instructor Neelima Gupta
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
LIS 7450, Searching Electronic Databases Basic: Database Structure & Database Construction Dialog: Database Construction for Dialog (FYI) Deborah A. Torres.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
MCS 101: Algorithms Instructor Neelima Gupta
CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms.
1 String Processing CHP # 3. 2 Introduction Computer are frequently used for data processing, here we discuss primary application of computer today is.
LING/C SC/PSYC 438/538 Lecture 25 Sandiway Fong 1.
Fundamental Data Structures and Algorithms
(C) 2003, The University of Michigan1 Information Retrieval Handout #2 February 3, 2003.
1 Lexicographic Search:Tries All of the searching methods we have seen so far compare entire keys during the search Idea: Why not consider a key to be.
(C) 2003, The University of Michigan1 Information Retrieval Handout #5 January 28, 2005.
BIT 3193 MULTIMEDIA DATABASE CHAPTER 4 : QUERING MULTIMEDIA DATABASES.
NLP. Text similarity People can express the same concept (or related concepts) in many different ways. For example, “the plane leaves at 12pm” vs “the.
(C) 2000, The University of Michigan 1 Language and Information Handout #2 September 21, 2000.
Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪 From
CS 430: Information Discovery
LING/C SC/PSYC 438/538 Lecture 26 Sandiway Fong.
Query Languages.
Space-for-time tradeoffs
Chapter 7 Space and Time Tradeoffs
Space-for-time tradeoffs
Space-for-time tradeoffs
資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪
Space-for-time tradeoffs
Basic Text Processing Word tokenization.
15-826: Multimedia Databases and Data Mining
Discussion Class 3 Stemming Algorithms.
Presentation transcript:

(C) 2003, The University of Michigan1 Information Retrieval Handout #4 January 28, 2005

(C) 2003, The University of Michigan2 Course Information Instructor: Dragomir R. Radev Office: 3080, West Hall Connector Phone: (734) Office hours: M & Th 12-1 or via Course page: Class meets on Fridays, 2:10-4:55 PM in 409 West Hall

(C) 2003, The University of Michigan3 Arithmetic coding

(C) 2003, The University of Michigan4 Arithmetic coding Uses probabilities Achieves about 2.5 bits per character – close to optimal (Rissanen and Langdon 1979, Witten, Neal, and Cleary 1987)

(C) 2003, The University of Michigan5

6 Exercise Assuming the alphabet consists of a, b, and c, develop arithmetic encodings for the following strings: aaaaab ababaa abccab cbabac

(C) 2003, The University of Michigan7 Stemming

(C) 2003, The University of Michigan8 Goals Motivation: –Computer, computers, computerize, computational, computerization –User, users, using, used Representing related words as one token Simplify matching Reduce storage and computation Also known as: term conflation

(C) 2003, The University of Michigan9 Methods Manual (tables) –Achievement  achiev –Achiever  achiev –Etc. Affix removal (Harman 1991, Frakes 1992) –if a word ends in “ies” but not “eies” or “aies” then “ies”  “y” –If a word ends in “es” but not “aes”, “ees”, or “oes”, then “es”  “e” –If a word ends in “s” but not “us” or “ss” then “s”  NULL –(apply only the first applicable rule)

(C) 2003, The University of Michigan10 Porter’s algorithm (Porter 1980) Home page: – Reading assignment: – Consonant-vowel sequences: –CVCV... C –CVCV... V –VCVC... C –VCVC... V –Shorthand: [C]VCVC... [V]

(C) 2003, The University of Michigan11 Porter’s algorithm (cont’d) [C](VC){m}[V] {m} indicates repetition Examples: m=0 TR, EE, TREE, Y, BY m=1 TROUBLE, OATS, TREES, IVY m=2 TROUBLES, PRIVATE, OATEN Conditions: *S - the stem ends with S (and similarly for the other letters). *v* - the stem contains a vowel. *d - the stem ends with a double consonant (e.g. -TT, -SS). *o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP).

(C) 2003, The University of Michigan12 Step 1a SSES -> SS caresses -> caress IES -> I ponies -> poni ties -> ti SS -> SS caress -> caress S -> cats -> cat Step 1b (m>0) EED -> EE feed -> feed agreed -> agree (*v*) ED -> plastered -> plaster bled -> bled (*v*) ING -> motoring -> motor sing -> sing Step 1b1 If the second or third of the rules in Step 1b is successful, the following is done: AT -> ATE conflat(ed) -> conflate BL -> BLE troubl(ed) -> trouble IZ -> IZE siz(ed) -> size (*d and not (*L or *S or *Z)) -> single letter hopp(ing) -> hop tann(ed) -> tan fall(ing) -> fall hiss(ing) -> hiss fizz(ed) -> fizz (m=1 and *o) -> E fail(ing) -> fail fil(ing) -> file

(C) 2003, The University of Michigan13 Step 1c (*v*) Y -> I happy -> happi sky -> sky Step 2 (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible

(C) 2003, The University of Michigan14 Step 3 (m>0) ICATE -> IC triplicate -> triplic (m>0) ATIVE -> formative -> form (m>0) ALIZE -> AL formalize -> formal (m>0) ICITI -> IC electriciti -> electric (m>0) ICAL -> IC electrical -> electric (m>0) FUL -> hopeful -> hope (m>0) NESS -> goodness -> good Step 4 (m>1) AL -> revival -> reviv (m>1) ANCE -> allowance -> allow (m>1) ENCE -> inference -> infer (m>1) ER -> airliner -> airlin (m>1) IC -> gyroscopic -> gyroscop (m>1) ABLE -> adjustable -> adjust (m>1) IBLE -> defensible -> defens (m>1) ANT -> irritant -> irrit (m>1) EMENT -> replacement -> replac (m>1) MENT -> adjustment -> adjust (m>1) ENT -> dependent -> depend (m>1 and (*S or *T)) ION -> adoption -> adopt (m>1) OU -> homologou -> homolog (m>1) ISM -> communism -> commun (m>1) ATE -> activate -> activ (m>1) ITI -> angulariti -> angular (m>1) OUS -> homologous -> homolog (m>1) IVE -> effective -> effect (m>1) IZE -> bowdlerize -> bowdler

(C) 2003, The University of Michigan15 Step 5a (m>1) E -> probate -> probat rate -> rate (m=1 and not *o) E -> cease -> ceas Step 5b (m > 1 and *d and *L) -> single letter controll -> control roll -> roll

(C) 2003, The University of Michigan16 Porter’s algorithm (cont’d) Example: the word “duplicatable” duplicat rule 4 duplicate rule 1b1 duplic rule 3 The application of another rule in step 4, removing “ic,” cannot be applied since one rule from each step is allowed to be applied. % cd /clair4/class/ir-w03/tf-idf %./stem.pl computers computers comput

(C) 2003, The University of Michigan17 Porter’s algorithm

(C) 2003, The University of Michigan18 Stemming Not always appropriate (e.g., proper names, titles) The same applies to casing (e.g., CAT vs. cat)

(C) 2003, The University of Michigan19 String matching

(C) 2003, The University of Michigan20 String matching methods Index-based Full or approximate –E.g., theater = theatre

(C) 2003, The University of Michigan21 Index-based matching Inverted files Position-based inverted files Block-based inverted files This is a text. A text has many words. Words are made from letters. Text: 11, 19 Words: 33, 40 From: 55

(C) 2003, The University of Michigan22 Inverted index (trie) Letters: 60 Text: 11, 19 Words: 33, 40 Made: 50 Many: 28 l m t w a d n

(C) 2003, The University of Michigan23 Sequential searching No indexing structure given Given: database d and search pattern p. –Example: find “words” in the earlier example Brute force method –try all possible starting positions –O(n) positions in the database and O(m) characters in the pattern so the total worst-case runtime is O(mn) –Typical runtime is actually O(n) given that mismatches are easy to notice

(C) 2003, The University of Michigan24 Knuth-Morris-Pratt Average runtime similar to BF Worst case runtime is linear: O(n) Idea: reuse knowledge Need preprocessing of the pattern

(C) 2003, The University of Michigan25 Knuth-Morris-Pratt (cont’d) Example ( ) database: ABC ABC ABC ABDAB ABCDABCDABDE pattern: ABCDABD index char A B C D A B D – pos ABCDABD

(C) 2003, The University of Michigan26 Knuth-Morris-Pratt (cont’d) ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^

(C) 2003, The University of Michigan27 Boyer-Moore Used in text editors Demos – –

(C) 2003, The University of Michigan28 Other methods The Soundex algorithm (Odell and Russell) Uses: –spelling correction –hash function –non-recoverable

(C) 2003, The University of Michigan29 Word similarity Hamming distance - when words are of the same length Levenshtein distance - number of edits (insertions, deletions, replacements) –color --> colour (1) –survey --> surgery (2) –com puter --> computer ? Longest common subsequence (LCS) –lcs (survey, surgery) = surey

(C) 2003, The University of Michigan30 The Soundex algorithm 1. Retain the first letter of the name, and drop all occurrences of a,e,h,I,o,u,w,y in other positions 2. Assign the following numbers to the remaining letters after the first: b,f,p,v : 1 c,g,j,k,q,s,x,z : 2 d,t : 3 l : 4 m n : 5 r : 6

(C) 2003, The University of Michigan31 The Soundex algorithm 3. if two or more letters with the same code were adjacent in the original name, omit all but the first 4. Convert to the form “LDDD” by adding terminal zeros or by dropping rightmost digits Examples: Euler: E460, Gauss: G200, H416: Hilbert, K530: Knuth, Lloyd: L300 same as Ellery, Ghosh, Heilbronn, Kant, and Ladd Some problems: Rogers and Rodgers, Sinclair and StClair