Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 9142608 黃哲修 9142609 張家豪 From

Similar presentations


Presentation on theme: "Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 9142608 黃哲修 9142609 張家豪 From"— Presentation transcript:

1 Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 9142608 黃哲修 9142609 張家豪 From www.mis.nsysu.edu.tw/~syhwang/Courses/IR/StemmingAlgorithms.ppt,www.mis.nsysu.edu.tw/~syhwang/Courses/IR/StemmingAlgorithms.ppt modified by Sumanta

2 The Porter Algorithm Word = Stem + Affix(es) E.g., generalizations = general + ization + s Stemming is the determination of the stem of a given word Porter’s stemmer is a rule-based algorithm E.g., ational → ate (apply: relational → relate) Porter’s stemmer is heuristic, in that it is a practical method not guaranteed to be optimal

3 Spring 2002 NLE3 The Porter Stemmer: Definitions Definitions CONSONANT: a letter other than A, E, I, O, U, and Y preceded by consonant (in TOY, consonants are T,Y; in SYZYGY they are S, Z, G) VOWEL: any other letter With this definition all words and parts of words are of form : [C](VC) m [V] C=string of one or more consonants (con+) and [C] indicates arbitrary presence of the contents V=string of one or more vowels and [V] indicates arbitrary … E.g., Troubles C VC VC = C(VC) 2 m is the measure of the word m = 0: TR, EE, TREE m = 1: TROUBLE, OATS, TREES m = 2: TROUBLES, PRIVATE, OATE

4 Rule Format Rules are of the form (condition) S1 → S2 where S1 and S2 are suffixes. Given a set of rules, only the one with the longest matching suffix S1 is applies. Conditions: 1. m --- measure m = k or m > k, where k is an integer 2.*X --- the stem ends with a given letter X 3.*v*--- the stem contains a vowel 4.*d --- the stem ends in double consonant 5.*o --- the stem ends with a consonant-vowel-consonant sequence, where the final consonant is not w, x or y, (e.g., wil, hop) Rules are divided into sets and in each successive step one set of rules is applied.

5 Porter Steps Each step corresponds to a set of rules. The rules in a step are examined in sequence, and only one rule from a step can apply { step1a(word); step1b(stem); if (the second or third rule of step 1b was used) step1b1(stem); step1c(stem); step2(stem); step3(stem); step4(stem); step5a(stem); step5b(stem); }

6

7 if the second or third rule of step 1b was used

8

9

10

11

12 Examples/Problems computers → computer → comput singing → sing generalizations → information → instructor → Try words of your own … Step 1aStep 4 Step 1b

13 Porter’s Mishaps On-line Porter’s at http://textanalysisonline.com/nltk- porter-stemmer giveshttp://textanalysisonline.com/nltk- porter-stemmer gas (noun) → ga gases (plural) → gase gasses (verb, present tense) → gass gassing (verb, present continuous) → gass gaseous (adjective) → gaseou This is not good – all these words should ideally reduce to the same stem. Trade-off: More rules (accurate but slow) vs Less rules (efficient but sometimes wrong). Google does give different results for gas and gases, so maybe they use these Porter rules:-)


Download ppt "Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 9142608 黃哲修 9142609 張家豪 From"

Similar presentations


Ads by Google