Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday.

Similar presentations


Presentation on theme: "1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday."— Presentation transcript:

1 1 Discussion Class 3 The Porter Stemmer

2 2 Course Administration No class on Thursday

3 3 Discussion Classes Format: Question Ask a member of the class to answer. Provide opportunity for others to comment. When answering: Stand up. Give your name. Make sure that the TA hears it. Speak clearly so that all the class can hear. Suggestions: Do not be shy at presenting partial answers. Differing viewpoints are welcome.

4 4 Question 1: Stemming (a)Define the terms: stem, suffix, prefix, conflation (b)What makes a good stemming algorithm? How would you measure it? (c)Porter proposes a criterion for removing suffixes. What is it? Do you agree with it? (d)The paper uses "recall cutoff" to measure effectiveness. What does it measure?

5 5 Question 2: Categories of Stemmer The following diagram illustrate the various categories of stemmer. Porter's algorithm is shown by the red path. What do these terms mean? Conflation methods Manual Automatic (stemmers) Affix Successor Table n-gram removal variety lookup Longest Simple match removal

6 6 Question 3: Mechanics Step 1a The paper gives the following example of Step 1a. Explain what this step does. Suffix Replacement Examples sses ss caresses -> caress ies i ponies -> poni ties -> ti ss ss caress -> caress s cats -> cat

7 7 Question 4: Mechanics Step 1b ConditionsSuffixReplacementExamples (m > 0)eedeefeed -> feed agreed -> agree (*v*)ednullplastered -> plaster bled -> bled (*v*)ingnullmotoring -> motor sing -> sing (a) Explain this table (b) How does this table apply to: "exceeding", "ringed"?

8 8 Question 5: Mechanics Step 5a Step 5a is defined as follows. What does this do and why? (m>1) E -> probate -> probat rate -> rate (m=1 and not *o) E -> cease -> ceas

9 9 Question 6. Ad hoc decisions Discuss the following: "The algorithm is careful not to remove a suffix when the stem is too short, the length of the stem being given by its measure, m. There is no linguistic basis for this approach. It was merely observed that m could be used quite effectively to help decide whether or not it was wise to take off a suffix." (a) What is m? (b) Why is it a reasonable measure? (c) What anomalies does it produce?

10 10 Question 7: Stemming in Web searching (a)In Web search engines, the tendency is not to use stemming. Why? (There are several answers.) (b)Does your answer to part (a) mean that stemming is no longer useful?


Download ppt "1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday."

Similar presentations


Ads by Google