1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday.

Slides:



Advertisements
Similar presentations
LING 388: Language and Computers Sandiway Fong Lecture 21: 11/8.
Advertisements

© Cambridge International Examinations 2013 Component/Paper 1.
Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Chapter 7: Text Preprocessing.
CS 430 / INFO 430 Information Retrieval
Spring 2002NLE1 CC 384: Natural Language Engineering Week 2, Lecture 2 - Lemmatization and Stemming; the Porter Stemmer.
CS 430 / INFO 430 Information Retrieval
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
An Efficient IP Address Lookup Algorithm Using a Priority Trie Authors: Hyesook Lim and Ju Hyoung Mun Presenter: Yi-Sheng, Lin ( 林意勝 ) Date: Mar. 11, 2008.
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
1 Discussion Class 2 A Vector Space Model for Automated Indexing.
1 Discussion Class 12 Medical Subject Headings (MeSH) and Unified Medical Language System (UML)
1 Discussion Class 11 Click through Data as Implicit Feedback.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2.
1 Discussion Class 4 Latent Semantic Indexing. 2 Discussion Classes Format: Question Ask a member of the class to answer. Provide opportunity for others.
This Class u How stemming is used in IR u Stemming algorithms u Frakes: Chapter 8 u Kowalski: pages
1 Discussion Class 10 Informedia. 2 Discussion Classes Format: Question Ask a member of the class to answer. Provide opportunity for others to comment.
1 Discussion Class 12 User Interfaces and Visualization.
1 Discussion Class 3 Inverse Document Frequency. 2 Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for.
1 Discussion Class 2 A Vector Space Model for Automated Indexing.
1 Discussion Class 6 Crawling the Web. 2 Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for others to.
1 Discussion Class 8 The Google File System. 2 Discussion Classes Format: Question Ask a member of the class to answer. Provide opportunity for others.
1 Discussion Class 5 TREC. 2 Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for others to comment. When.
1 Final Discussion Class User Interfaces. 2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to.
1 Discussion Class 1 Three Information Retrieval Systems.
LING 388 Language and Computers Lecture 21 11/13/03 Sandiway FONG.
CS 430 / INFO 430 Information Retrieval
Assessment Cadre #3: “Assess How? Designing Assessments to Do What You Want”
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
1 CS 501 Spring 2002 CS 501: Software Engineering Lecture 9 Techniques for Requirements Definition and Specification I.
Prepare and Use Knowledge Assessments. IntroductionIntroduction Why do we give knowledge tests? What problems did you have with tests as a student? As.
1 Discussion Class 9 Thesaurus Construction. 2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Data Structure. Two segments of data structure –Storage –Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
Exam Advice 1.Read all the instructions on the front of the exam paper 2.Read the questions carefully once twice & then again to make sure you know what.
1 Discussion Class 4 The Dublin Core Metadata Initiative.
BEHAVIORAL TARGETING IN ON-LINE ADVERTISING: AN EMPIRICAL STUDY AUTHORS: JOANNA JAWORSKA MARCIN SYDOW IN DEFENSE: XILING SUN & ARINDAM PAUL.
1 Discussion Class 8 MARC. 2 Discussion Classes Format: Question Ask a member of the class to answer. Provide opportunity for others to comment. When.
UNIT 2 : SPEAKING CLASSIFYING & GIVING REASONS.  To classify is to divide things into groups according to their type.  The books in the library are.
1 CS 501 Spring 2002 CS 501: Software Engineering Lecture 27 Software Engineering as Engineering.
1 Discussion Class 1 Three Information Retrieval Systems.
IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo a, Jose G. Delgado-Frias Publisher: Journal of Systems.
1 Discussion Class 3 Stemming Algorithms. 2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to.
1 Discussion Class 1 Inverted Files. 2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to comment.
1 Discussion Class 10 Thesaurus Construction. 2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others.
Advanced Indexing Issues 1. Additional Indexing Issues Indexing for queries about the Web structure Indexing for queries containing wildcards Preprocessing.
Evaluating Conditions for CongruencyProjector Resources Evaluating Conditions for Congruency Projector Resources.
1 Discussion Class 2 A Vector Space Model for Automated Indexing.
RAT Restate Answer the question Tell why WHY RAT? Systematic way for answering questions Provides consistency across grade-level Gives students a structure.
LECTURE 6 Natural Language Processing- Practical.
Classifying Solutions to Systems of EquationsProjector resources Classifying Solutions to Systems of Equations Projector Resources.
1 Chapter 7 Text Operations. 2 Logical View of a Document document structure recognition text+ structure accents, spacing, etc. stopwords noun groups.
Chapter 16 Overview: Writing Proposals
Advanced Indexing Issues
Evaluating Conditions for Congruency
Classifying Solutions to Systems of Equations
ENTR 550 Education for Service-- tutorialrank.com
Discussion Class 7 Lucene.
Accountability and Attention during Questioning
CALCULATE Use numbers given in the question to work out an answer. Always show working.
Relevance Feedback and Query Modification
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Basic Text Processing Word tokenization.
Discussion Class 3 Stemming Algorithms.
Introduction to information retrieval
Discussion Class 9 Google.
Discussion Class 7 User Requirements.
Discussion Class 8 User Interfaces.
Presentation transcript:

1 Discussion Class 3 The Porter Stemmer

2 Course Administration No class on Thursday

3 Discussion Classes Format: Question Ask a member of the class to answer. Provide opportunity for others to comment. When answering: Stand up. Give your name. Make sure that the TA hears it. Speak clearly so that all the class can hear. Suggestions: Do not be shy at presenting partial answers. Differing viewpoints are welcome.

4 Question 1: Stemming (a)Define the terms: stem, suffix, prefix, conflation (b)What makes a good stemming algorithm? How would you measure it? (c)Porter proposes a criterion for removing suffixes. What is it? Do you agree with it? (d)The paper uses "recall cutoff" to measure effectiveness. What does it measure?

5 Question 2: Categories of Stemmer The following diagram illustrate the various categories of stemmer. Porter's algorithm is shown by the red path. What do these terms mean? Conflation methods Manual Automatic (stemmers) Affix Successor Table n-gram removal variety lookup Longest Simple match removal

6 Question 3: Mechanics Step 1a The paper gives the following example of Step 1a. Explain what this step does. Suffix Replacement Examples sses ss caresses -> caress ies i ponies -> poni ties -> ti ss ss caress -> caress s cats -> cat

7 Question 4: Mechanics Step 1b ConditionsSuffixReplacementExamples (m > 0)eedeefeed -> feed agreed -> agree (*v*)ednullplastered -> plaster bled -> bled (*v*)ingnullmotoring -> motor sing -> sing (a) Explain this table (b) How does this table apply to: "exceeding", "ringed"?

8 Question 5: Mechanics Step 5a Step 5a is defined as follows. What does this do and why? (m>1) E -> probate -> probat rate -> rate (m=1 and not *o) E -> cease -> ceas

9 Question 6. Ad hoc decisions Discuss the following: "The algorithm is careful not to remove a suffix when the stem is too short, the length of the stem being given by its measure, m. There is no linguistic basis for this approach. It was merely observed that m could be used quite effectively to help decide whether or not it was wise to take off a suffix." (a) What is m? (b) Why is it a reasonable measure? (c) What anomalies does it produce?

10 Question 7: Stemming in Web searching (a)In Web search engines, the tendency is not to use stemming. Why? (There are several answers.) (b)Does your answer to part (a) mean that stemming is no longer useful?