LING/C SC/PSYC 438/538 Lecture 17 Sandiway Fong. Administrivia Grading – Midterm grading not finished yet – Homework 3 graded Reminder – Next Monday:

Slides:



Advertisements
Similar presentations
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.
Advertisements

CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Computational Morphology. Morphology S.Ananiadou2 Outline What is morphology? –Word structure –Types of morphological operation – Levels of affixation.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
LING 388: Language and Computers Sandiway Fong Lecture 21: 11/8.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Chapter 7: Text Preprocessing.
(C) 2003, The University of Michigan1 Information Retrieval Handout #4 January 28, 2005.
Term Processing & Normalization Major goal: Find the best possible representation Minor goals: Improve storage and speed First: Need to transform sequence.
Brief introduction to morphology
CS 430 / INFO 430 Information Retrieval
1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday.
Spring 2002NLE1 CC 384: Natural Language Engineering Week 2, Lecture 2 - Lemmatization and Stemming; the Porter Stemmer.
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 16: 10/19.
6/10/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.
Stemming, tagging and chunking Text analysis short of parsing.
LING 388: Language and Computers Sandiway Fong Lecture 28: 12/6.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 11/1.
CSCI 5832 Natural Language Processing Lecture 5 Jim Martin.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Learning Bit by Bit Class 3 – Stemming and Tokenization.
Morphology See Harald Trost “Morphology”. Chapter 2 of R Mitkov (ed.) The Oxford Handbook of Computational Linguistics, Oxford (2004): OUP D Jurafsky &
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 16: 10/23.
LING 388: Language and Computers Sandiway Fong Lecture 25: 11/22.
Morphological analysis
LING 438/538 Computational Linguistics Sandiway Fong Lecture 14: 10/12.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
LING 388 Language and Computers Lecture 21 11/13/03 Sandiway FONG.
Finite State Transducers The machine model we will study for morphological parsing is called the finite state transducer (FST) An FST has two tapes –input.
Introduction to English Morphology Finite State Transducers
Morphology and Finite-State Transducers. Why this chapter? Hunting for singular or plural of the word ‘woodchunks’ was easy, isn’t it? Lets consider words.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
Text Preprocessing. Preprocessing step Aims to create a correct text representation, according to the adopted model. Step: –Lexical analysis; –Case folding,
LING/C SC/PSYC 438/538 Lecture 7 9/15 Sandiway Fong.
Finite State Transducers for Morphological Parsing
Chapter 6: Information Retrieval and Web Search
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
CS 4705 Lecture 3 Morphology. What is morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of a language)
CSA3050: Natural Language Algorithms Finite State Devices.
November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst.
III. MORPHOLOGY. III. Morphology 1. Morphology The study of the internal structure of words and the rules by which words are formed. 1.1 Open classes.
LING/C SC/PSYC 438/538 Lecture 25 Sandiway Fong 1.
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
NLP. Text similarity People can express the same concept (or related concepts) in many different ways. For example, “the plane leaves at 12pm” vs “the.
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪 From
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Chapter 3 Morphology Without grammar, little can be conveyed. Without vocabulary, nothing can be conveyed. (David Wilkins ,1972) Morphology refers to.
LING/C SC/PSYC 438/538 Lecture 26 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 21 Sandiway Fong.
Morphology: Parsing Words
עיבוד שפות טבעיות מבוא פרופ' עידו דגן המחלקה למדעי המחשב
CSCI 5832 Natural Language Processing
Speech and Language Processing
CSCI 5832 Natural Language Processing
Token generation - stemming
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
Development of A Stemming Algorithm
資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Basic Text Processing Word tokenization.
Morphological Parsing
CSCI 5832 Natural Language Processing
Presentation transcript:

LING/C SC/PSYC 438/538 Lecture 17 Sandiway Fong

Administrivia Grading – Midterm grading not finished yet – Homework 3 graded Reminder – Next Monday: guest lecture from Dr. Jerry Ball.

Morphology Chapter 3 of JM – Morphology: Introduction – Stemming

Morphology Inflectional Morphology: – basically: no change in category –  -features (person, number, gender) Examples: movies, blonde, actress Irregular examples: – appendices (from appendix), geese (from goose) – case Examples: he/him, who/whom – comparatives and superlatives Examples: happier/happiest – tense Examples: drive/drives/drove (-ed)/driven

Morphology Derivational Morphology – basically: category changing – nominalization Examples: formalization, informant, informer, refusal, lossage – deadjectivals Examples: weaken, happiness, simplify, formalize, slowly, calm – deverbals Examples: see nominalization, readable, employee – denominals Examples: formal, bridge, ski, cowardly, useful

Morphology and Semantics Morphemes: units of meaning suffixation – Examples: x employ y – employee: picks out y – employer: picks out x x read y – readable: picks out y prefixation – Examples: undo, redo, un-redo, encode, defrost, asymmetric, malformed, ill-formed, pro-Chomsky

Stemming Normalization Procedure – inflectional morphology: cities  city, improves/improved  improve – derivational morphology: transformation/transformational  transform criterion – preserve meaning (word senses) organization  organ primary application – information retrieval (IR) – efficacy questioned: Harman (1991) Google doesn’t use stemming Google didn’t use stemming

Stemming and Search up until very recently... – Word Variations (Stemming) To provide the most accurate results, Google does not use "stemming" or support "wildcard" searches. In other words, Google searches for exactly the words that you enter in the search box. Searching for "book" or "book*" will not yield "books" or "bookstore". If in doubt, try both forms: "airline" and "airlines," for instance Google didn’t use stemming

Stemming and Search Google is more successful than other search engines in part because it returns “better”, i.e. more relevant, information – its algorithm (a trade secret) is called PageRank SEO (Search Engine Optimization) is a topic of considerable commercial interest – Goal: How to get your webpage listed higher by PageRank – Techniques: – e.g. by writing keyword-rich text in your page – e.g. by listing morphological variants of keywords Google does not use stemming everywhere – and it does not reveal its algorithm to prevent people “optimizing” their pages

Stemming and Search search on: – diet requirements – (results from Fall 2004) note – can’t use quotes around phrase – blocks stemming 6th-ranked page

Stemming and Search search on: – diet requirements notes – Top-ranked page has words diet, dietary and requirements – but not the phrase “diet requirements” BBC - Health - Healthy living - Dietary requirements <meta name="keywords" content="health, cancer, diabetes, cardiovascular disease, osteoporosis, restricted diets, vegetarian, vegan"/>

Stemming and Search search on: – diet requirements notes: 5th-ranked page does not have the word diet in it an academic conference – unlikely to be “optimized” for page hits ranks above 6th-ranked page which does have the exact phrase “diet requirements” in it PATLIB Dietary requirements

Stemming and Search (Fall 2008)

Stemming IR-centric view – Applies to open-class lexical items only: stop-word list: the, below, being, does exclude: determiners, prepositions, auxiliary verbs not full morphology – prefixes generally excluded (not meaning preserving) Examples: asymmetric, undo, encoding

Stemming: Methods use a dictionary (look-up) – OK for English, not for languages with more productive morphology, e.g. Japanese, Turkish write ad-hoc rules, e.g. Porter Algorithm (Porter, 1980) – Example: Ends in doubled consonant (not “l”, “s” or “z”), remove last character – hopping  hop – hissing  hiss

Stemming: Methods dictionary approach not enough – Example : (Porter, 1991) routed  route/rout – At Waterloo, Napoleon’s forces were routed – The cars were routed off the highway – notes here, the (inflected) verb form is ambiguous preceding word (context) does not disambiguate

Stemming: Errors Understemming: failure to merge Example : – adhere/adhesion Overstemming: incorrect merge Example : – probe/probable Claim: -able irregular suffix, root: probare (Lat.) Mis-stemming: removing a non- suffix (Porter, 1991) Example: – reply  rep Page 68

Stemming: Interaction interacts with noun compounding – Example: operating systems negative polarity items – for IR, compounds need to be identified first…

Stemming: Porter Algorithm The Porter Stemmer (Porter, 1980) Textbook – (page 68, Chapter 3 JM) – 3.8 Lexicon- Free FSTs: The Porter Stemmer URL – – for English – most widely used system – manually written rules – 5 stage approach to extracting roots – considers suffixes only – may produce non-word roots

Stemming: Porter Algorithm

rule format: – (condition on stem) suffix 1  suffix 2 In case of conflict, prefer longest suffix match “Measure” of a word is m in: – (C) (VC) m (V) C = sequence of one or more consonants V = sequence of one or more vowels – examples : tree C(VC) 0 V (tr)()(ee) troubles C(VC) 2 (tr)(ou)(bl)(e)(s)

Stemming: Porter Algorithm Step 1a: remove plural suffixation – SSES  SS (caresses) – IES  I (ponies) – SS  SS (caress) – S  (cats) Step 1b: remove verbal inflection – (m>0) EED  EE (agreed, feed) – (*v*) ED  (plastered, bled) – (*v*) ING  (motoring, sing)

Stemming: Porter Algorithm Step 1b: (contd. for -ed and -ing rules) – AT  ATE (conflated) – BL  BLE (troubled) – IZ  IZE (sized) – (*doubled c & ¬(*L v *S v *Z))  single c (hopping, hissing, falling, fizzing) – (m=1 & *cvc)  E (filing, failing, slowing) Step 1c: Y and I – (*v*) Y  I (happy, sky)

Stemming: Porter Algorithm Step 2: Peel one suffix off for multiple suffixes – (m>0) ATIONAL  ATE (relational) – (m>0) TIONAL  TION (conditional, rational) – (m>0) ENCI  ENCE (valenci) – (m>0) ANCI  ANCE (hesitanci) – (m>0) IZER  IZE (digitizer) – (m>0) ABLI  ABLE (conformabli) - able (step 4) – … – (m>0) IZATION  IZE (vietnamization) – (m>0) ATION  ATE (predication) – … – (m>0) IVITI  IVE (sensitiviti)

Stemming: Porter Algorithm Step 3 – (m>0) ICATE  IC (triplicate) – (m>0) ATIVE  (formative) – (m>0) ALIZE  AL (formalize) – (m>0) ICITI  IC (electriciti) – (m>0) ICAL  IC (electrical, chemical) – (m>0) FUL  (hopeful) – (m>0) NESS  (goodness)

Stemming: Porter Algorithm Step 4: Delete last suffix – (m>1) AL  (revival) - revive, see step 5 – (m>1) ANCE  (allowance, dance) – (m>1) ENCE  (inference, fence) – (m>1) ER  (airliner, employer) – (m>1) IC  (gyroscopic, electric) – (m>1) ABLE  (adjustable, mov(e)able) – (m>1) IBLE  (defensible,bible) – (m>1) ANT  (irritant,ant) – (m>1) EMENT  (replacement) – (m>1) MENT  (adjustment) – …

Stemming: Porter Algorithm Step 5a: remove e – (m>1) E  (probate, rate) – (m>1 & ¬*cvc) E  (cease) Step 5b: ll reduction – (m>1 & *LL)  L (controller, roll)

Stemming: Porter Algorithm Misses (understemming) – Unaffected: agreement (VC) 1 VCC - step 4 (m>1) adhesion – Irregular morphology: drove, geese Overstemming relativity - step 2 Mis-stemming wander C(VC) 1 VC

Stemming: Porter Algorithm Misses (understemming) – Unaffected: agreement (VC) 1 VCC - step 4 (m>1) adhesion – Irregular morphology: drove, geese Overstemming relativity - step 2 Mis-stemming wander C(VC) 1 VC

Porter Stemmer: Perl

Porter Stemmer: Perl

Just 3 pages of code You should be familiar enough with Perl programming and regular expressions to be able to read and understand the code Just 3 pages of code You should be familiar enough with Perl programming and regular expressions to be able to read and understand the code

Finite State Transducers (FST) A Finite State Transducer maps sequences of symbols onto one another using a FSA with input:output labels on the arcs epsilon arcs ϵ can occur on input arcs as well as output epsilon arcs ϵ can occur on input arcs as well as output Formally… two mappings δ: for states and σ: for output Formally… two mappings δ: for states and σ: for output non-deterministic: i.e. one-to-many so strictly speaking, not a function but a relation non-deterministic: i.e. one-to-many so strictly speaking, not a function but a relation

Finite State Transducers (FST) Problem: – not all FSTs can be turned into deterministic FST (DFST) – i.e. non-deterministic FST and DFST are not equivalent in power – e.g. DFST can only compute functions, no way to handle one-to-many relations: > a:c a:b There are also functions than non-deterministic FST can compute but not DFST There are also functions than non-deterministic FST can compute but not DFST Restrictions (so that things can be determinized): Sequential transducers: deterministic on the input side No epilson on input side – ok on output side Subsequential transducers = sequential transducer + extra tail output at final state(s)

Spelling Rules aka orthographic rules Examples: Idea: apply orthographic rules to fix up the output a morphological parsing FST:

Spelling Rules T lex :

e-insertion FST Context-sensitive rule: left context right context rewrite Corresponding FST: other = any feasible pair not in this transducer other = any feasible pair not in this transducer

e-insertion FST State transition table: FST: