Morphological Processing & Stemming Using FSAs/FSTs.

Slides:



Advertisements
Similar presentations
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Advertisements

Natural Language Processing Lecture 3—9/3/2013 Jim Martin.
Computational Morphology. Morphology S.Ananiadou2 Outline What is morphology? –Word structure –Types of morphological operation – Levels of affixation.
Morphology.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
Intro to NLP - J. Eisner1 Finite-State Methods.
Porter Stemmer Miriam Butt October Background Stemming is potentially of use for many applications: Information Retrieval (indices, e.g.,Web, abstracts)
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
CS4705 Natural Language Processing.  Regular Expressions  Finite State Automata ◦ Determinism v. non-determinism ◦ (Weighted) Finite State Transducers.
6/10/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.
LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing.
Midterm Review CS4705 Natural Language Processing.
Chapter 6 Identifying Grammatical Morphemes Morphology Lane 333.
Morphology How to build words. What is a morpheme? Morphology is the organization of morphemes into words. –The morpheme is the smallest meaningful (invested.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
1 I256: Applied Natural Language Processing Marti Hearst Sept 27, 2006.
Learning Bit by Bit Class 3 – Stemming and Tokenization.
Morphology See Harald Trost “Morphology”. Chapter 2 of R Mitkov (ed.) The Oxford Handbook of Computational Linguistics, Oxford (2004): OUP D Jurafsky &
Morphological analysis
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
CS 4705 Lecture 3 Morphology: Parsing Words. What is morphology? The study of how words are composed from smaller, meaning-bearing units (morphemes) –Stems:
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
Introduction to English Morphology Finite State Transducers
Chapter 3. Morphology and Finite-State Transducers From: Chapter 3 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech.
Morphology and Finite-State Transducers. Why this chapter? Hunting for singular or plural of the word ‘woodchunks’ was easy, isn’t it? Lets consider words.
Morphology (CS ) By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University,
10/8/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Session 11 Morphology and Finite State Transducers Introduction to Speech Natural and Language Processing (KOM422 ) Credits: 3(3-0)
Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model.
Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.
Reasons to Study Lexicography  You love words  It can help you evaluate dictionaries  It might make you more sensitive to what dictionaries have in.
Finite State Transducers
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Finite State Transducers for Morphological Parsing
Words: Surface Variation and Automata CMSC Natural Language Processing April 3, 2003.
Morphological Analysis of Hungarian in NooJ
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
CS 4705 Lecture 3 Morphology. What is morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of a language)
Chapter 23: Probabilistic Language Models April 13, 2004.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
CSA3050: Natural Language Algorithms Finite State Devices.
November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst.
Natural Language Processing Chapter 2 : Morphology.
MORPHOLOGY definition; variability among languages.
1/11/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
LECTURE 6 Natural Language Processing- Practical.
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
LITERACY-BASED DISTRICT-WIDE PROFESSIONAL DEVELOPMENT Aiken County Public School District January 15, 2016 LEADERS IN LITERACY CONFERENCE.
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Lecture 7 Summary Survey of English morphology
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
Morphology: Parsing Words
CSCI 5832 Natural Language Processing
Token generation - stemming
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
CS4705 Natural Language Processing
CPSC 503 Computational Linguistics
Morphological Parsing
CSCI 5832 Natural Language Processing
P 72 (PDF 76) Figure 32 Information item name Rules in columns
Presentation transcript:

Morphological Processing & Stemming Using FSAs/FSTs

FSAs and Morphology Can be used to validate/recognize input string For example, consider the Spanish conjugation for amar in J&M p. 64 What would a FSA look like the would recognize the input? am … 12 a e 3 4 … s m 5 6

FSTs and Morphology An FST could output information about the input, such as a translation or grammatical info: am:love … 12 a:ε o:ε 3 ε:impf a e 7

FSAs and NLP Why even use FSAs in NLP? Memory and storage are cheap –Build one large lexicon –List all entries and req’d output amo: amas:ames love love love pres ind pres impf pres subj Some NLP apps do this (e.g., AZ Noun Phraser (Tolle 2001)) [][][]

FSAs and NLP For more morphologically complex languages, one big lexicon not feasible Consider Hungarian and Finnish –One verbal form Hundreds of possible inflections Millions of resulting forms –A complete “word” lexicon not feasible –Morphological processing essential

Hungarian Consider one concept/’word’ in Hungarian: hazhouse hazathouse (object) haznakof the house hazzalwith the house hazzainto a house hazbainto the house hazrato the house …

Hungarian Now consider plural inflections: hazakhouses hazakathouses (object) hazaknakof the houses hazakzalwith the houses hazakzainto a houses hazakbainto the houses hazakrato the houses …

Hungarian And possessives: hazaimmy houses hazaimatmy houses (object) hazaimnakof the houses hazaimzalwith the houses hazaimzainto a houses hazaimbainto the houses hazaimrato the houses …

Stop

Stemming Used in many IR applications For building equivalence classes Connect Connected Connecting Connection Connections Porter Stemmer, simple and efficient Website: Same class; suffixes irrelevant

Stop

Stemming and Performance Does stemming help IR performance? Harman 91 indicated that it hurt as much as it helped Krovetz 93 shows that stemming does helpKrovetz 93 –Porter-like algorithms work well with smaller documents –Krovetz proposes that stemming loses information –Derivational morphemes tell us something that helps identify word senses (and helps in IR) Stemming them = information loss

Evaluating Performance Measures of Stemming Performance rely on similar metrics used in IR: –Precision: measure of the proportion of selected items the system got right precision = tp / (tp + fp) –Recall: measure of the proportion of the target items the system selected recall = tp / (tp + fn) –Rule of thumb: as precision increases, recall drops, and vice versa Metrics widely adopted in Stat NLP

Precision and Recall Take a given stemming task –Suppose there are 100 words that could be stemmed –A stemmer gets 52 of these right (tp) –But it inadvertently stems 10 others (fp) Precision = 52 / ( ) =.84 Recall = 52 / ( ) =.52