Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek Mary-Angela Papalaskari Presentation slides adapted.

Similar presentations


Presentation on theme: "CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek Mary-Angela Papalaskari Presentation slides adapted."— Presentation transcript:

1 CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek Mary-Angela Papalaskari Presentation slides adapted from: Marti Hearst (some following Dorr and Habash) and Jim Martin: CSC NLP - 3: Morphology, Finite State Transducers

2 Today Elementary Morphology Computational morphology
Finite State Transducers Lexicon-only schemes Rule-only schemes Lab: Introduction to NLTK CSC NLP - 3: Morphology, Finite State Transducers

3 Morphology Morphology: Morphemes: Contrasts: A useful resource:
The study of the way words are built up from smaller meaning units. Morphemes: The smallest meaningful unit in the grammar of a language. Contrasts: Derivational vs. Inflectional Regular vs. Irregular Concatinative vs. Templatic (root-and-pattern) A useful resource: Glossary of linguistic terms by Eugene Loos CSC NLP - 3: Morphology, Finite State Transducers

4 Examples (English) “unladylike” “technique” “dogs”
3 morphemes, 4 syllables un- ‘not’ lady ‘(well behaved) female adult human’ -like ‘having the characteristics of’ Can’t break any of these down further without distorting the meaning of the units “technique” 1 morpheme, 2 syllables “dogs” 2 morphemes, 1 syllable -s, a plural marker on nouns CSC NLP - 3: Morphology, Finite State Transducers

5 Morpheme Definitions Root Stem Affix Clitic
The portion of the word that: is common to a set of derived or inflected forms, if any, when all affixes are removed is not further analyzable into meaningful elements carries the principal portion of meaning of the words Stem The root or roots of a word, together with any derivational affixes, to which inflectional affixes are added. Affix A bound morpheme that is joined before, after, or within a root or stem. Clitic a morpheme that functions syntactically like a word, but does not appear as an independent phonological word Spanish: un beso, las aguas English: Hal’s (genetive marker) Proto-European: Kwe  -que (Latin), te (Greek), and –ca (Sanskrit) CSC NLP - 3: Morphology, Finite State Transducers

6 Inflectional vs. Derivational
Word Classes Parts of speech: noun, verb, adjectives, etc. Word class dictates how a word combines with morphemes to form new words Inflection: Variation in the form of a word, typically by means of an affix, that expresses a grammatical contrast. Doesn’t change the word class Usually produces a predictable, non-idiosyncratic change of meaning. Derivation: The formation of a new word or inflectable stem from another word or stem. CSC NLP - 3: Morphology, Finite State Transducers

7 Inflectional Morphology
Adds: tense, number, person, mood, aspect Word class doesn’t change Word serves new grammatical role Examples come is inflected for person and number: The pizza guy comes at noon. las and rojas are inflected for agreement with manzanas in grammatical gender by -a and in number by –s las manzanas rojas (‘the red apples’) CSC NLP - 3: Morphology, Finite State Transducers

8 Derivational Morphology
Nominalization (formation of nouns from other parts of speech, primarily verbs in English): computerization appointee killer fuzziness Formation of adjectives (primarily from nouns) computational clueless Embraceable Diffulcult cases: building  from which sense of “build”? CSC NLP - 3: Morphology, Finite State Transducers

9 Concatinative Morphology
Morpheme+Morpheme+Morpheme+… Stems: also called lemma, base form, root, lexeme hope+ing  hoping hop  hopping Affixes Prefixes: Antidisestablishmentarianism Suffixes: Antidisestablishmentarianism Infixes: hingi (borrow) – humingi (borrower) in Tagalog Circumfixes: sagen (say) – gesagt (said) in German Agglutinative Languages uygarlaştıramadıklarımızdanmışsınızcasına (Turkish) uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına Behaving as if you are among those whom we could not cause to become civilized Say (has) said CSC NLP - 3: Morphology, Finite State Transducers

10 Templatic Morphology Roots and Patterns Example: Hebrew verbs Root:
Consists of 3 consonants CCC Carries basic meaning Template: Gives the ordering of consonants and vowels Specifies semantic information about the verb Active, passive, middle voice Example: lmd (to learn or study) CaCaC -> lamad (he studied) CiCeC -> limed (he taught) CuCaC -> lumad (he was taught) Psycholinguistic reality format  فرمت farmat CSC NLP - 3: Morphology, Finite State Transducers

11 Nouns and Verbs (in English)
Nouns have simple inflectional morphology cat cat+s, cat+’s Verbs have more complex morphology CSC NLP - 3: Morphology, Finite State Transducers

12 Nouns and Verbs (in English)
Have simple inflectional morphology Cat/Cats Mouse/Mice, Ox, Oxen, Goose, Geese Verbs More complex morphology Walk/Walked Go/Went, Fly/Flew CSC NLP - 3: Morphology, Finite State Transducers

13 Regular (English) Verbs
Morphological Form Classes Regularly Inflected Verbs Stem walk merge try map -s form walks merges tries maps -ing form walking merging trying mapping Past form or –ed participle walked merged tried mapped CSC NLP - 3: Morphology, Finite State Transducers

14 Irregularly Inflected Verbs
Irregular (English) Verbs Morphological Form Classes Irregularly Inflected Verbs Stem eat catch cut -s form eats catches cuts -ing form eating catching cutting Past form ate caught -ed participle eaten CSC NLP - 3: Morphology, Finite State Transducers

15 “To love” in Spanish CSC NLP - 3: Morphology, Finite State Transducers

16 Syntax and Morphology Phrase-level agreement
Subject-Verb John studies hard (STUDY+3SG) Noun-Adjective Las vacas hermosas Sub-word phrasal structures שבספרינו ש+ב+ספר+ים+נו That+in+book+PL+Poss:1PL Which are in our books CSC NLP - 3: Morphology, Finite State Transducers

17 Phonology and Morphology
Script Limitations Spoken English has 14 vowels heed hid hayed head had hoed hood who’d hide how’d taught Tut toy enough English Alphabet has 5 Use vowel combinatios: far fair fare Consonantal doubling (hopping vs. hoping) CSC NLP - 3: Morphology, Finite State Transducers

18 Computational Morphology
Approaches Lexicon only Rules only Lexicon and Rules Finite-state Automata Finite-state Transducers Systems WordNet’s morphy PCKimmo Named after Kimmo Koskenniemi, much work done by Lauri Karttunen, Ron Kaplan, and Martin Kay Accurate but complex Two-level morphology Commercial version available from InXight Corp. Background Chapter 3 of Jurafsky and Martin A short history of Two-Level Morphology CSC NLP - 3: Morphology, Finite State Transducers

19 Computational Morphology
WORD STEM (+FEATURES)* cats cat +N +PL cat cat +N +SG cities city +N +PL geese goose +N +PL ducks (duck +N +PL) or (duck +V +3SG) merging merge +V +PRES-PART caught (catch +V +PAST-PART) or (catch +V +PAST) CSC NLP - 3: Morphology, Finite State Transducers

20 FSAs and the Lexicon First we’ll capture the morphotactics
The rules governing the ordering of affixes in a language. Then we’ll add in the actual words CSC NLP - 3: Morphology, Finite State Transducers

21 Simple Rules CSC NLP - 3: Morphology, Finite State Transducers

22 Adding the Words CSC NLP - 3: Morphology, Finite State Transducers

23 Derivational Rules CSC NLP - 3: Morphology, Finite State Transducers

24 Parsing/Generation vs. Recognition
Recognition is usually not quite what we need. Usually if we find some string in the language we need to find the structure in it (parsing) Or we have some structure and we want to produce a surface form (production/generation) Example From “cats” to “cat +N +PL” and back Morphological analysis Morphological analysis – either An important stand-alone component of an application (spelling correction, information retrieval) a link in a chain of processing CSC NLP - 3: Morphology, Finite State Transducers

25 Finite State Transducers
The simple story Add another tape Add extra symbols to the transitions On one tape we read “cats”, on the other we write “cat +N +PL”, or the other way around. CSC NLP - 3: Morphology, Finite State Transducers

26 FSTs CSC NLP - 3: Morphology, Finite State Transducers

27 Transitions c:c a:a t:t +N:ε
+PL:s c:c means read a c on one tape and write a c on the other +N:ε means read a +N symbol on one tape and write nothing on the other +PL:s means read +PL and write an s CSC NLP - 3: Morphology, Finite State Transducers

28 Ambiguity Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state. Didn’t matter which path was actually traversed In FSTs the path to an accept state does matter since different paths represent different parses and different outputs will result CSC NLP - 3: Morphology, Finite State Transducers

29 Ambiguity What’s the right parse for
Unionizable Union-ize-able Un-ion-ize-able Each represents a valid path through the derivational morphology machine. CSC NLP - 3: Morphology, Finite State Transducers

30 Ambiguity There are a number of ways to deal with this problem
Simply take the first output found Find all the possible outputs (all paths) and return them all (without choosing) Bias the search so that only one or a few likely paths are explored CSC NLP - 3: Morphology, Finite State Transducers

31 The Gory Details Multi-tape machines Of course, its not as easy as
“cat +N +PL” <-> “cats” As we saw earlier there are geese, mice and oxen But there are also a whole host of spelling/pronunciation changes that go along with inflectional changes Cats vs Dogs Multi-tape machines CSC NLP - 3: Morphology, Finite State Transducers

32 Multi-Level Tape Machines
We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape CSC NLP - 3: Morphology, Finite State Transducers

33 Lexical to Intermediate Level
CSC NLP - 3: Morphology, Finite State Transducers

34 Intermediate to Surface
The add an “e” rule as in fox^s# <-> foxes A key feature of this machine is that it doesn’t do anything to inputs to which it doesn’t apply. Meaning that they are written out unchanged to the output tape. CSC NLP - 3: Morphology, Finite State Transducers

35 Foxes CSC NLP - 3: Morphology, Finite State Transducers

36 Foxes CSC NLP - 3: Morphology, Finite State Transducers

37 FST Review FSTs allow us to take an input and deliver a structure based on it Or… take a structure and create a surface form Or take a structure and create another structure In many applications its convenient to decompose the problem into a set of cascaded transducers where The output of one feeds into the input of the next. We’ll see this scheme again for deeper semantic processing. CSC NLP - 3: Morphology, Finite State Transducers

38 Overall Plan CSC 9010- NLP - 3: Morphology, Finite State Transducers
This leaves out the fact that many such transducerss are needed and the easiest way to create the morphological analyzer is to create the transducers separately and then compose them to obtain the grand result, as described in the text. CSC NLP - 3: Morphology, Finite State Transducers

39 Lexicon-only Morphology
The lexicon lists all surface level and lexical level pairs No rules … Analysis/Generation is easy Very large for English What about Arabic or Turkish or Chinese? acclaim acclaim $N$ acclaim acclaim $V+0$ acclaimed acclaim $V+ed$ acclaimed acclaim $V+en$ acclaiming acclaim $V+ing$ acclaims acclaim $N+s$ acclaims acclaim $V+s$ acclamation acclamation $N$ acclamations acclamation $N+s$ acclimate acclimate $V+0$ acclimated acclimate $V+ed$ acclimated acclimate $V+en$ acclimates acclimate $V+s$ acclimating acclimate $V+ing$ CSC NLP - 3: Morphology, Finite State Transducers

40 Stemming vs Morphology
Sometimes you just need to know the stem of a word and you don’t care about the structure. In fact you may not even care if you get the right stem, as long as you get a consistent string. This is stemming… it most often shows up in IR applications CSC NLP - 3: Morphology, Finite State Transducers

41 Stemming in IR Run a stemmer on the documents to be indexed
Run a stemmer on users queries Match This is basically a form of hashing Example: Computerization ization -> -ize computerize ize -> ε computer CSC NLP - 3: Morphology, Finite State Transducers

42 Porter Stemmer Step 4: Derivational Morphology I: Multiple Suffixes
(m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational > rational (m>0) ENCI -> ENCE valenci > valence (m>0) ANCI -> ANCE hesitanci > hesitance (m>0) IZER -> IZE digitizer > digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli > radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator > operate (m>0) ALISM -> AL feudalism > feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti > formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible CSC NLP - 3: Morphology, Finite State Transducers

43 Porter No lexicon needed
Basically a set of staged sets of rewrite rules that strip suffixes Handles both inflectional and derivational suffixes Doesn’t guarantee that the resulting stem is really a stem (see first bullet) Lack of guarantee doesn’t matter for IR CSC NLP - 3: Morphology, Finite State Transducers

44 Porter Stemmer Errors of Omission Errors of Commission European Europe
analysis analyzes matrices matrix noise noisy explain explanation Errors of Commission organization organ doing doe generalization generic numerical numerous university universe CSC NLP - 3: Morphology, Finite State Transducers

45 Dr Papalarsky or Dr Matuzka
Soundex You work as the Villanova telephone operator. Someone calls looking for: Dr Papalarsky or Dr Matuzka ???????? What do you type as your query string? CSC NLP - 3: Morphology, Finite State Transducers

46 Soundex Keep the first letter
Drop non-initial occurrences of vowels, h, w and y Replace the remaining letters with numbers according to group (e.g.. b, f, p, and v -> 1 Replace strings of identical numbers with a single number (333 -> 3) Drop any numbers beyond a third one CSC NLP - 3: Morphology, Finite State Transducers

47 Soundex Effect is to map (hash) all similar sounding transcriptions to the same code. Structure your directory so that it can be accessed by code as well as by correct spelling Used for census records, phone directories, author searches in libraries etc. CSC NLP - 3: Morphology, Finite State Transducers


Download ppt "CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek Mary-Angela Papalaskari Presentation slides adapted."

Similar presentations


Ads by Google