LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing.

Slides:



Advertisements
Similar presentations
Finite State Automata. A very simple and intuitive formalism suitable for certain tasks A bit like a flow chart, but can be used for both recognition.
Advertisements

CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Natural Language Processing Lecture 3—9/3/2013 Jim Martin.
Computational Morphology. Morphology S.Ananiadou2 Outline What is morphology? –Word structure –Types of morphological operation – Levels of affixation.
1 Morphology September 2009 Lecture #4. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
1 Morphology September 4, 2012 Lecture #3. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011.
Finite-state automata 2 Day 13 LING Computational Linguistics Harry Howard Tulane University.
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
6/10/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.
Stemming, tagging and chunking Text analysis short of parsing.
Computational language: week 9 Finish finite state machines FSA’s for modelling word structure Declarative language models knowledge representation and.
CSCI 5832 Natural Language Processing Lecture 5 Jim Martin.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Learning Bit by Bit Class 3 – Stemming and Tokenization.
Morphological analysis
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
CS 4705 Lecture 3 Morphology: Parsing Words. What is morphology? The study of how words are composed from smaller, meaning-bearing units (morphemes) –Stems:
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Finite State Transducers The machine model we will study for morphological parsing is called the finite state transducer (FST) An FST has two tapes –input.
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
Introduction to English Morphology Finite State Transducers
Chapter 3. Morphology and Finite-State Transducers From: Chapter 3 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech.
Morphology and Finite-State Transducers. Why this chapter? Hunting for singular or plural of the word ‘woodchunks’ was easy, isn’t it? Lets consider words.
Finite-state automata 3 Morphology Day 14 LING Computational Linguistics Harry Howard Tulane University.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Morphology: Words and their Parts CS 4705 Slides adapted from Jurafsky, Martin Hirschberg and Dorr.
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
10/8/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Session 11 Morphology and Finite State Transducers Introduction to Speech Natural and Language Processing (KOM422 ) Credits: 3(3-0)
Lexical Analysis Hira Waseem Lecture
Finite State Transducers
Finite State Transducers for Morphological Parsing
Words: Surface Variation and Automata CMSC Natural Language Processing April 3, 2003.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
Finite State Machinery - I Fundamentals Recognisers and Transducers.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 3 27 July 2007.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Natural Language Processing Lecture 2—1/15/2015 Susan W. Brown.
CS 4705 Lecture 3 Morphology. What is morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of a language)
Artificial Intelligence: Natural Language
CSA3050: Natural Language Algorithms Finite State Devices.
The Simplest NL Applications: Text Searching and Pattern Matching Read J & M Chapter 2.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
1/11/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
FST Morphology Miriam Butt October 2003 Based on Beesley and Karttunen 2003.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Lecture 7 Summary Survey of English morphology
Speech and Language Processing
Natural Language Processing
CSCI 5832 Natural Language Processing
Speech and Language Processing
CSCI 5832 Natural Language Processing
LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Dan Jurafsky 11/24/2018 LING 138/238 Autumn 2004.
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
Morphological Parsing
CSCI 5832 Natural Language Processing
Presentation transcript:

LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing

Reminder: Non-deterministic FSA An FSA where there can be multiple paths for a single input (tape). Two basic approaches for recognition: 1.Either take a ND machine and convert it to a D machine and then do recognition with that. 2.Or explicitly manage the process of recognition as a state-space search (leaving the machine as is). LIN3022 Natural Language Processing2

Example LIN3022 Natural Language Processing3 ba a a !\ q0q0 q1q1 q2q2 q2q2 q3q3 q4q4

Non-Deterministic Recognition: Search Given an ND FSA representing some language and given an input string : – If the input string belongs to the language, the ND FSA will contain at least one path for that string; – If the input string does not belong to the language, there will be no path; – Not all paths directed through the machine for an accept string lead to an accept state. A recognition algorithm succeeds if a path is found for the input string; fails otherwise. LIN3022 Natural Language Processing4

Words Finite-state methods are particularly useful in dealing with a lexicon Many devices need access to large lists of words – Spell checkers – Syntactic parsers – Language Generators – Machine translation systems What sort of knowledge should a computational lexicon contain? – What we’re mainly concerned with today is morphology (inflection and derivation) LIN3022 Natural Language Processing5

Computational tasks involving morphology Morphological analysis (parsing): – ommijiet  omm+PL Morphological (lexical) generation – omm+PL  ommijiet Stemming – Ommijiet  omm – Ommijiethom  omm – Ommna  omm – (map from a word to its stem) LIN3022 Natural Language Processing6

Computational tasks involving morphology Lemmatisation – Convert words to their “basic” form Ommijiet  omm Ommijiethom  omm... Tokenisation – Split running text into individual tokens – Qtilt il-kelb  qtilt, il-, kelb – Xrobt l-ilma  xrobt, l-, ilma LIN3022 Natural Language Processing7

Regular, irregular and messy The problem is to cover everything, of course: – Mouse/mice, goose/geese, ox/oxen (PLURAL) – Go/went, fly/flew (PAST) – Solid/solidify, mechanical/mechanise (NOMINALISATION) LIN3022 Natural Language Processing8

Inflection vs Derivation Inflection is typically fairly straightforward – Relatively easy to identify regular and irregular cases Derivational morphology is much messier. – Quasi-systematicity – Irregular meaning change – Changes of word class LIN3022 Natural Language Processing9

Derivational Examples Verbs and Adjectives to Nouns -ationcomputerisecomputerisation -eeappointappointee -erkillkiller -nessfuzzyfuzziness LIN3022 Natural Language Processing10

Derivational Examples Nouns and Verbs to Adjectives -alcomputationcomputational -ableembraceembraceable -lessclueclueless LIN3022 Natural Language Processing11

Example: Compute Many paths are possible… Start with compute – Computer -> computerise -> computerisation – Computer -> computerise -> computerisable But not all paths/operations are equally good – Clue Clue -> *clueable LIN3022 Natural Language Processing12

Morphology, recognition and FSAs We’d like to use the machinery provided by FSAs to capture these facts about morphology – Accept strings that are in the language – Reject strings that are not –... in an efficient way: Without listing every word in the language Capturing some generalisations Enabling fast search LIN3022 Natural Language Processing13

What does a morphological parser or recogniser need? Lexicon – Often not feasible to just list all the words. Maltese has literally thousands of forms for some verbs... Some morphological processes are productive; we’re likely to meet completely new formations. Morphotactics (order of morphemes) – E.g. English plural morpheme after the noun stem E.g. Maltese “accusative” -l before dative pronominal suffixes (e.g Qatilulna) E.g. English –ise before –ation (formalisation) Orthographic rules – Needed to handle variations of the spelling of the stem – E.g. English nouns ending in –y change to –i (city  cities) LIN3022 Natural Language Processing14

Start Simple Regular singular nouns are ok Regular plural nouns have an -s on the end Irregulars are ok as is (i.e. treat as atomic for now) LIN3022 Natural Language Processing15

Simple Rules LIN3022 Natural Language Processing16

Substitute words for word classes Idea is to be able to use this kind of FSA for recognition. We’ve replaced classes like “reg-noun” with the actual words. LIN3022 Natural Language Processing17

Derivational Rules LIN3022 Natural Language Processing18 If everything is an accept state how do things ever get rejected?

Lexicons Lexicon can be stored as an FSA. A base lexicon (with baseforms) can be plugged into a larger FSA to capture morphological rules and morphotactics. LIN3022 Natural Language Processing19

Part 2 Finite state transducers Morphological parsing Morphological generation LIN3022 Natural Language Processing20

Parsing/Generation vs. Recognition We can now run strings through these machines to recognize strings in the language But recognition is usually not quite what we need – Often if we find some string in the language we might like to assign a structure to it (parsing) – Or we might have some structure and we want to produce a surface form for it (production/generation) LIN3022 Natural Language Processing21

Finite State Transducers The simple story – Add another tape – Add extra symbols to the transitions – On one tape we read “cats”, on the other we write “cat +N +PL” LIN3022 Natural Language Processing22

What is a finite-state transducer A transducer is a machine that takes input of a certain form and outputs something of a different form. – We can think of morphological analysis as transduction (from a word to stem+features) Ommijietna  omm + PL + POSS.1PL – So is morphological generation (from a stem+feature combination to a word) Omm + PL + POSS.1PL  ommijietna LIN3022 Natural Language Processing23

Structure of an FST The easiest way to think of an FST is as a variation on the classic FSA A simple FSA has states and transitions, and recognises something on an input tape. An FST has states and transitions, but works with two tapes. – One corresponds to the input. – The other to the output. LIN3022 Natural Language Processing24

The uses of an FST Recognition: – Take a pair of strings (on two tapes) as input, output accept or reject Generation: – Output a pair of strings from the language Translation: – Read one string and output another Relation between two sets – A machine that computes the relation between the set of possible input strings and the set of possible output strings. LIN3022 Natural Language Processing25

FSTs LIN3022 Natural Language Processing26

Morphological parsing Morphological analysis or parsing can either be An important stand-alone component of many applications (spelling correction, information retrieval) Or simply a link in a chain of further linguistic analysis Interestingly, FSTs are bidirectional, i.e. can be used for parsing and generation LIN3022 Natural Language Processing27

Transitions c:c means read a c on one tape and write a c on the other +N: ε means read a +N symbol on one tape and write nothing on the other +PL:s means read +PL and write an s Note the conventions: x:y represents an input symbol x and the output symbol y. LIN3022 Natural Language Processing28 c:ca:at:t +N: ε + PL:s

Typical Uses Typically, we’ll read from one tape using the first symbol on the machine transitions (just as in a simple FSA). And we’ll write to the second tape using the other symbols on the transitions. LIN3022 Natural Language Processing29

Ambiguity Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state. Didn’t matter which path was actually traversed In FSTs the path to an accept state does matter since different paths represent different parses and different outputs will result LIN3022 Natural Language Processing30

Ambiguity What’s the right parse (segmentation) for Unionisable Union-ise-able Un-ion-ise-able Each represents a valid path through the derivational morphology machine. Unlike in an FST, the differences matter! (Some are not legal parses in English) LIN3022 Natural Language Processing31

Ambiguity There are a number of ways to deal with this problem Simply take the first output found Find all the possible outputs (all paths) and return them all (without choosing) Bias the search so that only one or a few likely paths are explored LIN3022 Natural Language Processing32

The Gory Details Of course, its not as easy as “cat +N +PL” “cats” As we saw earlier there are geese, mice and oxen But there are also a whole host of spelling/pronunciation changes that go along with inflectional changes Cats vs Dogs Fox and Foxes LIN3022 Natural Language Processing33

Multi-Tape Machines To deal with these complications, we will add more tapes and use the output of one tape machine as the input to the next – This gives us a cascade of FSTs So to handle irregular spelling changes we’ll add intermediate tapes with intermediate symbols LIN3022 Natural Language Processing34

Multi-Level Tape Machines We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape LIN3022 Natural Language Processing35 ^ marks a morpheme boundary # marks a word boundary

Lexical to Intermediate Level LIN3022 Natural Language Processing36

Intermediate to Surface The add an “e” rule as in fox^s# foxes# LIN3022 Natural Language Processing37

Foxes LIN3022 Natural Language Processing38

Foxes LIN3022 Natural Language Processing39

Foxes LIN3022 Natural Language Processing40

Note A key feature of this lower machine is that it has to do the right thing for inputs to which it doesn’t really apply. So... – Fox -> foxes but bird -> birds LIN3022 Natural Language Processing41

Overall Scheme We now have one FST that has explicit information about the lexicon (actual words, their spelling, facts about word classes and regularity). – Lexical level to intermediate forms We have a larger set of machines that capture orthographic/spelling rules. – Intermediate forms to surface forms LIN3022 Natural Language Processing42

Overall Scheme LIN3022 Natural Language Processing43

Cascades This is a common architecture Overall processing is divided up into distinct rewrite steps The output of one layer serves as the input to the next The intermediate tapes may or may not wind up being useful in their own right LIN3022 Natural Language Processing44

Overall Plan LIN3022 Natural Language Processing45

Part 3 A brief look at stemming... A brief look at tokenisation LIN3022 Natural Language Processing46

Stemming Stemming is the process of stripping affixes from words to reduce them to their stems – NB The stem is not necessarily the baseform – Examples: Strip “ing” from all word endings – Going  go – Stripping  stripp –... LIN3022 Natural Language Processing47

Uses of stemming Often used in Information Retrieval – The basic technology underlying search engines – Task: given a query (e.g. Keywords), retrieve documents which match it – Stemming is useful because it increases the likelihood of matches – E.g. Search for kangaroos returns documents containing kangaroo or kangaroos LIN3022 Natural Language Processing48

The Porter stemmer Built by Martin Porter in 1980 Still widely used Very simple FST-based stemmer No lexicon, just rules View a demo online here: LIN3022 Natural Language Processing49

Rules in the Porter Stemmer ATIONAL  ATE – E.g. relational  relate – But what about rational? ING  Є – E.g. Shivering  shiver These can be viewed as an FST, but without a lexical layer. LIN3022 Natural Language Processing50

Errors in the Porter stemmer Simple rules like the ones used by the Porter stemmer are error-prone (miss exceptions) Errors of commission: – Rational  ration Errors of omission: – European  Europe LIN3022 Natural Language Processing51

Tokenisation Defined as the task of splitting running text into component tokens. Related task: sentence segmentation (split running text into sentences) Simplest technique: – Just split on whitespace But what about: – Punctuation – Clitics –... LIN3022 Natural Language Processing52

Tokenisation & Sentence segmentation Often go hand in hand! Example: – Full stop can be a sentence boundary or an intra-word boundary – “Punctuation” marks in numbers: 30, Can get a long way using simple regular expressions, at least in Indo-European languages. LIN3022 Natural Language Processing53

Example from Maltese Treat numbers as tokens, with or without decimals: – \d+(\.\d+)? \d+(\.\d+) Honorifics shouldn’t be broken up at full-stops or apostrophes: – sant['’]|(onor|sra|nru|dott|kap|mons|dr|prof)\. Definite articles shouldn’t be broken up at hyphens: – i?[dtlrnsxzżċ]-...[other exceptions] General rule: – A word is just a sequence of characters: \w+\s LIN3022 Natural Language Processing54

Chinese Words in Chinese are composed of hanzi characters. Each character usually represents one morpheme. Words tend to be quite short (ca. 2.4 characters long on average) There is no whitespace between words! LIN3022 Natural Language Processing55

A simple algorithm for segmenting Chinese Maximum Matching (maxmatch) – Requires a lexicon. – Start by pointing at the beginning of the input string – Choose the longest item in the lexicon that matches the input up to now. – If no word matches, move the pointer one symbol forward. LIN3022 Natural Language Processing56

MaxMatch example Imagine an English string with no whitespace: – The table down there –  thetabledownthere The first item found by MaxMatch is theta – Because the algorithm tries to match the longest portion of the input – Then: bled, own, there Result: theta bled own there Luckily, it works better for Chinese, because words there tend to be shorter than in English! LIN3022 Natural Language Processing57