LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing.

LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing

Reminder: Non-deterministic FSA An FSA where there can be multiple paths for a single input (tape). Two basic approaches for recognition: 1.Either take a ND machine and convert it to a D machine and then do recognition with that. 2.Or explicitly manage the process of recognition as a state-space search (leaving the machine as is). LIN3022 Natural Language Processing2

Example LIN3022 Natural Language Processing3 ba a a !\ q0q0 q1q1 q2q2 q2q2 q3q3 q4q4

Non-Deterministic Recognition: Search Given an ND FSA representing some language and given an input string : – If the input string belongs to the language, the ND FSA will contain at least one path for that string; – If the input string does not belong to the language, there will be no path; – Not all paths directed through the machine for an accept string lead to an accept state. A recognition algorithm succeeds if a path is found for the input string; fails otherwise. LIN3022 Natural Language Processing4

Words Finite-state methods are particularly useful in dealing with a lexicon Many devices need access to large lists of words – Spell checkers – Syntactic parsers – Language Generators – Machine translation systems What sort of knowledge should a computational lexicon contain? – What we’re mainly concerned with today is morphology (inflection and derivation) LIN3022 Natural Language Processing5

Computational tasks involving morphology Morphological analysis (parsing): – ommijiet  omm+PL Morphological (lexical) generation – omm+PL  ommijiet Stemming – Ommijiet  omm – Ommijiethom  omm – Ommna  omm – (map from a word to its stem) LIN3022 Natural Language Processing6

Computational tasks involving morphology Lemmatisation – Convert words to their “basic” form Ommijiet  omm Ommijiethom  omm... Tokenisation – Split running text into individual tokens – Qtilt il-kelb  qtilt, il-, kelb – Xrobt l-ilma  xrobt, l-, ilma LIN3022 Natural Language Processing7

Regular, irregular and messy The problem is to cover everything, of course: – Mouse/mice, goose/geese, ox/oxen (PLURAL) – Go/went, fly/flew (PAST) – Solid/solidify, mechanical/mechanise (NOMINALISATION) LIN3022 Natural Language Processing8

Inflection vs Derivation Inflection is typically fairly straightforward – Relatively easy to identify regular and irregular cases Derivational morphology is much messier. – Quasi-systematicity – Irregular meaning change – Changes of word class LIN3022 Natural Language Processing9

Derivational Examples Verbs and Adjectives to Nouns -ationcomputerisecomputerisation -eeappointappointee -erkillkiller -nessfuzzyfuzziness LIN3022 Natural Language Processing10

Derivational Examples Nouns and Verbs to Adjectives -alcomputationcomputational -ableembraceembraceable -lessclueclueless LIN3022 Natural Language Processing11

Example: Compute Many paths are possible… Start with compute – Computer -> computerise -> computerisation – Computer -> computerise -> computerisable But not all paths/operations are equally good – Clue Clue -> *clueable LIN3022 Natural Language Processing12

Morphology, recognition and FSAs We’d like to use the machinery provided by FSAs to capture these facts about morphology – Accept strings that are in the language – Reject strings that are not –... in an efficient way: Without listing every word in the language Capturing some generalisations Enabling fast search LIN3022 Natural Language Processing13

What does a morphological parser or recogniser need? Lexicon – Often not feasible to just list all the words. Maltese has literally thousands of forms for some verbs... Some morphological processes are productive; we’re likely to meet completely new formations. Morphotactics (order of morphemes) – E.g. English plural morpheme after the noun stem E.g. Maltese “accusative” -l before dative pronominal suffixes (e.g Qatilulna) E.g. English –ise before –ation (formalisation) Orthographic rules – Needed to handle variations of the spelling of the stem – E.g. English nouns ending in –y change to –i (city  cities) LIN3022 Natural Language Processing14

Start Simple Regular singular nouns are ok Regular plural nouns have an -s on the end Irregulars are ok as is (i.e. treat as atomic for now) LIN3022 Natural Language Processing15

Simple Rules LIN3022 Natural Language Processing16

Substitute words for word classes Idea is to be able to use this kind of FSA for recognition. We’ve replaced classes like “reg-noun” with the actual words. LIN3022 Natural Language Processing17

Derivational Rules LIN3022 Natural Language Processing18 If everything is an accept state how do things ever get rejected?

Lexicons Lexicon can be stored as an FSA. A base lexicon (with baseforms) can be plugged into a larger FSA to capture morphological rules and morphotactics. LIN3022 Natural Language Processing19

Part 2 Finite state transducers Morphological parsing Morphological generation LIN3022 Natural Language Processing20

Parsing/Generation vs. Recognition We can now run strings through these machines to recognize strings in the language But recognition is usually not quite what we need – Often if we find some string in the language we might like to assign a structure to it (parsing) – Or we might have some structure and we want to produce a surface form for it (production/generation) LIN3022 Natural Language Processing21

Finite State Transducers The simple story – Add another tape – Add extra symbols to the transitions – On one tape we read “cats”, on the other we write “cat +N +PL” LIN3022 Natural Language Processing22

What is a finite-state transducer A transducer is a machine that takes input of a certain form and outputs something of a different form. – We can think of morphological analysis as transduction (from a word to stem+features) Ommijietna  omm + PL + POSS.1PL – So is morphological generation (from a stem+feature combination to a word) Omm + PL + POSS.1PL  ommijietna LIN3022 Natural Language Processing23

Structure of an FST The easiest way to think of an FST is as a variation on the classic FSA A simple FSA has states and transitions, and recognises something on an input tape. An FST has states and transitions, but works with two tapes. – One corresponds to the input. – The other to the output. LIN3022 Natural Language Processing24

The uses of an FST Recognition: – Take a pair of strings (on two tapes) as input, output accept or reject Generation: – Output a pair of strings from the language Translation: – Read one string and output another Relation between two sets – A machine that computes the relation between the set of possible input strings and the set of possible output strings. LIN3022 Natural Language Processing25

FSTs LIN3022 Natural Language Processing26

Morphological parsing Morphological analysis or parsing can either be An important stand-alone component of many applications (spelling correction, information retrieval) Or simply a link in a chain of further linguistic analysis Interestingly, FSTs are bidirectional, i.e. can be used for parsing and generation LIN3022 Natural Language Processing27

Transitions c:c means read a c on one tape and write a c on the other +N: ε means read a +N symbol on one tape and write nothing on the other +PL:s means read +PL and write an s Note the conventions: x:y represents an input symbol x and the output symbol y. LIN3022 Natural Language Processing28 c:ca:at:t +N: ε + PL:s

Typical Uses Typically, we’ll read from one tape using the first symbol on the machine transitions (just as in a simple FSA). And we’ll write to the second tape using the other symbols on the transitions. LIN3022 Natural Language Processing29

Ambiguity Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state. Didn’t matter which path was actually traversed In FSTs the path to an accept state does matter since different paths represent different parses and different outputs will result LIN3022 Natural Language Processing30

Ambiguity What’s the right parse (segmentation) for Unionisable Union-ise-able Un-ion-ise-able Each represents a valid path through the derivational morphology machine. Unlike in an FST, the differences matter! (Some are not legal parses in English) LIN3022 Natural Language Processing31

Ambiguity There are a number of ways to deal with this problem Simply take the first output found Find all the possible outputs (all paths) and return them all (without choosing) Bias the search so that only one or a few likely paths are explored LIN3022 Natural Language Processing32

The Gory Details Of course, its not as easy as “cat +N +PL” “cats” As we saw earlier there are geese, mice and oxen But there are also a whole host of spelling/pronunciation changes that go along with inflectional changes Cats vs Dogs Fox and Foxes LIN3022 Natural Language Processing33

Multi-Tape Machines To deal with these complications, we will add more tapes and use the output of one tape machine as the input to the next – This gives us a cascade of FSTs So to handle irregular spelling changes we’ll add intermediate tapes with intermediate symbols LIN3022 Natural Language Processing34

Multi-Level Tape Machines We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape LIN3022 Natural Language Processing35 ^ marks a morpheme boundary # marks a word boundary

Lexical to Intermediate Level LIN3022 Natural Language Processing36

Intermediate to Surface The add an “e” rule as in fox^s# foxes# LIN3022 Natural Language Processing37

Foxes LIN3022 Natural Language Processing38

Note A key feature of this lower machine is that it has to do the right thing for inputs to which it doesn’t really apply. So... – Fox -> foxes but bird -> birds LIN3022 Natural Language Processing41

Overall Scheme We now have one FST that has explicit information about the lexicon (actual words, their spelling, facts about word classes and regularity). – Lexical level to intermediate forms We have a larger set of machines that capture orthographic/spelling rules. – Intermediate forms to surface forms LIN3022 Natural Language Processing42

Overall Scheme LIN3022 Natural Language Processing43

Cascades This is a common architecture Overall processing is divided up into distinct rewrite steps The output of one layer serves as the input to the next The intermediate tapes may or may not wind up being useful in their own right LIN3022 Natural Language Processing44

Overall Plan LIN3022 Natural Language Processing45

Part 3 A brief look at stemming... A brief look at tokenisation LIN3022 Natural Language Processing46

Stemming Stemming is the process of stripping affixes from words to reduce them to their stems – NB The stem is not necessarily the baseform – Examples: Strip “ing” from all word endings – Going  go – Stripping  stripp –... LIN3022 Natural Language Processing47

Uses of stemming Often used in Information Retrieval – The basic technology underlying search engines – Task: given a query (e.g. Keywords), retrieve documents which match it – Stemming is useful because it increases the likelihood of matches – E.g. Search for kangaroos returns documents containing kangaroo or kangaroos LIN3022 Natural Language Processing48

The Porter stemmer Built by Martin Porter in 1980 Still widely used Very simple FST-based stemmer No lexicon, just rules View a demo online here: http://qaa.ath.cx/porter_js_demo.html http://qaa.ath.cx/porter_js_demo.html LIN3022 Natural Language Processing49

Rules in the Porter Stemmer ATIONAL  ATE – E.g. relational  relate – But what about rational? ING  Є – E.g. Shivering  shiver These can be viewed as an FST, but without a lexical layer. LIN3022 Natural Language Processing50

Errors in the Porter stemmer Simple rules like the ones used by the Porter stemmer are error-prone (miss exceptions) Errors of commission: – Rational  ration Errors of omission: – European  Europe LIN3022 Natural Language Processing51

Tokenisation Defined as the task of splitting running text into component tokens. Related task: sentence segmentation (split running text into sentences) Simplest technique: – Just split on whitespace But what about: – Punctuation – Clitics –... LIN3022 Natural Language Processing52

Tokenisation & Sentence segmentation Often go hand in hand! Example: – Full stop can be a sentence boundary or an intra-word boundary – “Punctuation” marks in numbers: 30,000 20.2 Can get a long way using simple regular expressions, at least in Indo-European languages. LIN3022 Natural Language Processing53

Example from Maltese Treat numbers as tokens, with or without decimals: – \d+(\.\d+)? \d+(\.\d+) Honorifics shouldn’t be broken up at full-stops or apostrophes: – sant['’]|(onor|sra|nru|dott|kap|mons|dr|prof)\. Definite articles shouldn’t be broken up at hyphens: – i?[dtlrnsxzżċ]-...[other exceptions] General rule: – A word is just a sequence of characters: \w+\s LIN3022 Natural Language Processing54

Chinese Words in Chinese are composed of hanzi characters. Each character usually represents one morpheme. Words tend to be quite short (ca. 2.4 characters long on average) There is no whitespace between words! LIN3022 Natural Language Processing55

A simple algorithm for segmenting Chinese Maximum Matching (maxmatch) – Requires a lexicon. – Start by pointing at the beginning of the input string – Choose the longest item in the lexicon that matches the input up to now. – If no word matches, move the pointer one symbol forward. LIN3022 Natural Language Processing56

MaxMatch example Imagine an English string with no whitespace: – The table down there –  thetabledownthere The first item found by MaxMatch is theta – Because the algorithm tries to match the longest portion of the input – Then: bled, own, there Result: theta bled own there Luckily, it works better for Chinese, because words there tend to be shorter than in English! LIN3022 Natural Language Processing57

LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing.

Similar presentations

Presentation on theme: "LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing.

Similar presentations

Presentation on theme: "LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing."— Presentation transcript:

Similar presentations

About project

Feedback