Natural Language Processing >> Morphology <<

Slides:



Advertisements
Similar presentations
Regular expressions Day 2
Advertisements

Finite-State Machines with No Output Ying Lu
1 Regular Expressions and Automata September Lecture #2.
Week 13 - Wednesday.  What did we talk about last time?  Exam 3  Before review:  Graphing functions  Rules for manipulating asymptotic bounds  Computing.
1 Words and the Lexicon September 10th 2009 Lecture #3.
Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)
1 Regular Expressions and Automata September Lecture #2-2.
1 Regular Expressions & Automata Nelson Padua-Perez Bill Pugh Department of Computer Science University of Maryland, College Park.
Finite Automata and Regular Expressions i206 Fall 2010 John Chuang Some slides adapted from Marti Hearst.
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 2: 8/23.
Regular Expressions Lecture 3. Regular Expressions Motivation: To search for strings using partially specified patterns. Examples: To validate data fields.
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
Computational Language Finite State Machines and Regular Expressions.
CMSC 723 / LING 645: Intro to Computational Linguistics September 8, 2004: Monz Regular Expressions and Finite State Automata (J&M 2) Prof. Bonnie J. Dorr.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
1 Foundations of Software Design Lecture 22: Regular Expressions and Finite Automata Marti Hearst Fall 2002.
CS5371 Theory of Computation Lecture 8: Automata Theory VI (PDA, PDA = CFG)
Topics Automata Theory Grammars and Languages Complexities
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Regular Expressions & Automata Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
CMSC 723: Intro to Computational Linguistics Lecture 2: February 4, 2004 Regular Expressions and Finite State Automata Professor Bonnie J. Dorr Dr. Nizar.
Scripting Languages Chapter 8 More About Regular Expressions.
CSE467/567 Computational Linguistics Carl Alphonce Computer Science & Engineering University at Buffalo.
Regular Languages A language is regular over  if it can be built from ;, {  }, and { a } for every a 2 , using operators union ( [ ), concatenation.
Regular Language & Expressions. Regular Language A regular language is one that a finite state machine (fsm) will accept. ‘Alphabet’: {a, b} ‘Rules’:
Chapter 2: Finite-State Machines Heshaam Faili University of Tehran.
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
Finite-State Machines with No Output Longin Jan Latecki Temple University Based on Slides by Elsa L Gunter, NJIT, and by Costas Busch Costas Busch.
Finite-State Machines with No Output
Computational Linguistics Yoad Winter *General overview *Examples: Transducers; Stanford Parser; Google Translate; Word-Sense Disambiguation * Finite State.
CS490 Presentation: Automata & Language Theory Thong Lam Ran Shi.
1 i206: Lecture 18: Regular Expressions Marti Hearst Spring 2012.
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
CSC312 Automata Theory Lecture # 2 Languages.
March 1, 2009 Dr. Muhammed Al-mulhem 1 ICS 482 Natural Language Processing Regular Expression and Finite Automata Muhammed Al-Mulhem March 1, 2009.
October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.
Introduction to CS Theory Lecture 3 – Regular Languages Piotr Faliszewski
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15.
1 Regular Expressions and Automata CPE 641 Natural Language Processing from Kathy McCoy’s slides, CISC 882 Introduction to NLP
Basic Text Processing Regular Expressions. Dan Jurafsky 2 The original slides from: tml Some changes.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Finite-state automata Day 12 LING Computational Linguistics Harry Howard Tulane University.
1 LING 6932 Spring 2007 LING 6932 Topics in Computational Linguistics Hana Filip Lecture 2: Regular Expressions, Finite State Automata.
October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.
Natural Language Processing Lecture 4 : Regular Expressions and Automata.
CS 203: Introduction to Formal Languages and Automata
using Deterministic Finite Automata & Nondeterministic Finite Automata
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Week 13 - Wednesday.  What did we talk about last time?  Exam 3  Before review:  Graphing functions  Rules for manipulating asymptotic bounds  Computing.
May 2006CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Set, Alphabets, Strings, and Languages. The regular languages. Clouser properties of regular sets. Finite State Automata. Types of Finite State Automata.
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 Design and Analysis of Algorithms.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Week 14 - Friday.  What did we talk about last time?  Simplifying FSAs  Quotient automata.
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
Theory of Computation Lecture #
/208/.
Finite State Machines Dr K R Bond 2009
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
Some slides by Elsa L Gunter, NJIT, and by Costas Busch
CSC NLP - Regex, Finite State Automata
Regular Expressions
Regular Expressions and Automata in Language Analysis
CPSC 503 Computational Linguistics
Natural Language Processing (NLP)
Presentation transcript:

Natural Language Processing >> Morphology << winter / fall 2010/2011 41.4268 Prof. Dr. Bettina Harriehausen-Mühlbauer Univ. of Applied Science, Darmstadt, Germany www.fbi.h-da.de/~harriehausen b.harriehausen@fbi.h-da.de Bettina.Harriehausen@h-da.de

content 1 morphemes 2 compounds / concatenation 3 idiomatic phrases multiple word entries (MWE) spell aid regular expressions Finite State Automata (FSA) WS 2010/2011 NLP - Harriehausen

content 1 morphemes 2 compounds / concatenation 3 idiomatic phrases multiple word entries (MWE) spell aid regular expressions Finite State Automata (FSA) WS 2010/2011 NLP - Harriehausen

morpheme = smallest possible item in a language that carries meaning definition Morphemes morpheme = smallest possible item in a language that carries meaning lexeme (man, house, dog,...) inflectional affixes (dog-s, want-ed,...) other affixes (pre-/in-/suff-): unwanted, atypical, antipathetic,... esp. in technical language (-itis = „infection“, gastro = stomach...gastroenteritis) WS 2010/2011 NLP - Harriehausen

morphemes WS 2010/2011 NLP - Harriehausen

morphemes free morphemes : stand-alone, carry lexical and morphological meaning (e.g. house= sing, neuter, nominative ; case/number/gender) bound morphemes : legal wordform only in combination with another morpheme, stand-alone, carry lexical and morphological meaning (e.g. un-happy, gastroenteritis) WS 2010/2011 NLP - Harriehausen

Question: which string (~morpheme) do we include in our dictionary ? morphemes inflectional morphemes : create words and carry morphological meaning (e.g. dogs, laughed, going derivational morphemes : create wordforms and carry morphological meaning ( happily, intellectually, instruction, instructor, insulator, the pounding, limpness, blindness...) Question: which string (~morpheme) do we include in our dictionary ? WS 2010/2011 NLP - Harriehausen

content 1 morphemes 2 compounds / concatenation 3 idiomatic phrases multiple word entries (MWE) spell aid regular expressions Finite State Automata (FSA) WS 2010/2011 NLP - Harriehausen

compounds / concatenation in addition to single morphemes, we need to consider „multiple morpheme strings / multi word expressions“ (fixed phrases): idiomatic rigidity increasing the formal complexity increasing the independent of the context: dog, cat, ... compounding: combine lexical meanings: carseat, houseboat,... compounding: not a combination of the lexical meanings: nosebag, nosedive, paperback, ladybug,... depending on the context: bite the dust, lose face, kick the bucket,... = WS 2010/2011 NLP - Harriehausen

Samples for long compounds in German die Armbrust die Mehrzweckhalle das Mehrzweckkirschentkerngerät die Gemeindegrundsteuerveranlagung die Nummernschildbedruckungsmaschine der Mehrkornroggenvollkornbrotmehlzulieferer der Schifffahrtskapitänsmützenmaterialhersteller die Verkehrsinfrastrukturfinanzierungsgesellschaft die Feuerwehrrettungshubschraubernotlandeplatzaufseherin der Oberpostdirektionsbriefmarkenstempelautomatenmechaniker das Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz die Donaudampfschifffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesellschaft WS 2010/2011 NLP - Harriehausen

compounds / concatenation decompounding: principles / rules: FANO rule: „the analysis is unambiguous, when a morpheme is not the beginning of another morpheme“ (= principle of longest match) e.g. but / butter Segmentation has to be done recursively in order to find all possibilities: horseshoe: horses – hoe (?) vs. horse-shoe Staubecken: Stau – Becken vs. Staub - Ecken WS 2010/2011 NLP - Harriehausen

Problems: not all morphemes can be concatenated concatenation Problems: not all morphemes can be concatenated WS 2010/2011 NLP - Harriehausen

content 1 morphemes 2 compounds / concatenation 3 idiomatic phrases multiple word entries (MWE) spell aid regular expressions Finite State Automata (FSA) WS 2010/2011 NLP - Harriehausen

idiomatic phrases (http://www. geo Out of the blue To be on Cloud Nine A leopard cannot change its spots Head over heels Fair Play As cool as a cucumber The early bird catches the worm An apple a day keeps the doctor away As fit as a fiddle Beat about the bush The Big Apple The apple of my eye Wet behind the ears A bird in the hand is worth two in the bush It's raining cats and dogs A friend in need is a friend indeed It's all greek to me WS 2010/2011 NLP - Harriehausen

idiomatic phrases (http://www. geo Wie bei Hempels unterm Sofa Schmetterlinge im Bauch Jemanden übers Ohr hauen Ein Bäuerchen machen Mit jemandem durch dick und dünn gehen Seine Pappenheimer kennen Jemandem die Würmer aus der Nase ziehen Die Arschkarte ziehen Mit jemandem Pferde stehlen können Sich aus dem Staub machen Hummeln im Hintern haben Im siebten Himmel sein Viele Wege führen nach Rom Mit einem lachenden und einem weinenden Auge Nah am Wasser gebaut haben Da ist der Bär los Nachtigall, ick hör dir trapsen Mein lieber Scholli! WS 2010/2011 NLP - Harriehausen

idiomatic phrases (http://www. geo Jemandem einen Denkzettel verpassen Sich auf den Schlips getreten fühlen Alles für die Katz Wo drückt denn der Schuh? Gegen den Strich gehen Den Faden verlieren Etwas ausbaden müssen Einen Stein im Brett haben Bahnhof verstehen Der springende Punkt Der Sündenbock sein Einen Ohrwurm haben Das ist doch zum Mäusemelken! Schmiere stehen Den Teufel an die Wand malen Auf dem Holzweg sein Eselsbrücke In der Kreide stehen WS 2010/2011 NLP - Harriehausen

idiomatic phrases (http://www. geo Die Ohren steif halten Auf Vordermann bringen Um die Ecke bringen Hals- und Beinbruch Auf dem Kerbholz haben Eine Schlappe einstecken Frosch im Hals Es zieht wie Hechtsuppe Jemandem einen Bärendienst erweisen Damoklesschwert Tomaten auf den Augen haben Jemandem raucht der Kopf Für 'n Appel und 'n Ei Etwas an die große Glocke hängen Das ist Jacke wie Hose Etwas aus dem Ärmel schütteln Ein X für ein U vormachen Jemandem nicht das Wasser reichen können WS 2010/2011 NLP - Harriehausen

idiomatic phrases (http://www. geo Alles im grünen Bereich Die Hand ins Feuer legen Auf Draht sein Sein blaues Wunder erleben Der hat es faustdick hinter den Ohren Mein Name ist Hase, ich weiß von nichts Aus dem Stegreif Der Groschen ist gefallen Einen Vogel haben Den Kürzeren ziehen Bis in die Puppen Etwas hinter die Ohren schreiben Ins Fettnäpfchen treten Beleidigte Leberwurst Jemanden auf dem Kieker haben Ich verstehe immer nur Bahnhof! Die Katze im Sack kaufen Das kann kein Schwein lesen! WS 2010/2011 NLP - Harriehausen

idiomatic phrases (http://www. geo Bekannt wie ein bunter Hund Den Kopf in den Sand stecken Mit dem ist nicht gut Kirschen essen Aller guten Dinge sind drei Lampenfieber Das kommt mir spanisch vor Schwein haben Das hast du dir selbst eingebrockt Seinen Senf dazugeben Jemandem ist eine Laus über die Leber gelaufen Kalte Füße bekommen Im Stich lassen Schwedische Gardinen Alles in Butter Geld auf den Kopf hauen Das Handtuch werfen Sich mit fremden Federn schmücken WS 2010/2011 NLP - Harriehausen

content 1 morphemes 2 compounds / concatenation 3 idiomatic phrases multiple word entries (MWE) spell aid regular expressions Finite State Automata (FSA) WS 2010/2011 NLP - Harriehausen

! multiple word entries (MWE) in addition to single morphemes, we need to consider „multiple morpheme strings“ (fixed phrases): independent of the context: dog, cat, ... compounding (a): combine lexical meanings: carseat, houseboat,... compounding (b): not a combination of the lexical meanings: nosebag, nosedive, paperback, ladybug, soap opera... depending on the context: bite the dust, lose face, kick the bucket,... electronic dictionaries all NLP applications machine translation ! WS 2010/2011 NLP - Harriehausen

multiple word entries (MWE) Problems: the relationships among the components change the „Schnitzel“ problem sirloin steak (made from certain parts of..) soy steak (made out of material...) „Wiener Schnitzel“ (according to a certain receipe) pepper steak (served with...) ... Even though the single lexical meanings remain untouched in the compound, the relationship between the compounds varies tremendously ! WS 2010/2011 NLP - Harriehausen

multiple word entries (MWE) the 3 main relationships (default ?) between parts of a compound word: (the role of global knowledge in decompounding) compound meaning relationship doorknob knob of the door is-a / is-part-of/ carseat seat of the car genitive glasdoor door made of glas made from / material nutbread ‡ bread of the nut waterglas glas filled with water used for oiltruck truck that carries oil ‡ truck made of oil 1 2 3 WS 2010/2011 NLP - Harriehausen

? ? ? ? ? multiple word entries (MWE) decompounding: the orange bowl problem Can you please bring me the orange bowl ? ? bowl of orange colour ? bowl filled with oranges ? ? bowl that was formerly / usually filled with oranges bowl having the shape of an orange ? bowl with an orange pattern WS 2010/2011 NLP - Harriehausen

content 1 morphemes 2 compounds / concatenation 3 idiomatic phrases multiple word entries (MWE) spell aid regular expressions Finite State Automata (FSA) WS 2010/2011 NLP - Harriehausen

How do we define lexical error in NLP terms ? spell aid in NLP, decompounding algorithms are essential for spell-checking / spell aid : How do we define lexical error in NLP terms ? An error is a string that cannot be found in / matched with a dictionary entry. It is not necessarily an incorrect word (esp. neologisms). WS 2010/2011 NLP - Harriehausen

spell aid spell checking algorithms are based on the following types of mistakes (statistics !): phonetic similarities (ph – f : telephone – telefone) deletion of multiple entries ( mouuse - mouse) wrong order (from – form ; mouse – muose) substitution of neighbouring letters on the keyboard (miuse – mouse) include missing letters (vowels in between consonants...) (telephne) typos occur towards the end of a word (assumption:first letter is correct) segmentation / decomposition into substrings (horeshoe – horseshoe) WS 2010/2011 NLP - Harriehausen

spell aid phonetic similarities (ph – f : telephone – telefone) deletion of multiple entries ( mouuse - mouse) wrong order (from – form ; mouse – muose) substitution of neighbouring letters on the keyboard (miuse – mouse) include missing letters (vowels in between consonants...) (telephne) typos occur towards the end of a word (assumption:first letter is correct) segmentation / decomposition into substrings (horeshoe – horseshoe) WS 2010/2011 NLP - Harriehausen

spell aid include missing letters (vowels in between consonants...) (telephne) certain rules apply: e.g. in German: never concatenate „l“, „n“ or „r“ with „tz“ and „ck“: _ltz_ *Holtz _lck_ _ntz_ _nck_ _rlz_ _rck_ WS 2010/2011 NLP - Harriehausen

spell aid include missing letters www.dositey.com/language/spelling/Mislet3.htm WS 2010/2011 NLP - Harriehausen

How does spell checking work (w.r.t. grammar checking) ? spell aid How does spell checking work (w.r.t. grammar checking) ? Various degrees of „intelligence“: System A : no match found in the dictionary -> mark entry as incorrect System B: no match found in the dictionary. Initiate a rudimentary parse (left-right-search). Try to identify the wordclass, i.e. limit possibilities and continue a sentential analysis. e.g. the ...man (statistics: DET + ADJ + NOUN) System C: no match found in the dictionary. Initiate a segmentation of the word to identify the wordclass, e.g. look for typical endings (-ly = adverb / capital letters = proper noun, ...). This way new wordcreations can be identified (e.g. any word ending in -ness = noun) WS 2010/2011 NLP - Harriehausen

content 1 morphemes 2 compounds / concatenation 3 idiomatic phrases multiple word entries (MWE) spell aid regular expressions Finite State Automata (FSA) WS 2010/2011 NLP - Harriehausen

regular expressions (Jurafsky, section 2.1) In order to figure out whether something is an incorrect word, the machine has to match the string (= a sequence of symbols; any sequence of alphanumeric characters (letters, numbers, spaces, tabs, punctuation) to an entry in the dictionary other matches: e.g. information retrieval in www-search engines (google, altavista,…) the standard notation for characterizing text sequences= regular expressions regular expressions are written in (regular expression) languages: e.g. Perl, grep (Global Regular Expression Print) formally, regular expressions are algebraic notations for characterizing a set of strings regular expression search requires a pattern that we want to search for (and a corpus of text to search through) (text mining !) WS 2010/2011 NLP - Harriehausen

regular expressions (Jurafsky, section 2.1) Example: Search for the pattern “linguistics”. You also want to find documents with “Linguistics” and “LINGUISTICS”. (remember: the computer does EXACTLY do what you tell him to…) The regular expression /linguistics/ matches any string in any document containing exactly the substring “linguistics” Regular expressions are case sensitive samples (Jurafsky, p. 23) regular expression example pattern matched /woodchucks/ “interesting links to woodchucks and lemurs” /a/ “Mary Ann stopped by Mona’s” /Claire says,/ Dagmar, my gift please,” Claire says,” /song/ “all our pretty songs” /!/ “You’ve left the burglar behind again!” said Nori WS 2010/2011 NLP - Harriehausen

regular expressions (Jurafsky, section 2.1) linguistics - Linguistics - LINGUSTICS to search for alternative characters “l” and/or “L” we use square brackets: [l L] Regular expression match sample pattern /[l L] inguistics/ Linguistics or linguistics “computational linguistics is fun” /[1 2 3 4 5 6 7 8 9 0]/ any digit this is Linguistics 5981 WS 2010/2011 NLP - Harriehausen

regular expressions (Jurafsky, section 2.1) to search for a character in a range we use the dash: [-] Regular expression match sample pattern /[A-Z]/ any uppercase letter this is Linguistics 5981 /[0-9]/ any single digit this is Linguistics 5981 /[1 2 3 4 5 6 7 8 9 0]/ any single digit this is Linguistics 5981 WS 2010/2011 NLP - Harriehausen

regular expressions (Jurafsky, section 2.1) to search for negation, i.e. a character that I do NOT want to find we use the caret: [^] Regular expression match sample pattern /[^A-Z]/ not an uppercase letter this is Linguistics 5981 /[^L l]/ neither L nor l this is Linguistics 5981 /[^\.]/ not a period this is Linguistics 5981 Special characters: \* an asterisk “L*I*N*G*U*I*S*T*I*C*S” \. a period “Dr.Doolittle” \? a question mark “Is this Linguistics 5981 ?” \n a newline \t a tab WS 2010/2011 NLP - Harriehausen

regular expressions (Jurafsky, section 2.1) to search for optional characters we use the question mark: [?] Regular expression match sample pattern /colou?r/ colour or color beautiful colour to search for any number of a certain character we use the Kleene star: [*] Regular expression match /a*/ any string of zero or more “a”s /aa*/ at least one a but also any number of “a”s WS 2010/2011 NLP - Harriehausen

regular expressions (Jurafsky, section 2.1) To look for at least one character of a type we use the Kleene “+”: Regular expression match /[0-9]+/ a sequence of digits Any combination is possible Regular expression match /[ab]*/ zero or more “a”s or “b”s /[0-9] [0-9]*/ any integer (= a string of digits) WS 2010/2011 NLP - Harriehausen

regular expressions (Jurafsky, section 2.1) The “.” is a very special character -> so-called wildcard Regular expression match sample pattern /b.ll/ any character ball between b and ll bell bull bill Will the search find “Bill” ? WS 2010/2011 NLP - Harriehausen

regular expressions (Jurafsky, section 2.1) Anchors (start of line: “^”, end of line:”$”) Regular expression match sample pattern /^Linguistics/ “Linguistics” at the Linguistics is fun. beginning of a line /linguistics\.$/ “linguistics” at the We like linguistics. end of a line Anchors (word boundary: “\b”, non-boundary:”\B”) Regular expression match sample pattern /\bthe\b/ “the” alone This is the place. /\Bthe\B/ “the” included This is my mother. WS 2010/2011 NLP - Harriehausen

regular expressions (Jurafsky, section 2.1) More on alternative characters: the pipe symbol: “|” (disjunction) Regular expression match sample pattern /colou?r/ colour or color beautiful colour /progra(m|mme)/ program or programme linguistics program WS 2010/2011 NLP - Harriehausen

operator precedence hierarchy regular expressions (Jurafsky, section 2.1) What does the following expression match ? /student [0-9] + */ Will it match “student 1 student 2 student 3” ? operator precedence hierarchy WS 2010/2011 NLP - Harriehausen

regular expressions (Jurafsky, section 2.1) Perl expressions are also used for string substitution: (used in ELIZA) s/man/men/ man -> men Perl expressions are also used for string repetition via memory: (the number operator) s/(linguistics)/wonderful \1/ linguistics-> wonderful linguistics ELIZA s/.* YOU ARE (depressed|sad) .*/ I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad) .*/ WHY DO YOU THINK YOU ARE \1 ?/ WS 2010/2011 NLP - Harriehausen

content 1 morphemes 2 compounds / concatenation 3 idiomatic phrases multiple word entries (MWE) spell aid regular expressions Finite State Automata (FSA) WS 2010/2011 NLP - Harriehausen

Finite State Automata (FSA) The regular expression is more than just a convenient metalanguage for text searching. First, a regular expression is one way of describing a finite-state automaton (FSA). Finite-state automata are the theoretical foundation of a good deal of the computational work we will describe and look at in this lecture. Any regular expression can be implemented as a finite-state automaton*. Symmetrically, any finite-state automaton can be described with a regular expression. Second, a regular expression is one way of characterizing a particular kind of formal language called a regular language. Both regular expressions and finite-state automata can be used to describe regular languages. The relation among these three theoretical constructions is sketched out in the following figure: * Except regular expressions that use the memory feature – more on that later WS 2010/2011 NLP - Harriehausen

Finite State Automata (FSA) regular expressions Finite regular Automata languages Finite State Automata (FSA) The relationship between finite state automata, regular expressions, and regular languages* * as suggested by Martin Kay in: Kay, M. (1987). Nonconcatenative finite-state morphology. In Proceedings of the Third Conference of the European Chapter of the ACL (EACL-87), Copenhagen, Denmark,pp. 2-10.ACL.). WS 2010/2011 NLP - Harriehausen

Finite State Automata (FSA) Examples: Introduction to finite-state automata for regular expressions Mapping from regular expressions to automata examples WS 2010/2011 NLP - Harriehausen

Finite State Automata (FSA) Using a FSA to recognize sheeptalk After a while, with the parrot‘s help, the Doctor got to learn the language of the animals so well that he could talk to them himself and understand everything they said. Hugh Lofting, The Story of Doctor Doolittle WS 2010/2011 NLP - Harriehausen

Finite State Automata (FSA) Using a FSA to recognize sheeptalk Sheep language can be defined as any string from the following (infinite) set: baa! baaa! baaaa! baaaaa! baaaaaa! .... WS 2010/2011 NLP - Harriehausen

Finite State Automata (FSA) baa! baaa! baaaa! baaaaa! baaaaaa! .... The regular expression for this kind of sheeptalk is /baa+!/ All regular expressions can be represented as finite-state automata (FSA): WS 2010/2011 NLP - Harriehausen

Finite State Automata (FSA) a finite-state automaton (FSA) for the regular expression /baa+!/ b a a ! q q q q q 1 2 3 4 start state final state/ accepting state WS 2010/2011 NLP - Harriehausen

q Finite State Automata (FSA) ... ... ... a b a ! b ... ... ... ... ... ... ... ... a tape with cells Example of non-finite state = rejection of the input WS 2010/2011 NLP - Harriehausen

Finite State Automata (FSA) Input State b a ! 0(null) 1 0 0 1 0 2 0 2 0 3 0 3 0 3 4 4: 0 0 0 The state-transition table for the previous FSA WS 2010/2011 NLP - Harriehausen

Finite State Automata (FSA) An algorithm for deterministic recognition of FSAs function D-RECOGNIZE(tape,machine) returns accept or reject index <- Beginning of tape current-state <- Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elseif transition-table[current-state,tape[index]] is empty then return reject else current-state <- transition-table[current-state,tape[index]] index <- index +1 end WS 2010/2011 NLP - Harriehausen

q q q q q q Finite State Automata (FSA) ... ... ... b a a a ! ... ... ... ... ... ... ... ... Tracing the execution of FSA on some sheeptalk 1 2 3 4 5 WS 2010/2011 NLP - Harriehausen

? Finite State Automata (FSA) a b a a ! q q q q q 1 2 3 4 ! ! b ! b ! Regular expressions can be represented as FSAs: fail state a b a a ! q q q q q 1 2 3 4 ! ! b ! b ! b b ? a c a q f WS 2010/2011 NLP - Harriehausen

Finite State Automata (FSA) b a a ! q q q q q 1 2 3 4 A non-deterministic finite-state automaton for talking sheep WS 2010/2011 NLP - Harriehausen

E Finite State Automata (FSA) b a a ! q q q q q 4 1 2 3 1 2 3 E A non-finite-state automaton (NFSA) for the sheep language – having an E-transition WS 2010/2011 NLP - Harriehausen