Presentation is loading. Please wait.

Presentation is loading. Please wait.

Natural Language Processing >> Morphology << Prof. Dr. Bettina Harriehausen-Mühlbauer Univ. of Applied Science, Darmstadt, Germany www.fbi.h-da.de/~harriehausen.

Similar presentations


Presentation on theme: "Natural Language Processing >> Morphology << Prof. Dr. Bettina Harriehausen-Mühlbauer Univ. of Applied Science, Darmstadt, Germany www.fbi.h-da.de/~harriehausen."— Presentation transcript:

1 Natural Language Processing >> Morphology << Prof. Dr. Bettina Harriehausen-Mühlbauer Univ. of Applied Science, Darmstadt, Germany winter / fall 2010/

2 WS 2010/2011NLP - Harriehausen2 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4multiple word entries (MWE) 5spell aid 6regular expressions 7Finite State Automata (FSA) content

3 WS 2010/2011NLP - Harriehausen3 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4multiple word entries (MWE) 5spell aid 6regular expressions 7Finite State Automata (FSA) content

4 WS 2010/2011NLP - Harriehausen4 Morphemes morpheme = smallest possible item in a language that carries meaning lexeme (man, house, dog,...) inflectional affixes (dog-s, want-ed,...) other affixes (pre-/in-/suff-): unwanted, atypical, antipathetic,... esp. in technical language (-itis = infection, gastro = stomach...gastroenteritis) definition

5 WS 2010/2011NLP - Harriehausen5 morphemes

6 WS 2010/2011NLP - Harriehausen6 morphemes free morphemes : stand-alone, carry lexical and morphological meaning (e.g. house= sing, neuter, nominative ; case/number/gender) bound morphemes : legal wordform only in combination with another morpheme, stand-alone, carry lexical and morphological meaning (e.g. un-happy, gastroenteritis)

7 WS 2010/2011NLP - Harriehausen7 morphemes inflectional morphemes : create words and carry morphological meaning (e.g. dogs, laughed, going derivational morphemes : create wordforms and carry morphological meaning ( happily, intellectually, instruction, instructor, insulator, the pounding, limpness, blindness...) Question: which string (~morpheme) do we include in our dictionary ?

8 WS 2010/2011NLP - Harriehausen8 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4multiple word entries (MWE) 5spell aid 6regular expressions 7Finite State Automata (FSA) content

9 WS 2010/2011NLP - Harriehausen9 compounds / concatenation in addition to single morphemes, we need to consider multiple morpheme strings / multi word expressions (fixed phrases): increasing the formal complexity = increasing the idiomatic rigidity independent of the context: dog, cat,... compounding: combine lexical meanings: carseat, houseboat,... compounding: not a combination of the lexical meanings: nosebag, nosedive, paperback, ladybug,... depending on the context: bite the dust, lose face, kick the bucket,...

10 WS 2010/2011NLP - Harriehausen10 Samples for long compounds in German die Armbrust die Mehrzweckhalle das Mehrzweckkirschentkerngerät die Gemeindegrundsteuerveranlagung die Nummernschildbedruckungsmaschine der Mehrkornroggenvollkornbrotmehlzulieferer der Schifffahrtskapitänsmützenmaterialhersteller die Verkehrsinfrastrukturfinanzierungsgesellschaft die Feuerwehrrettungshubschraubernotlandeplatzaufseherin der Oberpostdirektionsbriefmarkenstempelautomatenmechaniker das Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz die Donaudampfschifffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesellschaft

11 WS 2010/2011NLP - Harriehausen11 compounds / concatenation decompounding: principles / rules: FANO rule: the analysis is unambiguous, when a morpheme is not the beginning of another morpheme (= principle of longest match) e.g. but / butter Segmentation has to be done recursively in order to find all possibilities: horseshoe: horses – hoe (?) vs. horse-shoe Staubecken: Stau – Becken vs. Staub - Ecken

12 WS 2010/2011NLP - Harriehausen12 concatenation Problems: not all morphemes can be concatenated

13 WS 2010/2011NLP - Harriehausen13 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4multiple word entries (MWE) 5spell aid 6regular expressions 7Finite State Automata (FSA) content

14 WS 2010/2011NLP - Harriehausen14 idiomatic phrases ( Out of the blue To be on Cloud Nine A leopard cannot change its spots Head over heels Fair Play As cool as a cucumber The early bird catches the worm An apple a day keeps the doctor away As fit as a fiddle Beat about the bush The Big Apple The apple of my eye Wet behind the ears A bird in the hand is worth two in the bush It's raining cats and dogs A friend in need is a friend indeed It's all greek to me

15 WS 2010/2011NLP - Harriehausen15 idiomatic phrases ( Wie bei Hempels unterm Sofa Schmetterlinge im Bauch Jemanden übers Ohr hauen Ein Bäuerchen machen Mit jemandem durch dick und dünn gehen Seine Pappenheimer kennen Jemandem die Würmer aus der Nase ziehen Die Arschkarte ziehen Mit jemandem Pferde stehlen können Sich aus dem Staub machen Hummeln im Hintern haben Im siebten Himmel sein Viele Wege führen nach Rom Mit einem lachenden und einem weinenden Auge Nah am Wasser gebaut haben Da ist der Bär los Nachtigall, ick hör dir trapsen Mein lieber Scholli!

16 WS 2010/2011NLP - Harriehausen16 idiomatic phrases ( Jemandem einen Denkzettel verpassen Sich auf den Schlips getreten fühlen Alles für die Katz Wo drückt denn der Schuh? Gegen den Strich gehen Den Faden verlieren Etwas ausbaden müssen Einen Stein im Brett haben Bahnhof verstehen Der springende Punkt Der Sündenbock sein Einen Ohrwurm haben Das ist doch zum Mäusemelken! Schmiere stehen Den Teufel an die Wand malen Auf dem Holzweg sein Eselsbrücke In der Kreide stehen

17 WS 2010/2011NLP - Harriehausen17 idiomatic phrases ( Die Ohren steif halten Auf Vordermann bringen Um die Ecke bringen Hals- und Beinbruch Auf dem Kerbholz haben Eine Schlappe einstecken Frosch im Hals Es zieht wie Hechtsuppe Jemandem einen Bärendienst erweisen Damoklesschwert Tomaten auf den Augen haben Jemandem raucht der Kopf Für 'n Appel und 'n Ei Etwas an die große Glocke hängen Das ist Jacke wie Hose Etwas aus dem Ärmel schütteln Ein X für ein U vormachen Jemandem nicht das Wasser reichen können

18 WS 2010/2011NLP - Harriehausen18 idiomatic phrases ( Alles im grünen Bereich Die Hand ins Feuer legen Auf Draht sein Sein blaues Wunder erleben Der hat es faustdick hinter den Ohren Mein Name ist Hase, ich weiß von nichts Aus dem Stegreif Der Groschen ist gefallen Einen Vogel haben Den Kürzeren ziehen Bis in die Puppen Etwas hinter die Ohren schreiben Ins Fettnäpfchen treten Beleidigte Leberwurst Jemanden auf dem Kieker haben Ich verstehe immer nur Bahnhof! Die Katze im Sack kaufen Das kann kein Schwein lesen!

19 WS 2010/2011NLP - Harriehausen19 idiomatic phrases ( Bekannt wie ein bunter Hund Den Kopf in den Sand stecken Mit dem ist nicht gut Kirschen essen Aller guten Dinge sind drei Lampenfieber Das kommt mir spanisch vor Schwein haben Das hast du dir selbst eingebrockt Seinen Senf dazugeben Jemandem ist eine Laus über die Leber gelaufen Kalte Füße bekommen Im Stich lassen Schwedische Gardinen Alles in Butter Geld auf den Kopf hauen Das Handtuch werfen Sich mit fremden Federn schmücken

20 WS 2010/2011NLP - Harriehausen20 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4multiple word entries (MWE) 5spell aid 6regular expressions 7Finite State Automata (FSA) content

21 WS 2010/2011NLP - Harriehausen21 multiple word entries (MWE) in addition to single morphemes, we need to consider multiple morpheme strings (fixed phrases): electronic dictionaries all NLP applications machine translation ! independent of the context: dog, cat,... compounding (a): combine lexical meanings: carseat, houseboat,... compounding (b): not a combination of the lexical meanings: nosebag, nosedive, paperback, ladybug, soap opera... depending on the context: bite the dust, lose face, kick the bucket,...

22 WS 2010/2011NLP - Harriehausen22 multiple word entries (MWE) Problems: the relationships among the components change the Schnitzel problem sirloin steak (made from certain parts of..) soy steak (made out of material...) Wiener Schnitzel (according to a certain receipe) pepper steak (served with...)... relationship Even though the single lexical meanings remain untouched in the compound, the relationship between the compounds varies tremendously !

23 WS 2010/2011NLP - Harriehausen23 multiple word entries (MWE) the 3 main relationships (default ?) between parts of a compound word: (the role of global knowledge in decompounding) compoundmeaningrelationship doorknobknob of the dooris-a / is-part-of/ carseatseat of the cargenitive glasdoordoor made of glasmade from / material nutbread bread of the nut waterglasglas filled with waterused for oiltrucktruck that carries oil truck made of oil 1 2 3

24 WS 2010/2011NLP - Harriehausen24 decompounding:the orange bowl problem Can you please bring me the orange bowl ? bowl filled with oranges bowl having the shape of an orange bowl with an orange pattern bowl of orange colour bowl that was formerly / usually filled with oranges ? ? ? ? ? multiple word entries (MWE)

25 WS 2010/2011NLP - Harriehausen25 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4multiple word entries (MWE) 5spell aid 6regular expressions 7Finite State Automata (FSA) content

26 WS 2010/2011NLP - Harriehausen26 spell aid spell- checking / spell aid : in NLP, decompounding algorithms are essential for spell- checking / spell aid : How do we define lexical error in NLP terms ? An error is a string that cannot be found in / matched with a dictionary entry. It is not necessarily an incorrect word (esp. neologisms).

27 WS 2010/2011NLP - Harriehausen27 spell aid spell checking algorithms spell checking algorithms are based on the following types of mistakes (statistics !): phonetic similarities (ph – f : telephone – telefone) deletion of multiple entries ( mouuse - mouse) wrong order (from – form ; mouse – muose) substitution of neighbouring letters on the keyboard (miuse – mouse) include missing letters (vowels in between consonants...) (telephne) typos occur towards the end of a word (assumption:first letter is correct) segmentation / decomposition into substrings (horeshoe – horseshoe)

28 WS 2010/2011NLP - Harriehausen28 spell aid phonetic similarities (ph – f : telephone – telefone) deletion of multiple entries ( mouuse - mouse) wrong order (from – form ; mouse – muose) substitution of neighbouring letters on the keyboard (miuse – mouse) include missing letters (vowels in between consonants...) (telephne) typos occur towards the end of a word (assumption:first letter is correct) segmentation / decomposition into substrings (horeshoe – horseshoe)

29 WS 2010/2011NLP - Harriehausen29 spell aid include missing letters (vowels in between consonants...) (telephne) certain rules apply: e.g. in German: never concatenate l, n or r with tz and ck: _ltz_*Holtz _lck_ _ntz_ _nck_ _rlz_ _rck_

30 WS 2010/2011NLP - Harriehausen30 spell aid include missing letters

31 WS 2010/2011NLP - Harriehausen31 spell aid How does spell checking work (w.r.t. grammar checking) ? Various degrees of intelligence: System A : no match found in the dictionary -> mark entry as incorrect System B: no match found in the dictionary. Initiate a rudimentary parse (left-right-search). Try to identify the wordclass, i.e. limit possibilities and continue a sentential analysis. e.g. the...man (statistics: DET + ADJ + NOUN) System C: no match found in the dictionary. Initiate a segmentation of the word to identify the wordclass, e.g. look for typical endings (-ly = adverb / capital letters = proper noun,...). This way new wordcreations can be identified (e.g. any word ending in -ness = noun)

32 WS 2010/2011NLP - Harriehausen32 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4multiple word entries (MWE) 5spell aid 6regular expressions 7Finite State Automata (FSA) content

33 WS 2010/2011NLP - Harriehausen33 regular expressions (Jurafsky, section 2.1) In order to figure out whether something is an incorrect word, the machine has to match the string (= a sequence of symbols; any sequence of alphanumeric characters (letters, numbers, spaces, tabs, punctuation) to an entry in the dictionary other matches: e.g. information retrieval in www-search engines (google, altavista,…) the standard notation for characterizing text sequences= regular expressions regular expressions are written in (regular expression) languages: e.g. Perl, grep (Global Regular Expression Print) formally, regular expressions are algebraic notations for characterizing a set of strings regular expression search requires a pattern that we want to search for (and a corpus of text to search through) (text mining !)

34 WS 2010/2011NLP - Harriehausen34 Example: Search for the pattern linguistics. You also want to find documents with Linguistics and LINGUISTICS. (remember: the computer does EXACTLY do what you tell him to…) The regular expression /linguistics/ matches any string in any document containing exactly the substring linguistics Regular expressions are case sensitive samples (Jurafsky, p. 23) regular expressionexample pattern matched /woodchucks/interesting links to woodchucks and lemurs /a/Mary Ann stopped by Monas /Claire says,/Dagmar, my gift please, Claire says, /song/all our pretty songs /!/Youve left the burglar behind again! said Nori regular expressions (Jurafsky, section 2.1)

35 WS 2010/2011NLP - Harriehausen35 linguistics - Linguistics - LINGUSTICS to search for alternative characters l and/or L we use square brackets: [l L] Regular expression matchsample pattern /[l L] inguistics/ Linguistics or linguisticscomputational linguistics is fun /[ ]/ any digitthis is Linguistics 5981 regular expressions (Jurafsky, section 2.1)

36 WS 2010/2011NLP - Harriehausen36 to search for a character in a range we use the dash: [-] Regular expression match sample pattern /[A-Z]/ any uppercase letter this is Linguistics 5981 /[0-9]/ any single digit this is Linguistics 5981 /[ ]/any single digit this is Linguistics 5981 regular expressions (Jurafsky, section 2.1)

37 WS 2010/2011NLP - Harriehausen37 to search for negation, i.e. a character that I do NOT want to find we use the caret: [^] Regular expression match sample pattern /[^A-Z]/not an uppercase letter this is Linguistics 5981 /[^L l]/ neither L nor l this is Linguistics 5981 /[^\.]/not a periodthis is Linguistics 5981 \*an asterisk L*I*N*G*U*I*S*T*I*C*S \.a period Dr.Doolittle \?a question mark Is this Linguistics 5981 ? \na newline \ta tab Special characters: regular expressions (Jurafsky, section 2.1)

38 WS 2010/2011NLP - Harriehausen38 to search for optional characters we use the question mark: [?] Regular expression match sample pattern /colou?r/colour or color beautiful colour to search for any number of a certain character we use the Kleene star: [*] Regular expression match /a*/any string of zero or more as /aa*/at least one a but also any number of as regular expressions (Jurafsky, section 2.1)

39 WS 2010/2011NLP - Harriehausen39 Any combination is possible Regular expression match /[ab]*/zero or more as or bs /[0-9] [0-9]*/any integer (= a string of digits) To look for at least one character of a type we use the Kleene +: Regular expression match /[0-9]+/a sequence of digits regular expressions (Jurafsky, section 2.1)

40 WS 2010/2011NLP - Harriehausen40 The. is a very special character -> so-called wildcard Regular expression matchsample pattern /b.ll/any characterball between b and llbell bull bill Will the search find Bill ? regular expressions (Jurafsky, section 2.1)

41 WS 2010/2011NLP - Harriehausen41 Anchors (start of line: ^, end of line:$) Regular expression matchsample pattern /^Linguistics/Linguistics at theLinguistics is fun. beginning of a line /linguistics\.$/linguistics at theWe like linguistics. end of a line Anchors (word boundary: \b, non-boundary:\B) Regular expression matchsample pattern /\bthe\b/the aloneThis is the place. /\Bthe\B/the includedThis is my mother. regular expressions (Jurafsky, section 2.1)

42 WS 2010/2011NLP - Harriehausen42 More on alternative characters: the pipe symbol: | (disjunction) Regular expression matchsample pattern /colou?r/colour or color beautiful colour /progra(m|mme)/program or programmelinguistics program regular expressions (Jurafsky, section 2.1)

43 WS 2010/2011NLP - Harriehausen43 What does the following expression match ? /student [0-9] + */ Will it match student 1 student 2 student 3 ? regular expressions (Jurafsky, section 2.1)

44 WS 2010/2011NLP - Harriehausen44 Perl expressions are also used for string substitution: (used in ELIZA) s/man/men/man -> men Perl expressions are also used for string repetition via memory: (the number operator) s/(linguistics)/wonderful \1/linguistics-> wonderful linguistics ELIZA s/.* YOU ARE (depressed|sad).*/ I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad).*/ WHY DO YOU THINK YOU ARE \1 ?/ regular expressions (Jurafsky, section 2.1)

45 WS 2010/2011NLP - Harriehausen45 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4multiple word entries (MWE) 5spell aid 6regular expressions 7Finite State Automata (FSA) content

46 WS 2010/2011NLP - Harriehausen46 The regular expression is more than just a convenient metalanguage for text searching. First, a regular expression is one way of describing a finite-state automaton (FSA). Finite-state automata are the theoretical foundation of a good deal of the computational work we will describe and look at in this lecture. Any regular expression can be implemented as a finite-state automaton*. Symmetrically, any finite-state automaton can be described with a regular expression. Second, a regular expression is one way of characterizing a particular kind of formal language called a regular language. Both regular expressions and finite-state automata can be used to describe regular languages. The relation among these three theoretical constructions is sketched out in the following figure: * Except regular expressions that use the memory feature – more on that later Finite State Automata (FSA)

47 WS 2010/2011NLP - Harriehausen47 regular expressions Finite regular Automata languages The relationship between finite state automata, regular expressions, and regular languages* * as suggested by Martin Kay in : Kay, M. (1987). Nonconcatenative finite-state morphology. In Proceedings of the Third Conference of the European Chapter of the ACL (EACL-87), Copenhagen, Denmark,pp ACL.). Finite State Automata (FSA)

48 WS 2010/2011NLP - Harriehausen48 Examples: Introduction to finite-state automata for regular expressions Mapping from regular expressions to automata examples Finite State Automata (FSA)

49 WS 2010/2011NLP - Harriehausen49 Using a FSA to recognize sheeptalk After a while, with the parrots help, the Doctor got to learn the language of the animals so well that he could talk to them himself and understand everything they said. Hugh Lofting, The Story of Doctor Doolittle Finite State Automata (FSA)

50 WS 2010/2011NLP - Harriehausen50 Using a FSA to recognize sheeptalk Sheep language can be defined as any string from the following (infinite) set: baa! baaa! baaaa! baaaaa! baaaaaa!.... Finite State Automata (FSA)

51 WS 2010/2011NLP - Harriehausen51 baa! baaa! baaaa! baaaaa! baaaaaa!.... The regular expression for this kind of sheeptalk is /baa+!/ All regular expressions can be represented as finite-state automata (FSA): Finite State Automata (FSA)

52 WS 2010/2011NLP - Harriehausen52 a finite-state automaton (FSA) for the regular expression /baa+!/ q 0 q q q q 1234 baa a ! start statefinal state/ accepting state Finite State Automata (FSA)

53 WS 2010/2011NLP - Harriehausen a b a ! b a tape with cells Example of non-finite state = rejection of the input q 0 Finite State Automata (FSA)

54 WS 2010/2011NLP - Harriehausen54 Input Stateba! 0(null) :000 The state-transition table for the previous FSA Finite State Automata (FSA)

55 WS 2010/2011NLP - Harriehausen55 function D-RECOGNIZE (tape,machine) returns accept or reject index <- Beginning of tape current-state <- Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elseif transition-table[current-state,tape[index]] is empty then return reject else current-state <- transition-table[current-state,tape[index]] index <- index +1 end An algorithm for deterministic recognition of FSAs Finite State Automata (FSA)

56 WS 2010/2011NLP - Harriehausen b a a a ! Tracing the execution of FSA on some sheeptalk q 0 qqqqq Finite State Automata (FSA)

57 WS 2010/2011NLP - Harriehausen57 Regular expressions can be represented as FSAs: fail state q 0 q q q q 1234 baa a ! f q a ! b b b b ! ! ! ac ? Finite State Automata (FSA)

58 WS 2010/2011NLP - Harriehausen58 q 0 q q q q 123 baa a ! 4 A non-deterministic finite-state automaton for talking sheep Finite State Automata (FSA)

59 WS 2010/2011NLP - Harriehausen q q 1 b 2 q q q !aa 3 E A non-finite-state automaton (NFSA) for the sheep language – having an E -transition Finite State Automata (FSA)


Download ppt "Natural Language Processing >> Morphology << Prof. Dr. Bettina Harriehausen-Mühlbauer Univ. of Applied Science, Darmstadt, Germany www.fbi.h-da.de/~harriehausen."

Similar presentations


Ads by Google