Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dictionaries and Grammar Do we include all forms of a particular word, or do we include only the base word and derive its forms? How are the grammatical.

Similar presentations

Presentation on theme: "Dictionaries and Grammar Do we include all forms of a particular word, or do we include only the base word and derive its forms? How are the grammatical."— Presentation transcript:

1 Dictionaries and Grammar Do we include all forms of a particular word, or do we include only the base word and derive its forms? How are the grammatical rules of a language represented? How do we represent the parts of speech that go with particular grammatical rules? Questions to Address

2 Morphology –The study of the patterns used to form words –E.g. inflection, derivation, and compounds Morpheme - Minimal meaning-bearing unit –Could be a stem or an affix Stem {“unthinkable” “realization” “distrust”} –The part of a word that contains the root meaning (E.g. cat) Affixes {-s, un-, de-, -en, -able, -ize, -hood} –a linguistic element added to a word modify the meaning –E.g.: prefix (unbuckle), suffix (buckled), infix (absobloodylutely), and circumfix (gesagt in German for said). –Affixes can attach to other affixes (boyishness) Definitions

3 Knowing Words When we know a word, we know its 1.Phonological sound sequences 2.Semantic meanings 3.Morphological relationships 4.Syntactic categories and proper structure of a sentence Morphological relationships adjust word meanings –PersonJill waits. –NumberJill carried two buckets. –CaseThe chair’s leg is broken. –TenseJill is waiting there now. –DegreeJill ran faster than Jack. –GenderJill is female –Part of SpeechJill is a proper noun These are the kind of things we want our computers to figure out

4 Units of Meaning How many morphemes do each of the following sentences have? –“I have two cats” –“She wants to leave soon” –“He walked across the room” –“Her behavior was unbelievable” Free Morphemes {eye, think, run, apple} Bound Morphemes {-able, un-, -s, -tion, -ly}

5 Affix Examples Prefixes from Karuk, a Hokan language of California  [pasip]“Shoot!”  [nipasip]“I shoot”  [/upasip]“She/he shoots” Suffixes from Mende spoken in Liberia and Sierra Leone  [pElE]“house”  [pElEi]“the house”  [mEmE]“glass”  [mEmEi]“the glass” Infixes from Bontoc spoken in the Phillipines  [fikas]“strong”  [fumikas]“she is becoming strong”  [fusul]“enemy”  [fumusal]“she is becoming an enemy”

6 Turkish Morpology Uygarlastiramadiklarimizdanmissinizcasina Meaning: `behaving as if you are among those whom we could not civilize’ Uygar `civilized’ + las `become’ + tir `cause’ + ama `not able’ + dik `past’ + lar ‘plural’+ imiz ‘p1pl’ + dan ‘abl’ + mis ‘past’ + siniz ‘2pl’ + casina ‘as if’

7 How does the Mind Store Meanings? Hypotheses –Full listing: We store all words individually –Minimum redundancy: We store morphemes and how they relate Analysis –Determine if people understand new words based on root meanings –Observe whether children have difficulty learning exceptions –Regular form: government/govern, Irregular form: department/depart Evidence suggests –The mind represents words and affix meanings separately –Linguists observe that affixes were originally separate words that speakers slur together over time

8 General Observations about Lexicons Meanings are continually changing Roots and Morphemes do not have to occur in a fixed position in relation to other elements. How many words do people know? –Shakespeare uses 15,000 words –A typical high school student knows 60,000 (learning 10 words a day from 12 months to 18 years) How many English words are there? –Over 300,000 words without Morphemes in 1988

9 Computational Morphology Consider all of the morphemes of the word ‘true’ –true, truer, truest, truly, untrue, truth, truthful, truthfully, untruthfully, untruthfulness –Untruthfulness = un- + true + -th + -ful + -ness Productive morphemes –An affix that at a point in time spread rapidly through the language –Consider goose and geese versus cat and cats The former was an older way to indicate plurals The latter is a more recent way that spread throughout If we store morpheme rules, not all words, we can –Reduce storage requirements and simplify creating entire dictionaries –More closely mimic how the mind does it –Be able to automatically understand newly encountered word forms Speech recognition requires a language dictionary How many words would it contain?

10 Morphology Rules There are rules used to form complex words from their roots –‘re-’ only precedes verbs (rerun, release, return) –‘-s’ indicates plurals –‘-ed’ indicates past tense Affix Rules –Regular: follow productive affix rules –Irregular: don’t follow productive affix rules Nouns –Regular: (cat, thrush), (cats, thrushes), (cat’s thrushes’) –Irregular: (mouse, ox), (mice, oxen) Observation: More frequent words resist changes that result from productive affixes and take irregular forms (E.g. am, is, are). Exceptions: A singer sings, and a writer writes. Why doesn’t a whisker whisk, a spider spid, or a finger fing?

11 Parsing Morphological parsing –Identifies stem and affixes and how they relate –Example: fish  fish + Noun + Singular or goose + Verb fish  fish +Noun +Plural fish  fish +Verb +Singular –Bracketing: indecipherable  [in [[de [cipher]] able]] Why do we parse? –spell-checking: Is muncheble a real word? –Identify a word’s part-of-speech (pos) –Sentence parsing and machine translation –Identify word stems for data mining search operations –Speech recognition and text to speech Identify components and underlying structure

12 Parsing Applications Lexicon –Create a word list –Include both stems and affixes (with the part of speech) Morphotactics –Models how morphemes can be affixed to a stem. –E.g., plural morpheme follows noun in English Orthographic rules –Defines spelling modifications during affixation –E.g. true  tru in context of true  truthfully

13 Grammatical Morphemes New forms are rarely added to closed morpheme classes ExamplesExamples –prepositionsat, for, by –articlesa, the –conjunctions and, but, or

14 Morphological Parsing (stemming) Goal: Break the surface input into morphemes foxes –Fox is a noun stem –It has -es as a plural suffix rewrites –Write is the verb stem –It has re- as a prefix meaning to do again –It has a –s suffix indicating a continuing activity

15 Inflectional Morphology Nouns –plural marker: -s (dog + s = dogs) –possessive marker: -’s (dog + ’s = dog’s) Verbs –3 rd person present singular: -s (walk + s = walks) –past tense: -ed (walk + ed = walked) –progressive: -ing (walk + ing = walking) –past participle: -en or -ed (eat + en = eaten) Adjectives –comparative: -er (fast + er = faster) –superlative: -est (fast + est = fastest) In English –Meaning transformations are predictable –All inflectional affixes are suffices –Inflectional affixes are attached after any derivational (next slide) affixes E.g. modern + ize + s = modernizes; not modern + s + ize Does not change the grammatical category

16 Concatenative and Non-concatenative Concatenative morphology combines by concatentation –prefixes and suffixes Non-concatentative morphology combines in complex ways –circumfixes and infixes –templatic morphology words change by internal changes to the root E.g. (Arabic, Hebrew) ktb (write), kuttib (will have been written) C V C k t b u i  kuttib Templative Example

17 Verbal Inflective Morphology Verbal inflection –Main verbs (sleep, like, fear) are relatively regular  Standard morphemes: -s, ing, ed  These morphemes are productive: Emails, Emailing, Emailed –Combination with nouns for syntactical agreement  I am, we are, they were There are exceptions –Eat (will eat, eats, eating, ate) –Catch (will catch, catches, catching, caught) –Be (will be, is, being, was) –Have (will have, has, having, had) General Observations about English –There are approximately 250 Irregular verbs that occur –Other languages have more complex verbal inflection rules

18 Nominal Inflective Morphology Plural forms (s or es) Possessives (cat’s or cats’) Regular Nouns –Singular (cat, bush) –Plural (cats, bushes) –Possessive (cat’s bushes’) Irregular Nouns –Singular (mouse, ox) –Plural (mice, oxen)

19 Derivational Morphology Word stem combines with grammatical morpheme –Usually produces word of different class –Complex rules that are less productive with many exceptions –Sometimes meanings of derived terms are hard to predict (E.g. hapless) Examples: verbs to nouns –generalize, realize  generalization, realization –Murder, spell  murderer, speller Examples: verbs and nouns to adjectives –embrace, pity  embraceable, pitiable –care, wit  careless, witless Example: adjectives  adverbs –happy  happily More complicated to model than inflection –Less productive: science-less, concern-less, go-able, sleep-able

20 Derivational Morphology Examples Level 1 Examples: ize, ization, ity, ic, al, ity, ion, y, ate, ous, ive, ation Observations –Can attach to non-words (e.g. fratern-al, paternal) –Often changes stem’s stress and vowel quality Level 2 Examples: hood, ness, ly, s, ing, ish, ful, ly, less, y (adj.) Observations –Never precede Level 1 suffixes –Never change stress or vowel quality –Almost always attach to words that exists Level 1 + Level 1: histor-ic-al, illumina-at-tion, indetermin-at-y; Level 1 + Level 2: fratern-al-ly, transform-ate-ion-less; Level 2 + Level 2: weight-less-ness Big one: antidisestablishmenterrianism (if I spelled it right)

21 Adjective Morphology Standard Forms –Big, bigger, biggest –Cool, cooler, coolest, cooly –Red, redder, reddest –Clear, clearer, clearest, clearly, unclear, unclearly –Happy, happier, happiest, happily –Unhappy, unhappier, unhappiest, unhappily –Real, unreal, really Exceptions: unbig, redly, realest

22 Identify and Classify Morphemes In each group –Two words have a different morphological structure –One word has a different type of suffix –One word has no suffix at all Perform the following tasks –1.Isolate the suffix that two of the words share. –2.Identify whether it is (i) free or bound; (ii) prefix, infix, suffix; (iii) inflectional or derivational. –3.Give its function/meaning. –4.Identify the word that has no suffix –5.Identify the word that has a suffix which is different from the others in each group. a.b.c.d. ridertressesrunningtables coldermelodiesfoundlinglens silverBess’shandlingwitches actorguessflingcalculates

23 Computational Techniques Regular Grammars Finite State Automata Finite State Transducer Parsing – Top down and bottom up

24 Regular Grammars Grammar: Rules that define legal characters strings A regular grammar accepts regular expressions A regular expression must satisfy the following: –The grammar with no strings is regular –The grammar that accepts the empty string is regular –A single character is a regular grammar –If r1 and r2 are regular grammars, then r1 union r2, and r1 concatenated with r2 are regular grammars –If r is a regular grammar, then r* ( where * means zero or more occurrences) is regular

25 Notations to Express Regular Expressions Conjunction: abc Disjunction: [a-zA-Z], gupp(y|ies) Counters: a*, a+, ?, a{5}, a{5,8}, a{5,} Any character: a.b Not: [^0-9] Anchors: /^The dog\.$/ –Note: the backslash before the period is an escape character –Other escape characters include \*, \?, \n, \t, \\, \[, \], etc. Operators –\d equivalent to [0-9], \D equivalent to [^0-9] –\w equivalent to [a-zA-z0-9 ], \W equivalent to [^\w] –\s equivalent to [ \r\t\n\f], \S equivalent to [^s] Substitute one regular expression for another: s/regExp1/regExp2/

26 Examples of Regular Expressions All strings ending with two zeroes All strings containing three consecutive zeroes All strings that every block of five consecutive symbols have at least two zeroes All strings that the tenth symbol from the right is a one The set of all modular five numbers

27 Finite State Automata (FSA) Definition: A FSA consists of 1.a set of states (Σ) 2.a starting state (q 0 ) 3.a set of final or accepting states (F  Q) 4.a finite set of symbols (Q) 5.a transition function (  (q,i) ) that maps QxΣ to Q. It switches from a from-state to a to-state, based on one of the valid symbols Synonyms: Finite Automata, Finite State Machine FSA’s recognize grammars that are regular

28 Recognition Traditionally, Turing used a tape reader to depict a FSA Algorithm –Begin in the start state –Examine the current input character –Consult the table –Go to a new state and update the tape pointer. –Until you run out of tape. –The machine accepts the string processing stops in a final state Determine if the machine accepts a particular string i.e. Is a string in the language?

29 Graphs and State Transition Tables What can we can say about this machine? –It has 5 states –At least b,a, and ! are in its alphabet –q0 is the start state –q4 is an accept state –It has 5 transitions Questions –Which strings does it accept? baaaa, aaabaaa, ba –Is this the only FSA that can accept this language? An FSA only can accept regular strings. Question: Can you think of a string that is not regular? State Transition Table Annotated Directed Graph

30 Recognizer Implementation index = beginning of tape state = start state DO IF transition[index, tape[index]] is empty RETURN false state = transition[index, tape[index]] index = index + 1 UNTIL end of tap is reached IF state is a final state RETURN true ELSE RETURN false

31 Key Points Regarding FSAs This algorithm is a state-space search algorithm –Implementation uses simple table lookups –Success occurs when at the end of a string, we reach a final state The results are always deterministic –There is one unique choice at each step –The algorithm recognizes all regular languages Perl, Java, etc. use a regular expression algorithm –Create a state transition table from the expression –pass the table to the FSA interpreter FSA algorithms –Recognizer: determines if a string is in the language –Generator: Generates all strings in the language

32 Non-Deterministic FSA Deterministic: Given a state and symbol, only one transition is possible Nondeterministic: –Given a state and a symbol, multiple transitions are possible –Epsilon transitions: those which DO NOT examine or advance the tape The Nondeterministic FSA recognizes a string if: –At least one transition sequence ends at a final state –Note: all sequences DO NOT have to end at a final state –Note: String rejection occurs only when NO sequence ends at a final state ε Examples

33 Concatenation

34 Closure

35 Union

36 Using NFSAs Input Stateba!  01000 10200 202,300 30040 40000

37 NFSA Recognition of “baaa!”

38 Breadth-first Recognition of “baaa!”

39 Nondeterministic FSA Example ba a a !\ q0q0 q1q1 q2q2 q2q2 q3q3 q4q4

40 Other FSA Examples Dollars and Cents Exercise: Create a FSA for the following regular expressions (0|1)* [a-f1-9] abc{5}

41 Non Deterministic FSA Recognizer Recognizer (index, state) LOOP IF end of tape THEN IF state is final RETURN true ELSE RETURN false IF no possible transitions RETURN false IF there is only one transition state = transition[index, tape[index]] IF not an epsilon transition THEN index++ ELSE FOR each possible transition not considered result = CALL recognizer(nextState,nextIndex) IF result = true RETURN true END LOOP RETURN false

42 FSA’s and Morphology Apply an FSA to each word in the dictionary to capture the morphological forms. Groups of words with common morphology can share FSAs

43 Building a Lexicon with a FSA

44 Derivational Rules

45 Simple Morphology Example q0 q1q2q3 un- adj-root -er –est -ly  FromToOutput 01un 01NULL 12adj-root-list 23er;est;ly Stop states: q2 and q3

46 An Extended Example FromToOutput 01un 03NULL 12adj-root-list-1 25er;est;ly 32adj-root-list-1 34adj-root-list-2 45er;est q0 q1 q2 q5 un- -er –est -ly  q4 -er –est q3 adj-root-2 adj-root-1 Adj-root1: clear, happy, real Adj-root2: big, red

47 Representing Derivational Rules

48 Finite State Transducer (FST) Definition: A FST is a 5-tuple consisting of –Q: set of states {q0,q1,q2,q3,q4} –  : an alphabet of complex symbols Each complex symbol contains two simple symbols The first symbol is from an input alphabet i  I The second symbol is from an output alphabet o  O  is in I x O, ε is the null character –q0: a start state –F: a set of final states in Q {q4} –  (q,i:o): a transition function mapping Q x  to Q Concept: Translates and writes to a second tape a:o q0 q4 q1q2q3 b:ma:o !:? Example: baaaa  moooo

49 Transition Example c:c means read a c on one tape and write a c on the other +N:ε means read a +N symbol on one tape and write nothing on the other +PL:s means read +PL and write an s c:ca:at:t +N:ε + PL:s

50 On-line demos Finite state automata demos -analysis/fsCompiler/fsinput.html Finite state morphology -analysis/demos/english Some other downloadable FSA tools:

51 Lexicon for L 0 Rule based languages

52 Top Down Parsing S NPVP NP Nom Noun DetVerbPro flightmorninga preferI [ S [ NP [ Pro I]] [ VP [ V prefer] [ NP [ Det a] [ Nom [ N morning] [ N flight]]]]] S → NP VP, NP → Pro, Pro → I, VP → V NP, V → prefer, NP → Det Nom, Det → a, Nom → Noun Nom, Noun → morning, Noun → flight Driven by the grammar, working down

53 Bottom Up Parsing Driven by the words, working up The Grammar 0) S  E $ 1)E  E + T | E - T | T 2)T  T * F | T / F | F 3) F  num | id The Bottom Up Parse 1)id - num * id 2)F - num * id 3)T - num * id 4)E - num * id 5)E - F * id 6)E - T * id 7)E - T * F 8)E - T 9)E 10)S  correct sentence Note: If there is no rule that applies, backtracking is necessary

54 Top-Down and Bottom-Up Top-down –Advantage: Searches only trees that are legal –Disadvantage: Tries trees that don’t match the words Bottom-up –Advantage: Only forms trees matching the words –Disadvantage: Tries trees that make no sense globally Efficient combined algorithms –Link top-down expectations with bottom-up data –Example: Top-down parsing with bottom-up filtering

55 Stochastic Language Models Problems –A Language model cannot cover all grammatical rules –Spoken language is often ungrammatical Solution –Constrain search space emphasizing likely word sequences –Enhance the grammar to recognize intended sentences even when the sequence doesn't satisfy the rules A probabilistic view of language modeling

56 Probabilistic Context-Free Grammars (PCFG) Definition: G = (V N, V T, S, P, p); V N = non-terminal set of symbols V T = terminal set of symbols S = start symbol p = set of rule probabilities R = set of rules P(S ->W |G): S is the start symbol, W = expression in grammar G Training the Grammar: Count rule occurrences in a training corpus P(R | G) = Count(R) / ∑C(R) Goal: Assist in discriminating among competing choices

57 PFSA (Probabilistic Finite State Automata) A PFSA is a type of Probabilistic Context Free Grammar –The states are the non-terminals in a production rule –The output symbols are the observed outputs –The arcs represent a context-free rule –The path through the automata represent a parse tree A PCFG considers state transitions and the transition path S1 aS2 b S3 ε S1S2S3 ab

58 Probabilistic Finite State Machines Probabilistic models determine weights of the transitions The sum of weights leaving a state total to unity Operations –Consider the weights to compute the probability of a given string or most likely path. –The machine can ‘learn’ the weights over time Canine Companion Tooth.01.0035.001

59 Another Example

60 Pronunciation decoding [n iy]

61 Merging the machines together [n iy]

62 Another Example

Download ppt "Dictionaries and Grammar Do we include all forms of a particular word, or do we include only the base word and derive its forms? How are the grammatical."

Similar presentations

Ads by Google