Download presentation
Presentation is loading. Please wait.
1
Search and Decoding in Speech Recognition
Words and Transducers
2
1 2 3 4 5 6 7 8 9 Outline Introduction Survey of English Morphology
Inflectional Morphology 3 Regular Verbs 4 Derivational Morphology 5 Cliticization 6 Non-Concatenative Morphology 7 Finite State Morphological Parsers 8 Constructing Finite State Lexicon 9 1 December 2019 Veton Këpuska
3
10 11 12 13 14 15 16 17 18 Outline Finite State Transducers
FST for Morphological Parsing 11 Transducers and Orthographic Rules 12 Combining FST Lexicon and Rules 13 Lexicon-Free FSTs: The Porter Stemmer 14 Word and Sentence Tokenization 15 Detecting and Correcting Spelling Errors 16 Human Morphological Processing 17 Summary 18 1 December 2019 Veton Këpuska
4
1 2 3 4 5 6 7 8 9 Outline Introduction Survey of English Morphology
Inflectional Morphology 3 Regular Verbs 4 Derivational Morphology 5 Cliticization 6 Non-Concatenative Morphology 7 Finite State Morphological Parsers 8 Constructing Finite State Lexicon 9 1 December 2019 Veton Këpuska
5
Introduction 1 December 2019 Veton Këpuska
6
Introduction From Ch 1. – regular expressions; we saw how easy it is to search for a plural of the woodchuck (woodchucks). However searching for plural of fox, fish, peccary or wild goose, etc. is not as trivial as just tacking on an s. Main Entry: fox Pronunciation: 'fäks Function: noun Inflected Form(s): plural fox·es also fox Usage: often attributive Etymology: Middle English, from Old English; akin to Old High German fuhs fox and perhaps to Sanskrit puccha tail Main Entry: fish Pronunciation: 'fish Function: noun Inflected Form(s): plural fish or fish·es Usage: often attributive Etymology: Middle English, from Old English fisc; akin to Old High German fisc fish, Latin piscis Main Entry: pec·ca·ry Pronunciation: 'pe-k&-rE Function: noun Inflected Form(s): plural -ries Etymology: of Cariban origin; akin to Suriname Carib paki:ra peccary : any of several largely nocturnal gregarious American mammals resembling the related pigs: as a : a grizzled animal (Tayassu tajacu) with an indistinct white collar b : a blackish animal (Tayassu pecari) with a whitish mouth region Main Entry: goose Pronunciation: 'güs Function: noun Inflected Form(s): plural geese /'gEs/ Etymology: Middle English gos, from Old English gOs; akin to Old High German gans goose, Latin anser, Greek chEn 1 December 2019 Veton Këpuska
7
Introduction Input: Output: Producing
Required knowledge to correctly search for singulars and plurals in English language: Orthographic rules: Words ending in –y are pluralized by changing the –y to –i and adding an –es. Morphological rules: tell us that fish has null plural and that the plural of goose is formed by changing the vowel. Morphological parsing: recognizing that a word (like foxes) break down into component morphemes (fox and -es) and building a structured representation of it. Parsing means taking an input and producing some sort of linguistic structure for it. Parsing can be thought in broad terms producing structures based on: Input: Morphology Syntax Semantics Discourse Output: String Tree Network Producing 1 December 2019 Veton Këpuska
8
Introduction Morphological parsing (or stemming) applies to many affixes other than plurals; Example: Parsing any English verbs ending in –ing (e.g., going, talking, congratulating) into its verbal stem plus the –ing morpheme. going ⇨ VERB-go + GERUND-ing Morphological parsing is important for speech and language processing: Part-of-speech tagging Dictionaries (spell-checking) Machine translation 1 December 2019 Veton Këpuska
9
Introduction To solve morphological parsing problem one could just store all the plural forms of English nouns and –ing forms of English verbs in dictionary as, for example, in English Speech Recognition tasks. For many Natural Language Processing applications this is not possible because –ing is a productive suffix: that is, it applies to every verb and it requires knowing the rules to adding this suffix. Similarly –s applies to almost every noun. Productive suffixes apply to new words: Example: fax and faxing New words (e.g., acronyms and proper nouns) are created constantly – need to add the plural morpheme –s to each. Plural form of new nouns depends on the spelling/pronunciation of the singular form (eg. The nouns ending in –z the plural is formed by replacing it with –es). In other languages (e.g., Turkish) one cannot list all the morphological variants of every word: Turkish verbs have 40,000 possible forms not counting derivational suffixes. 1 December 2019 Veton Këpuska
10
Noun Most of us learned the classic definition of noun back in elementary school, where we were told simply that - “a noun is the name of a person, place, or thing.” That's not a bad beginning; it even clues us in to the origin of the word, since noun is derived ultimately from the Latin word nōmen, which means ‘name’. 1 December 2019 Veton Këpuska
11
noun any member of a class of words that can function as the main or only elements of subjects of verbs (A dog just barked), or of objects of verbs or prepositions (to send money from home), and that in English can take plural forms and possessive endings (Three of his buddies want to borrow John's laptop). Nouns are often described as referring to persons, places, things, states, or qualities, and the word noun is itself often used as an attributive modifier, as in noun compound; noun group. 1 December 2019 Veton Këpuska
12
Verb The key word in most sentences, the word that reveals what is happening, is the verb. It can declare something: You ran, ask a question Did you run?, convey a command Run faster!, or express a wish May this good weather last!, or a possibility If you had run well, you might have won; if you run better tomorrow, you may win. 1 December 2019 Veton Këpuska
13
Verb You cannot have a complete English sentence without at least one verb. Verb any member of a class of words that function as the main elements of predicates, that typically express action, state, or a relation between two things, and that may be inflected for tense, aspect, voice, mood, and to show agreement with their subject or object. 1 December 2019 Veton Këpuska
14
The definitions of noun and verb were taken from dictionary.com
1 December 2019 Veton Këpuska
15
Outline 1 December 2019 Veton Këpuska
16
Outline Survey of morphological knowledge for English
Introduction of finite-state transducer as the key algorithm for morphological parsing. Finite-state transducers are key algorithms for speech and language processing. Related algorithms: Stemming: mapping from the word to its root or stem. Important to Information Retrieval tasks. Need to know if two words have a similar root despite their surface differences Example: sang and sung. The word sing is called the common lemma of these words, and mapping form all these to sing is called lemmatization. 1 December 2019 Veton Këpuska
17
Outline Tokenization or Word Segmentation – a related algorithms to morphological parsing that is defined as a task of separating out (tokenizing) words from running text. English language text separates words by white space but: “New York”, “rock ‘n’ roll” – are considered single words I’m – is considered two words “I” and “am” … etc. For many applications we need to know how similar two words are orthographically. Morphological parsing is one method for computing similarity, Comparison of strings of letters via minimum edit distance algorithm. 1 December 2019 Veton Këpuska
18
Morphological Parsing
Morphological parsing, in natural language processing, is the process of determining the morphemes from which a given word is constructed. It must be able to distinguish between orthographic rules and morphological rules. For example, the word 'foxes' can be decomposed into 'fox' (the stem), and 'es' (a suffix indicating plurality). The generally accepted approach to morphological parsing is through the use of a finite state transducer (FST), which inputs words and outputs their stem and modifiers. The FST is initially created through algorithmic parsing of some word source, such as a dictionary, complete with modifier markups. 1 December 2019 Veton Këpuska
19
1 2 3 4 5 6 7 8 9 Outline Introduction Survey of English Morphology
Inflectional Morphology 3 Regular Verbs 4 Derivational Morphology 5 Cliticization 6 Non-Concatenative Morphology 7 Finite State Morphological Parsers 8 Constructing Finite State Lexicon 9 1 December 2019 Veton Këpuska
20
Survey of English Morphology
1 December 2019 Veton Këpuska
21
Survey of English Morphology
Morphology is the study of the way words are built up from smaller meaning-bearing units - morphemes. Morpheme is often defined as the minimal meaning-bearing unit in a language. Main Entry: mor·pheme Pronunciation: 'mor-"fEm Function: noun Etymology: French morphème, from Greek morphE form : a distinctive collocation of phonemes (as the free form pin or the bound form -s of pins) having no smaller meaningful parts 1 December 2019 Veton Këpuska
22
Survey of English Morphology
Example: fox consists of a single morpheme: fox. cats consists of two morphemes: cat and –s. Two broad classes of morphemes: Stems - main morpheme of a word, and Affixes – add additional meaning to the word. Prefixes – preceding the stem: unbuckle Suffixes – following the stem: eats Infixes – inserted in the stem: humingi (Philippine language Tagalog – in English “more or less”) Circumfixes – precede and follow the stem. gesagt (German past participle of sagen) 1 December 2019 Veton Këpuska
23
Survey of English Morphology
A word can have more than one affix: rewrites: Prefix - re Stem - write Suffix - s unbelievably: Prefix - un Stem - believe Suffix - able, ly English language does not tend to stack more than four or five affixes Turkish can have words with nine or ten affixes – languages like Turkish are called agglutinative languages. 1 December 2019 Veton Këpuska
24
ag·glu·ti·na·tive Pronunciation: \ə-ˈglü-tən-ˌā-tiv, -ə-tiv\
Function: adjective Date: 1634 1 : adhesive 2 : characterized by linguistic agglutination 1 December 2019 Veton Këpuska
25
Survey of English Morphology
There are many ways to combine morphemes to create a word. Four methods are common and play important role in speech and language processing: Inflection Combination of a word stem with a grammatical morpheme, usually resulting in a word of the same class as the original stem, and usually filling some syntactic function like agreement. Example: -s: plural of nouns -ed: past tense of verbs. 1 December 2019 Veton Këpuska
26
Survey of English Morphology
Derivation Combination of word stem with a grammatical morpheme, usually resulting in a word of a different class, often with a meaning hard to predict. Example: Computerize – verb Computerization – noun. Compounding Combination of multiple word stems together. Doghouse: dog + house. Cliticization Combination of a word stem with a clitic. A clitic is a morpheme that acts syntactically like a word, but is reduced in form and attached (phonologically and sometimes orthographically) to another word. I’ve = I + ‘ve = I + have 1 December 2019 Veton Këpuska
27
1 2 3 4 5 6 7 8 9 Outline Introduction Survey of English Morphology
Inflectional Morphology 3 Regular Verbs 4 Derivational Morphology 5 Cliticization 6 Non-Concatenative Morphology 7 Finite State Morphological Parsers 8 Constructing Finite State Lexicon 9 1 December 2019 Veton Këpuska
28
Inflectional Morphology
1 December 2019 Veton Këpuska
29
Inflectional Morphology
English language has a relatively simple inflectional system; Only Nouns Verbs Adjectives (sometimes) Number of possible inflectional affixes is quite small. 1 December 2019 Veton Këpuska
30
Inflectional Morphology: Nouns
Nouns (English): Plural Possessive Many (but not all) nouns can either appear in bare stem or singular form, or Take a plural suffix Regular Nouns Irregular Nouns Singular Cat Thrush Mouse Ox Plural Cats Thrushes Mice Oxen 1 December 2019 Veton Këpuska
31
Inflectional Morphology: Nouns
Regular plural spelled: -s -es after words ending in –s (ibis/ibises) -z (waltz/waltzes) -sh (thrush/thrushes) -ch (finch/finches) -x (box/boxes); sometimes Nouns ending in –y preceded by a consonant change the –y to –i (butterfly/butterflies). The possessive suffix is realized by apostrophe + -s for Regular singular nouns (llama’s), and Plural nouns not ending in –s (children’s), and often Lone apostrophe after Regular plural nouns (llamas’), and some Names ending in –s or –z (Euripides’ comedies’). 1 December 2019 Veton Këpuska
32
Inflectional Morphology: Verbs
English language inflection of verbs is more complicated than nominal inflection, e.g. regular & irregular verbs English has three kinds of verbs Main verbs (eat, sleep, impeach) Modal verbs (can, will, should) Primary verbs (be, have, do) Concerned with main and primary verbs because these have inflectional endings. Of these verbs a large class are regular (all verbs in this class have the same endings marking the same functions) 1 December 2019 Veton Këpuska
33
Inflectional Morphology
Regular & Irregular Verbs Veton Këpuska 1 December 2019
34
Regularly Inflected Verbs
Regular Verbs Regular Verbs have four morphological forms. For regular verbs we know the other forms by adding one of three predictable endings and making (some) regular spelling changes. Morphological Form Class Regularly Inflected Verbs Stem Walk Merge Try Map -s form Walks Merges Tries Maps -ing participle Walking Merging Trying Mapping Past tense form or –ed participle Walked Merged Tried Mapped 1 December 2019 Veton Këpuska
35
Regular Verbs Since regular verbs
Cover majority of the verbs and forms, and Regular class is productive, they are significant in the morphology of English language. Productive class is one that automatically includes any new words that enter the language. 1 December 2019 Veton Këpuska
36
Irregularly Inflected Verbs
Irregular Verbs Irregular Verbs are those that have some more or less idiosyncratic forms of inflection. English irregular verbs often have five different forms, but can have as many as eight (e.g., the verb be), or as few as three (e.g., cut or hit) They constitute a smaller class of verbs estimated to be about 250 Morphological Form Class Irregularly Inflected Verbs Stem Eat Catch Cut -s form Eats Catches Cuts -ing participle Eating Catching Cutting Past tense form Ate Caught –ed participle Eaten 1 December 2019 Veton Këpuska
37
Usage of Morphological Forms for Irregular Verbs
The –s form: Used in “habitual present” form to distinguish the third-person singular ending: “She jogs every Tuesday” from the other choices of person and number “I/you/we/they jog every Tuesday”. The stem form: Used in in the infinitive form, and also after certain other verbs “I’d rather walk home, I want to walk home” The –ing participle is used in the progressive construction to mark a present or ongoing activity “It is raining”, or when the verb is treated as a noun (this particular kind of nominal use of a verb is called gerund use: “Fishing is fine if you live near water”) The –ed participle is used in the perfect construction “He’s eaten lunch already”, or passive construction “The verdict was overturned yesterday” 1 December 2019 Veton Këpuska
38
Spelling Changes A number of regular spelling changes occur at morpheme boundaries. Example: A single consonant letter is doubled before adding the –ing and –ed suffixes: beg/begging/begged If the final letter is “c”, the doubling is spelled “ck”: picnic/picnicking/picnicked If the base ends in a silent –e, it is deleted before adding –ing and –ed: merge/merging/merged Just as for nouns, the –s ending is spelled –es after verb stems ending in –s (toss/tosses) -z (waltz/waltzes) -sh (wash/washes) -ch (catch/catches) -x (tax/taxes) sometimes. Also like nouns, verbs ending in –y preceded by a consonant change the –y to –i (try/tries). 1 December 2019 Veton Këpuska
39
1 2 3 4 5 6 7 8 9 Outline Introduction Survey of English Morphology
Inflectional Morphology 3 Regular & Irregular Verbs 4 Derivational Morphology 5 Cliticization 6 Non-Concatenative Morphology 7 Finite State Morphological Parsers 8 Constructing Finite State Lexicon 9 1 December 2019 Veton Këpuska
40
Derivational Morphology
1 December 2019 Veton Këpuska
41
Derivational Morphology
Derivation is combination of a word stem with a grammatical morpheme Usually resulting in a word of a different class, Often with a meaning hard to predict exactly English inflection is relatively simple compared to other languages. Derivation in English language is quite complex. 1 December 2019 Veton Këpuska
42
Derivational Morphology
A common kind of derivation in English is the formation of new nouns, From verbs, or Adjectives, called nominalization. Example: Suffix –ation produces nouns from verbs ending often in the suffix –ize (computerize → computerization) Suffix Base Verb/Adjective Derived Noun -ation Computerize (V) Computerization -ee Appoint (V) Appointee -er Kill (V) Killer -ness Fuzzy (A) Fuzziness 1 December 2019 Veton Këpuska
43
Derivational Morphology
Adjectives can also be derived from nouns and verbs Suffix Base Noun/Verb Derived Adjective -al Computation (N) Computational -able Embrace (V) Embraceable -less Clue (N) Clueless 1 December 2019 Veton Këpuska
44
Complexity of Derivation in English Language
There a number of reasons for complexity in Derivation in English: Generally it is less productive: Nominalizing suffix like –ation, which can be added to almost any verb ending in –ize, cannot be added to absolutely every verb. Example: we can’t say *eatation or *spellation (* marks stem of words that do not have the named suffix in English) There are subtle and complex meaning differences among nominalizing suffixes Example: sincerity vs sincereness 1 December 2019 Veton Këpuska
45
1 2 3 4 5 6 7 8 9 Outline Introduction Survey of English Morphology
Inflectional Morphology 3 Regular Verbs 4 Derivational Morphology 5 Cliticization 6 Non-Concatenative Morphology 7 Finite State Morphological Parsers 8 Constructing Finite State Lexicon 9 1 December 2019 Veton Këpuska
46
Cliticization 1 December 2019 Veton Këpuska
47
Cliticization clitic noun (linguistics)
a morpheme that functions like a word, but appears not as an independent word but rather is always attached to a following or preceding word. In English, the possessive ('s), -'s is an example. cliticization noun process or instance of a word becoming a clitic 1 December 2019 Veton Këpuska
48
Cliticization Clitic is a unit whose status lies in between that of an affix and a word. Phonological behavior: Short Unaccented Syntactic behaviour: Words, acting as: Pronouns, Articles, Conjunctions Verbs 1 December 2019 Veton Këpuska
49
Cliticization Proclitics – clitics proceeding a word
Enclitics – clitics following a word Full Form Clitic am ‘m have ‘ve are ‘re has ‘s is had ‘d will ‘ll would Ambiguity She’s → she is or she has 1 December 2019 Veton Këpuska
50
1 2 3 4 5 6 7 8 9 Outline Introduction Survey of English Morphology
Inflectional Morphology 3 Regular Verbs 4 Derivational Morphology 5 Cliticization 6 Non-Concatenative Morphology 7 Finite State Morphological Parsers 8 Constructing Finite State Lexicon 9 1 December 2019 Veton Këpuska
51
Non-Concatenative Morphology
1 December 2019 Veton Këpuska
52
Non-Concatenative Morphology
Morphology discussed so far is called concatenative morphology. Other (than English) languages have extensive non-concatenative morphology: Morphemes are combined in more complex ways - Tagalog Arabic, Hebrew and other Semitic languages exhibit templatic morphology or root-and-pattern morphology. 1 December 2019 Veton Këpuska
53
Agreement 1 December 2019 Veton Këpuska
54
Agreement In English language plural is marked on both nouns and verbs. Consequently the subject noun and the main verb have to agree in number: Both must be either singular or plural. 1 December 2019 Veton Këpuska
55
1 2 3 4 5 6 7 8 9 Outline Introduction Survey of English Morphology
Inflectional Morphology 3 Regular Verbs 4 Derivational Morphology 5 Cliticization 6 Non-Concatenative Morphology 7 Finite State Morphological Parsers 8 Constructing Finite State Lexicon 9 1 December 2019 Veton Këpuska
56
Finite-State Morphological Parsing
57
Finite-State Morphological Parsing
Goal of Morphological Parsing is to take the column 1 in the following table and produce output forms like those in the 2 column Example link-grammar: Input Morphologically Parsed Output cats cat+N+PL goose goose+V cat cat+N+SG gooses goose+V+1P+SG cities city+N+PL merging merge+V+PresPart geese goose+N+PL caught catch+V+PastPart goose+N+SG catch+V+Past 1 December 2019 Veton Këpuska
58
Finite-State Morphological
The second/fourth column of the table in the previous slide contains stem of each word as well as assorted morphological features. These features provide additional information about the stem. Example: +N – word is a noun +SG – word is singular Some of the input forms may be ambiguous: Caught Goose For now we will consider the goal of morphological parsing as merely listing of all possible parses. Task of disambiguation among morphological parses will be discussed in Chapter 5. 1 December 2019 Veton Këpuska
59
Requirements of Morphological Parser
Lexicon: the list of stems and affixes, together with basic information about them. Example: whether a stem is a Noun or a Verb, etc. Morphotactics: the model of morpheme ordering that explains which classes of morphemes can follow other classes of morphemes inside a word. Example: English plural morpheme follows the noun and does not precede it. Orthographic rules: these are spelling rules that are used to model the changes that occur in a word, usually when two morphemes combine. Example: They *y → *ie spelling rule as in city + -s → cities. 1 December 2019 Veton Këpuska
60
Links to Morphological Parsers
PC-KIMMO, a morphological parser Downloads and documentation for the PC-KIMMO morphological parser, as well as background information and research in computational morphology. Hermit Crab Hermit Crab, a morphological parser and generator for classical generative phonology and morphology. 3.2 Morphological Parsing The goal of morphological parsing is to find out what morphemes a given word is built from. For example, a morphological parser should be able to tell us ... NLPUtils Project Here I will be posting the various C# class libraries I have developed for use in natural language processing projects by the CASPR and IHUT projects in the Artificial Intelligence ... CASPR - Computer Analysis of Speech for Psychological Research NLPUtils (Boisclair tokenizer and morphological analyzer) Verstaile tokenizer and morphological analyzer for English, in C#. Programs and documentation (updated 2008 March 21) Project ... 1 December 2019 Veton Këpuska
61
Links to Morphological Parsers
Celex readme The CELEX lemma lexicon is the one most similar to an ordinary dictionary since every entry in this lexicon represents a set of related inflected words. LDC96L14 - LDC Catalog The second release of CELEX contains an enhanced, expanded version of the German lexical database (2.5), featuring approximately 1000 new lemma entries, ... 1 December 2019 Veton Këpuska
62
Example “I am Veton Kepuska and I am trying this link grammar to tokenize and parse this sentence” (S (S (NP I) (VP am (NP Veton Kepuska))) and (S (NP I) (VP trying (NP this link grammar) (S (VP to (VP tokenize and parse (NP this sentence))))))) .) 1 December 2019 Veton Këpuska
63
Requirements of Morphological Parser
In the next section we will present: Representation of a simple lexicon for the sub-problem of morphological recognition Build FSAs to model morphotactic knowledge Finite-State Transducer (FST) is introduced as a way of modeling morphological features in the lexicon 1 December 2019 Veton Këpuska
64
1 2 3 4 5 6 7 8 9 Outline Introduction Survey of English Morphology
Inflectional Morphology 3 Regular Verbs 4 Derivational Morphology 5 Cliticization 6 Non-Concatenative Morphology 7 Finite State Morphological Parsers 8 Constructing Finite State Lexicon 9 1 December 2019 Veton Këpuska
65
Constructing a Finite-State Lexicon
66
Lexicon A lexicon is a repository for words.
The simplest possible lexicon would consist of an explicit list of every word of the language Every word means including Abbreviations: AAA, FSA, NY, FL Proper Names: Jane, Beijing, Gent, Lydra, Saranda, Zana, etc. Example: a, AAA, AA, Aachen, aardvark, aardwolf, aba, abaca, aback, … For the various reasons in general it will be inconvenient or even impossible to list every word in a language. Computational lexicons are usually structured with a list of each of the stems and affixes of the language. Representation of the morphotactics One of the most common way to model morphotactics is with the finite-state-automaton. 1 December 2019 Veton Këpuska
67
Example of FSA for English nominal inflection
This FSA assumes that the lexicon includes regular nouns (reg-nouns), that take Regular –s plural: cat, dog, aardvark Ignoring for now that the plural of words like fox have inserted e: foxes. irregular noun forms that don’t take –s; both: singular (irreg-sg-noun): goose, mouse, sheep plural (irreg-pl-noun): geese, mice, sheep 1 December 2019 Veton Këpuska
68
Examples of Verb Inflections
reg-noun irreg-pl-noun irreg-sg-noun plural fox geese goose -s cat sheep aardvark mice mouse 1 December 2019 Veton Këpuska
69
Example for English verbal inflection
This lexicon has three stem classes: reg-verb-stem irreg-verb-stem irreg-past-verb-form Four affix classes: -ed: past -ed: participle -ing: participle -s: third case singular 1 December 2019 Veton Këpuska
70
Examples of Verb Inflections
reg-verb-stem irreg-verb-stem irreg-past-verb past past-part pres-part 3sg walk cut caught -ed -ing -s fry speak ate talk sing eaten impeach sang 1 December 2019 Veton Këpuska
71
English Derivational Morphology
As it has been discussed earlier in this chapter, English derivational morphology is significantly more complex than English inflectional morphology. FSA for that model derivational morphology are thus tend to be quite complex. Some models of English derivation are based on the more complex context-free grammars. 1 December 2019 Veton Këpuska
72
Morphotactics of English Adjectives
Example of Simple Case of Derivation from Antworth (1990): big, bigger, biggest happy, happier, happiest, happily unhappy unhappier, unhappiest, unhappily clear, clearer, clearest, clearly, unclear, unclearly cool, cooler, coolest, coolly red, redder, reddest real, unreal, really 1 December 2019 Veton Këpuska
73
Problem Issues While previous slide FSA will
recognize all the adjectives in the table presented earlier, it will also recognize ungrammatical forms like: unbig, unfast, oranger, or smally adj-root would include adjectives that: can occur with un- and –ly: clear, happy and real can not occur: big, small, etc. This simple example gives an idea of the complexity to be expected from English derivation. 1 December 2019 Veton Këpuska
74
Derivational Morphology Example 2
FSA models a number of derivational facts: generalization that any verb ending in –ize can be followed by the nominalizing suffix –ation -al or –able → -ity or -ness Exercise 3.1 of the textbook, in order to discover some of the individual exceptions to many of these constructs. Example: fossil → fossilize → fossilization equal → equalize → equalization formal → formalize → formalization realize → realizable → realization natural → naturalness casual → casualness FSA models for another fragment of English derivational morphology 1 December 2019 Veton Këpuska
75
Solving the Problem of Morphological Recognition
Using FSAs like the one mentioned previously one could solve the problem of Morphological Recognition; Given an input string of letters does it constitute legitimate English word or not? Taking morphotactic FSAs and plugging in each “sub-lexicon” into the FSA. Expanding each arc (e.g., reg-noun-stem arc) with all morphemes that make up the stem of reg-noun-stem. The resulting FSA can then be defined at the level of the individual letter. 1 December 2019 Veton Këpuska
76
Solving the Problem of Morphological Recognition
Noun-recognition FSA produced by expanding the Nominal Inflection FSA of previous slide with sample regular and irregular nouns for each class. We can use the FSA in the figure below to recognize strings like aardvarks by simply starting at the initial state, and comparing the input letter by letter with each word on each outgoing arc, and so on, just as we saw in Ch. 2. 1 December 2019 Veton Këpuska
77
9 10 11 12 13 14 15 16 17 Outline Finite State Transducers
FST for Morphological Parsing 10 Transducers and Orthographic Rules 11 Combining FST Lexicon and Rules 12 Lexicon-Free FSTs: The Porter Stemmer 13 Word and Sentence Tokenization 14 Detecting and Correcting Spelling Errors 15 Human Morphological Processing 16 Summary 17 1 December 2019 Veton Këpuska
78
Finite-State Transducers
79
Finite-State Transducers (FST)
FSA can represent morphotactic structure of a lexicon and thus can be used for word recognition. In this section we will introduce the finite-state transducers and we will show how they can be applied to morphological parsing. A transducers maps between one representation and another; A finite-state transducer or FST is a type of finite automaton which maps between two (2) sets of symbols. FST can be visualized as a two-tape automaton which recognizes or generates pairs of strings. Intuitively, we can do this by labeling each arc in the FSM (finite-state machine) with two symbol strings, one from each tape. In the figure in the next slide an FST is depicted where each arc is labeled by an input and output string, separated by a colon. 1 December 2019 Veton Këpuska
80
A Finite-State Transducer (FST)
FST has a more general function than an FSA; Where an FSA defines a formal language by defining a set of strings, an FST defines a relation between sets of strings. FST can be thought of as a machine that reads one string and generates another. 1 December 2019 Veton Këpuska
81
A Finite-State Transducer (FST)
FST as recognizer: A transducer that takes a pair of strings as input, and outputs: accept if the string-pair is in the string-pair language, and reject if it is not. FST as a generator: A machine that outputs pairs of strings of the language and outputs: yes or no, and a pair of output string. FST as translator: A machine that reads a string and outputs another string. FST as a set relater: A machine that computes relations between sets. 1 December 2019 Veton Këpuska
82
A Finite-State Transducer (FST)
All four categories of FST in previous slide have applications in speech and natural language processing (NLP). Morphological parsing (and for many other NLP applications): Apply FST translator metaphor: Input: a string of letters Output: a string of morphemes. 1 December 2019 Veton Këpuska
83
Formal Definition of FST
Q A finite set of N states q0, q1, …, qN-1, S A finite set corresponding to the input alphabet ∆ A finite set corresponding to the output alphabet q0 ∈ Q The start state F ⊆ Q The set of final states d(q,w) The transition function or transition matrix between states; Given a state q ∈ Q and a string w ∈ S*. d(q,w) returns a set of new states Q’ ∈ Q. d is thus a function from Q x S*, to 2Q (because there are 2Q possible subsets of Q). d returns a set of states rather than a single state because a given input may be ambiguous in which state it maps to. s(q,w) The output function giving the set of possible output strings for each state and input. Given a state q ∈ Q and string w ∈ S*, s(q,w) gives a set of output strings, each string o ∈ D*. s is thus a function from Q x S* to 2D* 1 December 2019 Veton Këpuska
84
Properties of FST FSAs are isomorphic to regular languages ⇔ FSTs are isomorphic to regular expressions. ( FSTs and regular expressions (RAs) are closed under union Generally FSTs are not closed under difference, complementation and intersection. In addition to union, FSTs have two closure properties that turn out to be extremely useful: Inversion: The inversion of a transducer T(T-1) simply switches the input and output labels. Thus if T maps from the input alphabet I to the output alphabet O, T-1 maps from O to I. Composition: If T1 is a transducer from I1 to O1 and T2 a transducer from O1 to O2, then T1∘T2 maps from I1 to O2. 1 December 2019 Veton Këpuska
85
Properties of FST Inversion is useful because it makes it easy to convert a FST-as-parser into an FST-as-generator. Composition is useful because it allow us to take two transducers that run in series and replace them with one more complex transducer. Composition works as in algebra: Applying T1∘T2 to an input sequence S is identical to applying T1 to S and then T2 to the resulting sequence; thus T1∘T2(S) = T2(T1(S)) 1 December 2019 Veton Këpuska
86
FST Composition Example
The composition of [a:b]+ with [b:c]+ to produce [a:c]+ 1 December 2019 Veton Këpuska
87
Projection The Projection of an FST is the FSA that is produced by extracting only one side of the relation. We refer to the projection to the left or upper side of the relation as the upper or first projection and the projection to the lower or right side of the relation as the lower or second projection. 1 December 2019 Veton Këpuska
88
Sequential Transducers and Determinism
1 December 2019 Veton Këpuska
89
Sequential Transducers and Determinism
Transducers as have been described may be nondeterministic; given an input there may be many possible output symbols. Thus using general FSTs requires the kinds of search algorithms discussed in Chapter 1, which makes FSTs quite slow in the general case. This fact implies that it would be nice to have an algorithm to convert a nondeterministic FST to a deterministic one. Every non-deterministic FSA is equivalent to some deterministic FSA Not all FSTs can be determinized. 1 December 2019 Veton Këpuska
90
Sequential Transducers
Sequential transducers are a subtype of transducers that are deterministic on their input. At any state of a sequential transducer, each given symbol of the input alphabet S can label at most one transition out of that state. 1 December 2019 Veton Këpuska
91
Sequential Transducers
Sequential transducers are not necessarily sequential on their output. In example of FST in previous slide, two distinct transitions leaving from state q0 have the same output (b). Inverse of a sequential transducer may thus not be sequential, thus we always need to specify the direction of the transduction when discussing sequentiality. Formally, the definition of sequential transducers modifies the d and s functions slightly: d becomes a function from Q x S* to Q (rather than to 2Q), and s becomes a function from Q x S* to D* (rather than 2D*). Subsequential transducer is one generalization of sequential transducer which generates an additional output string at the final states, concatenating it onto the output produced so far. 1 December 2019 Veton Këpuska
92
Importance of Sequential and Subsequential Transducers
Sequential and Subsequential FSTs are: efficient because they are deterministic on input → they can be processed in time proportional to the number of symbols in the input (linear in their input length) rather then being proportional to some much larger number which is function of the number of states. Efficient algorithms for determinization exists for Subsequential transducers (Mohri 1997) and for their minimization (Mohri, 2000). 1 December 2019 Veton Këpuska
93
Importance of Sequential and Subsequential Transducers
While Sequential and Subsequential FSTs are deterministic and efficient, neither of them is able to handle ambiguity; they transduce each input string to exactly one possible output string. Since ambiguity is a crucial property of natural language, it will be useful to have an extension of subsequential transducers that can deal with ambiguity, but still retain the efficiency and other useful properties of sequential transducers. One such generalization of subsequential transducers is: p-subsequential transducer. A p-subsequential transducer allows for p(p≥1) final output strings to be associated with each final state (Mohri 1996). They can handle a finite amount of ambiguity, which is useful for many NLP tasks. 1 December 2019 Veton Këpuska
94
2-subsequential FSA (Mohri 1997)
Mohri (1996, 1997) shows a number of tasks whose ambiguity can be limited in this way, including the representation of dictionaries, the compilation of morphological and phonological rules, and local syntactic constraints. For each of these kinds of problems, he and others have shown that they are p-subsequentializable, and thus can be determinized and minimized. This class of transducers includes many, although not necessarily all, morphological rules. 1 December 2019 Veton Këpuska
95
FSTs for Morphological Parsing
96
Tasks of Morphological Parsing
Example: cats → cat+N+PL In the finite-state morphology paradigm, we represent a word as correspondence between: A lexical level – which represents a concatenation of morphemes make up a word, and The surface level – which represents the concatenation of letters which make up the actual spelling of the word 1 December 2019 Veton Këpuska
97
Lexical Tape For finite-state morphology it is convenient to view an FST as having two tapes. The upper or lexical tape is composed from characters from one (input) alphabet S. The lower or surface tape, is composed of characters from another (output) alphabet D. In the two-level morphology of Koskenniemi (1983), each arc is allowed to have only a single symbol from each alphabet. Two symbol alphabets can be obtained by combining alphabets S and D to make a new alphabet S’. New alphabet S’ makes the relationship to FSAs clear: S’ is a finite alphabet of complex symbols: Each complex symbol is composed of an input-output pair i:o; i – one symbol from input alphabet S o – one symbol from output alphabet D S’ ⊆ S x D S and D may each also include the epsilon symbol e. 1 December 2019 Veton Këpuska
98
Lexical Tape Comparing FSA to FST modeling morphological aspect of a language the following can be observed: FSA – accepts a language stated over a finite alphabet of single symbols; e.g. Sheep language: S={b,a,!} FST - as defined in previous slides; accepts a language stated over pairs of symbols, as in: S’={a:a, b:b, !:!, a:!, a:e, e:!} 1 December 2019 Veton Këpuska
99
Feasible Pairs In two-level morphology, the pairs of symbols in S’ are also called feasible pairs. Each feasible pair symbol a:b in the transducer alphabet S’ expresses how the symbol a from one tape is mapped to the symbol b on the other tape. Example: a:e means that an a on the upper tape will correspond to nothing on the lower tape. We can write regular expressions in the complex alphabet S’ just as in the case of FSA. 1 December 2019 Veton Këpuska
100
Default Pairs Since it is most common for symbols to map to themselves (in two-level morphology) we call pairs like a:a default pairs, and just refer to them by the single letter a. 1 December 2019 Veton Këpuska
101
FST Morphological Parser
From morphotactic FSAs covered earlier, and by adding Lexical tape and the appropriate morphological features we can build a FST morphological parser. In the next slide is presented a figure that is augmentation of the figure in the slide FSA for English Nominal Inflection with nominal morphological features (+Sg and +Pl) that correspond to each morpheme. The symbol ^ indicates a morpheme boundary, while The symbol # indicates a word boundary. 1 December 2019 Veton Këpuska
102
A Schematic Transducer for English Nominal Number Inflection Tnum
The symbols above of in each arc represent elements of the morphological parse in the lexical tape. The symbols below each arc represent the surface tape (or the intermediate tape, to be described later), using the morpheme-boundary symbol ^ and word-boundary marker #. The arcs need to be expanded by individual words in the lexicon. 1 December 2019 Veton Këpuska
103
Transducer & Lexicon In order to use the Transducer in the previous slide as a morphological noun parser it needs to be expanded with all the individual regular and irregular noun stems: replacing the labels reg-noun etc. This expansion can be done by updating the lexicon for this transducer, so that irregular plurals like geese will parse into the correct stem goose +N +Pl. This is achieved by allowing the lexicon to also have two levels: Surface geese maps to lexical goose → new lexical entry: “g:g o:e o:e s:s e:e”. Regular forms are simpler: two level entry for fox will now be “f:f o:o x:x” Relying on the orthographic convention that f stands for f:f and so on, we can simply refer to it as fox and the form for geese as “g o:e o:e s e”. 1 December 2019 Veton Këpuska
104
Lexicon reg-noun irreg-pl-noun irreg-sg-noun fox g o:e o:e s e goose
cat sheep aardvark m o:i u:e s:c e mouse 1 December 2019 Veton Këpuska
105
FST The resulting transducer shown in previous slide will map:
Plural nouns into the stem plus the morphological marker +Pl Singular nouns into the stem plus the morphological marker +Sg. Example: cats → cat +N +Pl c:c a:a t:t +N: +Pl:^s# Output symbols include the morpheme and word boundary markers: ^ and # respectively. Thus the lower labels do not correspond exactly to the surface level. Hence the tapes (output) with these morpheme boundary markers is referred to as intermediate as shown in the next figure. 1 December 2019 Veton Këpuska
106
Lexical and Intermediate Tapes
1 December 2019 Veton Këpuska
107
Transducers and Orthographic Rules
108
Transducers and Orthographic Rules
The method described in previous section will successfully recognize words like aardvarks and mice. However, concatenating morphemes won’t work for cases where there is a spelling change: Incorrect rejection of the input like foxes, and Incorrect acceptance of the input like foxs. This is due to the fact that English language often requires spelling changes at morpheme boundaries: Introduction of spelling (or orthographic) rules 1 December 2019 Veton Këpuska
109
Notations for Writing Orthographic Rules
Implementing a rule in a transducer is important in general for speech and language processing. The table bellow introduces a number of rules. Name Description of Rule Example Consonant doubling 1-letter consonant doubled before –ing/-ed beg/begging E deletion Silent -e dropped before –ing and –ed make/making E insertion -e added after –s, -z, -x, -ch, -sh before -s watch/watches Y replacement -y changes to –ie before –s, -i before -ed try/tries K insertion Verbs ending with vowel + -c add –k panic/panicked 1 December 2019 Veton Këpuska
110
Lexical → Intermediate → Surface
An example of the lexical, intermediate, and surface tapes. Between each pair of tapes is a two-level transducer; the lexical transducer between lexical and intermediate levels, and the E-insertion spelling rule between the intermediate and surface levels. The E-insertion spelling rule inserts an e on the surface tape when the intermediate tape has a morpheme boundary ^ followed by the morpheme -s 1 December 2019 Veton Këpuska
111
Orthographic Rule Example
E-insertion rule might be formulated as: Insert an e on the surface tape just when the lexical tape has a morpheme ending in x (or z, etc) and the next morpheme is –s”. Bellow is formalization of this rule: This rule notation is due to Chomsky and Halle (1968); and is applied as follows: a → b/c_d “rewrite a with b when it occurs between c and d”. 1 December 2019 Veton Këpuska
112
Orthographic Rule Example
Symbol e means an empty transition; replacing it means inserting something. Morpheme boundaries ^ are deleted by default on the surface level (^:e) Since # symbol marks a word boundary, the rule in the previous slide means: Insert an e after a morpheme-final x, s, or z, and before the morpheme s. 1 December 2019 Veton Këpuska
113
Transducer for E-insertion rule
State/Input s:s x:x z:z ^:e e:e # other q0: 1 - q1: 2 q2: 5 3 q3 4 q4 q5 1 December 2019 Veton Këpuska
114
Combining FST Lexicon and Rules
115
Combining FST Lexicon and Rules
Ready now to combine Lexicon and Rule Transducers for parsing and generating. The figure below depicts the architecture of two-level cascade of FSTs with lexicon and rules 1 December 2019 Veton Këpuska
116
Tlex FST Combined with Te-insert FST
1 December 2019 Veton Këpuska
117
Finite-State Transducers
The exact same cascade with the same state sequences is used when the machine is: Generating the surface tape from the lexical tape Parsing the lexical tape from the surface tape. Parsing is slightly more complicated than generation, because of the problem of ambiguity: Example: foxes can also be a verb (meaning “to baffle or confuse”), and hence lexical parse for foxes could be: fox +V +3Sg: “That trickster foxes me every time”, as well as fox +N +Pl: “I saw two foxes yesterday”. 1 December 2019 Veton Këpuska
118
Finite-State Transducers (cont)
For ambiguous cases of this sort, the transducers are not capable of deciding. Disambiguation will require some external evidence such as surrounding words. Disambiguation algorithms are discussed in Ch 5, Ch. 20 of the textbook. In lack of such external evidence the best that FST can do is enumerate all possible choices – transducing fox^s# into both fox +V +3Sg and fox +N +Pl. 1 December 2019 Veton Këpuska
119
FSTs and Ambiguation However, there is a kind of ambiguity that FSTs need to handle: Local ambiguity that occurs during the process of parsing. Example: Parsing input verb assess. After seeing ass, E-insertion transducer may propose that the e that follows is inserted by the spelling rule - as far as the FST we might have been parsing the word - asses. Thus is not until we don’t see the # after asses but rather run into another s, that we can realize that we have gone down an incorrect path. 1 December 2019 Veton Këpuska
120
FSTs and Ambiguation Because of this non-determinism, FST-parsing algorithm need to incorporate some sort of search algorithm. Homework – Exercise 3.7 asks to modify the algorithm for non-deterministic FSA recognition to do FST parsing. Note that many possible spurious segmentations of the input, such as parsing assess as ^a^s^ses^s will be ruled out since no entry in the lexicon will match this string. 1 December 2019 Veton Këpuska
121
FSTs and Cascading Running a cascade of number of FST can be unwieldy1): Using cascading property of FSTs we can compose a single more complex transducer from cascade of a number of transducers in series. 1)unwieldy [uhn-weel-dee] Word Origin adjective, unwieldier, unwieldiest.1.not wieldy; wielded with difficulty; not readily handled or managed inuse or action, as from size, shape, or weight; awkward; ungainly. 1 December 2019 Veton Këpuska
122
FSTs and automaton intersection
Transducers in parallel can be combined by automaton intersection. The automaton intersection algorithm just takes the Cartesian product of the states, i.e., for each state qi in machine 1 and state qj in machine 2, we create a new state qij. For any input symbol a, if machine 1 would transition to sate qn and machine 2 would transition to state qm, the new transducer will transition to state qnm. The figure in the next slide sketches how the intersection (∧) and composition (o) process might be carried out. 1 December 2019 Veton Këpuska
123
Intersection and Composition of Transducers
1 December 2019 Veton Këpuska
124
Intersection and Composition of Transducers
Since there are a number of rules → FST compilers. It is almost never necessary in practice to write an FST by hand. Kaplan and Key (1994) give the mathematics that define the mapping from rules to two-level relations Antworth (1990) gives details of the algorithms for rule compilation. Mohri (1997) gives algorithms for transducer minimization and determinization 1 December 2019 Veton Këpuska
125
Lexicon-Free FSTs: The Porter Stemmer
1 December 2019 Veton Këpuska
126
Lexicon-Free FSTs: The Porter Stemmer
Building the FSTs from a lexicon plus rules is the standard algorithm for morphological parsing. However, there are simpler algorithms that do not require the large on-line lexicon demanded by this standard algorithm. 1 December 2019 Veton Këpuska
127
The Porter Stemmer These new algorithms are especially used in Information Retrieval (IR) tasks like web search: Boolean combination of relevant keywords or phrases: marsupial OR kangaroo OR koala Since a document with the word marsupials might not match the keyword marsupial, some IR systems first run a stemmer on the query and document words. Morphological information in IR is thus only used to determine that two words have the same stem; the suffixes are thrown away. 1 December 2019 Veton Këpuska
128
Porter (1980) Stemming Algorithm
Porter Stemming Algorithm is based on a series of simple cascaded rewrite rules. Considering that cascaded rewrite rules can be easily implemented as an FST, the Porter algorithm is thought of as a lexicon-free FST stemmer (Homework: Exercise 3.6). The algorithm contains rules like this: ATIONAL → ATE (e.g., relational → relate) ING → e if stem contains vowel (e.g., motoring → motor) SSES → SS (e.g., grasses → grass) More details from: 1 December 2019 Veton Këpuska
129
Porter (1980) Stemming Algorithm
Not all IR engines use stemming, partly because of stemmer errors such as these noted by Krovetz: Errors of Commission Errors of Omission organization organ European Europe doing doe analysis analyzes generalization generic matrices matrix numerical numerous noise noisy policy police sparse sparisty 1 December 2019 Veton Këpuska
130
Word and Sentence Tokenization
131
Word and Sentence Tokenization
Focused so far on the problem of segmentation: How words can be segmented into morphemes? Related problem is of segmenting running text into words and sentences: tokenization. 1 December 2019 Veton Këpuska
132
Word and Sentence Tokenization
Word tokenization may seem very simple in a language like English that separates words via special ‘space’ character. However, not every language does this: e.g. Chinese, Japanese, and Thai 1 December 2019 Veton Këpuska
133
Why Segmentation is a Problem?
Whitespace character is not sufficient by itself: Mr. Sherwood said reaction to Sea Containers’ proposal has been “very positive.” In New York Stock Exchange composite trading yesterday, Sea Containers closed at $62.625, up 62.5 cents. “I said, ‘what’re you? Crazy?’ “ said Sadowsky. “I can’t afford to do that.” Segmenting purely on white-space would produce words like these: cents. said, positive,” Crazy? 1 December 2019 Veton Këpuska
134
Why Segmentation is a Problem? (cont)
In speech recognition systems: the whitespace character is not produced by the recognizer. Tokens are phonetic labels (not orthographic): AE Incomplete number of phones (recognizer deletion) Insertion of non-existing phone (recognizer insertion) Erroneous phone label (recognizer substitution) 1 December 2019 Veton Këpuska
135
Segmentation Errors Segmentation errors produced by using only white-space could be addressed by treating punctuation as word boundary in addition to white-space. Punctuation however often occurs word internally: m.p.g, Ph.D, AT&T, cap’n, 01/02/06, google.com, etc. Also, if we want for 62.5 in the example in previous slide to be a word, segmenting every period “.” must be avoided. Number expressions introduce additional problems when expressed in numeric form or as words: 777,777.77 seven hundred and seventy seven dollars and seventy seven cents. 9/15/2007 1 December 2019 Veton Këpuska
136
Segmentation Errors (cont.)
Languages differ on punctuation styles for numbers, dates. European continental languages use comma to mark the decimal “point” and spaces or sometimes periods where English language puts commas: ,77 or ,77 or 777,777.77 and 1 December 2019 Veton Këpuska
137
Other Tokenizer Tasks Expansion of clitic contractions that are marked by apostrophes: what’re → what are we’re → we are 1 December 2019 Veton Këpuska
138
Other Tokenizer Tasks (cont.)
Complications: Apostrophes are ambiguous since they are used as genitive markers: “the book’s cover” or “Containers’ “ in the previous example in the Segmentation Problem slide. quantative markers: “ ‘what’re you? Crazy?’ “ 1 December 2019 Veton Këpuska
139
Other Tokenizer Tasks Multi-word expressions:
New York Rock ‘n’ roll Etc. Requires multiword expression dictionary, also Must detect names, dates, organizations (names) – named entity detection (Ch22) 1 December 2019 Veton Këpuska
140
Other Tokenizer Tasks (cont.)
Sentence Segmentation – crucial first step in text processing. Segmentation of a text into sentences is generally based on punctuation: Sentence boundaries are marked usually with: “.”, “?”, “!”. “?”, “!” are relatively unambiguous compared to “.” Example: Sentence boundary “.” vs “Mr.” or “Inc.” One of the previous examples used “Inc.” to mark abbreviation as well as the sentence boundary. Consequently word tokenization and sentence tokenization tend to be addressed jointly. 1 December 2019 Veton Këpuska
141
Sentence Tokenization
Typically sentence tokenization methods work by building a binary classifier: Based on sequence of rules, or Based on some machine learning algorithm (introduced in the subsequent chapters) Deciding if a period is part of the word or is a sentence boundary marker. Abbreviation dictionary is useful in determining if a period is attached to a commonly used abbreviation. Useful first step in sentence tokenization can be taken via a sequence of regular expressions. Introduction of the first part of this algorithm: a word tokenization is presented in the next slide implemented in the perl script. Free scripting language can be downloaded from: 1 December 2019 Veton Këpuska
142
Word Tokenization with Perl
#!/usr/bin/perl $letternumber = "[A-Za-z0-9]"; $notletter = "[ˆA-Za-z0-9]"; $alwayssep = "[\\?!()\";/\\|‘]"; $clitic = "(’|:|-|’S|’D|’M|’LL|’RE|’VE|N’T|’s|’d|’m|’ll|’re|’ve|n’t)"; $abbr{"Co."} = 1; $abbr{"Dr."} = 1; $abbr{"Jan."} = 1; $abbr{"Feb."} = 1; while (<>){ # put whitespace around unambiguous separators s/$alwayssep/ $& /g; # put whitespace around commas that aren’t inside numbers s/([ˆ0-9]),/$1 , /g; s/,([ˆ0-9])/ , $1/g; # distinguish singlequotes from apostrophes by # segmenting off single quotes not preceded by letter s/ˆ’/$& /g; s/($notletter)’/$1 ’/g; # segment off unambiguous word-final clitics and punctuation s/$clitic$/ $&/g; s/$clitic($notletter)/ $1 $2/g; # now deal with periods. For each possible word @possiblewords=split(/\s+/,$_); foreach $word { # if it ends in a period, if (($word =˜ /$letternumber\./) && !($abbr{$word}) # and isn’t on the abbreviation list # and isn’t a sequence of letters and periods (U.S.) # and doesn’t resemble an abbreviation (no vowels: Inc.) && !($word =˜ /ˆ([A-Za-z]\.([A-Za-z]\.)+|[A-Z][bcdfghj-nptvxz]+\.)$/)) { # then segment off the period $word =˜ s/\.$/ \./; } # expand clitics $word =˜s/’ve/have/; $word =˜s/’m/am/; print $word," "; print "\n"; 1 December 2019 Veton Këpuska
143
Software Tools Perl sentence parsing module Sentence.pm
Finite State Morphology Book: Software: 1 December 2019 Veton Këpuska
144
Sentence Tokenization
The fact that the simple tokenizer can be build with such simple regular expression patterns like the one in the previous perl example suggest that they can be easily implemented in FSTs. The examples of FSTs that have been built: Karttunen et al Beesley and Karttunen – 2003, give descriptions of such FST-based tokenizers. 1 December 2019 Veton Këpuska
145
Detecting and Correcting Spelling Errors
146
Detecting and Correcting Spelling Errors
The problem of detecting and correcting spelling errors is introduced. Since the standard algorithm for spelling error correction is probabilistic, we will continue our spell-checking discussion in Ch. 5 after the probabilistic noisy channel is introduced. The detection and correction of spelling errors is an integral part of Modern word processors and search engines In optical character recognition (OCR), Automatic recognition of machine or hand-printed characters, and On-line handwriting recognition, printed or cursive handwriting. 1 December 2019 Veton Këpuska
147
Spell-Checking in Increased Problem Broadness
Non-word error detection: detecting spelling errors that result in non-words (e.g., graffe for giraffe) Isolated-word error correction: correcting spelling errors that result in non-words (e.g., correcting graffe to giraffe but looking only at the word in isolation) Context-dependent error detection and correction: using the context to help detect and correct spelling errors even if they accidentally result in an actual word of English (real-word errors). This happen from typographical errors (insertion, deletion, transposition) which accidentally produce a real word (e.g., there for three), or because the writer substituted the wrong spelling of a homophone or near-homophone (e.g., dessert for desert, or piece for peace) 1 December 2019 Veton Këpuska
148
Non-Word Error Detection
Marking any word that is not found in a dictionary. Dictionary would not have a word entry graffe. Early research (Peterson 1986) suggested that such spelling dictionaries would need to be kept small: Large dictionaries contain rare words that resemble misspellings of other words. Example: wont or veery. won’t and very. Larger dictionary proved more help than harm by avoiding marking rare words as errors in spite the fact that some misspellings were hidden by real words (Damerau and Mays 1989). Modern spell-checking systems tend to be based on large dictionaries. 1 December 2019 Veton Këpuska
149
Dictionary Implementation
A finite-state morphological parsers described throughout this chapter provide a technology for implementing such large dictionaries. By giving a morphological parser for a word, an FST parser is inherently a word recognizer. An FST morphological parser can be turned into an even more efficient FSA word recognizer by using the projection operation to extract the lower-side language graph. Such FST dictionaries also have the advantage of representing productive morphology: English –s and –ed inflections described previously. Important for dealing with new legitimate combinations of stems and inflections. New stem can be added to the dictionary and then all the inflected forms are easily recognized. This makes FST dictionaries especially powerful for spell-checking in morphologically rich languages where a single stem can have tens of hundreds of possible surface forms. 1 December 2019 Veton Këpuska
150
Dictionary Implementation
FST dictionaries can help with non-word error detection. But how about error correction? Algorithms for isolated-word error correction operate by finding words which are the likely source of the erroneous form: Correcting spelling error graffe requires: Searching through all possible options like: giraffe, graph, craft, grail, etc. Selection of the best choices from all possible ones: Computation of distance metric between source and the surface error. Intuitively giraffe is a more likely source than grail for graffe. The most powerful way to capture this similarity intuition requires use of the probability theory and will be discussed in Ch. 5. However this algorithm is based on the non-probabilistic minimum edit distance algorithm that is introduced next. 1 December 2019 Veton Këpuska
151
FST’s for Dictionaries
Minimum Edit Distance 1 December 2019 Veton Këpuska
152
Minimum Edit Distance Deciding which two words is closer to some third word in spelling is a special case of the general problem of string distance. The distance between two strings is a measure of how alike are to each other. Class of algorithm that solve this problems is know as minimum edit distance. Minimum edit distance between two strings is the minimum number of editing operations: Insertions Deletions Substitutions Needed to transform one string into the other. 1 December 2019 Veton Këpuska
153
Minimum Edit Distance Representing the minimum edit distance between two strings as an alignment. The alignment is achieved with operations of Deletion (D) Insertion (I) Substitution (S) To get a numeric representation for minimum edit distance we can assign a particular cost or weight to each of these operations/transformations. Levenshtein distance between two sequences is the simplest weighting factor in which each of the three operations has a cost of 1 (Levenshtein 1966). In the example above the distance between intention and execution is 5. Another version will weigh the D and I with a cost of “1” and S with a cost of “2” making the distance above equal to 8. 1 December 2019 Veton Këpuska
154
Minimum Edit Distance The minimum edit distance is computed by dynamic programming. Dynamic programming is the name for a class of algorithms first introduced by Bellman (1957) 60+ years of ago. It applies a table-driven method to solve problems by combining solutions to sub-problems. This class of algorithms includes most commonly-used algorithms in speech and language processing like: Viterbi & Forward (Ch. 6) CYK and Earley (Ch. 13) 1 December 2019 Veton Këpuska
155
Distance Matrix 1 December 2019 Veton Këpuska
156
Digital Systems: Hardware Organization and Design
12/1/2019 DP Algorithm Details A char-char matrix illustrates the alignment process S s P E h H C i-1 j-1 i-1, j i, j-1 1 December 2019 Veton Këpuska Architecture of a Respresentative 32 Bit Processor
157
Digital Systems: Hardware Organization and Design
12/1/2019 DP Algorithm Details One way to find the path that yields the best match (i.e., lowest global distance) between the test string and a given reference string is by evaluating all possible paths in the matrix. That approach is time consuming because the number of possible paths is exponential with the length of the utterance. The matching process can be shortened by requiring that A path cannot go backwards; (i.e., to the left or down in the Figure) A path must include an association for every character in the test string, and Local distance scores are combined by adding to give a global distance. 1 December 2019 Veton Këpuska Architecture of a Respresentative 32 Bit Processor
158
DTW - DP Algorithm Details
Digital Systems: Hardware Organization and Design 12/1/2019 DTW - DP Algorithm Details For the moment, assume that a path must include an association for every frame in the template and every frame in the utterance. Then, for a point (i, j) in the time-time matrix (where i indexes the utterance frame, j the template frame), the previous point must have been (i-1, j-1), (i-1, j) or (i, j-1) (because of the prohibition of going backward in time). (i-1,j) (i,j) j (i,j-1) (i-1,j-1) i 1 December 2019 Veton Këpuska Architecture of a Respresentative 32 Bit Processor
159
DTW - DP Algorithm Details
Digital Systems: Hardware Organization and Design 12/1/2019 DTW - DP Algorithm Details The principle of dynamic programming (DP) is that a point (i, j), the next selected point on the path, comes from one among (i-1, j-1), (i-1, j) or (i, j-1) that has the lowest distance. DP finds the lowest distance path through the matrix, while minimizing the amount of computation. The DP algorithm operates in a time-synchronous manner by considering each column of the time-time matrix in succession (which is equivalent to processing the utterance frame-by-frame). For a template of length N (corresponding to an N-row matrix), the maximum number of paths being considered at any time is N. A test utterance feature vector j is compared to all reference template features, 1…N, thus generating a vector of corresponding local distances d(1, j), d(2, j), … d(N, j). 1 December 2019 Veton Këpuska Architecture of a Respresentative 32 Bit Processor
160
Digital Systems: Hardware Organization and Design
12/1/2019 DP Algorithm Details If D(i, j) is the global distance up to, but not including matrix point (i, j) and the local distance of (i, j) matching character i of the test string with character j of reference string is given by d(i, j), then D(i, j) = min [D(i-1, j-1), D(i-1, j), D(i, j-1)] + d(i, j) Given that D(1, 1) = d(1, 1) (this is the initial condition), we have the basis for an efficient recursive algorithm for computing D(i, j). The final global distance D(M, N) at the end of the path gives the overall lowest matching score of the template with the utterance, where M is the number of vectors of the utterance. The test string is then matched against the dictionary word with a lowest matching score producing the minimum edit distance. 1 December 2019 Veton Këpuska Architecture of a Respresentative 32 Bit Processor
161
Digital Systems: Hardware Organization and Design
12/1/2019 DP Algorithm Details Equation presented in the previous slide enforces the rule that the only directions in which a path can move when at (i, j) in the time-time matrix is: up, right, or diagonally up and right. Computationally, that equation is in a form that could be recursively programmed. However, unless the language is optimized for recursion, this method can be slow even for relatively small pattern sizes. Another method that is both quicker and requires less memory storage uses two nested "for" loops. This method only needs two (2) arrays that hold adjacent columns of the time-time matrix. 1 December 2019 Veton Këpuska Architecture of a Respresentative 32 Bit Processor
162
Digital Systems: Hardware Organization and Design
12/1/2019 DP Algorithm Details Referring to the Figure below, the algorithm to find the least global distance path is as follows (Note that from the Figure, which shows a representative set of rows and columns, the cells at (i, j) and (i, 0) have different possible originator cells. The path to (i, 0) can originate only from (i-1, 0). But the path to any other (i, j) can originate from the three standard locations): i-1, j i, j i-1 j-1 i, j-1 Prev. Col.: i-1 Cur. Column: i 1 December 2019 Veton Këpuska Architecture of a Respresentative 32 Bit Processor
163
Digital Systems: Hardware Organization and Design
12/1/2019 DP Algorithm Details Calculate the global distance for the bottom most cell of the left-most column, column 0. The global distance up to this cell is just its local distance. The calculation then proceeds upward in column 0. The global distance at each successive cell is the local distance for that cell plus the global distance to the cell below it. Column 0 is then designated the predCol (predecessor column). Calculate the global distance to the bottom most cell of the next column, column 1 (which is designated the curCol, for current column). The global distance to that bottom most cell is the local distance for that cell plus the global distance to the bottom most cell of the predecessor column. Calculate the global distance of the rest of the cells of curCol. For example, at cell (i, j) this is the local distance at (i, j) plus the minimum global distance at either (i-1, j), (i-1, j-1) or (i, j-1). curCol becomes predCol and step 2 is repeated until all columns have been calculated. Minimum global distance is the value stored in the top most cell of the last column. 1 December 2019 Veton Këpuska Architecture of a Respresentative 32 Bit Processor
164
DP Algorithm Pseudo Code
Digital Systems: Hardware Organization and Design 12/1/2019 DP Algorithm Pseudo Code The pseudocode for this process is: Calculate First Column Distances (Prev. Column) for i=1 to Number of Input Feature Vectors CurCol[0] = local distance at (i,0) + global cost at (i-1,0) for j=1 to Number of Template Feature Vectors CurCol[j] = local distance at (i,j) + minimal global cost at (i-1,j), (i-1,j-1), (i,j-1) end 1 December 2019 Veton Këpuska Architecture of a Respresentative 32 Bit Processor
165
Digital Systems: Hardware Organization and Design
12/1/2019 DP Algorithm Details To perform recognition on an utterance, the algorithm is repeated for each template. The template file that gives the lowest global matching score is picked as the most likely word. Note that the minimum global matching score for test word need be compared to all dictionary candidates. The candidate words can be sorted based on minimum edit distance (DP score) 1 December 2019 Veton Këpuska Architecture of a Respresentative 32 Bit Processor
166
Digital Systems: Hardware Organization and Design
12/1/2019 DP Algorithm Details Note that the minimum global matching score for test word need be compared to all dictionary candidates. The candidate words can be sorted based on minimum edit distance (DP score) 1 December 2019 Veton Këpuska Architecture of a Respresentative 32 Bit Processor
167
Distance Computation function MIN-EDIT-DISTANCE(target, source) returns min-distance n←LENGTH(target) m←LENGTH(source) Create a distance matrix distance[n+1,m+1] Initialize the zeroth row and column to be the distance from the empty string distance[0,0] = 0 for each column i from 1 to n do distance[i,0]←distance[i-1,0] + ins-cost(target[i]) for each row j from 1 to m do distance[0,j]←distance[0,j-1] + del-cost(source[j]) distance[i, j]← MIN(distance[i−1, j] + ins-cost(targeti−1), distance[i−1, j−1] + subst-cost(source j−1, targeti−1), distance[i, j−1] + del-cost(source j−1)) return distance[n,m] 1 December 2019 Veton Këpuska
168
Minimum Edit Distance Algorithm
The minimum edit distance algorithm, an example of the class of dynamic programming algorithms is shown in the previous slide. The various costs can either be fixed (e.g. ∀x, ins-cost(x) = 1),or can be specific to the letter (to model the fact that some letters are more likely to be inserted than others). We assume that there is no cost for substituting a letter for itself (i.e. subst-cost(x,x) = 0). 1 December 2019 Veton Këpuska
169
Example of Distance Matrix Computation (Cost = 1 for INS & DEL, Cost = 2 for SUB)
9 8 10 11 12 O 7 I 6 T 5 4 E 3 2 1 # X C U 1 December 2019 Veton Këpuska
170
Alignement DP Algorithm can be used to find the minimum cost alignment between two strings. Useful in Speech and language processing, Speech recognition Machine translation 1 December 2019 Veton Këpuska
171
Alignment Path 1 December 2019 Veton Këpuska
172
Example of Distance Matrix Computation (Cost = 1 for INS & DEL, Cost = 2 for SUB)
9 8 10 11 12 O 7 I 6 T 5 4 E 3 2 1 # X C U 8 8 8 8 8 3 4 5 6 3 5 2 3 1 1 1 December 2019 Veton Këpuska
173
Back-tracking/Backtrace
Requires a minimal augmentation of original algorithm by storing in each cell of the matrix information form which cell the minimal score was derived. Homework: Exercise Write a complete DP algorithm to match two strings by finding minimal edit distance and the best matching path. You can use: Matlab, C++, C, C# perl, php, … 1 December 2019 Veton Këpuska
174
Publicly Available Packages
UNIX diff (Win32 cygwin package): sclite and other software tools for speech & language processing tools from NIST: 1 December 2019 Veton Këpuska
175
Human Morphological Processing
176
Human Morphological Processing
Brief survey of psycholinguistic studies on how multi-morphemic words are represented in the minds of speakers of English language. Example: walk → walks and walked. happy → happily and happiness. Are all three in the human lexicon? Or walk + –ed and –s? happy + -y → ily and y → iness 1 December 2019 Veton Këpuska
177
Human Morphological Processing
Two extreme possibilities: Full listing – proposes that all words of a language are listed in the mental lexicon without any internal morphological structure. This hypothesis is certainly untenable for morphologically complex languages like Turkish. Minimum redundancy hypothesis suggests that only the constituent morphemes are represented in the lexicon and when processing walks (whether for reading, listening, or talking) we must access both morphemes (walk and –s) and combine them. 1 December 2019 Veton Këpuska
178
Evidence of Human Lexicon
Earliest evidence comes from speech errors – also called slips of the tongue. In conversational speech, speakers often mix up the order of the words or sounds: If you break it it’ll drop. In the slips of tongue collected by Fromkin and Ratner (1998) and Garrett (1975) inflectional and derivational affixes can appear separately from their stems. It’s not only us who have screw looses (for “screws loose” ) Words of rule formation (for “rules of word formation”) Easy enoughly (for “easily enough). ⇒ Suggests that the mental lexicon contains some representation of morphological structure. 1 December 2019 Veton Këpuska
179
Evidence of Human Lexicon
More recent experimental evidence suggests neither the full listing nor the minimum redundancy hypotheses may be completely true. Stanners et al. provides evidence for: Words like: happily and happiness, are stored separately from its stems: happy Regularly inflected forms: pouring are not distinct in the lexicon from their stems: pour. 1 December 2019 Veton Këpuska
180
Evidence of Human Lexicon
Marslen-Wilson et al. (1994) result. Derived words are linked to their stems only if semantically related. -al -ure -s department depart govern -ing 1 December 2019 Veton Këpuska
181
Summary
182
Summary Morphology: an area of language processing dealing with the subparts of words. Finite-State Transducer: computational device that is important for morphology Stemming: Morphological Parsing (stem + affixs), Word and Sentence Tokenization, and Spelling Error Detection 1 December 2019 Veton Këpuska
183
Summary Morphological parsing is the process of finding the constituent morphemes in a word (e.g., cat +N +PL for cats). English mainly uses prefixes and suffixes to express inflectional and derivational morphology. English inflectional morphology is relatively simple and includes person and number agreement (-s) and tense markings (-ed and -ing). English derivational morphology is more complex and includes suffixes like -ation, -ness, -able as well as prefixes like co- and re-. Many constraints on the English morphotactics (allowable morpheme sequences) can be represented by finite automata. Finite-state transducers are an extension of finite-state automata that can generate output symbols. Important operations for FSTs include composition, projection, and intersection. Finite-state morphology and two-level morphology are applications of finite-state transducers to morphological representation and parsing. Spelling rules can be implemented as transducers. There are automatic transducer-compilers that can produce a transducer for any simple rewrite rule. 1 December 2019 Veton Këpuska
184
Summary The lexicon and spelling rules can be combined by composing and intersecting various transducers. The Porter algorithm is a simple and efficient way to do stemming, stripping off affixes. It is not as accurate as a transducer model that includes a lexicon, but may be preferable for applications like information retrieval in which exact morphological structure is not needed. Word tokenization can be done by simple regular expressions substitutions or by transducers. Spelling error detection is normally done by finding words which are not in a dictionary; an FST dictionary can be useful for this. The minimum edit distance between two strings is the minimum number of operations it takes to edit one into the other. Minimum edit distance can be computed by dynamic programming, which also results in an alignment of the two strings. 1 December 2019 Veton Këpuska
185
END 1 December 2019 Veton Këpuska
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.