Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.

Similar presentations


Presentation on theme: "Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011."— Presentation transcript:

1 Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011

2 Announcements Wednesday online GP meeting scheduling Seminar on Friday: Luke Zettlemoyer (CSE) Automatic grammar induction Treehouse Friday: Classifiers – Memory Lane

3 Roadmap Motivation: FST applications FST perspectives FSTs and Regular Relations FST Operations

4 FSTs Finite automaton that maps between two strings Automaton with two labels/arc input:output

5 FST Applications Tokenization Segmentation Morphological analysis Transliteration Parsing Translation Speech recognition Spoken language understanding….

6 Approaches to FSTs FST as recognizer: Takes pair of input:output strings Accepts if in language, o.w. rejects

7 Approaches to FSTs FST as recognizer: Takes pair of input:output strings Accepts if in language, o.w. rejects FST as generator: Outputs pairs of strings in languages

8 Approaches to FSTs FST as recognizer: Takes pair of input:output strings Accepts if in language, o.w. rejects FST as generator: Outputs pairs of strings in languages FST as translator: Reads an input string and prints output string

9 Approaches to FSTs FST as recognizer: Takes pair of input:output strings Accepts if in language, o.w. rejects FST as generator: Outputs pairs of strings in languages FST as translator: Reads an input string and prints output string FST as set relator: Computes relations between sets

10 FSTs & Regular Relations FSAs: equivalent to regular languages

11 FSTs & Regular Relations FSAs: equivalent to regular languages FSTs: equivalent to regular relations Sets of pairs of strings

12 FSTs & Regular Relations FSAs: equivalent to regular languages FSTs: equivalent to regular relations Sets of pairs of strings Regular relations: For all (x,y) in Σ 1 x Σ 2, {(x,y)} is a regular relation The empty set is a regular relation If R 1,R 2 are regular relations, R 1  R 2, R 1 U R 2 and R 1 * are regular relations

13 Regular Relation Closures By definition, Regular Relations are closed under: Concatenation: R 1  R 2 Union: R 1 U R 2 Kleene *: R 1 * Like regular languages

14 Regular Relation Closures By definition, Regular Relations are closed under: Concatenation: R 1  R 2 Union: R 1 U R 2 Kleene *: R 1 * Like regular languages Unlike regular languages, they are NOT closed under: Intersection:

15 Regular Relation Closures By definition, Regular Relations are closed under: Concatenation: R 1  R 2 Union: R 1 U R 2 Kleene *: R 1 * Like regular languages Unlike regular languages, they are NOT closed under: Intersection:R1 ={(a n b *,c n )} & R2={(a*b m,c m )}, intersection is {(a n b n,c n )} => not regular

16 Regular Relation Closures By definition, Regular Relations are closed under: Concatenation: R 1  R 2 Union: R 1 U R 2 Kleene *: R 1 * Like regular languages Unlike regular languages, they are NOT closed under: Intersection:R1 ={(a n b *,c n )} & R2={(a*b n,c n )}, intersection is {(a n b n,c n )} => not regular Difference

17 Regular Relation Closures By definition, Regular Relations are closed under: Concatenation: R 1  R 2 Union: R 1 U R 2 Kleene *: R 1 * Like regular languages Unlike regular languages, they are NOT closed under: Intersection:R1 ={(a n b *,c n )} & R2={(a*b n,c n )}, intersection is {(a n b n,c n )} => not regular Difference Complementation

18 Regular Relation Closures Regular relations are also closed under: Composition:

19 Regular Relation Closures Regular relations are also closed under: Composition: Inversion:

20 Regular Relation Closures Regular relations are also closed under: Composition: Inversion: Operations: Projection:

21 Regular Relation Closures Regular relations are also closed under: Composition: Inversion: Operations: Projection: Identity & cross-product of regular languages

22 FST Formal Definition A Finite-State Transducer is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ

23 FST Formal Definition A Finite-State Transducer is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ A finite set of output symbols: Γ

24 FST Formal Definition A Finite-State Transducer is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ A finite set of output symbols: Γ A finite set of initial states: I A finite set of final states: F

25 FST Formal Definition A Finite-State Transducer is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ A finite set of output symbols: Γ A finite set of initial states: I A finite set of final states: F A set of transition relations between states: δsubset Q x (Σuε) x (ΓU ε) x Q

26 FST Formal Definition A Finite-State Transducer is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ A finite set of output symbols: Γ A finite set of initial states: I A finite set of final states: F A set of transition relations between states: δsubset Q x (Σuε) x (ΓU ε) x Q FSAs are a special case of FSTs

27 FST Operations Union:

28 FST Operations Union: Concatenation:

29 FST Operations Inversion: Switching input and output labels If T maps from I to O, T -1 maps from O to !

30 FST Operations Inversion: Switching input and output labels If T maps from I to O, T -1 maps from O to I Composition: If T 1 is a transducer from I 1 to O 2 and T 2 is a transducer from O 2 to O 3, then T 1 T 2 is a transducer from I 1 to O 3

31 FST Operations Inversion: Switching input and output labels If T maps from I to O, T -1 maps from O to I Composition: If T 1 is a transducer from I 1 to O 2 and T 2 is a transducer from O 2 to O 3, then T 1 T 2 is a transducer from I 1 to O 3

32 FST Examples R(T) = {(ε,ε),(a,b),(aa,bb),(aaa,bbb)….}

33 FST Examples R(T) = {(ε,ε),(a,b),(aa,bb),(aaa,bbb)….}

34 FST Examples R(T) = {(ε,ε),(a,b),(aa,bb),(aaa,bbb)….}

35 FST Examples R(T) = {(ε,ε),(a,b),(aa,bb),(aaa,bbb)….} R(T) = {(a,x),(ab,xy),(abb,xyy),…}

36 FST Application Examples Case folding: He said  he said

37 FST Application Examples Case folding: He said  he said Tokenization: “He ran.”  “ He ran. “

38 FST Application Examples Case folding: He said  he said Tokenization: “He ran.”  “ He ran. “ POS tagging: They can fish  PRO VERB NOUN

39 FST Application Examples Pronunciation: B AH T EH R  B AH DX EH R Morphological generation: Fox s  Foxes Morphological analysis: cats  cat s

40 FST Application Examples Pronunciation: B AH T EH R  B AH DX EH R

41 FST Application Examples Pronunciation: B AH T EH R  B AH DX EH R Morphological generation: Fox s  Foxes

42 FST Application Examples Pronunciation: B AH T EH R  B AH DX EH R Morphological generation: Fox s  Foxes Morphological analysis: cats  cat s

43 FST Algorithms Recognition: Is a given string pair (x,y) accepted by the FST? (x,y)  yes/no

44 FST Algorithms Recognition: Is a given string pair (x,y) accepted by the FST? (x,y)  yes/no Composition: Given a pair of transducers T1 and T2, create a new transducer T1 T2.

45 FST Algorithms Recognition: Is a given string pair (x,y) accepted by the FST? (x,y)  yes/no Composition: Given a pair of transducers T1 and T2, create a new transducer T1 T2. Transduction: Given an input string and an FST, compute the output string. x  y

46 WFST Definition A Probabilistic Finite-State Automaton is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ A finite set of output symbols: Γ A finite set of initial states: I A finite set of final states: F A set of transitions: δsubset Q x (Σuε) x (ΓU ε) x Q

47 WFST Definition A Probabilistic Finite-State Automaton is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ A finite set of output symbols: Γ A finite set of initial states: I A finite set of final states: F A set of transitions: δsubset Q x (Σuε) x (ΓU ε) x Q Initial state probabilities: Q  R +

48 WFST Definition A Probabilistic Finite-State Automaton is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ A finite set of output symbols: Γ A finite set of initial states: I A finite set of final states: F A set of transitions: δsubset Q x (Σuε) x (ΓU ε) x Q Initial state probabilities: Q  R + Transition probabilities: δ  R +

49 WFST Definition A Probabilistic Finite-State Automaton is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ A finite set of output symbols: Γ A finite set of initial states: I A finite set of final states: F A set of transitions: δsubset Q x (Σuε) x (ΓU ε) x Q Initial state probabilities: Q  R + Transition probabilities: δ  R + Final state probabilities: Q  R +

50 Summary FSTs Equivalent to regular relations Transduce strings to strings Useful for range of applications

51 Summary FSTs Equivalent to regular relations Transduce strings to strings Useful for range of applications Closed under union, concatenation, Kleene*, inversion, composition Project to FSAs

52 Summary FSTs Equivalent to regular relations Transduce strings to strings Useful for range of applications Closed under union, concatenation, Kleene*, inversion, composition Project to FSAs Not closed under intersection, complementation, difference

53 Summary FSTs Equivalent to regular relations Transduce strings to strings Useful for range of applications Closed under union, concatenation, Kleene*, inversion, composition Project to FSAs Not closed under intersection, complementation, difference Algorithms: recognition, composition, transduction

54 Morphology and FSTs

55 Roadmap Motivation: Representing words A little (mostly English) Morphology Stemming FSTs & Morphology FSTs & Phonology

56 Surface Variation & Morphology Searching (a la Google) for documents about: Televised sports

57 Surface Variation & Morphology Searching (a la Google) for documents about: Televised sports Many possible surface forms: Televised, television, televise,.. Sports, sport, sporting,…

58 Surface Variation & Morphology Searching (a la Google) for documents about: Televised sports Many possible surface forms: Televised, television, televise,.. Sports, sport, sporting,… How can we match?

59 Surface Variation & Morphology Searching (a la Google) for documents about: Televised sports Many possible surface forms: Televised, television, televise,.. Sports, sport, sporting,… How can we match? Convert surface forms to common base form Stemming or morphological analysis

60 The Lexicon Goal: Represent all the words in a language Approach?

61 The Lexicon Goal: Represent all the words in a language Approach? Enumerate all words?

62 The Lexicon Goal: Represent all the words in a language Approach? Enumerate all words? Doable for English Typical for ASR (Automatic Speech Recognition) English is morphologically relatively impoverished

63 The Lexicon Goal: Represent all the words in a language Approach? Enumerate all words? Doable for English Typical for ASR (Automatic Speech Recognition) English is morphologically relatively impoverished Other languages?

64 The Lexicon Goal: Represent all the words in a language Approach? Enumerate all words? Doable for English Typical for ASR (Automatic Speech Recognition) English is morphologically relatively impoverished Other languages? Wildly impractical Turkish: 40,000 forms/verb; uygarlas¸tıramadıklarımızdanmıs¸sınızcasına “(behaving) as if you are among those whom we could not civilize”

65 Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes

66 Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes A morpheme is the minimal meaning-bearing unit in a language.

67 Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes A morpheme is the minimal meaning-bearing unit in a language. Stem: the morpheme that forms the central meaning unit in a word Affix: prefix, suffix, infix, circumfix

68 Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes A morpheme is the minimal meaning-bearing unit in a language. Stem: the morpheme that forms the central meaning unit in a word Affix: prefix, suffix, infix, circumfix Prefix: e.g., possible  impossible

69 Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes A morpheme is the minimal meaning-bearing unit in a language. Stem: the morpheme that forms the central meaning unit in a word Affix: prefix, suffix, infix, circumfix Prefix: e.g., possible  impossible Suffix: e.g., walk  walking

70 Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes A morpheme is the minimal meaning-bearing unit in a language. Stem: the morpheme that forms the central meaning unit in a word Affix: prefix, suffix, infix, circumfix Prefix: e.g., possible  impossible Suffix: e.g., walk  walking Infix: e.g., hingi  humingi (Tagalog)

71 Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes A morpheme is the minimal meaning-bearing unit in a language. Stem: the morpheme that forms the central meaning unit in a word Affix: prefix, suffix, infix, circumfix Prefix: e.g., possible  impossible Suffix: e.g., walk  walking Infix: e.g., hingi  humingi (Tagalog) Circumfix: e.g., sagen  gesagt (German)

72 Two Perspectives Stemming: writing 

73 Two Perspectives Stemming: writing  write (or writ) Beijing

74 Two Perspectives Stemming: writing  write (or writ) Beijing  Beije Morphological Analysis:

75 Two Perspectives Stemming: writing  write (or writ) Beijing  Beije Morphological Analysis: writing  write+V+prog

76 Two Perspectives Stemming: writing  write (or writ) Beijing  Beije Morphological Analysis: writing  write+V+prog cats  cat + N + pl writes  write+V+3rdpers+Sg

77 Ambiguity in Morphology Alternative analyses: Flies

78 Ambiguity in Morphology Alternative analyses: Flies  fly+N+Pl Flies  fly+V+3rdpers+Sg Saw 

79 Ambiguity in Morphology Alternative analyses: Flies  fly+N+Pl Flies  fly+V+3rdpers+Sg Saw  see+V+past Saw 

80 Ambiguity in Morphology Alternative analyses: Flies  fly+N+Pl Flies  fly+V+3rdpers+Sg Saw  see+V+past Saw  saw+N

81 Multi-linguality in Morphology Morphologically impoverished languages E.g. English

82 Multi-linguality in Morphology Morphologically impoverished languages E.g. English Isolating languages E.g., Chinese

83 Multi-linguality in Morphology Morphologically impoverished languages E.g. English Isolating languages E.g., Chinese Morphologically rich languages: E.g. Turkish

84 Combining Morphemes Inflection: Stem + gram. morpheme  same class E.g.: help + ed  helped

85 Combining Morphemes Inflection: Stem + gram. morpheme  same class E.g.: help + ed  helped Derivation: Stem + gram. morphone  new class E.g. Walk + er  walker (N)

86 Combining Morphemes Inflection: Stem + gram. morpheme  same class E.g.: help + ed  helped Derivation: Stem + gram. morphone  new class E.g. Walk + er  walker (N) Compounding: multiple stems  new word E.g. doghouse, catwalk, …

87 Combining Morphemes Inflection: Stem + gram. morpheme  same class E.g.: help + ed  helped Derivation: Stem + gram. morphone  new class E.g. Walk + er  walker (N) Compounding: multiple stems  new word E.g. doghouse, catwalk, … Clitics: stem+clitic I + ll  I’ll; he + is  he’s


Download ppt "Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011."

Similar presentations


Ads by Google