Presentation is loading. Please wait.

Presentation is loading. Please wait.

MaxEnt POS Tagging Shallow Processing Techniques for NLP Ling570 November 21, 2011.

Similar presentations


Presentation on theme: "MaxEnt POS Tagging Shallow Processing Techniques for NLP Ling570 November 21, 2011."— Presentation transcript:

1 MaxEnt POS Tagging Shallow Processing Techniques for NLP Ling570 November 21, 2011

2 Roadmap MaxEnt POS Tagging Features Beam Search vs Viterbi Named Entity Tagging

3 MaxEnt Feature Template Words: Current word: w 0 Previous word: w -1 Word two back: w -2 Next word: w +1 Next next word: w +2 Tags: Previous tag: t -1 Previous tag pair: t -2 t -1 How many features? 5|V|+|T|+|T| 2

4 Representing Orthographic Patterns How can we represent morphological patterns as features? Character sequences Which sequences? Prefixes/suffixes e.g. suffix(w i )=ing or prefix(w i )=well Specific characters or character types Which? is-capitalized is-hyphenated

5 MaxEnt Feature Set

6 Examples well-heeled: rare word

7 Examples well-heeled: rare word JJ prevW=about:1 prev2W=stories:1 nextW=communities:1 next2W=and:1 pref=w:1 pref=we:1 pref=wel:1 pref=well:1 suff=d:1 suff=ed:1 suff=led:1 suff=eled:1 is-hyphenated:1 preT=IN:1 pre2T=NNS- IN:1

8 Finding Features In training, where do features come from? Where do features come from in testing? w -1 w0w0 w -1 w 0 w +1 t -1 y x1(Time ) Time fliesBOSN x2 (flies) TimefliesTime flieslikeNN x3 (like)flieslikeflies likeanNV

9 Finding Features In training, where do features come from? Where do features come from in testing? tag features come from classification of prior word w -1 w0w0 w -1 w 0 w +1 t -1 y x1(Time ) Time fliesBOSN x2 (flies) TimefliesTime flieslikeNN x3 (like)flieslikeflies likeanNV

10 Sequence Labeling

11 Goal: Find most probable labeling of a sequence Many sequence labeling tasks POS tagging Word segmentation Named entity tagging Story/spoken sentence segmentation Pitch accent detection Dialog act tagging

12 Solving Sequence Labeling

13 Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM

14 Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features?

15 Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features? Features that use class labels – depend on classification Solutions:

16 Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features? Features that use class labels – depend on classification Solutions: Don’t use features that depend on class labels (loses info)

17 Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features? Features that use class labels – depend on classification Solutions: Don’t use features that depend on class labels (loses info) Use other process to generate class labels, then use

18 Solving Sequence Labeling Direct: Use a sequence labeling algorithm E.g. HMM, CRF, MEMM Via classification: Use classification algorithm Issue: What about tag features? Features that use class labels – depend on classification Solutions: Don’t use features that depend on class labels (loses info) Use other process to generate class labels, then use Perform incremental classification to get labels, use labels as features for instances later in sequence

19 HMM Trellis time flies like an arrow Adapted from F. Xia

20 Viterbi Initialization: Recursion: Termination:

21 1 2 time 3 flies 4 like 5 an 6 arrow N00.05 BOS 0.001 N 00[D,5]*P(N|D)*P(arrow|N) 0.000001680 D V00.01 BOS 0.007 N 0.00014 N,V 0 P0000.00007 V 0 D0000.0000168 V BOS1.0 0 000

22 Decoding Goal: Identify highest probability tag sequence

23 Decoding Goal: Identify highest probability tag sequence Issues: Features include tags from previous words Not immediately available

24 Decoding Goal: Identify highest probability tag sequence Issues: Features include tags from previous words Not immediately available Uses tag history Just knowing highest probability preceding tag insufficient

25 Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices

26 Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences?

27 Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences?

28 Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences? No. Why not?

29 Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences? No. Why not? How many sequences?

30 Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences? No. Why not? How many sequences? Branching factor: N (# tags); Depth: T (# words) N T

31 Decoding Approach: Retain multiple candidate tag sequences Essentially search through tagging choices Which sequences? All sequences? No. Why not? How many sequences? Branching factor: N (# tags); Depth: T (# words) N T Top K highest probability sequences

32 Breadth-First Search time flies like an arrow

33 Breadth-First Search time flies like an arrow

34 Breadth-First Search time flies like an arrow

35 Breadth-First Search time flies like an arrow

36 Breadth-First Search time flies like an arrow

37 Breadth-First Search time flies like an arrow

38 Breadth-first Search Is breadth-first search efficient?

39 Breadth-first Search Is it efficient? No, it tries everything

40 Beam Search Intuition: Breadth-first search explores all paths

41 Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths?

42 Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths? Restrict to (apparently best) paths Approach: Perform breadth-first search, but

43 Beam Search Intuition: Breadth-first search explores all paths Lots of paths are (pretty obviously) bad Why explore bad paths? Restrict to (apparently best) paths Approach: Perform breadth-first search, but Retain only k ‘best’ paths thus far k: beam width

44 Beam Search, k=3 time flies like an arrow

45 Beam Search, k=3 time flies like an arrow

46 Beam Search, k=3 time flies like an arrow

47 Beam Search, k=3 time flies like an arrow

48 Beam Search, k=3 time flies like an arrow

49 Beam Search W={w 1,w 2,…,w n }: test sentence

50 Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i

51 Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly

52 Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n:

53 Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j

54 Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j Beam selection: Sort sequences by probability Keep only top k sequences

55 Beam Search W={w 1,w 2,…,w n }: test sentence s ij : j th highest prob. sequence up to & inc. word w i Generate tags for w 1, keep top k, set s 1j accordingly for i=2 to n: Extension: add tags for w i to each s (i-1)j Beam selection: Sort sequences by probability Keep only top k sequences Return highest probability sequence s n1

56 POS Tagging Overall accuracy: 96.3+% Unseen word accuracy: 86.2% Comparable to HMM tagging accuracy or TBL Provides Probabilistic framework Better able to model different info sources Topline accuracy 96-97% Consistency issues

57 Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages:

58 Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95%

59 Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming

60 Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming Running time:

61 Beam Search Beam search decoding: Variant of breadth first search At each layer, keep only top k sequences Advantages: Efficient in practice: beam 3-5 near optimal Empirically, beam 5-10% of search space; prunes 90-95% Simple to implement Just extensions + sorting, no dynamic programming Running time: O(kT) [vs. O(N T )] Disadvantage: Not guaranteed optimal (or complete)

62 Viterbi Decoding Viterbi search: Exploits dynamic programming, memoization Requires small history window Efficient search: O(N 2 T) Advantage:

63 Viterbi Decoding Viterbi search: Exploits dynamic programming, memoization Requires small history window Efficient search: O(N 2 T) Advantage: Exact: optimal solution is returned Disadvantage:

64 Viterbi Decoding Viterbi search: Exploits dynamic programming, memoization Requires small history window Efficient search: O(N 2 T) Advantage: Exact: optimal solution is returned Disadvantage: Limited window of context

65 Beam vs Viterbi Dynamic programming vs heuristic search

66 Beam vs Viterbi Dynamic programming vs heuristic search Guaranteed optimal vs no guarantee

67 Beam vs Viterbi Dynamic programming vs heuristic search Guaranteed optimal vs no guarantee Different context window

68 MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words

69 MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words Sequence classification problems: Tag features depend on prior classification

70 MaxEnt POS Tagging Part of speech tagging by classification: Feature design word and tag context features orthographic features for rare words Sequence classification problems: Tag features depend on prior classification Beam search decoding Efficient, but inexact Near optimal in practice

71 Named Entity Recognition

72 Roadmap Named Entity Recognition Definition Motivation Challenges Common Approach

73 Named Entity Recognition Task: Identify Named Entities in (typically) unstructured text Typical entities: Person names Locations Organizations Dates Times

74 Example Microsoft released Windows Vista in 2007.

75 Example Microsoft released Windows Vista in 2007. Microsoft released Windows Vista in 2007

76 Example Microsoft released Windows Vista in 2007. Microsoft released Windows Vista in 2007 Entities: Often application/domain specific Business intelligence:

77 Example Microsoft released Windows Vista in 2007. Microsoft released Windows Vista in 2007 Entities: Often application/domain specific Business intelligence: products, companies, features Biomedical:

78 Example Microsoft released Windows Vista in 2007. Microsoft released Windows Vista in 2007 Entities: Often application/domain specific Business intelligence: products, companies, features Biomedical: Genes, proteins, diseases, drugs, …

79 Why NER? Machine translation:

80 Why NER? Machine translation: Person

81 Why NER? Machine translation: Person names typically not translated Possibly transliterated Waldheim Number:

82 Why NER? Machine translation: Person names typically not translated Possibly transliterated Waldheim Number: 9/11: Date vs ratio 911: Emergency phone number, simple number

83 Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on

84 Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on Company names, Person Names (CEO), valuations

85 Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on Company names, Person Names (CEO), valuations Information retrieval: Named entities focus of retrieval In some data sets, 60+% queries target Nes

86 Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on Company names, Person Names (CEO), valuations Information retrieval: Named entities focus of retrieval In some data sets, 60+% queries target NEs Text-to-speech:

87 Why NER? Information extraction: MUC task: Joint ventures/mergers Focus on Company names, Person Names (CEO), valuations Information retrieval: Named entities focus of retrieval In some data sets, 60+% queries target NEs Text-to-speech: 206-616-5728 Phone numbers (vs other digit strings), differ by language

88 Challenges Ambiguity Washington chose

89 Challenges Ambiguity Washington chose D.C., State, George, etc Most digit strings

90 Challenges Ambiguity Washington chose D.C., State, George, etc Most digit strings cat: (95 results)

91 Challenges Ambiguity Washington chose D.C., State, George, etc Most digit strings cat: (95 results) CAT(erpillar) stock ticker Computerized Axial Tomography Chloramphenicol Acetyl Transferase small furry mammal

92 Evaluation Precision Recall F-measure

93 Resources Online: Name lists Baby name, who’s who, newswire services Gazetteers etc Tools Lingpipe OpenNLP Stanford NLP toolkit


Download ppt "MaxEnt POS Tagging Shallow Processing Techniques for NLP Ling570 November 21, 2011."

Similar presentations


Ads by Google