Download presentation

Presentation is loading. Please wait.

1
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations will be videotaped – food will be provided

2
Task: Named-Entity Recognition in new corpus

4
Named-Entity Recognition Fragment of an example sentence: JulianAssangeaccusedtheUnited PER Other LOC

5
NER as Machine Learning Fragment of an example sentence: JulianAssangeaccusedtheUnited PER Other LOC YiYi XiXi Word label {Other, LOC, PER, ORG} Some feature representation of the word

6
Feature Vector: Three Choices Words: current word Context: current word, previous word, next word Features: current word, previous word, next word is the word capitalized? "word shape" (compact summary of orthographic information, like internal digits and punctuation) prefixes up to length 5, suffixes up to length 5 any word in a +/- six word window (*not* differentiated by position the way previous word and next word are)

7
Discriminative vs Generative I Y X Assange Capitalized=1Previous=JulianPOS= noun Y X Assange Capitalized=1Previous=JulianPOS= noun

8
Generative vs Discriminative I 10K training words from CoNLL (British newswire) looking only for PERSON Metric: F1 51.3 59.1 70.8 52.8 65.5 81.5

9
Do More Features Always Help? How do we evaluate multiple feature sets? – On validation set, not test set! Detecting underfitting – Train & test performance similar and low Detecting overfitting – Train performance high, test performance low The same holds every time we want to consider models of varying complexity!

10
Sequential Modeling Fragment of an example sentence: JulianAssangeaccusedtheUnited PER Other LOC YiYi XiXi Random variable with domain {Other, LOC, PER, ORG} Random variable for vector of features about the word

11
Hidden Markov Model (HMM) Y1Y1 Y2Y2 Y4Y4 Y5Y5 Y3Y3 X1X1 X2X2 X4X4 X5X5 X3X3 JulianAssangeaccusedtheUnited

12
Hidden Markov Model (HMM) JulianAssangeaccusedtheUnited Y1Y1 Y2Y2 Y4Y4 Y5Y5 Y3Y3 X1X1 X2X2 X4X4 X5X5 X3X3

13
Hidden Markov Model (HMM) Julian Assange accusedtheUnited Capitalized=1 Previous=Julian POS= noun Y1Y1 Y2Y2 Y4Y4 Y5Y5 Y3Y3 X1X1 X2X2 X4X4 X5X5 X3X3

14
Advantage of Sequential Modeling 51.3 59.1 70.8 57.4 61.8 70.8 Reminder: Plain logistic regression gives us 81.5!

15
Max Entropy Markov Model (MEMM) Markov chain over X i ’s Each X i has logistic regression CPD given Y i X1X1 X2X2 X4X4 X5X5 X3X3 Y1Y1 Y2Y2 Y4Y4 Y5Y5 Y3Y3 Julian Assange accusedtheUnited Capitalized=1 Previous=Julian POS= noun

16
Max Entropy Markov Model (MEMM) Pro: uses features in a powerful way Con: downstream evidence doesn’t help because of v-structures X1X1 X2X2 X4X4 X5X5 X3X3 Y1Y1 Y2Y2 Y4Y4 Y5Y5 Y3Y3 Julian Assange accusedtheUnited Capitalized=1 Previous=Julian POS= noun

17
MEMM vs HMM vs NB 59.1 68.3 84.6 Finally beat logistic regression!

18
Conditional Random Field (CRF) JulianAssangeaccusedtheUnited Y1Y1 Y2Y2 Y4Y4 Y5Y5 Y3Y3 X1X1 X2X2 X4X4 X5X5 X3X3

19
Comparison: Sequence Models 59.1 68.3 84.6 59.6 70.2 85.8 57.4 61.8 70.8

20
Tradeoffs in Learning I HMM – Simple closed form solution MEMM – Gradient ascent for parameters of logistic P(Y i | X i ) – But no inference required for learning CRF – Gradient ascent for all parameters – Inference over entire graph required at each iteration

21
Tradeoffs in Learning: II Can we learn from unsupervised data? HMM – Yes, using EM MEMM/CRF – No Discriminative objective: maximize log P(Y | X) – But if Y is not observed, we can’t maximize its probability

22
PGMs and ML PGMs deal well with predictions of structured objects (sequences, graphs, trees) – Exploit correlations between multiple parts of the prediction task Can easily incorporate prior knowledge into model Learned model can often be used for multiple prediction tasks Useful framework for knowledge discovery

23
Inference Exact marginals? – Clique tree calibration gives all marginals – Final labeling might not be jointly consistent Approximate marginals? – Doesn’t make sense in this context MAP? – Gives single coherent solution – Hard to get ROC curves (tradeoff precision & recall)

24
Mismatch of Objectives MAP inference optimizes LL = log P(Y | X) Actual performance metric is usually different (e.g., F1) Performance is best if we can get these two metrics to be relatively well-aligned – If MAP assignment gets significantly lower F1 than ground truth, model needs to be adjusted Very useful for debugging approximate MAP – If LL(y*) >> LL(y MAP ) – If LL(y*) << LL(y MAP ) - algorithm found local optimum - LL bad surrogate for objective

25
Richer Models JulianAssangeaccusedtheUnited saidStephen,Assange’slaywerto Y1Y1 Y2Y2 Y4Y4 Y5Y5 Y3Y3 X1X1 X2X2 X4X4 X5X5 X3X3 Y 101 Y 102 Y 104 Y 105 Y 103 X 101 X 102 X 104 X 105 X 103

26
Summary Foundation I: Probabilistic model – Coherent treatment of uncertainty – Declarative representation: separates model and inference separates inference and learning Foundation II: Graphical model – Encode and exploit structure for compact representation and efficient inference – Allows modularity in updating the model

Similar presentations

© 2020 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google