Presentation is loading. Please wait.

Presentation is loading. Please wait.

Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.

Similar presentations


Presentation on theme: "Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations."— Presentation transcript:

1 Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations will be videotaped – food will be provided

2 Task: Named-Entity Recognition in new corpus

3

4 Named-Entity Recognition Fragment of an example sentence: JulianAssangeaccusedtheUnited PER Other LOC

5 NER as Machine Learning Fragment of an example sentence: JulianAssangeaccusedtheUnited PER Other LOC YiYi XiXi Word label  {Other, LOC, PER, ORG} Some feature representation of the word

6 Feature Vector: Three Choices Words: current word Context: current word, previous word, next word Features: current word, previous word, next word is the word capitalized? "word shape" (compact summary of orthographic information, like internal digits and punctuation) prefixes up to length 5, suffixes up to length 5 any word in a +/- six word window (*not* differentiated by position the way previous word and next word are)

7 Discriminative vs Generative I Y X Assange Capitalized=1Previous=JulianPOS= noun Y X Assange Capitalized=1Previous=JulianPOS= noun

8 Generative vs Discriminative I 10K training words from CoNLL (British newswire) looking only for PERSON Metric: F1 51.3 59.1 70.8 52.8 65.5 81.5

9 Do More Features Always Help? How do we evaluate multiple feature sets? – On validation set, not test set! Detecting underfitting – Train & test performance similar and low Detecting overfitting – Train performance high, test performance low The same holds every time we want to consider models of varying complexity!

10 Sequential Modeling Fragment of an example sentence: JulianAssangeaccusedtheUnited PER Other LOC YiYi XiXi Random variable with domain {Other, LOC, PER, ORG} Random variable for vector of features about the word

11 Hidden Markov Model (HMM) Y1Y1 Y2Y2 Y4Y4 Y5Y5 Y3Y3 X1X1 X2X2 X4X4 X5X5 X3X3 JulianAssangeaccusedtheUnited

12 Hidden Markov Model (HMM) JulianAssangeaccusedtheUnited Y1Y1 Y2Y2 Y4Y4 Y5Y5 Y3Y3 X1X1 X2X2 X4X4 X5X5 X3X3

13 Hidden Markov Model (HMM) Julian Assange accusedtheUnited Capitalized=1 Previous=Julian POS= noun Y1Y1 Y2Y2 Y4Y4 Y5Y5 Y3Y3 X1X1 X2X2 X4X4 X5X5 X3X3

14 Advantage of Sequential Modeling 51.3 59.1 70.8 57.4 61.8 70.8 Reminder: Plain logistic regression gives us 81.5!

15 Max Entropy Markov Model (MEMM) Markov chain over X i ’s Each X i has logistic regression CPD given Y i X1X1 X2X2 X4X4 X5X5 X3X3 Y1Y1 Y2Y2 Y4Y4 Y5Y5 Y3Y3 Julian Assange accusedtheUnited Capitalized=1 Previous=Julian POS= noun

16 Max Entropy Markov Model (MEMM) Pro: uses features in a powerful way Con: downstream evidence doesn’t help because of v-structures X1X1 X2X2 X4X4 X5X5 X3X3 Y1Y1 Y2Y2 Y4Y4 Y5Y5 Y3Y3 Julian Assange accusedtheUnited Capitalized=1 Previous=Julian POS= noun

17 MEMM vs HMM vs NB 59.1 68.3 84.6 Finally beat logistic regression!

18 Conditional Random Field (CRF) JulianAssangeaccusedtheUnited Y1Y1 Y2Y2 Y4Y4 Y5Y5 Y3Y3 X1X1 X2X2 X4X4 X5X5 X3X3

19 Comparison: Sequence Models 59.1 68.3 84.6 59.6 70.2 85.8 57.4 61.8 70.8

20 Tradeoffs in Learning I HMM – Simple closed form solution MEMM – Gradient ascent for parameters of logistic P(Y i | X i ) – But no inference required for learning CRF – Gradient ascent for all parameters – Inference over entire graph required at each iteration

21 Tradeoffs in Learning: II Can we learn from unsupervised data? HMM – Yes, using EM MEMM/CRF – No Discriminative objective: maximize log P(Y | X) – But if Y is not observed, we can’t maximize its probability

22 PGMs and ML PGMs deal well with predictions of structured objects (sequences, graphs, trees) – Exploit correlations between multiple parts of the prediction task Can easily incorporate prior knowledge into model Learned model can often be used for multiple prediction tasks Useful framework for knowledge discovery

23 Inference Exact marginals? – Clique tree calibration gives all marginals – Final labeling might not be jointly consistent Approximate marginals? – Doesn’t make sense in this context MAP? – Gives single coherent solution – Hard to get ROC curves (tradeoff precision & recall)

24 Mismatch of Objectives MAP inference optimizes LL = log P(Y | X) Actual performance metric is usually different (e.g., F1) Performance is best if we can get these two metrics to be relatively well-aligned – If MAP assignment gets significantly lower F1 than ground truth, model needs to be adjusted Very useful for debugging approximate MAP – If LL(y*) >> LL(y MAP ) – If LL(y*) << LL(y MAP ) - algorithm found local optimum - LL bad surrogate for objective

25 Richer Models JulianAssangeaccusedtheUnited saidStephen,Assange’slaywerto Y1Y1 Y2Y2 Y4Y4 Y5Y5 Y3Y3 X1X1 X2X2 X4X4 X5X5 X3X3 Y 101 Y 102 Y 104 Y 105 Y 103 X 101 X 102 X 104 X 105 X 103

26 Summary Foundation I: Probabilistic model – Coherent treatment of uncertainty – Declarative representation: separates model and inference separates inference and learning Foundation II: Graphical model – Encode and exploit structure for compact representation and efficient inference – Allows modularity in updating the model


Download ppt "Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations."

Similar presentations


Ads by Google