Presentation is loading. Please wait.

Presentation is loading. Please wait.

10/24/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 6 Giuseppe Carenini.

Similar presentations


Presentation on theme: "10/24/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 6 Giuseppe Carenini."— Presentation transcript:

1 10/24/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 6 Giuseppe Carenini

2 10/24/2015CPSC503 Winter 20082 Knowledge-Formalisms Map Logical formalisms (First-Order Logics) Rule systems (and prob. versions) (e.g., (Prob.) Context-Free Grammars) State Machines (and prob. versions) (Finite State Automata,Finite State Transducers, Markov Models) Morphology Syntax Pragmatics Discourse and Dialogue Semantics AI planners Markov Models Markov Chains -> n-grams Hidden Markov Models (HMM) MaxEntropy Markov Models (MEMM)

3 10/24/2015CPSC503 Winter 20083 Today 24/9 ngrams evaluation Markov Chains Hidden Markov Models: –definition –the three key problems (only one in detail) Part-of-speech tagging –What it is –Why we need it –How to do it

4 10/24/2015CPSC503 Winter 20084 Model Evaluation: Goal You may want to compare: 2-grams with 3-grams two different smoothing techniques (given the same n-grams) On a given corpus…

5 10/24/2015CPSC503 Winter 20085 Model Evaluation: Key Ideas Corpus Training Set Testing set A:split B: train models Models: Q 1 and Q 2 C:Apply models counting frequencies smoothing Compare results

6 10/24/2015CPSC503 Winter 20086 Entropy Def1. Measure of uncertainty Def2. Measure of the information that we need to resolve an uncertain situation –Let p(x)=P(X=x); where x  X. –H(p)= H(X)= -  x  X p(x)log 2 p(x) –It is normally measured in bits.

7 10/24/2015CPSC503 Winter 20087 Model Evaluation Actual distribution Our approximation How different? Relative Entropy (KL divergence) ? D(p||q)=  x  X p(x)log(p(x)/q(x))

8 10/24/2015CPSC503 Winter 20088 Entropy of Entropy rate Language Entropy Assumptions: ergodic and stationary Entropy can be computed by taking the average log probability of a looooong sample NL? Shannon-McMillan-Breiman

9 10/24/2015CPSC503 Winter 20089 Cross-Entropy Between probability distribution P and another distribution Q (model for P) Between two models Q 1 and Q 2 the more accurate is the one with higher =>lower cross- entropy => lower Applied to Language

10 10/24/2015 CPSC503 Winter 2008 10 Model Evaluation: In practice Corpus Training Set Testing set A:split B: train models Models: Q 1 and Q 2 C:Apply models counting frequencies smoothing Compare cross- perplexities

11 10/24/2015 CPSC503 Winter 2008 11 k-fold cross validation and t-test Randomly divide the corpus in k subsets of equal size Use each for testing (all the other for training) In practice you do k times what we sow in previous slide Now for each model you have k perplexities Compare average models perplexities with t-test

12 10/24/2015CPSC503 Winter 200812 Knowledge-Formalisms Map (including probabilistic formalisms) Logical formalisms (First-Order Logics) Rule systems (and prob. versions) (e.g., (Prob.) Context-Free Grammars) State Machines (and prob. versions) (Finite State Automata,Finite State Transducers, Markov Models) Morphology Syntax Pragmatics Discourse and Dialogue Semantics AI planners

13 10/24/2015CPSC503 Winter 200813 Today 24/9 ngrams evaluation Markov Chains Hidden Markov Models: –definition –the three key problems (only one in detail) Part-of-speech tagging –What it is –Why we need it –How to do it

14 10/24/2015CPSC503 Winter 200814 Example of a Markov Chain 1.4 1.3.4.6 1.4 te h a p i Start.6 Start.4

15 10/24/2015CPSC503 Winter 200815 Markov-Chain Formal description: Manning/Schütze, 2000: 318 Probability of initial states t i.6.4 Stochastic Transition matrix A t tip i p 0.30.40.6 001 1 2 a h e 00.4 000 100 ahe.3.40 000 000.600 001 000

16 10/24/2015CPSC503 Winter 200816 Markov Assumptions Let X=(X 1,.., X t ) be a sequence of random variables taking values in some finite set S={s 1, …, s n }, the state space, the Markov properties are: (a) Limited Horizon: For all t, P(X t+1 |X 1,.., X t )=P(X t+1 | X t ) (b)Time Invariant: For all t, P(X t+1 |X t )=P(X 2 | X 1 ) i.e., the dependency does not change over time.

17 10/24/2015CPSC503 Winter 200817 Markov-Chain Probability of a sequence of states X 1 … X T Manning/Schütze, 2000: 320 Example: Similar to …….?

18 10/24/2015CPSC503 Winter 200818 Today 24/9 ngrams evaluation Markov Chains Hidden Markov Models: –definition –the three key problems (only one in detail) Part-of-speech tagging –What it is –Why we need it –How to do it

19 10/24/2015CPSC503 Winter 200819 HMMs (and MEMM) intro They are probabilistic sequence-classifier / sequence-lablers: assign a class/label to each unit in a sequence We have already seen a non-prob. version... Used extensively in NLP Part of Speech Tagging Partial parsing Named entity recognition Information Extraction

20 10/24/2015CPSC503 Winter 200820 Hidden Markov Model (State Emission).7.3.4.6 1.4 s1s1 a b i Start.6 Start.4 s2s2 a s3s3 s4s4 i a b b.5.1.9 1.1.4.5

21 10/24/2015CPSC503 Winter 200821 Hidden Markov Model Formal Specification as five-tuple Set of States Output Alphabet Initial State Probabilities State Transition Probabilities Symbol Emission Probabilities

22 10/24/2015CPSC503 Winter 200822 Three fundamental questions for HMMs Decoding: Finding the probability of an observation brute force or Forward/Backward-Algorithm Manning/Schütze, 2000: 325 Finding the best state sequence Viterbi-Algorithm Training: find model parameters which best explain the observations

23 10/24/2015CPSC503 Winter 200823 Computing the probability of an observation sequence O= o 1... o T X = all sequences of T states e.g., P(b,i | sample HMM )

24 10/24/2015CPSC503 Winter 200824 Decoding Example Manning/Schütze, 2000: 327 s 1, s 1 = 0 ? s 1, s 4 = 1 *.5 *.6 *.7 s 2, s 4 = 0? ………. s 1, s 2 = 1 *.1 *.6 *.3 ………. Complexity 

25 10/24/2015CPSC503 Winter 200825 The forward procedure 1. Initialization 2. Induction 3. Total Complexity

26 10/24/2015CPSC503 Winter 200826 Three fundamental questions for HMMs Decoding: Finding the probability of an observation brute force or Forward Algorithm Finding the best state sequence Viterbi-Algorithm Training: find model parameters which best explain the observations If interested in details of the next two questions, read (Sections 6.4 – 6.5)

27 10/24/2015CPSC503 Winter 200827 Today 24/9 ngrams evaluation Markov Chains Hidden Markov Models: –definition –the three key problems (only one in detail) Part-of-speech tagging –What it is –Why we need it –How to do it

28 10/24/2015CPSC503 Winter 200828 Parts of Speech Tagging What is it? Why do we need it? Word classes (Tags) –Distribution –Tagsets How to do it –Rule-based –Stochastic –Transformation-based

29 10/24/2015CPSC503 Winter 200829 Parts of Speech Tagging: What Brainpower_NN,_, not_RB physical_JJ plant_NN,_, is_VBZ now_RB a_DT firm_NN 's_POS chief_JJ asset_NN._. Tag meanings NNP (Proper N sing), RB (Adv), JJ (Adj), NN (N sing. or mass), VBZ (V 3sg pres), DT (Determiner), POS (Possessive ending),. (sentence-final punct) Output Brainpower, not physical plant, is now a firm's chief asset. Input

30 10/24/2015CPSC503 Winter 200830 Parts of Speech Tagging: Why? As a basis for (Partial) Parsing Information Retrieval Word-sense disambiguation Speech synthesis Improve language models (Spelling/Speech) Part-of-speech (word class, morph. class, syntactic category) gives a significant amount of info about the word and its neighbors Useful in the following NLP tasks:

31 10/24/2015CPSC503 Winter 200831 Parts of Speech Eight basic categories –Noun, verb, pronoun, preposition, adjective, adverb, article, conjunction These categories are based on: –morphological properties (affixes they take) –distributional properties (what other words can occur nearby) –e.g, green It is so…, both…, The… is Not semantics!

32 10/24/2015CPSC503 Winter 200832 Parts of Speech Two kinds of category –Closed class (generally are function words) Prepositions, articles, conjunctions, pronouns, determiners, aux, numerals –Open class Nouns (proper/common; mass/count), verbs, adjectives, adverbs Very short, frequent and important Objects, actions, events, properties If you run across an unknown word….??

33 10/24/2015CPSC503 Winter 200833 PoS Distribution Parts of speech follow a usual behavior in Language Words 1 PoS 2 PoS (unfortunately very frequent) >2 PoS …but luckily different tags associated with a word are not equally likely ~35k ~4k

34 10/24/2015CPSC503 Winter 200834 Sets of Parts of Speech:Tagsets Most commonly used: –45-tag Penn Treebank, –61-tag C5, –146-tag C7 The choice of tagset is based on the application (do you care about distinguishing between “to” as a prep and “to” as a infinitive marker?) Accurate tagging can be done with even large tagsets

35 10/24/2015CPSC503 Winter 200835 PoS Tagging Dictionary word i -> set of tags from Tagset Brainpower_NN,_, not_RB physical_JJ plant_NN,_, is_VBZ now_RB a_DT firm_NN 's_POS chief_JJ asset_NN._. ………. Brainpower, not physical plant, is now a firm's chief asset. ………… Input text Output Tagger

36 10/24/2015CPSC503 Winter 200836 Tagger Types Rule-based ‘95 Stochastic –HMM tagger ~ >= ’92 –Transformation-based tagger (Brill) ~ >= ’95 –Maximum Entropy Models ~ >= ’97

37 10/24/2015CPSC503 Winter 200837 Rule-Based (ENGTWOL ‘95) 1.A lexicon transducer returns for each word all possible morphological parses 2.A set of ~1,000 constraints is applied to rule out inappropriate PoS Step 1: sample I/O “Pavlov had show that salivation….” Pavlov N SG PROPER had HAVE V PAST SVO HAVE PCP2 SVO shown SHOW PCP2 SVOO …… that ADV PRON DEM SG CS …….. ……. Sample Constraint Example: Adverbial “that” rule Given input: “that” If (+1 A/ADV/QUANT) (+2 SENT-LIM) (NOT -1 SVOC/A) Then eliminate non-ADV tags Else eliminate ADV

38 10/24/2015CPSC503 Winter 200838 HMM Stochastic Tagging Tags corresponds to an HMM states Words correspond to the HMM alphabet symbols Tagging: given a sequence of words (observations), find the most likely sequence of tags (states) But this is…..! We need: State transition and symbol emission probabilities 1) From hand- tagged corpus 2) No tagged corpus: parameter estimation (Baum-Welch)

39 10/24/2015CPSC503 Winter 200839 Evaluating Taggers Accuracy: percent correct (most current taggers 96-7%) *test on unseen data!* Human Celing: agreement rate of humans on classification (96-7%) Unigram baseline: assign each token to the class it occurred in most frequently in the training set (race -> NN). (91%) What is causing the errors? Build a confusion matrix…

40 10/24/2015CPSC503 Winter 200840 Knowledge-Formalisms Map (next three lectures) Logical formalisms (First-Order Logics) Rule systems (and prob. versions) (e.g., (Prob.) Context-Free Grammars) State Machines (and prob. versions) (Finite State Automata,Finite State Transducers, Markov Models) Morphology Syntax Pragmatics Discourse and Dialogue Semantics AI planners

41 10/24/2015CPSC503 Winter 200841 Next Time Read Chapter 12 (syntax & Context Free Grammars)


Download ppt "10/24/2015CPSC503 Winter 20081 CPSC 503 Computational Linguistics Lecture 6 Giuseppe Carenini."

Similar presentations


Ads by Google