Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Similar presentations

Presentation on theme: "Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1."— Presentation transcript:

1 Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1

2 Overview Introduction Rule-based Approaches Machine Learning Approaches –Statistical Approach –Memory Based Learning Methodology Evaluation Machine Learning & CLARIN 2

3 Introduction As a scientific discipline –Studies algorithms that allow computers to evolve behaviors based on empirical data Learning: empirical data are used to improve performance on some tasks Core concept: Generalize from observed data 3

4 Introduction Plural Formation –Observed: list of (singular form, plural form) –Generalize: predict plural form given a singular form for new words (not in observed list) PoS tagging –Observed: text corpus with PoS-tag annotations –Generalize: predict Pos-Tag of each token from a new text corpus 4

5 Introduction Supervised Learning –Map input into desired output, e.g. classes –Requires a training set Unsupervised Learning –Model a set of inputs (e.g. into clusters) –No training set required 5

6 Introduction Many approaches –Decision Tree Learning –Artificial Neural Networks –Genetic programming –Support Vector Machines –Statistical Approaches –Memory Based Learning 6

7 Introduction Focus here –Supervised learning –Statistical Approaches –Memory-based learning 7

8 Rule-Based Approaches Rule based systems for language –Lexicon Lists all idiosyncratic properties of lexical items –Unpredictable properties e.g man is a noun –Exceptions to rules, e.g. past tense(go) = went Hand-crafted In a fully formalized manner 8

9 Rule-Based Approaches Rule based systems for language (cont.) –Rules Specifies regular properties of language –E.g. direct object directly follows verb (in English) Hand-crafted In a fully formalized manner 9

10 Rule-Based Approaches Problems for rule based systems –Lexicon Very difficult to specify and create Always incomplete Existing dictionaries –Were developed for use by humans –Do not specify enough properties –Do not specify the properties in a formalized manner 10

11 Rule-Based Approaches Problems for rule based systems (cont.) –Rules Extremely difficult to describe a language (or even a significant subset of language) by rules Rule systems become very large and difficult to maintain (No robustness (fail softly) for unexpected input) 11

12 Machine Learning –A machine learns Lexicon Regularities of language –From a large corpus of observed data 12

13 Statistical Approach Statistical approach Goal: get output O given some input I –Given a word in English, get its translation in Spanish –Given acoustic signal with speech, get the written transcription of the spoken word –Given preceding tags and following ambitag, get tag of the current word Work with probabilities P(O|I) 13

14 Statistical Approach P(A) probability of A A an event (usually modeled by a set) Event space=all possible event elements: 0 P(A) 1 For finite event space, and a uniform distribution: P(A) = |A| / || 14

15 Statistical Approach Simple Example A fair coin is tossed 3 times –What is the probability of (exactly) two heads? 2 possibilities for each toss: Heads or Tails Solution: – = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} –A = {HHT, HTH, THH} –P(A) = |A| / || = 3/8 15

16 Statistical Approach Conditional Probability P(A|B) –Probability of event A given that event B has occurred P(A|B) = P (A B) / P(B) (for P(B)>0) A AB B 16

17 Statistical Approach A fair coin is tossed 3 times –What is the probability of (exactly) two heads (A) if the first toss has occurred and is H (B)? – = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} –A = {HHT, HTH, THH} –B = {HHH,HHT,HTH,HTT} –A B = {HHT, HTH} –P(A|B)=P(AB) / P(B) = 2/8 / 4/8 = 2 / 4 = ½ 17

18 Statistical Approach Given –P(A|B)=P(AB) / P(B) (multiply by P(B)) –P(AB) = P(A|B) P(B) –P(BA) = P(B|A) P(A) – P(AB) = P(BA) –P(AB) = P(B|A) P(A) Bayes Theorem: –P(A|B) = P(AB)/P(B) = P(B|A)P(A) / P(B) 18

19 Statistical Approach Bayes Theorem Check – = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} –A = {HHT, HTH, THH} –B = {HHH,HHT,HTH,HTT} –A B = {HHT, HTH} –P(B|A) = P(BA) / P(A) = 2/8 / 3/8 = 2/3 –P(A|B) = P(B|A)P(A) / P(B) = (2/3 * 3/8) / (4/8) = 2 * 6/24= 1/2 19

20 Statistical Approach Statistical approach –Using Bayesian inference (noisy channel model) get P(O|I) for all possible O, given I take that O given input I for which P(O|I) is highest: Ô Ô = argmax O P(O|I) 20

21 Statistical Approach Statistical approach How to obtain P(O|I)? Bayes Theorem P(O|I) = 21

22 Statistical Approach Did we gain anything? Yes! –P(O) and P(I|O) often easier to estimate than P(O|I) –P(I) can be ignored: it is independent of O. –(though we have no probabilities anymore) In particular: argmax O P(O|I) = argmax O P(I|O) * P(O) 22

23 Statistical Approach P(O) (also called the Prior probability) –Used for the language model in MT and ASR –cannot be computed: must be estimated –P(w) estimated using the relative frequency of w in a (representative) corpus count how often w occurs in the corpus Divide by total number of word tokens in corpus = relative frequency ; set this as P(w) –(ignoring smoothing) 23

24 Statistical Approach P(I|O) (also called the likelihood) –Cannot easily be computed –But estimated on the basis of a corpus –Speech recognition: Transcribed speech corpus Acoustic Model –Machine translation Aligned parallel corpus Translation Model 24

25 Statistical Approach How to deal with sentences instead of words? Sentence = w 1..w n –P(S) = P(w 1 )*..*P(w n )? –NO: This misses the connections between the words –P(S) = (chain rule) P(w 1 )P(w 2 |w 1 )P(w 3 |w 1 w 2 )..P(w n |w 1..w n-1 ) 25

26 Statistical Approach –N-grams needed (not really feasible) –Probabilities of n-grams are estimated by the relative frequency of n-grams in a corpus Frequencies get too low for n-grams n>=3 to be useful –In practice: use bigrams, trigrams (4-grams) –E.g. Bigram model: P(S) = P(w 1 w 2 )* P(w 2 w 3 )..* P(w n-1 w n ) 26

27 Memory Based Learning Classification Determine input features Determine output classes Store observed examples Use similarity metrics to classify unseen cases 27

28 Memory Based Learning Example: PP-attachment Given a input sequence V..N.. PP –PP attaches to V?, or –PP attaches to N? Examples –John ate crisps with Mary –John ate pizza with fresh anchovies –John had pizza with his best friends 28

29 Memory Based Learning Input features (feature vector): –Verb –Head noun of complement NP –Preposition –Head noun of complement NP in PP Output classes (indicated by class labels) –Verb (i.e. attaches to the verb) –Noun (i.e. attaches to the noun) 29

30 Memory Based Learning Training Corpus: IdVerbNoun1PrepNoun2Class 1atecrispswithMaryVerb 2atepizzawithanchovie s Noun 3hadpizzawithfriendsVerb 4haspizzawithJohnVerb 5… 30

31 Memory Based Learning MBL: Store training corpus (feature vectors + associated class in memory) for new cases –Stored in memory? Yes: assign associated class No: use similarity metrics 31

32 Similarity Metrics (actually : distance metrics) Input: eats pizza with Liam Compare input feature vector X with each vector Y in memory: Δ(X,Y) Comparing vectors: sum the differences for the n individual features Δ(X,Y) = Σ n i=1 δ(x i,y i ) 32

33 Similarity Metrics δ(f 1,f 2 ) = – (f 1,f 2 numeric): – (f1-f2)/(max-min) 12 – 2 = 10 in a range of 0.. 100 10/100=0.1 12 - 2 = 10 in a range of 0.. 20 10/20 = 0.5 – (f 1,f 2 not numeric): –0 if f 1 = f 2 no difference distance = 0 –1 if f 1 f 2 difference distance = 1 33

34 Similarity Metrics IdVerbNoun1PrepNoun2ClassΔ(X,Y) New(X)eatspizzawithLiam?? Mem 1ate:1crisps:1with:0Mary:1Verb3 Mem 2ate:1Pizza:0with:0anchovies:1Noun2 Mem 3had:1Pizza:0with:0Friends:1Verb2 Mem 4has:1Pizza:0with:0John:1Verb2 Mem 5… 34

35 Similarity Metrics Look at the k nearest neighbours (k-NN) –(k = 1): look at the nearest set of vectors The set of feature vectors with ids {2,3,4} has the smallest distance (viz. 2) Take the most frequent class occurring in this set: Verb Assign this as class to the new example 35

36 Similarity Metrics with Δ(X,Y) = Σ n i=1 δ(x i,y i ) –every feature is equally important; –Perhaps some features are more important Adaptation: –Δ(X,Y) = Σ n i=1 w i * δ(x i,y i ) –Where w i is the weight of feature i 36

37 Similarity Metrics How to obtain the weight of a feature? –Can be based on knowledge –Can be computed from the training corpus –In various ways: Information Gain Gain Ratio χ 2 37

38 Methodology Split corpus into –Training corpus –Test Corpus Essential to keep test corpus separate (Ideally) Keep Test Corpus unseen Sometimes –Development set –To do tests while developing 38

39 Methodology Split –Training 50% –Test 50% Pro –Large test set Con –Small training set 39

40 Methodology Split –Training 90% –Test 10% Pro –Large training set Con –Small test set 40

41 Methodology 10-fold cross-validation –Split corpus in 10 equal subsets –Train on 9; Test on 1 (in all 10 combinations) Pro: –Large training sets –Still independent test sets Con : training set still not maximal requires a lot of computation 41

42 Methodology Leave One Out –Use all examples in training set except 1 –Test on 1 example (in all combinations) Pro: –Maximal training sets –Still independent test sets Con : requires a lot of computation 42

43 Evaluation True class Positive (P) Negative (N) Predicted class Correct True Positive (TP) False Positive (FP) Incorrect False negative (FN) True Negative (TN) 43

44 Evaluation TP= examples that have class C and are predicted to have class C FP = examples that have class ~C but are predicted to have class C FN= examples that have class C but are predicted to have class ~C TN= examples that have class ~C and are predicted to have class ~C 44

45 Evaluation Precision = TP / (TP+FP) Recall = True Positive Rate = TP / P False Positive Rate = FP / N F-Score = (2*Prec*Rec) / (Prec+Rec) Accuracy = (TP+TN)/(TP+TN+FP+FN) 45

46 Example Applications Morphology for Dutch –Segmentation into stems and affixes Abnormaliteiten -> abnormaal + iteit + en –Map to morphological features (eg inflectional) liepen-> lopen + past plural Instance for each character Features: Focus char; 5 preceding and 5 following letters + class 46

47 Example Applications Morphology for Dutch Results PrecRecF-Score –Full: 81.180.780.9 –Typed Seg: 90.389.990.1 –Untyped Seg:90.490.090.2 –Seg=correctly segmented –Typed= assigned correct type –Full = typed segm + correct spelling changes 47

48 Example Applications Part-of-Speech Tagging –Assignment of tags to words in context –[word] -> [(word, tag)] –[book that flight] -> –[(book, verb) (that,Det) (flight, noun)] –Book in isolation is ambiguous between noun and verb: marked by an ambitag: noun/verb 48

49 Example Applications Part-of-Speech Tagging Features –Context: preceding tag + following ambitag –Word: Actual word form for 1000 most frequent words some features of the word –ambitag of the word –+/-capitalized –+/-with digits – +/-hyphen 49

50 Example Applications Part-of-Speech Tagging Results –WSJ: 96.4% accuracyWSJ –LOB Corpus:97.0% accuracyLOB Corpus 50

51 Example Applications Phrase Chunking –Marking of major phrase boundaries –The man gave the boy the money -> –[ NP the man] gave [ NP the boy] [ NP the money] –Usually encoded with tags per word: –I-X = inside X; O=outside; B-X=beginning of new X –the I-NP man I-NP gave O the I-NP boy I-NP the B-NP money I-NP 51

52 Example Applications Phrase Chunking Features –Word form –PoS-tags of 2 preceding words The focus word 1 word to the right 52

53 Example Applications Phrase Chunking Results PrecRecF-score NP92.592.292.3 VP91.991.791.8 ADJP68.465.066.7 ADVP78.077.977.9 PP91.992.292.0 53

54 Example Applications Coreference Marking –COREA projectCOREA project –DemoDemo –Een 21-jarige dronkenlap 3 besloot maandagnacht zijn 5005=3 roes uit te slapen op de snelweg A1 9 bij Naarden. De politie 12=9 trof de man 14=5005 slapend aan achter het stuur van zijn 5017=14 auto 18, terwijl de motor nog draaide 54

55 Machine Learning & CLARIN Web services in work flow systems are created for several MBL-based tools –Orthographic Normalization –Morphological analysis –Lemmatization –Pos-Tagging –Chunking –Coreference assignment –Semantic annotation (semantic roles, locative and temporal adverbs) 55

56 Machine learning & CLARIN Web services in work flow systems are created for statistically based tools such –Speech recognition –Audio mining –All based on SPRAAKSPRAAK –Tomorrow more on this! 56

Download ppt "Machine Learning: Basic Introduction Jan Odijk January 2011 LOT Winter School 2011 1."

Similar presentations

Ads by Google