# Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester 2014-2015 Dr. Alexander Fraser, CIS.

## Presentation on theme: "Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester 2014-2015 Dr. Alexander Fraser, CIS."— Presentation transcript:

Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester 2014-2015 Dr. Alexander Fraser, CIS

Decision Trees vs. Linear Models Decision Trees are an intuitive way to learn classifiers from data They fit the training data well With heavy pruning, you can control overfitting NLP practitioners often use linear models instead Please read Sarawagi Chapter 3 (Entity Extraction: Statistical Methods) for next time The models discussed in Chapter 3 are linear models, as I will discuss here 2

Decision Trees for NER So far we have seen: How to learn rules for NER A basic idea of how to formulate NER as a classification problem Decision trees Including the basic idea of overfitting the training data 3

Rule Sets as Decision Trees Decision trees are quite powerful It is easy to see that complex rules can be encoded as decision trees For instance, let's go back to border detection in CMU seminars... 4

A Path in the Decision Tree The tree will check if the token to the left of the possible start position has "at" as a lemma Then check if the token after the possible start position is a Digit Then check the second token after the start position is a timeid ("am", "pm", etc) If you follow this path at a particular location in the text, then the decision should be to insert a 6

Linear Models However, in practice decision trees are not used so often in NLP Instead, linear models are used Let me first present linear models Then I will compare linear models and decision trees 7

Binary Classification I'm going to first discuss linear models for binary classification, using binary features We'll take the same scenario as before Our classifier is trying to decide whether we have a tag or not at the current position (between two words in an email) The first thing we will do is encode the context at this position into a feature vector 8

Feature Vector Each feature is true or false, and has a position in the feature vector The feature vector is typically sparse, meaning it is mostly zeros (meaning false) It will represent the full feature space. For instance, consider... 9

11 Our features represent this table using binary variables For instance, consider the lemma column Most features will be false (false = off = 0, these words are used interchangably). The features that will be on (true = on = 1) are: -3_lemma_the -2_lemma_Seminar -1_lemma_at +1_lemma_4 +2_lemma_pm +3_lemma_will

Classification To classify we will take the dot product of the feature vector with a learned weight vector We will say that the class is true (i.e., we should insert a here) if the dot product is >= 0, and false otherwise Because we might want to shift the values, we add a *bias* term, which is always true 12

Feature Vector We might use a feature vector like this: (this example is simplified – really we'd have all features for all positions) 13 101010011011101010011011 Bias term -3_lemma_the -2_lemma_Seminar -1_lemma_at +1_lemma_4 +1_Digit +2_timeid

Weight Vector Now we'd like the dot product to be > 0 if we should insert a tag To encode the rule we looked at before we have three features that we want to have a positive weight -1_lemma_at +1_Digit +2_timeid We can give them weights of 1 Their sum will be three To make sure that we only classify if all three weights are on, let's set the weight on the bias term to -2 14

Dot Product - I 15 101010011011101010011011 Bias term -3_lemma_the -2_lemma_Seminar -1_lemma_at +1_lemma_4 +1_Digit +2_timeid -2 0 1 0 1 To compute the dot product take the product of each row, and sum this

Dot Product - II 101010011011101010011011 Bias term -3_lemma_the -2_lemma_Seminar -1_lemma_at +1_lemma_4 +1_Digit +2_timeid -2 0 1 0 1 1*-2 0*0 1*0 0*0 1*1 1*0 0*0 1*1 1*-2 1*1 ----- 1

Learning the Weight Vector The general learning task is simply to find a good weight vector! This is sometimes also called "training" Basic intuition: you can check weight vector candidates to see how well they classify the training data Better weights vectors get more of the training data right So we need some way to make (smart) changes to the weight vector, such that we annotate more and more of the training data right I will talk about this next time 17

Feature Extraction We run feature extraction to get the feature vectors for each position in the text We typically use a text representation to represent true values (which are sparse) We usually define feature templates which describe the feature to be extracted and give the name (i.e., -1_lemma_ XXX) -3_lemma_the -2_lemma_Seminar -1_lemma_at +1_lemma_4 +1_Digit +2_timeid STIME -3_lemma_Seminar -2_lemma_at -1_lemma_4 -1_Digit +1_timeid +2_lemma_ will NONE... 18

Training vs. Testing When training the system, we have gold standard labels (see previous slide) When testing the system on new data, we have no gold standard We run the same feature generation first Then we take the dot product to get the classification decision Finally, we usually have to go back to the original text to write the tags into the correct positions 19

Further reading (optional): Tom Mitchell “Machine Learning” (text book) http://www.meta-net.eu/meta- research/training/machine-learning- tutorial/ http://www.meta-net.eu/meta- research/training/machine-learning- tutorial/ 20

Thank you for your attention! 21

Download ppt "Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester 2014-2015 Dr. Alexander Fraser, CIS."

Similar presentations