# CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld.

## Presentation on theme: "CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld."— Presentation transcript:

CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

Logistics Project Warm-Up – Due this Sunday Computing – \$100 EC2 credit per student Team Selection – Topic survey later this week

ShopBot WIEN Mulder RESOLVER KnowItAll REALM Opine LEX TextRunner Kylin Luchs KOG WOE USPOntoUSP SNE Velvet PrecHybrid Holmes Sherlock 1997200120052007200920082010 WebTables LDA-SP IIA UCR AuContraire SRL-IE 2004 An (Incomplete) Timeline of UW MR Systems 2011+ ReVerb MultiR Figer

Perspective Crawling the Web Inverted indices Query processing Pagerank computation & ranking Search UI Computational advertising Security & malware Social systems Information extraction

Perspective Crawling the Web Inverted indices Query processing Pagerank computation & ranking Search UI Computational advertising Security & malware Social systems Information extraction

Today’s Outline Supervised Learning – Compact Introduction – Learning as Function Approximation – Need for Bias – Overfitting – Bias / Variance Tradeoff – Loss Functions; Regularization; Learning as Optimization – Curse of Dimensionality – Logistic Regression IE as Supervised Learning Features for IE

Terminology Examples – Features – Labels Training Set Validation Set Test Set Input: { … …} Output: F: X  Y h: X  Y Objective: Minimize error of h on (unseen) test examples hypothesis

© Daniel S. Weld 8 Learning is Function Approximation

Linear Regression h w (x) = w 1 x + w 0

Classifier: Y (Range of F) is Discrete 0.01.02.03.04.05.06.0 0.01.02.03.0 Hypothesis: Function for labeling examples + + + + + + + + - - - - - - - - - - + ++ - - - + + Label: - Label: + ? ? ? ?

Objective: Minimize error on test examples 0.01.02.03.04.05.06.0 0.01.02.03.0 + + + + + + + + - - - - - - - - - - + ++ - - - + + ? ? ? ? So… How good is this hypothesis?

Objective: Minimize error on test examples 0.01.02.03.04.05.06.0 0.01.02.03.0 + + + + + + + + - - - - - - - - - - + ++ - - - + + Only know training data, so minimize error on that Loss(F) = j=1 n  |y j – F(x)|

13 Generalization Hypotheses must generalize to correctly classify instances not in the training data. Simply memorizing training examples yields a [consistent] hypothesis that does not generalize.

© Daniel S. Weld 14 Why is Learning Possible? Experience alone never justifies any conclusion about any unseen instance. Learning occurs when PREJUDICE meets DATA! Learning a “Frobnitz”

Frobnitz15 Not a Frobnitz

© Daniel S. Weld 16 Bias The nice word for prejudice is “bias”. – Different from “Bias” in statistics What kind of hypotheses will you consider? – What is allowable range of approximation functions? – Eg conjunctions linear functions What kind of hypotheses do you prefer? – Eg Simple hypotheses (Occam’s Razor) few parameters, small parameters,

Fitting a Polynomial

© Daniel S. Weld18 Overfitting Accuracy 0.9 0.8 0.7 0.6 On training data On test data Model complexity (e.g., number of parameters in polynomial)

Bia / Variance Tradeoff Slide from T Dietterich Variance: E[ (h(x*) – h(x*)) 2 ] How much h(x*) varies between training sets Reducing variance risks underfitting Bias: [h(x*) – f(x*)] Describes the average error of h(x*) Reducing bias risks overfitting

Learning as Optimization Loss Function – Loss(h,data) = error(h, data) + complexity(h) – Error + regularization – Minimize loss over training data Opt Methods – Closed form – Greedy search – Gradient ascent 20

Effect of Regularization Loss(h w ) = j=1 n  (y j – (w 1 x j +w 0 )) 2 i=1 +  |w i | k ln = -25

Regularization: vs.

Curse of Dimensionality Intuitions fail Hard to distinguish hypotheses 

A Great Learning Algorithm Logistic Regression

Univariate Linear Regression 25 h w (x) = w 1 x + w 0 Loss(h w ) = j=1 n  L 2 (y j, h w (x j )) = j=1 n  (y j - h w (x j )) 2 = j=1 n  (y j – (w 1 x j +w 0 )) 2

Understanding Weight Space 26 Loss(h w ) = j=1 n  (y j – (w 1 x j +w 0 )) 2 h w (x) = w 1 x + w 0

Understanding Weight Space 27 h w (x) = w 1 x + w 0 Loss(h w ) = j=1 n  (y j – (w 1 x j +w 0 )) 2

Finding Minimum Loss 28 h w (x) = w 1 x + w 0 Loss(h w ) = j=1 n  (y j – (w 1 x j +w 0 )) 2 Loss(h w ) = 0 w0w0  Argmin w Loss(h w ) Loss(h w ) = 0 w1w1 

Unique Solution! h w (x) = w 1 x + w 0 w 1 = Argmin w Loss(h w ) w 0 = (  (y j )–w 1 (  x j )/N N  (x j y j )–(  x j )(  y j ) N  (x j 2 )–(  x j ) 2

Could also Solve Iteratively w = any point in weight space Loop until convergence For each w i in w do w i := w i -  Loss(w) Argmin w Loss(h w ) wiwi 

Multivariate Linear Regression 31 h w (x j ) = w 0 Unique Solution = (x T x) -1 x T y +  w i x j,i =  w i x j,i =w T x j Argmin w Loss(h w ) Problem….

Overfitting Regularize!! Penalize high weights 32 Loss(h w ) = j=1 n  (y j – (w 1 x j +w 0 )) 2 i=1 +  w i 2 k Loss(h w ) = j=1 n  (y j – (w 1 x j +w 0 )) 2 i=1 +  |w i | k Alternatively….

Regularization 33 L1L2

Back to Classification 34 P(edible|X)=1 P(edible|X)=0 Decision Boundary

Logistic Regression Learn P(Y|X) directly!  Assume a particular functional form L Not differentiable… 35 P(Y)=1 P(Y)=0

Logistic Regression Learn P(Y|X) directly!  Assume a particular functional form  Logistic Function  Aka Sigmoid 36

Logistic Function in n Dimensions 37 Sigmoid applied to a linear function of the data: Features can be discrete or continuous!

Understanding Sigmoids w 0 =0, w 1 =-1 w 0 =-2, w 1 =-1 w 0 =0, w 1 = -0.5

Very convenient! implies linear classification rule! 39 ©Carlos Guestrin 2005-2009

Logistic regression more generally Logistic regression in more general case, where Y  {y 1,…,y R } for k { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/10/2811724/slides/slide_40.jpg", "name": "Logistic regression more generally Logistic regression in more general case, where Y  {y 1,…,y R } for k

Generative (Naïve Bayes) Loss function: Data likelihood Discriminative (Logistic Regr.) Loss funct: Conditional Data Likelihood Discriminative models can’t compute P(x j |w)! Or, … “They don’t waste effort learning P(X)” Focus only on P(Y|X) - all that matters for classification Loss Functions: Likelihood vs. Conditional Likelihood 41 ©Carlos Guestrin 2005-2009

Expressing Conditional Log Likelihood 42©Carlos Guestrin 2005-2009 } Probability of predicting 1 Probability of predicting 0 } 1 when correct answer is 01 when correct answer is 1 ln(P(Y=0|X,w) = -ln(1+exp(w 0 +  i w i X i ) ln(P(Y=0|X,w) = w 0 +  i w i X i - ln(1+exp(w 0 +  i w i X i )

Expressing Conditional Log Likelihood ln(P(Y=0|X,w) = -ln(1+exp(w 0 +  i w i X i ) ln(P(Y=0|X,w) = w 0 +  i w i X i - ln(1+exp(w 0 +  i w i X i )

Maximizing Conditional Log Likelihood Bad news: no closed-form solution to maximize l(w) Good news: l(w) is concave function of w ! No local minima Concave functions easy to optimize 44©Carlos Guestrin 2005-2009

Optimizing Concave Functions Gradient Ascent Conditional likelihood for Logistic Regression is concave ! Find optimum with gradient ascent Gradient ascent is simplest of optimization approaches e.g., Conjugate gradient ascent much better (see reading) Gradient: Learning rate,  >0 Update rule: 45 ©Carlos Guestrin 2005-2009

Earthquake or Nuclear Test? 46 x1x1 x2x2 x x x x x x x x x x x x x x x x x x x x x x x linear classification rule! implies If > 1, then predict Y=0

Logistic w/ Initial Weights 47 w 0 =20 w 1 = -5 w 2 =10 x x x x x x x x x x x x x1x1 x2x2 Update rule: w0w0 w1w1 l(w) Loss(H w ) = Error(H w, data) Minimize Error  Maximize l(w) = ln P(D Y | D x, H w )

Gradient Ascent 48 w 0 =40 w 1 = -10 w 2 =5 x1x1 x2x2 Update rule: w0w0 w1w1 l(w) x x x x x x x x x x x x x1x1 x2x2 Maximize l(w) = ln P(D Y | D x, H w )

Details 49 w0w0 w1w1 l(w) Update rule:

IE as Classification Citigroup has taken over EMI, the British … Citigroup’s acquisition of EMI comes just ahead of … Google’s Adwords system has long included ways to connect to Youtube. + + -

Preprocessed Data Files tokensafter tokenization John likes eating sausage. Each line corresponds to a sentence. "John likes eating sausage."

Preprocessed Data Files tokensafter tokenization John likes eating sausage. posPart-of-Speech tags John/NNP likes/VBZ eating/VBG sausage/NN./. Each line corresponds to a sentence. "John likes eating sausage." Grade School: “9 parts of speech in English”parts of speech Noun Verb Article Adjective Preposition But: plurals, possessive, case, tense, aspect, …. Pronoun Adverb Conjunction Interjection

Preprocessed Data Files tokensafter tokenization John likes eating sausage. posPart-of-Speech tags John/NNP likes/VBZ eating/VBG sausage/NN./. nerNamed Entities Each line corresponds to a sentence. "John likes eating sausage."

Preprocessed Data Files tokensafter tokenization John likes eating sausage. posPart-of-Speech tags John/NNP likes/VBZ eating/VBG sausage/NN./. parse automatic analysis of grammatical structure *stored in one line depGrammatical dep. Each line corresponds to a sentence. "John likes eating sausage." (S (NP (NNP John)) (VP (VBZ likes) (S (VP (VBG eating) (NP (NN sausage))))) (..))

Preprocessed Data Files tokensafter tokenization John likes eating sausage. posPart-of-Speech tags John/NNP likes/VBZ eating/VBG sausage/NN./. parse automatic analysis of grammatical structure *stored in one line depGrammatical dep. nerNamed Entities Each line corresponds to a sentence. "John likes eating sausage." (S (NP (NNP John)) (VP (VBZ likes) (S (VP (VBG eating) (NP (NN sausage))))) (..))

Text as Vectors Each document, j, can be viewed as a vector of frequency values, – one component for each word (or phrase) So we have a vector space – Words (or phrases) are axes – documents live in this space – even with stemming, may have 20,000+ dimensions

Vector Space Representation Documents that are close to query (measured using vector-space metric) => returned first. Query slide from Raghavan, Schütze, Larson

Lemmatization Reduce inflectional/variant forms to base form – am, are, is  be – car, cars, car's, cars'  car the boy's cars are different colors  the boy car be different color slide from Raghavan, Schütze, Larson

Stemming Reduce terms to their “roots” before indexing – language dependent – e.g., automate(s), automatic, automation all reduced to automat. for example compressed and compression are both accepted as equivalent to compress. for exampl compres and compres are both accept as equival to compres. slide from Raghavan, Schütze, Larson

Porter’s algorithm Common algorithm for stemming English Conventions + 5 phases of reductions – phases applied sequentially – each phase consists of a set of commands – sample convention: Of the rules in a compound command, select the one that applies to the longest suffix. Porter’s stemmer available: http//www.sims.berkeley.edu/~hearst/irbook/porter.html slide from Raghavan, Schütze, Larson

Typical rules in Porter sses  ss ies  i ational  ate tional  tion slide from Raghavan, Schütze, Larson

Challenges Sandy Sanded Sander  Sand ??? slide from Raghavan, Schütze, Larson

65 Many relations and events are temporally bounded – a person's place of residence or employer – an organization's members – the duration of a war between two countries – the precise time at which a plane landed – … Temporal Information Distribution – One of every fifty lines of database application code involves a date or time value (Snodgrass,1998) – Each news document in PropBank (Kingsbury and Palmer, 2002) includes eight temporal arguments Why Extract Temporal Information? Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial

66 Time-intensive Slot Types Person Organization per:alternate_namesper:titleorg:alternate_names per:date_of_birthper:member_oforg:political/religious_affiliation per:ageper:employee_oforg:top_members/employees per:country_of_birthper:religionorg:number_of_employees/members per:stateorprovince_of_birthper:spouseorg:members per:city_of_birthper:childrenorg:member_of per:originper:parentsorg:subsidiaries per:date_of_deathper:siblingsorg:parents per:country_of_deathper:other_familyorg:founded_by per:stateorprovince_of_deathper:chargesorg:founded per:city_of_deathorg:dissolved per:cause_of_deathorg:country_of_headquarters per:countries_of_residenceorg:stateorprovince_of_headquarters per:stateorprovinces_of_residenceorg:city_of_headquarters per:cities_of_residenceorg:shareholders per:schools_attendedorg:website Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial

67 Temporal Expression Examples ExpressionValue in Timex Format December 8, 20122012-12-08 Friday2012-12-07 today2012-12-08 1993 the 1990's199X midnight, December 8, 20122012-12-08T00:00:00 5pm2012-12-08T17:00 the previous day2012-12-07 last October2011-10 last autumn2011-FA last week2012-W48 Thursday evening2012-12-06TEV three months ago2012:09 Reference Date = December 8, 2012 Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial

68 Rule-based (Strtotgen and Gertz, 2010; Chang and Manning, 2012; Do et al., 2012 ) Machine Learning – Risk Minimization Model (Boguraev and Ando, 2005) – Conditional Random Fields (Ahn et al., 2005; UzZaman and Allen, 2010) State-of-the-art: about 95% F-measure for extraction and 85% F-measure for normalization Temporal Expression Extraction Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial

69 Ordering events in discourse (1 ) John entered the room at 5:00pm. (2) It was pitch black. (3) It had been three days since he’d slept. Time: Now Time: 5pm State: Pitch Black State: John SleptTime: 3 days Event: John entered the room Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial

70 Ordering events in time Speech (S), Event (E), & Reference (R) time (Reichenbach, 1947) Tense: relates R and S; Gr. Aspect: relates R and E R associated with temporal anaphora (Partee 1984) Order events by comparing R across sentences By the time Boris noticed his blunder, John had (already) won the game SentenceTenseOrder John wins the gamePresentE,R,S John won the gameSimple PastE,R { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/10/2811724/slides/slide_70.jpg", "name": "70 Ordering events in time Speech (S), Event (E), & Reference (R) time (Reichenbach, 1947) Tense: relates R and S; Gr.", "description": "Aspect: relates R and E R associated with temporal anaphora (Partee 1984) Order events by comparing R across sentences By the time Boris noticed his blunder, John had (already) won the game SentenceTenseOrder John wins the gamePresentE,R,S John won the gameSimple PastE,R

Types of eventualities 71 Chart from (Dölling, 2011) Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial

Inter-eventuality relations A boundary begins/ends a happening A boundary culminates an event A moment is the reduction of an episode A state is the result of a change A habitual state is realized by a class of occurrences A Processes is made of event constituents … 72 Chart from (Dölling, 2011) Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial