Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld.

Similar presentations


Presentation on theme: "CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld."— Presentation transcript:

1 CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld

2 Logistics Project Warm-Up – Due this Sunday Computing – $100 EC2 credit per student Team Selection – Topic survey later this week

3 ShopBot WIEN Mulder RESOLVER KnowItAll REALM Opine LEX TextRunner Kylin Luchs KOG WOE USPOntoUSP SNE Velvet PrecHybrid Holmes Sherlock WebTables LDA-SP IIA UCR AuContraire SRL-IE 2004 An (Incomplete) Timeline of UW MR Systems ReVerb MultiR Figer

4 Perspective Crawling the Web Inverted indices Query processing Pagerank computation & ranking Search UI Computational advertising Security & malware Social systems Information extraction

5 Perspective Crawling the Web Inverted indices Query processing Pagerank computation & ranking Search UI Computational advertising Security & malware Social systems Information extraction

6 Today’s Outline Supervised Learning – Compact Introduction – Learning as Function Approximation – Need for Bias – Overfitting – Bias / Variance Tradeoff – Loss Functions; Regularization; Learning as Optimization – Curse of Dimensionality – Logistic Regression IE as Supervised Learning Features for IE

7 Terminology Examples – Features – Labels Training Set Validation Set Test Set Input: { … …} Output: F: X  Y h: X  Y Objective: Minimize error of h on (unseen) test examples hypothesis

8 © Daniel S. Weld 8 Learning is Function Approximation

9 Linear Regression h w (x) = w 1 x + w 0

10 Classifier: Y (Range of F) is Discrete Hypothesis: Function for labeling examples Label: - Label: + ? ? ? ?

11 Objective: Minimize error on test examples ? ? ? ? So… How good is this hypothesis?

12 Objective: Minimize error on test examples Only know training data, so minimize error on that Loss(F) = j=1 n  |y j – F(x)|

13 13 Generalization Hypotheses must generalize to correctly classify instances not in the training data. Simply memorizing training examples yields a [consistent] hypothesis that does not generalize.

14 © Daniel S. Weld 14 Why is Learning Possible? Experience alone never justifies any conclusion about any unseen instance. Learning occurs when PREJUDICE meets DATA! Learning a “Frobnitz”

15 Frobnitz15 Not a Frobnitz

16 © Daniel S. Weld 16 Bias The nice word for prejudice is “bias”. – Different from “Bias” in statistics What kind of hypotheses will you consider? – What is allowable range of approximation functions? – Eg conjunctions linear functions What kind of hypotheses do you prefer? – Eg Simple hypotheses (Occam’s Razor) few parameters, small parameters,

17 Fitting a Polynomial

18 © Daniel S. Weld18 Overfitting Accuracy On training data On test data Model complexity (e.g., number of parameters in polynomial)

19 Bia / Variance Tradeoff Slide from T Dietterich Variance: E[ (h(x*) – h(x*)) 2 ] How much h(x*) varies between training sets Reducing variance risks underfitting Bias: [h(x*) – f(x*)] Describes the average error of h(x*) Reducing bias risks overfitting

20 Learning as Optimization Loss Function – Loss(h,data) = error(h, data) + complexity(h) – Error + regularization – Minimize loss over training data Opt Methods – Closed form – Greedy search – Gradient ascent 20

21 Effect of Regularization Loss(h w ) = j=1 n  (y j – (w 1 x j +w 0 )) 2 i=1 +  |w i | k ln = -25

22 Regularization: vs.

23 Curse of Dimensionality Intuitions fail Hard to distinguish hypotheses 

24 A Great Learning Algorithm Logistic Regression

25 Univariate Linear Regression 25 h w (x) = w 1 x + w 0 Loss(h w ) = j=1 n  L 2 (y j, h w (x j )) = j=1 n  (y j - h w (x j )) 2 = j=1 n  (y j – (w 1 x j +w 0 )) 2

26 Understanding Weight Space 26 Loss(h w ) = j=1 n  (y j – (w 1 x j +w 0 )) 2 h w (x) = w 1 x + w 0

27 Understanding Weight Space 27 h w (x) = w 1 x + w 0 Loss(h w ) = j=1 n  (y j – (w 1 x j +w 0 )) 2

28 Finding Minimum Loss 28 h w (x) = w 1 x + w 0 Loss(h w ) = j=1 n  (y j – (w 1 x j +w 0 )) 2 Loss(h w ) = 0 w0w0  Argmin w Loss(h w ) Loss(h w ) = 0 w1w1 

29 Unique Solution! h w (x) = w 1 x + w 0 w 1 = Argmin w Loss(h w ) w 0 = (  (y j )–w 1 (  x j )/N N  (x j y j )–(  x j )(  y j ) N  (x j 2 )–(  x j ) 2

30 Could also Solve Iteratively w = any point in weight space Loop until convergence For each w i in w do w i := w i -  Loss(w) Argmin w Loss(h w ) wiwi 

31 Multivariate Linear Regression 31 h w (x j ) = w 0 Unique Solution = (x T x) -1 x T y +  w i x j,i =  w i x j,i =w T x j Argmin w Loss(h w ) Problem….

32 Overfitting Regularize!! Penalize high weights 32 Loss(h w ) = j=1 n  (y j – (w 1 x j +w 0 )) 2 i=1 +  w i 2 k Loss(h w ) = j=1 n  (y j – (w 1 x j +w 0 )) 2 i=1 +  |w i | k Alternatively….

33 Regularization 33 L1L2

34 Back to Classification 34 P(edible|X)=1 P(edible|X)=0 Decision Boundary

35 Logistic Regression Learn P(Y|X) directly!  Assume a particular functional form L Not differentiable… 35 P(Y)=1 P(Y)=0

36 Logistic Regression Learn P(Y|X) directly!  Assume a particular functional form  Logistic Function  Aka Sigmoid 36

37 Logistic Function in n Dimensions 37 Sigmoid applied to a linear function of the data: Features can be discrete or continuous!

38 Understanding Sigmoids w 0 =0, w 1 =-1 w 0 =-2, w 1 =-1 w 0 =0, w 1 = -0.5

39 Very convenient! implies linear classification rule! 39 ©Carlos Guestrin

40 Logistic regression more generally Logistic regression in more general case, where Y  {y 1,…,y R } for k

41 Generative (Naïve Bayes) Loss function: Data likelihood Discriminative (Logistic Regr.) Loss funct: Conditional Data Likelihood Discriminative models can’t compute P(x j |w)! Or, … “They don’t waste effort learning P(X)” Focus only on P(Y|X) - all that matters for classification Loss Functions: Likelihood vs. Conditional Likelihood 41 ©Carlos Guestrin

42 Expressing Conditional Log Likelihood 42©Carlos Guestrin } Probability of predicting 1 Probability of predicting 0 } 1 when correct answer is 01 when correct answer is 1 ln(P(Y=0|X,w) = -ln(1+exp(w 0 +  i w i X i ) ln(P(Y=0|X,w) = w 0 +  i w i X i - ln(1+exp(w 0 +  i w i X i )

43 Expressing Conditional Log Likelihood ln(P(Y=0|X,w) = -ln(1+exp(w 0 +  i w i X i ) ln(P(Y=0|X,w) = w 0 +  i w i X i - ln(1+exp(w 0 +  i w i X i )

44 Maximizing Conditional Log Likelihood Bad news: no closed-form solution to maximize l(w) Good news: l(w) is concave function of w ! No local minima Concave functions easy to optimize 44©Carlos Guestrin

45 Optimizing Concave Functions Gradient Ascent Conditional likelihood for Logistic Regression is concave ! Find optimum with gradient ascent Gradient ascent is simplest of optimization approaches e.g., Conjugate gradient ascent much better (see reading) Gradient: Learning rate,  >0 Update rule: 45 ©Carlos Guestrin

46 Earthquake or Nuclear Test? 46 x1x1 x2x2 x x x x x x x x x x x x x x x x x x x x x x x linear classification rule! implies If > 1, then predict Y=0

47 Logistic w/ Initial Weights 47 w 0 =20 w 1 = -5 w 2 =10 x x x x x x x x x x x x x1x1 x2x2 Update rule: w0w0 w1w1 l(w) Loss(H w ) = Error(H w, data) Minimize Error  Maximize l(w) = ln P(D Y | D x, H w )

48 Gradient Ascent 48 w 0 =40 w 1 = -10 w 2 =5 x1x1 x2x2 Update rule: w0w0 w1w1 l(w) x x x x x x x x x x x x x1x1 x2x2 Maximize l(w) = ln P(D Y | D x, H w )

49 Details 49 w0w0 w1w1 l(w) Update rule:

50 IE as Classification Citigroup has taken over EMI, the British … Citigroup’s acquisition of EMI comes just ahead of … Google’s Adwords system has long included ways to connect to Youtube

51 Preprocessed Data Files tokensafter tokenization John likes eating sausage. Each line corresponds to a sentence. "John likes eating sausage."

52 Preprocessed Data Files tokensafter tokenization John likes eating sausage. posPart-of-Speech tags John/NNP likes/VBZ eating/VBG sausage/NN./. Each line corresponds to a sentence. "John likes eating sausage." Grade School: “9 parts of speech in English”parts of speech Noun Verb Article Adjective Preposition But: plurals, possessive, case, tense, aspect, …. Pronoun Adverb Conjunction Interjection

53 Preprocessed Data Files tokensafter tokenization John likes eating sausage. posPart-of-Speech tags John/NNP likes/VBZ eating/VBG sausage/NN./. nerNamed Entities Each line corresponds to a sentence. "John likes eating sausage."

54 Preprocessed Data Files tokensafter tokenization John likes eating sausage. posPart-of-Speech tags John/NNP likes/VBZ eating/VBG sausage/NN./. parse automatic analysis of grammatical structure *stored in one line depGrammatical dep. Each line corresponds to a sentence. "John likes eating sausage." (S (NP (NNP John)) (VP (VBZ likes) (S (VP (VBG eating) (NP (NN sausage))))) (..))

55 Preprocessed Data Files tokensafter tokenization John likes eating sausage. posPart-of-Speech tags John/NNP likes/VBZ eating/VBG sausage/NN./. parse automatic analysis of grammatical structure *stored in one line depGrammatical dep. nerNamed Entities Each line corresponds to a sentence. "John likes eating sausage." (S (NP (NNP John)) (VP (VBZ likes) (S (VP (VBG eating) (NP (NN sausage))))) (..))

56 Text as Vectors Each document, j, can be viewed as a vector of frequency values, – one component for each word (or phrase) So we have a vector space – Words (or phrases) are axes – documents live in this space – even with stemming, may have 20,000+ dimensions

57 Vector Space Representation Documents that are close to query (measured using vector-space metric) => returned first. Query slide from Raghavan, Schütze, Larson

58 Lemmatization Reduce inflectional/variant forms to base form – am, are, is  be – car, cars, car's, cars'  car the boy's cars are different colors  the boy car be different color slide from Raghavan, Schütze, Larson

59 Stemming Reduce terms to their “roots” before indexing – language dependent – e.g., automate(s), automatic, automation all reduced to automat. for example compressed and compression are both accepted as equivalent to compress. for exampl compres and compres are both accept as equival to compres. slide from Raghavan, Schütze, Larson

60 Porter’s algorithm Common algorithm for stemming English Conventions + 5 phases of reductions – phases applied sequentially – each phase consists of a set of commands – sample convention: Of the rules in a compound command, select the one that applies to the longest suffix. Porter’s stemmer available: http//www.sims.berkeley.edu/~hearst/irbook/porter.html slide from Raghavan, Schütze, Larson

61 Typical rules in Porter sses  ss ies  i ational  ate tional  tion slide from Raghavan, Schütze, Larson

62 Challenges Sandy Sanded Sander  Sand ??? slide from Raghavan, Schütze, Larson

63

64

65 65 Many relations and events are temporally bounded – a person's place of residence or employer – an organization's members – the duration of a war between two countries – the precise time at which a plane landed – … Temporal Information Distribution – One of every fifty lines of database application code involves a date or time value (Snodgrass,1998) – Each news document in PropBank (Kingsbury and Palmer, 2002) includes eight temporal arguments Why Extract Temporal Information? Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial

66 66 Time-intensive Slot Types Person Organization per:alternate_namesper:titleorg:alternate_names per:date_of_birthper:member_oforg:political/religious_affiliation per:ageper:employee_oforg:top_members/employees per:country_of_birthper:religionorg:number_of_employees/members per:stateorprovince_of_birthper:spouseorg:members per:city_of_birthper:childrenorg:member_of per:originper:parentsorg:subsidiaries per:date_of_deathper:siblingsorg:parents per:country_of_deathper:other_familyorg:founded_by per:stateorprovince_of_deathper:chargesorg:founded per:city_of_deathorg:dissolved per:cause_of_deathorg:country_of_headquarters per:countries_of_residenceorg:stateorprovince_of_headquarters per:stateorprovinces_of_residenceorg:city_of_headquarters per:cities_of_residenceorg:shareholders per:schools_attendedorg:website Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial

67 67 Temporal Expression Examples ExpressionValue in Timex Format December 8, Friday today the 1990's199X midnight, December 8, T00:00:00 5pm T17:00 the previous day last October last autumn2011-FA last week2012-W48 Thursday evening TEV three months ago2012:09 Reference Date = December 8, 2012 Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial

68 68 Rule-based (Strtotgen and Gertz, 2010; Chang and Manning, 2012; Do et al., 2012 ) Machine Learning – Risk Minimization Model (Boguraev and Ando, 2005) – Conditional Random Fields (Ahn et al., 2005; UzZaman and Allen, 2010) State-of-the-art: about 95% F-measure for extraction and 85% F-measure for normalization Temporal Expression Extraction Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial

69 69 Ordering events in discourse (1 ) John entered the room at 5:00pm. (2) It was pitch black. (3) It had been three days since he’d slept. Time: Now Time: 5pm State: Pitch Black State: John SleptTime: 3 days Event: John entered the room Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial

70 70 Ordering events in time Speech (S), Event (E), & Reference (R) time (Reichenbach, 1947) Tense: relates R and S; Gr. Aspect: relates R and E R associated with temporal anaphora (Partee 1984) Order events by comparing R across sentences By the time Boris noticed his blunder, John had (already) won the game SentenceTenseOrder John wins the gamePresentE,R,S John won the gameSimple PastE,R

71 Types of eventualities 71 Chart from (Dölling, 2011) Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial

72 Inter-eventuality relations A boundary begins/ends a happening A boundary culminates an event A moment is the reduction of an episode A state is the result of a change A habitual state is realized by a class of occurrences A Processes is made of event constituents … 72 Chart from (Dölling, 2011) Slide from Dan Roth, Heng Ji, Taylor Cassidy, Quang Do TIE Tutorial


Download ppt "CSE 454 Advanced Internet Systems Machine Learning for Extraction Dan Weld."

Similar presentations


Ads by Google