(Linear-chain) Conditional Random Fields

(Linear-chain) Conditional Random Fields
5/13/2018 (Linear-chain) Conditional Random Fields [Lafferty, McCallum, Pereira 2001] Undirected graphical model, trained to maximize conditional probability of outputs given inputs Finite state model Graphical model OTHER PERSON OTHER ORG TITLE … output seq y y y y y t - 1 t t+1 t+2 t+3 FSM states . . . observations x x x x x t - 1 t t +1 t +2 t +3 said Veght a Microsoft VP … input seq A CRF is simply an undirected graphical model trained to maximize a conditional probability. First explorations with these models centered on finite state models, represented as linear-chain graphical models, with joint probability distribution over state sequence Y calculated as a normalized product over potentials on cliques of the graph. As is often traditional in NLP and other application areas, these potentials are defined to be log-linear combination of weights on features of the clique values. The chief excitement from an application point of view is the ability to use rich and arbitrary features of the input without complicating inference. Yielded many good results, for example… Explosion of interest in many different conferences. where Asian word segmentation [COLING’04], [ACL’04] IE from Research papers [HTL’04] Object classification in images [CVPR ‘04], [NIPS ‘04] Segmenting tables in textual gov’t reports, 85% reduction in error over HMMs. Noun phrase, Named entity [HLT’03], [CoNLL’03] Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],…

5/13/2018 Hidden Markov Models HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, … Finite state model Graphical model S S S t - 1 t t+1 transitions ... ... observations ... Generates: State sequence Observation O O O t - 1 t t +1 o1 o2 o3 o4 o5 o6 o7 o8 HMMs: standard tool…, FSM w/ Pr on trans & emmi Operate by starting... rolling a die... emitting symbol, generating simultaneously a state seq and obs seq. It's Markov and independence assumptions as depicted in this Graphical Model. Random var depicted by this node takes on values of states in FSM. Emissions are traditionally atomic entities, like words. We say this is generative model because it has parameters for probability of producing observed emissions (given state) Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st ) Training: Maximize probability of training observations (w/ prior) Usually a multinomial over atomic, fixed alphabet

IE with Hidden Markov Models
5/13/2018 IE with Hidden Markov Models Given a sequence of observations: Yesterday Rich Caruana spoke this example sentence. and a trained HMM: person name location name background Find the most likely state sequence: (Viterbi) How do these apply to IE? Given HMM (some of whose states as associated with certain database fields) We find the most... Then any words Viterbi says were generated from red states, we extract as... HMMs are an old standard, but... Yesterday Rich Caruana spoke this example sentence. Any words said to be generated by the designated “person name” state extract as a person name: Person name: Rich Caruana

We want More than an Atomic View of Words
5/13/2018 We want More than an Atomic View of Words Would like richer representation of text: many arbitrary, overlapping features of the words. identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor last person name was female next two words are “and Associates” S S S t - 1 t t+1 … is “Wisniewski” … part of noun phrase ends in “-ski” O O O 1. Prefer richer… one that allows for multiple, arbitrary... In many cases, for example names and other large vocabulary fields, the identity of word alone is not very predictive because you probably have never seen it before. For example…”Wisneiwski” …not city In many other cases, such as table seg, the relevant features are not really the words at all, but attributes of the line as a whole… emissions=whole lines of text at a time t - 1 t t +1

Problems with Richer Representation and a Joint Model
5/13/2018 Problems with Richer Representation and a Joint Model These arbitrary features are not independent. Multiple levels of granularity (chars, words, phrases) Multiple dependent modalities (words, formatting, layout) Past & future Two choices: Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi! (Similar issues in bio-sequence modeling,...) S S S S S S t - 1 t t+1 t - 1 t t+1 O O O O O O t - 1 t t +1 t - 1 t t +1

Conditional Sequence Models
5/13/2018 Conditional Sequence Models We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(s|o) instead of P(s,o): Can examine features, but not responsible for generating them. Don’t have to explicitly model their dependencies. Don’t “waste modeling effort” trying to generate what we are given at test time anyway.

From HMMs to Conditional Random Fields
5/13/2018 From HMMs to Conditional Random Fields [Lafferty, McCallum, Pereira 2001] St-1 St St+1 Joint ... Ot-1 Ot Ot+1 ... Conditional St-1 St St+1 ... For context, we'll begin again with our old friend the HMM; Joint. Here is graphical model. To simplify drawing, temporarily collapsing many feature variables into one fat one. Want conditional: divide by the marginal prob of the observation seq Don't want to estimate P(o|s); no reason need a probability there; make is simply a potential; Z. Use a sparse representation for potential function: log-linear model; here are features; param lambda corresponds to P(feature|state), but does not need to be a probability. Undirected graphical model we train to maximize conditional prob. This is model. This is a super-special case, b/c as we discussed before, features of past & future... Here is a crash course in conditional-prob FSMs. Don’t let all the equations scare you: this slide is easier than it looks. Here is the old, traditional HMM: Prob of state and obs sequence is product of each state trans, and observation emission in the sequence. MEMMs: replaces with conditional, where we no longer generate the obs, but condition on them. Corresponds to reversing direction of arrows... BELPROB=new FB Q: What functional form? For various reasons, an excellent choice is an exponential function: sum of features, each with a learned weight, exp()'ed, and normalized. This functional form (w/ normalizer where it is here) corresponds to an graphical model with undirected edges, which is also known as a Markov Random Field in statistical physics. Ot-1 Ot Ot+1 ... where (A super-special case of Conditional Random Fields.) Set parameters by maximum likelihood, using optimization method on L.

Conditional Random Fields
5/13/2018 Conditional Random Fields [Lafferty, McCallum, Pereira 2001] 1. FSM special-case: linear chain among unknowns, parameters tied across time steps. St St+1 St+2 St+3 St+4 O = Ot, Ot+1, Ot+2, Ot+3, Ot+4 "Non-super" special case Graphical model on previous slide was a special case. In general, observations don’t have to be chopped up, and feature functions can ask arbitrary questions about any range of the observation sequence. Exact inference is very efficient with dynamic programming, as long as graph among unknowns is linear or tree. Not going to get into method for learning lambda weights from training data, but simply a matter of maximum likelihood: This is what trying to maximize, differentiate, and solve with Conjugate Gradient. Again, DP makes it efficient. 2. In general: CRFs = "Conditionally-trained Markov Network" arbitrary structure among unknowns 3. Relational Markov Networks [Taskar, Abbeel, Koller 2002]: Parameters tied across hits from SQL-like queries ("clique templates")

Training CRFs - - Feature count using correct labels
5/13/2018 Training CRFs Feature count using correct labels Feature count using predicted labels - - Smoothing penalty

Linear-chain CRFs vs. HMMs
5/13/2018 Linear-chain CRFs vs. HMMs Comparable computational efficiency for inference Features may be arbitrary functions of any or all observations Parameters need not fully specify generation of observations; can require less training data Easy to incorporate domain knowledge …all features used in HMMs can be used here, just discriminatively-trained HMM. …need not fully specify: I’ll show you a strong example of this later. …domain knowledge == arbitrary features you make up

Table Extraction from Government Reports
5/13/2018 Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below Marketings totaled 154 billion pounds, 1 percent above Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, : : Production of Milk and Milkfat 2/ : Number : Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/: : of Fat in All : : : Milk : Milkfat : Milk Produced : Milk : Milkfat : 1,000 Head Pounds Percent Million Pounds : : 9, , ,582 5,514.4 : 9, , ,664 5,623.7 : 9, , ,644 5,694.3 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves. You can find reams and reams of government reports at fedstats.gov. Including this piece of a document about cow milk production Complicated structure: rich headers, data, footnotes saying things like, "Excludes..."

Table Extraction from Government Reports
5/13/2018 Table Extraction from Government Reports [Pinto, McCallum, Wei, Croft, 2003 SIGIR] 100+ documents from Labels: CRF Non-Table Table Title Table Header Table Data Row Table Section Data Row Table Footnote ... (12 in all) Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below Marketings totaled 154 billion pounds, 1 percent above Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, : : Production of Milk and Milkfat 2/ : Number : Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/: : of Fat in All : : : Milk : Milkfat : Milk Produced : Milk : Milkfat : 1,000 Head Pounds Percent Million Pounds : : 9, , ,582 5,514.4 : 9, , ,664 5,623.7 : 9, , ,644 5,694.3 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves. Features: Our job is to find the tables, separating from rest of text, and then label lines that have different function in the table, such as title, header, data row... Do this with CRF, one random variable per line of text. Large array of features assessing contents, formatting ...as well as, importantly, conjunctions exponential models are log-LINEAR models, to form complex decision boundaries, need higher dimensionality. Just like: can't represent XOR with sum and threshold, but can with sum, conjunction and threshold. Percentage of digit chars Percentage of alpha chars Indented Contains 5+ consecutive spaces Whitespace in this line aligns with prev. ... Conjunctions of all previous features, time offset: {0,0}, {-1,0}, {0,1}, {1,2}.

Table Extraction Experimental Results
5/13/2018 Table Extraction Experimental Results [Pinto, McCallum, Wei, Croft, 2003 SIGIR] Line labels, percent correct Table segments, F1 HMM 65 % 64 % Stateless MaxEnt 85 % - Here is an HMM (generative model) Here is conditional model, but without the benefit of context from a FSM Conditional RF w/out conjunctions (pretty important!) The main model as introduced. Story is similar when... not individual lines, but whole segments begin correct. CRF 95 % 92 %

IE from Research Papers
5/13/2018 [McCallum et al ‘99] Some of you might know about Cora, a contemporary of CiteSeer. This is a system created by Kamal Nigam, Kristie Seymore, Jason Rennie and myself that mines the web for Computer Science research papers. You can view search results in terms of summaries from extracted data. Here you see the automatically extracted titles, authors, institutions and abstracts.

IE from Research Papers
5/13/2018 IE from Research Papers Field-level F1 Hidden Markov Models (HMMs) [Seymore, McCallum, Rosenfeld, 1999] Support Vector Machines (SVMs) 89.7 [Han, Giles, et al, 2003] Conditional Random Fields (CRFs) 93.9 [Peng, McCallum, 2004]  error 40% HMMs have FSM sequence modeling, but generative SVMs are discriminative, but don’t do full inference on sequences.

Named Entity Recognition
5/13/2018 Named Entity Recognition CRICKET - MILLNS SIGNS FOR BOLAND CAPE TOWN South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional. Labels: Examples: PER Yayuk Basuki Innocent Butare ORG 3M KDP Cleveland LOC Cleveland Nirmal Hriday The Oval MISC Java Basque 1,000 Lakes Rally Nirmal is a hospice house run by Mother Teresa.

Automatically Induced Features
5/13/2018 Automatically Induced Features [McCallum & Li, 2003, CoNLL] Index Feature 0 inside-noun-phrase (ot-1) 5 stopword (ot) 20 capitalized (ot+1) 75 word=the (ot) 100 in-person-lexicon (ot-1) 200 word=in (ot+2) 500 word=Republic (ot+1) 711 word=RBI (ot) & header=BASEBALL 1027 header=CRICKET (ot) & in-English-county-lexicon (ot) 1298 company-suffix-word (firstmentiont+2) 4040 location (ot) & POS=NNP (ot) & capitalized (ot) & stopword (ot-1) 4945 moderately-rare-first-name (ot-1) & very-common-last-name (ot) 4474 word=the (ot-2) & word=of (ot) Merrill Lynch, Lockheed Martin, Oscar Mayer

Named Entity Extraction Results
5/13/2018 Named Entity Extraction Results [McCallum & Li, 2003, CoNLL] Method F1 HMMs BBN's Identifinder 73% CRFs w/out Feature Induction 83% CRFs with Feature Induction 90% based on LikelihoodGain

(Linear-chain) Conditional Random Fields

Similar presentations

Presentation on theme: "(Linear-chain) Conditional Random Fields"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

(Linear-chain) Conditional Random Fields

Similar presentations

Presentation on theme: "(Linear-chain) Conditional Random Fields"— Presentation transcript:

Similar presentations

About project

Feedback