Toward Unified Models of Information Extraction and Data Mining Andrew McCallum Information Extraction and Synthesis Laboratory Computer Science Department.

Toward Unified Models of Information Extraction and Data Mining Andrew McCallum Information Extraction and Synthesis Laboratory Computer Science Department University of Massachusetts Amherst Joint work with Aron Culotta, Wei Li, Khashayar Rohanimanesh, Charles Sutton, Ben Wellner

Goal: Improving our ability to mine actionable knowledge from unstructured text.

Larger Context Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Database Discover patterns - entity types - links / relations - events Data Mining Spider Actionable knowledge

Problem: Combined in serial juxtaposition, IE and KD are unaware of each others’ weaknesses and opportunities. 1)KD begins from a populated DB, unaware of where the data came from, or its inherent uncertainties. 2)IE is unaware of emerging patterns and regularities in the DB. The accuracy of both suffers, and significant mining of complex text sources is beyond reach.

Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Database Discover patterns - entity types - links / relations - events Data Mining Spider Actionable knowledge Uncertainty Info Emerging Patterns Solution:

Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Probabilistic Model Discover patterns - entity types - links / relations - events Data Mining Spider Actionable knowledge Solution: Conditional Random Fields [Lafferty, McCallum, Pereira] Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…] Discriminatively-trained undirected graphical models Complex Inference and Learning Just what we researchers like to sink our teeth into! Unified Model

Outline The need for unified IE and DM. Review of Conditional Random Fields for IE. Preliminary steps toward unification: –Joint Co-reference Resolution (Graph Partitioning) –Joint Labeling of Cascaded Sequences (Belief Propagation) –Joint Segmentation and Co-ref (Iterated Conditional Samples.) Conclusions 

Hidden Markov Models S t-1 S t O t S t+1 O t +1 O t - 1... Finite state model Graphical model Parameters: for all states S={s 1,s 2,…} Start state probabilities: P(s t ) Transition probabilities: P(s t |s t-1 ) Observation (emission) probabilities: P(o t |s t ) Training: Maximize probability of training observations (w/ prior) HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, …... transitions observations o 1 o 2 o 3 o 4 o 5 o 6 o 7 o 8 Generates: State sequence Observation sequence Usually a multinomial over atomic, fixed alphabet

Joint Conditional S t-1 StSt OtOt S t+1 O t+1 O t-1 S t-1 StSt OtOt S t+1 O t+1 O t-1... (A super-special case of Conditional Random Fields.) [Lafferty, McCallum, Pereira 2001] where From HMMs to Conditional Random Fields Set parameters by maximum likelihood, using optimization method on  L.

Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------- 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.

Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------- 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves. CRF Labels: Non-Table Table Title Table Header Table Data Row Table Section Data Row Table Footnote... (12 in all) [Pinto, McCallum, Wei, Croft, 2003 SIGIR] Features: Percentage of digit chars Percentage of alpha chars Indented Contains 5+ consecutive spaces Whitespace in this line aligns with prev.... Conjunctions of all previous features, time offset: {0,0}, {-1,0}, {0,1}, {1,2}. 100+ documents from www.fedstats.gov

Table Extraction Experimental Results Line labels, percent correct Table segments, F1 95 %92 % 65 %64 %  error = 85%  error = 77% 85 %- HMM Stateless MaxEnt CRF w/out conjunctions CRF 52 %68 % [Pinto, McCallum, Wei, Croft, 2003 SIGIR]

IE from Research Papers [McCallum et al ‘99]

IE from Research Papers Field-level F1 Hidden Markov Models (HMMs)75.6 [Seymore, McCallum, Rosenfeld, 1999] Support Vector Machines (SVMs)89.7 [Han, Giles, et al, 2003] Conditional Random Fields (CRFs)93.9 [Peng, McCallum, 2004]  error 40%

Main Point #2 Conditional Random Fields were more accurate in practice than a generative model... on a research paper extraction task,... and others, including - a table extraction task - noun phrase segmentation - named entity extraction - …

Outline The need for unified IE and DM. Review of Conditional Random Fields for IE. Preliminary steps toward unification: 1.Joint Labeling of Cascaded Sequences (Belief Propagation) Charles Sutton 2.Joint Co-reference Resolution (Graph Partitioning) Aron Culotta 3.Joint Labeling for Semi-Supervision (Graph Partitioning) Wei Li 4.Joint Segmentation and Co-ref (Iterated Conditional Samples.) Andrew McCallum  

1. Jointly labeling cascaded sequences Factorial CRFs Part-of-speech Noun-phrase boundaries Named-entity tag English words [Sutton, Khashayar, McCallum, ICML 2004] Joint prediction of part-of-speech and noun-phrase in newswire, equivalent accuracy with only 50% of the training data. Inference: Tree reparameterization [Wainwright et al, 2002]

1b. Jointly labeling distant mentions Skip-chain CRFs Mr. Ted Green said today … … Mary saw Green at … … [Sutton, McCallum, 2004] 14% reduction in error on most repeated field in email seminar announcements. Inference: Tree reparameterization [Wainwright et al, 2002]

2. Joint co-reference among all pairs Affinity Matrix CRF... Mr Powell...... Powell...... she... 45  99 Y/N 11 [McCallum, Wellner, IJCAI WS 2003] 25% reduction in error on co-reference of proper nouns in newswire. Inference: Correlational clustering graph partitioning [Bansal, Blum, Chawla, 2002]

Y/N 3. Joint Labeling for Semi-Supervision Affinity Matrix CRF with prototypes 45  99 11 [Li, McCallum, 2003] 50% reduction in error on document classification with labeled and unlabeled data. Inference: Correlational clustering graph partitioning [Bansal, Blum, Chawla, 2002] y1y1 y2y2 x3x3 x2x2 x1x1

p Database field values c 4. Joint segmentation and co-reference o s o s c c s o Citation attributes y y y Segmentation [Wellner, McCallum, Peng, Hay, UAI 2004] Inference: Variant of Iterated Conditional Modes Co-reference decisions Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison- Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990. [Besag, 1986] World Knowledge 35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Extraction from and matching of research paper citations. see also [Marthi, Milch, Russell, 2003]

To Charles

Citation Segmentation and Coreference Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990.

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990. Segment citation fields Citation Segmentation and Coreference

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990. Segment citation fields Resolve coreferent papers Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990. Citation Segmentation and Coreference Y/N

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990. ? Segmentation QualityCitation Co-reference (F1) No Segmentation.787 CRF Segmentation.913 True Segmentation.932 Incorrect Segmentation Hurts Coreference

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990. ? Incorrect Segmentation Hurts Coreference Solution: Perform segmentation and coreference jointly. Use segmentation uncertainty to improve coreference and use coreference to improve segmentation.

o s Observed citation CRF Segmentation Segmentation + Coreference Model

o s c Citation attributes CRF Segmentation Observed citation Segmentation + Coreference Model

o s c o s c c s o Citation attributes Observed citation CRF Segmentation Segmentation + Coreference Model

o s c o s c c s o Citation attributes Observed citation yy y pairwise coref CRF Segmentation Segmentation + Coreference Model

Such a highly connected graph makes exact inference intractable, so…

Loopy Belief Propagation v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 m1(v2)m1(v2) m2(v3)m2(v3) m3(v2)m3(v2)m2(v1)m2(v1) messages passed between nodes Approximate Inference 1

Loopy Belief Propagation Generalized Belief Propagation v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 m1(v2)m1(v2) m2(v3)m2(v3) m3(v2)m3(v2)m2(v1)m2(v1) v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v9v9 v8v8 v7v7 messages passed between nodes messages passed between regions Here, a message is a conditional probability table passed among nodes. But when message size grows exponentially with region size! Approximate Inference 1

Iterated Conditional Modes (ICM) [Besag 1986] v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v 6 i+1 = argmax P(v 6 i | v \ v 6 i ) v6iv6i = held constant Approximate Inference 2

Iterated Conditional Modes (ICM) [Besag 1986] v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v 5 j+1 = argmax P(v 5 j | v \ v 5 j ) v5jv5j = held constant Approximate Inference 2

Iterated Conditional Modes (ICM) [Besag 1986] v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v 4 k+1 = argmax P(v 4 k | v \ v 4 k ) v4kv4k = held constant Approximate Inference 2 But greedy, and easily falls into local minima.

Iterated Conditional Modes (ICM) [Besag 1986] Iterated Conditional Sampling (ICS) (our proposal; related work?) Instead of passing only argmax, sample of argmaxes of P(v 4 k | v \ v 4 k ) i.e. an N-best list (the top N values) v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v 4 k+1 = argmax P(v 4 k | v \ v 4 k ) v4kv4k = held constant v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 Approximate Inference 2 Can use “Generalized Version” of this; doing exact inference on a region of several nodes at once. Here, a “message” grows only linearly with region size and N!

o s c o s c c s o yy y p p prototype pairwise vars Sample = N-best List from CRF Segmentation Do exact inference over these linear-chain regions Pass N-best List to coreference

o s c o s c y pairwise vars Parameterized by N-Best lists Sample = N-best List from Viterbi

NameTitle… Laurel, BInterface Agents: Metaphors with Character The … Laurel, B.Interface Agents: Metaphors with Character … Laurel, B. Interface Agents Metaphors with Character … o s c o s c y When calculating similarity with another citation, have more opportunity to find correct, matching fields. NameTitleBook TitleYear Laurel, B. InterfaceAgents: Metaphors with Character The Art of Human Computer Interface Design 1990 Laurel, B.Interface Agents: Metaphors with Character The Art of Human Computer Interface Design 1990 Laurel, B. InterfaceAgents: Metaphors with Character The Art of Human Computer Interface Design 1990 Sample = N-best List from Viterbi

NReinforceFaceReasonConstraint 10.9460.9670.9450.961 30.950.9790.9610.960 70.9480.9790.9510.971 90.9820.9670.9600.971 Optimal0.9950.9920.9940.988 Coreference F1 performance Average error reduction is 35%. “Optimal” makes best use of N-best list by using true labels. Indicates that even more improvement can be obtained Results on 4 Sections of CiteSeer Citations

Conclusions Conditional Random Fields combine the benefits of –Conditional probability models (arbitrary features) –Markov models (for sequences or other relations) Success in –Factorial finite state models –Coreference analysis –Semi-supervised Learning –Segmentation uncertainty aiding coreference Future work: –Structure learning. –Further tight integration of IE and Data Mining –Application to Social Network Analysis.

End of Talk

Application Project:

Research Paper Cites

Application Project: Research Paper Cites Person University Conf- erence Grant Groups Expertise

~60k lines of Java Document classification, information extraction, clustering, coreference, POS tagging, shallow parsing, relational classification, … Many ML basics in common, convenient framework: –naïve Bayes, MaxEnt, Boosting, SVMs, Dirichlets, Conjugate Gradient Advanced ML algorithms: –Conditional Random Fields, Maximum Margin Markov Networks, BFGS, Expectation Propogatation, Tree-Reparameterization, … Unlike other toolkits (e.g. Weka) MALLET scales to millions of features, 100k’s training examples, as needed for NLP. MALLET: Machine Learning for Language Toolkit Released as Open Source Software. http://mallet.cs.umass.edu Software Infrastructure In use at UMass, MIT, CMU, UPenn,

End of Talk

Toward Unified Models of Information Extraction and Data Mining Andrew McCallum Information Extraction and Synthesis Laboratory Computer Science Department.

Similar presentations

Presentation on theme: "Toward Unified Models of Information Extraction and Data Mining Andrew McCallum Information Extraction and Synthesis Laboratory Computer Science Department."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Toward Unified Models of Information Extraction and Data Mining Andrew McCallum Information Extraction and Synthesis Laboratory Computer Science Department.

Similar presentations

Presentation on theme: "Toward Unified Models of Information Extraction and Data Mining Andrew McCallum Information Extraction and Synthesis Laboratory Computer Science Department."— Presentation transcript:

Similar presentations

About project

Feedback