Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Extraction Data Mining and Topic Discovery with Probabilistic Models Andrew McCallum Computer Science Department University of Massachusetts.

Similar presentations


Presentation on theme: "Information Extraction Data Mining and Topic Discovery with Probabilistic Models Andrew McCallum Computer Science Department University of Massachusetts."— Presentation transcript:

1 Information Extraction Data Mining and Topic Discovery with Probabilistic Models Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton, Aron Culotta, Wei Li, Xuerui Wang, Andres Corrada, Ben Wellner, Chris Pal, Michael Hay, Natasha Mohanty, David Mimno, Gideon Mann.

2 Information Extraction with Conditional Random Fields Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton, Aron Culotta, Wei Li, Xuerui Wang, Andres Corrada, Ben Wellner, Chris Pal, Michael Hay, Natasha Mohanty, David Mimno, Gideon Mann.

3 Goal: Mine actionable knowledge from unstructured text.

4 An HR office Jobs, but not HR jobs

5 Extracting Job Openings from the Web foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1

6 A Portal for Job Openings

7

8

9

10

11 Job Openings: Category = High Tech Keyword = Java Location = U.S.

12 Data Mining the Extracted Job Information

13 Extracting Continuing Education Courses Data automatically extracted from www.calpoly.edu Data automatically extracted from www.calpoly.edu Source web page. Color highlights indicate type of information. (e.g., orange=course #) Source web page. Color highlights indicate type of information. (e.g., orange=course #)

14 This took place in ‘99 Not in Maryland Courses from all over the world

15

16

17

18

19 IE from Research Papers [McCallum et al ‘99]

20 IE from Research Papers

21 Mining Research Papers [Giles et al] [Rosen-Zvi, Griffiths, Steyvers, Smyth, 2004]

22 IE from Chinese Documents regarding Weather Department of Terrestrial System, Chinese Academy of Sciences 200k+ documents several millennia old - Qing Dynasty Archives - memos - newspaper articles - diaries

23 Why prefer “knowledge base search” over “page search” Targeted, restricted universe of hits –Don’t show resumes when I’m looking for job openings. Specialized queries –Topic-specific –Multi-dimensional –Based on information spread on multiple pages. Get correct granularity –Site, page, paragraph Specialized display –Super-targeted hit summarization in terms of DB slot values Ability to support sophisticated data mining Information Extraction needed to automatically build the Knowledge Base.

24 What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION

25 What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. IE

26 What is “Information Extraction” Information Extraction = segmentation + classification + clustering + association As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

27 What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

28 What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

29 What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. * * * *

30 Larger Context Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Database Discover patterns - entity types - links / relations - events Data Mining Spider Actionable knowledge

31

32 Why Information Extraction (IE)? Science –Grand old dream of AI: Build large KB* and reason with it. IE enables the automatic creation of this KB. –IE is a complex problem that inspires new advances in machine learning. Profit –Many companies interested in leveraging data currently “locked in unstructured text on the Web”. –Not yet a monopolistic winner in this space. Fun! –Build tools that we researchers like to use ourselves: Cora & CiteSeer, MRQE.com, FAQFinder,… –See our work get used by the general public. * KB = “Knowledge Base”

33 Tutorial Outline IE History Landscape of problems and solutions Parade of models for segmenting/classifying: –Sliding window –Boundary finding –Finite state machines –Trees Overview of related problems and solutions –Association, Clustering –Integration with Data Mining Where to go from here 15 min break

34 IE History Pre-Web Mostly news articles –De Jong’s FRUMP [1982] Hand-built system to fill Schank-style “scripts” from news wire –Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92-’96] Most early work dominated by hand-built models –E.g. SRI’s FASTUS, hand-built FSMs. –But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98] Web AAAI ’94 Spring Symposium on “Software Agents” –Much discussion of ML applied to Web. Maes, Mitchell, Etzioni. Tom Mitchell’s WebKB, ‘96 –Build KB’s from the Web. Wrapper Induction –Initially hand-build, then ML: [Soderland ’96], [Kushmeric ’97],…

35 www.apple.com/retail What makes IE from the Web Different? Less grammar, but more formatting & linking The directory structure, link structure, formatting & layout of the Web is its own new grammar. Apple to Open Its First Retail Store in New York City MACWORLD EXPO, NEW YORK--July 17, 2002-- Apple's first retail store in New York City will open in Manhattan's SoHo district on Thursday, July 18 at 8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience. "Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week," said Steve Jobs, Apple's CEO. "We hope our SoHo store will surprise and delight both Mac and PC users who want to see everything the Mac can do to enhance their digital lifestyles." www.apple.com/retail/soho www.apple.com/retail/soho/theatre.html NewswireWeb

36 Landscape of IE Tasks (1/4): Pattern Feature Domain Text paragraphs without formatting Grammatical sentences and some formatting & links Non-grammatical snippets, rich formatting & links Tables Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.

37 Landscape of IE Tasks (2/4): Pattern Scope Web site specificGenre specificWide, non-specific Amazon.com Book PagesResumesUniversity Names FormattingLayoutLanguage

38 Landscape of IE Tasks (3/4): Pattern Complexity Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern University of Arkansas P.O. Box 140 Hope, AR 71802 …was among the six houses sold by Hope Feldman that year. Ambiguous patterns, needing context and many sources of evidence The CALD main office can be reached at 412-268-1299 The big Wyoming sky… U.S. states U.S. phone numbers U.S. postal addresses Person names Headquarters: 1128 Main Street, 4th Floor Cincinnati, Ohio 45210 Pawel Opalinski, Software Engineer at WhizBang Labs. E.g. word patterns:

39 Landscape of IE Tasks (4/4): Pattern Combinations Single entity Person: Jack Welch Binary relationship Relation: Person-Title Person: Jack Welch Title: CEO N-ary record “Named entity” extraction Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Relation: Company-Location Company: General Electric Location: Connecticut Relation: Succession Company: General Electric Title: CEO Out: Jack Welsh In: Jeffrey Immelt Person: Jeffrey Immelt Location: Connecticut

40 Evaluation of Single Entity Extraction Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke. TRUTH: PRED: Precision = = # correctly predicted segments 2 # predicted segments 6 Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke. Recall = = # correctly predicted segments 2 # true segments 4 F1 = Harmonic mean of Precision & Recall = ((1/P) + (1/R)) / 2 1

41 State of the Art Performance Named entity recognition –Person, Location, Organization, … –F1 in high 80’s or low- to mid-90’s Binary relation extraction –Contained-in (Location1, Location2) Member-of (Person1, Organization1) –F1 in 60’s or 70’s or 80’s Wrapper induction –Extremely accurate performance obtainable –Human effort (~30min) required on each site

42 Landscape of IE Techniques (1/1): Models Any of these models can be used to capture words, formatting or both. Lexicons Alabama Alaska … Wisconsin Wyoming Abraham Lincoln was born in Kentucky. member? Classify Pre-segmented Candidates Abraham Lincoln was born in Kentucky. Classifier which class? …and beyond Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Boundary Models Abraham Lincoln was born in Kentucky. Classifier which class? BEGINENDBEGINEND BEGIN Context Free Grammars Abraham Lincoln was born in Kentucky. NNPVPNPVNNP NP PP VP S Most likely parse? Finite State Machines Abraham Lincoln was born in Kentucky. Most likely state sequence?

43 Sliding Windows

44 Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

45 Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

46 Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

47 Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

48 A “Naïve Bayes” Sliding Window Model [Freitag 1997] 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun w t-m w t-1 w t w t+n w t+n+1 w t+n+m prefixcontentssuffix Other examples of sliding window: [Baluja et al 2000] (decision tree over individual words & their context) If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it. … … Estimate Pr(LOCATION|window) using Bayes rule Try all “reasonable” windows (vary length, position) Assume independence for length, prefix words, suffix words, content words Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)

49 A “Naïve Bayes” Sliding Window Model [Freitag 1997] 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun w t-m w t-1 w t w t+n w t+n+1 w t+n+m prefixcontentssuffix P(“Wean Hall Rm 5409” = LOCATION) = Prior probability of start position Prior probability of length Probability prefix words Probability contents words Probability suffix words Try all start positions and reasonable lengths Other examples of sliding window: [Baluja et al 2000] (decision tree over individual words & their context) If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it. Estimate these probabilities by (smoothed) counts from labeled training data. … …

50 “Naïve Bayes” Sliding Window Results GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. Domain: CMU UseNet Seminar Announcements FieldF1 Person Name:30% Location:61% Start Time:98%

51 Problems with Sliding Windows and Boundary Finders Decisions in neighboring parts of the input are made independently from each other. –Naïve Bayes Sliding Window may predict a “seminar end time” before the “seminar start time”. –It is possible for two overlapping windows to both be above threshold. –In a Boundary-Finding system, left boundaries are laid down independently from right boundaries, and their pairing happens as a separate step.

52 Finite State Machines

53 Outline Examples of IE and Data Mining IE with Hidden Markov Models Introduction to Conditional Random Fields (CRFs) Examples of IE with CRFs Sequence Alignment with CRFs Semi-supervised Learning

54 Hidden Markov Models S t-1 S t O t S t+1 O t +1 O t - 1... Finite state model Graphical model Parameters: for all states S={s 1,s 2,…} Start state probabilities: P(s t ) Transition probabilities: P(s t |s t-1 ) Observation (emission) probabilities: P(o t |s t ) Training: Maximize probability of training observations (w/ prior) HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, …... transitions observations o 1 o 2 o 3 o 4 o 5 o 6 o 7 o 8 Generates: State sequence Observation sequence Usually a multinomial over atomic, fixed alphabet

55 IE with Hidden Markov Models Yesterday Pedro Domingos spoke this example sentence. Person name: Pedro Domingos Given a sequence of observations: and a trained HMM: Find the most likely state sequence: (Viterbi) Any words said to be generated by the designated “person name” state extract as a person name: person name location name background

56 HMM Example: “Nymble” Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99] Task: Named Entity Extraction Train on ~500k words of news wire text. Case Language F1. Mixed English93% UpperEnglish91% MixedSpanish90% [Bikel, et al 1998], [BBN “IdentiFinder”] Person Org Other (Five other name classes) start-of- sentence end-of- sentence Transition probabilities Observation probabilities P(s t | s t-1, o t-1 ) P(o t | s t, s t-1 ) Back-off to: P(s t | s t-1 ) P(s t ) P(o t | s t, o t-1 ) P(o t | s t ) P(o t ) or Results:

57 We want More than an Atomic View of Words Would like richer representation of text: many arbitrary, overlapping features of the words. S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor last person name was female next two words are “and Associates” … … part of noun phrase is “Wisniewski” ends in “-ski”

58 Problems with Richer Representation and a Generative Model These arbitrary features are not independent. –Multiple levels of granularity (chars, words, phrases) –Multiple dependent modalities (words, formatting, layout) –Past & future Two choices: Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi! S t-1 S t O t S t+1 O t +1 O t - 1 S t-1 S t O t S t+1 O t +1 O t - 1

59 Conditional Sequence Models We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(s|o) instead of P(s,o): –Can examine features, but not responsible for generating them. –Don’t have to explicitly model their dependencies. –Don’t “waste modeling effort” trying to generate what we are given at test time anyway.

60 Outline Examples of IE and Data Mining IE with Hidden Markov Models Introduction to Conditional Random Fields (CRFs) Examples of IE with CRFs Sequence Alignment with CRFs Semi-supervised Learning

61 Locally Normalized Conditional Sequence Model S t-1 S t O t S t+1 O t +1 O t - 1... Generative (traditional HMM)... transitions observations S t-1 S t O t S t+1 O t +1 O t - 1... Conditional... transitions observations Standard belief propagation: forward-backward procedure. Viterbi and Baum-Welch follow naturally. Maximum Entropy Markov Models [McCallum, Freitag & Pereira, 2000] MaxEnt POS Tagger [Ratnaparkhi, 1996] SNoW-based Markov Model [Punyakanok & Roth, 2000]

62 Locally Normalized Conditional Sequence Model S t-1 S t O t S t+1 O t +1 O t - 1... Generative (traditional HMM)... transitions observations S t-1 S t O t S t+1... Conditional... transitions entire observation sequence Standard belief propagation: forward-backward procedure. Viterbi and Baum-Welch follow naturally. Maximum Entropy Markov Models [McCallum, Freitag & Pereira, 2000] MaxEnt POS Tagger [Ratnaparkhi, 1996] SNoW-based Markov Model [Punyakanok & Roth, 2000] Or, more generally:...

63 Exponential Form for “Next State” Function Overall Recipe: - Labeled data is assigned to transitions. - Train each state’s exponential model by maximum likelihood (iterative scaling or conjugate gradient). weightfeature Black-box classifier s t-1

64 Feature Functions Yesterday Lawrence Saul spoke this example sentence. s3s3 s1s1 s2s2 s4s4 o = o 1 o 2 o 3 o 4 o 5 o 6 o 7

65 Experimental Data 38 files belonging to 7 UseNet FAQs Example: X-NNTP-Poster: NewsHound v1.33 Archive-name: acorn/faq/part2 Frequency: monthly 2.6) What configuration of serial cable should I use? Here follows a diagram of the necessary connection programs to work properly. They are as far as I know agreed upon by commercial comms software developers fo Pins 1, 4, and 8 must be connected together inside is to avoid the well known serial port chip bugs. The Procedure: For each FAQ, train on one file, test on other; average.

66 Features in Experiments begins-with-number begins-with-ordinal begins-with-punctuation begins-with-question-word begins-with-subject blank contains-alphanum contains-bracketed-number contains-http contains-non-space contains-number contains-pipe contains-question-mark contains-question-word ends-with-question-mark first-alpha-is-capitalized indented indented-1-to-4 indented-5-to-10 more-than-one-third-space only-punctuation prev-is-blank prev-begins-with-ordinal shorter-than-30

67 Models Tested ME-Stateless: A single maximum entropy classifier applied to each line independently. TokenHMM: A fully-connected HMM with four states, one for each of the line categories, each of which generates individual tokens (groups of alphanumeric characters and individual punctuation characters). FeatureHMM: Identical to TokenHMM, only the lines in a document are first converted to sequences of features. MEMM: The Maximum Entropy Markov Model described in this talk.

68 Results

69 Example (after Bottou ‘91): Bias toward states with few “siblings”. Per-state normalization in MEMMs does not allow “probability mass” to transfer from one branch to the other. Label Bias Problem in Conditional Sequence Models ri b or b “rob” “rib” start

70 Proposed Solutions Determinization: –not always possible –state-space explosion Use fully-connected models: –lacks prior structural knowledge. Our solution: Conditional random fields (CRFs): –Probabilistic conditional models generalizing MEMMs. –Allow some transitions to vote more strongly than others in computing state sequence probability. –Whole sequence rather than per-state normalization. i bo r start b “rob” “rib”

71 HMM MEMM CRF S t-1 StSt OtOt S t+1 O t+1 O t-1 S t-1 StSt OtOt S t+1 O t+1 O t-1 S t-1 StSt OtOt S t+1 O t+1 O t-1... (A special case of MEMMs and CRFs.) Conditional Random Fields (CRFs) [Lafferty, McCallum, Pereira ‘2001] From HMMs to MEMMs to CRFs

72 Joint Conditional S t-1 StSt OtOt S t+1 O t+1 O t-1 S t-1 StSt OtOt S t+1 O t+1 O t-1... (A super-special case of Conditional Random Fields.) [Lafferty, McCallum, Pereira 2001] where From HMMs to Conditional Random Fields Set parameters by maximum likelihood, using optimization method on  L.

73 Linear Chain Conditional Random Fields StSt S t+1 S t+2 O = O t, O t+1, O t+2, O t+3, O t+4 S t+3 S t+4 Markov on s, conditional dependency on o. Hammersley-Clifford-Besag theorem stipulates that the CRF has this form—an exponential function of the cliques in the graph. Assuming that the dependency structure of the states is tree-shaped (linear chain is a trivial tree), inference can be done by dynamic programming in time O(|o| |S| 2 )—just like HMMs. [Lafferty, McCallum, Pereira 2001]

74 CRFs vs. HMMs More general and expressive modeling technique Comparable computational efficiency Features may be arbitrary functions of any or all observations Parameters need not fully specify generation of observations; require less training data Easy to incorporate domain knowledge State means only “state of process”, vs “state of process” and “observational history I’m keeping”

75 Efficient Inference

76 Training CRFs Feature count using correct labels Feature count using predicted labels Smoothing penalty --

77 Training CRFs Methods: iterative scaling (quite slow) conjugate gradient (much faster) conjugate gradient with preconditioning (super fast) limited-memory quasi-Newton methods (also super fast) Complexity comparable to standard Baum-Welch [Sha & Pereira 2002] & [Malouf 2002]

78 Voted Perceptron Sequence Models [Collins 2002] Like CRFs with stochastic gradient ascent and a Viterbi approximation. Avoids calculating the partition function (normalizer), Z o, but gradient ascent, not 2 nd -order or conjugate gradient method. Analogous to the gradient for this one training instance

79 MEMM & CRF Related Work Maximum entropy for language tasks: –Language modeling [Rosenfeld ‘94, Chen & Rosenfeld ‘99] –Part-of-speech tagging [Ratnaparkhi ‘98] –Segmentation [Beeferman, Berger & Lafferty ‘99] –Named entity recognition “MENE” [Borthwick, Grishman,…’98] HMMs for similar language tasks –Part of speech tagging [Kupiec ‘92] –Named entity recognition [Bikel et al ‘99] –Other Information Extraction [Leek ‘97], [Freitag & McCallum ‘99] Serial Generative/Discriminative Approaches –Speech recognition [Schwartz & Austin ‘93] –Reranking Parses [Collins, ‘00] Other conditional Markov models –Non-probabilistic local decision models [Brill ‘95], [Roth ‘98] –Gradient-descent on state path [LeCun et al ‘98] –Markov Processes on Curves (MPCs) [Saul & Rahim ‘99] –Voted Perceptron-trained FSMs [Collins ’02]

80 Outline Examples of IE and Data Mining IE with Hidden Markov Models Introduction to Conditional Random Fields (CRFs) Examples of IE with CRFs Sequence Alignment with CRFs Semi-supervised Learning

81 Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------- 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.

82 Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------- 1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves. CRF Labels: Non-Table Table Title Table Header Table Data Row Table Section Data Row Table Footnote... (12 in all) [Pinto, McCallum, Wei, Croft, 2003 SIGIR] Features: Percentage of digit chars Percentage of alpha chars Indented Contains 5+ consecutive spaces Whitespace in this line aligns with prev.... Conjunctions of all previous features, time offset: {0,0}, {-1,0}, {0,1}, {1,2}. 100+ documents from www.fedstats.gov

83 Table Extraction Experimental Results Line labels, percent correct Table segments, F1 95 %92 % 65 %64 % 85 %- HMM Stateless MaxEnt CRF [Pinto, McCallum, Wei, Croft, 2003 SIGIR]

84 IE from Research Papers [McCallum et al ‘99]

85 IE from Research Papers Field-level F1 Hidden Markov Models (HMMs)75.6 [Seymore, McCallum, Rosenfeld, 1999] Support Vector Machines (SVMs)89.7 [Han, Giles, et al, 2003] Conditional Random Fields (CRFs)93.9 [Peng, McCallum, 2004]  error 40%

86 Part-of-speech Tagging The asbestos fiber, crocidolite, is unusually resilient once it enters the lungs, with even brief exposures to it causing symptoms that show up decades later, researchers said. DT NN NN, NN, VBZ RB JJ IN PRP VBZ DT NNS, IN RB JJ NNS TO PRP VBG NNS WDT VBP RP NNS JJ, NNS VBD. 45 tags, 1M words training data, Penn Treebank Erroroov errorerror  err oov error  err HMM5.69%45.99% CRF5.55%48.05%4.27%-24%23.76%-50% Using spelling features* * use words, plus overlapping features: capitalized, begins with #, contains hyphen, ends in -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies. [Lafferty, McCallum, Pereira 2001]

87 Chinese Word Segmentation [McCallum & Feng 2003] ~100k words data, Penn Chinese Treebank Lexicon features: Adjective ending characterdigit charactersprovinces adverb ending characterforeign name charspunctuation chars building wordsfunction wordsRoman alphabetics Chinese number charactersjob titleRoman digits Chinese periodlocationsstopwords cities and regionsmoneysurnames countriesnegative characterssymbol characters datesorganization indicatorverb chars department characterspreposition characterswordlist (188k lexicon)

88 Chinese Word Segmentation Results [McCallum & Feng 2003] Precision and recall of segments with perfect boundaries: # trainingtesting segmentation Methodsentencesprec.recallF1 [Peng]~5M75.174.074.2 [Ponte]?84.487.886.0 [Teahan]~40k??94.4 [Xue]~10k95.295.195.2 CRF280597.397.897.5 CRF14095.496.095.7 CRF5693.995.094.4 Prev. world’s best  error 50%

89 Named Entity Recognition CRICKET - MILLNS SIGNS FOR BOLAND CAPE TOWN 1996-08-22 South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional. Labels: Examples: PERYayuk Basuki Innocent Butare ORG3M KDP Cleveland LOCCleveland Nirmal Hriday The Oval MISCJava Basque 1,000 Lakes Rally

90 Automatically Induced Features IndexFeature 0inside-noun-phrase (o t-1 ) 5stopword (o t ) 20capitalized (o t+1 ) 75word=the (o t ) 100in-person-lexicon (o t-1 ) 200word=in (o t+2 ) 500word=Republic (o t+1 ) 711word=RBI (o t ) & header=BASEBALL 1027header=CRICKET (o t ) & in-English-county-lexicon (o t ) 1298company-suffix-word (firstmention t+2 ) 4040location (o t ) & POS=NNP (o t ) & capitalized (o t ) & stopword (o t-1 ) 4945moderately-rare-first-name (o t-1 ) & very-common-last-name (o t ) 4474word=the (o t-2 ) & word=of (o t ) [McCallum & Li, 2003, CoNLL]

91 Named Entity Extraction Results MethodF1 HMMs BBN's Identifinder73% CRFs w/out Feature Induction83% CRFs with Feature Induction90% based on LikelihoodGain [McCallum & Li, 2003, CoNLL]

92 Related Work CRFs are widely used for information extraction...including more complex structures, like trees: –[Zhu, Nie, Zhang, Wen, ICML 2007] Dynamic Hierarchical Markov Random Fields and their Application to Web Data Extraction –[Viola & Narasimhan]: Learning to Extract Information from Semi-structured Text using a Discriminative Context Free Grammar –[Jousse et al 2006]: Conditional Random Fields for XML Trees

93 Broader View Create ontology Segment Classify Associate Cluster Load DB Spider Query, Search Data mine IE Document collection Database Filter by relevance Label training data Train extraction models Up to now we have been focused on segmentation and classification

94 Broader View Create ontology Segment Classify Associate Cluster Load DB Spider Query, Search Data mine IE Tokenize Document collection Database Filter by relevance Label training data Train extraction models Now touch on some other issues 1 2 3 4 5

95 (1) Association as Binary Classification [Zelenko et al, 2002] Christos Faloutsos conferred with Ted Senator, the KDD 2003 General Chair. Person-Role (Christos Faloutsos, KDD 2003 General Chair)  NO Person-Role ( Ted Senator, KDD 2003 General Chair)  YES Person Role Do this with SVMs and tree kernels over parse trees.

96 (1) Association with Finite State Machines [Ray & Craven, 2001] … This enzyme, UBC6, localizes to the endoplasmic reticulum, with the catalytic domain facing the cytosol. … DETthis Nenzyme Nubc6 Vlocalizes PREPto ARTthe ADJendoplasmic Nreticulum PREPwith ARTthe ADJcatalytic Ndomain Vfacing ARTthe Ncytosol Subcellular-localization (UBC6, endoplasmic reticulum)

97 (1) Association using Parse Tree [Miller et al 2000] Simultaneously POS tag, parse, extract & associate! Increase space of parse constituents to include entity and relation tags NotationDescription. c h head constituent category c m modifier constituent category X p X of parent node t POS tag w word Parameterse.g.. P(c h |c p )P(vp|s) P(c m |c p,c hp,c m-1,w p )P(per/np|s,vp,null,said) P(t m |c m,t h,w h )P(per/nnp|per/np,vbd,said) P(w m |c m,t m,t h,w h )P(nance|per/np,per/nnp,vbd,said) (This is also a great example of extraction using a tree model.)

98 (1) Association with Graphical Models [Roth & Yih 2002] Capture arbitrary-distance dependencies among predictions. Local language models contribute evidence to entity classification. Local language models contribute evidence to relation classification. Random variable over the class of entity #2, e.g. over {person, location,…} Random variable over the class of relation between entity #2 and #1, e.g. over {lives-in, is-boss-of,…} Dependencies between classes of entities and relations! Inference with loopy belief propagation.

99 (1) Association with Graphical Models [Roth & Yih 2002] Also capture long-distance dependencies among predictions. Local language models contribute evidence to entity classification. Random variable over the class of entity #1, e.g. over {person, location,…} Local language models contribute evidence to relation classification. Random variable over the class of relation between entity #2 and #1, e.g. over {lives-in, is-boss-of,…} Dependencies between classes of entities and relations! Inference with loopy belief propagation. person? person lives-in

100 (1) Association with Graphical Models [Roth & Yih 2002] Also capture long-distance dependencies among predictions. Local language models contribute evidence to entity classification. Random variable over the class of entity #1, e.g. over {person, location,…} Local language models contribute evidence to relation classification. Random variable over the class of relation between entity #2 and #1, e.g. over {lives-in, is-boss-of,…} Dependencies between classes of entities and relations! Inference with loopy belief propagation. location person lives-in

101 (1) Association with “Grouping Labels” Create a simple language that reflects a field’s relation to other fields Language represents ability to define: –Disjoint fields –Shared fields –Scope Create rules that use field labels [Jensen & Cohen, 2001]

102

103

104

105

106 Broader View Create ontology Segment Classify Associate Cluster Load DB Spider Query, Search Data mine IE Tokenize Document collection Database Filter by relevance Label training data Train extraction models Now touch on some other issues 1 2 3 4 5 Object Consolidation

107 (2) Learning a Distance Metric Between Records [Borthwick, 2000; Cohen & Richman, 2001; Bilenko & Mooney, 2002, 2003] Learn Pr ({duplicate, not-duplicate} | record1, record2) with a Maximum Entropy classifier. Do greedy agglomerative clustering using this Probability as a distance metric.

108 (2) String Edit Distance distance(“William Cohen”, “Willliam Cohon”) WILLIAM_COHEN WILLLIAM_COHON CCCCICCCCCCCSC 00001111111122 s t op cost alignment

109 (2) Computing String Edit Distance D(i,j) = min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete COHEN M 1 2345 C 1 2345 C 2 3345 O 3 2 345 H 43 2 3 4 N 5433 3 A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1) learn these parameters

110 (2) String Edit Distance Learning Precision/recall for MAILING dataset duplicate detection [Bilenko & Mooney, 2002, 2003]

111

112 Outline Examples of IE and Data Mining IE with Hidden Markov Models Introduction to Conditional Random Fields (CRFs) Examples of IE with CRFs Sequence Alignment with CRFs Semi-supervised Learning

113 String Edit Distance Distance between sequences x and y: –“cost” of lowest-cost sequence of edit operations that transform string x into y.

114 String Edit Distance Distance between sequences x and y: –“cost” of lowest-cost sequence of edit operations that transform string x into y. Applications –Database Record Deduplication Apex International HotelGrassmarket Street Apex Internat’l Grasmarket Street Records are duplicates of the same hotel?

115 String Edit Distance Distance between sequences x and y: –“cost” of lowest-cost sequence of edit operations that transform string x into y. Applications –Database Record Deduplication –Biological Sequences AGCTCTTACGATAGAGGACTCCAGA AGGTCTTACCAAAGAGGACTTCAGA

116 String Edit Distance Distance between sequences x and y: –“cost” of lowest-cost sequence of edit operations that transform string x into y. Applications –Database Record Deduplication –Biological Sequences –Machine Translation Il a achete une pomme He bought an apple

117 String Edit Distance Distance between sequences x and y: –“cost” of lowest-cost sequence of edit operations that transform string x into y. Applications –Database Record Deduplication –Biological Sequences –Machine Translation –Textual Entailment He bought a new car last night He purchased a brand new automobile yesterday evening

118 Levenshtein Distance copyCopy a character from x to y(cost 0) insertInsert a character into y(cost 1) deleteDelete a character from y(cost 1) substSubstitute one character for another(cost 1) Edit operations Lowest cost alignment W i l l i a m _ W. _ C o h o n W i l l l e a m _ C o h e n copy subst copy insertcopy delete substcopy operation cost Total cost = 6 = Levenshtein Distance delete 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0 Align two strings William W. Cohon Willleam Cohen x 1 = x 2 = [1966]

119 Levenshtein Distance copyCopy a character from x to y(cost 0) insertInsert a character into y(cost 1) deleteDelete a character from y(cost 1) substSubstitute one character for another(cost 1) Edit operations W i l l l e a m 0 1 2 3 4 5 6 7 8 W 1 0 1 2 3 4 5 6 7 i 2 1 0 1 2 3 4 5 6 l 3 2 1 0 1 2 3 4 5 l 4 3 2 1 0 1 2 3 4 i 5 4 3 2 1 1 2 3 4 a 6 5 4 3 2 2 2 2 4 m 7 6 5 4 3 3 3 3 2 insert subst D(i,j) = score of best alignment from x 1... x i to y 1... y j. D(i-1,j-1) +  (x i ≠y j ) D(i,j) = min D(i-1,j) + 1 D(i,j-1) + 1 Dynamic program total cost = distance

120 Levenshtein Distance with Markov Dependencies Cost after ac i d s copyCopy a character from x to y0 0 0 0 insertInsert a character into y1 1 1 deleteDelete a character from y1 1 1 subst Substitute one character for another1 1 1 1 Edit operations W i l l l e a m 0 1 2 3 4 5 6 7 8 W 1 0 1 2 3 4 5 6 7 i 2 1 0 1 2 3 4 5 6 l 3 2 1 0 1 2 3 4 5 l 4 3 2 1 0 1 2 3 4 i 5 4 3 2 1 1 2 3 4 a 6 5 4 3 2 2 2 2 4 m 7 6 5 4 3 3 3 3 2 Learn these costs from training data subst insertdelete 3D DP table repeated delete is cheaper copy 1 2 1 2

121 Ristad & Yianilos (1997) Essentially a Pair-HMM, generating a edit/state/alignment-sequence and two strings complete data likelihood Learn via EM: Expectation step: Calculate likelihood of alignment paths Maximization step: Make those paths more likely. W i l l i a m _ W. _ C o h o n W i l l l e a m _ C o h e n copy subst copy insertcopy delete substcopy delete 1 2 3 4 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 8 8 8 9 10 11 12 13 14 x1x1 x2x2 a.i 1 a.e a.i 2 string 1 alignment string 2 incomplete data likelihood (sum over all alignments consistent with x 1 and x 2 ) Match score = Given training set of matching string pairs, objective fn is

122 Ristad & Yianilos Regrets Limited features of input strings –Examine only single character pair at a time –Difficult to use upcoming string context, lexicons,... –Example: “Senator John Green” “John Green” Limited edit operations –Difficult to generate arbitrary jumps in both strings –Example: “UMass” “University of Massachusetts”. Trained only on positive match data –Doesn’t include information-rich “near misses” –Example: “ACM SIGIR” ≠ “ACM SIGCHI” So, consider model trained by conditional probability

123 Conditional Probability (Sequence) Models We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(y|x) instead of P(y,x): –Can examine features, but not responsible for generating them. –Don’t have to explicitly model their dependencies.

124 CRF String Edit Distance W i l l i a m _ W. _ C o h o n W i l l l e a m _ C o h e n copy subst copy insertcopydelete substcopy delete joint complete data likelihood 1 2 3 4 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 8 8 8 9 10 11 12 13 14 x1x1 x2x2 a.i 1 a.e a.i 2 string 1 alignment string 2 conditional complete data likelihood Want to train from set of string pairs, each labeled one of {match, non-match} match“William W. Cohon”“Willlleam Cohen” non-match“Bruce D’Ambrosio”“Bruce Croft” match“Tommi Jaakkola”“Tommi Jakola” match“Stuart Russell”“Stuart Russel” non-match“Tom Deitterich”“Tom Dean”

125 CRF String Edit Distance FSM subst insertdelete copy

126 CRF String Edit Distance FSM subst insertdelete copy subst insertdelete copy Start match m = 1 non-match m = 0 conditional incomplete data likelihood

127 CRF String Edit Distance FSM subst insertdelete copy subst insertdelete copy Start match m = 1 non-match m = 0 Probability summed over all alignments in match states 0.8 Probability summed over all alignments in non-match states 0.2 x 1 = “Tommi Jaakkola” x 2 = “Tommi Jakola”

128 CRF String Edit Distance FSM subst insertdelete copy subst insertdelete copy Start match m = 1 non-match m = 0 Probability summed over all alignments in match states 0.1 Probability summed over all alignments in non-match states 0.9 x 1 = “Tom Dietterich” x 2 = “Tom Dean”

129 Parameter Estimation Expectation Maximization E-step: Estimate distribution over alignments,, using current parameters M-step: Change parameters to maximize the complete (penalized) log likelihood, with an iterative quasi-Newton method (BFGS) Given training set of string pairs and match/non-match labels, objective fn is the incomplete log likelihood The complete log likelihood This is “conditional EM”, but avoid complexities of [Jebara 1998], because no need to solve M-step in closed form.

130 Efficient Training Dynamic programming table is 3D; |x 1 | = |x 2 | = 100, |S| = 12,.... 120,000 entries Use beam search during E-step [Pal, Sutton, McCallum 2005] Unlike completely observed CRFs, objective function is not convex. Initialize parameters not at zero, but so as to yield a reasonable initial edit distance.

131 What Alignments are Learned? subst insertdelete copy subst insertdelete copy Start match m = 1 non-match m = 0 x 1 = “Tommi Jaakkola” x 2 = “Tommi Jakola” T o m m i J a a k k o l a T o m i J a k o l a

132 What Alignments are Learned? subst insertdelete copy subst insertdelete copy Start match m = 1 non-match m = 0 x 1 = “Bruce Croft” x 2 = “Tom Dean” B r u c e C r o f t T o m D e a n

133 What Alignments are Learned? subst insertdelete copy subst insertdelete copy Start match m = 1 non-match m = 0 x 1 = “Jaime Carbonell” x 2 = “Jamie Callan” J a i m e C a r b o n e l l J a m i e C a l a n

134 Example Learned Alignment

135 Summary of Advantages Arbitrary features of the input strings –Examine past, future context –Use lexicons, WordNet Extremely flexible edit operations –Single operation may make arbitrary jumps in both strings, of size determined by input features Discriminative Training –Maximize ability to predict match vs non-match Easy to Label Data –Match/Non-Match no need for labeled alignments

136 Experimental Results: Data Sets Restaurant name, Restaurant address –864 records, 112 matches –E.g. “Abe’s Bar & Grill, E. Main St” “Abe’s Grill, East Main Street” People names, UIS DB generator –synthetic noise –E.g. “John Smith” vs “Snith, John” CiteSeer Citations –In four sections: Reason, Face, Reinforce, Constraint –E.g. “Rusell & Norvig, “Artificial Intelligence: A Modern...” “Russell & Norvig, “Artificial Intelligence: An Intro...”

137 Experimental Results: Features same, different same-alphabetic, different alphbetic same-numeric, different-numeric punctuation1, punctuation2 alphabet-mismatch, numeric-mismatch end-of-1, end-of-2 same-next-character, different-next-character

138 Experimental Results: Edit Operations insert, delete, substitute/copy swap-two-characters skip-word-if-in-lexicon skip-parenthesized-words skip-any-word substitute-word-pairs-in-translation-lexicon skip-word-if-present-in-other-string

139 Experimental Results CiteSeer ReasonFaceReinfConstraint 0.9270.9520.8930.924 0.9380.9660.9070.941 0.8970.9220.9030.923 0.9240.875 0.8080.913 Restaurant name 0.290 0.354 0.365 0.433 Restaurant address 0.686 0.712 0.380 0.532 Distance metric Levenshtein Learned Leven. Vector Learned Vector [Bilenko & Mooney 2003] F1 (average of precision and recall)

140 Experimental Results CiteSeer ReasonFaceReinfConstraint 0.9270.9520.8930.924 0.9380.9660.9070.941 0.8970.9220.9030.923 0.9240.8750.8080.913 0.9640.9180.9170.976 Restaurant name 0.290 0.354 0.365 0.433 0.448 Restaurant address 0.686 0.712 0.380 0.532 0.783 Distance metric Levenshtein Learned Leven. Vector Learned Vector CRF Edit Distance [Bilenko & Mooney 2003] F1 (average of precision and recall)

141 Experimental Results F1 0.856 0.981 Without skip-if-present-in-other-string With skip-if-present-in-other-string Data set: person names, with word-order noise added

142 Related Work Learned Edit Distance –[Bilenko & Mooney 2003], [Cohen et al 2003],... –[Joachims 2003]: Max-margin, trained on alignments Conditionally-trained models with latent variables –[Jebara 1999]: “Conditional Expectation Maximization” –[Quattoni, Collins, Darrell 2005]: CRF for visual object recognition, with latent classes for object sub-patches –[Zettlemoyer & Collins 2005]: CRF for mapping sentences to logical form, with latent parses.

143 Outline Examples of IE and Data Mining IE with Hidden Markov Models Introduction to Conditional Random Fields (CRFs) Examples of IE with CRFs Sequence Alignment with CRFs Semi-supervised Learning

144 Semi-Supervised Learning How to train with limited labeled data? Augment with lots of unlabeled data “Expectation Regularization” [Mann, McCallum, ICML 2007]

145 Supervised Learning Decision boundary Creation of labeled instances requires extensive human effort

146 What if limited labeled data? Small amount of labeled data

147 Semi-Supervised Learning: Labeled & Unlabeled data Large amount of unlabeled data Small amount of labeled data Augment limited labeled data by using unlabeled data

148 More Semi-Supervised Algorithms than Applications # papers Compiled from [Zhu, 2007]

149 Weakness of Many Semi-Supervised Algorithms Difficult to Implement Significantly more complicated than supervised counterparts Fragile Meta-parameters hard to tune Lacking in Scalability O(n 2 ) or O(n 3 ) on unlabeled data

150 “EM will generally degrade [tagging] accuracy, except when only a limited amount of hand-tagged text is available.” [Merialdo, 1994] “When the percentage of labeled data increases from 50% to 75%, the performance of [Label Propagation with Jensen-Shannon divergence] and SVM become almost same, while [Label propagation with cosine distance] performs significantly worse than SVM.” [Niu,Ji,Tan, 2005]

151 Families of Semi-Supervised Learning 1.Expectation Maximization 2.Graph-Based Methods 3.Auxiliary Functions 4.Decision Boundaries in Sparse Regions

152 Family 1 : Expectation Maximization [Dempster, Laird, Rubin, 1977] Fragile -- often worse than supervised

153 Family 2: Graph-Based Methods [Zhu, Ghahramani, 2002] [Szummer, Jaakkola, 2002] Lacking in scalability, Sensitive to choice of metric

154 Family 3: Auxiliary-Task Methods [Ando and Zhang, 2005] Complicated to find appropriate auxiliary tasks

155 Family 4: Decision Boundary in Sparse Region

156 Transductive SVMs [Joachims, 1999]: Sparsity measured by margin Entropy Regularization [Grandvalet and Bengio, 2005] …by label entropy

157 Minimal Entropy Solution!

158 How do we know the minimal entropy solution is wrong? We suspect at least some of the data is in the second class! In fact we often have prior knowledge of the relative class proportions 0.8 : Student 0.2 : Professor

159 How do we know the minimal entropy solution is wrong? We suspect at least some of the data is in the second class! In fact we often have prior knowledge of the relative class proportions 0.1 : Gene Mention 0.9 : Background

160 How do we know the minimal entropy solution is wrong? We suspect at least some of the data is in the second class! In fact we often have prior knowledge of the relative class proportions 0.6 : Person 0.4 : Organization

161 Families of Semi-Supervised Learning 1.Expectation Maximization 2.Graph-Based Methods 3.Auxiliary Functions 4.Decision Boundaries in Sparse Regions 5.Expectation Regularization

162 Family 5: Expectation Regularization Low density region Favor decision boundaries that match the prior

163 Family 5: Expectation Regularization Label Regularization Expectation Regularization special case general case p(y) p(y|feature)

164 Expectation Regularization Simple: Easy to implement Robust: Meta-parameters need little or no tuning Scalable: Linear in number of unlabeled examples

165 Discriminative Models Predict class boundaries directly Do not directly estimate class densities Make few assumptions (e.g. independence) on features Are trained by optimizing conditional log- likelihood

166 Logistic Regression Constraints Expectations

167 Expectation Regularization (XR) Log-likelihoodXR KL-Divergence between a prior distribution and an expected distribution over the unlabeled data

168 Prior distribution (provided from supervised training or estimated on the labeled data) Model’s expected distribution on the unlabeled data

169 After Training, Model Matches Prior Distribution Supervised + XRSupervised only

170 Gradient for Logistic Regression When the gradient is 0

171 XR Results for Classification Accuracy # Labeled Examples 21001000 SVM (supervised)55.41%66.29% Cluster Kernel SVM57.05%65.97% QC Smartsub57.68%59.16% Naïve Bayes (supervised)52.42%57.12%64.47% Naïve Bayes EM50.79%57.34%57.60% Logistic Regression (supervised)52.42%56.74%65.43% Logistic Regression + Ent. Reg.48.56%54.45%58.28% Logistic Regression + XR57.08%58.51%65.44% Secondary Structure Prediction

172 XR Results for Classification: Sliding Window Model CoNLL03 Named Entity Recognition Shared Task

173 XR Results for Classification: Sliding Window Model 2 BioCreativeII 2007 Gene/Gene Product Extraction

174 XR Results for Classification: Sliding Window Model 3 Wall Street Journal Part-of-Speech Tagging

175 XR Results for Classification: SRAA Simulated/Real Auto/Aviation Text Classification

176 Noise in Prior Knowledge What happens when users’ estimates of the class proportions is in error?

177 Noisy Prior Distribution 20% change in probability of majority class CoNLL03 Named Entity Recognition Shared Task

178 Ongoing and Future Work Expectation Regularization is an effective, robust method of semi-supervised training which can be applied to discriminative models, such as logistic regression Conclusion Applying Expectation Regularization to other discriminative models, e.g. conditional random fields Experimenting with priors other than class label priors

179 End of Part 1

180 Joint Inference in Information Extraction & Data Mining Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton, Aron Culotta, Wei Li, Xuerui Wang, Andres Corrada, Ben Wellner, Chris Pal, Michael Hay, Natasha Mohanty, David Mimno, Gideon Mann.

181 From Text to Actionable Knowledge Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Database Discover patterns - entity types - links / relations - events Data Mining Spider Actionable knowledge

182 Problem: Combined in serial juxtaposition, IE and DM are unaware of each others’ weaknesses and opportunities. 1)DM begins from a populated DB, unaware of where the data came from, or its inherent errors and uncertainties. 2)IE is unaware of emerging patterns and regularities in the DB. The accuracy of both suffers, and significant mining of complex text sources is beyond reach.

183 Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Database Discover patterns - entity types - links / relations - events Data Mining Spider Actionable knowledge Uncertainty Info Emerging Patterns Solution:

184 Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Probabilistic Model Discover patterns - entity types - links / relations - events Data Mining Spider Actionable knowledge Solution: Conditional Random Fields [Lafferty, McCallum, Pereira] Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…] Discriminatively-trained undirected graphical models Complex Inference and Learning Just what we researchers like to sink our teeth into! Unified Model

185 Scientific Questions What model structures will capture salient dependencies? Will joint inference actually improve accuracy? How to do inference in these large graphical models? How to do parameter estimation efficiently in these models, which are built from multiple large components? How to do structure discovery in these models?

186 Scientific Questions What model structures will capture salient dependencies? Will joint inference actually improve accuracy? How to do inference in these large graphical models? How to do parameter estimation efficiently in these models, which are built from multiple large components? How to do structure discovery in these models?

187 Outline The need for joint inference Examples of joint inference –Joint Labeling of Cascaded Sequences (Belief Propagation) –Joint Labeling of Distant Entities (BP by Tree Reparameterization) –Joint Co-reference Resolution (Graph Partitioning) –Joint Segmentation and Co-ref (Sparse BP) –Joint Relation Extraction and Data Mining (ICM) –Probability + First-order Logic, Co-ref on Entities (MCMC)

188 1. Jointly labeling cascaded sequences Factorial CRFs Part-of-speech Noun-phrase boundaries Named-entity tag English words [Sutton, Khashayar, McCallum, ICML 2004]

189 1. Jointly labeling cascaded sequences Factorial CRFs Part-of-speech Noun-phrase boundaries Named-entity tag English words [Sutton, Khashayar, McCallum, ICML 2004]

190 1. Jointly labeling cascaded sequences Factorial CRFs Part-of-speech Noun-phrase boundaries Named-entity tag English words [Sutton, Khashayar, McCallum, ICML 2004] But errors cascade--must be perfect at every stage to do well.

191 1. Jointly labeling cascaded sequences Factorial CRFs Part-of-speech Noun-phrase boundaries Named-entity tag English words [Sutton, Khashayar, McCallum, ICML 2004] Joint prediction of part-of-speech and noun-phrase in newswire, matching accuracy with only 50% of the training data. Inference: Loopy Belief Propagation

192 Cascaded Predictions Segmentation Chinese character(input observation) (output prediction) Part-of-speech Named-entity tag

193 Cascaded Predictions Segmentation Part-of-speech Chinese character(input observation) (output prediction) Named-entity tag

194 Cascaded Predictions Segmentation Part-of-speech Named-entity tag Chinese character(input observation) (input obseration) (output prediction) But errors cascade--must be perfect at every stage to do well.

195 Joint Prediction Cross-Product over Labels Segmentation+POS+NE Chinese character(input observation) (output prediction) 3 x 45 x 11 = 1485 possible states O(|o| x 1485 2 ) running time O(|V| x 1485 2 ) parameters e.g.: state label = (Wordbeg, Noun, Person)

196 Segmentation Part-of-speech Named-entity tag Chinese character(input observation) (output prediction) O(|V| x 2785) parameters Joint Prediction Factorial CRF

197 Linear-chain Factorial where Linear-Chain to Factorial CRFs Model Definition... w v u x y x

198 Linear-chain Factorial Linear-Chain to Factorial CRFs Log-likelihood Training... w v u x y x

199 Dynamic CRFs Undirected conditionally-trained analogue to Dynamic Bayes Nets (DBNs) FactorialHigher-OrderHierarchical

200 Training CRFs Feature count using correct labels Feature count using predicted labels Smoothing penalty --

201 Training DCRFs Same form as general CRFs Feature count using correct labels Feature count using predicted labels Smoothing penalty --

202 Max-clique: 3 x 45 x 45 = 6075 assignments NP POS Inference (Exact) Junction Tree

203 Max-clique: 3 x 45 x 45 x 11 = 66825 assignments NER POS SEG Inference (Exact) Junction Tree

204 Inference (Approximate) Loopy Belief Approximation v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 m4(v1)m4(v1)m6(v3)m6(v3)m5(v2)m5(v2) m1(v4)m1(v4)m3(v6)m3(v6)m2(v5)m2(v5) m1(v2)m1(v2) m5(v6)m5(v6)m4(v5)m4(v5) m2(v3)m2(v3) m5(v4)m5(v4) m5(v4)m5(v4) m3(v2)m3(v2)m2(v1)m2(v1)

205 [Wainwright, Jaakkola, Willsky 2001] Inference (Approximate) Tree Re-parameterization

206 [Wainwright, Jaakkola, Willsky 2001] Inference (Approximate) Tree Re-parameterization

207 [Wainwright, Jaakkola, Willsky 2001] Inference (Approximate) Tree Re-parameterization

208 [Wainwright, Jaakkola, Willsky 2001] Inference (Approximate) Tree Re-parameterization

209 Experiments Simultaneous noun-phrase & part-of-speech tagging Data from CoNLL Shared Task 2000 (Newswire) –8936 training instances –45 POS tags, 3 NP tags –Features: word identity, capitalization, regex’s, lexicons… B I I B I I O O O N N N O N N V O V Rockwell International Corp. 's Tulsa unit said it signed B I I O B I O B I O J N V O N O N N a tentative agreement extending its contract with Boeing Co.

210 Experiments Simultaneous noun-phrase & part-of-speech tagging Two experiments Compare exact and approximate inference Compare Noun Phrase Segmentation F1 of –Cascaded CRF+CRF –Cascaded Brill+CRF –Joint Factorial DCRFs B I I B I I O O O N N N O N N V O V Rockwell International Corp. 's Tulsa unit said it signed B I I O B I O B I O J N V O N O N N a tentative agreement extending its contract with Boeing Co.

211 Comparison of Cascaded & Joint CRF+CRFBrill+CRFFCRF POS acc98.28N/A98.92 Joint acc95.56N/A96.48 NP F193.1093.3393.87 CRF+CRF and FCRF trained on 8936 CoNLL sentences Brill tagger trained on 30,000+ sentences, including CoNLL test set! 20%  error

212 Accuracy by Training Set Size Joint prediction of part-of-speech and noun-phrase in newswire, matching accuracy with only 50% of the training data.

213 Inference: Approximate vs. Exact

214 Comparison to State-of-the-art FCRF[KM01][SP03] NP F193.8794.3994.38 Part-of-speech Noun phrase Word identity Rockwell Int’l Corp Model from [SP03]

215 Comparison to State-of-the-art FCRF[KM01][SP03] NP F193.8794.3994.38 Part-of-speech Noun phrase Word identity Rockwell Int’l Corp Future work

216 Comparing Inference Algorithms

217 NounPhrase Experimental Results

218 Experiment with Seminar Announcements Data: 485 emails announcing seminars at CMU Annotated with start time, end time, location and speaker Measure per-field F1, where each mention counts once

219 Outline The need for joint inference Examples of joint inference –Joint Labeling of Cascaded Sequences (Belief Propagation) –Joint Labeling of Distant Entities (BP by Tree Reparameterization) –Joint Co-reference Resolution (Graph Partitioning) –Joint Segmentation and Co-ref (Sparse BP) –Joint Relation Extraction and Data Mining (ICM) –Probability + First-order Logic, Co-ref on Entities (MCMC)

220 Jointly labeling distant mentions Skip-chain CRFs Senator Joe Green said today …. Green ran for … … [Sutton, McCallum, SRL 2004] Dependency among similar, distant mentions ignored.

221 Jointly labeling distant mentions Skip-chain CRFs Senator Joe Green said today …. Green ran for … … [Sutton, McCallum, SRL 2004] 14% reduction in error on most repeated field in email seminar announcements. Inference: Tree reparameterized BP [Wainwright et al, 2002] See also [Finkel, et al, 2005]

222 Outline The need for joint inference Examples of joint inference –Joint Labeling of Cascaded Sequences (Belief Propagation) –Joint Labeling of Distant Entities (BP by Tree Reparameterization) –Joint Co-reference Resolution (Graph Partitioning) –Joint Segmentation and Co-ref (Sparse BP) –Joint Relation Extraction and Data Mining (ICM) –Probability + First-order Logic, Co-ref on Entities (MCMC)

223 Joint co-reference among all pairs Affinity Matrix CRF... Mr Powell...... Powell...... she... 45  99 Y/N 11 [McCallum, Wellner, IJCAI WS 2003, NIPS 2004] ~25% reduction in error on co-reference of proper nouns in newswire. Inference: Correlational clustering graph partitioning [Bansal, Blum, Chawla, 2002] “Entity resolution” “Object correspondence”

224 Coreference Resolution Input AKA "record linkage", "database record deduplication", "citation matching", "object correspondence", "identity uncertainty" Output News article, with named-entity "mentions" tagged Number of entities, N = 3 #1 Secretary of State Colin Powell he Mr. Powell Powell #2 Condoleezza Rice she Rice #3 President Bush Bush Today Secretary of State Colin Powell met with................................................. he................... Condoleezza Rice......... Mr Powell..........she..................... Powell............... President Bush.................................. Rice................ Bush.........................................................................

225 Inside the Traditional Solution Mention (3) Mention (4)... Mr Powell...... Powell... NTwo words in common29 YOne word in common13 Y"Normalized" mentions are string identical39 YCapitalized word in common17 Y> 50% character tri-gram overlap19 N< 25% character tri-gram overlap-34 YIn same sentence9 YWithin two sentences8 NFurther than 3 sentences apart-1 Y"Hobbs Distance" < 311 NNumber of entities in between two mentions = 012 NNumber of entities in between two mentions > 4-3 YFont matches1 YDefault-19 OVERALL SCORE = 98 > threshold=0 Pair-wise Affinity Metric Y/N?

226 The Problem... Mr Powell...... Powell...... she... affinity = 98 affinity = 11 affinity =  104 Pair-wise merging decisions are being made independently from each other Y Y N Affinity measures are noisy and imperfect. They should be made in relational dependence with each other.

227 A Generative Model Solution [Russell 2001], [Pasula et al 2002] id wordscontext distancefonts id surname agegender N............ (Applied to citation matching, and object correspondence in vision) 2) Number of entities is hard-coded into the model structure, but we are supposed to predict num entities! Thus we must modify model structure during inference---MCMC. 1) Generative model makes it difficult to use complex features. Issues:

228 A Markov Random Field for Co-reference... Mr Powell...... Powell...... she... 45  30 Y/N [McCallum & Wellner, 2003, ICML] (MRF) Make pair-wise merging decisions in dependent relation to each other by - calculating a joint prob. - including all edge weights - adding dependence on consistent triangles. 11

229 A Markov Random Field for Co-reference... Mr Powell...... Powell...... she... 45  30 Y/N [McCallum & Wellner, 2003] (MRF) Make pair-wise merging decisions in dependent relation to each other by - calculating a joint prob. - including all edge weights - adding dependence on consistent triangles. 11

230 A Markov Random Field for Co-reference... Mr Powell...... Powell...... she... Y N N [McCallum & Wellner, 2003] (MRF) 44  45)  30)  (11)

231 A Markov Random Field for Co-reference... Mr Powell...... Powell...... she... Y Y N [McCallum & Wellner, 2003] (MRF)  infinity  45)  30)  (11)

232 A Markov Random Field for Co-reference... Mr Powell...... Powell...... she... N Y N [McCallum & Wellner, 2003] (MRF) 64  45)  30)  (11)

233 Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]... Mr Powell...... Powell...... she... 45 11  30... Condoleezza Rice...  134 10  106

234 Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]... Mr Powell...... Powell...... she...... Condoleezza Rice... =  22 45 11  30  134 10  106

235 Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]... Mr Powell...... Powell...... she...... Condoleezza Rice... = 314 45 11  30  134 10  106

236 Co-reference Experimental Results Proper noun co-reference DARPA ACE broadcast news transcripts, 117 stories Partition F1Pair F1 Single-link threshold16 %18 % Best prev match [Morton]83 %89 % MRFs88 %92 %  error=30%  error=28% DARPA MUC-6 newswire article corpus, 30 stories Partition F1Pair F1 Single-link threshold11%7 % Best prev match [Morton]70 %76 % MRFs74 %80 %  error=13%  error=17% [McCallum & Wellner, 2003]

237 Y/N Joint Co-reference for Multiple Entity Types Stuart Russell [Culotta & McCallum 2005] S. Russel People

238 Y/N Joint Co-reference for Multiple Entity Types Stuart Russell University of California at Berkeley [Culotta & McCallum 2005] S. Russel Berkeley PeopleOrganizations

239 Y/N Joint Co-reference for Multiple Entity Types Stuart Russell University of California at Berkeley [Culotta & McCallum 2005] S. Russel Berkeley PeopleOrganizations Reduces error by 22%

240 Outline The need for joint inference Examples of joint inference –Joint Labeling of Cascaded Sequences (Belief Propagation) –Joint Labeling of Distant Entities (BP by Tree Reparameterization) –Joint Co-reference Resolution (Graph Partitioning) –Joint Segmentation and Co-ref (Sparse BP) –Joint Relation Extraction and Data Mining (ICM) –Probability + First-order Logic, Co-ref on Entities (MCMC)

241 p Database field values c Joint segmentation and co-reference o s o s c c s o Citation attributes y y y Segmentation [Wellner, McCallum, Peng, Hay, UAI 2004] Inference: Sparse Generalized Belief Propagation Co-reference decisions Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison- Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990. [Pal, Sutton, McCallum, 2005] World Knowledge 35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Extraction from and matching of research paper citations. see also [Marthi, Milch, Russell, 2003]

242 Joint IE and Coreference from Research Paper Citations Textual citation mentions (noisy, with duplicates) Paper database, with fields, clean, duplicates collapsed AUTHORS TITLE VENUE Cowell, Dawid… Probab…Springer Montemerlo, Thrun…FastSLAM… AAAI… Kjaerulff Approxi… Technic… Joint segmentation and co-reference

243 Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. Citation Segmentation and Coreference

244 Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. 1) Segment citation fields Citation Segmentation and Coreference

245 Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. 1) Segment citation fields 2) Resolve coreferent citations Citation Segmentation and Coreference Y?NY?N

246 Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. 1) Segment citation fields 2) Resolve coreferent citations 3) Form canonical database record Citation Segmentation and Coreference AUTHOR =Brenda Laurel TITLE =Interface Agents: Metaphors with Character PAGES =355-366 BOOKTITLE =The Art of Human-Computer Interface Design EDITOR =T. Smith PUBLISHER =Addison-Wesley YEAR =1990 Y?NY?N Resolving conflicts

247 Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, T. Smith (ed), Addison-Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Smith, The Art of Human-Computr Interface Design, 355-366, 1990. 1) Segment citation fields 2) Resolve coreferent citations 3) Form canonical database record Citation Segmentation and Coreference AUTHOR =Brenda Laurel TITLE =Interface Agents: Metaphors with Character PAGES =355-366 BOOKTITLE =The Art of Human-Computer Interface Design EDITOR =T. Smith PUBLISHER =Addison-Wesley YEAR =1990 Y?NY?N Perform jointly.

248 x s Observed citation CRF Segmentation IE + Coreference Model J Besag 1986 On the… AUT AUT YR TITL TITL

249 x s Observed citation CRF Segmentation IE + Coreference Model Citation mention attributes J Besag 1986 On the… AUTHOR = “J Besag” YEAR = “1986” TITLE = “On the…” c

250 x s IE + Coreference Model c J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Structure for each citation mention

251 x s IE + Coreference Model c Binary coreference variables for each pair of mentions J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining…

252 x s IE + Coreference Model c y n n J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Binary coreference variables for each pair of mentions

253 y n n x s IE + Coreference Model c J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Research paper entity attribute nodes AUTHOR = “P Smyth” YEAR = “2001” TITLE = “Data Mining…”...

254 y y y x s IE + Coreference Model c J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Research paper entity attribute node

255 y n n x s IE + Coreference Model c J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining…

256 Such a highly connected graph makes exact inference intractable, so…

257 Loopy Belief Propagation v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 m1(v2)m1(v2) m2(v3)m2(v3) m3(v2)m3(v2)m2(v1)m2(v1) messages passed between nodes Approximate Inference 1

258 Loopy Belief Propagation Generalized Belief Propagation v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 m1(v2)m1(v2) m2(v3)m2(v3) m3(v2)m3(v2)m2(v1)m2(v1) v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v9v9 v8v8 v7v7 messages passed between nodes messages passed between regions Here, a message is a conditional probability table passed among nodes. But when message size grows exponentially with region size! Approximate Inference 1

259 Iterated Conditional Modes (ICM) [Besag 1986] v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v 6 i+1 = argmax P(v 6 i | v \ v 6 i ) v6iv6i = held constant Approximate Inference 2

260 Iterated Conditional Modes (ICM) [Besag 1986] v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v 5 j+1 = argmax P(v 5 j | v \ v 5 j ) v5jv5j = held constant Approximate Inference 2

261 Iterated Conditional Modes (ICM) [Besag 1986] v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v 4 k+1 = argmax P(v 4 k | v \ v 4 k ) v4kv4k = held constant Approximate Inference 2 But greedy, and easily falls into local minima.

262 Iterated Conditional Modes (ICM) [Besag 1986] “Iterated Conditional Sampling” or “Sparse Belief Propagation” Instead of passing only argmax, sample of argmaxes of P(v 4 k | v \ v 4 k ) e.g. an N-best list (the top N values) v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 v 4 k+1 = argmax P(v 4 k | v \ v 4 k ) v4kv4k = held constant v6v6 v5v5 v3v3 v2v2 v1v1 v4v4 Approximate Inference 2 Can use “Generalized Version” of this; doing exact inference on a region of several nodes at once. Here, a “message” grows only linearly with region size and N!

263 Inference by Sparse “Generalized BP” Exact inference on these linear-chain regions J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… From each chain pass an N-best List into coreference [Pal, Sutton, McCallum 2005]

264 Inference by Sparse “Generalized BP” J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Approximate inference by graph partitioning… …integrating out uncertainty in samples of extraction Make scale to 1M citations with Canopies [McCallum, Nigam, Ungar 2000] [Pal, Sutton, McCallum 2005]

265 NameTitle… Laurel, BInterface Agents: Metaphors with Character The … Laurel, B.Interface Agents: Metaphors with Character … Laurel, B. Interface Agents Metaphors with Character … When calculating similarity with another citation, have more opportunity to find correct, matching fields. NameTitleBook TitleYear Laurel, B. InterfaceAgents: Metaphors with Character The Art of Human Computer Interface Design 1990 Laurel, B.Interface Agents: Metaphors with Character The Art of Human Computer Interface Design 1990 Laurel, B. InterfaceAgents: Metaphors with Character The Art of Human Computer Interface Design 1990 Inference: Sample = N-best List from CRF Segmentation y ? n

266 y n n Inference by Sparse “Generalized BP” J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Exact (exhaustive) inference over entity attributes [Pal, Sutton, McCallum 2005]

267 y n n Inference by Sparse “Generalized BP” J Besag 1986 On the… Smyth. 2001 Data Mining… Smyth, P Data mining… Revisit exact inference on IE linear chain, now conditioned on entity attributes [Pal, Sutton, McCallum 2005]

268 y n n Parameter Estimation: Piecewise Training Coref graph edge weights MAP on individual edges Divide-and-conquer parameter estimation IE Linear-chain Exact MAP Entity attribute potentials MAP, pseudo-likelihood In all cases: Climb MAP gradient with quasi-Newton method [Sutton & McCallum 2005]

269 NReinforceFaceReasonConstraint 10.9460.9670.9450.961 30.950.9790.9610.960 70.9480.9790.9510.971 90.9820.9670.9600.971 Optimal0.9950.9920.9940.988 Coreference F1 performance Average error reduction is 35%. Results on 4 Sections of CiteSeer Citations “Optimal” makes best use of N-best list by using true labels. Indicates that even more improvement can be obtained

270 p Database field values c Joint segmentation and co-reference o s o s c c s o Citation attributes y y y Segmentation [Wellner, McCallum, Peng, Hay, UAI 2004] Inference: Sparse Belief Propagation Co-reference decisions Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison- Wesley, 1990. Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990. World Knowledge 35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Extraction from and matching of research paper citations. [Pal, Sutton, McCallum, 2005]

271 Outline The need for joint inference Examples of joint inference –Joint Labeling of Cascaded Sequences (Belief Propagation) –Joint Labeling of Distant Entities (BP by Tree Reparameterization) –Joint Co-reference Resolution (Graph Partitioning) –Joint Segmentation and Co-ref (Sparse BP) –Joint Relation Extraction and Data Mining (ICM) –Probability + First-order Logic, Co-ref on Entities (MCMC)

272 Motivation: Robust Relation Extraction NameEducation George W. BushYale George W. Bush attended Yale… Bill Clinton attended Yale. Fellow alumnus George W. Bush… Yale is located in New Haven. When George W. Bush visited… Pattern matching Classifier with contextual and external features (thesaurus) Relations with sparse, noisy, or complex contextual evidence? How to learn predictive relational patterns? Knowledge discovery from text –Mine web to discover unknown facts George W. Bush graduated from Yale…

273 Data 270 Wikipedia articles 1000 paragraphs 4700 relations 52 relation types –JobTitle, BirthDay, Friend, Sister, Husband, Employer, Cousin, Competition, Education, … Targeted for density of relations –Bush/Kennedy/Manning/Coppola families and friends

274

275 Relation Extraction as Named-Entity Recognition + Classification …George W. Bush and his father, George H. W. Bush, … Person classify P(brother)=0.2 P(father)=0.6 P(sister)=0.01 … … both are subjects of same sentence arg1=person & arg2=person context=“father” Features

276 Difficulties with this approach enumerate all pairs of entities in document –low signal/noise errors in NER –if “Ford” mislabeled as company, won’t be part of brother relation. Relation Extraction as Named-Entity Recognition + Classification

277 Relation Extraction as Sequence Labeling George W. Bush …the son of George H. W. Bush … Father Most entities are related to subject Folds together NER and relation extraction Models dependency of adjacent relations –“Austrian physicist” [nationality jobTitle] Lots of work on sequence labeling –HMMs, CRFs, …

278 CRF Features Context words Lexicons –cities, states, names, companies Regexp –Capitalization, ContainsDigits, ContainsPunctuation Part-of-speech Prefixes/suffixes Conjunctions of these within window of size 6

279 Example Features Father: son of NAME ; father, NAME Brother: [his|her] brother X Executive: JOBTITLE of X Birthday: born MONTH [0-9]+ Boss: under JOBTITLE X Competition: defeating NAME Award: awarded DET X ; won DET X

280

281 Mining Relational Features Want to discover database regularities across documents that provide strong evidence of relation –High Precision: parent(x,z) ^ sibling(z,w) ^ child(w,y) => cousin(x,y) –Low Precision: friends tend to attend the same schools

282 Mining Relational Features Generate relational path features from extracted (or true) database. Paths between entities up to length k

283 cousin Nancy Ellis Bush sibling George HW Bush George W Bush son John Prescott Ellis son George W. Bush …his father George H. W. Bush… …his cousin John Prescott Ellis… George H. W. Bush …his sister Nancy Ellis Bush… Nancy Ellis Bush …her son John Prescott Ellis… Cousin = Father’s Sister’s Son X Y

284 John Kerry …celebrated with Stuart Forbes… Stuart ForbesJames Forbes John KerryRosemary Forbes SonName James ForbesRosemary Forbes SiblingName cousin James Forbes sibling Rosemary Forbes John Kerry son Stuart Forbes son likely a cousin

285 Iterative DB Construction Joseph P. Kennedy, Sr … son John F. Kennedy with Rose Fitzgerald NameSon Joseph P. KennedyJohn F. Kennedy Rose FitzgeraldJohn F. Kennedy Wife Son Fill DB with “first-pass” CRF Use relational features with “second-pass” CRF Ronald Reagan George W. Bush (0.3)

286 Results Member Of: 0.4211 -> 0.6093 MECRFRCRFRCRF.9RCRF.5RCRF Truth RCRF Truth.5 F1.5489.5995.6100.6008.6136.6791.6363 Prec.6475.7019.6799.7177.7095.7553.7343 Recall.4763.5232.5531.5166.5406.6169.5614 ME = maximum entropy CRF = conditional random field RCRF = CRF + mined features

287 Examples of Discovered Relational Features Mother: Father  Wife Cousin: Mother  Husband  Nephew Friend: Education  Student Education: Father  Education Boss: Boss  Son MemberOf: Grandfather  MemberOf Competition: PoliticalParty  Member  Competition

288 Outline The need for joint inference Examples of joint inference –Joint Labeling of Cascaded Sequences (Belief Propagation) –Joint Labeling of Distant Entities (BP by Tree Reparameterization) –Joint Co-reference Resolution (Graph Partitioning) –Joint Segmentation and Co-ref (Sparse BP) –Joint Relation Extraction and Data Mining (ICM) –Probability + First-order Logic, Co-ref on Entities (MCMC)

289 Sometimes graph partitioning with pairwise comparisons is not enough. Entities have multiple attributes (name, email, institution, location); need to measure “compatibility” among them. Having 2 “given names” is common, but not 4. –e.g. Howard M. Dean / Martin, Dean / Howard Martin Need to measure size of the clusters of mentions.  a pair of lastname strings that differ > 5? We need measures on hypothesized “entities” We need First-order logic

290 Toward High-Order Representations Identity Uncertainty..Howard Dean....H Dean....Dean Martin....Dino....Howard Martin....Howard..

291 Toward High-Order Representations Identity Uncertainty..Howard Dean....H Dean....Dean Martin....Dino....Howard Martin....Howard..

292 Pairwise Co-reference Features Dean Martin Howard Dean Howard Martin SamePerson(Howard Dean, Howard Martin)? SamePerson(Dean Martin, Howard Martin)? SamePerson(Dean Martin, Howard Dean)? Pairwise Features StringMatch(x 1,x 2 ) EditDistance(x 1,x 2 )

293 Dean Martin Howard Dean Howard Martin SamePerson(Howard Dean, Howard Martin, Dean Martin)? First-Order Features maximum edit distance between any pair is < 0.5 number of distinct names < 3 All have same gender There exists a number mismatch Only pronouns Toward High-Order Representations Identity Uncertainty

294 Weighted Logic This model brings together the two main (long-separated) branches of Artificial Intelligence: Logic Probability

295 Dean MartinHoward DeanHoward Martin Dino Howie Martin SamePerson(x 1,x 2 ) … SamePerson(x 1,x 2,x 3 ) … … Toward High-Order Representations Identity Uncertainty SamePerson(x 1,x 2,x 3,x 4 ) SamePerson(x 1,x 2,x 3,x 4,x 5 ) SamePerson(x 1,x 2,x 3,x 4,x 5,x 6 ) … … …............ Combinatorial Explosion!

296 This space complexity is common in first-order probabilistic models

297 ground Markov network Markov Logic as a Template to Construct a Markov Network using First-Order Logic [Richardson & Domingos 2005] grounding Markov network requires space O(n r ) n = number constants r = highest clause arity [Paskin & Russell 2002]

298 How can we perform inference and learning in models that cannot be grounded?

299 Inference in First-Order Models SAT Solvers Weighted SAT solvers [Kautz et al 1997] –Requires complete grounding of network LazySAT [Singla & Domingos 2006] –Saves memory by only storing clauses that may become unsatisfied –Initialization still requires time O(n r ) to visit all ground clauses

300 Gibbs Sampling –Difficult to move between high probability configurations by changing single variables Although, consider MC-SAT! [Poon & Domingos ‘06] An alternative: Metropolis-Hastings sampling [Culotta & McCallum 2006] –Can be extended to partial configurations Only instantiate relevant variables –Key advantage: can design arbitrary “smart” jumps –Successfully used in BLOG models [Milch et al 2005] –2 parts: proposal distribution, acceptance distribution. Inference in First-Order Models MCMC

301 Model Howard Dean Governor Howie Dean Martin Dino Howard Martin Howie Martin f w : SamePerson(x) f b : DifferentPerson(x, x’ ) “First-order features”

302 Model Howard Dean Governor Howie Dean Martin Dino Howard Martin Howie Martin

303 Model Z X : Sum over all possible configurations!

304 Inference with Metropolis-Hastings p(y’)/p(y) : likelihood ratio –Ratio of P(Y|X) –Z X cancels! q(y’|y) : proposal distribution –probability of proposing move y  y’ What is nice about this? –Can design arbitrary “smart” proposal distributions

305 Proposal Distribution Dean Martin Howie Martin Howard Martin Dino Dean Martin Dino Howard Martin Howie Martin y y’

306 Proposal Distribution Dean Martin Howie Martin Howard Martin Dino Dean Martin Howie Martin Howard Martin Howie Martin y y’

307 Proposal Distribution Dean Martin Howie Martin Howard Martin Dino Dean Martin Howie Martin Howard Martin Howie Martin y y’

308 Feature List Exact Match/Mis-Match –Entity type –Gender (requires lexicon) –Number –Case –Entity Text –Entity Head –Entity Modifier/Numerical Modifier –Sentence –WordNet: hypernym,synonym,antonym Other –Relative pronoun agreement –Sentence distance in bins –Partial text overlaps Quantification –Existential  a gender mismatch  three different first names –Universal  NER type match  named mentions str identical Filters (limit quantifiers to mention type) –None –Pronoun –Nominal (description) –Proper (name)

309 Learning the Likelihood Ratio Can’t normalize over all possible y’s....temporarily consider the following... ad hoc training: Maximize p(b|y,x), where b in {TRUE, FALSE}

310 Error Driven Training Motivation Where to get training examples? Generate all possible partial clusters intractable –Sample uniformly? –Sample from clusters visited during inference? Error driven –Focus learning on examples that need it most

311 Error-Driven Training Results Non-Error-driven69 Error-driven 72 B-cubed F1

312 Learning the Likelihood Ratio Given a pair of configurations, learn to rank the “better” configuration higher.

313 Rank-Based Training Instead of training [Powell, Mr. Powell, he] --> TRUE [Powell, Mr. Powell, she] --> FALSE...Rather... [Powell, Mr. Powell, he] > [Powell, Mr. Powell, she] [Powell, Mr. Powell, he] > [Powell, Mr. Powell] [Powell, Mr. Powell, George, he] > [Powell, Mr. Powell, George, she]

314 Rank-Based Training Results Non-Ranked-Based72 (Error-driven) Rank-Based79 (Error-driven) B-cubed F1 Previous best in literature68

315 Experimental Results ACE 2005 newswire coreference All entity types –Proper-, common-, pro- nouns 443 documents B-cubed Previous best results, 1997: 65 Previous best results, 2002: 67 Previous best results, 2005: 68 Our new results 79 [Culotta, Wick, Hall, McCallum, NAACL/HLT 2007]

316 Learning the Proposal Distribution by Tying Parameters Proposal distribution q(y’|y) “cheap” approximation to p(y) Reuse subset of parameters in p(y) E.g. in identity uncertainty model –Sample two clusters –Stochastic agglomerative clustering to propose new configuration

317 Scalability Currently running on >2 million author name mentions. Canopies Schedule of different proposal distributions

318 Weighted Logic Summary Bring together –Logic –Probability Inference and Learning incredibly difficult. Our recommendation: –MCMC for inference –Error-driven, rank-based for training

319 Outline The need for joint inference Examples of joint inference –Joint Labeling of Cascaded Sequences (Belief Propagation) –Joint Labeling of Distant Entities (BP by Tree Reparameterization) –Joint Co-reference Resolution (Graph Partitioning) –Joint Segmentation and Co-ref (Sparse BP) –Joint Relation Extraction and Data Mining (ICM) –Probability + First-order Logic, Co-ref on Entities (MCMC)

320 End of Part 2

321 Investigate Beam Search More Formally Beam Search is a well known method for speeding up decoding We investigate: A principled approach for beam search and learning Cast the problem as variational inference Adaptively construct smallest possible beam, with KL divergence < eps Graphical model for HMM

322 Sparse Belief Propagation m st m vt “Beam Search” in arbitrary graphical models Different pruning strategy based on variational inference. xsxs xvxv xtxt b(x t )b’(x t ) 3. Retain 1-  % of mass 2. Form marginal 1. Compute messages

323 “Sparse Belief Propagation” used during training Like beam search, but modifications from Variational Methods of Inference. [Pal, Sutton, McCallum, ICAASP 2006] Training Time “Accuracy” No beam Sparse BP Traditional beam

324 Context Segment Classify Associate Cluster Filter Prediction Outlier detection Decision support IE Document collection Database Discover patterns - entity types - links / relations - events Data Mining Spider Actionable knowledge Leveraging Text in Social Network Analysis Joint inference among detailed steps

325 Outline Role Discovery (Author-Recipient-Topic Model, ART) Group Discovery (Group-Topic Model, GT) Enhanced Topic Models –Correlations among Topics (Pachinko Allocation, PAM) –Time Localized Topics (Topics-over-Time Model, TOT) –Markov Dependencies in Topics (Topical N-Grams Model, TNG) Bibliometric Impact Measures enabled by Topics Social Network Analysis with Topic Models Multi-Conditional Mixtures


Download ppt "Information Extraction Data Mining and Topic Discovery with Probabilistic Models Andrew McCallum Computer Science Department University of Massachusetts."

Similar presentations


Ads by Google