Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the.

Similar presentations


Presentation on theme: "Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the."— Presentation transcript:

1 Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the universe a dozen pictures of what he was doing. He straightened and nodded to Dwar Reyn, then moved to a position beside the switch that would complete the contact when he threw it. The switch that would connect, all at once, all of the monster computing machines of all the populated planets in the universe - ninety-six billion planets - into the supercircuit that would connect them all into one supercalculator, one cybernetics machine that would combine all the knowledge of all the galaxies. Dwar Reyn spoke briefly to the watching and listening trillions. Then after a moment’s silence he said, “Now, Dwar Ev.” Dwar Ev threw the switch. There was a mighty hum, the surge of power from ninety-six billion planets. Lights flashed and quieted along the miles-long panel. Dwar Ev stepped back and drew a deep breath. “The honour of asking the first questions is yours, Dwar Reyn.” “Thank you,” said Dwar Reyn. “It shall be a question which no single cybernetics machine has been able to answer.” He turned to face the machine. “Is there a God ?” The mighty voice answered without hesitation, without the clicking of a single relay. “Yes, now there is a god.” Sudden fear flashed on the face of Dwar Ev. He leaped to grab the switch. A bolt of lightning from the cloudless sky struck him down and fused the switch shut. ‘Answer’ by Fredric Brown. ©1954, Angels and Spaceships Information Extraction: 10-707 and 11-748

2 Instructor and stuff: –William Cohen (wcohen@cs, Wean 8217) Assistant: Sharon Cavlovich (sharonw@cs) WikiManager: Katie Rivard (krivard@andrew.cmu.edu)krivard@andrew.cmu.edu Office hours: Thus 11:30-12:30 or by appt –TA: Ni Lao (nlao@cs.cmu.edu) Web page: –http://www.cs.cmu.edu/~wcohen/10-707/ -> http://malt.ml.cmu.edu/mw/index.php/Information_Extraction_10-707_in_Fall_2010http://www.cs.cmu.edu/~wcohen/10-707/ http://malt.ml.cmu.edu/mw/index.php/Information_Extraction_10-707_in_Fall_2010 Mon-Wed 1-2:50pm, Gates 4101

3 Information Extraction: 10-707 and 11-748 Prerequisite: –Machine learning or consent of William Grading: –Do a project, preferably in a group of 2-3. Typical example: new algorithm on an old dataset, or old algorithm on a new dataset. Write it up as a conference paper and present (as poster?) Your idea doesn’t have to actually work, or be novel. Timing: proposal and team 10/8, status update 11/8, due 12/8 –Present a paper one of the suggested “optional” papers, or one I approve Schedule anytime after we’ve covered the associated material in class Must give me one week’s notice –Do the readings and understand them. Ask or answer a question with Google moderator before class Contribute 9 pages to a wiki on machine learning papers sort of like this onethis one –Three in Sept, three in Oct, three in Nov Think of this as a semi-structured related work section for your project

4 What’s with the wiki? Reading, thinking and summarizing papers is hard work Doing critiques that nobody but the TA reads seems like wasted effort So why not direct the effort toward my evil secret plan? –Personalized summarization for scientific papers –E.g., “Combination of Kaski’s sparse block model with Erosheva et al’s LinkLDA model using collapsed Gibbs sampling applied to the problem of protein-protein interaction prediction, evaluated on Airoldi et al’s data.”

5 First assignment: due Monday Ask for an account to be created on the class wiki –Email: Katie Rivard (krivard@andrew.cmu.edu)krivard@andrew.cmu.edu Go to http://malt.cmu.edu/mwhttp://malt.cmu.edu/mw Set up your user page –Your real name & a link to your home page –Preferably a picture –Who you are and what you hope to get out of the class (Let me know if you’re just auditing) –Any special skills you have, research interests that you have, IE related projects you have been or might be working on, etc.

6 Information Extraction: 10-707 and 11-748 What is covered? –What is information extraction? “(ML Approaches to) Extracting Structured Information from Text” “Learning How to Turn Words into Data” –Applications: Web info extraction: building catalogs, directories, etc from web sites Biotext info extraction: extracting facts like regulates(CDC23,TNF-1b) Question-answering: answering Q’s like “who invented the light bulb?” …. –Techniques: Named entity recognition: finding names in text –… –Graphical models for classifying sequences of tokens Extracting facts (aka events, relationships) – classifying pairs of extractions Normalizing extracted data – classifying pairs of extractions Semi- and unsupervised approaches to finding information from large corpora (aka bookstrapping – “read the web” like techniques Today: –Admin, motivation –A brief overview of IE, and a less brief overview of named entity recognition

7 Motivation: Why bother with IE?

8 Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the universe a dozen pictures of what he was doing. He straightened and nodded to Dwar Reyn, then moved to a position beside the switch that would complete the contact when he threw it. The switch that would connect, all at once, all of the monster computing machines of all the populated planets in the universe - ninety-six billion planets - into the supercircuit that would connect them all into one supercalculator, one cybernetics machine that would combine all the knowledge of all the galaxies. Dwar Reyn spoke briefly to the watching and listening trillions. Then after a moment’s silence he said, “Now, Dwar Ev.” Dwar Ev threw the switch. There was a mighty hum, the surge of power from ninety-six billion planets. Lights flashed and quieted along the miles-long panel. Dwar Ev stepped back and drew a deep breath. “The honour of asking the first questions is yours, Dwar Reyn.” “Thank you,” said Dwar Reyn. “It shall be a question which no single cybernetics machine has been able to answer.” He turned to face the machine. “Is there a God ?” The mighty voice answered without hesitation, without the clicking of a single relay. “Yes, now there is a god.” Sudden fear flashed on the face of Dwar Ev. He leaped to grab the switch. A bolt of lightning from the cloudless sky struck him down and fused the switch shut. ‘Answer’ by Fredric Brown. ©1954, Angels and Spaceships

9 Some observations In the distant future: –Complex AI systems are completed by ceremonially soldering the final connection, not ceremonially compiling the last Java class –Performance is monitored by clicking relays –A “lightning-from-a-cloudless-sky” peripheral exists Writing and debugging device drivers is a dangerous and highly skilled profession –Question-answering interfaces are still in use Natural-language query in, answer out –Answering (some) complex questions requires combining information from many different places With different parts contributed by different people?

10 Two ways to manage information Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx retrieval QueryAnswerQueryAnswer advisor(wc,nl) advisor(yh,tm) affil(wc,mld) affil(vc,nl) name(wc,William Cohen ) name(nl,Ni Lao) Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx inference “ceremonial soldering” X:advisor(wc,X)&affil(X,lti) ?{X=em; X=nl} AND

11 Some observations Using computers to combine information from multiple places is and has been important…

12 Some observations Using computers to merge information is and has been important… –Data cleaning and integration, record linkage, … –Standards for data exchange: KQML, KIF, DAML+OIL, … Semantic web: N3Logic, OWL, … –Friend-of-a-friend, GeneOntology, …. –Growth from 456 OWL ontologies in 2004 to 14,600 in 2007 Number of web pages estimated at 11.5B as of early 2006 –#webPages/#ontologies =~ 1,000,000 ? –#webSites/#ontologies =~ 10,000 ? –It seems to be much easier to generate sharable text than to generate sharable knowledge. –A lot of accessible knowledge is only accessible in text

13 How do you extract information? [Cohen / McCallum tutorial, NIPS 2002, KDD 2003, …] [Some pilfering from Tom Mitchell’s invited talks]

14 What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION

15 What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. IE

16 What is “Information Extraction” Filling slots in a database from sub-segments of text. As a task: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… NAME TITLE ORGANIZATION Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman founder Free Soft.. IE QA End User

17 What is “Information Extraction” Information Extraction = segmentation + classification + clustering + association As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation aka “named entity extraction”

18 What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

19 What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

20 What is “Information Extraction” Information Extraction = segmentation + classification + association + clustering As a family of techniques: October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the open- source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation NAME TITLE ORGANIZATION Bill Gates CEOMicrosoft Bill Veghte VP Microsoft RichardStallman founder Free Soft.. * * * *

21 Example: Finding Jobs Ads on the Web Martin Baker, a person Genomics job Employers job posting form

22 Example: A Solution

23 Extracting Job Openings from the Web foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1

24 Job Openings: Category = Food Services Keyword = Baker Location = Continental U.S.

25 Data Mining the Extracted Job Information

26

27 Notice that we get something useful from just identifying the person names and then doing some counting and trending

28

29

30

31 Sunita’s Breakdown of IE What’s the end goal (application?) What’s the input (corpus)? How is it preprocessed? How is output postprocessed (to make querying easier)? What structure is extracted? –Entity names? (“William Cohen, “Anthony ‘Van’ Jones”) –Relationships between entities? (“Richard Wang” studentOf “William Cohen”) –Features/properties/adjectives describing entities? (“iPhone 3G”  “expensive service plan”, “color screen”) What (learning) methods are used?

32 Landscape of IE Tasks (1/4): Degree of Formatting Text paragraphs without formatting Grammatical sentences and some formatting & links Non-grammatical snippets, rich formatting & links Tables Astro Teller is the CEO and co-founder of BodyMedia. Astro holds a Ph.D. in Artificial Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow. His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.

33 Landscape of IE Tasks (2/4): Intended Breadth of Coverage Web site specificGenre specificWide, non-specific Amazon.com Book PagesResumesUniversity Names FormattingLayoutLanguage

34 Landscape of IE Tasks (3/4): Complexity of extraction task Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern University of Arkansas P.O. Box 140 Hope, AR 71802 …was among the six houses sold by Hope Feldman that year. Ambiguous patterns, needing context and many sources of evidence The CALD main office can be reached at 412-268-1299 The big Wyoming sky… U.S. states U.S. phone numbers U.S. postal addresses Person names Headquarters: 1128 Main Street, 4th Floor Cincinnati, Ohio 45210 Pawel Opalinski, Software Engineer at WhizBang Labs. E.g. word patterns:

35 Landscape of IE Tasks (4/4): Single Field/Record Single entity Person: Jack Welch Binary relationship Relation: Person-Title Person: Jack Welch Title: CEO N-ary record “Named entity” extraction Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt. Relation: Company-Location Company: General Electric Location: Connecticut Relation: Succession Company: General Electric Title: CEO Out: Jack Welsh In: Jeffrey Immelt Person: Jeffrey Immelt Location: Connecticut

36 A little more depth on named entity recognition (NER)

37 Models for NER Lexicons Alabama Alaska … Wisconsin Wyoming Abraham Lincoln was born in Kentucky. member? Classify Pre-segmented Candidates Abraham Lincoln was born in Kentucky. Classifier which class? Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Boundary Models Abraham Lincoln was born in Kentucky. Classifier which class? BEGINENDBEGINEND BEGIN Token Tagging Abraham Lincoln was born in Kentucky. Most likely state sequence? This is often treated as a structured prediction problem…classifying tokens sequentially HMMs, CRFs, ….

38 Sliding Windows

39 Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

40 Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

41 Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

42 Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

43 A “Naïve Bayes” Sliding Window Model [Freitag 1997] 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun w t-m w t-1 w t w t+n w t+n+1 w t+n+m prefixcontentssuffix If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it. … … Estimate Pr(LOCATION|window) using Bayes rule Try all “reasonable” windows (vary length, position) Assume independence for length, prefix words, suffix words, content words Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)

44 A “Naïve Bayes” Sliding Window Model 1.Create dataset of examples like these: +(prefix00,…,prefixColon, contentWean,contentHall,….,suffixSpeaker,…) - (prefixColon,…,prefixWean,contentHall,….,ContentSpeaker,suffixColon,….) … 2.Train a NaiveBayes classifier (or YFCL), treating the examples like BOWs for text classification 3.If Pr(class=+|prefix,contents,suffix) > threshold, predict the content window is a location. To think about: what if the extracted entities aren’t consistent, eg if the location overlaps with the speaker? [Freitag 1997] 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun w t-m w t-1 w t w t+n w t+n+1 w t+n+m prefixcontentssuffix … …

45 “Naïve Bayes” Sliding Window Results GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. Domain: CMU UseNet Seminar Announcements FieldF1 Person Name:30% Location:61% Start Time:98%

46 Token Tagging

47 NER by tagging tokens Yesterday Pedro Domingos flew to New York. Yesterday Pedro Domingos flew to New York Person name: Pedro Domingos Location name: New York Given a sentence: 2) Identify names based on the entity labels person name location name background 1) Break the sentence into tokens, and classify each token with a label indicating what sort of entity it’s part of: 3) To learn an NER system, use YFCL.

48 NER by tagging tokens Yesterday Pedro Domingos flew to New York person name location name background Another common labeling scheme is BIO (begin, inside, outside; e.g. beginPerson, insidePerson, beginLocation, insideLocation, outside) BIO also leads to strong dependencies between nearby labels (eg inside follows begin) Similar labels tend to cluster together in text

49 NER with Hidden Markov Models Yesterday Pedro Domingos spoke this example sentence. Person name: Pedro Domingos Given a sequence of observations: and a trained HMM: Find the most likely state sequence: (Viterbi) Any words said to be generated by the designated “person name” state extract as a person name: person name location name background

50 HMM for Segmentation of Addresses Simplest HMM Architecture: One state per entity type CA0.15 NY0.11 PA0.08 …… Hall0.15 Wean0.03 N-S0.02 …… [Pilfered from Sunita Sarawagi, IIT/Bombay]

51 HMMs for Information Extraction 1.The HMM consists of two probability tables Pr(currentState=s|previousState=t) for s=background, location, speaker, Pr(currentWord=w|currentState=s) for s=background, location, … 2.Estimate these tables with a (smoothed) CPT Prob(location|location) = #(loc->loc)/#(loc->*) transitions 3.Given a new sentence, find the most likely sequence of hidden states using Viterbi method: MaxProb(curr=s|position k)= Max state t MaxProb(curr=t|position=k-1) * Prob(word=w k-1 |t)*Prob(curr=s|prev=t) 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun … …

52 “Naïve Bayes” Sliding Window vs HMMs GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. Domain: CMU UseNet Seminar Announcements FieldF1 Speaker:30% Location:61% Start Time:98% FieldF1 Speaker:77% Location:79% Start Time:98%

53 What is a “symbol” ??? Cohen => “Cohen”, “cohen”, “Xxxxx”, “Xx”, … ? 5317 => “5317”, “9999”, “9+”, “number”, … ? Datamold: choose best abstraction level using holdout set

54 HMM Example: “Nymble” Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99] Task: Named Entity Extraction Train on ~500k words of news wire text. Case Language F1. Mixed English93% UpperEnglish91% MixedSpanish90% [Bikel, et al 1998], [BBN “IdentiFinder”] Person Org Other (Five other name classes) start-of- sentence end-of- sentence Transition probabilities Observation probabilities P(s t | s t-1, o t-1 ) P(o t | s t, s t-1 ) Back-off to: P(s t | s t-1 ) P(s t ) P(o t | s t, o t-1 ) P(o t | s t ) P(o t ) or Results:

55 What is a symbol? Bikel et al mix symbols from two abstraction levels

56 What is a symbol? Ideally we would like to use many, arbitrary, overlapping features of words. S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski” Lots of learning systems are not confounded by multiple, non- independent features: decision trees, neural nets, SVMs, …

57 What is a symbol? S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski” Idea: replace generative model in HMM with a maxent model, where state depends on observations

58 What is a symbol? S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski” Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state

59 What is a symbol? S t-1 S t O t S t+1 O t +1 O t - 1 identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is under node X in WordNet is in bold font is indented is in hyperlink anchor … … … part of noun phrase is “Wisniewski” ends in “-ski” Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history

60 Ratnaparkhi’s MXPOST Sequential learning problem: predict POS tags of words. Uses MaxEnt model described above. Rich feature set. To smooth, discard features occurring < 10 times.

61 Conditional Markov Models (CMMs) aka MEMMs aka Maxent Taggers vs HMMS S t-1 StSt OtOt S t+1 O t+1 O t-1... S t-1 StSt OtOt S t+1 O t+1 O t-1...

62 HMMs vs MEMM vs CRF HMM MEMM CRF

63 Some things to think about We’ve seen sliding windows, non-sequential token tagging, and sequential token tagging. –Which of these are likely to work best, and when? –Are there other ways to formulate NER as a learning task? –Is there a benefit from using more complex graphical models? What potentially useful information does a linear- chain CRF not capture? –Can you combine sliding windows with a sequential model? Next lecture will survey IE of sets of related entities (e.g., person and his/her affiliation). –How can you formalize that as a learning task?


Download ppt "Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the."

Similar presentations


Ads by Google