Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE 5539: Web Information Extraction

Similar presentations


Presentation on theme: "CSE 5539: Web Information Extraction"— Presentation transcript:

1 CSE 5539: Web Information Extraction
Instructor: Alan Ritter

2 Bigger Unstructured Data
Motivation Data Analytics / Big Data Companies have lots of data lying around Computing cycles are cheap Using data to get insights: Business, Healthcare, Science, Government, Politics Challenge: Most of the world’s data is Unstructured Text Speech Images Structured Data Bigger Unstructured Data

3 Extracting Knowledge from Text
The Web News Text Extractors Structured Data

4

5 Example: Information Extraction from Twitter
“Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America march 27 for $250” OK, so just to make this a bit more concrete, let’s look at an idealized version of information extraction on Twitter.

6 Example: Information Extraction from Twitter
“Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America march 27 for $250” Typically the first step in an IE pipeline, is Named Entity recognition. Here you can see that I’ve just highlighted all the “entities” in this post, for example Nintendo, 3DS, the date, and price. Doing this kind of thing automatically is the standard NLP task of “Named Entity Recognition”

7 Example: Information Extraction from Twitter
“Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America march 27 for $250” Next we might classify this post as a PRODUCT RELEASE event. We might also have a pre-defined idea that product release events involve a company, a product, a date on which the product is released, the price and the geographical region where it is being released. COMPANY PRODUCT DATE PRICE REGION PRODUCT RELEASE

8 Example: Information Extraction from Twitter
“Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America march 27 for $250” And so the information extraction task is to fill in the fields for the event. So you could imagine many events mentioned on Twitter which fit this general schema. If we can automatically convert tweets into database rows like this, then we can do SQL-like queries and higher-level data mining and analysis on top of this that wasn’t possible by just directly looking at the text. COMPANY PRODUCT DATE PRICE REGION Nintendo 3DS March 27 $250 North America PRODUCT RELEASE

9 Example: Information Extraction from Twitter
Samsung Galaxy S5 Coming to All Major U.S. Carriers Beginning April 11th And so the information extraction task is to fill in the fields for the event. So you could imagine many events mentioned on Twitter which fit this general schema. If we can automatically convert tweets into database rows like this, then we can do SQL-like queries and higher-level data mining and analysis on top of this that wasn’t possible by just directly looking at the text. COMPANY PRODUCT DATE PRICE REGION Samsung Galaxy S5 April 11 ? U.S. Nintendo 3DS March 27 $250 North America PRODUCT RELEASE

10 Example: Information Extraction from Twitter
News And so the information extraction task is to fill in the fields for the event. So you could imagine many events mentioned on Twitter which fit this general schema. If we can automatically convert tweets into database rows like this, then we can do SQL-like queries and higher-level data mining and analysis on top of this that wasn’t possible by just directly looking at the text. COMPANY PRODUCT DATE PRICE REGION Samsung Galaxy S5 April 11 ? U.S. Nintendo 3DS March 27 $250 North America PRODUCT RELEASE

11 Example Applications Question Answering / Structured Queries
Which companies are releasing new smartphones new products in Europe this Spring? Alert me anytime a new smartphone is announced in the U.S. Data Mining Analyze trends in product releases across different industries Is there a correlation between price and date of release?

12 Knowledge Graphs Things not strings! Alan Ritter Instructor
CSE 5539 Ohio State Univ. Course offered at Columbus OH Located In

13 Data Sources

14 Available Data Sources
All of these databases are sparsely populated and out of date. We need to extract this type of knowledge from text!!!!

15 Available Data Sources
All of these databases are sparsely populated and out of date. We need to extract this type of knowledge from text!!!!

16 Traditional information Extraction

17

18

19

20

21

22 Traditional information Extraction

23 Example Text from MUC-4 (1992)
[Cowie and Wilks] Example Text from MUC-4 (1992)

24 Example Output from MUC-4 (1992)
[Cowie and Wilks] Example Output from MUC-4 (1992)

25 Approaches Initially: Rule Based
Basically just write a bunch of regular expressions

26 Approaches Initially: Rule Based
Basically just write a bunch of regular expressions

27 Approaches Initially: Rule Based
Basically just write a bunch of regular expressions

28 Approaches Initially: Rule Based
Basically just write a bunch of regular expressions Machine Learning (Fietag 1998) (Soderland 1999), (Mooney 1999) Annotate training / dev / test documents Train machine learning models

29 Extraction by Sliding Window
[Slide from William Cohen] GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

30 Extraction by Sliding Window
[Slide from William Cohen] GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

31 Extraction by Sliding Window
[Slide from William Cohen] GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

32 Extraction by Sliding Window
[Slide from William Cohen] GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. CMU UseNet Seminar Announcement

33 A “Naïve Bayes” Sliding Window Model
[Slide from William Cohen] [Freitag 1997] 00 : pm Place : Wean Hall Rm Speaker : Sebastian Thrun w t-m w t-1 w t w t+n w t+n+1 w t+n+m prefix contents suffix Estimate Pr(LOCATION|window) using Bayes rule Try all “reasonable” windows (vary length, position) Assume independence for length, prefix words, suffix words, content words Estimate from data quantities like: Pr(“Place” in prefix|LOCATION) If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.

34 “Naïve Bayes” Sliding Window Results
[Slide from William Cohen] Domain: CMU UseNet Seminar Announcements GRAND CHALLENGES FOR MACHINE LEARNING Jaime Carbonell School of Computer Science Carnegie Mellon University 3:30 pm 7500 Wean Hall Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on. Field F1 Person Name: 30% Location: 61% Start Time: 98%

35 IE with Hidden Markov Models
[Slide from William Cohen] Given a sequence of observations: Yesterday Pedro Domingos spoke this example sentence. and a trained HMM: person name location name background Find the most likely state sequence: (Viterbi) Yesterday Pedro Domingos spoke this example sentence. Any words said to be generated by the designated “person name” state extract as a person name: Person name: Pedro Domingos

36 Finite State Models Generative directed models HMMs Naïve Bayes
Sequence General Graphs Conditional Conditional Conditional Logistic Regression General CRFs Linear-chain CRFs General Graphs Sequence

37 Various Annotated Datasets for Event / Relation Extraction
ACE Automatic Content Extraction Newswire Successor to MUC

38 Various Annotated Datasets for Event / Relation Extraction
GENIA Medline abstracts Similar extraction task in the Biomedical domain

39 Schemas -> Triples “Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America march 27 for $250” COMPANY PRODUCT DATE PRICE REGION Nintendo 3DS March 27 $250 North America And so the information extraction task is to fill in the fields for the event. So you could imagine many events mentioned on Twitter which fit this general schema. If we can automatically convert tweets into database rows like this, then we can do SQL-like queries and higher-level data mining and analysis on top of this that wasn’t possible by just directly looking at the text. PRODUCT RELEASE Manufacturer(3DS, Nintendo) ReleaseDate(3DS, March 27) Price(3DS, $250) Relation Extraction

40 Open Information Extraction (Banko et. al. 2007)

41 Demo (TextRunner)

42 Distant (weak) Supervision for Relation Extraction
e.g. [Mintz et. al. 2009] (Albert Einstein, Ulm) (Mitt Romney, Detroit) (Barack Obama, Honolulu) Person Birth Location Barack Obama Honolulu Mitt Romney Detroit Albert Einstein Ulm Nikola Tesla Smiljan “Barack Obama was born on August 4, 1961 at … in the city of Honolulu ...” “Birth notices for Barack Obama were published in the Honolulu Advertiser…” PROBLEM AREA OK, so in distant supervision for binary relations we have access to a large number of instances of a target relation. Then we gather a bunch of sentences which mention pairs of entities from the realtion. Now we can extract features from these sentences and treat them like positive examples of the relation. But what happens if some data is missing here? Now all these sentences are effectively negative training examples which is clearly problematic. This is a pretty common situation. These databases are incomplete and there is a lot of missing data. ******************************************************** ***PROBLEM-AREA**** In distant supervision we have access to both a database and a large text corpus. For example Freebase contains large lists of people and their birth locations. And if we search the web for pairs of these entities, we can find lots of sentences which mention them. For example here are a few sentences which mention Barack Obama and Honolulu. OK, so each of these entity pairs become positive examples for the relation we’re trying to extract (in this case birth-location), we extract features from these sentences and train a supervised classifier to recognize instances of the relation for new pairs of entities. ******************************************************************** In Distant supervision we have access to a structured database, for example Freebase contains a list of birth locations for many individuals. Then we look through a large text collection and find all the sentences which mention people and their birth locations, for example all the sentences that contain Barack Obama and Honolulu. Then features are extracted from each group of sentences, and these entity pairs are treated as positive examples of the relation for a supervised classifier. -Negative instances are just random pairs of entities. “Born in Honolulu, Barack Obama went on to become…”

43 Demo (NELL)

44 Demo (Literome)

45 Knowledge Base Population Subtasks
Entity Recognition/Classification/Linking Relation Extraction Event Extraction Knowledge Base Inference

46 Applications Google knowledge graph Facebook graph search
Biomedical knowledge bases -> Your application domain here Geoscience knowledge graph? Patent knowledge graph? Cybersecurity knowledge graph?

47 Research Groups at Other Places

48 Why learn about this stuff?

49 Paper Selection Form! (please fill out before next class)

50 Administrative Details
Course Webpage


Download ppt "CSE 5539: Web Information Extraction"

Similar presentations


Ads by Google