Presentation is loading. Please wait.

Presentation is loading. Please wait.

Natural Language Processing Information Extraction Jim Martin (slightly modified by Jason Baldridge)

Similar presentations


Presentation on theme: "Natural Language Processing Information Extraction Jim Martin (slightly modified by Jason Baldridge)"— Presentation transcript:

1 Natural Language Processing Information Extraction Jim Martin (slightly modified by Jason Baldridge)

2 9/26/2016 Speech and Language Processing - Jurafsky and Martin 2 Information Extraction  Turns out that these partial/parsing and chunking methods for syntax, given us the means to address shallow semantic problems as well  Figure out the entities (the players, props, instruments, locations, etc.) in a text  Figure out how they’re related  Figure out what kind of events/activities they’re all up to  And do each of those tasks in a loosely-coupled data-driven manner

3 9/26/2016 Speech and Language Processing - Jurafsky and Martin 3 Information Extraction  Ordinary newswire text is often used in typical examples.  And there’s an argument that there are useful applications there  But the real interest/money is in specialized domains  Bioinformatics  Electronic medical records  Stock market analysis  Intelligence analysis  Social media

4 9/26/2016 Speech and Language Processing - Jurafsky and Martin 4 Example App

5 9/26/2016 Speech and Language Processing - Jurafsky and Martin 5 Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York

6 9/26/2016 Speech and Language Processing - Jurafsky and Martin 6 Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York.

7 9/26/2016 Speech and Language Processing - Jurafsky and Martin 7 Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York

8 9/26/2016 Speech and Language Processing - Jurafsky and Martin 8 Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York

9 9/26/2016 Speech and Language Processing - Jurafsky and Martin 9 Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York

10 9/26/2016 Speech and Language Processing - Jurafsky and Martin 10 Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York

11 9/26/2016 Speech and Language Processing - Jurafsky and Martin 11 NER  Find and classify all the named entities in a text.  What’s a named entity?  A reference to an entity via the mention of its name.  Colorado Rockies  This is a subset of the possible mentions...  Rockies, the team, it, they...  Find means identify the exact span of the mention.  Classify means determine the category of the entity being referred to.

12 9/26/2016 Speech and Language Processing - Jurafsky and Martin 12 NER Approaches  As with partial parsing and chunking there are two basic approaches (and hybrids)  Rule-based (regular expressions)  Lists of names  Patterns to match things that look like names  Patterns to match the environments that classes of names tend to occur in.  ML-based approaches  Get annotated training data  Extract features  Train systems to replicate the annotation

13 9/26/2016 Speech and Language Processing - Jurafsky and Martin 13 ML Approach

14 9/26/2016 Speech and Language Processing - Jurafsky and Martin 14 Encoding for Sequence Labeling  We can use the same IOB encoding here that we used for chunking:  For N classes we have 2*N+1 tags  An I and B for each class and a O for outside any class.  Each token in a text gets a tag.

15 9/26/2016 Speech and Language Processing - Jurafsky and Martin 15 NER Features

16 9/26/2016 Speech and Language Processing - Jurafsky and Martin 16 NER as Sequence Labeling

17 9/26/2016 Speech and Language Processing - Jurafsky and Martin 17 NER Evaluation  Suppose you employ this scheme. What’s the best way to measure performance.  Probably not the per-tag accuracy we used for POS tagging.  Why? It’s not measuring what we care about We need a metric that looks at the chunks not the tags

18 9/26/2016 Speech and Language Processing - Jurafsky and Martin 18 Example  Suppose we were looking for Location mentions.  If the system simply said O all the time it would do pretty well on a per-label basis since most words reside outside any location.

19 9/26/2016 Speech and Language Processing - Jurafsky and Martin 19 Precision/Recall/F  Precision: The fraction of entities the system returned that were right  “Right” means the boundaries and the label are correct given some labeled test set.  Recall: The fraction of entities that system got from those that it should have gotten.  F: Simple harmonic mean of those two numbers.

20 9/26/2016 Speech and Language Processing - Jurafsky and Martin 20 Entity-level evaluation  We need to evaluation P/R/F at the entity level.  But we may not care equally about all kinds of entities  So we might weight them differently in the evaluation routine.  Or consider the P/R/F for each one separately

21 9/26/2016 Speech and Language Processing - Jurafsky and Martin 21 Relations  Once you have captured the entities in a text you might want to ascertain how they relate to one another.  Here we’re just talking about explicitly stated relations

22 9/26/2016 Speech and Language Processing - Jurafsky and Martin 22 Information Extraction CHICAGO (AP) — Citing high fuel prices, United Airlines said Friday it has increased fares by $6 per round trip on flights to some cities also served by lower-cost carriers. American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said. United, a unit of UAL, said the increase took effect Thursday night and applies to most routes where it competes against discount carriers, such as Chicago to Dallas and Atlanta and Denver to San Francisco, Los Angeles and New York

23 9/26/2016 Speech and Language Processing - Jurafsky and Martin 23 Relation Types  As with named entities, the list of relations is application specific. For generic news texts...

24 9/26/2016 Speech and Language Processing - Jurafsky and Martin 24 Relations  By relation we really mean sets of tuples.  Think about populating a database.

25 9/26/2016 Speech and Language Processing - Jurafsky and Martin 25 Relation Analysis  We can divide relation analysis into two parts  Determining if 2 entities are related  And if they are, classifying the relation  There are 2 reasons to do this  Cutting down on training time for classification by eliminating most pairs  Producing separate feature-sets that are appropriate for each task.

26 9/26/2016 Speech and Language Processing - Jurafsky and Martin 26 Relation Analysis  Let’s just worry about named entities within the same sentence

27 9/26/2016 Speech and Language Processing - Jurafsky and Martin 27 Features  We can group the features (for both tasks) into three categories  Features of the named entities involved  Features derived from the words between and around the named entities  Features derived from the syntactic environment that governs the two entities

28 9/26/2016 Speech and Language Processing - Jurafsky and Martin 28 Features  Features of the entities  Their types  Concatenation of the types  Headwords of the entities  George Washington Bridge  Words in the entities  Features between and around  Particular positions to the left and right of the entities  +/- 1, 2, 3  Bag of words between

29 9/26/2016 Speech and Language Processing - Jurafsky and Martin 29 Features  Syntactic environment  Constituent path through the tree from one to the other  Base syntactic chunk sequence from one to the other  Dependency path

30 9/26/2016 Speech and Language Processing - Jurafsky and Martin 30 Example  For the following example, we’re interested in the possible relation between American Airlines and Tim Wagner.  American Airlines, a unit AMR, immediately matched the move, spokesman Tim Wagner said.


Download ppt "Natural Language Processing Information Extraction Jim Martin (slightly modified by Jason Baldridge)"

Similar presentations


Ads by Google