Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Similar presentations


Presentation on theme: "Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji Acknowledgement: some slides from Radu Florian and Stephen Soderland."— Presentation transcript:

1 Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji Acknowledgement: some slides from Radu Florian and Stephen Soderland

2  Long successful run –MUC –CoNLL –ACE –TAC-KBP –DEFT –BioNLP 2  Genres –Newswire –Broadcast news –Broadcast conversations –Weblogs –Blogs –Newsgroups –Speech –Biomedical data –Electronic Medical Records  Programs –MUC –ACE –GALE –MRP –BOLT –DEFT

3 3 Quality Portability

4 Quality Challenges 4

5  We’re thriving  Entity Linking  We’re making slow but consistent progress  Relation Extraction  Event Extraction  Slot Filling  We’re running around in circles  Name Tagging  We’re stuck in a tunnel  Entity Coreference Resolution 5 Where have we been?

6 Name Tagging: “Old” Milestones 6 YearTasks & Resources MethodsF-MeasureExample References 1966-First person name tagger with punch card 30+ decision tree type rules -(Borkowski et al., 1966) 1998MUC-6MaxEnt with diverse levels of linguistic features 97.12%(Borthwick and Grishman, 1998) 2003CONLLSystem combination; Sequential labeling with Conditional Random Fields 89%(Florian et al., 2003; McCallum et al., 2003; Finkel et al., 2005) 2006ACEDiverse levels of linguistic features, Re-ranking, joint inference ~89%(Florian et al., 2006; Ji and Grishman, 2006)  Our progress compared to 1966:  More data, a few more features and more fancy learning algorithms  Not much active work after ACE because we tend to believe it’s a solved problem…

7 7 State-of-the-art reported in papers The end of extreme happiness is sadness…

8 8  Experiments on ACE2005 data

9 Challenges  Defining or choosing an IE schema  Dealing with genres & variations –Dealing with novelty  Bootstrapping a new language  Improving the state-of-the-art with unlabeled data  Dealing with a new domain  Robustness 9

10 99 Schemas of IE on the Wall…  Many IE schemas over the years: –MUC – 7 types PER, ORG, LOC, DATE, TIME, MONEY, PERCENT –ACE – types PER, ORG, GPE, LOC, FAC, WEA, VEH Has substructure (subtypes, mention types, specificity, roles) –CoNLL: 4 types ORG, PER, LOC, MISC –ONTONotes: 18 types CARDINAL,DATE,EVENT,FAC,GPE,LANGUAGE,LAW,LOC,MONEY,NORP,ORDIN AL,ORG,PERCENT,PERSON,PRODUCT,QUANTITY,TIME,WORK_OF_ART –IBM KLUE2: 50 types, including event anchors –Freebase categories –Wikipedia categories  Challenges: –Selecting an appropriate schema to model –Combining training data 10

11 My Favorite Booby-trap Document  LVMH Makes a Two-Part Offer for Donna Karan  By LESLIE KAUFMAN Published: December 19, 2000  The fashion house of Donna Karan, which has long struggled to achieve financial equilibrium, has finally found a potential buyer. The giant luxury conglomerate LVMH-Moet Hennessy Louis Vuitton, which has been on a sustained acquisition bid, has offered to acquire Donna Karan International for $195 million in a cash deal with the idea that it could expand the company's revenues and beef up accessories and overseas sales.  At $8.50 a share, the LVMH offer represents a premium of nearly 75 percent to the closing stock price on Friday. Still, it is significantly less than the $24 a share at which the company went public in The final price is also less than one-third of the company's annual revenue of $662 million, a significantly smaller multiple than European luxury fashion houses like Fendi were receiving last year.  The deal is still subject to board approval, but in a related move that will surely help pave the way, LVMH purchased Gabrielle Studio, the company held by the designer and her husband, Stephan Weiss, that holds all of the Donna Karan trademarks, for $450 million. That price would be reduced by as much as $50 million if LVMH enters into an agreement to acquire Donna Karan International within one year. In a press release, LVMH said it aimed to combine Gabrielle and Donna Karan International and that it expected that Ms. Karan and her husband ''will exchange a significant portion of their DKI shares for, and purchase additional stock in, the combined entity.'' 11

12 Analysis of an Error 12 Donna Karan International

13 Saddam Hussein International Ronald Reagan International Analysis of an Error: How can you Tell? 13 Donna Karan International Dana International FAC Saddam Hussein International Airport 8 FAC Baghdad International 1 ORG Amnesty International 3 FACInternational Space Station 1 ORG International Criminal Court 1 ORG Habitat for Humanity International 1 ORG U-Haul International 1 FAC Saddam International Airport 7 ORG International Committee of the Red Cross 4 ORG International Committee for the Red Cross 1 FAC International Press Club 1 ORG American International Group Inc. 1 ORG Boots and Coots International Well Control Inc. 1 ORG International Committee of Red Cross 1 ORGInternational Black Coalition for Peace and Justice 1 FAC Baghdad International Airport RG Center for Strategic and International Studies 2 ORG International Monetary Fund 1 FAC Saddam Hussein International Airport 8 FAC Baghdad International 1 ORG Amnesty International 3 FACInternational Space Station 1 ORG International Criminal Court 1 ORG Habitat for Humanity International 1 ORG U-Haul International 1 FAC Saddam International Airport 7 ORG International Committee of the Red Cross 4 ORG International Committee for the Red Cross 1 FAC International Press Club 1 ORG American International Group Inc. 1 ORG Boots and Coots International Well Control Inc. 1 ORG International Committee of Red Cross 1 ORGInternational Black Coalition for Peace and Justice 1 FAC Baghdad International Airport RG Center for Strategic and International Studies 2 ORG International Monetary Fund 1

14 14

15 Dealing With Different Genres:  Weblogs: –All lower case data obama has stepped up what bush did even to the point of helping our enemy in Libya. –Non-standard capitalization/title case LiveLeak.com - Hillary Clinton: Saddam Has WMD, Terrorist Ties (Video) Solution: Case Restoration (truecasing) 15

16 } 16

17 Out-of-domain data 17 Volunteers have also aided victims of numerous other disasters, including hurricanes Katrina, Rita, Andrew and Isabel, the Oklahoma City bombing, and the September 11 terrorist attacks.

18 Out-of-domain Data  Manchester United manager Sir Alex Ferguson got a boost on Tuesday as a horse he part owns What A Friend landed the prestigious Lexus Chase here at Leopardstown racecourse. 18

19 Bootstrapping a New Language  English is resource-rich: –Lexical resources: gazetteers –Syntactic resources: PennTreeBank –Semantic resources: Wordnet, entity-labeled data (MUC, ACE, CoNLL), Framenet, PropBank, NomBank, OntoBank  How can we leverage these resources in other languages?  MT to the rescue!

20 B-LOC Mention Detection Transfer  ES: El soldado nepalés fue baleado por ex soldados haitianos cuando patrullaba la zona central de Haiti, informó Minustah.  EN: The Nepalese soldier was gunned down by former Haitian soldiers when patrullaba the central area of Haiti, reported minustah. O O O O O El soldado nepalés fue baleado por ex soldados haitianos cuando patrullaba la zona central de Haiti, informó Minustah. The Nepalese soldier was gunned down by former Haitian soldiers when patrolling the central area of Haiti, reported minustah. B-GPE B-PER

21

22 22  How to deal with out-of-domain data? How to even detect if you’re out of domain?  How to deal with unseen WotD? (e.g. ISIS, ISIL, IS, Ebola)  How to improve significantly the state-of-the- art using unlabeled data?

23 What’s Wrong? 23  Name tagger s are getting old (trained from 2003 news & test on 2012 news)  Genre adaptation (informal contexts, posters)  Revisit the definition of name mention – extraction for linking  Limited types of entities (we really only cared about PER, ORG, GPE)  Old unsolved problems  Identification: “Asian Pulp and Paper Joint Stock Company, Lt. of Singapore”  Classification: “FAW has also utilized the capital market to directly finance,…” (FAW = First Automotive Works)  Potential Solutions for Quality  Word clustering, Lexical Knowledge Discovery (Brown, 1992; Ratinov and Roth, 2009; Ji and Lin, 2010)  Feedback from Linking, Relation, Event (Sil and Yates, 2013; Li and Ji, 2014)  Potential Solutions for Portability  Extend entity types based on AMR (140+)

24 Entity Linking Milestones 24  2006: The first definition of Wikification task (Bunescu and Pasca, 2006)  2009: TAC-KBP Entity Linking launched (McNamee and Dang, 2009)  : Supervised learning-to-rank with diverse levels of features such as entity profiling, various popularity and similarity measures were developed (Gao et al., 2010; Chen and Ji, 2011; Ratinov et al., 2011; Zheng et al., 2010; Dredze et al., 2010; Anastacio et al., 2011) ;  : Collective Inference, Coherence measures were developed (Milne and Witten, 2008; Kulkarni et al., 2009; Ratinov et al., 2011; Chen and Ji, 2011; Ceccarelli et al., 2013; Cheng and Roth, 2013)  2012: Various applications(e.g., Coreference resolution (Ratinov & Roth, 2012) – Dan’s talk  2014: TAC-KBP Entity Discovery and Linking (end-to-end name tagging, cross-document entity clustering, entity linking) (Ji et al., 2014)  Many different versions of international evaluations were inspired from TAC-KBP; more than 130 papers have been published

25 Current Linking Problems and Possible Solutions  State-of-the-art Entity Linking: 85% B-cubed+ F-score on formal genres and 70% B-cubed+ F-score on informal genres  State-of-the-art Entity Discovery and Linking: 66% Discovery and Linking F-score, 73% Clustering CEAFm F-score  Remaining Challenges  Popularity bias  Require better meaning representation  Select collaborators from rich contexts  Knowledge gap between source and KB  Cross-lingual Entity Linking (name translation problem)  Potential Solutions:  Deep knowledge acquisition and representation (e.g., AMR)  Better graph search alignment algorithms  Make more people excited about Chinese and Spanish 25

26 Slot Filling Milestones 26  : Top systems achieved 30%-40% F-measure  Ground-truth is created based on manual assessment of pooled system output – relative recall; score may appear lower with stronger teams  2014 queries are more challenging than 2013; including some ambiguous queries sharing with entity linking (Stephen’s talk)  Consistent progress on individual system (RPI, test on 2014 data): 2010: 20%  2011: 22%  2013: 28%  2014: 34%  Successful Methods  Multi-label Multi-instance learning (Seadeanu et al., 2012)  Combination of distant supervision with heuristic rules and patterns (Roth et al., 2013)  Cross-source Cross-system Inference (Chen et al., 2011; Yu et al., 2014)  Linguistic constraints (Yu et al., 2014) – Heng’s one-week pencil-and- paper work to semi-automatically acquire trigger phrases; an awfully simple trigger scoping method beat all 2013 systems

27 27/35 Have the Error Sources Changed over Years? (Min and Grishman, 2011) (Yu and Ji, 2014)

28 Blame Ourselves First… 28  Non-verb and multi-word expression as triggers  his men back to their compound  Knowledge scarcity - Long-tail  A suicide bomber detonated explosives at the entrance to a crowded  medical teams carting away dozens of wounded victims  Today I was let go from my job after working there for 4 1/2 years.  Possible solution: increase coverage with FrameNet (Li et al., 2014)  Global context  I didn't want to hurt him. I miss him to death.  I threw stone out of the window. vs. I threw him out of the window.  Ellison to spend $10.3 billion to get his company.  We believe that the likelihood of them using those weapons goes up.  Fifteen people were killed and more than 30 wounded Wednesday as a suicide bomber blew himself up on a student bus in the northern town of Haifa  Possible solution: joint modeling between triggers and arguments (Li et al., 2013)

29 Then Blame Others…  Fundamental language problem – ambiguity and variety  Coreference, coreference, coreference…  25% of the examples involve coreference which is beyond current system capabilities, such as nominal anaphors and non-identity coreference  Almost overnight, he became fabulously rich, with a $3-million book deal, a $100,000 speech making fee, and a lucrative multifaceted consulting business, Giuliani Partners. … His consulting partners included seven of those who were with him on 9/11, and in 2002 Alan Placa, his boyhood pal, went to work at the firm.  After successful karting career in Europe, Perera became part of the Toyota F1 Young Drivers Development Program and was a Formula One test driver for the Japanese company in  “a woman charged with running a prostitution ring … her business, Pamela Martin and Associates” 29

30 Then Blame Others…  Paraphrase, paraphrase, paraphrase… 30  “employee/member”:  Sutil, a trained pianist, tested for Midland in 2006 and raced for Spyker in 2007 where he scored one point in the Japanese Grand Prix.  Daimler Chrysler reports 2004 profits of $3.3 billion; Chrysler earns $1.9 billion.  In her second term, she received a seat on the powerful Ways and Means Committee  Jennifer Dunn was the face of the Washington state Republican Party for more than two decades  Buchwald lied about his age and escaped into the Marine Corps.  By 1942, Peterson was performing with one of Canada's leading big bands, the Johnny Holmes Orchestra.  “spouse”:  Buchwald 's 1952 wedding -- Lena Horne arranged for it to be held in London 's Westminster Cathedral -- was attended by Gene Kelly, John Huston, Jose Ferrer, Perle Mesta and Rosemary Clooney, to name a few

31  Inference, Inference, Inference…  systems would benefit from specialists which are able to reason about times, locations, family relationships, and employment relationships.  People Magazine has confirmed that actress Julia Roberts has given birth to her third child a boy named Henry Daniel Moder. Henry was born Monday in Los Angeles and weighed 8 lbs. Roberts, 39, and husband Danny Moder, 38, are already parents to twins Hazel and Phinnaeus who were born in November…  He [Pascal Yoadimnadji] has been evacuated to France on Wednesday after falling ill and slipping into a coma in Chad, Ambassador Moukhtar Wawa Dahab told The Associated Press. His wife, who accompanied Yoadimnadji to Paris, will repatriate his body to Chad, the amba.  is he dead? in Paris?  Until last week, Palin was relatively unknown outside Alaska…  does she live in Alaska?  The list says that the state is owed $2,665,305 in personal income taxes by singer Dionne Warwick of South Orange, N.J., with the tax lien dating back to  does she live in NJ? 31 Then Blame Others…

32 Portability/Scalability Challenges 32

33 Defining the Problem Deep understanding of all possible relations? Open IE, pre-emptive IE, on-demand IE… DEFT PI meeting -- U. Washington3310/15/2014

34 Defining the Problem Deep understanding of all possible relations? Deep Extraction for Focused Tasks (D.E.F.T.) – User has a focused information need: A few dozen relations, several entity types: Date_of_birth(per, date), city_of_headquarters(org, city), … Treatment(substance, condition), studies_disease(per/org, condition),… Arrive_in(per, loc), meet_with(per, per), unveil(org, product), … – Quickly train an extractor for the task Domain independent: parsing, Open IE, SRL, … Task specific: semantic tagging, extraction patterns, … Freedman et al. Extreme Extraction -- Machine Reading in a Week. EMNLP 2011 Zhang et al. NewsSpike Event Extractor, in review TAC-KBP DEFT PI meeting -- U. Washington3410/15/2014

35 Aim for the Head ? Frequency Patterns Dead simple A Zipfian Distribution of surface forms to express a textual relation The real challenge A hopeless case DEFT PI meeting -- U. Washington3510/15/2014

36 Open IE for KBP DEFT PI meeting -- U. Washington3610/15/2014 Advantages of Open IE – Robust – Massively scalable – Works out of the box – Finds whatever relations are expressed in the text – Not tied to an ontology of relations Disadvantages – Finds whatever relations are expressed in the text – Not tied to an ontology of relations Challenge – Map Open IE to an ontology of relations – Minimum of user effort github/knowitall/openie

37 OpenIE–KBP Rule Language DEFT PI meeting -- U. Washington37 (Smith, was appointed, Acting Director of Acme Corporation) entity slotfill per:employee_or_member_of (Smith, Acme Corporation) Terms in RuleExample Target relation:per:employee_or_member_of Query entity in:Arg1 Slotfill in:Arg2 Slotfill type:Organization Arg1 terms:- Relation terms:appointed Arg2 terms: of 10/15/2014 Arg1Arg2Rel

38 Hits the Head, but … High precision, average recall Limited recall from Open IE, – Good with verb-based relations – Weak on noun-based relations “Implicit relation” patterns “Bashardost, 43, is …” (Baradost, [has age], 43) “… the Election Complaints Commission (ECC)…” (Election Complaints Commission, [has acronym], ECC) “French journalist Jean LeGall reported that …” (Jean LeGall, [has job title], journalist ) (Jean LeGall, [has nationality], French ) 10/15/2014DEFT PI meeting -- U. Washington38

39 NewsSpike Event Extractor Extracts event relations from news streams – Event = event_phrase(arg1_type, arg2_type) NewsSpike = (entity1, entity2, date, {sentences}) – from parallel news streams – Open IE identifies entity1, entity2, and event phrase – a spike in frequency on that date indicates an event between entity1 and entity2 Automatically discover relations not covered by Freebase 10/15/2014DEFT PI meeting -- U. Washington39 arrive_in (person, location) beat (sports_team, sports_team) meet_with (person, person) nominate (person/politician, person) unveil (organization, product) …

40 NewsSpike Architecture 10/15/2014DEFT PI meeting -- U. Washington40 Parallel news streams E=e(t 1,t 2 ) Event Discover events NewsSpike w/ Parallel sentences r 1 r 2 r 3 (a 1,a 2,t) r 1 r 2 r 3 r 4 r 5 r 1 r 2 r 3 NS=(a1,a2,d,S) Group S={s 1, s 2,s 3 } r 1 r 2 r 3 (a 1,a 2,t) r 1 r 2 r 3 r 4 r 5 r 1 r 2 r 3 E=e(t 1,t 2 ) Generate training data Training sentences Event Extractor Event Extractor learn Test sentences input extract Extractions s Training PhaseTesting Phase

41 High Quality Training Paraphrases in NewsSpike gives positive training Negative training from Temporal negation heuristic: – If event phrases e1 and e2 are in the same NewsSpike – and one of them is negated – e1 is probably not a paraphrase of e2 “Team1 faces Team2” “Team1 did not beat Team2” face ≠ beat High precision from negative training 10/15/2014DEFT PI meeting -- U. Washington41

42 High Precision Event Extractor Doubles the area under PR curve vs. Universal Schemas 10/15/2014DEFT PI meeting -- U. Washington42 NewsSpike-E2 on news stream Universal Schemas on NYT Universal Schemas on news stream


Download ppt "Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji Acknowledgement: some slides from Radu Florian and Stephen Soderland."

Similar presentations


Ads by Google