Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tri-lingual EDL Planning Heng Ji (RPI) Hoa Trang Dang (NIST) WORRY, BE HAPPY!

Similar presentations

Presentation on theme: "Tri-lingual EDL Planning Heng Ji (RPI) Hoa Trang Dang (NIST) WORRY, BE HAPPY!"— Presentation transcript:

1 Tri-lingual EDL Planning Heng Ji (RPI) Hoa Trang Dang (NIST) WORRY, BE HAPPY!

2 Motivations: Cross-lingual KBP 2

3 Motivations: Cross-lingual Information Fusion 3 Who is Jim Parsons? How is he doing lately?

4 Motivations: A Smart Cross-lingual Kindle 4 Xi Jinping Sunnylands California China US South Sea Diaoyu Islands  Navigating Unfamiliar Languages/Domains  Education purposes

5 Current Status of EDL/EL/Wikification 5  English EDL attracted 20 teams  End-to-end EDL score 70%  EL: mature mono-lingual linking techniques 90% accuracy  But limited ACL papers on cross-lingual EDL/EL/Wikification  Goals  Extend from Mono-lingual to Cross-lingual  Rapid construction KB for a foreign language

6 Tri-lingual EDL Task Definition 6  Input:  Source Collection: English, Chinese, Spanish  KB: English only (Chinese KB and Spanish KB are disallowed)  Discourage using Inter-lingual Wikipedia links  rapid KB construction for low-density languages  Output: Entity clusters presented in English, some have links to English KB  Some clusters are from single languages; and some are from multiple languages  May need to normalize NIL mention translations for ground-truth  A typical system should extract entity mentions from all three languages, link them to English KB, cluster and translate NIL mentions  Query: English, Chinese, Corpus  Some queries will be from single language only  Some queries will exist in multiple languages to form cross-lingual entity clusters

7 Tri-lingual Diagnostic EL Task Definition 7  Perfect mentions (queries) are given  Query: English, Chinese, Spanish  Some queries will be from single language only  Some queries will exist in multiple languages to form cross-lingual entity clusters

8 Source Collection 8  Some KBA web streaming data in English, Chinese, Spanish  Some social media data with code-switch  Some formal comparable newswire  Some discussion forum posts  Include KBP2014 EDL corpora  Larger scale than KBP2014 EDL  Share some documents with Cold-start KBP task  Maybe consider news only for 2015

9 KB: Freebase 9  2.6 billion triples (vs. DBPedia has 583 million triples)  Potential Problem (and Opportunity)  Some entries don’t have corresponding Wikipedia pages, so systems don’t have Wikipedia articles to analyze (similar to EL optional task before 2014)  May trigger some new research when KB doesn’t include unstructured texts

10 Resources: English 10  Google, LCC, IBM, RPI will run English EDL on the entire source collection  Each generate top 10 candidates for each mention, vote  Oracle linking accuracy should be above 97%  Give these to LDC as starting points to speed up human annotation/assessment  A pipeline RPI+ISI did for AMR EDL annotation (ISI has an annotation interface to correct top 10 RPI system generated candidates + Google search + …)  RPI can share English entity embeddings

11 11  Softwares  Stanford Basic Chinese NLP (name tagging, parsing)  CAS Basic Chinese NLP (pos tagging, name tagging, chunking)  RPI Chinese IE (name tagging, relation, event, not-great coref/nominal)  Resources  RPI has 2 million manually cleaned Chinese-English name translation pairs to share and Chinese entity embeddings  LDC has Chinese-English name dict/dicts with frequency information  LDC is developing more training data for Chinese/Spanish SF  Automatic Annotation  RPI can provide Chinese name tagging and translation, and event trigger/argument extraction  BBN/IBM run Chinese IE on source collection Resources: Chinese

12 12  Softwares  Dependency parser: Maltparser  Stanford Spanish CoreNLP (name tagger, …) coming in the next 6 months  Need more help from the community  Automatic Annotation  IBM run Spanish ACE entity extraction (name, coref) and Parsing on source collection Resources: Spanish

13 Timeline 13  Release training data in May  A pilot study in May 2015  You can submit manual runs!  Evaluation: September/October 2015

14 Teams with Expertise/Interest (only asked workshop attendees so far) 14  English/Spanish/Chinese  Yes: IBM, HITS, NYU, RPI  Maybe: JHU, LCC, BBN  …  English/Chinese:  PKU, Tsinghua, a lot more Chinese teams  …  English/Spanish  CSFG  Maybe: UIUC  …  Speed-dating between Chinese & Spanish teams

15 Another Ambitious Proposal: Cross-lingual Slot Filling 15  Chinese-to-English Slot Filling  Annotation guideline available  BLENDER Pilot system (Snover et al., 2011)  Off-cycle Pilot in DEFT (Jan 2015): RPI, Univ. of Washington, Univ. of Wisconsin, CMU  Spanish-to-English Slot Filling  Evaluation proposed in KBP2013  Guideline, Annotation available  Tri-lingual Slot Filling

16 Other Issues 16  Mention Definition  Extraction for linking  Nested mentions  Posters  Scoring  Is the current scoring reasonable?  If we do EDL on 50K documents and only partial entities/documents are manually annotated, how to evaluate clustering performance?  Add new entity types in 2015: Location and Facility?  Add Non-name concepts (e.g., nominal mentions)?  Link “wife” in “Obama’s wife” to “Michelle” in KB

17 Other Issues 17  Evidence & Confidence  Annotation to provide evidence on NIL  System confidence/justification  Correct annotation errors  Need community effort to report errors / share corrections  Improve/extend annotation guidelines, check IAA  Shift some annotation cost from annotating new data to knowledge resource construction?  Current research bottlenecks on coreference and slot filling are on knowledge acquisition instead of more labeled data  e.g., semantic distance between any two nominals for coreference  e.g., large-scale clean paraphrases for slot filling

Download ppt "Tri-lingual EDL Planning Heng Ji (RPI) Hoa Trang Dang (NIST) WORRY, BE HAPPY!"

Similar presentations

Ads by Google