Motivations: Cross-lingual Information Fusion 3 Who is Jim Parsons? How is he doing lately?
Motivations: A Smart Cross-lingual Kindle 4 Xi Jinping Sunnylands California China US South Sea Diaoyu Islands Navigating Unfamiliar Languages/Domains Education purposes
Current Status of EDL/EL/Wikification 5 English EDL attracted 20 teams End-to-end EDL score 70% EL: mature mono-lingual linking techniques 90% accuracy But limited ACL papers on cross-lingual EDL/EL/Wikification Goals Extend from Mono-lingual to Cross-lingual Rapid construction KB for a foreign language
Tri-lingual EDL Task Definition 6 Input: Source Collection: English, Chinese, Spanish KB: English only (Chinese KB and Spanish KB are disallowed) Discourage using Inter-lingual Wikipedia links rapid KB construction for low-density languages Output: Entity clusters presented in English, some have links to English KB Some clusters are from single languages; and some are from multiple languages May need to normalize NIL mention translations for ground-truth A typical system should extract entity mentions from all three languages, link them to English KB, cluster and translate NIL mentions Query: English, Chinese, Corpus Some queries will be from single language only Some queries will exist in multiple languages to form cross-lingual entity clusters
Tri-lingual Diagnostic EL Task Definition 7 Perfect mentions (queries) are given Query: English, Chinese, Spanish Some queries will be from single language only Some queries will exist in multiple languages to form cross-lingual entity clusters
Source Collection 8 Some KBA web streaming data in English, Chinese, Spanish Some social media data with code-switch Some formal comparable newswire Some discussion forum posts Include KBP2014 EDL corpora Larger scale than KBP2014 EDL Share some documents with Cold-start KBP task Maybe consider news only for 2015
KB: Freebase 9 2.6 billion triples (vs. DBPedia has 583 million triples) Potential Problem (and Opportunity) Some entries don’t have corresponding Wikipedia pages, so systems don’t have Wikipedia articles to analyze (similar to EL optional task before 2014) May trigger some new research when KB doesn’t include unstructured texts
Resources: English 10 Google, LCC, IBM, RPI will run English EDL on the entire source collection Each generate top 10 candidates for each mention, vote Oracle linking accuracy should be above 97% Give these to LDC as starting points to speed up human annotation/assessment A pipeline RPI+ISI did for AMR EDL annotation (ISI has an annotation interface to correct top 10 RPI system generated candidates + Google search + …) RPI can share English entity embeddings
11 Softwares Stanford Basic Chinese NLP (name tagging, parsing) CAS Basic Chinese NLP (pos tagging, name tagging, chunking) RPI Chinese IE (name tagging, relation, event, not-great coref/nominal) Resources RPI has 2 million manually cleaned Chinese-English name translation pairs to share and Chinese entity embeddings LDC has Chinese-English name dict/dicts with frequency information LDC is developing more training data for Chinese/Spanish SF Automatic Annotation RPI can provide Chinese name tagging and translation, and event trigger/argument extraction BBN/IBM run Chinese IE on source collection Resources: Chinese
12 Softwares Dependency parser: Maltparser Stanford Spanish CoreNLP (name tagger, …) coming in the next 6 months Need more help from the community Automatic Annotation IBM run Spanish ACE entity extraction (name, coref) and Parsing on source collection Resources: Spanish
Timeline 13 Release training data in May A pilot study in May 2015 You can submit manual runs! Evaluation: September/October 2015
Teams with Expertise/Interest (only asked workshop attendees so far) 14 English/Spanish/Chinese Yes: IBM, HITS, NYU, RPI Maybe: JHU, LCC, BBN … English/Chinese: PKU, Tsinghua, a lot more Chinese teams … English/Spanish CSFG Maybe: UIUC … Speed-dating between Chinese & Spanish teams
Another Ambitious Proposal: Cross-lingual Slot Filling 15 Chinese-to-English Slot Filling Annotation guideline available BLENDER Pilot system (Snover et al., 2011) Off-cycle Pilot in DEFT (Jan 2015): RPI, Univ. of Washington, Univ. of Wisconsin, CMU Spanish-to-English Slot Filling Evaluation proposed in KBP2013 Guideline, Annotation available Tri-lingual Slot Filling
Other Issues 16 Mention Definition Extraction for linking Nested mentions Posters Scoring Is the current scoring reasonable? If we do EDL on 50K documents and only partial entities/documents are manually annotated, how to evaluate clustering performance? Add new entity types in 2015: Location and Facility? Add Non-name concepts (e.g., nominal mentions)? Link “wife” in “Obama’s wife” to “Michelle” in KB
Other Issues 17 Evidence & Confidence Annotation to provide evidence on NIL System confidence/justification Correct annotation errors Need community effort to report errors / share corrections Improve/extend annotation guidelines, check IAA Shift some annotation cost from annotating new data to knowledge resource construction? Current research bottlenecks on coreference and slot filling are on knowledge acquisition instead of more labeled data e.g., semantic distance between any two nominals for coreference e.g., large-scale clean paraphrases for slot filling
Your consent to our cookies if you continue to use this website.