Tri-lingual EDL Planning Heng Ji (RPI) Hoa Trang Dang (NIST) WORRY, BE HAPPY!

Slides:



Advertisements
Similar presentations
Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
Advertisements

Overview of the TAC2013 Knowledge Base Population Evaluation: Temporal Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji,
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Text Analysis Conference Knowledge Base Population 2013 Hoa Trang Dang National Institute of Standards and Technology Sponsored by:
Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji, and.
Large-Scale Entity-Based Online Social Network Profile Linkage.
A Corpus for Cross- Document Co-Reference D. Day 1, J. Hitzeman 1, M. Wick 2, K. Crouch 1 and M. Poesio 3 1 The MITRE Corporation 2 University of Massachusetts,
Overview of the KBP 2013 Slot Filler Validation Track Hoa Trang Dang National Institute of Standards and Technology.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Ang Sun Ralph Grishman Wei Xu Bonan Min November 15, 2011 TAC 2011 Workshop Gaithersburg, Maryland USA.
J. Turmo, 2006 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
4/14/20051 ACE Annotation Ralph Grishman New York University.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Named Entity Recognition and the Stanford NER Software Jenny Rose Finkel Stanford University March 9, 2007.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
CSC 9010 Spring Paula Matuszek A Brief Overview of Watson.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
Survey of Semantic Annotation Platforms
University of Sheffield, NLP Entity Linking Kalina Bontcheva © The University of Sheffield, This work is licensed under the Creative Commons.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
A Two Tier Framework for Context-Aware Service Organization & Discovery Wei Zhang 1, Jian Su 2, Bin Chen 2,WentingWang 2, Zhiqiang Toh 2, Yanchuan Sim.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Overview of the KBP 2012 Slot-Filling Tasks Hoa Trang Dang (National Institute of Standards and Technology Javier Artiles (Rakuten Institute of Technology)
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
Populating A Knowledge Base From Text Clay Fink, Tim Finin, Christine Piatko and Jim Mayfield.
Open Information Extraction using Wikipedia
NLP And The Semantic Web Dainis Kiusals COMS E6125 Spring 2010.
Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda.
Intent Subtopic Mining for Web Search Diversification Aymeric Damien, Min Zhang, Yiqun Liu, Shaoping Ma State Key Laboratory of Intelligent Technology.
A Language Independent Method for Question Classification COLING 2004.
Coreference Resolution with Knowledge Haoruo Peng March 20,
Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
COLING 2012 Extracting and Normalizing Entity-Actions from Users’ comments Swapna Gottipati, Jing Jiang School of Information Systems, Singapore Management.
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
Linguistic Resources for the 2013 TAC KBP Entity Linking Evaluation Joe Ellis (presenter), Justin Mott, Xuansong Li, Jeremy Getman, Jonathan Wright, Stephanie.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Ang Sun Director of Research, Principal Scientist, inome
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Evaluating Answer Validation in multi- stream Question Answering Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo UNED NLP & IR group nlp.uned.es The Second.
NLP and Big Data Shanxi HPC Research Center Xiaoge LI WBDB2013, Xi’an, China.
The Unreasonable Effectiveness of Data
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL.
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Cold-Start KBP Something from Nothing Sean Monahan, Dean Carpenter Language Computer.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Mastering the Pipeline CSCI-GA.2590 Ralph Grishman NYU.
Ensembling Diverse Approaches to Question Answering
Automatically Labeled Data Generation for Large Scale Event Extraction
Tri-lingual EDL for 2017 and Beyond
Reading Report: Open QA Systems
Social Knowledge Mining
Presentation transcript:

Tri-lingual EDL Planning Heng Ji (RPI) Hoa Trang Dang (NIST) WORRY, BE HAPPY!

Motivations: Cross-lingual KBP 2

Motivations: Cross-lingual Information Fusion 3 Who is Jim Parsons? How is he doing lately?

Motivations: A Smart Cross-lingual Kindle 4 Xi Jinping Sunnylands California China US South Sea Diaoyu Islands  Navigating Unfamiliar Languages/Domains  Education purposes

Current Status of EDL/EL/Wikification 5  English EDL attracted 20 teams  End-to-end EDL score 70%  EL: mature mono-lingual linking techniques 90% accuracy  But limited ACL papers on cross-lingual EDL/EL/Wikification  Goals  Extend from Mono-lingual to Cross-lingual  Rapid construction KB for a foreign language

Tri-lingual EDL Task Definition 6  Input:  Source Collection: English, Chinese, Spanish  KB: English only (Chinese KB and Spanish KB are disallowed)  Discourage using Inter-lingual Wikipedia links  rapid KB construction for low-density languages  Output: Entity clusters presented in English, some have links to English KB  Some clusters are from single languages; and some are from multiple languages  May need to normalize NIL mention translations for ground-truth  A typical system should extract entity mentions from all three languages, link them to English KB, cluster and translate NIL mentions  Query: English, Chinese, Corpus  Some queries will be from single language only  Some queries will exist in multiple languages to form cross-lingual entity clusters

Tri-lingual Diagnostic EL Task Definition 7  Perfect mentions (queries) are given  Query: English, Chinese, Spanish  Some queries will be from single language only  Some queries will exist in multiple languages to form cross-lingual entity clusters

Source Collection 8  Some KBA web streaming data in English, Chinese, Spanish  Some social media data with code-switch  Some formal comparable newswire  Some discussion forum posts  Include KBP2014 EDL corpora  Larger scale than KBP2014 EDL  Share some documents with Cold-start KBP task  Maybe consider news only for 2015

KB: Freebase 9  2.6 billion triples (vs. DBPedia has 583 million triples)  Potential Problem (and Opportunity)  Some entries don’t have corresponding Wikipedia pages, so systems don’t have Wikipedia articles to analyze (similar to EL optional task before 2014)  May trigger some new research when KB doesn’t include unstructured texts

Resources: English 10  Google, LCC, IBM, RPI will run English EDL on the entire source collection  Each generate top 10 candidates for each mention, vote  Oracle linking accuracy should be above 97%  Give these to LDC as starting points to speed up human annotation/assessment  A pipeline RPI+ISI did for AMR EDL annotation (ISI has an annotation interface to correct top 10 RPI system generated candidates + Google search + …)  RPI can share English entity embeddings

11  Softwares  Stanford Basic Chinese NLP (name tagging, parsing)  CAS Basic Chinese NLP (pos tagging, name tagging, chunking)  RPI Chinese IE (name tagging, relation, event, not-great coref/nominal)  Resources  RPI has 2 million manually cleaned Chinese-English name translation pairs to share and Chinese entity embeddings  LDC has Chinese-English name dict/dicts with frequency information  LDC is developing more training data for Chinese/Spanish SF  Automatic Annotation  RPI can provide Chinese name tagging and translation, and event trigger/argument extraction  BBN/IBM run Chinese IE on source collection Resources: Chinese

12  Softwares  Dependency parser: Maltparser  Stanford Spanish CoreNLP (name tagger, …) coming in the next 6 months  Need more help from the community  Automatic Annotation  IBM run Spanish ACE entity extraction (name, coref) and Parsing on source collection Resources: Spanish

Timeline 13  Release training data in May  A pilot study in May 2015  You can submit manual runs!  Evaluation: September/October 2015

Teams with Expertise/Interest (only asked workshop attendees so far) 14  English/Spanish/Chinese  Yes: IBM, HITS, NYU, RPI  Maybe: JHU, LCC, BBN  …  English/Chinese:  PKU, Tsinghua, a lot more Chinese teams  …  English/Spanish  CSFG  Maybe: UIUC  …  Speed-dating between Chinese & Spanish teams

Another Ambitious Proposal: Cross-lingual Slot Filling 15  Chinese-to-English Slot Filling  Annotation guideline available  BLENDER Pilot system (Snover et al., 2011)  Off-cycle Pilot in DEFT (Jan 2015): RPI, Univ. of Washington, Univ. of Wisconsin, CMU  Spanish-to-English Slot Filling  Evaluation proposed in KBP2013  Guideline, Annotation available  Tri-lingual Slot Filling

Other Issues 16  Mention Definition  Extraction for linking  Nested mentions  Posters  Scoring  Is the current scoring reasonable?  If we do EDL on 50K documents and only partial entities/documents are manually annotated, how to evaluate clustering performance?  Add new entity types in 2015: Location and Facility?  Add Non-name concepts (e.g., nominal mentions)?  Link “wife” in “Obama’s wife” to “Michelle” in KB

Other Issues 17  Evidence & Confidence  Annotation to provide evidence on NIL  System confidence/justification  Correct annotation errors  Need community effort to report errors / share corrections  Improve/extend annotation guidelines, check IAA  Shift some annotation cost from annotating new data to knowledge resource construction?  Current research bottlenecks on coreference and slot filling are on knowledge acquisition instead of more labeled data  e.g., semantic distance between any two nominals for coreference  e.g., large-scale clean paraphrases for slot filling