A Corpus for Cross- Document Co-Reference D. Day 1, J. Hitzeman 1, M. Wick 2, K. Crouch 1 and M. Poesio 3 1 The MITRE Corporation 2 University of Massachusetts,

Slides:



Advertisements
Similar presentations
How to Set Up a System for Teaching Files, Conferences, and Clinical Trials Medical Imaging Resource Center.
Advertisements

PHP I.
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
1/(19) GATE Evaluation Tools GATE Training Course October 2006 Kalina Bontcheva.
Query Chain Focused Summarization Tal Baumel, Rafi Cohen, Michael Elhadad Jan 2014.
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.
Features, Formalized Stephen Mayhew Hyung Sul Kim 1.
Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji, and.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Developing and Evaluating a Query Recommendation Feature to Assist Users with Online Information Seeking & Retrieval With graduate students: Karl Gyllstrom,
Easy-First Coreference Resolution Veselin Stoyanov and Jason Eisner Johns Hopkins University.
Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.
4/14/20051 ACE Annotation Ralph Grishman New York University.
Web Logs and Question Answering Richard Sutcliffe 1, Udo Kruschwitz 2, Thomas Mandl University of Limerick, Ireland 2 - University of Essex, UK 3.
Supervised models for coreference resolution Altaf Rahman and Vincent Ng Human Language Technology Research Institute University of Texas at Dallas 1.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
WRDS User Guide West Virginia University. Three Ways of Working with WRDS Web – Based PC – SAS The WRDS UNIX server will be accessed using SSH Secure.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.
The SemEval-2007 Web People Search Evaluation The SemEval-2007 Web People Search Evaluatin Javier Artiles, Julio Gonzalo, Satoshi SekineThe SemEval-2007.
Lucent Technologies – Proprietary Use pursuant to company instruction Learning Sequential Models for Detecting Anomalous Protocol Usage (work in progress)
- -Heather Rodriguez, - Shilpa Reddy. Goal - One stop shop for Undergrad Competitions.
Cross Document Entity Disambiguation August 22, 2007 Johns Hopkins Summer Workshop.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition.
The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
© 2008 The MITRE Corporation. All rights reserved. Approved for Public Release; Distribution Unlimited. Case # Alexander Yeh MITRE Corp. October.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali and Vasileios Hatzivassiloglou Human Language Technology Research Institute The.
Illinois-Coref: The UI System in the CoNLL-2012 Shared Task Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth Supported by ARL,
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Populating A Knowledge Base From Text Clay Fink, Tim Finin, Christine Piatko and Jim Mayfield.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
JHU/CLSP/WS07/ELERFED Scoring Metrics for IDC, CDC and EDC David Day ELERFED JHU Workshop July 18, 2007.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
I2B2 Shared Task 2011 Coreference Resolution in Clinical Text David Hinote Carlos Ramirez.
A Cross-Lingual ILP Solution to Zero Anaphora Resolution Ryu Iida & Massimo Poesio (ACL-HLT 2011)
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
Linguistic Resources for the 2013 TAC KBP Entity Linking Evaluation Joe Ellis (presenter), Justin Mott, Xuansong Li, Jeremy Getman, Jonathan Wright, Stephanie.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
ACE Automatic Content Extraction A program to develop technology to extract and characterize meaning from human language.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Inference Protocols for Coreference Resolution Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Nick Rizzolo, Mark Sammons, and Dan Roth This research.
Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002.
Information Extraction from Single and Multiple Sentences Mark Stevenson Department of Computer Science University of Sheffield, UK.
ASSOCIATIVE BROWSING Evaluating 1 Jinyoung Kim / W. Bruce Croft / David Smith for Personal Information.
3 Copyright © 2010, Oracle. All rights reserved. Product Data Hub: PIM Functional Training Program Setup Workbench Fundamentals.
FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
CompBase TM Providing E-Publishing Solutions for Cities & Towns.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.
Evaluating NLP Features for Automatic Prediction of Language Impairment Using Child Speech Transcripts Khairun-nisa Hassanali 1, Yang Liu 1 and Thamar.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
Language Identification and Part-of-Speech Tagging
WP4 Models and Contents Quality Assessment
Simone Paolo Ponzetto University of Heidelberg Massimo Poesio
Entity-Level Modelling for Coreference Resolution
Lesson 6: Databases and Web Search Engines
NYU Coreference CSCI-GA.2591 Ralph Grishman.
Social Knowledge Mining
Clustering Algorithms for Noun Phrase Coreference Resolution
Lesson 6: Databases and Web Search Engines
Corpus Statistics ACE2005/ACE2007 English EDR
Presentation transcript:

A Corpus for Cross- Document Co-Reference D. Day 1, J. Hitzeman 1, M. Wick 2, K. Crouch 1 and M. Poesio 3 1 The MITRE Corporation 2 University of Massachusetts, Amherst 3 Universities of Essex and Trento Approved for public release. Distribution unlimited. MITRE case number #

Within-doc Coreference The LDC has developed a corpus for within- doc coreference, i.e., when a phrase in a document refers back to a previously mentioned entity “Smith succeeded Jones as CEO of the company. He started his career at IBM….”

In order to determine a chain of events, the movements of a person, changes in ownership of a company, etc., we need a corpus that identifies co-referring mentions of entities appearing in different documents “Smith succeeded Jones as CEO of the company. He started his career at IBM….” “Smith is currently the vice- president of IBM. He was hired in 1972 in order to improve profits.” Cross-doc Coreference

The Johns Hopkins Workshop Johns Hopkins hosted a summer workshop – To investigate the use of lexical and encyclopedic resources to improve coreference resolution – To build a cross-doc corpus – To build systems to perform cross-doc coreference One question was how far the techniques we use on within-doc coreference would work with cross-doc coreference Our team was in charge of building the corpus We intend to release this corpus for unlimited use and distribution

The Technique We began with the within-doc corpus developed by the LDC for the Automated Content Extraction competition (ACE) We built the Callisto/EDNA annotation tool – A specialized annotation task plug-in for the Callisto annotation tool ( – A Callisto client plug-in that uses a web server (Tomcat) and search/indexing web services plug-ins that support multiple simultaneous annotators

The Search Query and Search Results Panes

Search Results Details Pane

The Annotation Process Criteria for considering cross-referencing entities – It has at least one mention of type NAME within a document – It is of type PER, ORG, GPE or LOC To expedite the process, we applied an initial automated cross-doc linking prior to manual annotation – E.g., all mentions of “Tony Blair” were coreferenced – When a NAME is common, this pre-linking saved the annotator many mouse clicks

The Pre-Linking Process The pre-linked entities had to have at least one identical NAME mention and to be of the same TYPE and SUBTYPE We were concerned that the automatic pre-linking would produce errors but it produced very few The errors were largely due to errors in the within- doc data, e.g., within-doc coreferencing of – “anonymous speaker” with other anonymous speakers – “Scott Peterson” and “Laci Peterson”

The ACE2005 English EDT Corpus 1.5 million characters 257,000 words 18,000 distinct document-level entities (prior to cross-doc linking) – PER 9.7K – ORG 3K – Geo-Political entity (GPE) 3K – FAC 1K – LOC 897 – Weapon 579 – Vehicle ,000 entity mentions – Pronoun 20K – Name 18K – Nominal 17K

Resulting Entities 7,129 entities satisfied the constraints required for cross-doc annotation Automatic and manual annotation resulted in 3,660 entities Of these, 2,390 entities were mentioned in only one document

Comparison to Previous Work John Smith corpus (Bagga, et al, 1998) – Baldwin and Bagga created a cross-doc corpus and evaluated it for the common name “John Smith” Benefits of our work – By using an existing within-doc corpus, we have high-quality co-reference information for both within-doc and cross-doc The size of this corpus is significantly larger than previous data sets

Data Format The output is similar to the ACE APF format John Wayne...

Observations One side effect of performing cross-doc coreference is that it showed errors in the within-doc annotation – E.g., “Scott Peterson” and “Laci Peterson” are coreferenced because there is a misannotated reference to “Peterson” It allowed us to cross-reference names with nicknames which will not be found in a gazetteer – E.g., “Bama” with “Alabama” – “Q”, “Qland”, “Queensland” – This co-referencing allows nicknames to be mapped using a gazetteer

Scoring To test the ambiguity of the dataset, we implemented a discriminatively trained clustering algorithm similar to Culotta et all (2007) We measured cross-doc coreference performance on a reserve test set of gold standard documents F=.96 (Bcubed) F=.91 (Pairwise) F=.89 (MUC)