Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.

Slides:

Advertisements

Similar presentations

LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.

Advertisements

Using Syntax to Disambiguate Explicit Discourse Connectives in Text Source: ACL-IJCNLP 2009 Author: Emily Pitler and Ani Nenkova Reporter: Yong-Xiang Chen.

Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.

Fast Algorithms For Hierarchical Range Histogram Constructions

October 1999 Statistical Methods for Computer Science Marie desJardins CMSC 601 April 9, 2012 Material adapted.

Using Query Patterns to Learn the Durations of Events Andrey Gusev joint work with Nate Chambers, Pranav Khaitan, Divye Khilnani, Steven Bethard, Dan Jurafsky.

Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.

Errors & Uncertainties Confidence Interval. Random – Statistical Error From:

A Novel Approach to Event Duration Prediction

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?

1 Quasi-Synchronous Grammars  Based on key observations in MT: translated sentences often have some isomorphic syntactic structure, but not usually in.

Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.

Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.

Evaluating Hypotheses

Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.

1 Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to S ENSEVAL -3 Saif Mohammad Ted Pedersen University of Toronto, Toronto.

Learning syntactic patterns for automatic hypernym discovery Rion Snow, Daniel Jurafsky and Andrew Y. Ng Prepared by Ang Sun

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions.

ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.

Experimental Evaluation

7-1 Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall Chapter 7 Sampling and Sampling Distributions Statistics for Managers using Microsoft.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.

Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or.

Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)

Andreea Bodnari, 1 Peter Szolovits, 1 Ozlem Uzuner 2 1 MIT, CSAIL, Cambridge, MA, USA 2 Department of Information Studies, University at Albany SUNY, Albany,

Evaluation in NLP Zdeněk Žabokrtský. Intro The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system Definition of proper.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

Copyright ©2011 Pearson Education 7-1 Chapter 7 Sampling and Sampling Distributions Statistics for Managers using Microsoft Excel 6 th Global Edition.

Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.

A Language Independent Method for Question Classification COLING 2004.

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

JHU WORKSHOP July 30th, 2003 Semantic Annotation – Week 3 Team: Louise Guthrie, Roberto Basili, Fabio Zanzotto, Hamish Cunningham, Kalina Boncheva,

A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart

Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.

1 Nonparametric Statistical Techniques Chapter 17.

A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,

Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )

1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,

Inference Protocols for Coreference Resolution Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Nick Rizzolo, Mark Sammons, and Dan Roth This research.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 7-1 Chapter 7 Sampling and Sampling Distributions Basic Business Statistics 11 th Edition.

Basic Business Statistics

From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:

Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Word Sense and Subjectivity (Coling/ACL 2006) Janyce Wiebe Rada Mihalcea University of Pittsburgh University of North Texas Acknowledgements: This slide.

Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

A Database of Narrative Schemas A 2010 paper by Nathaniel Chambers and Dan Jurafsky Presentation by Julia Kelly.

Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Identifying Expressions of Opinion in Context Eric Breck and Yejin Choi and Claire Cardie IJCAI 2007.

Chapter 7 (b) – Point Estimation and Sampling Distributions

Introduction to Sampling Distributions

Relation Extraction CSCI-GA.2591

Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD

SAD: 6º Projecto.

Categorizing networks using Machine Learning

Category-Based Pseudowords

Statistical NLP: Lecture 9

Towards a Personal Briefing Assistant

Statistical NLP : Lecture 9 Word Sense Disambiguation

Presentation transcript:

Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06

Example: George W. Bush met with Vladimir Putin in Moscow. How long was the meeting? Most people would say the meeting lasted between an hour and three days. This research is potentially very important in applications in which the time course of events is to be extracted from news. Introduction

Inter-Annotator Agreement The kappa statistic (Krippendorff, 1980; Carletta,1996) has become the de facto standard to assess inter-annotator agreement. It is computed as: P(A) is the observed agreement among the annotators, and P(E) is the expected agreement, which is the probability that the annotators agree by chance

What Should Count as Agreement? Determining what should count as agreement is not only important for assessing inter- annotator agreement, but is also crucial for evaluation we use the normal distribution (i.e., Gaussian distribution) to model our duration distributions.

What Should Count as Agreement? If the area between lower and upper bounds covers 80% of the entire distribution area, the bounds are each 1.28 standard deviations from the mean. With this data model, the agreement between two annotations can be defined as the overlapping area between two normal distributions. The agreement among many annotations is the average overlap of all the pair wise overlapping areas.

A logarithmic scale is used for the output

Expected Agreement

There are two peaks in this distribution. One is from 5 to 7 in the natural logarithmic scale, which corresponds to about 1.5 minutes to 30 minutes. The other is from 14 to 17 in the natural logarithmic scale, which corresponds to about 8 days to 6 months. We also compute the distribution of the widths (i.e., X upper – X lower ) of all the annotated durations

Expected Agreement Two different methods were used to compute the expected agreement (baseline), both yielding nearly equal results. These are described in detail in (Pan et al., 2006). For both, P(E) is about 0.15.

Features Local Context Syntactic Relations WordNet Hypernyms

Local Context window of n tokens to its left and right The best n determined via cross validation turned out to be 0, i.e., the event itself with no local context. But we also present results for n = 2 to evaluate the utility of local context three features are included: the original form of the token, its lemma (or root form), and its part-of-speech (POS) tag

Local Context

Syntactic Relations For a given event, both the head of its subject and the head of its object are extracted from the parse trees generated by the CONTEX parser. in sentence (1), the head of its subject is “ presidents ” and the head of its object is “ plan ”. the feature vector is [presidents, president, NNS, plan, plan, NN].

WordNet Hypernyms Events with the same hypernyms may have similar durations. Hypernyms are only extracted for the events and their subjects and objects, not for the local context words. A word disambiguation module might improve the learning performance. But since the features we need are the hypernyms, not the word sense itself, even if the first word sense is not the correct one, its hypernyms can still be good enough in many cases.

WordNet Hypernyms

Experiments The corpus that we have annotated currently contains all the 48 non-Wall-Street-Journal (non-WSJ) news articles (a total of 2132 event instances),as well as 10 WSJ articles (156 event instances), from the TimeBank corpus annotated in TimeML (Pustejovky et al., 2003). The non-WSJ articles (mainly political and disaster news) include both print and broadcast news that are from a variety of news sources, such as ABC, AP, and VOA.

Experiments Annotators were instructed to provide lower and upper bounds on the duration of the event, encompassing 80% of the possibilities, and taking the entire context of the article into account. Our first machine learning experiment, we have tried to learn this coarse-grained event duration information as a binary classification task.

Data For each event annotation, the most likely (mean) duration is calculated first by averaging (the logs of) its lower and upper bound durations. If its most likely (mean) duration is less than a day (about 11.4 in the natural logarithmic scale), it is assigned to the “ short ” event class, otherwise it is assigned to the “ long ” event class.

Data We divide the total annotated non-WSJ data (2132 event instances) into two data sets: a training data set with 1705 event instances (about 80% of the total non-WSJ data) and a held-out test data set with 427 event instances (about 20% of the total non-WSJ data). The WSJ data (156 event instances) is kept for further test purposes

Learning Algorithms Since 59.0% of the total data is “ long ” events, the baseline performance is 59.0%. Support Vector Machines (SVM) Na ï ve Bayes (NB) Decision Trees (C4.5)

Experimental Results (non- WSJ)

Feature Evaluation

We can see that most of the performance comes from event word A significant improvement above that is due to the addition of “ Syn ” Local context and hypernym does not seem to help In the “ Syn+Hyper ” cases, the learning algorithm with and without local context gives identical results, probably because the other features dominate.

Experimental Results (WSJ) The precision (75.0%) is very close to the test performance on the non-WSJ

Learning the Most Likely Temporal Unit Seven classes (second, minute, hour, day, week, month, and year) However, human agreement on this more fine-grained task is low (44.4%) “ approximate agreement ” is computed for the most likely temporal unit of events. In “ approximate agreement ”, temporal units are considered to match if they are the same temporal unit or an adjacent one.

Learning the Most Likely Temporal Unit Human agreement becomes 79.8% by using approximate agreement Since the “ week ”, “ month ”, and “ year ” classes together take up largest portion (51.5%) of the data, the baseline is always taking the “ month ” class

Conclusion We have addressed a problem -- extracting information about event durations encoded in event descriptions We describe a method for measuring inter-annotator agreement when the judgments are intervals on a scale We have shown that machine-learning techniques achieve impressive results