1 Combining Probability-Based Rankers for Action-Item Detection HLT/NAACL 2007 April 24, 2007 Paul N. Bennett Microsoft Research Jaime G. Carbonell Carnegie.

Slides:

Advertisements

Similar presentations

Simplifications of Context-Free Grammars

Advertisements

Mathematical Preliminaries

3.6 Support Vector Machines

Copyright © 2008 Pearson Addison-Wesley. All rights reserved. Chapter 16 Unemployment: Search and Efficiency Wages.

1 Knowledge and reasoning – second part Knowledge representation Logic and representation Propositional (Boolean) logic Normal forms Inference in propositional.

Online Max-Margin Weight Learning with Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.

Introductory Mathematics & Statistics for Business

Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.

A Transition Matrix Representation of the Algorithmic Statistical Process Control Procedure with Bounded Adjustments and Monitoring Changsoon Park Department.

Detection of Hydrological Changes – Nonparametric Approaches

UNITED NATIONS Shipment Details Report – January 2006.

Importance-Driven Focus of Attention and Meister Eduard Gröller 1 1 Vienna University of Technology, Austria 2 University of Girona, Spain 3 University.

and 6.855J Cycle Canceling Algorithm. 2 A minimum cost flow problem , $4 20, $1 20, $2 25, $2 25, $5 20, $6 30, $

Overview of Lecture Partitioning Evaluating the Null Hypothesis ANOVA

Your lecturer and course convener

Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION

Evaluating Provider Reliability in Risk-aware Grid Brokering Iain Gourlay.

Programming Language Concepts

Latent Variables Naman Agarwal Michael Nute May 1, 2013.

Chapter 7 Sampling and Sampling Distributions

The 5S numbers game..

REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.

Biostatistics Unit 5 Samples Needs to be completed. 12/24/13.

CS525: Special Topics in DBs Large-Scale Data Management

On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach

1 Generating Network Topologies That Obey Power LawsPalmer/Steffan Carnegie Mellon Generating Network Topologies That Obey Power Laws Christopher R. Palmer.

Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.

Oil & Gas Final Sample Analysis April 27, Background Information TXU ED provided a list of ESI IDs with SIC codes indicating Oil & Gas (8,583)

Text Categorization.

Feature Selection 1 Feature Selection for Image Retrieval By Karina Zapién Arreola January 21th, 2005.

Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.

Hypothesis Tests: Two Independent Samples

Introduction to Machine Learning Fall 2013 Perceptron (6) Prof. Koby Crammer Department of Electrical Engineering Technion 1.

Lecture 3 Validity of screening and diagnostic tests

Classification.. continued. Prediction and Classification Last week we discussed the classification problem.. – Used the Naïve Bayes Method Today..we.

© 2012 National Heart Foundation of Australia. Slide 2.

1 Using Bayesian Network for combining classifiers Leonardo Nogueira Matos Departamento de Computação Universidade Federal de Sergipe.

Module 17: Two-Sample t-tests, with equal variances for the two populations This module describes one of the most utilized statistical tests, the.

CS 240 Computer Programming 1

Subtraction: Adding UP

Polynomial Functions of Higher Degree

Putting Statistics to Work

Determining How Costs Behave

Statistical Inferences Based on Two Samples

DTU Informatics Introduction to Medical Image Analysis Rasmus R. Paulsen DTU Informatics TexPoint fonts.

Types of selection structures

CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.

©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.

Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)

1 Interpreting a Model in which the slopes are allowed to differ across groups Suppose Y is regressed on X1, Dummy1 (an indicator variable for group membership),

PSSA Preparation.

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide

Self-training with Products of Latent Variable Grammars Zhongqiang Huang, Mary Harper, and Slav Petrov.

9. Two Functions of Two Random Variables

4/4/2015Slide 1 SOLVING THE PROBLEM A one-sample t-test of a population mean requires that the variable be quantitative. A one-sample test of a population.

1 Undirected Graphical Models Graphical Models – Carlos Guestrin Carnegie Mellon University October 29 th, 2008 Readings: K&F: 4.1, 4.2, 4.3, 4.4,

1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.

Basics of Statistical Estimation

Classification Classification Examples

The Pumping Lemma for CFL’s

Adaptive Segmentation Based on a Learned Quality Metric

Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.

Data Mining Classification: Alternative Techniques

Ensemble Learning: An Introduction

Scalable Text Mining with Sparse Generative Models

Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Presentation transcript:

1 Combining Probability-Based Rankers for Action-Item Detection HLT/NAACL 2007 April 24, 2007 Paul N. Bennett Microsoft Research Jaime G. Carbonell Carnegie Mellon, LTI Copyright © 2007 Paul N. Bennett, Microsoft Corporation

2 Action Items Action-Item: An explicit request for information that requires the recipient's attention or action.

3 Problem Motivation Many users have limited time and more than they can process efficiently and accurately. Especially important during crunch times or crises. Some s have a greater response urgency than others. Those that have action-items are more likely to be urgent. Action-Item Detection is one part of a comprehensive system including spam detection, prioritization, time management, etc.

4 Primary Tasks Document detection: Classify a document as to whether or not it contains an action-item. Document ranking: Rank the documents such that all documents containing action-items occur as high as possible in the ranking. Sentence detection: Classify each sentence in a document as to whether or not it is an action-item.

5 Standard vs Fine-Grained Text Classification Document-level Instances Treat each document as an instance. Sentence-level Instances Treat each (automatically-segmented) sentence as an instance. Make document-level predictions using sentence-level predictions. Most basic is Predict document in action-item class if it contains a sentence predicted to be an action-item.

6 Representation and View Differences from Other Classification Tasks Unlike topic classification, key words at the document level dont really capture the major semantics. Whether or not could and you occur in a document is relatively uninformative. For this reason, n-grams are more effective at both levels. Other features such as end-of-sentence terminators and position in document have a high impact as well. Fine-grained judgments can be used by a sentence-level classifier to predict with high accuracy in this task.

7 Different Views Focus on Different Features Document-Level tends to use features that indicate messages that come from people or organizations that have an extremely high/low number of action-items: org, com, edu, joe, sue. These features are very corpus-specific but can work well at times. The n-grams significantly impact the document-level approach. Sentence-level selects words that are more relevant to the task regardless of the corpus. At the document-level, these words can be common in most documents though: could, you, UPS, send. N-grams make less impact at sentence-level because we already have window.

8 What approach should we use? Document-level view or Sentence-level? n-gram or bag-of-words? Algorithm: naïve Bayes (multinomial or multivariate Bernoulli), dependency networks, linear SVMs, kNN? Lets just use them all and combine them!

Metaclassifiers STRIVE: Stacked Reliability Indicator Variable Ensemble Stacking (Wolpert, 1992) 9 w1w1 w2w2 w3w3 … wnwn c c Reliability Indicators r1r1 r2r2 … rnrn Nested cross-validation over training data. Use values obtained when item was in validation set as input to the metaclassifier. Base Classifiers Metaclassifier

10 Defining Reliability Indicators in STRIVE Original STRIVE model lacked formalization of what properties of the model and the current example are useful for combination. Need reliability indicator variables that come with a classification model.

kNN-Based Local Variance 11 f(x) f(x 1 ) f(x 2 ) f(x 3 ) f(x 4 ) f(x 6 ) f(x 5 )

12 What if we had a single base classifier? Assume binary classification, {-1,+1}. Base classifier estimates log-odds,, of belonging to the positive class. Metaclassifier learns a weight vector w and makes a final prediction of the log-odds as a linear correction,. Metaclassifier can only improve if base classifier is uncalibrated both in linear transform case and in general (DeGroot and Fienberg, Bayesian Inference and Decision Techniques, 1986). Platt recalibration is a special case of this.

13 What about locally linear corrections? What if metaclassifier learns weighting functions of the inputs W 0 (x) and W 1 (x) and then outputs, ? Assuming we have a local distribution Δ x = p(z|x) that gives probability of drawing a point z similar to x, we can recast this problem. For every x the metaclassifier uses the weight vector w by solving:

14 Motivation for Model-Based Indicators Assume we know true log-odds, λ. Then, if, Obviously cant compute terms involving true log- odds, but each classification model can specify a Δ and then compute terms like the sensitivity,.

15 Model-Specific Reliability Indicators For each model, define distribution over documents similar to current document. Compute: kNN: randomly shift toward one of the k neighbors Unigram: randomly delete a word. naïve Bayes: randomly flip bit in entire vocabulary. SVM: randomly shift toward support vectors. Decision Tree: randomly shift toward nearby leaves.

16 Model-Specific Reliability Indicators (cont.) Continued developing similar variables from related terms. In total, the number of variables for each model: kNN: 10 SVM: 5 multivariate Bernoulli naïve Bayes (MBNB): 6 multinomial naïve Bayes (NB): 6

17 Data Collection 744 messages collected at CMU that have been anonymized. For this experiment, the messages were hand-cleaned by removing embedded previous messages, attachments, etc. Prevents chronological taints of cross-validation and needed for user-experiment token balancing. Two people labeled all 744 messages. At the message level, 93% agreement. Kappa = 0.85 At the sentence-level, 98% agreement. Kappa = Kappa is a better indicator since labeling all 6301 sentences as no action-item would yield a high agreement. Resolved disputes to determine gold-standard (44% of messages contain action-items).

18 Base Classifiers Dnet: Decision trees built with a Bayesian machine learning algorithm (i.e. dependency networks) using the WinMine Toolkit. Estimated log-odds at leaf nodes. SVM: Linear Support Vector Machines built using SVMLight. Margin score. Naïve Bayes: Also referred to as multivariate Bernoulli model in literature. Smoothed estimated log-odds. Unigram: Also referred to as multinomial naïve Bayes Classifier in literature. Smoothed estimated log-odds kNN: Distance-weighted voting with s-cut.

19 Obtaining Document Rankings from Sentence-Level Classifiers Simple combination of scores for each sentence. If any sentence was predicted positive, the score was the sum of all sentence scores above threshold else it was the max of the sentence scores. The score was then normalized by the length of the document since longer documents (more sentences) give rise to more false positives.

20 Feature Representations Bag-of-Words Alpha-numeric based bag-of-words representation Sentence-ending punctuation Ngram Basic Sentence-ending punctuation N-grams Relative position of sentence in document (for sentence-level classifier)

21 Performance Measures Ranking: Area Under the ROC Curve (AUC): equivalent to Mann-Whitney- Wilcoxon sum of ranks test (Hanley & McNeil, Radiology, 1982; Flach, ICML Tutorial, 2004). Probability that for a randomly chosen positive example, x+, and randomly chosen negative, x-, x+ will be ranked higher than x-, i.e. P(s(x+) > s(x-)). RRA: relative residual area. (1 – AUC) / (1-AUC Baseline ) bRRA – decrease over oracle-selected best base classifier AUC dRRA – decrease over oracle-selected dynamically best base classifier AUC per cross-validation run F1: To ensure ranking improvement does not come at a cost of significant negative decrease.

22 Methodological Details 10-fold cross-validation Top 300 features ranked by χ 2. Two-tailed t-Test with p=0.05 to judge significance.

23 Metaclassifiers 20 base classifiers: 5 algorithms * 2 representations * 2 level views. Stacking: linear SVM using just the base classifier outputs. STRIVE: linear SVM using … Document-level: model-based RIVs (2*29=58). Sentence-level averaged model-based RIVs across sentence instances (2*29=58). Mean and deviation of confidence scores for sentences in a document. (2 * 2 * 5=20). Two voting-based RIVs (from Bennett et al., 2005).

24 Action-Item Detection Ranking Performance

24% improvement over best base classifier! 6% improvement over dynamically chosen best base classifier. 25 Combining Action-Item Detector Performance

26 User Experiments (Jill Lehman & Aaron Steinfeld)

27 Related Work on Action-Item Detection Cohen et al. (EMNLP, 2004) looks at predicting an ontology of speech acts in . Action-Items can be seen as one type of (very important) speech act. Only worked with document-level judgments, we focus on both using and predicting at finer levels of granularity. Corston-Oliver et al. (ACL-WS, 2004). Automatic construction of to-do list. Use fine-grained judgments but no study of impact (does the extra label collection effort really pay off in performance). Bennett and Carbonell (SIGIR BBOW WS, 2005). Bennett (PhD Thesis, 2006).

28 Related Work on Classifier Combination Bennett et al. (Information Retrieval, 2005). Bennett (PhD Thesis, 2006). Kahn (PhD Thesis, 2004). Lee et al. (ICML 2006). Wolpert (Neural Networks, 1992).

29 Conclusions & Future Work Formal motivation for reliability indicators. Locality distributions to compute indicators related to common classification models. Ranking performance improved by 24% relative to best base classifier. Less variation in performance relative to the training set. Use sensitivity estimates more directly as suggested by derivation (future work).

30 Action-Item Distribution Num Action-ItemsNum Messages 0416 (56%) 1259 (35%) 255 (7%) 311 (1%) 42 (1%) 50 (0%) 61 (1%)

31 Inter-Annotator Agreement Document-Level Kappa = 0.85 Sentence-Level Labeled positive when has 30% characters of an action-item Kappa = 0.82 NoYes No39126 Yes29298 Annotator 1 Annotator 2 NoYes No Yes74352 Annotator 1 Annotator 2

32 Resolving Annotator Differences Problems Annotator oversight Interpreting conditional statements If you would like to keep your job, come to tomorrows meeting. If you would like to join the football betting pool, come to tomorrows meeting. After reconciling judgements 416 messages with no action-items, 328 action-item messages Sentence-level agreement Annotator 1 Kappa = Annotator 2 Kappa = 0.92.

33 What makes detection hard? Language is mimicked in many other types of mail: spam, solicited advertisements, volunteer recruiting, etc. Indirect speech acts Conditional Statements Context Id like to take Friday off from work. In from secretary to boss vs. between friends.