Download presentation
Presentation is loading. Please wait.
Published byHarold McKenzie Modified over 8 years ago
1
1 Automating Slot Filling Validation to Assist Human Assessment Suzanne Tamang and Heng Ji Computer Science Department and Linguistics Department, Queens College and the Graduate Center City University of New York November 5, 2012
2
Overview KBP SF validation task Two-step validation Logistic regression based reranking Predicted confidence adjustment and filtering Validation features Shallow, Contextual, Emergent (voting) System combination Perfect setting Limiting conditions Evaluation results Opportunities
3
SF Validation Task Standard answer format id, slot, run, docid, filler, start and end offset for filler, start and end offset for justification, confidence Richmond Flowers, per:title, SFV_10_1, APW_ENG_20070810.1457.LDC2009T13, Attorney General, 336, 351,321,44,1.0 Validation goal Use post-processing methods to label 1 or -1 Step one: Combine runs, and rerank using a probabilistic classifier Identify a threshold for filtering best candidates Step two: Automatically assess system quality When available, use deeper contextual information Adjust confidence values to dampen noisy system contribution
4
4/23 Features FeatureDescriptionValueType document type provided by document collection as news wire, broadcast news, web log category shallow *number of tokens count of white spaces (+1) between contiguous character string integer shallow *acronym identify and concatenate first letter of each token binary shallow *url structural rules to determine if a valid url binary shallow named entity type label with gazetteer category shallow city, *state, *country, *title, ethnicity, religion appears in specific slot-related gazetteer binary shallow *alphanumeric indicate if numbers and letters appear binary shallow date structural rules to determine if an acceptable date format binary shallow capitalized first character of token(s) caps binary shallow same if query and fill strings match binary shallow keywordsused primarily for spouse and residence slots binary context dependency parselength from query to answer integer context ** system votes proportion of systems with answer agreement 0-1 emergent ** answer votes proportion of answers with answer agreement 0-1 emergent * statistically significant predictor in select models ** statistically significant predictor in most all models
5
5/23 Two Phased Validation Approach Step 1: Classification Training with 2011KBP SF Data Using features extracted from the 2011 KBP results: Model selection using stepwise procedure and AIC Threshold tuning on predicted confidence estimates Step 2: Adjustment and filtering Automatic assessment of system quality Adjustment of predicted confidence using quality/DP Contextual analysis with answer provenance offsets Features – answer, system and group level Shallow, Contextual, Emergent
6
6 Attribute Distribution in Automatic Slot Filling
7
7 PER Attribute Distribution
8
8 ORG Attribute Distribution
9
SF Performance: Training and Testing
10
10 Performance, Mean Confidence & Set Size 27 distinct runs; variable F1, size, confidence, & offset use.
11
11/23 Results: Slot Filling Validation
12
12/23 Pre-Post Validation Results: RPF1 LDC0.720.770.75 w/o validation0.710.030.06 validation P10.120.070.09 validation P20.350.080.13
13
Reranking multi-systems Ideal case Diversity of systems Comparable performance Rich information Reliable answer context System approach / intermediate system results KBP SF Task Twenty-seven runs, limited intermediate results, unkown strategies, and variable performance Inconsistencies paired with `rigid’ framework Provenance: unavailable, unreliable (off a little and a lot) Confidence may or may not be available
14
What have we learn that translates to more efficient assessment? Confidence, provenance, approximating system quality, and flexibility 14
15
Challenges and Solutions Labor intensive Training, quality control, tedious and unfulfilling 22% of total answers were redundant 1% gain on recall over systems Validation Inconsistencies in reporting (provenance / confidence) Lack of intermediate output Confidence Uniform weighting Automatic assessment quality: inconsistency, confidence distributions RPFTP LDC0.720.770.751119 Systems0.710.030.061081 Answer Key?1?1543
16
16/23 Naïve Estimation of System Quality
17
Confidence of High and Low Performers Shallow/emergent features reduce noise at the expense of better systems
18
18/23 Confidence-based Reranking Confidence is and important factors to a validator informative at the >90 threshold paired with quality estimates, cull more valid answers
19
Summary Evaluation of a two-phase SF Validation approach for KBP 2012 Improves overall F1 before (0.06) /after (0.13) Helps low performers at the expense of better systems Key observations Shallow features contribute to establishing a baseline Voting features did not generalize, and susceptible to system noise Contextual features are helpful (P1 to P2 gains) Opportunities Incorporating confidence as a classifier feature or filtering More flexible frameworks for using provenance information Improved methods for naively estimating low and high performers in the multi-system setting
20
Thank you
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.