1 Automating Slot Filling Validation to Assist Human Assessment Suzanne Tamang and Heng Ji Computer Science Department and Linguistics Department, Queens.

1 Automating Slot Filling Validation to Assist Human Assessment Suzanne Tamang and Heng Ji Computer Science Department and Linguistics Department, Queens College and the Graduate Center City University of New York November 5, 2012

Overview KBP SF validation task Two-step validation  Logistic regression based reranking  Predicted confidence adjustment and filtering Validation features  Shallow, Contextual, Emergent (voting) System combination  Perfect setting  Limiting conditions  Evaluation results  Opportunities

SF Validation Task Standard answer format  id, slot, run, docid, filler, start and end offset for filler, start and end offset for justification, confidence  Richmond Flowers, per:title, SFV_10_1, APW_ENG_20070810.1457.LDC2009T13, Attorney General, 336, 351,321,44,1.0 Validation goal  Use post-processing methods to label 1 or -1  Step one: Combine runs, and rerank using a probabilistic classifier Identify a threshold for filtering best candidates  Step two: Automatically assess system quality When available, use deeper contextual information Adjust confidence values to dampen noisy system contribution

4/23 Features FeatureDescriptionValueType document type provided by document collection as news wire, broadcast news, web log category shallow *number of tokens count of white spaces (+1) between contiguous character string integer shallow *acronym identify and concatenate first letter of each token binary shallow *url structural rules to determine if a valid url binary shallow named entity type label with gazetteer category shallow city, *state, *country, *title, ethnicity, religion appears in specific slot-related gazetteer binary shallow *alphanumeric indicate if numbers and letters appear binary shallow date structural rules to determine if an acceptable date format binary shallow capitalized first character of token(s) caps binary shallow same if query and fill strings match binary shallow keywordsused primarily for spouse and residence slots binary context dependency parselength from query to answer integer context ** system votes proportion of systems with answer agreement 0-1 emergent ** answer votes proportion of answers with answer agreement 0-1 emergent * statistically significant predictor in select models ** statistically significant predictor in most all models

5/23 Two Phased Validation Approach Step 1: Classification  Training with 2011KBP SF Data  Using features extracted from the 2011 KBP results:  Model selection using stepwise procedure and AIC  Threshold tuning on predicted confidence estimates Step 2: Adjustment and filtering  Automatic assessment of system quality  Adjustment of predicted confidence using quality/DP  Contextual analysis with answer provenance offsets Features – answer, system and group level  Shallow, Contextual, Emergent

6 Attribute Distribution in Automatic Slot Filling

7 PER Attribute Distribution

8 ORG Attribute Distribution

SF Performance: Training and Testing

10 Performance, Mean Confidence & Set Size 27 distinct runs; variable F1, size, confidence, & offset use.

11/23 Results: Slot Filling Validation

12/23 Pre-Post Validation Results: RPF1 LDC0.720.770.75 w/o validation0.710.030.06 validation P10.120.070.09 validation P20.350.080.13

Reranking multi-systems Ideal case  Diversity of systems  Comparable performance  Rich information Reliable answer context System approach / intermediate system results KBP SF Task  Twenty-seven runs, limited intermediate results, unkown strategies, and variable performance  Inconsistencies paired with `rigid’ framework Provenance: unavailable, unreliable (off a little and a lot) Confidence may or may not be available

What have we learn that translates to more efficient assessment? Confidence, provenance, approximating system quality, and flexibility 14

Challenges and Solutions Labor intensive  Training, quality control, tedious and unfulfilling  22% of total answers were redundant  1% gain on recall over systems Validation  Inconsistencies in reporting (provenance / confidence)  Lack of intermediate output Confidence  Uniform weighting Automatic assessment quality: inconsistency, confidence distributions RPFTP LDC0.720.770.751119 Systems0.710.030.061081 Answer Key?1?1543

16/23 Naïve Estimation of System Quality

Confidence of High and Low Performers Shallow/emergent features reduce noise  at the expense of better systems

18/23 Confidence-based Reranking Confidence is and important factors to a validator  informative at the >90 threshold  paired with quality estimates, cull more valid answers

Summary Evaluation of a two-phase SF Validation approach for KBP 2012  Improves overall F1 before (0.06) /after (0.13)  Helps low performers at the expense of better systems  Key observations Shallow features contribute to establishing a baseline Voting features did not generalize, and susceptible to system noise Contextual features are helpful (P1 to P2 gains) Opportunities  Incorporating confidence as a classifier feature or filtering  More flexible frameworks for using provenance information  Improved methods for naively estimating low and high performers in the multi-system setting

Thank you

1 Automating Slot Filling Validation to Assist Human Assessment Suzanne Tamang and Heng Ji Computer Science Department and Linguistics Department, Queens.

Similar presentations

Presentation on theme: "1 Automating Slot Filling Validation to Assist Human Assessment Suzanne Tamang and Heng Ji Computer Science Department and Linguistics Department, Queens."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Automating Slot Filling Validation to Assist Human Assessment Suzanne Tamang and Heng Ji Computer Science Department and Linguistics Department, Queens.

Similar presentations

Presentation on theme: "1 Automating Slot Filling Validation to Assist Human Assessment Suzanne Tamang and Heng Ji Computer Science Department and Linguistics Department, Queens."— Presentation transcript:

Similar presentations

About project

Feedback