Download presentation
Presentation is loading. Please wait.
Published byClement Barrett Modified over 9 years ago
1
The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4, 2013FIRE
2
The Story Evaluation-guided research The three C’s Five examples Thinking forward
3
Evaluation-Guided Research Information Retrieval Text classification Automatic Speech Recognition Optical Character Recognition Named Entity Recognition Machine Translation Extractive summarization …
4
Key Elements Task model Single-valued evaluation measure Affordable evaluation process
5
Critiques Early convergence Duplicative ($) Incrementalism Privileging the measurable
6
The Big Four TREC NTCIR CLEF FIRE
7
10 More TDT Amarylis INEX TRECVid TAC MediaEval STD OAEI CONLL WePS
8
What We Create Collections Comparison points –Baseline results Communities Competition?
9
Elsewhere in the Ecosystem … Capacity –From universities, industry, individuals, and funding agencies Completed work –Often requires working outside our year-long innovation cycles with rigid timelines Culling –Conferences and journals are the guardians of community standards
10
A Typical Task Life Cycle Year 1: –Task definition –Evaluation design –Community building Year 2: –Creating training data Year 3: –Reusable test collection –Establishing strong baselines
11
Some Sea Stories TDT CLIR Speech Retrieval E-Discovery
12
Topic Detection and Tracking Cultures –Speech, sponsor Event-based relevance Document boundary discovery Complexity –5 tasks, 3 languages, 2 modalities Lasting influence
13
Cross-Language IR TREC CLIR (Arabic) –Standard resources –Light stemming –Problematic task model CLEF Interactive CLIR –Controlled user studies –Problematic evaluation design –Qualitative vs. quantitative
14
Speech Retrieval TREC Spoken Document Retrieval –The “solved problem” CLEF Cross-Language Speech Retrieval –Grounded queries –Start time error evaluation measure FIRE QA for the Spoken Web
15
TREC Legal Track Iterative task design Sampling Measurement error Families Cultures
16
What’s in a Test Collection? Queries Documents Relevance judgments
17
What’s in a Test Collection? Queries Content Units of judgment Relevance judgments Evaluation measure(s)
18
Personality Types Innovators Organizers Optimizers Deployers Resourcers
19
Some Takeaways Progressive invalidation Social engineering Innovation from outside
20
A Final Thought It isn’t what you don’t know that limits your thinking. Rather, it is what you know that isn’t true.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.