Evaluation Issues: June 2002 Donna Harman Ellen Voorhees.

Slides:



Advertisements
Similar presentations
The Application of ECD to the Redesign of Advanced Placement Exams CCSSO June 2012 Maureen Ewing, Kristen Huff, Amy Hendrickson, Pamela Kaliski.
Advertisements

Resolving Challenges in Metadata Management: A User-Centered Manifesto Carol A. Hert PNCASIST May 15, 2004.
Process and Product Quality Assurance (PPQA)
1 Eurostat Unit B1 – IT Systems for Statistical Production IT outsourcing in Eurostat – our experience Georges Pongas, Adam Wroński Meeting on the Management.
1 Administrative Items Questions / Assistance –JoAnn Funk, Omen –Heather McCallum-Bayliss, ARDA SETA Demo’s –Questions concerning your Demo set-up? See.
AQUAINT Proposal for a Pilot Study on Extended Definitions: Who is Colin Powell? What is naproxen? Ralph Weischedel & Dan Moldovan 13 June 2002.
Wrap-up Dr. John D. Prange AQUAINT Program Manager
Chapter 6: Design of Expert Systems
CATEGORIES OF INFORMATION There are three main categories of business information,and these are related to the purpose for which the information is utilized.
NAACL Workshop on MTE 3 rd in a Series of MTE adventures 3 June, 2001.
socio-organizational issues and stakeholder requirements
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
JAVELIN Project Briefing 1 AQUAINT Year I Mid-Year Review Language Technologies Institute Carnegie Mellon University Status Update for Mid-Year Program.
Carnegie Mellon School of Computer Science Copyright © 2001, Carnegie Mellon. All Rights Reserved. JAVELIN Project Briefing 1 AQUAINT Phase I Kickoff December.
1 Designing Effective Programs: –Introduction to Program Design Steps –Organizational Strategic Planning –Approaches and Models –Evaluation, scheduling,
MSBA School Board Survey Results 1. Agenda Objective of the Study Overview of Methodology Reasons for running for school board Training Challenges and.
AQUAINT Kickoff Meeting Advanced Techniques for Answer Extraction and Formulation Language Computer Corporation Dallas, Texas.
Text REtrieval Conference (TREC) Implementing a Question-Answering Evaluation for AQUAINT Ellen M. Voorhees Donna Harman.
AQUAINT Testbed John Aberdeen, John Burger, Conrad Chang, John Henderson, Scott Mardis The MITRE Corporation © 2002, The MITRE Corporation.
AQUAINT June 2002 Workshop June 2002 Just-in-Time Interactive Question Answering Sanda Harabagiu: PI Language Computer Corporation.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
AQUAINT Scenario Breakout -- Group 2, Team 6 12 June 2002.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
MT Evaluation: What to do and what not to do Eduard Hovy Information Sciences Institute University of Southern California.
AQUAINT AQUAINT Evaluation Overview Ellen M. Voorhees.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Evaluation of Retrieval Effectiveness 2.
Improving QA Accuracy by Question Inversion John Prager, Pablo Duboue, Jennifer Chu-Carroll Presentation by Sam Cunningham and Martin Wintz.
AQUAINT R&D Program Phase I Kickoff Workshop WELCOME.
Evaluating Answers to Definition Questions in HLT-NAACL 2003 & Overview of TREC 2003 Question Answering Track in TREC 2003 Ellen Voorhees NIST.
Carpe Diem Phase II Making the Case. Jot down 3 questions that you have related to Phase II (the one’s that you carried into the room today)
1 Evaluation of Multi-Media Data QA Systems AQUAINT Breakout Session – June 2002 Howard Wactlar, Carnegie Mellon Yiming Yang, Carnegie Mellon Herb Gish,
1 Evaluation of Opinion Questions ä Session leaders: Ed Hovy, Kathy McKeown ä Topics ä Is evaluating opinion questions feasible at all? How can we construct.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
March, 2004 Into the LAN: An Integrated LAN/WAN End User Study © 2004 Frost & Sullivan. All rights reserved. This document contains highly confidential.
Phase-1: Prepare for the Change Why stepping back and preparing for the change is so important to successful adoption: Uniform and effective change adoption.
Recommendations on project/action design and structure.
EMS 8th grade Social Studies
Stages of Research and Development
Gender & Transport Module Eight
Project Planning: Scope and the Work Breakdown Structure
User Stories > Big and Small
Information Systems Development
Software Quality Engineering
Classroom Assessment A Practical Guide for Educators by Craig A
Project Cycle Management
TEACHING MATHEMATICS Through Problem Solving
Software Quality Engineering
Software solutions for the Work Programme
Introduction to Program Evaluation
Chapter 6: Design of Expert Systems
CMMI – Staged Representation
VENDORS, CONSULTANTS AND USERS
CHAPTER 13 The budgeting process.
Portable Food Vendors : City of Harbor Springs Study
CS 430: Information Discovery
Discussions and Conclusions
CSSSPEC6 SOFTWARE DEVELOPMENT WITH QUALITY ASSURANCE
eContentplus Programme (2005 – 2008)
Database Development Cycle
Case Method - JMSB Program
Goal-Driven Continuous Risk Management
Writing Business Reports and Proposals
Retrieval Evaluation - Reference Collections
Define Your IT Strategy
Vendor Collaboration and Initiatives to Improve Performance
Goal-Driven Software Measurement
Investing in Data Management Capabilities
Terms: Data: Database: Database Management System: INTRODUCTION
Community Builder Activity 3 min-2 min
INFORMATION SEMINAR Interreg V-A Latvia-Lithuania programme
Presentation transcript:

Evaluation Issues: June 2002 Donna Harman Ellen Voorhees

NIST Interlocking Evaluation Plan Major metrics evaluation to be in TREC QA track Additional Aquaint-specific evaluations to be run for narrowly focused areas Pilot tasks to test new evaluation methodologies to be run each 6 months; resulting tasks will then migrate to TREC or to an Aquaint-specific evaluation Testbed will be focused on integration and usability issues

NIST Why use TREC QA? To open the evaluation to a much broader community –allows many different/unusual approaches –ensures that Aquaint technology is “competitive” with the outside world –encourages more rapid technology transfer To maintain continuity across the various question types; building to an ever larger set of question-answering capabilities

NIST When would an evaluation be Aquaint-specific ? Evaluation plan not likely to scale to TREC-size participation –example: user dialog evaluation Data not available outside of Aquaint –example: CNS data Narrow focus not likely to attract many research groups –example: multimedia/multilingual QA

NIST Criteria for Pilot Tasks Known type of question with evaluation problems –example: definitional/biographical questions (who is Colin Powell?) Known area of interest from Aquaint users –example: questions with no answer or only a partial answer Known area of research concentration –example: multimedia QA

NIST June 02 Pilot Evaluation Tasks Dialog for QA –Tomek Strzalkowski, Sanda Harabagiu Relationships or cause-and-effect QA –John Prager, Eric Nyberg Answer explanation/justification –Stefano Bertolo, Richard Fikes QA access to multimedia data –Howard Wactlar, Yiming Yang, Herb Gish

NIST June 02 Pilot Evaluation Tasks Opinion questions –Eduard Hovy, Kathleen McKeown Definitional (who is, what is) questions –Ralph Weischedel, Dan Moldovan Questions for a fixed domain –Jerry Hobbs, Daniel Marcu Questions with no or only partial answer –Maureen Caudill, Bill Ogden

NIST Evaluation Breakout Goals Develop a workable evaluation plan for a pilot evaluation of the target task Pilot evaluations run July-November 2002 –results reported at December meeting –the result of interest is the effectiveness of the evaluation, not of the systems

NIST Stakeholders in Evaluation Plans Contractors/researchers –plan needs to address an appropriate facet of problem so that groups will participate Eventual end-users (analysts) –plan needs to reflect some facet of user needs so that evaluation is seen as useful Implementers –plan needs to be specific, actually do-able so that NIST (or others) can carry it out

NIST Workable Evaluation Plan Concrete definition of problem to be addressed Detailed specification of the data structure that systems are to return as a response Operational method for scoring the quality of a response, including any human judgments required

NIST Examples of thorny issues Who is/what is questions –Task definition: how to supply context to help systems select “better” answers? –Form of answer: a ranked/”binned” list of facts? a filled template? a narrative? –Judgment: recall of “important” facts? missing a critical fact? precision/redundancy? –Operational details of pilot (who is doing what?)

NIST More thorny issues Answer justification –Task definition: what does this mean? –Form of answer: a logical reasoning chain? a list of document extracts? metadata? –Judgment: ??? –Operational details of pilot (who is doing what?)

NIST Breakout areas Dialog for QA Definitional (who is, what is) questions Opinion questions Relationships or cause-and-effect Questions for a fixed domain QA Questions with no or only partial answer h Answer explanation/justification QA access to multimedia data