The Browser Evaluation Test A Proposal Pierre Wellner, Mike Flynn IDIAP, September 2003.

The Browser Evaluation Test A Proposal Pierre Wellner, Mike Flynn IDIAP, September 2003

Ricoh “MuVIE”, Lee et al Video editing, key frames, transcript search, embedded web browser, slides, whiteboard, minutes, perspective & panoramic views, speaker location, visual & audio activity NOT TESTED ON HUMANS Microsoft “Distributed Meetings” Cutler et al Panoramic video, person-tracking, audio source localisation & beam- forming, speaker clustering & change, whiteboard camera, PC capture SUBJECTIVELY TESTED

The Problem No assessment, or... Assessed by unique scheme Often very subjective [from Cutler et al, “Distributed Meetings: A Meeting Capture and Broadcasting System”, ACM Multimedia, 2002] –“I was able to get the information I needed […]” –“I would use this system again if I had to miss a meeting.” –“I would recommend the use of this system to my peers.” No standard Browsing task → Objective comparison not possible ←

Aims of the BET Performance, not judgment Independent of experimenter perception Directly comparable numeric scores Replicable

The Browsing Task Find a maximum number of observations of interest in a minimum amount of time. But what is an “observation of interest”?

test sampling BET Overview observations answers observers playback system subjects media browser scoring scores meeting participants corpus recording system

People Participants Observers –Observer selection –Many diverse interests –Interesting for participants or absentees? Subjects –Subject selection

Data Corpus –Discussion, Presentation, Decision, Status… –Normal meetings, if possible –Reflect common distribution Observations –Pairs of statements, one true, one false

Tests & Scores Test: sample of observations Subjects must decide on truth –using the browser Score is correct minus incorrect answers Control scores established: –Educated guesses, no media –Same software as observers –Well-known basic applications

Illustration Corpus 20 meetings @ ~40 minutes ≈ 13 hrs 20 mins of recordings Observations 60 observers 3 observers watch each meeting @ 18 observation-pairs/hour 6  real-time ≈ 240 hours observation time 216 observation-pairs/meeting, or 4,320 observation-pairs total Testing 10 subjects each watch 8 meetings, in 2 hours 40 mins per subject 4 subjects watch each meeting, 26 hours 40 mins total subject time 1 answer per minute, 160 answers/subject ≈ 1,600 answers total Significance Assume: binomial distribution of results, 90% answered correctly Confidence interval: 88.2% to 91.6%, with 95% confidence level

Summary Performance, not judgment –Subjects are measured in performance of tasks Independent of experimenter perception –Observers indirectly decide the tasks Directly comparable numeric scores –Standard methods, standard scores Replicable –Publicly accessible Web-site –All media available for download –Tests and scoring on-line

Questions…? Is this a good method? Do you recognise the problem? Would you use this method? Do you have a browser to test? Do you know of an existing MM corpus? …

The Browser Evaluation Test A Proposal Pierre Wellner, Mike Flynn IDIAP, September 2003.

Similar presentations

Presentation on theme: "The Browser Evaluation Test A Proposal Pierre Wellner, Mike Flynn IDIAP, September 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Browser Evaluation Test A Proposal Pierre Wellner, Mike Flynn IDIAP, September 2003.

Similar presentations

Presentation on theme: "The Browser Evaluation Test A Proposal Pierre Wellner, Mike Flynn IDIAP, September 2003."— Presentation transcript:

Similar presentations

About project

Feedback