National Institute of Standards and Technology Information Technology Laboratory 2000 TREC-9 Spoken Document Retrieval Track

National Institute of Standards and Technology Information Technology Laboratory 2000 TREC-9 Spoken Document Retrieval Track http://www.nist.gov/speech/sdr2000 John Garofolo, Jerome Lard, Ellen Voorhees

SDR 2000 - Overview SDR 2000 Track Overview, changes for TREC-9 SDR Collection/Topics Technical Approaches Speech Recognition Metrics/Performance Retrieval Metrics/Performance Conclusions Future

Task: –Given a text topic, retrieve ranked list of relevant excerpts from collection of recorded speech Requires 2 core technologies: –Speech Recognition –Information Retrieval Spoken Document Retrieval (SDR) Recognized Transcripts Broadcast News Speech Recognition Engine Broadcast News Audio Recording Corpus IR Search Engine Ranked Time Pointer List Topic (Query) Temporal Index First step towards multimedia information access. Focus is on effect of recognition accuracy on retrieval performance Domain: Radio and TV Broadcast News

SDR Evaluation Approach In the TREC tradition: –Create realish but doable application task Increase realism (and difficulty) each year –NIST creates: infrastructure: test collection, queries, task definition, relevance judgements –task includes several different control conditions: recognizer, boundaries, etc. –Sites submit: speech recognizer transcripts for benchmarking and sharing rank-ordered retrieval lists for scoring

Past SDR Test Collections

Past SDR Evaluation Conditions

SDR 2000 - Changes from 1999 2000 evaluated on whole shows including non-news segments 50 ad-hoc topics in two forms: short description and keyword 1 baseline recognizer transcript sets (NIST/BBN B2 from 1999) story boundaries unknown (SU) condition is required recognition and use of non-lexical information 1999 evaluated on hand-segmented news excerpts only 49 ad-hoc-style topics/metrics 2 baseline recognizer transcript sets (NIST/BBN) story boundaries known (SK) focus and exploratory unknown (SU) conditions

SDR 2000 - Test Collection Based on the LDC TDT-2 Corpus –4 sources (TV: ABC, CNN, Radio: PRI, VOA) –February through June 1998 subset, 902 broadcasts –557.5 hours, 21,754 stories, 6,755 filler and commercial segments (~55 hours) –Reference transcripts Human-annotated story boundaries Full broadcast word transcription –News segments hand-transcribed (same as in ‘99) –Commercials and non-news filler transcribed via NIST ROVER applied to 3 automatic recognizer transcript sets Word times provided by LIMSI forced alignment –Automatic recognition of non-lexical information (commercials, repeats, gender, bandwidth, non-speech, signal energy, and combinations) provided by CU

Test Variables Collection –Reference (R1) - transcripts created by LDC human annotators –Baseline (B1) - transcripts created by NIST/BBN time- adaptive automatic recognizer –Speech (S1/S2) - transcripts created by sites’ own automatic recognizers –Cross-Recognizer (CR) - all contributed recognizers Boundaries –Known (K) - Story boundaries provided by LDC annotators –Unknown (U) - Story boundaries unknown

Test Variables (contd) Queries –Short (S) - 1 or 2-phrase description of information need –Terse (T) - keyword list Non-Lexical Information –Default - Could make use of automatically-recognized features –None (N) - no non-lexical information (control) Recognition language models –Fixed (FLM) - Fixed language model/vocabulary predating test epoch –Rolling (RLM) - Time-adaptive language model/vocabulary using daily newswire texts

Test Conditions Primary Conditions (may use non-lexical side info, but must run contrast below): R1SU: Reference Retrieval, short topics, using human-generated "perfect" transcripts without known story boundaries R1TU: Reference Retrieval, terse topics, using human-generated "perfect" B1SU: Baseline Retrieval, short topics, using provided recognizer transcripts without known story boundaries B1TU: Baseline Retrieval, terse topics, using provided recognizer transcripts without known story boundaries S1SU: Speech Retrieval, short topics, using own recognizer without known story boundaries S1TU: Baseline Retrieval,terse topics, using provided recognizer transcripts without known story boundaries Optional Cross-Recognizer Condition (may use non-lexical side info, but must run contrast below): CRSU- : Cross-Recognizer Retrieval, short topics, using other participants' recognizer transcripts without known story boundaries CRTU- : Cross-Recognizer Retrieval, terse topics, using other participants' recognizer transcripts without known story boundaries Conditional No Non-Lexical Information Condition (required contrast if non-lexical information is used in other conditions): R1SUN: Reference Retrieval, short topics, using human-generated "perfect" transcripts without known story boundaries, no non-lexical info R1TUN: Reference Retrieval, terse topics, using human-generated "perfect" transcripts without known story boundaries, no non-lexical info B1SUN: Baseline Retrieval, short topics, using provided recognizer transcripts without known story boundaries, no non-lexical info B1TUN: Baseline Retrieval, terse topics, using provided recognizer transcripts without known story boundaries, no non-lexical info S1SUN: Speech Retrieval, short topics, using own recognizer without known story boundaries, no non-lexical info S1TUN: Speech Retrieval, terse topics, using own recognizer without known story boundaries, no non-lexical info S2SUN: Speech Retrieval, short topics, using own second recognizer without known story boundaries, no non-lexical info S2TUN: Speech Retrieval, terse topics, using own second recognizer without known story boundaries, no non-lexical info Optional Known Story Boundaries Conditions: R1SK: Reference Retrieval, short topics, using human-generated "perfect" transcripts with known story boundaries R1TK: Reference Retrieval, terse topics, using human-generated "perfect" transcripts with known story boundaries B1SK: Baseline Retrieval, short topics, using provided recognizer transcripts with known story boundaries B1TK: Baseline Retrieval, terse topics, using provided recognizer transcripts with known story boundaries S1SK: Speech Retrieval, short topics, using own recognizer with known story boundaries S1TK: Speech Retrieval, terse topics, using own recognizer with known story boundaries Recognition Language Models: FLM: Fixed language model/vocabulary predating test epoch RLM: Rolling language model/vocabulary using daily newswire adaptation

Test Topics 50 topics developed by NIST Assessors using similar approach to TREC Ad-Hoc Task –Short and terse forms of topics were generated Hard: Topic 125: 10 relevant stories Short: Provide information pertaining to security violations within the U. S. intelligence community. (.024 average MAP) Terse: U. S. intelligence violations (.019 average MAP) Medium: Topic 143: 8 relevant stories Short: How many Americans file for bankruptcy each year? (.505 avg MAP) Terse: Americans bankruptcy debts (.472 average MAP) Easy: Topic 127: 11 relevant stories Short: Name some countries which permit their citizens to commit suicide with medical assistance. (.887 average MAP) Terse: assisted suicide (.938 average MAP)

Test Topic Relevance

Topic Difficulty

Participants Full SDR (recognition and retrieval): Cambridge University, UK LIMSI, France Sheffield University, UK

Approaches for 2000 Automatic Speech Recognition –HMM, word-based - most –NN/HMM hybrid-based, Sheffield Retrieval –OKAPI Probabilistic Model - all –Blind Relevance Feedback and parallel corpus BRF for query expansion - all Story boundary unknown retrieval –passage windowing, retrieval and merging - all Use of automatically-recognized non-lexical features –repeat and commercial detection - CU

ASR Metrics Traditional ASR Metric: –Word Error Rate (WER) and Mean Story Word Error Rate (SWER) using SCLITE and LDC ref transcripts WER = word insertions + word deletions + word substitutions total words in reference –LDC created 2 Hub-4 compliant 10-hour subsets for ASR scoring and analyses (LDC-SDR-99 and LDC-SDR-2000) Note that there is a 10.3% WER in the collection human (closed caption) transcripts Note: SDR recognition is not directly comparable to Hub-4 benchmarks due to transcript quality, test set selection method, and word mapping method used in scoring

ASR Performance Ovals indicates no significant difference Ovals indicate no significant difference

IR Metrics Traditional TREC ad-hoc Metric: –Mean Average Precision (MAP) using TREC_EVAL –Created assessment pools for each topic using top 100 of all retrieval runs Mean pool size: 596 (2.1% of all segments) Min pool size: 209 Max pool size: 1309 –NIST assessors created reference relevance assessments from topic pools –Somewhat artificial for boundary unknown conditions

Story Boundaries Known Condition Retrieval using pre-segmented news stories –systems given index of story boundaries for recognition with IDs for retrieval excluded non-news segments stories are treated as documents –systems produce rank-ordered list of Story IDs –document-based scoring: score as in other TREC Ad Hoc tests using TREC_EVAL

Story Boundaries Known Retrieval Condition

Unknown Story Boundary Condition Retrieval using continuous speech stream –systems process entire broadcasts for ASR and retrieval with no provided segmentation –systems output a single time marker for each relevant excerpt to indicate topical passages this task does NOT attempt to determine topic boundaries –time-based scoring: map to a story ID (“dummy” ID for retrieved non-stories and duplicates) score as usual using TREC_EVAL penalizes for duplicate retrieved stories story-based scoring somewhat artificial but expedient

Story Boundaries Unknown Retrieval Condition

SDR-2000 Cross-Recognizer Results Performance for own ASR similar to human reference

Conclusions ad hoc retrieval in broadcast news domain appears to be a “solved problem” –systems perform well at finding relevant passages in transcripts produced by a variety of recognizers on full unsegmented news broadcasts performance on own recognizer comparable to human reference just beginning to investigate use of non-lexical information –Caveat Emptor ASR may still pose serious problems for Question Answering domain where content errors are fatal

Future for Multi-Media Retrieval? SDR Track will be sunset Other opportunities –TREC Question Answering Track New Video Retrieval Track –CLEF Cross-language SDR –TDT Project

TREC-9 SDR Results, Primary Conditions

TREC-9 SDR Results, Cross Recognizer Conditions

National Institute of Standards and Technology Information Technology Laboratory 2000 TREC-9 Spoken Document Retrieval Track

Similar presentations

Presentation on theme: "National Institute of Standards and Technology Information Technology Laboratory 2000 TREC-9 Spoken Document Retrieval Track"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

National Institute of Standards and Technology Information Technology Laboratory 2000 TREC-9 Spoken Document Retrieval Track

Similar presentations

Presentation on theme: "National Institute of Standards and Technology Information Technology Laboratory 2000 TREC-9 Spoken Document Retrieval Track"— Presentation transcript:

Similar presentations

About project

Feedback