Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Multimodal Technology Integration for News-on-Demand SRI International News-on-Demand Compare & Contrast DARPA September 30, 1998.

Similar presentations


Presentation on theme: "1 Multimodal Technology Integration for News-on-Demand SRI International News-on-Demand Compare & Contrast DARPA September 30, 1998."— Presentation transcript:

1 1 Multimodal Technology Integration for News-on-Demand SRI International News-on-Demand Compare & Contrast DARPA September 30, 1998

2 2 Personnel n Speech: Dilek Hakkani, Madelaine Plauche, Zev Rivlin, Ananth Sankar, Elizabeth Shriberg, Kemal Sonmez, Andreas Stolcke, Gokhan Tur n Natural language: David Israel, David Martin, John Bear n Video Analysis: Bob Bolles, Marty Fischler, Marsha Jo Hannah, Bikash Sabata n OCR: Greg Myers, Ken Nitz n Architectures: Luc Julia, Adam Cheyer

3 3 SRI News-on-Demand Highlights n Focus on technologies n New technologies: scene tracking, speaker tracking, flash detection, sentence segmentation n Exploit technology fusion n MAESTRO multimedia browser

4 4 Outline n Goals for News-on-Demand n Component Technologies n The MAESTRO testbed n Information Fusion n Prosody for Information Extraction n Future Work n Summary

5 5 High-level Goal Develop techniques to provide direct and natural access to a large database of information sources through multiple modalities, including video, audio, and text.

6 6 Information We Want n Geographical location n Topic of the story n News-makers n Who or what is in the picture n Who is speaking

7 7 Component Technologies n Speech processing Automatic speech recognition (ASR) Speaker identification Speaker tracking/grouping Sentence boundary/disfluency detection n Video analysis Scene segmentation Scene tracking/grouping Camera flashes n Optical character recognition (OCR) Video caption Scene text (light or dark) Person identification n Information extraction (IE) Names of people, places, organizations Temporal terms Story segmentation/classification

8 8 Component Flowchart

9 9 MAESTRO n Testbed for multimodal News-on-Demand Technologies n Links input data and output from component technologies through common time line n MAESTRO score visually correlates component technologies output n Easy to integrate new technologies through uniform data representation format

10 10 Score ASR Output Video IR Results MAESTRO Interface

11 11 The Technical Challenge n Problem: Knowledge sources are not always available or reliable n Approaches Make existing sources more reliable Combine multiple sources for increased reliability and functionality (fusion) Exploit new knowledge sources

12 12 Two Examples n Technology Fusion: Speech recognition + Named entity finding = better OCR n New knowledge source: Speech prosody for finding names and sentence boundaries

13 13 Fusion Ideas n Use the names of people detected in the audio track to suggest names in captions n Use the names of people detected in yesterdays news to suggest names in audio n Use a video caption to identify a person speaking, and then use their voice to recognize them again

14 Information Fusion Text Recog Moore + NL Moore add to lexicon ASR moore

15 EXTRACTED INFORMATION Video imagery Auxiliary text news sources Audio track Face Det/Rec Caption Recog Scene Text Det/Rec Speaker Seg/Clust/Class Audio event detection Speech Recog Name Extraction Topic detection Story start/end Geographic focus Story topic Who / Whats in view Whos speaking Video object tracking Scene Seg/Clust/Class TECHNOLOGY COMPONENTSINPUT MODALITITES Input processing paths First-pass fusion opportunities

16 16 Augmented Lexicon Improves Recognition Results TONY BLAKJB Without lexicon: With lexicon: TONY BLAIR UNITED STATES WNITEE SIATEE

17 17 Prosody for Enhanced Speech Understanding n Prosody = Rhythm and Melody of Speech n Measured through duration (of phones and pauses), energy, and pitch n Can help extract information crucial to speech understanding n Examples: Sentence boundaries and Named Entities

18 18 Prosody for Sentence Segmentation n Finding sentence boundaries important for information extraction, structuring output for retrieval n Ex.: Any surprises? No. Tanks are in the area. n Experiment: Predict sentence boundaries based on duration and pitch using decision trees classifiers

19 19 Sentence Segmentation: Results n Baseline accuracy = 50% (same number boundaries & non-boundaries) n Accuracy using prosody = 85.7% n Boundaries indicated by: long pauses, low pitch before, high pitch after n Pitch cues work much better in Broadcast News than in Switchboard

20 20 Prosody for Named Entities n Finding names (of people, places, organizations) key to info extraction n Names tend to be important to content, hence prosodic emphasis n Prosodic cues can be detected even if words are misrecognized: could help find new named entities

21 21 Named Entities: Results n Baseline accuracy = 50% n Using prosody only: accuracy = 64.9% n N.E.s indicated by longer duration (more careful pronunciation) more within-word pitch variation n Challenges only first mentions are accented only one word in longer N.E. marked non-names accented

22 22 Using Prosody in NoD: Summary n Prosody can help information extraction independent of word recognition n Preliminary positive results for sentence segmentation and N.E. finding n Other uses: topic boundaries, emotion detection

23 23 Ongoing and Future Work n Combine prosody and words for name finding n Implement additional fusion opportunities: OCR helping speech speaker tracking helping topic tracking n Leverage geographical information for recognition technologies

24 24 Conclusions n News-on-Demand technologies are making great strides n Robustness still a challenge n Improved reliability through data fusion and new knowledge sources


Download ppt "1 Multimodal Technology Integration for News-on-Demand SRI International News-on-Demand Compare & Contrast DARPA September 30, 1998."

Similar presentations


Ads by Google