Presentation is loading. Please wait.

Presentation is loading. Please wait.

Informedia Interface Evaluation and Information Visualization Digital Video Library December 10, 2002 Mike Christel.

Similar presentations


Presentation on theme: "Informedia Interface Evaluation and Information Visualization Digital Video Library December 10, 2002 Mike Christel."— Presentation transcript:

1 Informedia Interface Evaluation and Information Visualization Digital Video Library December 10, 2002 Mike Christel

2 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 2 Carnegie Mellon Outline Surrogates for Informedia Digital Video LibrarySurrogates for Informedia Digital Video Library Abstractions for single video document Abstractions for single video document Empirical studies on thumbnail images, skims Empirical studies on thumbnail images, skims Quick overview of early HCI investigations Quick overview of early HCI investigations Summaries across video documents (collages)Summaries across video documents (collages) Demonstration of information visualization Demonstration of information visualization Required advances in automated content extraction Required advances in automated content extraction TREC Video Retrieval Track 2002TREC Video Retrieval Track 2002 Overview of Carnegie Mellon participation and results Overview of Carnegie Mellon participation and results Multiple storyboard interface emphasizing imagery Multiple storyboard interface emphasizing imagery

3 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 3 Carnegie Mellon Informedia Digital Video Library Project Initiated by the National Science Foundation, DARPA, and NASA the Digital Libraries Initiative, 1994-98Initiated by the National Science Foundation, DARPA, and NASA under the Digital Libraries Initiative, 1994-98 Continued funding via Digital Libraries Initiative Phase 2 (NSF, DARPA, National Library of Medicine, Library of Congress, NASA, National Endowment for the Humanities)Continued funding via Digital Libraries Initiative Phase 2 (NSF, DARPA, National Library of Medicine, Library of Congress, NASA, National Endowment for the Humanities) New work and directions via NSF, NSDL, ARDA VACE, “Capturing, Coordinating, and Remembering Human Experience” CCRHE Project, etc.New work and directions via NSF, NSDL, ARDA VACE, “Capturing, Coordinating, and Remembering Human Experience” CCRHE Project, etc. Details at http://www.informedia.cs.cmu.edu/Details at http://www.informedia.cs.cmu.edu/

4 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 4 Carnegie Mellon Techniques Underlying Video Metadata Image processing Detection of text overlaid on video Detection of faces Identification of camera and object motion Breaking video into component shots Detecting corpus-specific categories, e.g., anchorperson shots and weather map shots Speech recognition Text extraction and alignment Natural language processing Determining best text matches for a given query Identifying places, organizations, people Producing phrase summaries

5 Text and Face Detection

6 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 6 Carnegie Mellon Text Extraction and Alignment Raw audio Text extraction Raw video SILENCE MUSIC electric cars are they are the jury every toy owner hopes to please

7 Deriving “Matching Shots” Shot Detection Shot Frame Extraction Speech Recognition and Alignment These strange markings, preserved in the clay of a Texas riverbed, are footsteps… dinosaur graveyards… group of scientists… nature’s special effects... 0 3500 4600 5930 Matching Shot for “dinosaur footprint” Align Words to Shots Carnegie Mellon

8 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 8 Carnegie Mellon Initial User Testing of Video Library, ca. 1996 104 hour library consisting of 3481 clips104 hour library consisting of 3481 clips Average clip length of 1.8 minutes, consuming 15.7 megabytes of storageAverage clip length of 1.8 minutes, consuming 15.7 megabytes of storage Automatic logs generated for usage of Informedia Library by high school science teachers and studentsAutomatic logs generated for usage of Informedia Library by high school science teachers and students 243 hours logged (2473 queries, 2910 video clips played)243 hours logged (2473 queries, 2910 video clips played)

9 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 9 Carnegie Mellon Early Lessons Learned Titles frequently used, should include length and production dateTitles frequently used, should include length and production date Results and title placement affect usageResults and title placement affect usage Greater quantity of video was desiredGreater quantity of video was desired Storyboards (filmstrips) used infrequentlyStoryboards (filmstrips) used infrequently

10 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 10 Carnegie Mellon Empirical Study Into Thumbnail Images

11 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 11 Carnegie Mellon Text-based Result List

12 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 12 Carnegie Mellon “Naïve” Thumbnail List (Uses First Shot Image)

13 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 13 Carnegie Mellon Query-based Thumbnail Result List

14 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 14 Carnegie Mellon Query-based Thumbnail Selection Process 1. Decompose video segment into shots. 2. Compute representative frame for each shot. 3. Locate query scoring words (shown by arrows). 4. Use frame from highest scoring shot.

15 Thumbnail Study Results © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 15 Carnegie Mellon

16 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 16 Carnegie Mellon Empirical Study Summary* Significant performance improvements for query- based thumbnail treatment over other two treatmentsSignificant performance improvements for query- based thumbnail treatment over other two treatments Subjective satisfaction significantly greater for query- based thumbnail treatmentSubjective satisfaction significantly greater for query- based thumbnail treatment Subjects could not identify differences between thumbnail treatments, but their performance definitely showed differences!Subjects could not identify differences between thumbnail treatments, but their performance definitely showed differences!_____ *Christel, M., Winkler, D., and Taylor, C.R. Improving Access to a Digital Video Library. In Human-Computer Interaction: INTERACT97, Chapman & Hall, London, 1997, 524-531

17 Thumbnail View with Query Relevance Bar © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 17 Carnegie Mellon

18 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 18 Carnegie Mellon Close-up of Thumbnail with Relevance Bar Relevance score of [0, 100] This document has score of 30 Color-coded scoring words: “Asylum” contributes some, “rights” a bit, “refugee” contributes 50% Query-based thumbnail Shortcut to storyboard

19 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 19 Carnegie Mellon “Skim Video”: Extracting Significant Content Skim Video (78 frames) Original Video (1100 frames)

20 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 20 Carnegie Mellon Skims: Preliminary Findings Real benefit for skims appears to be for comprehension rather than navigationReal benefit for skims appears to be for comprehension rather than navigation For PBS documentaries, information in audio track is very importantFor PBS documentaries, information in audio track is very important Empirical study conducted in September 1997 to determine advantages of skims over subsampled video, and synchronization requirements for audio and visualsEmpirical study conducted in September 1997 to determine advantages of skims over subsampled video, and synchronization requirements for audio and visuals

21 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 21 Carnegie Mellon Empirical Study: Skims DFL - “default” long skim DFS - default short skim NEW - selective skim RND - same audio as NEW but with unsynchronized video

22 Skim Study Results Subjects asked if image was in the video just seen Subjects asked if text summarizes info. that would be in full source video © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann Carnegie Mellon

23 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 23 Carnegie Mellon Skim Study QUIS Results wonderful, satisfying, stimulating terrible, frustrating, dull

24 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 24 Carnegie Mellon Skim Study Results* 1996 “selective” skims performed no better than subsampled skims, but results from 1997 study show significant differences with “selective” skims more satisfactory to users audio is less choppy than earlier 1996 skim work audio is less choppy than earlier 1996 skim work synchronization with video is better preserved synchronization with video is better preserved grain size has increased grain size has increased_____ *Christel, M., Smith, M., Taylor, C.R., and Winkler, D. Evolving Video Skims into Useful Multimedia Abstractions. In Proc. ACM CHI ’98 (Los Angeles, CA, April 1998), ACM Press, 171-178

25 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 25 Carnegie Mellon Match Information

26 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 26 Carnegie Mellon Using Match Information For Browsing

27 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 27 Carnegie Mellon Using Match Info to Reduce Storyboard Size

28 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 28 Carnegie Mellon Adding Value to Video Surrogates via Text Captions AND pictures better than either modality aloneCaptions AND pictures better than either modality alone Large, A., et al. Multimedia and Comprehension: The Relationship among Text, Animation, and Captions. J. American Society for Information Science 46(5) (June 1995), 340-347 Large, A., et al. Multimedia and Comprehension: The Relationship among Text, Animation, and Captions. J. American Society for Information Science 46(5) (June 1995), 340-347 Nugent, G.C. Deaf Students' Learning from Captioned Instruction: The Relationship between the Visual and Caption Display. J. Special Education 17(2) (1983), 227-234 Nugent, G.C. Deaf Students' Learning from Captioned Instruction: The Relationship between the Visual and Caption Display. J. Special Education 17(2) (1983), 227-234 Video surrogates better with BOTH images and textVideo surrogates better with BOTH images and text Ding, W., et al. Multimodal Surrogates for Video Browsing. In Proc. ACM Conf. on Digital Lib. (Berkeley, CA, Aug. 1999), 85-93 Ding, W., et al. Multimodal Surrogates for Video Browsing. In Proc. ACM Conf. on Digital Lib. (Berkeley, CA, Aug. 1999), 85-93 Christel, M. and Warmack, A. The Effect of Text in Storyboards for Video Navigation. In Proc. IEEE ICASSP, (Salt Lake City, UT, May 2001), Vol. III, pp. 1409-1412 Christel, M. and Warmack, A. The Effect of Text in Storyboards for Video Navigation. In Proc. IEEE ICASSP, (Salt Lake City, UT, May 2001), Vol. III, pp. 1409-1412 For news/documentaries, audio narrative is important, but other video genres may be differentFor news/documentaries, audio narrative is important, but other video genres may be different Li, F., Gupta, A., et al. Browsing Digital Video. In Proc. ACM CHI ’00 (The Hague, Neth., April 2000), 169-176 Li, F., Gupta, A., et al. Browsing Digital Video. In Proc. ACM CHI ’00 (The Hague, Neth., April 2000), 169-176

29 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 29 Carnegie Mellon How Much Text, and Does Layout Matter? NoText BriefByRow Brief AllByRow All

30 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 30 Carnegie Mellon Results from Christel/Warmack Study Task Completion Time (secs.) NoText AllByRow All BriefByRow Brief NoTextAllByRowAllBriefByRowBrief 192160137117162 Mean Completion Times, in seconds: Graph with 95% confidence intervals

31 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 31 Carnegie Mellon More Results from Storyboard/Text Study Mean Ranking for Treatments (1 = favorite, 5 = least favorite): NoTextAllByRowAllBriefByRowBrief 4.61.123.042.523.72 AllByRow was favored, but had relatively poor performance (160 seconds for tasks) BriefByRow ranked 2 nd by preference, and had the best performance (117 seconds for tasks)

32 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 32 Carnegie Mellon Conclusions from Storyboard/Text Study Storyboard surrogates clearly improved with textStoryboard surrogates clearly improved with text Participants favored interleaved presentationParticipants favored interleaved presentation Navigation efficiency is best served with reduced interleaved text (BriefByRow)Navigation efficiency is best served with reduced interleaved text (BriefByRow) BriefByRow and All had best task performance, but BriefByRow requires less display spaceBriefByRow and All had best task performance, but BriefByRow requires less display space If interleaving is done in conjunction with text reduction, to better preserve and represent the time association between lines of text, imagery and their affiliated video sequence, then a storyboard with great utility for information assessment and navigation can be constructed.If interleaving is done in conjunction with text reduction, to better preserve and represent the time association between lines of text, imagery and their affiliated video sequence, then a storyboard with great utility for information assessment and navigation can be constructed.

33 Discussed Multimedia Surrogates, i.e., Abstractions based on Library Metadata text title Static Temporal skim video match bars Content detail, object size storyboard thumbnail image © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 33 Carnegie Mellon

34 Range of Multimedia Surrogates text title Static Temporal thumbnail image skim video match bars Content detail, object size storyboard audio data full text transcript storyboard with audio © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 34 Carnegie Mellon

35 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 35 Carnegie Mellon Evaluating Multimedia Surrogates Techniques discussed here:Techniques discussed here: transaction logs transaction logs formal empirical studies formal empirical studies Other techniques used in interface refinement:Other techniques used in interface refinement: contextual inquiry contextual inquiry heuristic evaluation heuristic evaluation cognitive walkthroughs cognitive walkthroughs “think aloud” protocols “think aloud” protocols

36 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 36 Carnegie Mellon Extending to Surrogates ACROSS Video As digital video assets grow, so do result setsAs digital video assets grow, so do result sets As automated processing techniques improve, e.g., speech and image processing, more metadata is generated with which to build interfaces into videoAs automated processing techniques improve, e.g., speech and image processing, more metadata is generated with which to build interfaces into video Need overview capability to deal with greater volumeNeed overview capability to deal with greater volume Prior work offered many solutions:Prior work offered many solutions: Visualization By Example (VIBE) for matching entity relationships Visualization By Example (VIBE) for matching entity relationships Scatter plots for low dimensionality relationships, e.g., timelines Scatter plots for low dimensionality relationships, e.g., timelines Dynamic query sliders for direct manipulation of plots Dynamic query sliders for direct manipulation of plots Colored maps for geographic relationships Colored maps for geographic relationships

37 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 37 Carnegie Mellon Enhancing Library Utility via Better Metadata (final representation) Summarizer PeopleEvent Topics Affiliation LocationTime User Interface Metadata Extractor Perspective Templates

38 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 38 Carnegie Mellon Displaying Metadata in Effective “Collages” North Pacific Ocean South Pacific Ocean Map collage emphasizing distribution by nation of “El Niño effects” with overlaid thumbnails

39 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 39 Carnegie Mellon Zooming into “Collage” to Reveal Details

40 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 40 Carnegie Mellon Example of “Chrono-Collage” March 1998 April 1998 May 1998 Suharto economic reform meetings U.S. policy on Indonesia Habibie new president El Niño wildfires Student protests against Suharto Timeline collage emphasizing “key player faces” and short event descriptors, representing the same data shown in Indonesia map perspective.

41 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 41 Carnegie Mellon Named Entity Extraction CNN national correspondent John Holliman is at Hartsfield International Airport in Atlanta. Good morning, John. …But there was one situation here at Hartsfield where one airplane flying from Atlanta to Newark, New Jersey yesterday had a mechanical problem and it caused a backup that spread throughout the whole system because even though there were a lot of planes flying to the New York area from the Atlanta area yesterday, …. Key: Place, Time, Organization/Person F. Kubala, R. Schwartz, R. Stone, and R. Weischedel, “Named Entity Extraction from Speech”, Proc. DARPA Workshop on Broadcast News Understanding Systems, Lansdowne, VA, February 1998.

42 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 42 Carnegie Mellon Challenge: Integrating Imagery into Collages

43 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 43 Carnegie Mellon Great Volume of Imagery Requires Filtering Video can be decomposed into shotsVideo can be decomposed into shots Consider 2050 hours of CNN videos from 1997- 2002 Consider 2050 hours of CNN videos from 1997- 2002 1,688,000 shots1,688,000 shots 67,700 segments/stories67,700 segments/stories 1 minute 53 seconds average story duration1 minute 53 seconds average story duration 4.5 seconds average shot duration4.5 seconds average shot duration 23 shots per segment on average23 shots per segment on average Result sets for queries number in the hundreds or thousandsResult sets for queries number in the hundreds or thousands Against 2001 CNN collection, top 1000 stories for queries on “terrorism” and “bomb threat” produced 17545 and 18804 shots respectively Against 2001 CNN collection, top 1000 stories for queries on “terrorism” and “bomb threat” produced 17545 and 18804 shots respectively User needs a way to filter down tens of thousands of imagesUser needs a way to filter down tens of thousands of images

44 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 44 Carnegie Mellon Adding Imagery to Visualizations Query-based thumbnail images added to VIBE, timeline, map summariesQuery-based thumbnail images added to VIBE, timeline, map summaries Layout differs: overlap in VIBE/timeline; tile in mapLayout differs: overlap in VIBE/timeline; tile in map Extend concept of “highest scoring” to represent country, or a point in time or a point on VIBE plotExtend concept of “highest scoring” to represent country, or a point in time or a point on VIBE plot

45 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 45 Carnegie Mellon Leveraging from Our Prior Video Summarization Work Context, e.g., matching terms, and synchronization between imagery and narrative can reduce summary complexityContext, e.g., matching terms, and synchronization between imagery and narrative can reduce summary complexity Text with imagery more useful in video summaries than either text alone or imagery aloneText with imagery more useful in video summaries than either text alone or imagery alone “Overview first, zoom and filter, then details on demand”“Overview first, zoom and filter, then details on demand” Visual Information-Seeking Mantra of Ben Schneiderman Visual Information-Seeking Mantra of Ben Schneiderman Direct manipulation interfaces leave the user in control Direct manipulation interfaces leave the user in control Iterative prototyping reveals areas needing further workIterative prototyping reveals areas needing further work

46 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 46 Carnegie Mellon Adding Text Overviews to Collages Transcript and other derived text such as scene text and characters overlaid on broadcast video provide input for further processingTranscript and other derived text such as scene text and characters overlaid on broadcast video provide input for further processing Named entity tagging and common phrase extraction provides filtering mechanism to reduce text into defined subsetsNamed entity tagging and common phrase extraction provides filtering mechanism to reduce text into defined subsets Visualization interface allows subsets, e.g., people, organizations, locations, and common phrases, to be displayed for the set of documents plotted in the visualization viewVisualization interface allows subsets, e.g., people, organizations, locations, and common phrases, to be displayed for the set of documents plotted in the visualization view

47 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 47 Carnegie Mellon Example of Text-Augmented Timeline Most frequent common phrases and people from query on “anthrax” against 2001 news listed beneath timeline plot.

48 Example of Text-Augmented VIBE Plot Left pane shows videos from 1/01 – 3/01 focus on refugees and asylum. Right pane shows videos from 5/01 – 8/01 focus on human rights and stem cell research. © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 48 Carnegie Mellon

49 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 49 Carnegie Mellon Refinement of Collages* Image addition to summaries improved over timeImage addition to summaries improved over time Anchorperson removal for more representative visuals Anchorperson removal for more representative visuals Consume more space in timeline with images via better layout Consume more space in timeline with images via better layout Image resizing under user control to see detail on demand Image resizing under user control to see detail on demand Text addition found to require new interface controlsText addition found to require new interface controls Selection controls, e.g., list people, organizations, locations, and/or common phrases Selection controls, e.g., list people, organizations, locations, and/or common phrases Stopping rules, e.g., list at most X terms, list terms only if they are covered by Y documents or Z% of document set Stopping rules, e.g., list at most X terms, list terms only if they are covered by Y documents or Z% of document set Show some text where user’s attention is focused, by the mouse pointer, i.e., pop-up tooltips text Show some text where user’s attention is focused, by the mouse pointer, i.e., pop-up tooltips text_____ *Christel, M., et al. Collages as Dynamic Summaries for News Video. In Proc. ACM Multimedia ’02 (Juan-les-Pins, France, December 2002)

50 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 50 Carnegie Mellon NIST TREC Video Retrieval Track Definitive information at NIST TREC Video Track web site: http://www-nlpir.nist.gov/projects/trecvid/Definitive information at NIST TREC Video Track web site: http://www-nlpir.nist.gov/projects/trecvid/ TREC series sponsored by the National Institute of Standards and Technology (NIST) with additional support from other U.S. government agenciesTREC series sponsored by the National Institute of Standards and Technology (NIST) with additional support from other U.S. government agencies Goal is to encourage research in information retrieval from large amounts of text by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results Goal is to encourage research in information retrieval from large amounts of text by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results Video Retrieval Track started in 2001Video Retrieval Track started in 2001 Goal is investigation of content-based retrieval from digital video Goal is investigation of content-based retrieval from digital video Focus on the shot as the unit of information retrieval rather than the scene or story/segment/clip Focus on the shot as the unit of information retrieval rather than the scene or story/segment/clip

51 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 51 Carnegie Mellon TREC-Video 2001 and TREC-Video 2002 2001 collection had ~11 hours of MPEG-1 video: 260 segments, 8000 shots, 80,000 I-frames2001 collection had ~11 hours of MPEG-1 video: 260 segments, 8000 shots, 80,000 I-frames 2002 search test collection had ~40 hours of MPEG-1 video: 1160 segments, 14,524 shots (given by TREC-V), 292,000 I-frames2002 search test collection had ~40 hours of MPEG-1 video: 1160 segments, 14,524 shots (given by TREC-V), 292,000 I-frames 2001 results2001 results http://trec.nist.gov/pubs/trec10/t10_proceedings.html http://trec.nist.gov/pubs/trec10/t10_proceedings.html Definite need to define the unit of information retrieval Definite need to define the unit of information retrieval Automatic search (no human in loop) difficult: about 1/3 of queries were unanswered by any of the automatic systems Automatic search (no human in loop) difficult: about 1/3 of queries were unanswered by any of the automatic systems Research groups submitting search runs were Carnegie Mellon, Dublin City Univ., Fudan Univ. China, IBM, Johns Hopkins Univ., Lowlands Group Netherlands, Univ. Maryland, Univ. North Texas Research groups submitting search runs were Carnegie Mellon, Dublin City Univ., Fudan Univ. China, IBM, Johns Hopkins Univ., Lowlands Group Netherlands, Univ. Maryland, Univ. North Texas 2002 results to be published after TREC Conference in 11/022002 results to be published after TREC Conference in 11/02

52 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 52 Carnegie Mellon TREC-Video 2001 Queries Specific item or personSpecific item or person the planet Jupiter, corn on the cob, Ron Vaughn, Harry Hertz, Lou Gossett Jr., R. Lynn Bonderant Specific factSpecific fact number of spikes on Statue of Liberty’s crown Specific event or activitySpecific event or activity liftoff of the Space Shuttle, Ronald Reagan reading speech about Space Shuttle Instances of a categoryInstances of a category mountains as prominent scenery, scenes with a yellow boat, pink flowers Instances of events/activitiesInstances of events/activities vehicle traveling on the moon, water skiing, speaker talking in front of the US flag, chopper landing

53 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 53 Carnegie Mellon Carnegie Mellon TREC-Video 2001 Results* Retrieval using: ARR Recall Speech Recognition Transcripts only 1.84 % 13.2 % Raw Video OCR only 5.21 % 6.10 % Raw Video OCR + Speech Transcripts 6.36 % 19.30 % Enhanced VOCR with dictionary post-processing 5.93 % 7.52 % Speech Transcripts + Enhanced Video OCR 7.07 % 20.74 % Image Retrieval only using a probabilistic Model 14.99 % 24.45 % Image Retrieval + Speech Transcripts 14.99 % 24.45 % Image Retrieval + Face Detection 15.04 % 25.08 % Image Retrieval + Raw VOCR 17.34 % 26.95 % Image Retrieval + Enhanced VOCR 18.90 % 28.52 %____ Average reciprocal rank (ARR) used as evaluation metric *See http://trec.nist.gov/pubs/trec10/papers/CMU-VideoTrack.pdf for full report.

54 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 54 Carnegie Mellon TREC-Video 2002 Queries Specific item or personSpecific item or person Eddie Rickenbacker, James Chandler, George Washington, Golden Gate Bridge, Price Tower in Bartlesville, OK Specific factSpecific fact Arch in Washington Square Park in NYC, map of continental US Instances of a categoryInstances of a category football players, overhead views of cities, one or more women standing in long dresses Instances of events/activitiesInstances of events/activities people spending leisure time at the beach, one or more musicians with audible music, crowd walking in an urban environment, locomotive approaching the viewer

55 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 55 Carnegie Mellon TREC-Video 2002 Features for Auto-Detection Outdoors: recognizably outdoor locationOutdoors: recognizably outdoor location Indoors: recognizably indoor locationIndoors: recognizably indoor location Face: at least one human face with nose, mouth, and both eyesFace: at least one human face with nose, mouth, and both eyes People: group of two more humansPeople: group of two more humans Cityscape: recognizably city/urban/suburban settingCityscape: recognizably city/urban/suburban setting Landscape: a predominantly natural inland setting, i.e., one with little or no evidence of development by humansLandscape: a predominantly natural inland setting, i.e., one with little or no evidence of development by humans Text Overlay: superimposed text large enough to be readText Overlay: superimposed text large enough to be read Speech: human voice uttering recognizable wordsSpeech: human voice uttering recognizable words Instrumental Sound: sound produced by one or more musical instruments, including percussion instrumentsInstrumental Sound: sound produced by one or more musical instruments, including percussion instruments Monologue: an event in which a single person is at least partially visible and speaks for a long time without interruption by another speakerMonologue: an event in which a single person is at least partially visible and speaks for a long time without interruption by another speaker

56 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 56 Carnegie Mellon New Interface Development for TREC-V 2002 Multiple document storyboardsMultiple document storyboards Resolution and layout under user controlResolution and layout under user control Query context plays a key role in filtering image sets to manageable sizesQuery context plays a key role in filtering image sets to manageable sizes TREC 2002 image feature set offers additional filtering capabilities for indoor, outdoor, faces, people, etc.TREC 2002 image feature set offers additional filtering capabilities for indoor, outdoor, faces, people, etc. Displaying filter count and distribution guides their use in manipulating the storyboard viewsDisplaying filter count and distribution guides their use in manipulating the storyboard views

57 Multiple Document Storyboards © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 57 Carnegie Mellon

58 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 58 Carnegie Mellon Resolution and Layout under User Control

59 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 59 Carnegie Mellon Leveraging From Query Context User has already expressed information need via queryUser has already expressed information need via query Query-based thumbnail representation has proven summarization effectiveness*Query-based thumbnail representation has proven summarization effectiveness* Therefore, use query-based scoring for shot selection, reduce thousands of shots to tens or hundreds of shots *See INTERACT '97 Conference paper by Christel et al. for more details. Decompose video into shots, align query matches to shots, use highest-scoring shot to represent video segment

60 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 60 Carnegie Mellon TREC 2002 Image Feature Set

61 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 61 Carnegie Mellon Filter Interface for using Image Features

62 Example: Looking for Beach Shots, 863 shots © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 62 Carnegie Mellon

63 Ex.: “Outdoor” Beach Shots Set at 469 Shots © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 63 Carnegie Mellon

64 Ex.: Beach Shot Set Manageable Size of 56 after Filtering Out Shots with No People © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 64 Carnegie Mellon

65 © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 65 Carnegie Mellon Conclusions Multi-document storyboard view facilitates quick inspection of large set of imagesMulti-document storyboard view facilitates quick inspection of large set of images First-order filtering by query very useful in providing user with an initial set of images for investigationFirst-order filtering by query very useful in providing user with an initial set of images for investigation Shots temporally near relevant shots often were relevant as well, so image ordering by video segment and time usefulShots temporally near relevant shots often were relevant as well, so image ordering by video segment and time useful Image features useful to filter, specific to certain queriesImage features useful to filter, specific to certain queries Drill-down to details, from images to video, necessary to eliminate ambiguityDrill-down to details, from images to video, necessary to eliminate ambiguity These strategies hold promise for finding visual information from video corpus beyond TREC 2002 collectionThese strategies hold promise for finding visual information from video corpus beyond TREC 2002 collection

66 Credits Many Informedia Project and CMU research community members contributed to this work; a partial list appears here: Project Director: Howard Wactlar User Interface: Mike Christel, Chang Huang, Adrienne Warmack, Dave Winkler Image Processing: Takeo Kanade, Norm Papernick, Toshio Sato, Henry Schneiderman, Michael Smith Speech and Language Processing: Alex Hauptmann, Ricky Houghton, Rong Jin, Raj Reddy, Michael Witbrock Informedia Library Essentials: Bob Baron, Bruce Cardwell, Colleen Everett, Mark Hoy, Melissa Keaton, Bryan Maher, Craig Marcus © Copyright 2002 Michael G. Christel and Alexander G. Hauptmann 66 Carnegie Mellon


Download ppt "Informedia Interface Evaluation and Information Visualization Digital Video Library December 10, 2002 Mike Christel."

Similar presentations


Ads by Google