Presentation is loading. Please wait.

Presentation is loading. Please wait.

Paul Over TRECVID Project Leader Information Access Division National Institute of Standards and Technology Gaithersburg, MD, USA 1DMASM 2011

Similar presentations


Presentation on theme: "Paul Over TRECVID Project Leader Information Access Division National Institute of Standards and Technology Gaithersburg, MD, USA 1DMASM 2011"— Presentation transcript:

1 Paul Over TRECVID Project Leader Information Access Division National Institute of Standards and Technology Gaithersburg, MD, USA 1DMASM 2011 http://trecvid.nist.gov

2 What is TRECVID? Workshop series (2001 – present)  http://trecvid.nist.govhttp://trecvid.nist.gov to promote research/progress in content-based video analysis/exploitation Foundation for large-scale laboratory testing Forum for the exchange of research ideas discussion of research methodology – what works, what doesn’t, and why. Focus: content-based approaches to retrieval/detection/summarization/segmentation/… Aims for realistic system tasks and test collections unfiltered data focus on relatively high-level functionality (e.g. interactive search) measurement against human abilities Provides data, tasks, and uniform, appropriate scoring procedures 2DMASM 2011

3 TRECVID Philosophy TRECVID is a modern example of the Cranfield tradition Laboratory system evaluation based on test collections Emphasis on advancing the state of the art from evaluation results TRECVID’s primary aim is not competitive product benchmarking experimental workshop: sometimes experiments fail! Laboratory experiments (vs. e.g., observational studies) sacrifice operational realism and broad scope of conclusions for control and information about causality – what works and why results tend to be narrow, at best indicative, not final evidence grows as approaches prove themselves repeatedly, as part of various systems, against various test data, over years 3 DMASM 2011

4 TRECVID Yearly Cycle Post-workshop experiments, final papers Results Evaluation TRECVID Workshop Results analysis and workshop paper/presentation preparation ~400 authors /year System building & experimentation; Community contributions (shots, training data, ASR, MT, etc.) Search topic, ground truth development Task definitions complete Call for Participation Data Procurement 4DMASM 2011

5 English TV News TRECVID’s Evolution Shot boundaries ■■■■■■■■■■■■■■■■■■■■■■■■■■■■ ■■■■■■■■■■ Ad hoc search ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ Features/semantic indexing ■■■■■■■■■■■■■■■■■ ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ ■■■■■■■■■■■■■■■■■■■■■ Stories ■■■■■■■■■■■■■ Camera motion ■■■■■■■ BBC rushes - - - - - - ■■■■■■■■■■■■■■■■■■■■■ Summaries ■■■■■■■■■■■■ Copy detection - - - - - - - - - - - - - - - - - - - - - ■■■■■■■■■■■■■■■■■■■■■ ■■■■■■■■■■■■■■■■■■■■■ Surveillance events - - - - - - - - - - - - - - - - - - -■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ Known-item search - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ■■■■■■■■■■■■■■■■■■■■■ Instance search pilot - - - - - - - - - - - - - - - - - - - - - - - - - - - - ■■■■■■■■■■ ■■■■■■■■■■ Multimedia event detection (MED) pilot - - - - - - - - - - - - - - - - - - - - ■■■■■■■■■■■■■■■■■■■■■ Tasks: Data: (hours) Participanting teams: … 2003 2004 2005 2006 2007 2008 2009 2010 2011 5DMASM 2011 BBC rushes New development or test data as added S&V

6 TRECVID 2010 Tasks and Data 6DMASM 2011 Internet Archive – Creative Commons (IACC) [ video, title, keywords, description] Sound and Vision [video] Airport surveillance [video] HAVIC - Internet multimedia [video] Known-item search from text-only query Instance search from multiple frames with bounding boxes Surveillance event detection Multimedia event detection Semantic indexing (automatic assignment of ~150 tags) Content-based copy detection

7 TRECVID @ NIST7 TV2010 Finishers Groups Finished Task code Task name 22CCDCopy detection 11SEDSurveillance event detection 39SINSemantic indexing 15KISKnown-item search 5MEDMultimedia event detection pilot 15INSInstance search pilot

8 TRECVID @ NIST8 Support  Brewster Kahle (Internet Archive's founder) and R. Manmatha (U. Mass, Amherst) suggested in December of 2008 that TRECVID take another look at the resources of the Archive.  Cara Binder and Raj Kumar @ archive.org helped explain how to query and download automatically from the Internet Archive.  Georges Quénot with Franck Thollard, Andy Tseng, Bahjat Safadi from LIG and Stéphane Ayache from LIF shared coordination of the semantic indexing task and organized additional judging with support from the Quaero program  Georges Quénot and Stéphane Ayache again organized a collaborative annotation of 130 features.  Shin'ichi Satoh at NII along with Alan Smeaton and Brian Boyle at DCU arranged for the mirroring of the video data  Colum Foley and Kevin McGuinness (DCU) helped segment the instance search topic examples and set up the oracle at DCU for interactive systems in the known-item search task.  The LIMSI Spoken Language Processing Group and VexSys Research provided ASR for the IACC.1 videos.  Laurent Joyeux (INRIA-Roquencourt) updated the copy detection query generation code.  Matthijs Douze from INRIA-LEAR volunteered a camcorder simulator to automate the camcording transformation for the copy detection task.  Emine Yilmaz (Microsoft Research) and Evangelos Kanoulas (U. Sheffield) updated their xinfAP code (sample_eval.pl) to estimate additional values and made it available. :  National Institute of Standards and Technology (NIST)  Intelligence Advanced Research Projects Activity (IARPA)  Department of Homeland Security (DHS) Contributors:

9 TRECVID @ NIST9 Some impacts … Continuing improvement in feature detection (automatic tagging) in the University of Amsterdam’s MediaMill system  Performance on 36 features doubled: 2006 –> 2009  Within domain (train and test) MAP 0.22 -> 0.41  Cross domains MAP 0.13 -> 0.27 Bibliometric study of TRECVID’s scholarly impact: 2003 - 2009 (Dublin City University & University College, Dublin )  2073 peer-reviewed journal/conference papers 2010 RTI International economic impact study of TREC/TRECVID  “… for every $1 that NIST and its partners invested in TREC[/TRECVID], at least $3.35 to $5.07 in benefits accrued to IR [Information Retrieval] researchers ”

10 TRECVID search types so far TRECVID search has modeled a user looking for video shots for reuse of people, objects, locations, events not just information (e.g., video of X, not video of someone talking about X) independent of original intent, saliency, etc. in video of various sorts (without metadata other than file names): multilingual broadcast news (Arabic, Chinese, English) Dutch “edutainment”, cultural, news magazine, historical shows using queries containing: text only text + image/video examples image/video examples only in two modes: fully automatic human-in-the-loop search 10DMASM 2011

11 Specific (Iconographic) Generic (Pre-iconographic) Abstract (Iconological) WhoIndividually named person, group, thing Kind of person, thingMythical, fictitious being WhatIndividually named event, action Kind of event, action, condition Emotion, abstraction WhereIndividually named geographical location Kind of place, geographical, architectural Place symbolized WhenLinear time: date or period Cyclical time: season, time of day Emotion, abstraction symbolized by time Panofsky/Shatford mode/facet matrix ** From Enser, Peter G. B. and Sandom, Chriss J. Retrieval of Archival Moving Imagery – CBIR Outside the Frame. CIVR2002. LNCS 2383 pp. 206-214. ** 11DMASM 2011

12 12 24 Topics from TRECVID 2009  Find shots of a road taken from a moving vehicle through the front window.  Find shots of a crowd of people, outdoors, filling more than half of the frame area.  Find shots with a view of one or more tall buildings (more than 4 stories) and the top story visible.  Find shots of a person talking on a telephone.  Find shots of a close-up of a hand, writing, drawing, coloring, or painting.  Find shots of exactly two people sitting at a table.  Find shots of one or more people, each walking up one or more steps.  Find shots of one or more dogs, walking, running, or jumping.  Find shots of a person talking behind a microphone.  Find shots of a building entrance.  Find shots of people shaking hands.  Find shots of a microscope.  Find shots of two more people, each singing and/or playing a musical instrument.  Find shots of a person pointing.  Find shots of a person playing a piano.  Find shots of a street scene at night.  Find shots of printed, typed, or handwritten text, filling more than half of the frame area.  Find shots of something burning with flames visible.  Find shots of one or more people, each at a table or desk with a computer visible.  Find shots of an airplane or helicopter on the ground, seen from outside.  Find shots of one or more people, each sitting in a chair, talking.  Find shots of one or more ships or boats, in the water.  Find shots of a train in motion, seen from outside.  Find shots with the camera zooming in on a person's face.

13 13DMASM 2011 Documentary producer searches TV archive for reusable shots of Berlin in 1920’s Student searches Web for new music video Your mother searches home videos for shots of daughter playing with family pet. Voter looks for video of candidate X at recent town hall meeting Drilling down in the search landscape Intelligence analyst searches multilingual open source video for background info on location X Security personnel searches surveillance video archive for suspicious behavior Fan searches for favorite TV show episode 10-yr old looks for video of tigers for school report Doctor searches echocardiogram videos for instances like example Human visual capabilities, expert vs novice, text/image/concept querying, visualization, … Indexing, query typing, concept selection, weighting, ranking, pos/neg relevance feedback, metadata, … Segmentation, keypoints, SIFT, classifier fusion, face recognition, … SVM, GMM, graphical models, boosting, … Metrics, data, task definition, ground truth, significance, … Human-computer interaction Information retrieval Machine vision Machine learning Metrology … TRECVID You want something to make you laugh

14 Finding meaning in text (words) versus images (pixels) Hurricane Andrew which hit the Florida coast south of Miami in late August 1992 was at the time the most expensive disaster in US history. Andrew's damage in Florida cost the insurance industry about $8 billion. There were fifteen deaths, severe property damage, 1.2 million homes were left without electricity, and in Dade county alone 250,000 were left homeless. 14DMASM 2011 Hurricane Andrew which hit the Florida coast south of Miami in late August 1992 was at the time the most expensive disaster in US history. Andrew's damage in Florida cost the insurance industry about $8 billion. There were fifteen deaths, severe property damage, 1.2 million homes were left without electricity, and in Dade county alone 250,000 were left homeless.

15 One image/video – many different (changing) views of content Creator’s keywords: “ stupid sister ” 15DMASM 2011 www.archive.org/details/StupidSister women pigeons plaza buildings outdoors daytime running falling clapping …. Possible content keywords, tags:

16 One person/thing/location – many different (changing) appearances 16DMASM 2011

17 Can multimedia features serve as “words”? Low-level – Color – Texture – Shape High-level – 449 annotated LSCOM features – 39 LSCOM-Lite – TRECVID 2009 Classroom Chair Infant Traffic intersection Doorway Airplane-flying Person-playing-a-musical- instrument Bus Person-playing-soccer Cityscape Person-riding-a-bicycle Telephone Person-eating Demonstration-Or-Protest Hand People-dancing Nighttime Boat-Ship Female-human-face- closeup Singing Text from – speech – video OCR 17DMASM 2011

18 LSCOM feature sample 18DMASM 2011 000 – Parade 001 - Exiting_Car 002 – Handshaking 003 – Running 004 - Airplane_Crash 005 – Earthquake 006 - Demonstration_Or_Protest 007 - People_Crying 008 - Airplane_Takeoff 009 - Airplane_Landing 010 - Helicopter_Hovering 011 – Golf 012 – Walking 013 – Singing 014 – Baseball 015 – Basketball 016 – Football 017 – Soccer 018 – Tennis 019 - Speaking_To_Camera 020 – Riot 021 - Natural_Disasters 022 – Tornado 023 - Ice_Skating 024 – Snow 025 - Flood 026 – Skiing 027 – Talking 028 – Dancing 029 - Car_Crash 030 – Funeral 031 – Gymnastics 032 - Rocket_Launching 033 – Cheering 034 – Greeting 035 – Throwing 036 – Shooting 037 - Address_Or_Speech 038 - Bomber_Bombing 039 - Celebration_Or_Party 040 – Airport 041 – Barn 042 – Castle 043 – College 044 – Courthouse 045 - Fire_Station 046 - Gas_Station 047 – Grain_Elevator 048 – Greenhouse 049 – Hangar 050 – Hospital 051 – Hotel 052 - House_Of_Worship 053 - Police_Station 054 - Power_Plant 055 - Processing_Plant 056 – School 057 - Shopping_Mall 058 – Stadium 059 – Supermarket 060 - Airport_Or_Airfield 061 – Aqueduct 062 – Avalanche 063 - River_Bank 064 - Aircraft_Cabin... 810 - Still_Image_Composition_May_I nclude_Text 811 - Stock_Exchange 812 – Stockyard 813 - Storage_Tanks 814 - Store_Outside 815 - Street_Signs 816 - Street_Vendor 817 - Students_Schoolkids 818 – Suitcases 819 – Surgeons 820 – Sword 821 – Synagogue 822 – Tailor 823 – Tanneries 824 - Taxi_Driver 825 – Teacher 826 - Team_Organized_Group 827 – Technicians 828 – Teenagers 829 – Temples 830 – Terrorist 831 - Text_Only_Artificial_Bkgd 832 - Thatched_Roof_Buildings 833 – Theater 834 – Toddlers 835 - Town_Halls 836 - Town_Squares 837 – Townhouse 838 – Tractor 839 - Traffic_Cop 840 - Train_Station 841 - Tribal_Chief 842 – Twilight 843 – Uav 844 - Vacationer_Tourist 845 – Vandal 846 – Veterinarian 847 – Viaducts 848 – Vineyards 849 – Voter 850 - Waiter_Waitress 851 - Water_Mains 852 – Windmill 853 - Wooden_Buildings 854 - Worker_Laborer http://www.lscom.org

19 Simulation study suggests …. “… ‘concept-based’ video retrieval with fewer than 5000 concepts, detected with minimal accuracy of 10% mean average precision is likely to provide high accuracy results, comparable to text retrieval on the web, in a typical broadcast news collection.” * ? * Alexander Hauptmann, Rong Yan, Wei-Hao Lin, Michael Christel, and Howard Wactlar. Can High-Level Concepts Fill the Semantic Gap in Video Retrieval? A Case Study With Broadcast News. IEEE Transactions in Multimedia. Vol. 9, No. 5. August 2007 pp.958-966. 19DMASM 2011

20 A generic TRECVID search system (based on Snoek and Worring 2008 ** ) ** Cees G. M. Snoek and Marcel Worring. Concept-Based Video Retrieval. in Foundations and Trends in Information Retrieval Vol. 2, No. 4 (2008) 215-322 Basic Concept Detection Feature Fusion Classifier Fusion Mode lin g Relations Best of Selection Shot-segmented video Database SEARCHER Query results combination Query Prediction Learning from the searcher Visualization Query Methods Information need Query requests 20

21 Innovative search interfaces … 21DMASM 2011 http://www-nlpir.nist.gov/projects/tvpubs/tv9.slides/mediamill1.slides.pdf U. Amsterdam MediaMill

22 Some results Keyframes from top 20 clips returned by a system to query for “shots of person seated at computer “ 22DMASM 2011

23 23 Variation in Average Precision by topic Dogs walking … Printer, typed… text … Closeup of hand writing … Crowds of people (270), Building entrance (278), People at desk with computer (287) each had automatic max better then interactive max

24 Observations, questions … One solution will not fit all. Investigations/discussion of video search must be related to the searcher‘s specific needs/capabilities/history and to the kinds data being searched. The enormous and growing amounts of video require extremely large- scale approaches to video exploitation. Much of it has little or no metadata describing the content in any detail. TREVCID participants have explored some automatic approaches to tagging and use of those tags in automatic and interactive search systems on a couple sorts of video. Much has been learned, some results may already be useful, but most of the territory is still unexplored. 24DMASM 2011

25 Observations, questions … Within the focus of TRECVID experiments … Multiple information sources (text, audio, video), each errorful, can yield better results when combined than used alone… A human in the loop in search still makes an enormous difference. Text from speech via automatic speech recognition (ASR) is a powerful source of information but: Its usefulness varies by video genre Not everything/one in a video is talked about, “in the news" Audible mentions are often offset in time from visibility Not all languages have good ASR Machine learning approaches to tagging yield seemingly useful results against large amounts of data when training data is sufficient and similar to the test data but will they work well enough to be useful on highly heterogeneous video? 25DMASM 2011

26 Within the focus of TRECVID experiments … A hierarchy of automatically derived features can help bridge the gap between pixels and meaning and can assist search - but problems abound: What is the right set of features for a given application? Given a query, how do you automatically decide which specific features to use? Creating quality training data, even with active learning, is very expensive Searchers (experts and non-experts) will use more than text queries if available: concepts, visual similarity, temporal browsing, positive and negative relevance feedback,… http://www.videolympics.orghttp://www.videolympics.org Processing video using a sample of more than one frame per shot, yields better results but quickly pushes common hardware configurations to their limits 26DMASM 2011 Observations, questions …

27 Within the focus of TRECVID experiments … TRECVID has only just started looking at combining automatically derived and manual-provided evidence in search Systems have been using externally annotated video (e.g. Flickr) but results are not conclusive Internet Archive video will provide titles, keywords, descriptions Where in the Panofsky hierarchy are the donors’ descriptions? If very personal, does that mean less useful for other people? Need observational studies of real searching of various sorts using current functionality and identifying unmet needs Need more access for researchers to much more multimedia data of varying kinds, mixtures, with and without human annotation 27DMASM 2011 Observations, questions …

28 Time to take some of the ideas developed in the laboratory out for small scale testing with real users with real needs and real video collections ? 28DMASM 2011 Observations, questions …


Download ppt "Paul Over TRECVID Project Leader Information Access Division National Institute of Standards and Technology Gaithersburg, MD, USA 1DMASM 2011"

Similar presentations


Ads by Google