Paul Over TRECVID Project Leader Information Access Division National Institute of Standards and Technology Gaithersburg, MD, USA 1DMASM 2011

Slides:

Advertisements

Similar presentations

Facets of user-assigned tags and their effectiveness in image retrieval Nicky Ransom University for the Creative Arts.

Advertisements

Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos.

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Evaluating Color Descriptors for Object and Scene Recognition Koen E.A. van de Sande, Student Member, IEEE, Theo Gevers, Member, IEEE, and Cees G.M. Snoek,

Kien A. Hua Division of Computer Science University of Central Florida.

Multimedia Retrieval. Outline Audio Retrieval Spoken information Music Document Image Analysis and Retrieval Video Retrieval.

Image Information Retrieval Shaw-Ming Yang IST 497E 12/05/02.

TRECVID Evaluations Mei-Chen Yeh 03/27/2012. Introduction Text REtrieval Conference (TREC) – Organized by National Institute of Standards (NIST) – Support.

Personalized Abstraction of Broadcasted American Football Video by Highlight Selection Noboru Babaguchi (Professor at Osaka Univ.) Yoshihiko Kawai and.

ICASSP, May Arjen P. de Vries Thijs Westerveld Tzvetanka I. Ianeva Combining Multiple Representations on the TRECVID Search Task.

Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.

Content-based Video Indexing, Classification & Retrieval Presented by HOI, Chu Hong Nov. 27, 2002.

Search Engines and Information Retrieval

Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.

1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.

Image Search Presented by: Samantha Mahindrakar Diti Gandhi.

Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,

CS335 Principles of Multimedia Systems Content Based Media Retrieval Hao Jiang Computer Science Department Boston College Dec. 4, 2007.

Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.

Multimedia Search and Retrieval Presented by: Reza Aghaee For Multimedia Course(CMPT820) Simon Fraser University March.2005 Shih-Fu Chang, Qian Huang,

Presentation Outline  Project Aims  Introduction of Digital Video Library  Introduction of Our Work  Considerations and Approach  Design and Implementation.

Information Retrieval in Practice

Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.

Visual Information Retrieval Chapter 1 Introduction Alberto Del Bimbo Dipartimento di Sistemi e Informatica Universita di Firenze Firenze, Italy.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Presented by Zeehasham Rasheed

Video Search Engines and Content-Based Retrieval Steven C.H. Hoi CUHK, CSE 18-Sept, 2006.

Overview of Search Engines

DVMM Lab, Columbia UniversityVideo Event Recognition Video Event Recognition: Multilevel Pyramid Matching Dong Xu and Shih-Fu Chang Digital Video and Multimedia.

DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.

Information Retrieval in Practice

TREC Video Retrieval Evaluation TRECVID Paul Over* Alan Smeaton (Dublin City University) George Awad* Wessel Kraaij (TNO, Radboud University Nijmegen)

Utilizing Video Ontology for Fast and Accurate Query-by-Example Retrieval Kimiaki Shirahama Graduate School of Economics, Kobe University Kuniaki Uehara.

1 Lessons Learned From Building a Terabyte Digital Video Library Presented by Jia Yao Multimedia Communications and Visualization Laboratory Department.

CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

Search Engines and Information Retrieval Chapter 1.

Bridge Semantic Gap: A Large Scale Concept Ontology for Multimedia (LSCOM) Guo-Jun Qi Beckman Institute University of Illinois at Urbana-Champaign.

1 The BT Digital Library A case study in intelligent content management Paul Warren

Multimedia Databases (MMDB)

Lecture Four: Steps 3 and 4 INST 250/4.  Does one look for facts, or opinions, or both when conducting a literature search?  What is the difference.

Finding Better Answers in Video Using Pseudo Relevance Feedback Informedia Project Carnegie Mellon University Carnegie Mellon Question Answering from Errorful.

Producción de Sistemas de Información Agosto-Diciembre 2007 Sesión # 8.

TRECVID Evaluations Mei-Chen Yeh 05/25/2010. Introduction Text REtrieval Conference (TREC) – Organized by National Institute of Standards (NIST) – Support.

TRECVID-2009: Search Task Alan Smeaton CLARITY, Dublin City University & Paul Over NIST.

1 CS430: Information Discovery Lecture 18 Usability 3.

1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.

Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.

Prof. Thomas Sikora Technische Universität Berlin Communication Systems Group Thursday, 2 April 2009 Integration Activities in “Tools for Tag Generation“

PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

1 Applications of video-content analysis and retrieval IEEE Multimedia Magazine 2002 JUL-SEP Reporter: 林浩棟.

MMDB-9 J. Teuhola Standardization: MPEG-7 “Multimedia Content Description Interface” Standard for describing multimedia content (metadata).

Semantic Extraction and Semantics-Based Annotation and Retrieval for Video Databases Authors: Yan Liu & Fei Li Department of Computer Science Columbia.

TREC-2003 (CDVP TRECVID 2003 Team)- 1 - Center for Digital Video Processing C e n t e r f o r D I g I t a l V I d e o P r o c e s s I n g CDVP & TRECVID-2003.

Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.

1 CS 430 / INFO 430 Information Retrieval Lecture 17 Metadata 4.

Data Mining for Surveillance Applications Suspicious Event Detection Dr. Bhavani Thuraisingham.

Coached Active Learning for Interactive Video Search Xiao-Yong Wei, Zhen-Qun Yang Machine Intelligence Laboratory College of Computer Science Sichuan University,

RSC Learning Resources Conference 8 th November 2012, Manchester Andrew Bevan (EDINA)

Ontology-based Automatic Video Annotation Technique in Smart TV Environment Jin-Woo Jeong, Hyun-Ki Hong, and Dong-Ho Lee IEEE Transactions on Consumer.

TRECVID IES Lab. Intelligent E-commerce Systems Lab. 1 Presented by: Thay Setha 05-Jul-2012.

Searching the Web for academic information Ruth Stubbings.

Information Retrieval in Practice

Data Mining for Surveillance Applications Suspicious Event Detection

Visual Information Retrieval

Multimedia Content-Based Retrieval

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Multimedia Information Retrieval

CSE 635 Multimedia Information Retrieval

Presentation transcript:

Paul Over TRECVID Project Leader Information Access Division National Institute of Standards and Technology Gaithersburg, MD, USA 1DMASM

What is TRECVID? Workshop series (2001 – present)  to promote research/progress in content-based video analysis/exploitation Foundation for large-scale laboratory testing Forum for the exchange of research ideas discussion of research methodology – what works, what doesn’t, and why. Focus: content-based approaches to retrieval/detection/summarization/segmentation/… Aims for realistic system tasks and test collections unfiltered data focus on relatively high-level functionality (e.g. interactive search) measurement against human abilities Provides data, tasks, and uniform, appropriate scoring procedures 2DMASM 2011

TRECVID Philosophy TRECVID is a modern example of the Cranfield tradition Laboratory system evaluation based on test collections Emphasis on advancing the state of the art from evaluation results TRECVID’s primary aim is not competitive product benchmarking experimental workshop: sometimes experiments fail! Laboratory experiments (vs. e.g., observational studies) sacrifice operational realism and broad scope of conclusions for control and information about causality – what works and why results tend to be narrow, at best indicative, not final evidence grows as approaches prove themselves repeatedly, as part of various systems, against various test data, over years 3 DMASM 2011

TRECVID Yearly Cycle Post-workshop experiments, final papers Results Evaluation TRECVID Workshop Results analysis and workshop paper/presentation preparation ~400 authors /year System building & experimentation; Community contributions (shots, training data, ASR, MT, etc.) Search topic, ground truth development Task definitions complete Call for Participation Data Procurement 4DMASM 2011

English TV News TRECVID’s Evolution Shot boundaries ■■■■■■■■■■■■■■■■■■■■■■■■■■■■ ■■■■■■■■■■ Ad hoc search ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ Features/semantic indexing ■■■■■■■■■■■■■■■■■ ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ ■■■■■■■■■■■■■■■■■■■■■ Stories ■■■■■■■■■■■■■ Camera motion ■■■■■■■ BBC rushes ■■■■■■■■■■■■■■■■■■■■■ Summaries ■■■■■■■■■■■■ Copy detection ■■■■■■■■■■■■■■■■■■■■■ ■■■■■■■■■■■■■■■■■■■■■ Surveillance events ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ Known-item search ■■■■■■■■■■■■■■■■■■■■■ Instance search pilot ■■■■■■■■■■ ■■■■■■■■■■ Multimedia event detection (MED) pilot ■■■■■■■■■■■■■■■■■■■■■ Tasks: Data: (hours) Participanting teams: … DMASM 2011 BBC rushes New development or test data as added S&V

TRECVID 2010 Tasks and Data 6DMASM 2011 Internet Archive – Creative Commons (IACC) [ video, title, keywords, description] Sound and Vision [video] Airport surveillance [video] HAVIC - Internet multimedia [video] Known-item search from text-only query Instance search from multiple frames with bounding boxes Surveillance event detection Multimedia event detection Semantic indexing (automatic assignment of ~150 tags) Content-based copy detection

NIST7 TV2010 Finishers Groups Finished Task code Task name 22CCDCopy detection 11SEDSurveillance event detection 39SINSemantic indexing 15KISKnown-item search 5MEDMultimedia event detection pilot 15INSInstance search pilot

NIST8 Support  Brewster Kahle (Internet Archive's founder) and R. Manmatha (U. Mass, Amherst) suggested in December of 2008 that TRECVID take another look at the resources of the Archive.  Cara Binder and Raj archive.org helped explain how to query and download automatically from the Internet Archive.  Georges Quénot with Franck Thollard, Andy Tseng, Bahjat Safadi from LIG and Stéphane Ayache from LIF shared coordination of the semantic indexing task and organized additional judging with support from the Quaero program  Georges Quénot and Stéphane Ayache again organized a collaborative annotation of 130 features.  Shin'ichi Satoh at NII along with Alan Smeaton and Brian Boyle at DCU arranged for the mirroring of the video data  Colum Foley and Kevin McGuinness (DCU) helped segment the instance search topic examples and set up the oracle at DCU for interactive systems in the known-item search task.  The LIMSI Spoken Language Processing Group and VexSys Research provided ASR for the IACC.1 videos.  Laurent Joyeux (INRIA-Roquencourt) updated the copy detection query generation code.  Matthijs Douze from INRIA-LEAR volunteered a camcorder simulator to automate the camcording transformation for the copy detection task.  Emine Yilmaz (Microsoft Research) and Evangelos Kanoulas (U. Sheffield) updated their xinfAP code (sample_eval.pl) to estimate additional values and made it available. :  National Institute of Standards and Technology (NIST)  Intelligence Advanced Research Projects Activity (IARPA)  Department of Homeland Security (DHS) Contributors:

NIST9 Some impacts … Continuing improvement in feature detection (automatic tagging) in the University of Amsterdam’s MediaMill system  Performance on 36 features doubled: 2006 –> 2009  Within domain (train and test) MAP > 0.41  Cross domains MAP > 0.27 Bibliometric study of TRECVID’s scholarly impact: (Dublin City University & University College, Dublin )  2073 peer-reviewed journal/conference papers 2010 RTI International economic impact study of TREC/TRECVID  “… for every $1 that NIST and its partners invested in TREC[/TRECVID], at least $3.35 to $5.07 in benefits accrued to IR [Information Retrieval] researchers ”

TRECVID search types so far TRECVID search has modeled a user looking for video shots for reuse of people, objects, locations, events not just information (e.g., video of X, not video of someone talking about X) independent of original intent, saliency, etc. in video of various sorts (without metadata other than file names): multilingual broadcast news (Arabic, Chinese, English) Dutch “edutainment”, cultural, news magazine, historical shows using queries containing: text only text + image/video examples image/video examples only in two modes: fully automatic human-in-the-loop search 10DMASM 2011

Specific (Iconographic) Generic (Pre-iconographic) Abstract (Iconological) WhoIndividually named person, group, thing Kind of person, thingMythical, fictitious being WhatIndividually named event, action Kind of event, action, condition Emotion, abstraction WhereIndividually named geographical location Kind of place, geographical, architectural Place symbolized WhenLinear time: date or period Cyclical time: season, time of day Emotion, abstraction symbolized by time Panofsky/Shatford mode/facet matrix ** From Enser, Peter G. B. and Sandom, Chriss J. Retrieval of Archival Moving Imagery – CBIR Outside the Frame. CIVR2002. LNCS 2383 pp ** 11DMASM 2011

12 24 Topics from TRECVID 2009  Find shots of a road taken from a moving vehicle through the front window.  Find shots of a crowd of people, outdoors, filling more than half of the frame area.  Find shots with a view of one or more tall buildings (more than 4 stories) and the top story visible.  Find shots of a person talking on a telephone.  Find shots of a close-up of a hand, writing, drawing, coloring, or painting.  Find shots of exactly two people sitting at a table.  Find shots of one or more people, each walking up one or more steps.  Find shots of one or more dogs, walking, running, or jumping.  Find shots of a person talking behind a microphone.  Find shots of a building entrance.  Find shots of people shaking hands.  Find shots of a microscope.  Find shots of two more people, each singing and/or playing a musical instrument.  Find shots of a person pointing.  Find shots of a person playing a piano.  Find shots of a street scene at night.  Find shots of printed, typed, or handwritten text, filling more than half of the frame area.  Find shots of something burning with flames visible.  Find shots of one or more people, each at a table or desk with a computer visible.  Find shots of an airplane or helicopter on the ground, seen from outside.  Find shots of one or more people, each sitting in a chair, talking.  Find shots of one or more ships or boats, in the water.  Find shots of a train in motion, seen from outside.  Find shots with the camera zooming in on a person's face.

13DMASM 2011 Documentary producer searches TV archive for reusable shots of Berlin in 1920’s Student searches Web for new music video Your mother searches home videos for shots of daughter playing with family pet. Voter looks for video of candidate X at recent town hall meeting Drilling down in the search landscape Intelligence analyst searches multilingual open source video for background info on location X Security personnel searches surveillance video archive for suspicious behavior Fan searches for favorite TV show episode 10-yr old looks for video of tigers for school report Doctor searches echocardiogram videos for instances like example Human visual capabilities, expert vs novice, text/image/concept querying, visualization, … Indexing, query typing, concept selection, weighting, ranking, pos/neg relevance feedback, metadata, … Segmentation, keypoints, SIFT, classifier fusion, face recognition, … SVM, GMM, graphical models, boosting, … Metrics, data, task definition, ground truth, significance, … Human-computer interaction Information retrieval Machine vision Machine learning Metrology … TRECVID You want something to make you laugh

Finding meaning in text (words) versus images (pixels) Hurricane Andrew which hit the Florida coast south of Miami in late August 1992 was at the time the most expensive disaster in US history. Andrew's damage in Florida cost the insurance industry about $8 billion. There were fifteen deaths, severe property damage, 1.2 million homes were left without electricity, and in Dade county alone 250,000 were left homeless. 14DMASM 2011 Hurricane Andrew which hit the Florida coast south of Miami in late August 1992 was at the time the most expensive disaster in US history. Andrew's damage in Florida cost the insurance industry about $8 billion. There were fifteen deaths, severe property damage, 1.2 million homes were left without electricity, and in Dade county alone 250,000 were left homeless.

One image/video – many different (changing) views of content Creator’s keywords: “ stupid sister ” 15DMASM women pigeons plaza buildings outdoors daytime running falling clapping …. Possible content keywords, tags:

One person/thing/location – many different (changing) appearances 16DMASM 2011

Can multimedia features serve as “words”? Low-level – Color – Texture – Shape High-level – 449 annotated LSCOM features – 39 LSCOM-Lite – TRECVID 2009 Classroom Chair Infant Traffic intersection Doorway Airplane-flying Person-playing-a-musical- instrument Bus Person-playing-soccer Cityscape Person-riding-a-bicycle Telephone Person-eating Demonstration-Or-Protest Hand People-dancing Nighttime Boat-Ship Female-human-face- closeup Singing Text from – speech – video OCR 17DMASM 2011

LSCOM feature sample 18DMASM – Parade Exiting_Car 002 – Handshaking 003 – Running Airplane_Crash 005 – Earthquake Demonstration_Or_Protest People_Crying Airplane_Takeoff Airplane_Landing Helicopter_Hovering 011 – Golf 012 – Walking 013 – Singing 014 – Baseball 015 – Basketball 016 – Football 017 – Soccer 018 – Tennis Speaking_To_Camera 020 – Riot Natural_Disasters 022 – Tornado Ice_Skating 024 – Snow Flood 026 – Skiing 027 – Talking 028 – Dancing Car_Crash 030 – Funeral 031 – Gymnastics Rocket_Launching 033 – Cheering 034 – Greeting 035 – Throwing 036 – Shooting Address_Or_Speech Bomber_Bombing Celebration_Or_Party 040 – Airport 041 – Barn 042 – Castle 043 – College 044 – Courthouse Fire_Station Gas_Station 047 – Grain_Elevator 048 – Greenhouse 049 – Hangar 050 – Hospital 051 – Hotel House_Of_Worship Police_Station Power_Plant Processing_Plant 056 – School Shopping_Mall 058 – Stadium 059 – Supermarket Airport_Or_Airfield 061 – Aqueduct 062 – Avalanche River_Bank Aircraft_Cabin Still_Image_Composition_May_I nclude_Text Stock_Exchange 812 – Stockyard Storage_Tanks Store_Outside Street_Signs Street_Vendor Students_Schoolkids 818 – Suitcases 819 – Surgeons 820 – Sword 821 – Synagogue 822 – Tailor 823 – Tanneries Taxi_Driver 825 – Teacher Team_Organized_Group 827 – Technicians 828 – Teenagers 829 – Temples 830 – Terrorist Text_Only_Artificial_Bkgd Thatched_Roof_Buildings 833 – Theater 834 – Toddlers Town_Halls Town_Squares 837 – Townhouse 838 – Tractor Traffic_Cop Train_Station Tribal_Chief 842 – Twilight 843 – Uav Vacationer_Tourist 845 – Vandal 846 – Veterinarian 847 – Viaducts 848 – Vineyards 849 – Voter Waiter_Waitress Water_Mains 852 – Windmill Wooden_Buildings Worker_Laborer

Simulation study suggests …. “… ‘concept-based’ video retrieval with fewer than 5000 concepts, detected with minimal accuracy of 10% mean average precision is likely to provide high accuracy results, comparable to text retrieval on the web, in a typical broadcast news collection.” * ? * Alexander Hauptmann, Rong Yan, Wei-Hao Lin, Michael Christel, and Howard Wactlar. Can High-Level Concepts Fill the Semantic Gap in Video Retrieval? A Case Study With Broadcast News. IEEE Transactions in Multimedia. Vol. 9, No. 5. August 2007 pp DMASM 2011

A generic TRECVID search system (based on Snoek and Worring 2008 ** ) ** Cees G. M. Snoek and Marcel Worring. Concept-Based Video Retrieval. in Foundations and Trends in Information Retrieval Vol. 2, No. 4 (2008) Basic Concept Detection Feature Fusion Classifier Fusion Mode lin g Relations Best of Selection Shot-segmented video Database SEARCHER Query results combination Query Prediction Learning from the searcher Visualization Query Methods Information need Query requests 20

Innovative search interfaces … 21DMASM U. Amsterdam MediaMill

Some results Keyframes from top 20 clips returned by a system to query for “shots of person seated at computer “ 22DMASM 2011

23 Variation in Average Precision by topic Dogs walking … Printer, typed… text … Closeup of hand writing … Crowds of people (270), Building entrance (278), People at desk with computer (287) each had automatic max better then interactive max

Observations, questions … One solution will not fit all. Investigations/discussion of video search must be related to the searcher‘s specific needs/capabilities/history and to the kinds data being searched. The enormous and growing amounts of video require extremely large- scale approaches to video exploitation. Much of it has little or no metadata describing the content in any detail. TREVCID participants have explored some automatic approaches to tagging and use of those tags in automatic and interactive search systems on a couple sorts of video. Much has been learned, some results may already be useful, but most of the territory is still unexplored. 24DMASM 2011

Observations, questions … Within the focus of TRECVID experiments … Multiple information sources (text, audio, video), each errorful, can yield better results when combined than used alone… A human in the loop in search still makes an enormous difference. Text from speech via automatic speech recognition (ASR) is a powerful source of information but: Its usefulness varies by video genre Not everything/one in a video is talked about, “in the news" Audible mentions are often offset in time from visibility Not all languages have good ASR Machine learning approaches to tagging yield seemingly useful results against large amounts of data when training data is sufficient and similar to the test data but will they work well enough to be useful on highly heterogeneous video? 25DMASM 2011

Within the focus of TRECVID experiments … A hierarchy of automatically derived features can help bridge the gap between pixels and meaning and can assist search - but problems abound: What is the right set of features for a given application? Given a query, how do you automatically decide which specific features to use? Creating quality training data, even with active learning, is very expensive Searchers (experts and non-experts) will use more than text queries if available: concepts, visual similarity, temporal browsing, positive and negative relevance feedback,… Processing video using a sample of more than one frame per shot, yields better results but quickly pushes common hardware configurations to their limits 26DMASM 2011 Observations, questions …

Within the focus of TRECVID experiments … TRECVID has only just started looking at combining automatically derived and manual-provided evidence in search Systems have been using externally annotated video (e.g. Flickr) but results are not conclusive Internet Archive video will provide titles, keywords, descriptions Where in the Panofsky hierarchy are the donors’ descriptions? If very personal, does that mean less useful for other people? Need observational studies of real searching of various sorts using current functionality and identifying unmet needs Need more access for researchers to much more multimedia data of varying kinds, mixtures, with and without human annotation 27DMASM 2011 Observations, questions …

Time to take some of the ideas developed in the laboratory out for small scale testing with real users with real needs and real video collections ? 28DMASM 2011 Observations, questions …