Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.

Slides:



Advertisements
Similar presentations
Distinctive Image Features from Scale-Invariant Keypoints
Advertisements

Three things everyone should know to improve object retrieval
Presented by Xinyu Chang
TP14 - Indexing local features
TP14 - Local features: detection and description Computer Vision, FCUP, 2014 Miguel Coimbra Slides by Prof. Kristen Grauman.
A NOVEL LOCAL FEATURE DESCRIPTOR FOR IMAGE MATCHING Heng Yang, Qing Wang ICME 2008.
Query Specific Fusion for Image Retrieval
Object Recognition using Invariant Local Features Applications l Mobile robots, driver assistance l Cell phone location or object recognition l Panoramas,
CS4670 / 5670: Computer Vision Bag-of-words models Noah Snavely Object
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Stephan Gammeter, Lukas Bossard, Till Quack, Luc Van Gool.
Image alignment Image from
CVPR 2008 James Philbin Ondˇrej Chum Michael Isard Josef Sivic
Packing bag-of-features ICCV 2009 Herv´e J´egou Matthijs Douze Cordelia Schmid INRIA.
Effective Image Database Search via Dimensionality Reduction Anders Bjorholm Dahl and Henrik Aanæs IEEE Computer Society Conference on Computer Vision.
Robust and large-scale alignment Image from
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
Object retrieval with large vocabularies and fast spatial matching
JOSEF SIVIC AND ANDREW ZISSERMAN PRESENTERS: ILGE AKKAYA & JEANNETTE CHANG MARCH 1, 2011 Efficient Visual Search for Objects in Videos.
Beyond bags of features: Adding spatial information Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba.
Approximate Nearest Neighbor - Applications to Vision & Matching
Distinctive Image Feature from Scale-Invariant KeyPoints
Feature extraction: Corners and blobs
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
Stepan Obdrzalek Jirı Matas
Scale Invariant Feature Transform (SIFT)
Automatic Matching of Multi-View Images
Blob detection.
1 Invariant Local Feature for Object Recognition Presented by Wyman 2/05/2006.
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
Scale-Invariant Feature Transform (SIFT) Jinxiang Chai.
Object Recognition and Augmented Reality
CS 766: Computer Vision Computer Sciences Department, University of Wisconsin-Madison Indexing and Retrieval James Hill, Ozcan Ilikhan, Mark Lenz {jshill4,
Indexing Techniques Mei-Chen Yeh.
Distinctive Image Features from Scale-Invariant Keypoints By David G. Lowe, University of British Columbia Presented by: Tim Havinga, Joël van Neerbos.
Computer vision.
Keypoint-based Recognition Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 03/04/10.
Reporter: Fei-Fei Chen. Wide-baseline matching Object recognition Texture recognition Scene classification Robot wandering Motion tracking.
Bag-of-features models. Origin 1: Texture recognition Texture is characterized by the repetition of basic elements or textons For stochastic textures,
1 Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval Ondrej Chum, James Philbin, Josef Sivic, Michael Isard and.
Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
10/31/13 Object Recognition and Augmented Reality Computational Photography Derek Hoiem, University of Illinois Dali, Swans Reflecting Elephants.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
CSCE 643 Computer Vision: Extractions of Image Features Jinxiang Chai.
Lecture 7: Features Part 2 CS4670/5670: Computer Vision Noah Snavely.
Distinctive Image Features from Scale-Invariant Keypoints Ronnie Bajwa Sameer Pawar * * Adapted from slides found online by Michael Kowalski, Lehigh University.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.
Feature extraction: Corners and blobs. Why extract features? Motivation: panorama stitching We have two images – how do we combine them?
Lecture 08 27/12/2011 Shai Avidan הבהרה: החומר המחייב הוא החומר הנלמד בכיתה ולא זה המופיע / לא מופיע במצגת.
A Tutorial on using SIFT Presented by Jimmy Huff (Slightly modified by Josiah Yoder for Winter )
Bundling Features for Large Scale Partial-Duplicate Web Image Search Zhong Wu ∗, Qifa Ke, Michael Isard, and Jian Sun Microsoft Research.
776 Computer Vision Jan-Michael Frahm Spring 2012.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Scale Space and Deep Structure. Importance of Scale Painting by Dali Objects exist at certain ranges of scale. It is not known a priory.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
SIFT Scale-Invariant Feature Transform David Lowe
CS262: Computer Vision Lect 09: SIFT Descriptors
CS 2770: Computer Vision Feature Matching and Indexing
Video Google: Text Retrieval Approach to Object Matching in Videos
By Suren Manvelyan, Crocodile (nile crocodile?) By Suren Manvelyan,
Indexing and Retrieval
Video Google – A google approach to Video Retrieval
From a presentation by Jimmy Huff Modified by Josiah Yoder
Video Google: Text Retrieval Approach to Object Matching in Videos
Lecture VI: Corner and Blob Detection
Presentation transcript:

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto

Motivation  Retrieve key frames and shots of video containing particular object with ease, speed and accuracy with which Google retrieves web pages containing particular words  Investigate whether text retrieval approach is applicable to object recognition  Visual analogy of word: vector quantizing descriptor vectors

Benefits  Matches are pre-computed so at run time frames and shots containing particular object can be retrieved with no delay  Any object (or conjunction of objects) occurring in video can be retrieved even though there was no explicit interest in object when descriptors were built

Text Retrieval Approach  Documents are parsed into words  Words represented by stems  Stop list to reject common words  Remaining words assigned unique identifier  Document represented by vector of weighted frequency of words  Vectors organized in inverted files  Retrieval returns documents with closest (angle) vector to query

Viewpoint invariant description  Two types of viewpoint covariant regions computed for each frame Shape Adapted (SA) Maximally Stable (MS)  Detect different image areas  Provide complimentary representations of frame  Computed at twice originally detected region size to be more discriminating

Shape Adapted region  Elliptical shape adaptation about interest point  Iteratively determine ellipse center, scale and shape  Scale determined by local extremum (across scale) of Laplacian  Shape determined by maximizing intensity gradient isotropy over elliptical region  Centered on corner like features

Maximally Stable region  Use intensity watershed image segmentation  Select areas that are approximately stationary as intensity threshold is varied  Correspond to blobs of high contrast with respect to surroundings

Feature Descriptor  Each elliptical affine invariant region represented by 128 dimensional vector using SIFT descriptor

Noise Removal  Information aggregated over sequence of frames  Regions detected in each frame tracked using simple constant velocity dynamical model and correlation  Region not surviving more than 3 frames are rejected  Estimate descriptor for region computed by averaging descriptors throughout track

Noise Removal Tracking region over 70 frames

Visual Vocabulary  Goal: vector quantize descriptors into clusters (visual words)  When new frame observed, descriptor of new frame assigned nearest cluster, generating matches for all frames

Visual Vocabulary  Implementation: K-Means clustering  Regions tracked through contiguous frames  Mean vector descriptor x_i computed for each i regions  Subset of 48 shots selected  Distance function: Mahalanobis  6000 SA clusters and MS clusters

Visual Vocabulary

Visual Indexing  Apply weighting to vector components  Weighting: term frequency-inverse document frequency (tf-idf)  Vocabulary k words, each doc represented by k- vector V d = (t 1,…,t i,…,t k ) T where n id = # of occurences of word i in doc d n d = total # of words in doc d n i =# of occurences of word I in db N = # of doc in db

Experiments - Setup  Goal: match scene locations within closed world of shots  Data:164 frames from 48 shots taken at 19 different 3D locations; 4-9 frames from each location

Experiments - Retrieval  Entire frame is query  Each of 164 frames as query region in turn  Correct retrieval: other frames which show same location  Retrieval performance: average normalized rank of relevant images N rel = # of relevant images for query image N = size of image set R i = rank of ith relevant image

Experiment - Results

Experiments - Results

Precision = # relevant images/total # of frames retrieved Recall = # correctly retrieved frames/ # relevant frames

Stop List  Top 5% and bottom 10% of frequent words are stopped

Spatial Consistency  Matched region in retrieved frames have similar spatial arrangement to outlined region in query  Retrieve frames using weighted frequency vector and re-rank based on spatial consistency

Spatial Consistency  Search area of 15 nearest neighbors of each match cast a vote for the frame  Matches with no support are rejected  Total number of votes determine rank circular areas are defined by the fifth nearest neighbour and the number of votes cast by the match is three.

Inverted File  Entry for each visual word  Store all matches : occurences of same word in all frames

More Results

Future Works  Lack of visual descriptors for some scene types  Define object of interest over more than single frame  Learning visual vocabularies for different scene types  Latent semantic indexing for content  Automatic clustering to find principal objects throughout movie

Demo  h/vgoogle/how/method/method_a.html h/vgoogle/how/method/method_a.html  h/vgoogle/index.html h/vgoogle/index.html