Clustering with Application to Fast Object Search

Name: Clustering with Application to Fast Object Search
Uploaded: 2017-07-31T01:19:22+00:00
Duration: PTM24S30
Channel: Shannon Rice
Description: Clustering with Application to Fast Object Search

Clustering with Application to Fast Object Search
03/08/12 Clustering with Application to Fast Object Search Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem

This section Clustering: grouping together similar points, images, feature vectors, etc. Segmentation: dividing the image into meaningful regions Segmentation by clustering: K-means and mean-shift Graph approaches to segmentation: graph cuts and normalized cuts Segmentation from boundaries: watershed EM: soft clustering, or parameter estimation with hidden data

Today’s class Clustering algorithms K-means Hierarchical clustering
Application to fast object search Hierarchical clustering Spectral clustering

Clustering: group together similar points and represent them with a single token
Key Challenges: 1) What makes two points/images/patches similar? 2) How do we compute an overall grouping from pairwise similarities?

Why do we cluster? Summarizing data Counting Segmentation Prediction
Look at large amounts of data Patch-based compression or denoising Represent a large continuous vector with the cluster number Counting Histograms of texture, color, SIFT vectors Segmentation Separate the image into different regions Prediction Images in the same cluster may have the same labels

How do we cluster? K-means Agglomerative clustering
Iteratively re-assign points to the nearest cluster center Agglomerative clustering Start with each point as its own cluster and iteratively merge the closest clusters Spectral clustering Split the nodes in a graph based on assigned links with similarity weights

Clustering for Summarization
Goal: cluster to minimize variance in data given clusters Preserve information Cluster center Data Whether xj is assigned to ci

K-means algorithm 1. Randomly select K centers
2. Assign each point to nearest center 3. Compute new center (mean) for each cluster Illustration:

K-means algorithm 1. Randomly select K centers
2. Assign each point to nearest center Back to 2 3. Compute new center (mean) for each cluster Illustration:

K-means demos General Color clustering

K-means Initialize cluster centers: c0 ; t=0
Assign each point to the closest center Update cluster centers as the mean of the points Repeat 2-3 until no points are re-assigned (t=t+1)

Kmeans: Matlab code function C = kmeans(X, K) % Initialize cluster centers to be randomly sampled points [N, d] = size(X); rp = randperm(N); C = X(rp(1:K), :); lastAssignment = zeros(N, 1); while true % Assign each point to nearest cluster center bestAssignment = zeros(N, 1); mindist = Inf*ones(N, 1); for k = 1:K for n = 1:N dist = sum((X(n, :)-C(k, :)).^2); if dist < mindist(n) mindist(n) = dist; bestAssignment(n) = k; end % break if assignment is unchanged if all(bestAssignment==lastAssignment), break; end; % Assign each cluster center to mean of points within it C(k, :) = mean(X(bestAssignment==k, :));

K-means: design choices
Initialization Randomly select K points as initial cluster center Or greedily choose K points to minimize residual Distance measures Traditionally Euclidean, could be others Optimization Will converge to a local minimum May want to perform multiple restarts

How to choose the number of clusters?
Minimum Description Length (MDL) principal for model comparison Minimize Schwarz Criterion also called Bayes Information Criteria (BIC) sum squared error

How to choose the number of clusters?
Validation set Try different numbers of clusters and look at performance When building dictionaries (discussed later), more clusters typically work better

How to evaluate clusters?
Generative How well are points reconstructed from the clusters? Discriminative How well do the clusters correspond to labels? Purity Note: unsupervised clustering does not aim to be discriminative

Common similarity/distance measures
P-norms City Block (L1) Euclidean (L2) L-infinity Mahalanobis Scaled Euclidean Cosine distance Here xi is the distance between two points

Conclusions: K-means Good Bad
Finds cluster centers that minimize conditional variance (good representation of data) Simple to implement, widespread application Bad Prone to local minima Need to choose K All clusters have the same parameters (e.g., distance measure is non-adaptive) Can be slow: each iteration is O(KNd) for N d-dimensional points

K-medoids Just like K-means except
Represent the cluster with one of its members, rather than the mean of its members Choose the member (data point) that minimizes cluster dissimilarity Applicable when a mean is not meaningful E.g., clustering values of hue or using L-infinity similarity

Building Visual Dictionaries
Sample patches from a database E.g., 128 dimensional SIFT vectors Cluster the patches Cluster centers are the dictionary Assign a codeword (number) to each new patch, according to the nearest cluster

Examples of learned codewords
Most likely codewords for 4 learned “topics” EM with multinomial (problem 3) to get topics Sivic et al. ICCV 2005

Application of Clustering
How to quickly find images in a large database that match a given image region?

Simple idea See how many SIFT keypoints are close to SIFT keypoints in each other image Lots of Matches Few or No Matches But this will be really, really slow!

Key idea 1: “Visual Words”
Cluster the keypoint descriptors Assign each descriptor to a cluster number What does this buy us? Each descriptor was 128 dimensional floating point, now is 1 integer (easy to match!) Is there a catch? Need a lot of clusters (e.g., 1 million) if we want points in the same cluster to be very similar Points that really are similar might end up in different clusters

Cluster the keypoint descriptors Assign each descriptor to a cluster number Represent an image region with a count of these “visual words”

Cluster the keypoint descriptors Assign each descriptor to a cluster number Represent an image region with a count of these “visual words” An image is a good match if it has a lot of the same visual words as the query region

Naïve matching is still too slow
Imagine matching 1,000,000 images, each with 1,000 keypoints

Key Idea 2: Inverse document file
Like a book index: keep a list of all the words (keypoints) and all the pages (images) that contain them. Rank database images based on tf-idf measure. tf-idf: Term Frequency – Inverse Document Frequency # documents # times word appears in document # documents that contain the word # words in document

Fast visual search “Video Google”, Sivic and Zisserman, ICCV 2003
“Scalable Recognition with a Vocabulary Tree”, Nister and Stewenius, CVPR 2006.

110,000,000 Images in 5.8 Seconds Slide
Very recently, we have scaled the system even further. The system we just demo’d searches a 50 thousand image index in the RAM of this laptop at real-time rates. We have built and programmed a desktop system at home in which we currently have 12 hard drives. We bypass the operating system and read and write the disk surfaces directly. Last week this system clocked in on 110 Million images in 5.8 seconds, so roughly 20Million images per second and machine. The images consist of around 10 days of four TV channels, so more than a month of TV. Soon we don’t have to watch TV, because we’ll have machines doing it for us. Slide This slide and following by David Nister

I will now try to describe the approach with an animation.
The first, offline phase, is to train the vocabulary tree. We use Maximally Stable Extremal Regions by Matas et al. and then we extract SIFT descriptors from the MSER regions. Regarding the feature extraction, we are not really claiming anything other than down-in-the-trenches nose-to-the-grindstone hard work to get good implementations of the state-of-the-art, and since we are all obsessed with novelty I will not spend any time talking about the feature extraction. We believe that the vocabulary tree approach will work well on any of your favorite descriptors.

All the descriptor vectors are thrown into a common space.
This is done for many images, the goal being to acquire enough statistics on how natural images distribute points in the descriptor space.

We then run k-means on the descriptor space
We then run k-means on the descriptor space. In this setting, k defines what we call the branch-factor of the tree, which indicates how fast the tree branches. In this illustration, k is three. We then run k-means again, recursively on each of the resulting quantization cells. This defines the vocabulary tree, which is essentially a hierarchical set of cluster centers and their corresponding Voronoi regions. We typically use a branch-factor of 10 and six levels, resulting in a million leaf nodes. We lovingly call this the Mega-Voc.

We now have the vocabulary tree, and the online phase can begin.
In order to add an image to the database, we perform feature extraction. Each descriptor vector is now dropped down from the root of the tree and quantized very efficiently into a path down the tree, encoded by a single integer. Each node in the vocabulary tree has an associated inverted file index. Indecies back to the new image are then added to the relevant inverted files. This is a very efficient operation that is carried out whenever we want to add an image.

When we then wish to query on an input image, we quantize the descriptor vectors of the input image in a similar way, and accumulate scores for the images in the database with so called term frequency inverse document frequency (tf-idf). This is effectively an entropy weighting of the information. The winner is the image in the database with the most common information with the input image.

Performance One of the reasons we managed to take the system this far is that we have worked with rigorous performance testing against a database with ground truth. We now have a benchmark database with over images grouped in sets of four images of the same object. We query on one of the images in the group and see how many of the other images in the group make it to the top of the query. In the paper, we have tested this up to a million images, by embedding the ground truth database into a database encompassing all the frames of 10 feature length movies.

More words is better Improves Retrieval Improves Speed Branch factor
We have used the performance testing to explore various settings for the algorithm. The most important finding is that the number of leaf nodes in the vocabulary tree is very important. In principle, there is a trade-off, because we wish to have discriminative quantizations, which implies many leaf-nodes, while we also wish to have repeatable quantizations, which implies few leaf nodes. However, it seems that in a large practical range, the bigger the better when it comes to the vocabulary. This probably depends on the descriptors used, but is not entirely surprising, since we are trying to divide up a 128-dimensional space in discriminative bins. This is a very useful finding, because it means that we are in a win-win situation: a larger vocabulary improves retrieval performance. It turns out that it also improves the retrieval speed, because our inverted file index becomes much sparser, so we exploit the true power of inverted files better with a larger vocabulary. We also use a hierarchical scoring scheme, which has the nice property that the performance will only level out, not actually go down.

Higher branch factor works better (but slower)
We also find that performance improves with the branch factor. This improvement is not dramatic, but it is interesting to note that very low branch factors are somewhat weak, and a branch factor of two results in partitioning the space with planes.

Application: Google Goggles

Sampling strategies Sparse, at interest points Dense, uniformly
Randomly To find specific, textured objects, sparse sampling from interest points often more reliable. Multiple complementary interest operators offer more image coverage. For object categorization, dense sampling offers better coverage. [See Nowak, Jurie & Triggs, ECCV 2006] Multiple interest operators K. Grauman, B. Leibe Image credits: F-F. Li, E. Nowak, J. Sivic

Can we be more accurate? So far, we treat each image as containing a “bag of words”, with no spatial information Which matches better? h a f e a f z e h a f e e

Can we be more accurate? So far, we treat each image as containing a “bag of words”, with no spatial information Real objects have consistent geometry

Final key idea: geometric verification
RANSAC for affine transform Repeat N times: a f z e a f e z Randomly choose 3 matching pairs Estimate transformation Affine Transform a f z e a f e Predict remaining points and count “inliers” z

Application: Large-Scale Retrieval
Query Results on 5K (demo available for 100K) K. Grauman, B. Leibe [Philbin CVPR’07]

Video Google System Collect all words within query region
Inverted file index to find relevant frames Compare word counts Spatial verification Sivic & Zisserman, ICCV 2003 Demo online at : Retrieved frames K. Grauman, B. Leibe

Building Visual Dictionaries
Sample patches from a database E.g., 128 dimensional SIFT vectors Cluster the patches Cluster centers are the dictionary Assign a codeword (number) to each new patch, according to the nearest cluster

Examples of learned codewords
Most likely codewords for 4 learned “topics” EM with multinomial (problem 3) to get topics Sivic et al. ICCV 2005

Learning Visual Dictionaries
Inpainting Application

Agglomerative clustering

Agglomerative clustering
How to define cluster similarity? Average distance between points, maximum distance, minimum distance Distance between means or medoids How many clusters? Clustering creates a dendrogram (a tree) Threshold based on max number of clusters or based on distance between merges distance

Agglomerative clustering demo

Conclusions: Agglomerative Clustering
Good Simple to implement, widespread application Clusters have adaptive shapes Provides a hierarchy of clusters Bad May have imbalanced clusters Still have to choose number of clusters or threshold Need to use an “ultrametric” to get a meaningful hierarchy

Spectral clustering Group points based on links in a graph B A

Cuts in a graph B A Normalized Cut the raw cut cost encouraging splitting out just one node fix by normalizing for size of segments volume(A) = sum of costs of all edges that touch A Source: Seitz

Normalized cuts for segmentation

Visual PageRank Determining importance by random walk
What’s the probability that you will randomly walk to a given node? Create adjacency matrix based on visual similarity Edge weights determine probability of transition Gives a way to select representative samples Jing Baluja 2008

Which algorithm to use? Quantization/Summarization: K-means
Aims to preserve variance of original data Can easily assign new point to a cluster Summary of 20,000 photos of Rome using “greedy k-means” Quantization for computing histograms

Which algorithm to use? Image segmentation: agglomerative clustering
More flexible with distance measures (e.g., can be based on boundary prediction) Adapts better to specific data Hierarchy can be useful

Which algorithm to use? Image segmentation: spectral clustering
Can provide more regular regions Spectral methods also used to propagate global cues (e.g., Global pB)

Things to remember K-means useful for summarization, building dictionaries of patches, general clustering Fast object retrieval using visual words and inverse index table Agglomerative clustering useful for segmentation, general clustering Spectral clustering useful for determining relevance, summarization, segmentation

Next class Gestalt grouping Image segmentation Mean-shift segmentation
Watershed segmentation

Clustering with Application to Fast Object Search

Similar presentations

Presentation on theme: "Clustering with Application to Fast Object Search"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering with Application to Fast Object Search

Similar presentations

Presentation on theme: "Clustering with Application to Fast Object Search"— Presentation transcript:

Similar presentations

About project

Feedback