Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department.

Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department University of California, San Diego Multi-Class Object Localization by Combining Local Contextual Interactions

Outline Introduction Multi-Class Multi-Kernel Approach Contextual Interaction Experiment & Results Conclusion

Introduction Object localization of contextual cues can greatly improve accuracy over model that use appearance feature alone. Context considers information from neighboring area of object, such as pixel, region, and object interaction.

Introduction In this work, we present a novel framework for object localization that efficiently and effectively combines different level of interaction. Develop a multiple kernel learning algorithm to integrate appearance feature with pixel and region interaction data, resulting in a unified similarity metric, which is optimized for nearest neighbor classification. Object level interactions are modeled by a conditional random field(CRF) to produce the final label prediction.

Multi-Class Multi-Kernel Approach Large Margin Nearest Neighbor Multiple Kernel Extension Spatial Smoothing by Segment Merging Contextual Conditional Random Field

Multi-Class Multi-Kernel Approach In our model, each training image I is partitioned into segments si by using ground truth information. Each segment si corresponds to exactly one object of class where C is the set of all object labels. These segments are collected into the training set S. For each segment si2S,,we extract several types of features, where the pth feature space is characterized by a kernel function and inner product matrix:

Multi-Class Multi-Kernel Approach From this collection of kernels, we learn a unified similarity metric over, and a corresponding embedding function, map training set to learned space. To provide more representative examples for nearest neighbor prediction, we augment the training set S with additional segments, obtained by running a segmentation algorithm multiple times on the training images [24]. Because at test time, ground-truth segmentations are not available, the test image must be segmented automatically.

Multi-Class Multi-Kernel Approach Multiple Kernel Extension – several different features are extracted Spatial Smoothing by Segment Merging Contextual Conditional Random Field – predict the final labeling of each segment Segment are mapped into a unified space & soft label prediction is compute

Large Margin Nearest Neighbor Our classification algorithm is based on k-nearest neighbor prediction. Apply the Large Margin Nearest Neighbor(LMNN) algorithm to optimally distort the features for nearest neighbor prediction [35]. Neighbors are selected by using the learned Mahalanobis distance metric W : W is a positive semidefinite(PSD) matrix.

Large Margin Nearest Neighbor W is trained so that for each training segment. Neighboring segments (in feature space) with differing labels are pushed away by a large margin. Achieved by solving the following semidefinite program: and is similar and dissimilar label is slack parameter, is slack variable

Large Margin Nearest Neighbor A linear projection matrix L can be recovered from W by its spectral decomposition, so that W = L: V contains the eigenvectors of W, and is a diagonal matrix containing the eigenvalues

Large Margin Nearest Neighbor Although the learned projection is linear, the algorithm can be kernelized [28] to effectively learn non-linear feature transformations. After kernelizing the algorithm, each segment si can be rewritten by its corresponding column in the kernel matrix 1111111 and introducing a regularization term. The embedding function then takes the form:

Multiple Kernel Extension To effectively integrate different types of feature descriptions, we learn a linear projection from each kernel’s feature space. Define the combined distance between two points by summing the distance in each (transformed) space. This is expressed algebraically as: The regularization term tr(WK) is similarly extended to the sum The multiple-kernel embedding function then takes the form

Multiple Kernel Extension Multiple Kernel LMNN(MKLMNN) algorithm:

Multiple Kernel Extension The probability distribution over the labels for the segment is computed by using its k nearest neighbors, weighted according to distance from g(s0): where is the label of segment To simplify the process, we restrict to be diagonal, which can be interpreted as learning weightings over S in each feature space.

Spatial Smoothing by Segment Merging Because objects may be represented by multiple segments at test time, some of those segments will contain only partial information from the object. Resulting in less reliable label predictions. Smooth a segment’s label distribution by incorporating information from segments which are likely to come from the same object, resulting in an updated label distribution

Spatial Smoothing by Segment Merging Using the extra segments, we train an SVM classifier to predict when two segments belong to the same object. By using the ground truth object annotation, we know when a pair of training segment came from the same object. Given two segment and we compute: pixel and region interaction features. overlap between segment masks. normalized segment centroids. number of segments obtained in the segmentation. Euclidean distance between the two segment centroids.

Spatial Smoothing by Segment Merging We construct an undirected graph where each vertex is a segment, and edges are added between pairs that the classifier predicts should be merged, resulting in a new object segment. The smoothed label distribution is the geometric mean of the segment distribution and its corresponding object’s distribution:

Pixel and region interactions can be described by low-level features, but object interaction require a high-level description, e.g., it’s label. We follow the soft label prediction with Conditional Random field(CRF) that encode high-level object interaction.

Contextual Conditional Random Field We learn potential functions from object co-occurrences, capturing long-distance dependencies between whole regions of the image and across classes. Our CRF model is described as: treating the image as a bag of segment:, represents the vector of labels for the segment in The final label vector is the value of which is maximize.

Contextual Interactions In this part, we describe the features we use to characterize each level of contextual interaction. Pixel level interaction. Region level interaction. Object level interaction.

Pixel Level Interaction Pixel level interactions can implicitly capture background contextual information as well as information about object boundaries. We use a new type of contextual source, boundary support.

Pixel Level Interaction Encode by computing a histogram over LAB color value between 0 and pixel away from the object’s boundary. Compute the -distance between boundary support histogram H: Define the pixel interaction kernel as:

Region Level Interaction By using large windows around an object, known as contextual neighborhoods [7], regions encode probable geometrical configurations, and capture information from neighboring (parts of) objects.

Region Level Interaction Computed by dilating the bounding box around the object by using a disk of diameter: We model region interactions by computing the gist[31] of a contextual neighborhood, Gi. Our region interaction are represented by the kernel:

Object Level Interactions To train the object interaction CRF, we derive semantic context from the co-occurrence of objects within each training image. A co-ocurrence matrix A A(i,j) counts the times an object with label ci appears in a training image with an object with label cj. Diagonal entries correspond to the frequency of the object in the training set.

Experiments Database : MSRC and PASCAL 2007 Appearance feature : SIFT Self-similarity (SSIM) LAB histogram Pyramid of Histogram of Oriented Gradients (PHOG). Context feature : GIST LAB color

Result Object localization: Mean accuracy results

Result MSRC presents more co-occurrences of object classes per image than PASCAL, providing more information to the object interaction model.

Result Feature combination: Learning the optimal embedding

Result Learned kernel weights

Result Comparison to other model: MSRC PASCAL 07

We have introduced a novel framework that efficiently and effectively combines different levels of local context interactions. Our multiple kernel learning algorithm integrates appearance features with pixel and region interaction data. We obtain significant improvement over current state-of-the- art contextual frameworks. Adding another object interaction type, such as spatial context [8], localization accuracy could be improved further.

Thank you!!!

Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department.

Similar presentations

Presentation on theme: "Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department.

Similar presentations

Presentation on theme: "Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department."— Presentation transcript:

Similar presentations

About project

Feedback