Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Higher-Level Visual Representation For Semantic Learning In Image Databases Ismail EL SAYAD 18/07/2011 Mrs president, other member of the jury, and.

Similar presentations


Presentation on theme: "A Higher-Level Visual Representation For Semantic Learning In Image Databases Ismail EL SAYAD 18/07/2011 Mrs president, other member of the jury, and."— Presentation transcript:

1 A Higher-Level Visual Representation For Semantic Learning In Image Databases Ismail EL SAYAD 18/07/2011 Mrs president, other member of the jury, and my friends Good morning and welcome to my Ph.D. defense presentation President : Sophie Tison Université Lille 1 Reviewers : Philippe Mulhem Laboratoire d'Informatique de Grenoble Zhongfei Zhang State University of New York Examinator: Bernard Merialdo Eurecom Sophia-Antipolis Advisor : Chabane Djeraba Co-advisor : Jean Martinet

2 Overview Introduction Related works Our approach
Experiments Conclusion and perspectives Overview Introduction Related works Our approach Enhanced Bag of Visual Words (E-BOW) Multilayer Semantically Significant Analysis Model (MSSA) Semantically Significant Invariant Visual Glossary (SSIVG) Experiments Image retrieval Image classification Object Recognition Conclusion and perspectives Talk briefly about the introduction Talk about the different parts of the approach briefly

3 Digital content grows rapidly Personal acquisition devices
Introduction Related works Our approach Experiments Conclusion and perspectives Motivation Digital content grows rapidly Personal acquisition devices Broadcast TV Surveillance Relatively easy to store, but useless if no automatic processing, classification, and retrieving The usual way to solve this problem is by describing images by keywords. This method suffers from subjectivity, text ambiguity and the lack of automatic annotation

4 Visual representations
Introduction Related works Our approach Experiments Conclusion and perspectives Visual representations Image-based representations are based on global visual features extracted over the whole image like color, color moment, shape or texture Image-based representations Part-based representations Visual representations Nowadays, images can be described using their visual content.

5 Visual representations
Introduction Related works Our approach Experiments Conclusion and perspectives Visual representations The main drawbacks of Image-based representations: High sensitivity to : Scale Pose Lighting condition changes Occlusions Cannot capture the local information of an image Part-based representations: Based on the statistics of features extracted from segmented image regions

6 Introduction Related works Our approach Experiments Conclusion and perspectives Visual representations Part-based representations (Bag of visual words) Feature clustering Feature space Compute local descriptors VW1 VW3 VW2 VW4 VW1 VW2 VW3 VW4 . Visual word vocabulary 2 1 . VW1 VW2 VW3 VW4 Frequency

7 Visual representations Bag of visual words (BOW) drawbacks
Introduction Related works Our approach Experiments Conclusion and perspectives Visual representations Bag of visual words (BOW) drawbacks Spatial information loss Record number of occurrences Ignore the position Using only keypoints-based Intensity descriptors: Neither shape nor color information is used Feature quantization noisiness: Unnecessary and insignificant visual words are generated

8 Visual representations Drawbacks Bag of Visual words (BOW)
Introduction Related works Our approach Experiments Conclusion and perspectives Visual representations Drawbacks Bag of Visual words (BOW) Low discrimination power: Different image semantics are represented by the same visual words Low invariance for visual diversity: One image semantic is represented by different visual words VW1364 VW1364 VW330 VW480 VW148 VW263

9 Enhanced BOW representation
Introduction Related works Our approach Experiments Conclusion and perspectives Objectives Enhanced BOW representation Different local information (intensity, color, shape…) Spatial constitution of the image Efficient visual word vocabulary structure Higher-level visual representation Less noisy More discriminative More invariant to the visual diversity This work aims at addressing these drawbacks with the following

10 Overview of the proposed higher-level visual representation
Introduction Related works Our approach Experiments Conclusion and perspectives Overview of the proposed higher-level visual representation Set of images E-BOW representation Visual word vocabulary building E-BOW MSSA model SSIVG Learning the MSSA model SSVIWs & SSIVPs generation SSIVG representation

11 Conclusion and perspectives
Introduction Related works Our approach Experiments Conclusion and perspectives Introduction Related works Spatial Pyramid Matching Kernel (SPM) & sparse coding Visual phrase & descriptive visual phrase Visual phrase pattern & visual synset Our approach Experiments Conclusion and perspectives All the related works are based on BOW representation, they propose different Higher-level representation

12 Spatial Pyramid Matching Kernel (SPM) & sparse coding
Introduction Related works Our approach Experiments Conclusion and perspectives Spatial Pyramid Matching Kernel (SPM) & sparse coding Lazebnik et al. [CVPR06] Spatial Pyramid Matching Kernel (SPM): exploiting the spatial information of location regions. Yang et al. [CVPR09] SPM + sparse coding: replacing k-means in the SPM Spatial pyramid extended the BOW representation Example of spatial pyramid for three different spatial level and resolutions

13 Visual phrase & descriptive visual phrase
Introduction Related works Our approach Experiments Conclusion and perspectives Visual phrase & descriptive visual phrase Zheng and Gao [TOMCCAP08] Visual phrase: pair of spatially adjacent local image patches Zhang et al. [ACM MM09] Descriptive visual phrase: selected according to the frequencies of its constituent visual word pairs enhanced this approach by selecting descriptive visual phrases from the constructed visual phrases according to the frequencies of their constituent visual word pairs.

14 Visual phrase pattern & visual sysnet
Introduction Related works Our approach Experiments Conclusion and perspectives Visual phrase pattern & visual sysnet Yuan et al. [CVPR07] Visual phrase pattern: spatially co-occurring group of visual words Zheng et al. [CVPR08] Visual synset: relevance-consistent group of visual words or phrases in the spirit of the text synset

15 Comparison of the different enhancements of the BOW
Introduction Related works Our approach Experiments Conclusion and perspectives Comparison of the different enhancements of the BOW SPM SPM + sparse coding Visual phrase Descriptive visual phrase Visual phrase pattern Visual synset Our approach Considering the spatial location + - Describing different local information Eliminating ambiguous visual words semantically Efficient structure for storing visual vocabulary Enhancing low discrimination power Tackling low invariance for visual diversity EFFICIENT structure by other approaches are for visual word vocabulary but we used for word and phrase

16 Conclusion and perspectives
Introduction Related works Our approach Experiments Conclusion and perspectives Introduction Related works Our approach Enhanced Bag of Visual Words (E-BOW) Multilayer Semantically Significant Analysis Model (MSSA) Semantically Significant Invariant Visual Glossary (SSIVG) Experiments Conclusion and perspectives

17 Enhanced Bag of Visual Words (E-BOW)
Introduction Related works Our approach Experiments Conclusion and perspectives Enhanced Bag of Visual Words (E-BOW) E-BOW Hierarchal features quantization Features fusion Set of images E-BOW representation SURF & Edge Context extraction MSSA model SSIVG

18 Enhanced Bag of Visual Words (E-BOW) Feature extraction
Introduction Related works Our approach Experiments Conclusion and perspectives Enhanced Bag of Visual Words (E-BOW) Feature extraction Interest points detection Edge points detection Color filtering using vector median filter (VMF )  SURF feature vector extraction at each interest point Color feature extraction at each interest and edge point Fusion of the SURF and edge context feature vectors Color and position vector clustering using Gaussian mixture model Recall descriptors Collection of all vectors for the whole image set Edge Context feature vector extraction at each interest point ∑1 µ1 Pi1 ∑2 µ2 Pi2 ∑3 µ3 Pi3 HAC and Divisive Hierarchical K-Means clustering VW vocabulary

19 Enhanced Bag of Visual Words (E-BOW) Feature extraction (SURF)
Introduction Related works Our approach Experiments Conclusion and perspectives Enhanced Bag of Visual Words (E-BOW) Feature extraction (SURF) SURF is a low-level feature descriptor Describes how the pixel intensities are distributed within a scale dependent neighborhood of each interest point. Good at Handling serious blurring Handling image rotation Poor at Handling illumination change Efficient Add reference

20 6 bins for the magnitude of the drawn vectors to the edge points
Introduction Related works Our approach Experiments Conclusion and perspectives Enhanced Bag of Visual Words (E-BOW) Feature extraction (Edge Context descriptor) Edge context descriptor is represented at each interest point as a histogram : 6 bins for the magnitude of the drawn vectors to the edge points 4 bins for the orientation angle Motivation of the edge context

21 This descriptor is invariant to : Translation :
Introduction Related works Our approach Experiments Conclusion and perspectives Enhanced Bag of Visual Words (E-BOW) Feature extraction (Edge context descriptor) This descriptor is invariant to : Translation : The distribution of the edge points is measured with respect to fixed points Scale: The radial distance is normalized by a mean distance between the whole set of points within the same Gaussian Rotation: All angles are measured relative to the tangent angle of each interest point

22 Enhanced Bag of Visual Words (E-BOW) Hierarchal feature quantization
Introduction Related works Our approach Experiments Conclusion and perspectives Enhanced Bag of Visual Words (E-BOW) Hierarchal feature quantization Visual word vocabulary is created by clustering the observed merged features (SURF + Edge context 88 D) in 2 clustering steps: Stop clustering at desired level k A cluster at k =4 Merged feature in the feature space Hierarchical Agglomerative Clustering (HAC) The tree is determined level by level, down to some maximum number of levels L, and each division into k parts. Divisive Hierarchical K-Means Clustering k clusters from HAC

23 Multilayer Semantically Significant Analysis (MSSA) model
Introduction Related works Our approach Experiments Conclusion and perspectives Multilayer Semantically Significant Analysis (MSSA) model E-BOW Hierarchal features quantization Features fusion Set of images E-BOW representation SURF & Edge Context extraction MSSA model Generative process Parameters estimation Number of latent topics Estimation VWs semantic inference estimation SSIVG

24 A topic model that considers this hierarchal structure is needed
Introduction Related works Our approach Experiments Conclusion and perspectives Multilayer Semantically Significant Analysis (MSSA) model Generative Process Different Visual aspects Higher-level aspect: People A topic model that considers this hierarchal structure is needed

25 In the MSSA, there are two different latent (hidden) topics:
Introduction Related works Our approach Experiments Conclusion and perspectives Multilayer Semantically Significant Analysis (MSSA) model Generative Process In the MSSA, there are two different latent (hidden) topics: High latent topic that represents the high aspects Visual latent topic that represents the visual aspects V W v h im φ Θ Ψ M N Add training images

26 Probability distribution function :
Introduction Related works Our approach Experiments Conclusion and perspectives Multilayer Semantically Significant Analysis (MSSA) model Parameter Estimation Probability distribution function : Log-likelihood function : Gaussier et al. [ ACM SIGIR05]: maximizing the likelihood can be seen as a Nonnegative Matrix Factorization (NMF) problem under the generalized KL divergence Objective function: This generative process leads to the following conditional probability distribution: Following the maximum likelihood principle, one can estimate the parameters by maximizing the log-likelihood function as follows:

27 This leads to the following multiplicative update rules :
Introduction Related works Our approach Experiments Conclusion and perspectives Multilayer Semantically Significant Analysis (MSSA) model Parameter Estimation KKT conditions are used to derive the multiplicative update rules for minimizing the objective function This leads to the following multiplicative update rules :

28 Minimum Description Length (MDL) is used as a model selection criteria
Introduction Related works Our approach Experiments Conclusion and perspectives Multilayer Semantically Significant Analysis (MSSA) model Number of Latent Topics Estimation Minimum Description Length (MDL) is used as a model selection criteria Number of the high latent topics (L) Number of the visual latent topics (K) is the log-likelihood is the number of free parameters: The number of the high latent topics, L, and the number of the visual latent topics, K, is determined in advance for the model fitting based on the Minimum Description Length (MDL) principle

29 Introduction Related works Our approach Experiments Conclusion and perspectives Semantically Significant Invariant Visual Glossary (SSIVG) representation E-BOW Hierarchal features quantization Features fusion Set of images E-BOW representation SURF & Edge Context extraction MSSA model Generative process Parameters estimation Number of latent topics Estimation VWs semantic inference estimation SSIVG SSVW representation SSVPs generation SSVP representation Divisive theoretic clustering SSIVW representation SSVWs selection SSIVP representation SSIVG representation

30 Set of relevant visual topics
Introduction Related works Our approach Experiments Conclusion and perspectives Semantically Significant Invariant Visual Glossary (SSIVG) representation Semantically Significant Visual Word (SSVW) Set of VWs Set of relevant visual topics Estimating using MSSA Set of SSVWs Estimating using MSSA Check the size of the cylinders

31 SSVP: Higher-level and more discriminative representation
Introduction Related works Our approach Experiments Conclusion and perspectives Semantically Significant Invariant Visual Glossary (SSIVG) representation Semantically significant Visual Phrase (SSVP) SSVP: Higher-level and more discriminative representation SSVWs + their inter-relationships SSVPs are formed from SSVW sets that satisfy all the following conditions: Occur in the same spatial context Involved in strong association rules High support and confidence Have the same semantic meaning High probability related to at least one common visual latent topic

32 Introduction Related works Our approach Experiments Conclusion and perspectives Semantically Significant Invariant Visual Glossary (SSIVG) representation Semantically Significant Visual Phrase (SSVP) SSIVP126 SSIVP126 SSIVP326 SSIVP326 SSIVP304 SSIVP304

33 The invariance power of SSVWs and SSVPs is still low Text documents
Introduction Related works Our approach Experiments Conclusion and perspectives Semantically Significant Invariant Visual Glossary (SSIVG) representation Invariance Problem Studying the co-occurrence and spatial scatter information make the image representation more discriminative The invariance power of SSVWs and SSVPs is still low Text documents Synonymous words can be clustered into one synonymy set to improve the document categorization performance See right part

34 Set of relevant visual topics
Introduction Related works Our approach Experiments Conclusion and perspectives Semantically Significant Invariant Visual Glossary (SSIVG) representation SSIVG : higher-level visual representation composed from two different layers of representation Semantically Significant Invariant Visual Word (SSIVW) Re-indexed SSVWs after a distributional clustering Semantically Significant Invariant Visual Phrases (SSIVP) Re-indexed SSVPs after a distributional clustering Set of SSVWs and SSVPs Set of relevant visual topics Estimating using MSSA Correct the animation Make boxes bigger Divisive theoretic clustering Estimating using MSSA Set of SSIVGs Set of SSIVPs Set of SSIVWs

35 Conclusion and perspectives
Introduction Related works Our approach Experiments Conclusion and perspectives Experiments Introduction Related works Our approach Experiments Image retrieval Image classification Object Recognition Conclusion and perspectives Global approach slide before this slide

36 Assessment of the SSIVG representation performance in image retrieval
Introduction Related works Our approach Experiments Conclusion and perspectives Assessment of the SSIVG representation performance in image retrieval Dataset Total Nr. of images Nr. of training images Nr. of test images Nr. of image categories NUS-WIDE 269,648 161,789 107,859 81 Evaluation criteria : Mean Average Precision (MAP) The traditional Vector Space Model of Information Retrieval is adapted The weighting for the SSIVP Spatial weighting for the SSIVW The inverted file structure Add parameter settings

37 Assessment of the SSIVG representation Performance in image retrieval
Introduction Related works Our approach Experiments Conclusion and perspectives Assessment of the SSIVG representation Performance in image retrieval

38 Assessment of the SSIVG representation performance in image retrieval
Introduction Related works Our approach Experiments Conclusion and perspectives Assessment of the SSIVG representation performance in image retrieval Discriptive correct and upper cases Add references 38

39 Evaluation of the SSIVG representation in image classification
Introduction Related works Our approach Experiments Conclusion and perspectives Evaluation of the SSIVG representation in image classification Dataset # images # training images # test images # image categories MIRFLICKER 25000 15000 10,000 11 Evaluation criteria : Classification Average Precision over each class Classifiers : SVM with linear kernel Multiclass Vote-Based Classifier (MVBC)

40 Evaluation of the SSIVG representation in image classification
Introduction Related works Our approach Experiments Conclusion and perspectives Evaluation of the SSIVG representation in image classification Multiclass Vote-Based Classifier (MVBC) For each , we detect the high latent topic that maximizes: is The final voting score for a high latent topic : is Add this slide at the end of the presentations and add sub points to slide before Each image is categorized according to the dominant high latent

41 Evaluation of the SSIVG representation performance in classification
Introduction Related works Our approach Experiments Conclusion and perspectives Evaluation of the SSIVG representation performance in classification Upper case corrections

42 Classification Average Precision (AP) over each object class
Introduction Related works Our approach Experiments Conclusion and perspectives Assessment of the SSIVG representation Performance in object recognition Dataset # images # training images # test images # image categories Caltech101 8707 7697 1,010 101 Each test image is recognized by predicting the object class using the SSIVG representation and the MVBC Evaluation criteria: Classification Average Precision (AP) over each object class

43 Introduction Related works Our approach Experiments Conclusion and perspectives Assessment of the SSIVG Representation Performance in Object Recognition

44 Conclusion and perspectives
Introduction Related works Our approach Experiments Conclusion and perspectives Experiments Introduction Related works Our approach Experiments Conclusion and perspectives Global approach slide before this slide

45 Conclusion Enhanced BOW (E-BOW) representation
Introduction Related works Our approach Experiments Conclusion and perspectives Conclusion Enhanced BOW (E-BOW) representation Modeling the spatial-color image constitution using GMM New local feature descriptor (Edge Context) Efficient visual word vocabulary structure New Multilayer Semantic Significance (MSSA) model Semantic inferences of different layers of representation Semantically Significant Visual Glossary (SSIVG) More discriminative More invariant to visual diversity Experimental validation Outperform other sate of the art works Add parameter settings

46 Introduction Related works Our approach Experiments Conclusion and perspectives Perspectives MSSA Parameters update On-line algorithms to continuously (re-)learn the parameters Invariance issue Context large-scale databases where large intra-class variations can occur Cross-modalitily extension to video content Cross-modal data (visual and textual closed captions contents) New generic framework of video summarization Study the semantic coherence between visual contents and textual captions Parameters update: It will be essential to design on-line algorithms to continuously (re-)learn the parameters of the proposed MSSA model, as the content of digital databases is modified by the regular upload or deletion of images. Invariance issue: It will be interesting to investigate more on the invariance issue especially in the context large-scale databases where large intra-class variations can occur. Cross-modalitily extension: The proposed higher-level visual representation can be extended to video content. The extension can be based on cross-modal data (visual and textual closed captions contents). Video summarization: A new generic framework of video summarization based on the extended higher-level semantic representation of video content can be designed. Talk that this work is applied at the frame level

47 Thank you for your attention !
Questions ?

48 Parameter Settings Introduction Related works Our approach Experiments
Conclusion and perspectives Parameter Settings Parameters update: It will be essential to design on-line algorithms to continuously (re-)learn the parameters of the proposed MSSA model, as the content of digital databases is modified by the regular upload or deletion of images. Invariance issue: It will be interesting to investigate more on the invariance issue especially in the context large-scale databases where large intra-class variations can occur. Cross-modalitily extension: The proposed higher-level visual representation can be extended to video content. The extension can be based on cross-modal data (visual and textual closed captions contents). Video summarization: A new generic framework of video summarization based on the extended higher-level semantic representation of video content can be designed. Talk that this work is applied at the frame level


Download ppt "A Higher-Level Visual Representation For Semantic Learning In Image Databases Ismail EL SAYAD 18/07/2011 Mrs president, other member of the jury, and."

Similar presentations


Ads by Google