Presentation is loading. Please wait.

Presentation is loading. Please wait.

Elisavet Chatzilari, Spiros Nikolopoulos, Yiannis Kompatsiaris, Josef Kittler Information Technologies Institute Elisavet Chatzilari

Similar presentations


Presentation on theme: "Elisavet Chatzilari, Spiros Nikolopoulos, Yiannis Kompatsiaris, Josef Kittler Information Technologies Institute Elisavet Chatzilari"— Presentation transcript:

1 Elisavet Chatzilari, Spiros Nikolopoulos, Yiannis Kompatsiaris, Josef Kittler Information Technologies Institute Elisavet Chatzilari

2  Introduction – Scalability  Proposed approach  Experiments  Discussion and Conclusions Elisavet Chatzilari -

3 The problem  In the past decade there has been a huge increase on the availability of multimedia content on the web  How to index all these data to facilitate searching, browsing, etc? Elisavet Chatzilari -

4 Keyword-based  Search multimedia content using text  File names  Tags  User comments  Not always successful, especially for ambiguous concepts such as palm  Palm – hand  Palm – tree  Palm - palmtop  Large amount of multimedia data have auto generated names and no associated text  Digital camera image name: im jpg Elisavet Chatzilari -

5 Content-based  Machine learning popular paradigm  Simulates the way humans learn to recognize objects  A model is being trained to recognize a concept by providing a set of positive and negative examples (i.e. training set) Elisavet Chatzilari -

6 Challenges  The performance of a machine learning model mainly depends on the quality and the quantity of the training set  Quality is accomplished through manual annotation Laborious Time consuming  Quantity is limited when manual annotation is required  Impact on the scalability of content based methods to numerous concepts  Positive and negative instances are required for each concept Elisavet Chatzilari -

7 The potential  tremendous volume of user contributed content (UGC) available on the Web, usually along with an indication of their meaning (i.e. tags)  Can this content be used for training and facilitate scalable learning?  rapid growth of social media (e.g. images, videos, etc.)  emerged as the result of users’ willingness to communicate, socialize, collaborate and share content Elisavet Chatzilari -

8 Scalability  Scalability  Definition: The ability of a system to train effectively recognition models for numerous concepts  Requirement: Availability of significant amounts of training data  Problem: Manual annotation limits the availability  Proposed solution: Find “cheaper” ways (i.e. do not require manual annotation) to obtain annotated data.  Possible sources of “cheaper” content  Crowdsourcing (e.g. Mturk) High quality annotations Faster than manual annotation Not free  Online games (e.g. Peekabook, Labelme) Medium quality annotations Content availability is still limited Free  Tagging environments (e.g. flickr) Noisy annotations (tags) Almost unlimited size Free  Web content No annotations Unlimited size Free  Our objective: Examine whether and under which circumstances user tagged images can be suitable content for training Elisavet Chatzilari -

9 Proposed approach  Based on the typical bootstrapping paradigm – the search space defined by the pool of candidates is too dense to accurately select new samples  Proposed approach: select additional training samples from UGC by refining the search space using:  Visual information - typical self-training  Textual information – user tags  Visual ambiguity - discard the ambiguous images for which our classifier cannot localize the targeted region (e.g. grass-bush vs grass-fence) Elisavet Chatzilari -

10 Visual Scores (VS)  Manually labeled images  Feature extraction  Dense and Harris Laplace keypoints  SIFT  Soft assignment  Classification  Support Vector Machines  The Visual Score for each region is extracted by applying the SVM model on them. Support Vector Machines Elisavet Chatzilari -

11 Textual Scores (TS) [vector] S. Patwardhan, Incorporating Dictionary and Corpus Information into a Context Vector Measure of Semantic Relatedness, Master’s thesis, University of Minnesota, Duluth (August 2003).  Gauges the possibility that the image contains the concept  Measure the semantic relatedness between each tag of the image and the concepts’ lexical description and select the maximum of the scores as the Textual Score.  WordNet (lexical database)  The similarity metric that was chosen ([vector]), combines the benefits of using the strict definitions of WordNet along with the knowledge of the concepts’ co-occurrence which is derived from a large data corpus. Elisavet Chatzilari -

12 Visual Ambiguity (VA)  Visual ambiguity arises  between visually similar concepts  Attributed to the imperfection of the specific visual analysis algorithms  Visual ambiguity extraction  For every concept c k, a model (SVM) is trained using manually labeled regions and it is applied on all training samples  The Visual Ambiguity Score between c k and the concept c l is selected to be the mean of the visual scores that the regions belonging to the c l class received  The Visual Ambiguity Scores indicate how much we trust a specific classifier to distinguish between two concepts when asked to classify a region Elisavet Chatzilari -

13 Image Trustworthiness – an example VA(grass, grass) = 0 VA(grass, plant) = VA(grass, bush) = VA(grass, grass) = 0 VA(grass, fence) = Trust c k (I 1 ) = 1-max(0,0.824,0.874) = = Trust c k (I 2 ) = 1-max(0,0.638) = =  c k = grass Grass Potted plant Plant Countryside Green Nature Grass Fence Picket fence Wooden fence Nature Banister White fence  Gauge how much we trust the model for c k (grass in this example) to classify a specific image  Decide which concepts in our vocabulary are included in the examined image based on the tags  Left: grass, plant, bush  Right: grass, fence  Find the most ambiguous for the examined concept (grass), i.e. the concept with the maximum VA  Left: bush  Right: fence  Trustworthiness is its complement  We trust the model for grass to classify the regions of image I2 more than the regions of image I1. Elisavet Chatzilari -

14 Region Selection  Combine the Visual Scores, Textual Scores and Image Trustworthiness Scores into a single Region Relevance Score using the geometric mean: RR c k (r m I ) = [V(r I,c k )*T(I,c k )*Trust(I,c k )] 1/3  The regions of the loosely tagged images are ranked according to their region relevance score  The top 1000 regions with the highest relevance scores are selected to enhance the initial training set Elisavet Chatzilari -

15 Experiments  Datasets  SAIAPR TC-12 dataset – 20k manually labeled images split into 3 subsets train 14k images (used for testing Visual Ambiguity directly) validation 2k images (used as the initial training set) test 4k images (used for evaluation)  MIRFlickr-1m 1 million loosely tagged images (used for selecting regions to enhance the initial classifiers)  Experiments: Examine the impact of using auxiliary information  In the accurate selection of new training samples  In the performance of the initial classification models Elisavet Chatzilari -

16 Sample Selection Performance Assessment  Simulation experiment:  Train on validation set  Use training set as the pool of candidates  Compare the distribution of the selection scores based on three different configurations  Observations  Without the auxiliary information (V) the two distributions overlap significantly  The textual information has eliminated a large number of non- relevant regions (VT), but still distributions overlap  Discarding the ambiguous content (VTA) allows for part of the black distribution, i.e. true positives, to stand out VT 58.78% VTA 65.5% V 4.56% Elisavet Chatzilari -

17 Indicative regions selected from flickr - grass  Blue bounding boxes indicate the false positives  Similar observations with user tagged images compared to the simulation experiment, which had accurate tags-keywords V VT VTA Elisavet Chatzilari -

18 Performance of the retrained models  Similar observations  Without the auxiliary information the classifiers deteriorate when adding new training data  Textual information allows for more accurate selection of new training samples  The proposed approach outperforms both V and VT Elisavet Chatzilari -

19 Discussion of the results – future plans  Proposed method: Modeling and utilizing visual ambiguity directly into classification  Discarding the ambiguous content allows for more accurate selection of new training data  However  The impact to the retrained models is not as expected  Key Observation: the informativeness of the added samples was not taken into account  Future Work  Design a more sophisticated method to incorporate visual ambiguity  Investigate the utilization of informativeness within the selection strategy  Extend the framework to an on-line iterative learning scheme  Use a richer pool of candidates (e.g. download data for each concept independently) Elisavet Chatzilari -


Download ppt "Elisavet Chatzilari, Spiros Nikolopoulos, Yiannis Kompatsiaris, Josef Kittler Information Technologies Institute Elisavet Chatzilari"

Similar presentations


Ads by Google