Presentation is loading. Please wait.

Presentation is loading. Please wait.

FRAMEWORK FOR CREATING LARGE-SCALE CONTENT-BASED IMAGE RETRIEVAL SYSTEM (CBIR) FOR SOLAR DATA ANALYSIS Doctoral Dissertation Defense Juan M. Banda April.

Similar presentations


Presentation on theme: "FRAMEWORK FOR CREATING LARGE-SCALE CONTENT-BASED IMAGE RETRIEVAL SYSTEM (CBIR) FOR SOLAR DATA ANALYSIS Doctoral Dissertation Defense Juan M. Banda April."— Presentation transcript:

1 FRAMEWORK FOR CREATING LARGE-SCALE CONTENT-BASED IMAGE RETRIEVAL SYSTEM (CBIR) FOR SOLAR DATA ANALYSIS Doctoral Dissertation Defense Juan M. Banda April 5 th, 2011

2 2 Ph.D Committee Dr. Rafal Angryk (chair) Dr. Petrus Martens Dr. Brendan Mumey Dr. Rocky Ross Dr. Michael Ivie (Graduate Representative)

3 3 Acknowledgements Dr. Angryk, Dr. Martens, and Ph.D Committee NASA SDO Mission Montana State University – Computer Science Faculty Fellow Data Mining Lab Students

4 4 Accomplishments Refereed Publications J.M Banda and R. Angryk “Selection of Image Parameters as the First Step Towards Creating a CBIR System for the Solar Dynamics Observatory”. International Conference on Digital Image Computing: Techniques and Applications (DICTA). Sydney, Australia, December 1-3, 2010. pp. 528-534. J.M Banda and R. Angryk “Usage of dissimilarity measures and multidimensional scaling for large scale solar data analysis”. NASA Conference of Intelligent Data Understanding (CIDU 2010). Computer History Museum, Mountain View, CA October 5th - 6th, 2010. pp. 189-203. J.M Banda and R. Angryk “An Experimental Evaluation of Popular Image Parameters for Monochromatic Solar Image categorization” Proceedings of the twenty-third international Florida Artificial Intelligence Research Society conference (FLAIRS-23), Daytona Beach, Florida, USA, May 19–21 2010. pp. 380-385. J.M Banda and R. Angryk “On the effectiveness of fuzzy clustering as a data discretization technique for large-scale classification of solar images” Proceedings of the 18th IEEE International Conference on Fuzzy Systems (FUZZ-IEEE ’09), Jeju Island, Korea, August 2009, pp. 2019-2024.

5 5 Accomplishments Under Review J.M Banda, R. Angryk and P. Martens “On dimensionality reduction for indexing and retrieval of large-scale solar image data” in special issue of Solar Physics journal. J.M Banda, R. Angryk and P. Martens “Large-scale solar data analysis using dissimilarity measures and multidimensional scaling” in special issue of Solar Physics journal. J. M Banda, R. Angryk and P. Martens “On the surprisingly accurate transfer of image parameters between medical and solar images” in IEEE International Conference on Image Processing 2011. J. M Banda, R. Angryk and P. Martens “Dimensionality reduction for indexing and retrieval of highly dimensional solar image data” in ACM Knowledge Discovery and Data Mining 2011. Under Preparation M. Schuh, J. M Banda, R. Angryk, P. Martens, P. Bernasconi “Quantitative Comparison of Automated Filament Detection Methods” for Solar Physics J. M Banda and R. Angryk “Composite single-dimensional indexing for high-dimensional data” for Proceedings of Very Large Database Endowment (PVLDB) 2011 J.M Banda and R. Angryk “A Framework for creating large-scale content-based image retrieval systems” for Information Retrieval journal.

6 6 Agenda 1.Motivation and Contributions 2.Framework Outline 3.Benchmark Datasets 4.Features Extraction Module (FEM) 5.Attribute Evaluation Module (AEM) 6.Dissimilarity Measures Module (DMM) 7.Dimensionality Reduction Module (DRM) 8.Indexing (and Retrieval) Module (IM) 9.Conclusions

7 7 What is CBIR? Content-based Image Retrieval System –System to retrieve images based on the actual content of the images, instead of meta-data describing the content of the images

8 8 1. Motivation With the launch of NASA’s Solar Dynamics Observatory mission, a whole new age of high-quality solar image analysis was started. With the generation of over 1.5 Terabytes of solar images, per day, that are ten times higher resolution than high-definition television, the task of analyzing them by scientists by hand is simply impossible. The storage of all these images becomes a second problem of importance. There is only one full repository with partial mirrors.

9 9 1. Major Contributions To Computer Science –Creation of a CBIR building framework (first of its kind) –Creation of a composite single-dimensional indexing technique for multi-dimensional data To Solar Physics –Creation of a CBIR for the Solar Dynamics Observatory Mission

10 10 2. Framework Outline

11 11 3. Benchmark Datasets

12 12 Solar Dataset –Created using the Heliophysics Events Knowledgebase (HEK) portal, –200 images per class, 8 classes, available on the web –All image from TRACE mission, two different wavelengths, partial disk

13 13 Solar Dataset – Sample Images

14 14 ImageCLEFMed Datasets – Sample Images

15 15 INDECS Dataset – Sample Images

16 16 PASCAL Dataset – Sample Images

17 17 4. Feature Extraction Module (FEM)

18 18 FEM - Image Parameter Extraction

19 19 FEM - Image Parameters ** User Extendable LabelImage parameter P1Entropy P2Mean P3Standard Deviation P43 rd Moment (skewness) P54 th Moment (kurtosis) P6Uniformity P7Relative Smoothness (RS) P8Fractal Dimension P9Tamura Directionality P10Tamura Contrast

20 20 FEM - Image Parameter Extraction 10 image parameters extracted for all 64 image cells = One 640-dimensional feature vector per image

21 21 Classification Algorithms for Comparative Evaluation

22 22 Classification Algorithms Naïve Bayes (NB) –Fast learning and surprisingly accurate C4.5 –Most popular in real world applications Support Vector Machines (SVM) –Usually the best performing, but very slow to train ** User Extendable

23 23 Classification Evaluation Evaluation Measure:

24 24 Classification Evaluation Receiver Operating Characteristic (ROC) curves –Plot of true positive rate vs. the false positive rate

25 25 FEM - Comparative Evaluation Classification accuracy for our benchmark datasets (closer to 100% is better) Experiments using 10-fold cross validation

26 26 FEM - Comparative Evaluation Naïve Bayes classification ROC curve per class for the solar dataset (closer to 1 true positive rate/0 false positive rate (0,1) is better)

27 27 FEM - Comparative Evaluation C4.5 classification ROC curve per class for the solar dataset (closer to 1 true positive rate/0 false positive rate (0,1) is better)

28 28 FEM - Comparative Evaluation SVM classification ROC curve per class for the solar dataset (closer to 1 true positive rate/ 0 false positive rate (0,1) is better)

29 29 5. Attribute Evaluation Module (AEM)

30 30 AEM - Motivation By selecting the most relevant image parameters we will be able to save processing and storage costs for each parameter that we remove

31 31 AEM - Unsupervised Attribute Evaluation Intra-class correlation for Active Region class Correlation Map / 2D Multi-Dimensional Scaling Plot

32 32 AEM - Unsupervised Attribute Evaluation Inter-class correlation for Active Region class Correlation Map / 2D Multi-Dimensional Scaling Plot

33 33 AEM - Supervised Attribute Evaluation Chi Squared Gain Ratio Info Gain ** User Extendable

34 34 AEM - Supervised Attribute Evaluation Chi SquaredInfo GainGain Ratio RankingLabelRankingLabelRankingLabel 2039.06P70.8440P70.3597P9 2036.05P60.8425P60.3152P4 2016.69P10.8292P10.3140P5 1887.89P40.7848P40.3116P6 1883.68P90.7607P50.3102P7 1839.05P50.7552P90.3025P1 1740.33P20.6882P20.2781P8 1578.76P100.6627P100.2647P10 1370.68P30.5702P30.2493P2 1134.43P80.5344P80.2399P3 A higher ranking indicates higher relevance

35 35 AEM – Experimental Evaluation Label - Experiment Description Exp 1 - All Parameters used Exp 2 - Parameters P2, P9, P10 removed Exp 3 - Parameters P3, P6, P10 removed Exp 4 - Parameters P2, P3, P8 removed Exp 5 - Parameters P1, P6, P7 removed Exp 6 - Parameters P2, P3, P10 removed Exp 7 - Parameters P1, P4, P7 removed LabelNBC45SVMAverage Exp 174.50%86.56%89.94%83.67% Exp 274.56%84.69%90.19%83.15% Exp 376.81%85.75%89.94%84.17% Exp 474.44%85.13%89.69%83.08% Exp 571.75%83.50%89.25%81.50% Exp 675.19%85.25%89.75%83.40% Exp 772.88%85.56%92.56%83.67% Classification accuracy results for removal experiments (closer to 100% is better) Experiments using 10-fold cross validation

36 36 AEM – Experimental Evaluation Naïve Bayes classification ROC curve per class for Experiment 3 (closer to 1 true positive rate/0 false positive rate (0,1) is better)

37 37 AEM – Experimental Evaluation C4.5 classification ROC curve per class for Experiment 3 (closer to 1 true positive rate/0 false positive rate (0,1) is better)

38 38 AEM – Experimental Evaluation SVM classification ROC curve per class for Experiment 3 (closer to 1 true positive rate/0 false positive rate (0,1) is better)

39 39 AEM - Conclusions Removal of some image parameters maintains comparable classification accuracy Saving up to 30% of storage and processing costs

40 40 6. Dissimilarity Measures Module (DMM)

41 41 Dissimilarity Measure Module (DMM) - Outline ** User Extendable

42 42 DMM – Motivation for use on AEM module Deeper understanding of the data is needed to find interesting underlying relationships in our datasets Usage of different dissimilarity measures has shown previously hidden data relationships

43 43 DMM with AEM – Experimental evaluation Event NameScaled Image Plot RangeMDS Color Active Region1-200Red Coronal Jet201-400Green Emerging Flux401-600Blue Filament601-800Yellow Filament Activation801-1000Magenta Filament Eruption1001-1200Gray Flare1201-1400Orange Oscillation1401-1600Black Using the dissimilarity measures and labels from table 7.1 in handout and table 7.2 (shown below), we present scaled image plots of the dissimilarity matrices, 2D and 3D Multi-dimensional scaling plots

44 44 Scaled image plot of dissimilarity matrix for: (left) Correlation (D7) measure with image parameter mean (P2) (right) JSD (D10) measure with image parameter mean (P2) (blue indicates low dissimilarity, read indicates high dissimilarity) DMM with AEM – Scaled Image Plots

45 45 DMM with AEM – Scaled Image Plots Scaled image plot of dissimilarity matrix for: (left) Chebychev (D5) measure with image parameter mean (P2) (right) Chebychev (D5) measure with image parameter relative smoothness (P7) (blue indicates low dissimilarity, read indicates high dissimilarity)

46 46 DMM with AEM – MDS 2D Visualization 2D-MDS map for: (left) Correlation (D7) measure with image parameter mean (P2) (right) Chebychev (D5) measure with image parameter relative smoothness (P7)

47 47 DMM with AEM – MDS 2D Visualization 3D-MDS map for: (left) Correlation (D7) measure with image parameter mean (P2) (right) Chebychev (D5) measure with image parameter relative smoothness (P7)

48 48 DMM with AEM – Experimental Evaluation 10-Component Threshold Tangent Threshold Experiments using 10-fold cross validation

49 49 Percentage of correctly classified instances for the 10 component threshold: a) Naïve Bayes (NB) results, b) C4.5 results, and c) SVM results. For the original data (1-10) and our 180 experiments, as labeled in table 7.1 in handout

50 50 Percentage of correctly classified instances for the tangent threshold: a) Naïve Bayes (NB) results, b) C4.5 results, and c) SVM results. For the original data (1-10) and our 180 experiments, as labeled in table 7.1 in handout

51 51 DMM with AEM - Conclusions Certain combination of dissimilarity measures – image parameters, produce very interesting and different results Some combination of dissimilarity measures – image parameters perform very stable in our classification experiments MDS results lead us to believe our dataset is optimal for dimensionality reduction, thus this avenue should be pursued

52 52 7. Dimensionality Reduction Module (DRM)

53 53 DRM - Motivation By eliminating redundant dimensions we will be able to save indexing, retrieval and storage costs In our case: 540 kilobytes per dimension per day, since we will have a 10,240 dimensional feature vector per image (5.27 GB per day)

54 54 DRM - Methods Linear dimensionality reduction methods –Principal Component Analysis (PCA) –Singular Value Decomposition (SVD) –Locality Preserving Projections (LPP) –Factor Analysis (FA) Non-linear Dimensionality Reduction Methods –Kernel PCA –Isomap –Locally-Linear Embedding (LLE) –Laplacian Eigenmaps (Laplacian) ** User Extendable

55 55 DRM – Experimental Evaluation We selected 67% of our data as the training set and an the remaining 33% for evaluation For comparative evaluation we utilize the number of components returned by standard PCA and SVD’s algorithms, setting up a variance threshold between 96% and 99%

56 56 DRM – Experimental Evaluation Datasets tested PCA VarianceSVD Variance 96%97%98%99%96%97%98%99% Solar42465158 7499143 INDECS 94106121143215239270319 ImageCLEFMed05 7989103126193218253307 ImageCLEFMed07 7687102126192217251304 Pascal2006 778596114100111125147 Pascal2008 115127141160212239275331 Experiment Label 12345678

57 57 Percentage of correctly classified instances for the Linear dimensionality reduction methods. Experiments using 67% training, 33% testing data. (closer to 100% is better)

58 58 Percentage of correctly classified instances for the Non-linear dimensionality reduction methods Experiments using 67% training, 33% testing data. (closer to 100% is better)

59 59 DRM – Experimental Evaluation Classification accuracy average for all dimensionality reduction methods for the targeted dimensions on the solar dataset (closer to 100% is better) Experiments using 67% testing and 33% training data

60 60 DRM – Experimental Evaluation PCA 6 (74 dimensional) experiment SVM ROC Curves for the eight classes of the solar dataset. (closer to 1 true positive rate/0 false positive rate (0,1) is better)

61 61 DRM - Conclusions Selecting anywhere between 42 and 74 dimensions provided stable results For our current benchmark dataset we can reduce dimensionality of feature vector by 90% For the SDO mission a 90% reduction would imply savings of up to 4.74 Gigabytes per day (from 5.27 Gigabytes of data per day)

62 62 8. Indexing Module (IM)

63 63 Indexing Module (IM) - Outline

64 64 IM - Motivation Multi-dimensional indexing techniques not optimal for high number of dimensions Current popularity of single dimensional approaches to high-dimensional data Results have been very domain specific Dimensionally reduced data spaces = reduced index complexity

65 65 IM – Evaluation Measures * relevant images indicate images belonging to the same class in this measure Page Reads = number of memory pages of size n that are needed to be accessed in order to retrieve each k-nearest neighbor

66 66 IM – Experimental Evaluation Setup –Index: All 1,600 feature vectors –Query: 10 Nearest Neighbors

67 67 IM – Experimental Evaluation Let’s try kd-trees… kd-tree precision results for each of our 1,600 queries on our solar dataset (640-dimensions) Average precision: 0.1114 or 11%

68 68 IM – Experimental Evaluation Average retrieval precision values for the solar dataset and the eight targeted dimensions (closer to 1 is better) Single-Dimensional Indexing for Multi-Dimensional Data algorithms: iDistance and Pyramid-tree

69 69 IM – Experimental Evaluation – iDistance + DMM Average retrieval precision values for iDistance with Dissimilarity Measure Module (DMM) using original dimensionality (640). (closer to 1 is better)

70 70 IM – Experimental Evaluation – iDistance + DMM SolarINDECSImgCLEFMed05ImgCLEFMed07PASCAL2006PASCAL2008 D60.8612D20.3878D20.7660D50.8438D20.3223D70.2706 D50.8611D80.3836D80.7631D60.8438D10.3192D10.2703 D10.8609D50.3806D50.7601D70.8438D90.3191D50.2701 Dataset Experiment (dimensions) Dissimilarity Measure EXP1EXP2EXP3 SolarLPP-8 (143)D1 (Euclidean)D6 (Cosine)D5 (Chebychev) INDECSLaplacian-2 (106)D1 (Euclidean)D2 (STDeuclidean)D8 (Spearman) ImgeCLEFMed05PCA-1 (79)D1 (Euclidean)D2 (STDeuclidean)D8 (Spearman) ImgeCLEFMed07SVD-7 (251)D1 (Euclidean)D5 (Chebychev)D6 (Cosine) PASCAL2006Isomap-2 (85)D1 (Euclidean)D2 (STDeuclidean)D9 (Hausdorff) PASCAL2008FA-3 (141)D1 (Euclidean)D7 (Correlation)D5 (Chebychev) Top 3 Results Experimental setup for dimensionality reduction tests

71 71 IM – Experimental Evaluation - iDistance + DMM Retrieval precision values for dimensionally reduced experiments vs. originals (closer to 1 is better). Exp 1-3 represent best dissimilarity measures paired with best reduction experiments.

72 72 IM – Experimental Evaluation - iDistance + DMM Average page reads for dimensionally reduced experiments vs. originals (lower is better). Exp 1-3 represent best dissimilarity measures paired with best reduction experiments. Page reads

73 73 IM – Retrieval Demo - Solar

74 74 IM – Retrieval Demo - Solar

75 75 IM – Retrieval Demo - Solar

76 76 IM – Retrieval Demo - Solar

77 77 IM – Retrieval Demo - Medical

78 78 IM – Retrieval Demo – Pascal

79 79 IM - Conclusions Single dimensional indexing for high-dimensional data was the best fit for our CBIR building purposes Composite implementation of iDistance shows stable results for several different measures Validation of our dimensionality reduction experiments on an actual indexing and retrieval setting

80 80 9. General Conclusions Framework provides flexible and extendable capabilities to aid in the creation of large-scale CBIR systems Each step of the process can be quantitatively evaluated All steps are finally evaluated in a real CBIR scenario of indexing and retrieval While solar CBIR returns relevant images, there is still work to be done in order to deal with high repetition of images (and the high similarity between them)

81 81 Questions? Thank you for your time


Download ppt "FRAMEWORK FOR CREATING LARGE-SCALE CONTENT-BASED IMAGE RETRIEVAL SYSTEM (CBIR) FOR SOLAR DATA ANALYSIS Doctoral Dissertation Defense Juan M. Banda April."

Similar presentations


Ads by Google