Support Vector Machines for Visualization and Dimensionality Reduction Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus.

Support Vector Machines for Visualization and Dimensionality Reduction Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus University, Toruń, Poland ICANN 2008

PlanPlan Main idea.Main idea. Short description of three popular methods frequently used for visualization :Short description of three popular methods frequently used for visualization : –Multidimensional Scaling (MDS) –Principal Component Analysis (PCA) –Fisher Discriminant Analysis (FDA) Description of our approach.Description of our approach. Results on artificial and real data.Results on artificial and real data.

Main idea Many classifiers work as black-box predictors.Many classifiers work as black-box predictors. Quality estimated on accuracy or cost measures.Quality estimated on accuracy or cost measures. It's important to be able to evaluate a specific case, showing the confidence in predictions in the region of space close to this case.It's important to be able to evaluate a specific case, showing the confidence in predictions in the region of space close to this case. PCA, ICA, MDS and other commonly used for direct visualization useful for exploratory data analysis, but do not provide information about reliability of the method.PCA, ICA, MDS and other commonly used for direct visualization useful for exploratory data analysis, but do not provide information about reliability of the method. We shows how to use any LDA or SVM classifier in its linear or kernelized version for dimensionality reduction and data visualization.We shows how to use any LDA or SVM classifier in its linear or kernelized version for dimensionality reduction and data visualization.

Multidimensional Scaling (MDS) Non-linear technique of visualization.Non-linear technique of visualization. Decreases dimensionality preserving original distances in high-dimensional space.Decreases dimensionality preserving original distances in high-dimensional space. Need only similarities between objects, so explicit vector representation of objects is not necessary.Need only similarities between objects, so explicit vector representation of objects is not necessary. In metric scaling uses quantitative evaluation of similarity using numerical functions.In metric scaling uses quantitative evaluation of similarity using numerical functions. For non-metric scaling uses qualitative information about the pairwise similarities.For non-metric scaling uses qualitative information about the pairwise similarities. Differ by their cost functions, optimization algorithms, the number of similarity matrices used and the use of feature weighting.Differ by their cost functions, optimization algorithms, the number of similarity matrices used and the use of feature weighting.

Multidimensional Scaling (MDS) The most commonly used stress function:The most commonly used stress function: d ij - distances in the target space D ij - distances in the input space, calculated using some metric functions These measures are minimized over positions of all target points.These measures are minimized over positions of all target points. Cost functions are not easy to minimize, with multiple local minima representing different mappings.Cost functions are not easy to minimize, with multiple local minima representing different mappings. Initial configuration is either selected randomly or based on projection of data to the space spanned by principal components.Initial configuration is either selected randomly or based on projection of data to the space spanned by principal components. Orientation of axes is arbitrary, and the values of coordinates do not have any simple interpretation (only relative distances are important).Orientation of axes is arbitrary, and the values of coordinates do not have any simple interpretation (only relative distances are important).

Principal Component Analysis (PCA) Linear projection method.Linear projection method. Finds orthogonal combinations of input features accounting for most variation in the data.Finds orthogonal combinations of input features accounting for most variation in the data. Principal components guarantee minimal loss of information when position are recreated from their low- dimensional projections.Principal components guarantee minimal loss of information when position are recreated from their low- dimensional projections. Projecting the data into the space defined by 1-3 principal components provides for each input vector its representative in the target space.Projecting the data into the space defined by 1-3 principal components provides for each input vector its representative in the target space. For some data distributions such projections show informative structures.For some data distributions such projections show informative structures.

Fisher Discriminant Analysis (FDA) Supervised methods find projections that separate data from different classes in a better way.Supervised methods find projections that separate data from different classes in a better way. It maximizes the ratio of between-class to within-class scatter, seeking a direction W such that:It maximizes the ratio of between-class to within-class scatter, seeking a direction W such that: where the scatter matrices S B and S I are defined by: where m i and are the sample means and covariance matrices of each class and m is the sample mean. In our implementation pseudoinverse matrix has been used to generate higher FDA directions.In our implementation pseudoinverse matrix has been used to generate higher FDA directions.

Linear SVM Search a hyperplane that provides a large margin of classification, using regularization term and quadratic programming.Search a hyperplane that provides a large margin of classification, using regularization term and quadratic programming. Non-linear versions are based on a kernel trick that implicitly maps data vectors to a high-dimensional feature space where the best separating hyperplane is constructed.Non-linear versions are based on a kernel trick that implicitly maps data vectors to a high-dimensional feature space where the best separating hyperplane is constructed. Linear discriminant function is defined by:Linear discriminant function is defined by: The best discriminating hyperplane should maximize the distance between decision hyperplane and the vectors that are nearest to it.The best discriminating hyperplane should maximize the distance between decision hyperplane and the vectors that are nearest to it. Vector W, orthogonal to the discriminant hyperplane, defines direction on which data vectors are projected, and thus may be used for one-dimensional projections.Vector W, orthogonal to the discriminant hyperplane, defines direction on which data vectors are projected, and thus may be used for one-dimensional projections.

Linear SVM The same may be done using non-linear SVM based on kernel discriminant:The same may be done using non-linear SVM based on kernel discriminant: The x=g W (X) values for different classes may be smoothed and displayed as a histogram, estimating either the class-conditionals or posterior probabilities.The x=g W (X) values for different classes may be smoothed and displayed as a histogram, estimating either the class-conditionals or posterior probabilities. The first projection should give g W1 (X) 0 for the second class.The first projection should give g W1 (X) 0 for the second class. The second best direction may be obtained by repeating SVM calculations in the space orthogonalized to the already obtained W directions.The second best direction may be obtained by repeating SVM calculations in the space orthogonalized to the already obtained W directions.

Linear SVM This process may be repeated to obtain more dimensions.This process may be repeated to obtain more dimensions. Each additional dimension should help to decrease errors estimated by CV.Each additional dimension should help to decrease errors estimated by CV. Optimal dimensionality is obtained when new dimensions stop decreasing the number of errors in CV tests.Optimal dimensionality is obtained when new dimensions stop decreasing the number of errors in CV tests. In case of non-linear kernel g w (X) provides the first direction, while the second may be generated e.g. repeating training on subset of vectors that are close to the hyperplane in the extended space using some other kernel.In case of non-linear kernel g w (X) provides the first direction, while the second may be generated e.g. repeating training on subset of vectors that are close to the hyperplane in the extended space using some other kernel.

Four datasets Parity_8: 8-bit parity dataset (8 binary features and 256 vectors).Parity_8: 8-bit parity dataset (8 binary features and 256 vectors). Heart: disease dataset consist of 270 samples, each described by 13 attributes, 150 cases belongs to group „absence” and 120 to „presence of heart disease”.Heart: disease dataset consist of 270 samples, each described by 13 attributes, 150 cases belongs to group „absence” and 120 to „presence of heart disease”. Wisconsin: breast cancer data contains 699 samples collected from patients. Among them, 458 biopsies are from patients labeled as „benign”, and 241 are labeled as „malignant”. Feature 6 has 16 missing values, removing corresponding vectors leaves 683 examples.Wisconsin: breast cancer data contains 699 samples collected from patients. Among them, 458 biopsies are from patients labeled as „benign”, and 241 are labeled as „malignant”. Feature 6 has 16 missing values, removing corresponding vectors leaves 683 examples. Leukemia: microarray gene expressions for two types of leukemia (ALL and AML, with a total of 47 ALL and 25 AML samples measured with 7129 probes. Visualization is based on 100 best features from simple feature ranking using FDA index.Leukemia: microarray gene expressions for two types of leukemia (ALL and AML, with a total of 47 ALL and 25 AML samples measured with 7129 probes. Visualization is based on 100 best features from simple feature ranking using FDA index.

Parity_8 dataset

Heart dataset

Wisconsin dataset

Leukemia dataset

Estimation of class-conditional probability in the first SVM direction

SVM 10CV accuracy for datasets with reduced features

SummarySummary Visualization enables exploratory data analysis, gives more information than accuracy or probability.Visualization enables exploratory data analysis, gives more information than accuracy or probability. Helps to understand what black box classifiers really do.Helps to understand what black box classifiers really do. When safety is important, visualization methods may evaluate confidence in predictors.When safety is important, visualization methods may evaluate confidence in predictors. Sequential dimensionality reduction based on SVM has several advantages:Sequential dimensionality reduction based on SVM has several advantages: –enables visualization –guarantees dimensionality reduction without loss of accuracy –increases accuracy of the linear discrimination model –very fast –preserves simple interpretation

Summary cont. SVM decision borders can be visualized using estimations of class-dependent or posterior probabilities.SVM decision borders can be visualized using estimations of class-dependent or posterior probabilities. Visualization help evaluate the reliability of predictions for individual cases, showing them in context of the known cases.Visualization help evaluate the reliability of predictions for individual cases, showing them in context of the known cases. We plan to add visualization of probabilities and scatterograms to a few popular SVM packages soon.We plan to add visualization of probabilities and scatterograms to a few popular SVM packages soon.

Support Vector Machines for Visualization and Dimensionality Reduction Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus.

Similar presentations

Presentation on theme: "Support Vector Machines for Visualization and Dimensionality Reduction Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Support Vector Machines for Visualization and Dimensionality Reduction Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus.

Similar presentations

Presentation on theme: "Support Vector Machines for Visualization and Dimensionality Reduction Tomasz Maszczyk and Włodzisław Duch Department of Informatics, Nicolaus Copernicus."— Presentation transcript:

Similar presentations

About project

Feedback