Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mykola Pechenizkiy, Seppo Puuronen Department of Computer Science University of Jyväskylä Finland Alexey Tsymbal Department of Computer Science Trinity.

Similar presentations


Presentation on theme: "Mykola Pechenizkiy, Seppo Puuronen Department of Computer Science University of Jyväskylä Finland Alexey Tsymbal Department of Computer Science Trinity."— Presentation transcript:

1 Mykola Pechenizkiy, Seppo Puuronen Department of Computer Science University of Jyväskylä Finland Alexey Tsymbal Department of Computer Science Trinity College Dublin Ireland ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based Feature Extraction for Supervised Learning: (kNN, Naïve Bayes and C4.5 )

2 2 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Contents DM and KDD background –KDD as a process –DM strategy Classification –Curse of dimensionality and Indirectly relevant features –Dimensionality reduction Feature Selection (FS) Feature Extraction (FE) Feature Extraction for Classification –Conventional PCA –Class-conditional FE: parametric and non-parametric –Combining principal component (PCs) and linear discriminants (LDs) Experimental Results –3 FE strategies, 3 Classifiers, 21 UCI datasets Conclusions and Further Research

3 3 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen What is Data Mining Data mining or Knowledge discovery is the process of finding previously unknown and potentially interesting patterns and relations in large databases (Fayyad, KDD’96) Data mining is the emerging science and industry of applying modern statistical and computational technologies to the problem of finding useful patterns hidden within large databases (John 1997) Intersection of many fields: statistics, AI, machine learning, databases, neural networks, pattern recognition, econometrics, etc.

4 4 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Knowledge discovery as a process Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1997. I

5 5 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen CLASSIFICATION New instance to be classified Class Membership of the new instance J classes, n training observations, p features Given n training instances (x i, y i ) where x i are values of attributes and y is class Goal: given new x 0, predict class y 0 Training Set The task of classification Examples: - prognostics of recurrence of breast cancer; - diagnosis of thyroid diseases; - heart attack prediction, etc.

6 6 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Goals of Feature Extraction Improvement of representation space

7 7 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Constructive Induction Feature extraction (FE) is a dimensionality reduction technique that extracts a subset of new features from the original set by means of some functional mapping keeping as much information in the data as possible (Fukunaga 1990).

8 8 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Feature selection or transformation Features can be (and often are) correlated –FS techniques that just assign weights to individual features are insensitive to interacted or correlated features. Data is often not homogenous –For some problems a feature subset may be useful in one part of the instance space, and at the same time it may be useless or even misleading in another part of it. –Therefore, it may be difficult or even impossible to remove irrelevant and/or redundant features from a data set and leave only useful ones by means of feature selection. That is why the transformation of the given representation before weighting the features is often preferable.

9 9 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen FE for Classification

10 10 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Principal Component Analysis PCA extracts a lower dimensional space by analyzing the covariance structure of multivariate statistical observations. The main idea – determine the features that explain as much of the total variation in the data as possible with as few of these features as possible. PCA has the following properties: (1) it maximizes the variance of the extracted features; (2) the extracted features are uncorrelated; (3) it finds the best linear approximation; (4) it maximizes the information contained in the extracted features.

11 11 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen The Computation of the PCA

12 12 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen The Computation of the PCA 1)Calculate the covariance matrix S from the input data. 2)Compute the eigenvalues and eigenvectors of S and sort them in a descending order with respect to the eigenvalues. 3)Form the actual transition matrix by taking the predefined number of components (eigenvectors). 4)Finally, multiply the original feature space with the obtained transition matrix, which yields a lower- dimensional representation. The necessary cumulative percentage of variance explained by the principal axes is used commonly as a threshold, which defines the number of components to be chosen.

13 13 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen FT example “Heart Disease” 0.1·Age-0.6·Sex-0.73·RestBP-0.33·MaxHeartRate -0.01·Age+0.78·Sex-0.42·RestBP-0.47·MaxHeartRate -0.7·Age+0.1·Sex-0.43·RestBP+0.57·MaxHeartRate 100% Variance covered 87% 60% 67%

14 14 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen PCA for Classification PCA for classification: a) effective work of PCA, b) the case where an irrelevant principal component was chosen from the classification point of view. PCA gives high weights to features with higher variabilities disregarding whether they are useful for classification or not.

15 15 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Simultaneous Diagonalization Algorithm Transformation of X to Y:, where  and  are the eigenvalues and eigenvectors matrices of S B. Computation of S B in the obtained Y space. Selection of m eigenvectors of S B, which correspond to the m largest eigenvalues. Computation of new feature space, where  is the set of selected eigenvectors. The usual decision is to use some class separability criterion, based on a family of functions of scatter matrices: the within- class, the between-class, and the total covariance matrices. Class-conditional Eigenvector-based FE

16 16 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Parametric Eigenvalue-based FE The within-class covariance matrix shows the scatter of samples around their respective class expected vectors: The between-class covariance matrix shows the scatter of the expected vectors around the mixture mean: where c is the number of classes, n i is the number of instances in a class i, is the j-th instance of i-th class, m (i) is the mean vector of the instances of i-th class, and m is the mean vector of all the input data.

17 17 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Nonparametric Eigenvalue-based FE Tries to increase the number of degrees of freedom in the between-class covariance matrix, measuring the between-class covariances on a local basis. K-nearest neighbor (kNN) technique is used for this purpose. The coefficient w ik is a weighting coefficient, which shows importance of each summand. assign more weight to those elements of the matrix, which involve instances lying near the class boundaries and are more important for classification.

18 18 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen S b : Parametric vs Nonparametric Differences in the between-class covariance matrix calculation for nonparametric (left) and parametric (right) approaches for the two-class case.

19 19 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Combining PCs and LDs for SL Improvement of the parametric class-conditional LDA-based approach by adding a few principal components (PCs) to the Linear Discriminants (LDs) for further supervised learning (SL)

20 20 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Experimental Settings 21 data sets with different characteristics taken from the UCI machine learning repository 3 classifiers: 3-nearest neighbor classification (3NN), Naïve-Bayes (NB) learning algorithm, and C4.5 decision tree learning (C4.5) –The classifiers were used from WEKA library with their defaults settings. 3 approaches with each classifier: –PCA with classifier –Parametric LDA with classifier –PCA+LDA with classifier After PCA we took 3 main PCs. We took all the LDs (features extracted by parametric LDA) as it was always equal to #classes – 1. 30 test runs of Monte-Carlo cross validation were made for each data set to evaluate the classification accuracy. In each run, the training set/the test set = 70%/30% by stratified random sampling to keep class distributions approximately same.

21 21 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen

22 22 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Ranking of the FE approaches Ranking of the FE approaches according to the results on 21 UCI data sets kNN

23 23 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Ranking of the FE approaches Ranking of the FE approaches according to the results on 21 UCI data sets Naïve Bayes

24 24 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Ranking of the FE approaches Ranking of the FE approaches according to the results on 21 UCI data sets C4.5

25 25 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Accuracy of classifiers, averaged over 21 datasets.

26 26 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen State transition diagram

27 27 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Effect of combining PCs with LDs according to state transition diagrams PAR+PCA vs PCA PAR+PCA vs PAR kNN-10+5 NB+11+14 C4.5+15+17

28 28 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen When combination of PCs and LDs is practically useful.

29 29 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Conclusions “The curse of dimensionality” is a serious problem in ML/DM/KDD –Classification accuracy decreases and processing time increases dramatically –FE is a common way to cope with this problem. Before applying a learning algorithm the space of instances is transformed into a new space of a lower dimensionality, trying to preserve the distances among instances and class separability. A classical approach (that takes into account class information) here is Fisher’s LDA, –tries to minimize the within class covariance and to maximize the between class covariance in the extracted features. –is well studied, and commonly used, - often provides informative features for classification, but! extracts no more than the #classes -1 features. often fails to provide reasonably good classification accuracy even with fairly simple datasets, where the intrinsic dimensionality exceeds that number. There has been considered a number of ways to solve this problem. –Many approaches suggest non-parametric variations of LDA (rather time-consuming), which lead to greater numbers of extracted features. –Dataset partitioning and local FE

30 30 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Conclusions (cont.) In this paper we consider an alternative way to improve LDA- based FE for classification, –combining the extracted LDs with a few PCs. Our experiments with the combination of LDs with PCs have shown that the discriminating power of LDA features can be improved by PCs with many datasets and learning algorithms. –The best performance is exhibited with the C4.5: a possible explanation for the good behaviour with C4.5 is that decision trees use implicit feature selection, and thus implicitly select LDs and/or PCs, useful for classification, out of the combined set of features, discarding the less relevant and duplicate ones. Moreover, this feature selection is local.

31 31 ADMKD’05 Tallinn, Estonia September 15-16, 2005 On Combining Principal Components with Parametric LDA-based FE for SL by Pechenizkiy, Tsymbal, Puuronen Thank You! Mykola Pechenizkiy Department of Computer Science and Information Systems, University of Jyväskylä, FINLAND E-mail: mpechen@cs.jyu.fimpechen@cs.jyu.fi Tel. +358 14 2602472 Mobile: +358 44 3851845 Fax: +358 14 2603011 www.cs.jyu.fi/~mpechen Acknowledgments: ADMKD Reviewers COMAS Graduate School of the University of Jyväskylä Finland Science Foundation Ireland WEKA software library UCI datasets


Download ppt "Mykola Pechenizkiy, Seppo Puuronen Department of Computer Science University of Jyväskylä Finland Alexey Tsymbal Department of Computer Science Trinity."

Similar presentations


Ads by Google