Computational Intelligence: Methods and Applications Lecture 34 Applications of information theory and selection of information Włodzisław Duch Dept. of.

Slides:



Advertisements
Similar presentations
Universal Learning Machines (ULM) Włodzisław Duch and Tomasz Maszczyk Department of Informatics, Nicolaus Copernicus University, Toruń, Poland ICONIP 2009,
Advertisements

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Decision Tree Approach in Data Mining
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
Minimum Redundancy and Maximum Relevance Feature Selection
Model Assessment, Selection and Averaging
Visual Recognition Tutorial
Computational Intelligence for Information Selection
Exploratory Data Mining and Data Preparation
CSci 8980: Data Mining (Fall 2002)
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Ensemble Learning: An Introduction
Lecture 5 (Classification with Decision Trees)
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Special Topic: Missing Values. Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61%
CS Instance Based Learning1 Instance Based Learning.
Feature selection based on information theory, consistency and separability indices Włodzisław Duch, Tomasz Winiarski, Krzysztof Grąbczewski, Jacek Biesiada,
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Monday, April 3, 2000 DingBing.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Computational Intelligence: Methods and Applications Lecture 31 Combinatorial reasoning, or learning from partial observations. Włodzisław Duch Dept. of.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Computational Intelligence: Methods and Applications Lecture 37 Summary & review Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Computational Intelligence: Methods and Applications Lecture 30 Neurofuzzy system FSM and covering algorithms. Włodzisław Duch Dept. of Informatics, UMK.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Computational Intelligence: Methods and Applications Lecture 36 Meta-learning: committees, sampling and bootstrap. Włodzisław Duch Dept. of Informatics,
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Computational Intelligence: Methods and Applications Lecture 20 SSV & other trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Lecture Notes for Chapter 4 Introduction to Data Mining
Computational Intelligence: Methods and Applications Lecture 33 Decision Tables & Information Theory Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning 5. Parametric Methods.
Computational Intelligence: Methods and Applications Lecture 21 Linear discrimination, linear machines Włodzisław Duch Dept. of Informatics, UMK Google:
Computational Intelligence: Methods and Applications Lecture 29 Approximation theory, RBF and SFN networks Włodzisław Duch Dept. of Informatics, UMK Google:
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Unsupervised Classification
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning: Ensemble Methods
Support Feature Machine for DNA microarray data
Chapter 7. Classification and Prediction
Computational Intelligence: Methods and Applications
Computational Intelligence: Methods and Applications
K Nearest Neighbor Classification
Feature Selection To avid “curse of dimensionality”
REMOTE SENSING Multispectral Image Classification
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Computational Intelligence: Methods and Applications
Ensemble learning.
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Generally Discriminant Analysis
Text Categorization Berlin Chen 2003 Reference:
Feature Selection Methods
FEATURE WEIGHTING THROUGH A GENERALIZED LEAST SQUARES ESTIMATOR
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Presentation transcript:

Computational Intelligence: Methods and Applications Lecture 34 Applications of information theory and selection of information Włodzisław Duch Dept. of Informatics, UMK Google: W Duch

Some applications of info theory Information theory has many applications in different CI areas. Here only a few applications are mentioned, to data visualization, discretization and selection of features. Information gain has already been used in decision trees (ID3, C4.5) to define the gain of information by making a split: for feature A, used to split node S into left S l, and right S r sub-node, with classes  =(  1...  K ) with = information in the class distribution  for vectors in the node S. Information is zero if all samples in the node are from one class (log 1 = 0, and 0 log 0 = 0), and reaches maximum H(S)=lg 2 K for uniform distribution in K equally probable classes, P(  i |S)=1/K. Gain ratio:

Model selection How complex should our model be? For example, what size of a tree, how many functions in a network, and what degree of the kernel? Crossvalidation is a good method but on a huge data it is costly. Another way to optimize model complexity is to measure the amount of information necessary to specify the model and its errors. Simpler models make more errors, complex need longer description. Minimum Description Length for model + errors (very simplified). General intuition: learning is compression, finding simple models, regularities in the data (ability to compress). Therefore estimate: L(M) = how many bits of information are needed to transmit the model. L(D|M) = how many bits to transmit information about data, given M. Minimize L(M)+L(D|M). Data correctly handled need not be transmitted. Estimations of L(M) are usually nontrivial but approximations work well.

More on model selection Many criteria for information-based model selection have been devised in computational learning theory, two best known are: AIC, Akaike Information Criterion. BIC, Bayesian Information Criterion. The goal is to predict, using training data, which model has the best potential for accurate generalization. Although model selection is a popular topic applications are relatively rare and selection via crossvalidation is commonly used. Models may be trained by max. mutual information of outputs/classes: This may in general be any non-linear transformation, for example implemented via basis set expansion methods.

Visualization via max MI Linear transformation that maximizes mutual information between a set of class labels and a set of new input features Y i, i=1..d’ < d is: Here W is a d’ x d dimensional matrix of mixing coefficients. Maximization proceeds using gradient-based iterative methods. Left – FDA view of 3 clusters, right – linear MI view with better separation FDA does not perform well if distributions are multimodal, separating the means that may be lead to overlapping clusters.

More examples Torkkola (Motorola Labs) has developed the MI-based visualization; see more ex at: Example: the Landsat Image data contain 36 features (spectral intensities of 3x3 submatrix of pixels in 4 spectral bands) are used to classify 6 type of land use; 1500 samples used for visualization. Left: FDA, right: MI, note violet/blue separation. Classification in the reduced space is more accurate. Movie 1: Reuters Movie 2: Satimage (local only)

Feature selection and attention Attention: basic cognitive skill, without attention learning would not have been possible. First we focus on some sensation (visual object, sounds, smells, tactile sensations) and only then the full power of the brain is used to analyze this sensation. Given a large database, to find relevant information you may: –discard features that do not contain information, –use weights to express their relative importance, –reduce dimensionality aggregating information, making linear or non-linear combinations of subsets of features (FDA, MI) – new features may not be so understandable as the original ones. –create new, more informative features, introducing new higher- level concepts; this is usually left to human invention.

Ranking and selection Feature ranking: treat each feature as independent, and compare them to determine the order of relevance or importance (rank). Note that: Several features may have identical relevance, especially for nominal features. This is either by chance, or because features are strongly correlated and therefore redundant. Ranking depends on what we look for, for example rankings of cars from the comfort, usefulness in the city, or rough terrain performance point of view, will be quite different. Feature selection: search for the best subsets of features, remove redundant features, create subsets of 1, 2,... k-best features.

Filters and wrappers Can ranking or selection be universal, independent of the particular decision system used? Some features are quite irrelevant to the task at hand. Feature filters are model-independent, universal methods based on some criteria to measure relevance, for information filtering. They are usually computationally inexpensive. Wrappers – methods that check the influence of feature selection on the result of particular classifier at each step of the algorithm. For example, LDA or kNN or NB methods may be used as wrappers for feature selection. Forward selection: add one feature, evaluate result using a wrapper. Backward selection: remove one feature, evaluate results.

NB feature selection Naive Bayes assumes that all features are independent. Results degrade if redundant features are kept. Naive Bayes predicts the class and its probability. For one feature X i P(X) does not need to be computed if NB returns a class label only. Comparing predictions with the desired class C(X) (or probability distribution over all classes) for all training data gives an error estimation when a given feature X i is used for the dataset D

NB selection algorithm Forward selection, best first, but complexity is O(d 2 ), rather high: start with a single feature, find X i1 that minimizes the NB classifier error rate; this is the most important feature; set X s ={X i1 }. Set X s <= {X s +X s+1 } and repeat until the error stops decreasing. For NB this forward selection wrapper approach works quite well. Let X s be the subspace of s features already selected; check all the remaining features, one after another, calculating probabilities (as products of factors) in the X s +X i subspace:

NB WEKA example First run WEKA with NB on data with larger number of attributes, for example colic or hypothyroid data. Add wrapper selection using FilteredClassifier with the AttributeSelectionFilter Note that the number of attributes has decreased, while the accuracy has increased. Note that WEKA may run forward selection, backward selection, and bi-directional selection. Search is terminated after a user-specified number of nodes after expansion does not decrease error; a beam-search instead of the best-first-search is used. That means that if expansion of the best node does not improve results, second best is used and new features added, then third best, etc, avoiding local minima in the search process.