Tin Kam Ho Bell Laboratories Lucent Technologies.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Introduction to Support Vector Machines (SVM)
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Data Mining Classification: Alternative Techniques
Support Vector Machines
Boosting Approach to ML
An Overview of Machine Learning
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Region labelling Giving a region a name. Image Processing and Computer Vision: 62 Introduction Region detection isolated regions Region description properties.
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
For Better Accuracy Eick: Ensemble Learning
An Introduction to Support Vector Machines Martin Law.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.
This week: overview on pattern recognition (related to machine learning)
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 Part II: Practical Implementations.. 2 Modeling the Classes Stochastic Discrimination.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Classification Heejune Ahn SeoulTech Last updated May. 03.
An Introduction to Support Vector Machines (M. Law)
1 ICPR 2006 Tin Kam Ho Bell Laboratories Lucent Technologies.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Fast and accurate text classification via multiple linear discriminant projections Soumen Chakrabarti Shourya Roy Mahesh Soundalgekar IIT Bombay
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)
Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Linear Models for Classification
Bell Laboratories Intrinsic complexity of classification problems Tin Kam Ho With contributions from Mitra Basu, Ester Bernado-Mansilla, Richard Baumgartner,
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Classification Ensemble Methods 1
Data Mining and Decision Support
Computational Intelligence: Methods and Applications Lecture 29 Approximation theory, RBF and SFN networks Włodzisław Duch Dept. of Informatics, UMK Google:
NTU & MSRA Ming-Feng Tsai
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
KNN & Naïve Bayes Hongning Wang
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Basic machine learning background with Python scikit-learn
Machine Learning Basics
Overview of Supervised Learning
Outlier Discovery/Anomaly Detection
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Machine Learning Ensemble Learning: Voting, Boosting(Adaboost)
Ensemble learning.
Generally Discriminant Analysis
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
A task of induction to find patterns
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Presentation transcript:

Tin Kam Ho Bell Laboratories Lucent Technologies

Outlines Classifier Combination: –Motivation, methods, difficulties Data Complexity Analysis: –Motivation, methods, early results

Supervised Classification -- Discrimination Problems

An Ill-Posed Problem

Where Were We in the Late 1990’s? Statistical Methods –Bayesian classifiers, polynomial discriminators, nearest- neighbors, decision trees, neural networks, support vector machines, … Syntactic Methods –regular grammars, context-free grammars, attributed grammars, stochastic grammars,... Structural Methods –graph matching, elastic matching, rule-based systems,...

Classifiers Competition among different … –choices of features –feature representations –classifier designs Chosen by heuristic judgements No clear winners

Classifier Combination Methods Decision optimization methods –find consensus from a given set of classifiers –majority/plurality vote, sum/product rule –probability models, Bayesian approaches –logistic regression on ranks or scores –classifiers trained on confidence scores

Classifier Combination Methods Coverage optimization methods –subsampling methods: stacking, bagging, boosting –subspace methods: random subspace projection, localized selection –superclass/subclass methods: mixture of experts, error-correcting output codes –perturbation in training

Layers of Choices Best Features? Best Classifier? Best Combination Method? Best (combination of )* combination methods?

Before We Start … Single or multiple classifiers? –accuracy vs efficiency Feature or decision combination? –compatible units, common metric Sequential or parallel classification? –efficiency, classifier specialties

Difficulties in Decision Optimization Reliability versus overall accuracy Fixed or trainable combination function Simple models or combinatorial estimates How to model complementary behavior

Difficulties in Coverage Optimization What kind of differences to introduce: –Subsamples? Subspaces? Subclasses? –Training parameters? –Model geometry? 3-way tradeoff: –discrimination + uniformity + generalization Effects of the form of component classifiers

Dilemmas and Paradoxes Weaken individuals for a stronger whole? Sacrifice known samples for unseen cases? Seek agreements or differences?

Model of Complementary Decisions Statistical independence of decisions: assumed or observed? Collective vs point-wise error estimates Related estimates of neighboring samples

Stochastic Discrimination Set-theoretic abstraction Probabilities in model or feature spaces Enrichment / Uniformity / Projectability Convergence by law of large numbers

Stochastic Discrimination Algorithm for uniformity enforcement Fewer, but more sophisticated classifiers Other ways to address the 3-way trade-off

Geometry vs Probability Geometry of classifiers Rule of generalization to unseen samples Assumption of representative samples Optimistic error bounds Distribution-free arguments Pessimistic error bounds

Sampling Density N = 2N = 10 N = 100N = 500N = 1000

Summary of Difficulties Many theories have inadequate assumptions Geometry and probability lack connection Combinatorics defies detailed modeling Attempt to cover all cases gives weak results Empirical results overly specific to problems Lack of systematic organization of evidences

More Questions How do confidence scores differ from feature values? Is combination a convenience or a necessity? What are common among various combination methods? When should the combination hierarchy terminate?

Data Dependent Behavior of Classifiers Different classifiers excel in different problems So do combined systems This complicates theories and interpretation of observations

Questions to ask: Does this method work for all problems? Does this method work for this problem? Does this method work for this type of problems? Study the interaction of data and classifiers

Characterization of Data and Classifier Behavior in a Common Language

Sources of Difficulty in Classification Class ambiguity Boundary complexity Sample size and dimensionality

Class Ambiguity Is the problem intrinsically ambiguous? Are the classes well defined? What is the information content of the features? Are the features sufficient for discrimination?

Kolmogorov complexity Length may be exponential in dimensionality Trivial description: list all points, class labels Is there a shorter description? Boundary Complexity

Sample Size & Dimensionality Problem may appear deceptively simple or complex with small samples Large degree of freedom in high-dim. spaces Representativeness of samples vs. generalization ability of classifiers

Mixture of Effects Real problems often have mixed effects of class ambiguity boundary complexity sample size & dimensionality Geometrical complexity of class manifolds coupled with probabilistic sampling process

Easy or Difficult Problems Linearly separable problems

Easy or Difficult Problems Random noise 1000 points500 points 100 points 10 points

Easy or Difficult Problems Others Nonlinear boundary Spirals 4x4 checkerboard 10x10 checkerboard

Description of Complexity What are real-world problems like? Need a description of complexity to –set expectation on recognition accuracy –characterize behavior of classifiers Apparent or true complexity?

Possible Measures Separability of classes –linear separability –length of class boundary –intra / inter class scatter and distances Discriminating power of features –Fisher’s discriminant ratio –overlap of feature values –feature efficiency

Possible Measures Geometry, topology, clustering effects –curvature of boundaries –overlap of convex hulls –packing of points in regular shapes –intrinsic dimensionality –density variations

Linear Separability Intensively studied in early literature Many algorithms only stop with positive conclusions –Perceptrons, Perceptron Cycling Theorem, 1962 –Fractional Correction Rule, 1954 –Widrow-Hoff Delta Rule, 1960 –Ho-Kashyap algorithm, 1965 –Linear programming, 1968

Length of Class Boundary Friedman & Rafsky 1979 –Find MST (minimum spanning tree) connecting all points regardless of class –Count edges joining opposite classes –Sensitive to separability and clustering effects

Fisher’s Discriminant Ratio Defined for one feature: means, variances of classes 1,2 One good feature makes a problem easy Take maximum over all features

Volume of Overlap Region Overlap of class manifolds Overlap region of each dimension as a fraction of range spanned by the two classes Multiply fractions to estimate volume Zero if no overlap

Convex Hulls & Decision Regions Hoekstra & Duin 1996 Measure nonlinearity of a classifier w.r.t. a given dataset Sensitive to smoothness of decision boundaries

Shapes of Class Manifolds Lebourgeois & Emptoz 1996 Packing of same-class points in hyperspheres Thick and spherical, or thin and elongated manifolds

Measures of Geometrical Complexity

Space of Complexity Measures Single measure may not suffice Make a measurement space See where datasets are in this space Look for a continuum of difficulty: Easiest CasesMost difficult cases

UC-Irvine collection 14 datasets (no missing values, > 500 pts) 844 two-class problems 452 linearly separable 392 linearly nonseparable points each dimensional feature spaces Data Sets: UCI

Data Sets: Random Noise Randomly located and labeled points 100 artificial problems 1 to 100 dimensional feature spaces 2 classes, 1000 points per class

Patterns in Measurement Space

Correlated or Uncorrelated Measures

Separation + ScatterSeparation

Observations Noise sets and linearly separable sets occupy opposite ends in many dimensions In-between positions tell relative difficulty Fan-like structure in most plots At least 2 independent factors, joint effects Noise sets are far from real data Ranges of noise sets: apparent complexity

Principle Component Analysis

A Trajectory of Difficulty

What Else Can We Do? Find clusters in this space Determine intrinsic dimensionality Study effectiveness of these measures Interpret problem distributions

What Else Can We Do? Study specific domains with these measures Study alternative formuations, subproblems induced by localization, projection, transformation Apply these measures to more problems

What Else Can We Do? Relate complexity measures to classifier behavior

Bagging vs Random Subspaces for Decision Forests Subspaces better Subsampling better Fisher’s discriminant ratio Length of class boundary

Bagging vs Random Subspaces for Decision Forests Nonlinearity, nearest neighbors vs linear classifier % Retained adherence subsets, Intra/inter class NN distances

Error Rates of Individual Classifiers Error rate, nearest neighbors vs linear classifier Error rate, single trees vs sampling density

Observations Both types of forests are good for problems of various degrees of difficulty Neither is good for extremely difficult cases - many points on boundary - ratio of intra/inter class NN dist. close to 1 - low Fisher’s discriminant ratio - high nonlinearity of NN or LP classifiers Subsampling is preferable for sparse samples Subspaces is preferable for smooth boundaries

Conclusions Real-world problems have different types of geometric characteristics Relevant measures can be related to classifier accuracies Data complexity analysis improves understanding of classifier or combination behavior Helpful for combination theories and practices