11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota Presented at the University.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
ICONIP 2005 Improve Naïve Bayesian Classifier by Discriminative Training Kaizhu Huang, Zhangbing Zhou, Irwin King, Michael R. Lyu Oct
Support vector machine
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
The loss function, the normal equation,
Instructor : Saeed Shiry
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
The Nature of Statistical Learning Theory by V. Vapnik
Introduction to Predictive Learning
111 Support Vector Machines and Predictive Data Modeling Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota
Engineering Data Analysis & Modeling Practical Solutions to Practical Problems Dr. James McNames Biomedical Signal Processing Laboratory Electrical & Computer.
Part 2: Support Vector Machines
Learning From Data Chichang Jou Tamkang University.
Classification and application in Remote Sensing.
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
Introduction to Predictive Learning
Introduction to Predictive Learning
SVM Support Vectors Machines
Part 4: ADVANCED SVM-based LEARNING METHODS
SVM (Support Vector Machines) Base on statistical learning theory choose the kernel before the learning process.
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Part I: Classification and Bayesian Learning
Crash Course on Machine Learning
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
PRESENTED BY: SAUPTIK DHAR P RACTICAL C ONDITIONS FOR E FFECTIVENESS OF THE U NIVERSUM L EARNING 1.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 2 Basic Learning Approaches and Complexity Control.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Data-Driven Knowledge Discovery and Philosophy of Science
1 Statistics & R, TiP, 2011/12 Neural Networks  Technique for discrimination & regression problems  More mathematical theoretical foundation  Works.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 8 Combining Methods and Ensemble Learning.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 4 Statistical Learning Theory.
STATISTICAL LEARNING THEORY & CLASSIFICATIONS BASED ON SUPPORT VECTOR MACHINES presenter: Xipei Liu Vapnik, Vladimir. The nature of statistical.
CS 9633 Machine Learning Support Vector Machines
Part 2: Margin-Based Methods and Support Vector Machines
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Predictive Learning from Data
Data Mining Lecture 11.
Overview of Supervised Learning
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Predictive Learning from Data
Predictive Learning from Data
10701 / Machine Learning Today: - Cross validation,
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota Presented at the University of Cyprus, 2009

2 OUTLINE Background and motivation Application study: real-time pricing of mutual funds Inductive Learning and Philosophy Two methodologies: classical statistics and predictive learning Statistical Learning Theory and SVM Summary and discussion

3 Recall: Learning ~ function estimation Math terminology Past observations ~ data points Explanation (model) ~ function  Learning ~ function estimation (from data points) Prediction ~ using estimated model to make predictions

444 Statistical vs Predictive Approach Binary Classification problem estimate decision boundary from training data Assuming distribution P(x,y) were known: (x1,x2) space

555 Classical Statistical Approach (1) parametric form of unknown distribution P(x,y) is known (2) estimate parameters of P(x,y) from training data (3) Construct decision boundary using estimated distribution and given misclassification costs Estimated boundary Modeling assumption: Unknown P(x,y) can be accurately estimated from available data

666 Predictive Modeling Approach (1) parametric form of decision boundary f(x,w) is given (2) Explain available data via fitting f(x,w), or minimization of some loss function (i.e., squared error) (3) A function f(x,w*) providing smallest fitting error is then used for predictiion Estimated boundary Modeling assumption: - Need to specify f(x,w) and loss function a priori. - No need to estimate P(x,y)

77 Philosophical Interpretation Unknown system, observed data (input x, output y) Unknown P(x,y) Goal is to estimate a function: y = f (x) Probabilistic Approach ~ Goal is to estimate the true model for data (x,y) i.e. System Identification  REALISM Predictive Approach ~ Goal is to imitate (predict) System output y i.e., System Imitation  INSTRUMENTALISM

88 Classification with High-Dimensional Data Digit recognition 5 vs 8: each example ~ 16 x 16 pixel image  256-dimensional vector x Given finite number of labeled examples, estimate decision rule y = f(x) for classifying new images Note: x ~ 256-dimensional vector, y ~ binary class label 0/1 Estimation of P(x,y) with finite data is not possible Accurate estimation of decision boundary in 256-dim. space is possible, using just a few hundred samples

99 Statistical vs Predictive Predictive approach - estimates certain properties of unknown P(x,y) that are useful for predicting y - has solid theoretical foundations (VC-theory) - successfully used in many apps BUT its methodology + concepts are different from classical statistical estimation: - understanding of application - a priori specification of a loss function (necessary for imitation) - interpretation of predictive models is hard - possibility of several good models estimated from the same data

10 OUTLINE Background and motivation Application study: real-time pricing of mutual funds Inductive Learning and Philosophy Two methodologies: classical statistics and predictive learning Statistical Learning Theory and SVM Summary and discussion

11 Quick Tour of VC-theory -1 Goals of Predictive Learning - explain (or fit) available training data - predict well future (yet unobserved) data - ample empirical evidence in many apps Similar to biological learning Example: given 1, 3, 7, … predict the rest of the sequence. Rule 1: Rule 2: randomly chosen odd numbers Rule 3: BUT for sequence 1, 3, 7, 15, 31, 63, …, Rule 1 seems very reliable (why?)

12 Quick Tour of VC-theory - 2 Main Practical Result of VC-theory: If a model explains well past data AND is simple, then it can predict well This explains why Rule 1 is a good model for sequence 1, 3, 7, 15, 31, 63, …, Measure of model complexity ~ VC-dimension ~ Ability to explain past data 1, 3, 7, 15, 31, 63 BUT can not explain all other possible sequences  Low VC-dimension (~ large falsifiability) For linear models, VC-dim = DoF (as in statistics) differentBut for nonlinear models they are different

13 Quick Tour of VC-theory - 3 Strategy for modeling high-dimensional data: Find a model f(x) that explains past data AND has low VC-dimension, even when dim. is large SVM methods for high-dim data: Large margin = Low VC-dimension ~ easy to falsify

14 Non-separable data: classification

15 Support Vectors SV’s ~ training samples with non-zero loss SV’s are samples that falsify the model The model depends only on SVs  SV’s ~ robust characterization of the data WSJ Feb 27, 2004: About 40% of us (Americans) will vote for a Democrat, even if the candidate is Genghis Khan. About 40% will vote for a Republican, even if the candidate is Attila the Han. This means that the election is left in the hands of one-fifth of the voters. SVM Generalization ~ data compression

16 Nonlinear Decision Boundary Fixed (linear) parameterization is too rigid Nonlinear curved margin may yield larger margin (falsifiability) and lower error  nonlinear kernel SVM

17 Handwritten Digit Recognition (mid-90’s) Data set: postal images (zip-code), segmented, cropped; ~ 7K training samples, and 2K test samples Data encoding: 16x16 pixel image  256-dim. vector Summary: test error rate ~ 3-4% - prediction accuracy better than custom NN’s - accuracy does not depend on the kernel type – 400 support vectors per class (digit)

18 Interpretation of SVM models Humans can not provide interpretion of high-dimensional data, even when they make good decisions (predictions) using such data i. e. digit recognition vs How to interpret high-dimensional models? -Project data samples onto normal direction w of SVM decision boundary D(x) = (w x) + b -Interpret univariate histograms of projections

19 Univariate histogram (of projections) Project training data onto normal vector w of trained SVM W

20 Projections for high-dimensional data -1 Most training samples cluster on margin borders For 5 vs 8 recognition data, 100 training samples:  Explanation (~ fitting of training data) is easy

21 Continued.. BUT test data projections (for this SVM model) have completely different distribution: For 5 vs 8 recognition data, 1000 test samples: test error ~ 6%  prediction is more difficult

22 Projections for high-dimensional data-2 For 5 vs 8 recognition data, 1000 training samples Projections of training data:

23 Continued.. For this SVM model, test error is ~ 1.35% And histogram of projections for 1000 test samples:

24 OUTLINE Background and motivation Application study: real-time pricing of mutual funds Inductive Learning and Philosophy Two methodologies: classical statistics and predictive learning Statistical Learning Theory and SVM Summary and discussion

25 Summary In many real-life applications: 1.Estimation of models that can explain available data is easy 2.Estimation of models that can make useful predictions is very difficult 3.It is important to make clear distinction between (1) and (2) Usually this constitutes the difference between beliefs (opinions) and predictive models

26 Current Challenges Non-technical: - lack of agreement on understanding of uncertainty and risk Technical: - many different fragmented disciplines dealing with predictive learning VC- theory gives consistent practical approach for handling uncertainty and risk but it is often misinterpreted by scientists

27 Acknowledgements Parts of this presentation are taken - from the forthcoming book Introduction to Predictive Learning by V. Cherkassky and Y. Ma, Springer and from the course EE 4389 at