Presentation is loading. Please wait.

Presentation is loading. Please wait.

111 Support Vector Machines and Predictive Data Modeling Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota

Similar presentations


Presentation on theme: "111 Support Vector Machines and Predictive Data Modeling Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota"— Presentation transcript:

1 111 Support Vector Machines and Predictive Data Modeling Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at Tech Tune Ups, ECE Dept, June 1, 2011

2 22 2 Acknowledgements Research on Predictive Learning supported by NSF grant ECCS-0802056 The A. Richard Newton Breakthrough Research Award from Microsoft ResearchA. Richard Newton Breakthrough Research Award Joint work with grad students F. Cai & S. Dhar Parts of this presentation are from the books Introduction to Predictive Learning, by Cherkassky and Ma, Springer 2011 Learning from Data, by Cherkassky and Mulier, Wiley 2007

3 333 OUTLINE Introduction + Motivation 4 parts of this course: Philosophy, induction and predictive data modeling Support vector machines (SVM) SVM practical issues and applications Advanced SVM-based learning technologies

4 444 Motivation 1 Two critical points: (1) Humans can not reason about uncertainty in a rational way Examples (2) Humans and animals have excellent biological capabilities to cope with uncertainty and risk Examples

5 555 Motivation 2 Growth of data in digital age Is it possible to extract knowledge from this data? – philosophical and cultural implications How to extract knowledge from data? – business and technological aspects Is this a natural domain of statistics?

6 666 Motivation 3: biological learning Rosenblatt’s Perceptron (early 1960’s) - an early attempt to simulate biological learning (simple learning algorithm for a linear classifier) Young scientists in Moscow tried to understand generalization properties of such ‘machines’ and developed new statistical learning theory

7 777 Motivation 4: why SVM? Support Vector Machines - developed in the USSR in mid-1960’s - later introduced in the West in mid-1990’s - currently the most widely used method for modeling high-dimensional data -based on new mathematical theory different from classical statistics VC-theory also provides philosophical framework for ‘learning from data’ This new predictive modeling methodology is still poorly understood

8 888 PART 1: Philosophy, induction and predictive data modeling Understanding uncertainty and risk Induction and knowledge discovery Philosophy and statistical learning Predictive learning approach Introduction to VC-theory

9 999 Understanding Uncertainty Humans tend to avoid uncertainty, and try to explain unpredictable events Aristotle: All men by nature desire knowledge Learning ~discovering regularities from data Ancient cultures, i.e. Ancient Greeks, had no formal concepts related to randomness: Unpredictable events (wars, natural disasters etc.) were thought to be controlled by Gods or Fate. In modern society, religion has been replaced by science and pseudo-science

10 10 Gods, Prophets and Shamans

11 11 Science and Uncertainty Math, Logic and Science are about certainty ~ deterministic rules Probability and empirical data: involves uncertainty ~ inferior knowledge This view dominates modern science, i.e. True Scientific knowledge consists of deterministic Laws of Nature There is a (true, causal) model explaining a given natural phenomenon (i.e., disease)

12 12 Causal Determinism in Science Popular view of science - deterministic rules (laws of Nature) - reflects objective reality (single truth) - knowledge inferred from (observed) data Digital technology enables growth of data  Can expect rapid growth of knowledge by applying (statistical, data mining etc.) algorithms to this data Reality is more sobering (as usual)

13 13 Popular Hype: the data deluge makes scientific method obsolete Wired Magazine, 16/07: We can stop looking for (scientific) models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot. Early Detection of Cancer (or other diseases): Massive data analysis of cancer samples in order to identify unique proteins for tens of thousands of types of cancer. The goal is that (in the future) we can all be screened for these proteins as early warning signals for cancer.

14 14 REALITY Many studies have questionable value - statistical correlation vs causation Some border stupidity/ pseudoscience - US scientists at SUNY discovered Adultery Gene !!! (based on a sample of 181 volunteers interviewed about sexual life) Usual conclusion - more research is needed …

15 15 Some Views on Science Karl Popper: Science starts from problems, and not from observations Werner Heisenberg: What we observe is not nature itself, but nature exposed to our method of questioning Albert Einstein: Reality is merely an illusion, albeit a very persistent one.

16 16 Scientific Discovery Always involves ideas (models) and facts (data) Classical first-principle knowledge: hypothesis  data  scientific theory Note: deterministic, simple models Modern data-driven discovery: Computer program + DATA  knowledge Note: statistical, complex systems Two philosophies, poorly understood

17 17 COMPLEX SYSTEMS A. Einstein: When the number of factors coming into play in a phenomenological complex is too large, scientific method in most cases fails us. Example: weather prediction Does digital technology make Einstein’s claim obsolete?

18 18 Examples of Complex Systems Life Sciences Healthcare Climate modeling Social Systems (i.e. financial markets) Attempts to understand and model such systems using deterministic approach usually fail

19 19 Problem of Induction in Philosophy Francis Bacon: advocated empirical knowledge (inductive) vs scholastic David Hume: What right do we have to assume that the future will be like the past? Philosophy of Science tries to resolve this dilemma/contradiction between deterministic logic and uncertain nature of empirical data. Digital Age: growth of empirical data, and this dilemma becomes important in practice.

20 20 What is ‘a good model’? All models are mental constructs that (hopefully) relate to real world Two goals of data-driven modeling: - explain available data - predict future data All good (scientific) models make non-trivial predictions  Good data-driven models can predict well, so the goal is to estimate predictive models

21 21 Three Types of Knowledge Growing role of empirical knowledge Classical philosophy of science differentiates only between (first-principle) science and beliefs (demarcation problem) Importance of demarcation btwn empirical knowledge and beliefs in applications

22 22 Examples of Nonscientific Beliefs Aristotle’s science - everything is a mix of 4 basic elements: earth, water, air and fire Geocentric system of the world Origin of life (spontaneous generation) - disproved by L. Pasteur in 19 th century Modern belief: every medical condition can be traced to genetic variations - is it a popular belief or scientific theory ?

23 23 Popper’s Demarcation Principle Karl Popper: Every true (inductive) theory prohibits certain events or occurences, i.e. it should be falsifiable First-principle scientific theories vs beliefs or metaphysical theories Risky prediction, testability, falsifiability

24 24 Popper’s conditions for scientific hypothesis -Should be testable -Should be falsifiable Example 1: Efficient Market Hypothesis(EMH) The prices of securities reflect all known information that impacts their value Example 2: We do not see our noses, because they all live on the Moon

25 25 Predictive Learning: Formalization Given: data samples ~ training data (x,y) Estimate: a model, or function, f(x) that - explains this data and - can predict future data Classification problem:  Learning ~ function estimation

26 26 Application Example: predicting gender of face images Training data: labeled face images Male etc. Female etc.

27 27 Predicting Gender of Face Images Input ~ 32x32 pixel image Model ~ indicator function f(x) separating 1024-dimensional pixel space in two halves Model should predict well new images Difficult machine learning problem, but easy for human recognition

28 28 Learning ~ Reliable Induction Induction ~ function estimation from data: Deduction ~ prediction for new (test) inputs:

29 29 Common Learning Problems Classification Regression Note: explanation does not ensure prediction

30 30 Common Learning Problems Unsupervised learning (i.e., clustering) Note: many other types of problems exist. All such problems ~ inductive learning setting

31 31 Generalization and Complexity Control Consider regression estimation Ten training samples Fitting linear and 2-nd order polynomial:

32 32 Complexity Control (cont’d) The same data set: Using k-nn regression with k=1 and k=4  Generalization depends on model complexity

33 33 Complexity Control: issues Theoretical + conceptual - how to define model complexity Practical 1 - high-dimensional data Practical 2 - true model is not known  resampling for choosing opt. complexity Model selection ~ choosing opt model complexity

34 34 Resampling Split available data into 2 sets: Training + Validation (1) Use training set for model estimation (via data fitting) (2) Use validation data to estimate the ‘prediction’ error of the model Change model complexity index and repeat (1) and (2) Select the final model providing lowest (estimated) prediction error BUT results are sensitive to data splitting

35 35 K-fold cross-validation 1.Divide the training data Z into k (randomly selected) disjoint subsets {Z 1, Z 2,…, Z k } of size n/k 2.For each ‘left-out’ validation set Z i : - use remaining data to estimate the model - estimate prediction error on Z i : 3.Estimate ave prediction risk as

36 36 Example of model selection 25 samples are generated as with x uniformly sampled in [0,1], and noise ~ N(0,1) Regression estimated using polynomials of degree m=1,2,…,10 Polynomial degree m = 5 is chosen via 5-fold cross-validation. The curve shows the polynomial model, along with training (* ) and validation (*) data points, for one partitioning. mEstimated R via Cross validation 10.1340 20.1356 30.1452 40.1286 50.0699 60.1130 70.1892 80.3528 90.3596 100.4006

37 37 Statistical vs Predictive Approach Binary Classification problem estimate decision boundary from training data where y ~ binary class label (-1/+1) Assuming distribution P(x,y) is known: (x1,x2) space

38 38 Classical Statistical Approach (1) parametric form of unknown distribution P(x,y) is known (2) estimate parameters of P(x,y) from the training data (3) Construct decision boundary using estimated distribution and given misclassification costs Estimated boundary Modeling assumption: Distribution P(x,y) can be accurately estimated from available data

39 39 Predictive Approach (1) parametric form of decision boundary f(x,w) is given (2) Explain available data via fitting f(x,w), or minimization of some loss function (i.e., squared error) (3) A function f(x,w*) providing smallest fitting error is then used for predictiion Estimated boundary Modeling assumptions - Need to specify f(x,w) and loss function a priori. - No need to estimate P(x,y)

40 40 Two Different Methodologies System Identification (~ classical statistics) - estimate probabilistic model (class densities) from available data - use this model to make predictions System Imitation (~ biological learning) - need only predict well, i.e. imitate specific aspect of unknown system; - multiplicity of good models; - can they be interpreted and/or trusted? Which approach works for high-dim. data?

41 41 Classification with High-Dimensional Data Digit recognition 5 vs 8: each example ~ 32 x 32 pixel image  1,024-dimensional vector x Medical analogy -Each pixel ~ genetic marker -Each patient (sample) described by 1024 genetic markers -Two classes ~ presence/ absence of a disease Estimation of P(x,y) with finite data is not possible Accurate estimation of decision boundary in 1024-dim. space is possible, using just a few hundred samples

42 42 Statistical vs Predictive: Discussion Classical statistics has modeling goals: - interpretable model explaining the data - few important input variables (risk factors) - prediction performance is not verified but (usually) assumed – Why? Predictive modeling has different goals: - prediction (generalization) is the main goal - prediction accuracy is measured/reported - model interpretation is not important, as it cannot be objectively evaluated

43 43 PART 1: Philosophy, induction and predictive data modeling Understanding uncertainty and risk Induction and knowledge discovery Philosophy and statistical learning Predictive learning approach Introduction to VC-theory

44 44 Empirical Risk Minimization ERM principle for learning –Model parameterization: f(x, w) –Loss function: L(f(x, w),y) –Estimate risk from data: –Choose w* that minimizes R emp  model f(x, w*) explains past data ERM principle ~ biological approach Statistical Learning Theory (aka VC-theory) under what conditions the ERM-style models will generalize (predict) well?

45 45 Inductive Learning Setting The learning machine observes samples (x,y), and returns an estimated response Recall ‘ first-principles ’ vs ‘ empirical ’ knowledge  Two types of inference: identification vs imitation Risk

46 46 VC-theory basics - 1 Goals of Predictive Learning - explain (or fit) available training data - predict well future (yet unobserved) data Similar to biological learning Example: given 1, 3, 7, … predict the rest of the sequence. Rule 1: Rule 2: randomly chosen odd numbers Rule 3: BUT for sequence 1, 3, 7, 15, 31, 63, …, Rule 1 seems very reliable (why?)

47 47 VC-theory basics - 2 Main Practical Result of VC-theory: If a model explains well past data AND is simple, then it can predict well This explains why Rule 1 is a good model for sequence 1, 3, 7, 15, 31, 63, …, Measure of model complexity ~ VC-dimension ~ Ability to explain past data 1, 3, 7, 15, 31, 63 BUT can not explain all other possible sequences  Low VC-dimension (~ large falsifiability) For linear models, VC-dim = DoF (as in statistics) differentBut for nonlinear models they are different

48 48 VC-theory basics - 3 Strategy for modeling high-dimensional data: Find a model f(x) that explains past data AND has low VC-dimension, even when dim. is large SVM approach Large margin = Low VC-dimension ~ easy to falsify

49 49 SUMMARY & DISCUSSION Predictive data modeling: - training data similar to future (test) data - performance index/loss function - predictive methodology is different from classical statistics - may not be a single true model - ‘conventional’ model interpretation is hard Understanding of uncertainty and risk: - changing due to technological advances - cultural and ethical issues


Download ppt "111 Support Vector Machines and Predictive Data Modeling Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota"

Similar presentations


Ads by Google