 # Learning From Data Chichang Jou Tamkang University.

## Presentation on theme: "Learning From Data Chichang Jou Tamkang University."— Presentation transcript:

Learning From Data Chichang Jou Tamkang University

2 Chapter Objectives Analyze the general model of inductive learning Analyze the general model of inductive learning Explain how to select an approximating function Explain how to select an approximating function Introduce risk functional for regression and classification problems Introduce risk functional for regression and classification problems Identify concepts in statistical learning theory Identify concepts in statistical learning theory Discuss the differences of inductive principles, empirical risk minimization, and structural risk minimization Discuss the differences of inductive principles, empirical risk minimization, and structural risk minimization Discuss practical aspects of VC dimension Discuss practical aspects of VC dimension Compare inductive learning tasks using graphics Compare inductive learning tasks using graphics Introduce validation methods of inductive learning results Introduce validation methods of inductive learning results

3 Background Biological systems learn to cope with the unknown, statistical environment in a data-driven fashion Biological systems learn to cope with the unknown, statistical environment in a data-driven fashion Two-phases of predictive-learning process: Two-phases of predictive-learning process: –Learning or estimating unknown dependencies Induction: progressing from particular cases to a model Induction: progressing from particular cases to a model –Using estimated dependencies to predict Deduction: progressing from a model and given input to particular cases Deduction: progressing from a model and given input to particular cases

4 Induction, Deduction, Transduction Local estimation, like association rules

5 4.1 Learning machine Machine learning algorithms vary in their goals, in the available training data sets, and in the learning strategies and representation of data Machine learning algorithms vary in their goals, in the available training data sets, and in the learning strategies and representation of data Inductive machine learning Inductive machine learning –A generalization of models is obtained from a set of samples

6 Observational setting of a Learning machine Conditional probability p(Y/X) Real-worlds systems often have un- measured inputs

7 Inductive Learning machine Try to form generalizations from particular true facts (called training data set). Try to form generalizations from particular true facts (called training data set). –Formalized as a set of functions that approximate a system ’ s behavior Given X as an input, implementing a set of functions f(X, w), w is a parameter of the function Given X as an input, implementing a set of functions f(X, w), w is a parameter of the function –Its solution requires a priori knowledge

8 Inductive Learning machine The task of inductive inference The task of inductive inference –Given samples (x i, f(x i )), return a function h(x), called hypothesis, that approximate f(x) linear non-linear

9 Inductive Learning machine Statistical dependency vs. causality Statistical dependency vs. causality –Inductive-learning processes build the model of dependencies, but they should not be automatically interpreted as causality relations –Example: people in Florida are on average older than in other states. Married mnn live longer than single men.

10 L(y, f(X,w)) L(y, f(X,w)) –Measures the difference between y and f(X,w) Induction learning is the process of estimating f(X,w opt ), which minimizes R(w) Induction learning is the process of estimating f(X,w opt ), which minimizes R(w) Loss function and Risk function

11 Common Loss function

12 Inductive principle An inductive principle is a general prescription (what to do with the data) for obtaining an estimate f(X, w opt * ) An inductive principle is a general prescription (what to do with the data) for obtaining an estimate f(X, w opt * ) Human intervention in the learning algorithm Human intervention in the learning algorithm –Selection of input and output variables –Data encoding and representation –Incorporating a priori knowledge –Influence over the generator of the sampling rate or distribution

13 4.2 Statistical Learning Method A formalized theory for finite-sample inductive learning, mainly for classification or pattern recognition A formalized theory for finite-sample inductive learning, mainly for classification or pattern recognition –Provide quantitative description of the trade-off between model complexity and the available information –Also called VC (Vapnik-Chervonenkis) theory Other approaches are more engineering- oriented, without proofs and formalizations Other approaches are more engineering- oriented, without proofs and formalizations

14 Empirical risk minimization (ERM) Typically used when the model is given or approximated first, and then its parameters are estimated from the data

15 Empirical risk minimization (ERM) The consistency property The consistency property –Minimizing one risk for a given data set will also minimize the other risk Nontrivial Consistency Nontrivial Consistency –Consistency requirement must hold for all approximating functions

16 Behavior of the Growth function G(n) Approximating functions in the form of G(n) will have a consistency property

17 Structural Risk Minimization (SRM) ERM is good when n/h is large ERM is good when n/h is large When n/h < 20, use SRM When n/h < 20, use SRM 1.Selecting an element of a structure having optimal complexity 2.Estimating the model based on the set of approximating functions defined in the selected element of the structure

18 SRM in practice

19 SRM Applications of SRM for non-linear approximations are difficult, impossible in many cases Applications of SRM for non-linear approximations are difficult, impossible in many cases –use heuristics, like early stopping rules and weight initialization Three optimization approaches Three optimization approaches –Stochastic approximation (gradient descent) –Iterative methods –Greedy optimization

20 SRM Problems with the optimization approaches Problems with the optimization approaches –Too sensitive to initial conditions –Too sensitive to stopping rules –Too sensitive to many local minima Two useful guidelines Two useful guidelines –Do not attempt to solve a problem by indirectly solving a harder general problem –Occam ’ s razor: The best performance is provided by a model of optimal complexity

21 Requirement of any inductive-learning process

22 Types of Learning Methods Examples: logistic regression, multilayered perception, decision rules, decision trees, etc. Emphasis on a task-independent measure of quality of representation. Examples: cluster analysis, artificial neural network, association rules

23 Common Learning Tasks Classification Classification Regression Regression Clustering Clustering Summarization (Formalized Description) Summarization (Formalized Description) Dependency-modeling Dependency-modeling Deviation Detection (Outlier, Changes in time) Deviation Detection (Outlier, Changes in time)

24 Data-mining and Knowledge-discovery techniques Statistical Methods Statistical Methods Cluster Analysis Cluster Analysis Decision Trees and Decision Rules Decision Trees and Decision Rules Association Rules Association Rules Artificial Neural Network Artificial Neural Network Genetic Algorithms Genetic Algorithms Fuzzy Inference Systems Fuzzy Inference Systems N-dimensional Visualization Methods N-dimensional Visualization Methods

25 4.5 Model Estimation

26 Testing

27 Objective of Testing

28 How to Split Samples

29 Common Resampling Methods Resubstitution Method Resubstitution Method Holdout Method Holdout Method Leave-one-out Method Leave-one-out Method Rotation Method Rotation Method Bootstrap Method Bootstrap Method

30 Error rate, Accuracy R= E / S R= E / S A = 1 – R = (S – E) / S A = 1 – R = (S – E) / S Two classes Two classes –False Negative: False Reject Rate (FRR) –False Positive: False Acceptance Rate (FAR) More than two classes More than two classes –Confusion matrix

31 Confusion matrix for three classes

32 Receiver Operating Characteristic (ROC) Curve To evaluate FAR and FRR at the same time To evaluate FAR and FRR at the same time The following ROC shows sensitivity (FAR) vs. 1-specificity (1-FRR) The following ROC shows sensitivity (FAR) vs. 1-specificity (1-FRR) FAR