Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.

Slides:



Advertisements
Similar presentations
Artificial Neural Networks
Advertisements

Multi-Layer Perceptron (MLP)
Slides from: Doug Gray, David Poole
NEURAL NETWORKS Backpropagation Algorithm
Neural networks Introduction Fitting neural networks
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Machine Learning Neural Networks
Neural Networks I CMPUT 466/551 Nilanjan Ray. Outline Projection Pursuit Regression Neural Network –Background –Vanilla Neural Networks –Back-propagation.
Lecture 14 – Neural Networks
x – independent variable (input)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Neural Networks Marco Loog.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Artificial Neural Networks
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
CS 4700: Foundations of Artificial Intelligence
Data mining and statistical learning - lecture 11 Neural networks - a model class providing a joint framework for prediction and classification  Relationship.
Comp 540 Chapter 9: Additive Models, Trees, and Related Methods
Collaborative Filtering Matrix Factorization Approach
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Biointelligence Laboratory, Seoul National University
Classification Part 3: Artificial Neural Networks
Artificial Neural Networks (ANN). Output Y is 1 if at least two of the three inputs are equal to 1.
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
Classification / Regression Neural Networks 2
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
Multi-Layer Perceptron
11 1 Backpropagation Multilayer Perceptron R – S 1 – S 2 – S 3 Network.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.
EEE502 Pattern Recognition
Neural Networks 2nd Edition Simon Haykin
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
Fall 2004 Backpropagation CS478 - Machine Learning.
Predictive Learning from Data
Deep Feedforward Networks
Introduction to Machine Learning and Tree Based Methods
The Gradient Descent Algorithm
Learning with Perceptrons and Neural Networks
Boosting and Additive Trees (2)
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Neural Networks and Backpropagation
Classification / Regression Neural Networks 2
Machine Learning Today: Reading: Maria Florina Balcan
Collaborative Filtering Matrix Factorization Approach
Artificial Neural Networks
Lecture Notes for Chapter 4 Artificial Neural Networks
Backpropagation.
Machine Learning: Lecture 4
Machine Learning: UNIT-2 CHAPTER-1
Neural networks (1) Traditional multi-layer perceptrons
Backpropagation David Kauchak CS159 – Fall 2019.
Presentation transcript:

Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with the data flood (STT, 65) (pp ). Den Haag, the Netherlands: STT/Beweton. J.H. Friedman and N.I. Fisher (1999) Bump-hunting in high-dimensional data. Statistics and Computing, 9:123–143.

Bump Hunting - The objective Find regions in the feature space, where the outcome variable has high average value. In classification, it means a region of the feature space where the majority of the samples are in one class. The decision rule looks like an intersection of several conditions (each on one predictor variable) If condition 1 & condition 2 &…… & condition N, then predict value … Ex: if 0<x 1 <1 & 2<x 2 <5 &…& -1<x n <0, then class 1

Bump Hunting - The objective When the dimension is high, and there is many such boxes, the problem is not easy.

Bump Hunting - The objective Let’s formalize the problem: Predictors x=( ) Target variable y, either continuous or binary Feature space: Find subspace such that Note: when y is binary, this is not mean of y. Rather, it is Pr(y=1 | x R) Define any box:

Box in continuous feature space: Bump Hunting - The objective

Box in categorical feature space. Bump Hunting - The objective

Sequentially find box in subsets of the data. Bump Hunting - PRIM Support of a box: Continue search for boxes until not enough support for the new box.

Bump Hunting - PRIM “Patient Rule Induction Method” Two steps: (1) Patient successive top-down refinement (2) Bottom-up recursive expansion These are greedy algorithms.

Bump Hunting - PRIM Peeling: Begin with box B containing all data (or all remaining data in later steps) Remove sub-box b*, which maximizes in B-b* The candidate box b is defined on a single variable (peeling only in one of the dimensions), and only a small percentile is peeled each time.

Bump Hunting - PRIM This is a greedy hill-climb algorithm. Stop the iteration when the support drops to pre- determined threshold. Why called “patient …”? Only remove a small fraction at each step.

Bump Hunting - PRIM Pasting: In peeling, box boundries are determined without knowledge of later peels. Some non-optimal steps can be taken. Final box could be improved by boundary adjustments.

Bump Hunting - PRIM example

2/7

Bump Hunting - PRIM example

3/7

Bump Hunting - PRIM example The winner is:

Bump Hunting - PRIM example The next peel: 1. And β= 0.4

Bump Hunting - PRIM example

Bump Hunting - Beam search algorithm At each step, w best sub-boxes (each on a single variable) are selected. Minimum support requirement. More greedy --- at each step, much more can be peeled than PRIM.

Bump Hunting - Beam search algorithm

W=2 Bump Hunting - Beam search algorithm

Bump Hunting - About PRIM It is a greedy search. However, it is “patient”. This is important. Methods that partition the data much faster, e.g. Beam search and CART, could be less successful. The “patient” method makes it easier to recover from previous “unfortunate” steps, since we don’t run out of the data too fast. It doesn’t select off predictors due to high correlation within them.

Neural networks (1) Introduction Fitting neural networks

Neural network K-class classification: K nodes in top layer Continuous outcome: Single node in top layer

Neural network Z m are created from linear combinations of the inputs, Y k is modeled as a function of linear combinations of the Z m For regression, can use a simple g k (T) =T k. Typically K = 1. K-class classification: (multilogit model)

Neural network

y1: x1 + x ≥ 0 y2: x1 +x2 −1.5 ≥ 0 z1 = +1 if and only if both y1=1 and y2=-1 A simple network with linear functions. Neural network “bias”: intercept

Neural network

Fitting Neural Networks Set of parameters (weights): Objective function: Regression: Classification:cross-entropy (deviance)

Fitting Neural Networks minimizing R(θ) is by gradient descent, called “back-propagation” Middle-layer values for each data point: We use the square error loss for demonstration:

Fitting Neural Networks Derivatives: Descent along the gradient: :learning rate k m l i: observation index

Fitting Neural Networks By definition

Fitting Neural Networks General workflow of back-propagation: Forward: fix weights and compute Backward: compute back propagate to compute use both to compute the gradients for the updates update the weights

Fitting Neural Networks Can use parallel computing - each hidden unit passes and receives information only to and from units that share a connection. Online training the fitting scheme allows the network to handle very large training sets, and also to update the weights as new observations come in. Training neural network is an “art” – the model is generally overparametrized optimization problem is nonconvex and unstable A neural network model is a blackbox and hard to directly interpret

Fitting Neural Networks Initiation When weight vectors are close to length zero  all Z values are close to zero.  The sigmoid curve is close to linear.  the overall model is close to linear.  a relatively simple model. (This can be seen as a regularized solution) Start with very small weights. Let the neural network learn necessary nonlinear relations from the data. Starting with large weights often leads to poor solutions.