Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:

Similar presentations


Presentation on theme: "Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:"— Presentation transcript:

1 Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail: ramanathana@ornl.govramanathana@ornl.gov

2 2 Last Class Streaming Data Analytics: –sampling, counting, frequent item sets, etc. Berkeley Data Analytic Stack (BDAS)

3 3 Project Deliverables and Deadlines DeliverableDue-date% Grade Initial selection of topics 1, 2 Jan 27, 201510 Project Description and ApproachFeb 20, 201520 Initial ReportMar 20, 201510 Project DemonstrationApr 16-19, 201510 Final Project Report (10 pages)Apr 21, 201525 Poster (12-16 slides)Apr 23, 201525 1 Projects can come with their own data (e.g., from your project) or can be provided 2 Datasets need to be open! Please don’t use datasets that have proprietary limitations All reports will be in NIPS format: http://nips.cc/Conferences/2013/PaperInformation/StyleFiles http://nips.cc/Conferences/2013/PaperInformation/StyleFiles

4 4 This class and next… Classification Decision Tree Methods Instance based Learning Support Vector Machines (SVM) Latent Dirichlet Allocation (LDA) How do we make these work with large-scale datasets?

5 5 Part I: Classification and a few new terms…

6 6 Classification Given a collection of records (training): –Each record has a set of attributes [x 1,x 2, …, x n ] –One attribute is referred to as the class [y] Training Goal: Build a model for the class attribute as a function of the set of attributes: Testing Goal: Previously unseen records should be assigned the class attribute as accurately as possible

7 7 Illustration of Classification IDAttrib1Attrib2Attrib3Class 1YesLarge120,000No 2 Medium100,000No 3 Small70,000No 4YesMedium120,000No 5 Large95,000Yes 6 Large220,000Yes …………… IDAttrib1Attrib2Attrib3Class 20NoSmall55,000? 21YesMedium90,000? 22Yes……? Training Testing Learning Algorithm Learn Model Model Apply Model

8 8 Examples of classification Predicting tumor cells whether they are benign or cancerous Classifying credit card transactions are legitimate / fraudulent Classifying if a protein sequence is a helix, beta- sheet or a random coil Classifying if a tweet is talking about sports, finance, terrorism, etc.

9 9 Two problem Settings Supervised Learning: –Training and testing data –Training consists of labels (class attribute  y) –Testing data consists of all other attributes Unsupervised Learning: –Training data does not usually consist of labels –We will cover unsupervised learning later (mid Feb)

10 10 Two more terms.. Parametric: –A particular functional form is assumed. E.g., Naïve Bayes –Simplicity: easy to interpret and understand –High bias: real data might not follow this functional form Non-parametric: –Estimation is purely data-driven –no functional form assumption

11 11 Part II: Classification with Decision Trees Examples illustrated from Tan, Steinbach and Kumar, Introduction to Data Mining Text book

12 12 Decision Trees One of the more popular learning techniques Uses a tree-structured plan of a set of attributes to test in order to predict the output Decision of which attribute to be tested is based on information gain

13 13 Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting Attributes Training Data Model: Decision Tree

14 14 Another Example of Decision Tree categorical continuous class MarSt Refund TaxInc YES NO Yes No Married Single, Divorced < 80K> 80K There could be more than one tree that fits the same data!

15 15 Decision Tree Classification Task Decision Tree

16 16 Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Start from the root of tree.

17 17 Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data

18 18 Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data

19 19 Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data

20 20 Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data

21 21 Apply Model to Test Data Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Test Data Assign Cheat to “No”

22 22 Decision Tree Classification Task Decision Tree

23 23 Decision Tree Induction Many Algorithms: –Hunt’s Algorithm (one of the earliest) –CART –ID3, C4.5 –SLIQ,SPRINT

24 24 General Structure of Hunt’s Algorithm Let D t be the set of training records that reach a node t General Procedure: –If D t contains records that belong the same class y t, then t is a leaf node labeled as y t –If D t is an empty set, then t is a leaf node labeled by the default class, y d –If D t contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. –Recursively apply the procedure to each subset. DtDt ?

25 25 Hunt’s Algorithm Don’t Cheat Refund Don’t Cheat Don’t Cheat Yes No Refund Don’t Cheat Yes No Marital Status Don’t Cheat Single, Divorced Married Taxable Income Don’t Cheat < 80K>= 80K Refund Don’t Cheat Yes No Marital Status Don’t Cheat Single, Divorced Married

26 26 Tree Induction Greedy strategy –Split the records based on an attribute test that optimizes certain criterion. Issues –Determine how to split the records How to specify the attribute test condition? How to determine the best split? –Determine when to stop splitting

27 27 Tree Induction Greedy strategy. –Split the records based on an attribute test that optimizes certain criterion. Issues –Determine how to split the records How to specify the attribute test condition? How to determine the best split? –Determine when to stop splitting

28 28 How to Specify Test Condition? Depends on attribute types –Nominal –Ordinal –Continuous Depends on number of ways to split –2-way split –Multi-way split

29 29 Splitting Based on Nominal Attributes Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. CarType Family Sports Luxury CarType {Family, Luxury} {Sports} CarType {Sports, Luxury} {Family} OR

30 30 Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. Splitting Based on Ordinal Attributes Size Small Medium Large Size {Medium, Large} {Small} Size {Small, Medium} {Large} OR Size {Small, Large} {Medium}

31 31 Splitting Based on Continuous Attributes Different ways of handling –Discretization to form an ordinal categorical attribute Static – discretize once at the beginning Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. –Binary Decision: (A < v) or (A  v) consider all possible splits and finds the best cut can be more compute intensive

32 32 Splitting Based on Continuous Attributes

33 33 Tree Induction Greedy strategy. –Split the records based on an attribute test that optimizes certain criterion. Issues –Determine how to split the records How to specify the attribute test condition? How to determine the best split? –Determine when to stop splitting

34 34 How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best?

35 35 How to determine the Best Split Greedy approach: –Nodes with homogeneous class distribution are preferred Need a measure of node impurity: Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity

36 36 Measures of Node Impurity Gini Index Entropy Misclassification error

37 37 How to Find the Best Split B? YesNo Node N3Node N4 A? YesNo Node N1Node N2 Before Splitting: M0 M1 M2M3M4 M12 M34 Gain = M0 – M12 vs M0 – M34

38 38 Measure of Impurity: GINI Gini Index for a given node t : (NOTE: p( j | t) is the relative frequency of class j at node t). –Maximum (1 - 1/n c ) when records are equally distributed among all classes, implying least interesting information –Minimum (0.0) when all records belong to one class, implying most interesting information

39 39 Examples for computing GINI P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1) 2 – P(C2) 2 = 1 – 0 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Gini = 1 – (1/6) 2 – (5/6) 2 = 0.278 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 – (2/6) 2 – (4/6) 2 = 0.444

40 40 Splitting Based on GINI Used in CART, SLIQ, SPRINT. When a node p is split into k partitions (children), the quality of split is computed as, where,n i = number of records at child i, n = number of records at node p.

41 41 Binary Attributes: Computing GINI Index l Splits into two partitions l Effect of Weighing partitions: – Larger and Purer Partitions are sought for. B? YesNo Node N1Node N2 Gini(N1) = 1 – (5/6) 2 – (2/6) 2 = 0.194 Gini(N2) = 1 – (1/6) 2 – (4/6) 2 = 0.528 Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528 = 0.333

42 42 Categorical Attributes: Computing Gini Index For each distinct value, gather counts for each class in the dataset Use the count matrix to make decisions Multi-way splitTwo-way split (find best partition of values)

43 43 Continuous Attributes: Computing Gini Index Use Binary Decisions based on one value Several Choices for the splitting value –Number of possible splitting values = Number of distinct values Each splitting value has a count matrix associated with it –Class counts in each of the partitions, A < v and A  v Simple method to choose best v –For each v, scan the database to gather count matrix and compute its Gini index –Computationally Inefficient! Repetition of work.

44 44 Continuous Attributes: Computing Gini Index... For efficient computation: for each attribute, –Sort the attribute on values –Linearly scan these values, each time updating the count matrix and computing gini index –Choose the split position that has the least gini index Split Positions Sorted Values

45 45 How do we do this with large datasets? GINI computations is expensive: –Sorting along numeric attributes O(n log n)! –Need an in-memory hashtable to split based on attributes Empirical observation: –GINI scores tend to increase or decrease slowly –Minimum value of the gini score for the splitting attribute is lower than other data points CLOUDS (Classification for Large OUt-of-core Datasets) achieves this

46 46 How to hasten GINI computation? (1) Sampling of Splitting Points (SS): –Use quantiles to divide the points into q intervals –Evaluate GINI index at each boundary –Determine the GINI min value –Split along the attribute with GINI min Pro: –Only one pass over the entire dataset Cons: –Greedy –Time is dependent on the number of points in the interval

47 47 How to hasten GINI computation? (2) Sampling Splitting points with Estimation (SSE) –Find GINI min as outlined in SS –Estimate a lower bound GINI est for each interval –Eliminate (prune) all intervals where GINI est ≥ GINI min –Derive list of intervals (alive) where we can determine the correct split How do we determine what to prune? Goal: Estimate the minimum GINI given an interval [v l, v u ] Approach: Hill-climbing heuristic Inputs: n – number of data points (records) c – number of classes Compute: x i (y i ) – number of elements in class i that are less than equal to v l (v u ) c i – total number of elements in class i n l (n u ) – number of elements that are less than equal to v l (v u ) Take the partial derivative along the subset of classes c Class with the minimum gradient will remain the split point (for the next split) Pros: Instead of being dependent on (n) the hill climbing heuristic is dependent on (c) Other optimizations can include prioritizing alive intervals based on GINI min estimated Better accuracy than SS in estimating the splitting points Cons: Paying more for splitting the data (I/O costs are higher)

48 48 Summary Decision Trees are an elegant way to design classifiers: –Simple and has intuitive meaning associated with the splits More than one tree representation is possible for the same data –many heuristic measures available for deciding how to construct a tree For Big Data, we need to design heuristics that provide approximate (yet closely accurate) solutions

49 49 Part III: Classification with k-Nearest Neighbors Examples illustrated from Tom Mitchell, Andrew Moore, Jure Leskovec, …

50 50 Instance based Learning Given a number of examples {(x,y)}, and a query vector q: –Find the closest example(s) x* –Predict y* Learning: –Store all training examples Prediction: –Classification problem: Return majority class among the examples –Regression problem: Return average y value of the k examples

51 51 Application Examples Collaborative filtering –Find k most similar people to user x that have rated movies in a similar way –Predict rating y x of x as an average of y k

52 52 Visual Interpretation of k-Nearest Neighbor works k=1 k=3

53 53 1-Nearest Neigbhor Distance Metric: –E.g.: Euclidean How many neighbors to look at? –One Weighting function (optional): –Not-used How to fit with local points? –Predict the same output as the nearest neigbhor

54 54 Distance Metric: –E.g.: Euclidean How many neighbors to look at? –k (e.g., k=9) Weighting function (optional): –Not-used How to fit with local points? –Predict the same output as the nearest neighbors k-nearest Neighbor k = 9

55 55 Generalizing k-NN: Kernel Regression Instead of k-neighbors, we look at all Weight the points using a Gaussian function: Fit the local points with the weighted average… wiwi d(x i, q) = 0

56 56 Distance Weighted k-NN Points that are close by may be weighted heavier than farther away points Prediction rule: where w i can be: where d is the distance between x and xi

57 57 How to compute d(x,xi)? One approach: Shepard’s method

58 58 How to find nearest neighbors? Given: a set P of n points in R d Goal: Given a query point q –NN: Find the nearest neighbor p of q in P –Range search: Find one/all points in P within distance r from q 58 q p

59 59 Issues with large datasets Distance measure to use Curse of dimensionality: –in high dimensions, the nearest neighbor might not be near at all Irrelevant attributes: –a lot of attributes (in big data) are not informative k-NN wants data in memory! It must make a pass over the entire data to do classification

60 60 Distance Metrics D(x, xi) must be positive D(x, xi) = 0 iff a = b (Reflexive) D(x, xi) = D(xi, x) (Symmetric) For any other data point y, D(x, xi) + D(xi,y) >= D(x, y) (Triangle inequality) Euclidean Distance: One of the most commonly used metric

61 61 But… Euclidean is not fully useful! Makes sense when all data points are expressed in SI (or a similar way) Can generalize Euclidean distance  L k norm, also called Minkowski distance… L1: Manhattan Distance (L1 norm) L2: Euclidean Distance (L2 norm)

62 62 Curse of Dimensionality… What happens if {d} is very large? Say we have 10000 points uniformly distributed in a unit hypercube where we want to run k=5 neighbors Query point is origin (0 1, 0 2, …,0 d ): –in d=1: we have to go 5/10000 = 0.0005 on an average to capture our 5 neighbors –in d=2: we have to go (0.0005) 1/2 to find 5 neighbors –in d=d: we have to go (0.0005) 1/d to find neighbors Very expensive and can break down!

63 63 Irrelevant features… A large number of features are uninformative about the data: –do not discriminate across classes (but may be within a class) We may be left with a classic “hole” in the data when computing the neighbors…

64 64 Efficient indexing (for computing neighbors) A kd-tree is an efficient data structure: –Split along the median value having the highest variance –Points are stored in the leaves…

65 65 Space and Time Complexity of Kd-trees Space: O(n)!!! –Can reduce storage with other forms of efficient data structures –E.g., Ball-tree structure (does better with higher dimensional datasets) Time: –O(n) – worst case scenario –O(log n) – normal scenarios…

66 66 Now, how do we make this online? Instead of assuming a static k-NN classifier, how about we make our k-NN adapt to incoming data? –Locally weighted linear regression Form an explicit approximation f^ for a region surrounding query q –Called piece-wise approximation –Minimize error over k nearest neighbor of q –Minimize error over all training examples (weighting by distances) –Combine the two

67 67 LWLR: Mathematical form Linear regression: Error: Minimize error over k-nearest neighbors Minimize error over entire set of examples, weighing by distances Combine the two:

68 68 Example of LWLR simple regression

69 69 Large-scale Classification & Regression Decision Trees: –Querying and inference is easy –Induction needs special constructs for handling large- scale datasets –Can adapt to incoming data streams k-NN: –Intuitive algorithm –Building efficient data structures to handle large datasets –Online approach Next class: More classification & Regression!


Download ppt "Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph:"

Similar presentations


Ads by Google