1 BUS 297D: Data Mining Professor David Mease Lecture 8 Agenda: 1) Reminder about HW #4 (due Thursday, 10/15) 2) Lecture over Chapter 10 3) Discuss final.

Slides:



Advertisements
Similar presentations
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Advertisements

Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Statistics 100 Lecture Set 6. Re-cap Last day, looked at a variety of plots For categorical variables, most useful plots were bar charts and pie charts.
1 Business 90: Business Statistics Professor David Mease Sec 03, T R 7:30-8:45AM BBC 204 Lecture 11 = Finish Chapter Numerical Descriptive Measures (NDM)
1 BUS 297D: Data Mining Professor David Mease Lecture 5 Agenda: 1) Go over midterm exam solutions 2) Assign HW #3 (Due Thurs 10/1) 3) Lecture over Chapter.
1 Business 90: Business Statistics Professor David Mease Sec 03, T R 7:30-8:45AM BBC 204 Lecture 12 = Start Chapter “ Basic Probability” (BP) Agenda: 1)
1) Reminder about HW #3 (Due Thurs 10/1) 2) Lecture over Chapter 5
Ensemble Learning: An Introduction
1 Business 90: Business Statistics Professor David Mease Sec 03, T R 7:30-8:45AM BBC 204 Lecture 14 = Review for Midterm Exam Agenda: 1) Go over Homework.
1 Business 90: Business Statistics Professor David Mease Sec 03, T R 7:30-8:45AM BBC 204 Lecture 26 = Review Chapter “Fundamentals of Hypothesis Testing:
BCOR 1020 Business Statistics
Chapter 3: Descriptive Measures STP 226: Elements of Statistics Jenifer Boshes Arizona State University.
Business 90: Business Statistics
1 Business 90: Business Statistics Professor David Mease Sec 03, T R 7:30-8:45AM BBC 204 Lecture 9 = Start Chapter “Numerical Descriptive Measures” (NDM)
1) Go over HW #1 solutions (Due today)
1 BUS 297D: Data Mining Professor David Mease Lecture 3 Agenda: 1) Reminder about HW #1 (Due Thurs 9/10) 2) Finish lecture over Chapter 3.
1 Business 260: Managerial Decision Analysis Professor David Mease Lecture 1 Agenda: 1) Course web page 2) Greensheet 3) Numerical Descriptive Measures.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Business Statistics - QBM117 Statistical inference for regression.
Evaluating Performance for Data Mining Techniques
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
AP Statistics Chapters 0 & 1 Review. Variables fall into two main categories: A categorical, or qualitative, variable places an individual into one of.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Describing distributions with numbers
CHAPTER 2: Describing Distributions with Numbers ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 9 = Review for midterm exam.
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 7 = Finish chapter 3 and.
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 11 = Finish ch. 4 and start.
LECTURE 8 Thursday, 19 February STA291 Fall 2008.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Copyright © 2010 Pearson Education, Inc. Chapter 6 The Standard Deviation as a Ruler and the Normal Model.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 6 The Standard Deviation as a Ruler and the Normal Model.
Describing distributions with numbers
STATISTICS “CALCULATING DESCRIPTIVE STATISTICS –Measures of Dispersion” 4.0 Measures of Dispersion.
Displaying Quantitative Data Graphically and Describing It Numerically AP Statistics Chapters 4 & 5.
1 BUS 297D: Data Mining Professor David Mease Lecture 2 Agenda: 1) Assign Homework #1 (Due Thursday 9/10) 2) Finish lecture over Chapter 2 3) Start lecture.
June 11, 2008Stat Lecture 10 - Review1 Midterm review Chapters 1-5 Statistics Lecture 10.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
Statistics: Unlocking the Power of Data Lock 5 Exam 2 Review STAT 101 Dr. Kari Lock Morgan 11/13/12 Review of Chapters 5-9.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/20/12 Multiple Regression SECTIONS 9.2, 10.1, 10.2 Multiple explanatory.
Variability Introduction to Statistics Chapter 4 Jan 22, 2009 Class #4.
COMPUTATIONAL FORMULAS AND IQR’S. Compare the following heights in inches: BoysGirls
Boxplots in R The function boxplot() in R plots boxplots By default, boxplot() in R plots the maximum and the minimum non-outlying values instead of the.
Copyright © 2009 Pearson Education, Inc. Chapter 6 The Standard Deviation as a Ruler and the Normal Model.
Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
Statistics Descriptive Statistics. Statistics Introduction Descriptive Statistics Collections, organizations, summary and presentation of data Inferential.
The simple linear regression model and parameter estimation
Statistics 202: Statistical Aspects of Data Mining
Stats 202: Statistical Aspects of Data Mining Professor Rajan Patel
CHAPTER 1 Exploring Data
Data Mining: Concepts and Techniques
Chapter 5 : Describing Distributions Numerically I
CHAPTER 2: Describing Distributions with Numbers
Do-Now-Day 2 Section 2.2 Find the mean, median, mode, and IQR from the following set of data values: 60, 64, 69, 73, 76, 122 Mean- Median- Mode- InterQuartile.
CHAPTER 29: Multiple Regression*
AP Exam Review Chapters 1-10
Boxplots in R The function boxplot() in R plots boxplots
CHAPTER 2: Describing Distributions with Numbers
Support Vector Machine _ 2 (SVM)
Summary (Week 1) Categorical vs. Quantitative Variables
Summary (Week 1) Categorical vs. Quantitative Variables
CHAPTER 1 Exploring Data
Statistical Models and Machine Learning Algorithms --Review
Presentation transcript:

1 BUS 297D: Data Mining Professor David Mease Lecture 8 Agenda: 1) Reminder about HW #4 (due Thursday, 10/15) 2) Lecture over Chapter 10 3) Discuss final exam + give sample questions

2 Homework 4 Homework 4 is at It is due Thursday, October 15 during class It is work 50 points It must be printed out using a computer and turned in during the class meeting time. Anything handwritten on the homework will not be counted. Late homeworks will not be accepted.

3 Introduction to Data Mining by Tan, Steinbach, Kumar Chapter 10: Anomaly Detection

4 What is an Anomaly? l An anomaly is an object that is different from most of the other objects (p.651) l “Outlier” is another word for anomaly l “An outlier is an observation that differs so much from other observations as to arouse suspicion that it was generated by a different mechanism” (p. 653) l Some good examples of applications for anomaly detection are on page 652

5 Detecting Outliers for a Single Attribute l A common method of detecting outliers for a single attribute is to look for observations more than a large number of standard deviations above or below the mean l The “z score” is the number of standard deviations above or below the mean (p. 661) l For the normal (bell-shaped) distribution we know the exact probabilities for the z scores l For non-normal distributions this approach is still useful and valid l A z score of 3 or -3 is a common cut off value

6 In class exercise #59: For the second exam scores at use a z score cut off of 3 to identify any outliers.

7 In class exercise #59: For the second exam scores at use a z score cut off of 3 to identify any outliers. Solution: data<-read.csv("exams_and_names.csv") exam2mean<-mean(data[,3],na.rm=TRUE) exam2sd<-sd(data[,3],na.rm=TRUE) z<-(data[,3]-exam2mean)/exam2sd sort(z)

8 In class exercise #60: Compute the count of each ip address (1 st column) in the data Then use a z score cut off of 3 to identify any outliers for these counts.

9 Detecting Outliers for a Single Attribute l A second popular method of detecting outliers for a single attribute is to look for observations more than a large number of IQR’s above the 3 rd quartile or below the 1 st quartile (the IQR is the interquartile range = Q 3 -Q 1 ) l This approach is used in R by default in the boxplot function l The default value in R is to identify outliers more than 1.5 IQR’s above the 3 rd quartile or below the 1 st quartile l This approach is thought to be more robust than the z score because the mean and standard deviation are sensitive to outliers (but not the quartiles)

10

11 In class exercise #61: For the second exam scores at identify any outliers more than 1.5 IQR’s above the 3 rd quartile or below the 1 st quartile. Verify that these are the same outliers found by the boxplot function in R.

12 In class exercise #61: For the second exam scores at identify any outliers more than 1.5 IQR’s above the 3 rd quartile or below the 1 st quartile. Verify that these are the same outliers found by the boxplot function in R. Solution: data<-read.csv("exams_and_names.csv") q1<-quantile(data[,3],.25,na.rm=TRUE) q3<-quantile(data[,3],.75,na.rm=TRUE) iqr<-q3-q1 data[(data[,3]>q3+1.5*iqr),3] data[(data[,3]<q1-1.5*iqr),3]

13 In class exercise #61: For the second exam scores at identify any outliers more than 1.5 IQR’s above the 3 rd quartile or below the 1 st quartile. Verify that these are the same outliers found by the boxplot function in R. Solution (continued): boxplot(data[,2],data[,3],col="blue", main="Exam Scores", names=c("Exam 1","Exam 2"),ylab="Exam Score")

14 Detecting Outliers for Multiple Attributes l For the data there are two students who did better on exam 2 than exam 1. l Our single attribute approaches would not identify these as outliers since they are not outliers on either attribute l So for multiple attributes we need some other approaches l There are 4 techniques in Chapter 10 that may work well here. They are listed on the next slide.

15 Detecting Outliers for Multiple Attributes l Mahalanobis distance (p. 662) - This is a distance measure that takes correlation into account l Proximity-based outlier detection (p. 666) - Points are identified as outliers if they are far from most other points l Model based techniques (p. 654) - Points which don’t fit a certain model well are identified as outliers l Clustering based techniques (p. 671) - Points are identified as outliers if they are far from all cluster centers (or if they form their own small cluster with only a few points)

16 Proximity-Based Outlier Detection (p. 666) l Points are identified as outliers if they are far from most other points l One method is to identify points as outliers if their distance to their k th nearest neighbor is large l Choosing k is tricky because it should not be too small or too big l Page 667 has some good examples with k=5

17 Model Based Techniques (p. 654) l First build a model l Points which don’t fit the model well are identified as outliers l For the example at the right, a least squares regression model would be appropriate

18 In class exercise #62: Use the function lm in R to fit a least squares regression model which predicts the exam 2 score as a function of the exam 1 score for the data at Plot the fitted line and determine for which points the fitted exam 2 values are the furthest from the actual values using the model residuals.

19 In class exercise #62: Use the function lm in R to fit a least squares regression model which predicts the exam 2 score as a function of the exam 1 score for the data at Plot the fitted line and determine for which points the fitted exam 2 values are the furthest from the actual values using the model residuals. Solution: data<-read.csv("exams_and_names.csv") model<-lm(data[,3]~data[,2]) plot(data[,2],data[,3],pch=19,xlab="Exam 1", ylab="Exam2",xlim=c(100,200),ylim=c(100,200)) abline(model) sort(model$residuals)

20 In class exercise #62: Use the function lm in R to fit a least squares regression model which predicts the exam 2 score as a function of the exam 1 score for the data at Plot the fitted line and determine for which points the fitted exam 2 values are the furthest from the actual values using the model residuals. Solution (continued):

21 Clustering Based Techniques (p. 671) l Clustering can be used to find outliers l One approach is to compute the distance of each point to its cluster center and identify points as outliers for which this distance is large l Another approach is to look for points that form clusters containing very few points and identify these points as outliers

22 In class exercise #63: Use kmeans() in R with all the default values to find the k=5 solution for the data at Plot the data. Also plot the fitted cluster centers using a different color. Finally, use the knn() function to assign the cluster membership for the points to the nearest cluster center. Color the points according to their cluster membership. Do the two people who did better on exam 2 than exam 1 form their own cluster?

23 In class exercise #63: Use kmeans() in R with all the default values to find the k=5 solution for the data at Plot the data. Also plot the fitted cluster centers using a different color. Finally, use the knn() function to assign the cluster membership for the points to the nearest cluster center. Color the points according to their cluster membership. Do the two people who did better on exam 2 than exam 1 form their own cluster? Solution: data<-read.csv("exams_and_names.csv") x<-data[!is.na(data[,3]),2:3]

24 In class exercise #63: Use kmeans() in R with all the default values to find the k=5 solution for the data at Plot the data. Also plot the fitted cluster centers using a different color. Finally, use the knn() function to assign the cluster membership for the points to the nearest cluster center. Color the points according to their cluster membership. Do the two people who did better on exam 2 than exam 1 form their own cluster? Solution (continued): plot(x,pch=19,xlab="Exam 1",ylab="Exam 2") fit<-kmeans(x,5) points(fit$centers,pch=19,col="blue",cex=2) library(class) knnfit<- knn(fit$centers,x,as.factor(c(1,2,3,4,5))) points(x,col=as.numeric(knnfit),pch=19)

25 Final Exam: The final exam will be Thursday 10/15 Just like with the midterm, you are allowed one 8.5 x 11 inch sheet (front and back) containing notes No books or computers are allowed, but please bring a hand held calculator The exam will cover the material from Lectures 5, 6, 7 and 8 and Homeworks #3 and #4 (Chapters 4, 5, 8 and 10) so it is not cumulative I have some sample questions on the next slides In general the questions will be similar to the homework questions (much less multiple choice this time)

26 Sample Final Exam Question #1: Which of the following describes bagging as discussed in class? A) Bagging combines simple base classifiers by upweighting data points which are classified incorrectly B) Bagging builds different classifiers by training on repeated samples (with replacement) from the data C) Bagging usually gives zero training error, but rarely overfits which is very curious D) All of these

27 Sample Final Exam Question #2: Homework 3 question #2

28 Sample Final Exam Question #3: Homework 3 question #3

29 Sample Final Exam Question #4: Homework 3 question #4

30 Sample Final Exam Question #5: Chapter 5 textbook problem #17 part a:

31 Sample Final Exam Question #6: Compute the precision, recall, F-measure and misclassification error rate with respect to the positive class when a cutoff of P=.50 is used for model M 2.

32 Sample Final Exam Question #7: For the one dimensional data at the right, give the k-nearest neighbor classifier for the points x=2, x=10 and x=120 using k=5. xy

33 Sample Final Exam Question #8: Consider the one-dimensional data set given by x<-c(1,2,3,5,6,7,8) (I left out 4 on purpose). Starting with initial cluster center values of 1 and 2 carry out algorithm 8.1 until convergence by hand for k=2 clusters. Show the cluster membership and cluster centers for each iteration.

34 Sample Final Exam Question #9: For the Midterm 1 and Midterm 2 scores listed below use a z score cut off of +/-3 to identify any outliers for each midterm. Show all your work.