Exam #3 Review Zuyin (Alvin) Zheng.

Slides:



Advertisements
Similar presentations
MIS2502: Data Analytics Association Rule Mining. Uses What products are bought together? Amazon’s recommendation engine Telephone calling patterns Association.
Advertisements

Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
QUANTITATIVE DATA ANALYSIS
The Simple Regression Model
MR2300: MARKETING RESEARCH PAUL TILLEY Unit 10: Basic Data Analysis.
Chapter 7 Correlational Research Gay, Mills, and Airasian
Chapter 5 Data mining : A Closer Look.
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
Decision Tree Models in Data Mining
Bayesian Decision Theory Making Decisions Under uncertainty 1.
SAS Homework 3 Review Association rules mining
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
Smith/Davis (c) 2005 Prentice Hall Chapter Four Basic Statistical Concepts, Frequency Tables, Graphs, Frequency Distributions, and Measures of Central.
Data Handbook Chapter 4 & 5. Data A series of readings that represents a natural population parameter A series of readings that represents a natural population.
Chapter 12 – Discriminant Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Final Exam Review. The following is a list of items that you should review in preparation for the exam. Note that not every item in the following slides.
Exam 3 Sample Decision Trees Cluster Analysis Association Rules Data Visualization SAS.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 18 Inference for Counts.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
The Three Analytics Techniques. Decision Trees – Determining Probability.
Central Tendency & Dispersion
ASSOCIATION RULES (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Chapter Eight: Using Statistics to Answer Questions.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
RESEARCH & DATA ANALYSIS
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Lecturer’s desk Physics- atmospheric Sciences (PAS) - Room 201 s c r e e n Row A Row B Row C Row D Row E Row F Row G Row H Row A
Chap 6: Association Rules. Rule Rules!  Motivation ~ recent progress in data mining + warehousing have made it possible to collect HUGE amount of data.
MIS2502: Data Analytics Association Rule Mining David Schuff
MIS2502: Data Analytics Association Rule Mining Jeremy Shafer
Statistics Descriptive Statistics. Statistics Introduction Descriptive Statistics Collections, organizations, summary and presentation of data Inferential.
Data Mining – Association Rules
Chapter 13 Simple Linear Regression
Chapter 12 – Discriminant Analysis
Statistics Use of mathematics to ORGANIZE, SUMMARIZE and INTERPRET numerical data. Needed to help psychologists draw conclusions.
Research Methods in Psychology PSY 311
Please hand in Project 4 To your TA.
Arrangements or patterns for producing data are called designs
PCB 3043L - General Ecology Data Analysis.
Understanding Results
Sampling Distributions
Frequent Pattern Mining
Randomness and Probability
Statistics: the language of psychological research
Chapter 12 Using Descriptive Analysis, Performing
Waikato Environment for Knowledge Analysis
Introduction to Statistics for the Social Sciences SBS200 - Lecture Section 001, Fall 2017 Room 150 Harvill Building 10: :50 Mondays, Wednesdays.
Chapter 3 Describing Data Using Numerical Measures
Module 8 Statistical Reasoning in Everyday Life
Data Analysis.
Arrangements or patterns for producing data are called designs
MIS2502: Data Analytics Classification using Decision Trees
Frequent patterns and Association Rules
MIS2502: Data Analytics Association Rule Mining
MIS2502: Data Analytics Association Rule Mining
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Review for Exam 3 Aaron Zhi Cheng
MIS2502: Data Analytics Clustering and Segmentation
Introduction to Statistics for the Social Sciences SBS200 - Lecture Section 001, Fall 2018 Room 150 Harvill Building 10: :50 Mondays, Wednesdays.
15.1 The Role of Statistics in the Research Process
Chapter Nine: Using Statistics to Answer Questions
MIS2502: Data Analytics Classification Using Decision Trees
MIS2502: Data Analytics Introduction to Advanced Analytics and R
MIS2502: Data Analytics Association Rule Learning
REGRESSION ANALYSIS 11/28/2019.
Presentation transcript:

Exam #3 Review Zuyin (Alvin) Zheng

Important Notes Time 2 hours NOT 50 minutes Location Calculator Monday, May 8, 8:00am –10:00am Location Regular Class Room Alter Hall 231 Calculator Do please bring your own calculator You will NOT be allowed to share calculators during the exam

Overview of the Exam (Time Matters) Part 1: Multiple Choice (50 Points) 25 questions * 2.5 points each Part 2: Interpreting Decision Tree Output (16 Points) 6 questions, short answers are required Part 3: Interpreting Clustering Output (14 Points) 5 questions, short answers are required Part 4: Computing Support, Confidence, and Lift (20 Points) 4 questions, calculations and short answers are needed Bonus Questions: 2 questions * 2 points each

Using R and R-Studio 

R and R-Studio Open source, free Many, many, many statistical add-on “packages” that perform data analysis R Packages contain additional functionality Integrated Development Environment for R Nicer interface that makes R easier to use Requires R to run (The pretty face) (The base/engine)

R Syntax 1 Define variables Functions R is case sensitive Variable names can start with a letter or digits The assignment operator in R is <- or = Functions c(), rep(), and sort() are functions Each function consists of function name and ()

R Syntax 2 Define and Index Vectors Access a variable from a dataset We use combine function to define a vector c() We use brackets [ ] to index elements in a vector Access a variable from a dataset We use $ to index a variable (column) from a data set (a table) Load Data into R Access a Variable

Understanding Descriptive Statistics

Summary Statistics summary describe describeBy Provides basic summary statistics describe Provides detailed summary statistics describeBy Provide summary statistics by groups

Read and Interpret a Histogram For a histogram What does x axis represent? What does y axis represent? Basic statistics Mean Median 1,3,3,6,7,8,9 1,2,3,4,5,6,7,8 What does it mean if the mean is greater (or smaller) than the median?

Hypotheses Testing: t-test and P-value Compares the mean between two groups/ samples P-value <0.05, means the difference between the means is statistically significant >=0.05, means the difference between the means is statistically insignificant

Decision Tree Analysis

What is Classification? A statistical method used to determine to what category (or “class”) a new observation belongs Pay off the loan / Purchase organic food/ Legitimate Charge How to classify? Decision Tree Training Set Validation Set How to evaluate decision tree? Correct classification rate Error rate Error rate + Correct classification rate =1

Read the Decision Tree How many leaf nodes? /What’s the size of the tree? What does 0.85 mean? How likely a person with a 50k income and 10% debt is going to default? Who are most likely to default? Who are least likely to default? 1-Default 0-No Default Credit Approval Income <40k Debt >20% 1 Debt <20% Owns house Rents Income >40k 0.85 0.20 0.30 0.22 0.60 0.26 Answer: Customers whose income is less than 40k, debt is smaller than 20% and house is owned are the least likely to default.

How to calculate the classification accuracy? A Confusion Matrix compares the predicted outcomes to the observed (actual) outcomes Correct classification rate=(600+650)/1500=0.83 Error rate=1-0.83=0.17 The Confusion Matrix Predicted outcome: Default No default Observed outcome: 600 100 150 650 Total: 1500

Control Tree Size Minimum split Complexity factor Tree size and classification accuracy A bigger tree =using more predictor=more leaf nodes = a larger size Larger tree leads to higher classification accuracy but makes the tree more complex Goal: a simple tree with high classification accuracy (high correct classification rate) Minimum split Minimum number of observations in each node needed to add an additional split. Smaller minimum split → More complex tree Complexity factor Minimum reduction in error needed to add an additional split. Smaller complexity factor → More complex tree

Clustering Analysis

What is Cluster Analysis? Grouping data so that elements in a group will be Similar (or related) to one another Different (or unrelated) from elements in other groups We don’t know how many groups we have beforehand How to cluster: K-means Grouping data based on distances Red dots are centroid (centers)

How to Evaluate Clusters? Within-cluster SSE Between-cluster SSE 2 1.3 1 SSE1 = 12 + 1.32 + 22 = 1 + 1.69 + 4 = 6.69 Within-cluster SSE: SMALL –High Cohesion Cluster 1 Cluster 5 Between-cluster SSE: LARGE –High Separation We want High Cohesion and High Separation We want Small Within-SSE and Large Between-SSE More Clusters Higher Cohesion (Good) More Clusters Lower Separation (Bad)

Interpreting Clustering Results 1 Summary Statistics for Each Cluster What’s the characteristics of cluster 1 compared to the population? What does a positive / negative value mean? > aggregate(kData,by=list(MyKMeans$cluster),FUN=mean) Group.1 RegionDensityPercentile MedianHouseholdIncome AverageHouseholdSize 1 1 0.8835353 -0.2681572 -0.5992532 2 2 -1.1224866 -0.5594931 -0.5055258 3 3 -0.4836836 -0.1400215 0.3502257 4 4 0.9557776 -0.3145690 1.3672892 5 5 0.8561222 1.3520755 0.2790425

Interpreting Clustering Results 2 Within-Cluster SSE What’s the Within-Cluster SSE for Cluster 4? Which cluster has the highest cohesion? Which cluster has the lowest cohesion? What is the range of the Within-Cluster SSE ? > # Display withinss (i.e. the within-cluster SSE for each cluster) > print("Within cluster SSE for each cluster (Cohesion):") [1] "Within cluster SSE for each cluster (Cohesion):"   > MyKMeans$withinss [1] 6523.491 4990.183 6772.426 2707.390 5102.896

Interpreting Clustering Results 3 Between-Cluster SSE What’s the average Between-Cluster SSE? What’s the total Between-Cluster SSE? > # Display betweenss (i.e. the between-cluster SSE between clusters) > print("Total between-cluster SSE (Seperation):") [1] "Total between-cluster SSE (Seperation):" > MyKMeans$betweenss [1] 45301.67 > # Compute average separation: more clusters = less separation > print("Average between-cluster SSE:"); [1] "Average between-cluster SSE:" > MyKMeans$betweenss/NUM_CLUSTER [1] 9060.334

Association Rules Mining

What is Association Rule Mining? Find out which items (X) predict the occurrence of other items (Y) Basket Analyses Itemset: A collection of items of interest Support count () In how many baskets does the itemset appear? {Milk, Diapers, Beer} = 2 {Milk, Diapers} = 3 Basket Items 1 Bread, Milk 2 Bread, Diapers, Beer, Eggs 3 Milk, Diapers, Beer, Coke 4 Bread, Milk, Diapers, Beer 5 Bread, Milk, Diapers, Coke

Support and Confidence Support (s) Support count () / total number of baskets Measures whether the itemset appears often s{Milk, Diapers}= 3/5 = 0.6 s{Bread}= 4/5 = 0.8 Bread appears very often in shopping baskets Is s(X,Y) always greater/smaller/equal to s(X)? Confidence (c) is the strength of the association between two itemsets c(XY): use X to predict Y, X is the attendant and Y is the consequent c(XY) is always between 0 and 1 Basket Items 1 Bread, Milk 2 Bread, Diapers, Beer, Eggs 3 Milk, Diapers, Beer, Coke 4 Bread, Milk, Diapers, Beer 5 Bread, Milk, Diapers, Coke

Lift What’s the problem with confidence? Lift The purpose of association rule mining is to predict Y with X If X and Y are frequently purchased items, the confidence will be large but X is a bad predictor of Y So we need to account for the natural occurrence of both X and Y Lift Measures whether two itemsets X and Y appear together at the same frequency as random chance Lift > 1 The occurrence of X , Y together is more likely than what you would expect by chance [Good] Lift<1 The occurrence of X , Y together is less likely than what you would expect by chance [Bad] Lift=1 The occurrence of X , Y together is the same as what you would expect by chance

Example 1 What’s the lift for the rule:{Milk, Diapers}  {Beer} X = {Milk, Diapers} ; Y = {Beer} s(X,Y)= 2/5 = 0.4 s(X) = 3/5 = 0.6 s(Y) = 3/5 = 0.6 c(XY)=0.4/0.6=0.67 The support is 0.4, meaning that {Milk, Diapers, Beer} appears very often The confidence is 0.67, indicating there is a strong association between {Milk Diaper} and {Beer} The Lift is greater than 1, suggesting that {Milk, Diapers} and {Beer} occur together more often than a random chance, thus is predictive. Basket Items 1 Bread, Milk 2 Bread, Diapers, Beer, Eggs 3 Milk, Diapers, Beer, Coke 4 Bread, Milk, Diapers, Beer 5 Bread, Milk, Diapers, Coke

What is the effect of Netflix on Cable TV? {Netflix  Cable TV) Example 2 Netflix Cable TV No Yes 200 3800 8000 1000 What is the effect of Netflix on Cable TV? {Netflix  Cable TV) s (Netflix, Cable TV) = 1000/13000  7% s (Cable TV ) = 9000/13000  69% s (Netflix ) = 4800/13000  37% Having one service negatively affects the purchase of the other (lift closer to 0 than 1)

Good Luck!