Introduction to Predictive Analytics

Introduction to Predictive Analytics
Tiber Training, 6/12/2015

Short Bio B.A. from Allegheny College
M.S. from Carnegie Mellon University Internship at the Department of Veterans Affairs July 2014 – Present: Senior Consultant, IBM Global Business Services Advanced Analytics and Optimization

Today’s Goals: Students Will Be Able To:
Identify the purpose of Data Mining Discuss major paradigms in Data Mining Apply paradigms to business problems Discuss major considerations in model performance Evaluate model performance Discuss the general functionality of some major classification algorithms Use RapidMiner software to load data, train a classifier, and evaluate Use RapidMiner to create clusters from data

Agenda Conceptual Material Hands-On Activities Data Mining Foundations
Problem Types and Model Paradigms Model Performance Considerations Recognizing a Strong Model Hands-On Activities K-Nearest Neighbors Classification Decision Tree Models K-Means Clustering

Data Mining Foundations – Data Mining Definition
BRAINSTORM Based on exposure to literature, conversations with clients, professional publications: What is data mining?

Data Mining Foundations – Several Definitions
To extract insight or information from data “Extracting useful knowledge from data to solve business problems” To discover interesting relationships within data * Provost & Fawcett, “Data Science for Business”

CRISP-DM Framework Iterative process
“Cross Industry Standard Practice for Data Mining” Iterative process Human stays in the loop throughout iterations Understanding of both the business and the data are paramount

System Design BI DASHBOARDS PROCESSED STATE (Data Warehouse?)
(Operational Data Store?) (Other form?) RESULTS REPORTING REPORTING RESULTS STORAGE SOURCE SYSTEMS DATA MINING RESULTS DASHBOARD It is assumed that some action will occur AD HOC, ONE-OFF DATA MINING AUTOMATIC DATA MINING

Data Format Name Age Occupation Purchase? Total Spent Jane Smith 38
Designer Yes 40,000 James Doe 19 Student No N/A Sally Q. Public 50 CEO 9,000 Rows, Also known as: Records Observations NOTE: Expects inputs and outputs to be at the same level (i.e. for an individual, for a visit to the store, for a webpage view, etc) Input Fields Labels/Output Fields, Can be: Binary Categorical Numeric Columns, Also known as: Fields Attributes Dimensions

Data Format in Different Disciplines
BRAINSTORM Based on experience within business intelligence and data warehousing: How does this data format expectation differ from other applications? Does it pose any unique challenges?

Problem Types and Paradigms
SUPERVISED UNSUPERVISED CLASSIFICATION REGRESSION CLUSTERING VISUALIZATION Supervised methods assume labels/outputs Unsupervised methods lack labels/outputs

Problem Types: Classification
The preceding slide had $ amount of purchase. Why not use it?! Problem Types: Classification Given a set of inputs, predict the most likely value for a categorical or binary output Examples: (From preceding slide) Every new customer must record their name, age, and occupation. Jimmy G. Generic 10 Student Will this new customer make a purchase? Given the dollar amount, GPS location of the store, category of the retailer, date, and time of a purchase – is it fraudulent? Given the pulse, respiration, and blood pressure – Will the patient have a complication?

Problem Types: Regression
Given a set of inputs, predict the most likely value for a real-valued output Examples: (From preceding slide) Every new customer must record their name, age, and occupation. Jimmy G. Generic 10 Student IF this customer makes a purchase, how much are they likely to spend? Today’s session has minimal coverage of regression. Why?

Problem Types: Clustering
Given a set of inputs, with or without outputs, assign a class label to each record based upon similarity to other records Examples: ` Days as a Salesperson ` ` Average $ Value Sales/Day

Problem Types: Visualization
Not necessarily a problem type like the others, but: Can sometimes create new insight for existing problems Social Network Visualization, Courtesy of Wikipedia What can we say about this? Would we necessarily glean the same from clustering?

Problem Types and Business
BRAINSTORM Now that we’ve introduced the major problem types in data mining: Form groups and create a list of at least five business use cases for each of the problem types.

Model Performance Considerations
PROBLEM SCENARIO: We have clickstream data on how long a customer has viewed an item (total) and how many times they’ve clicked it We pull 10 records from the database Build a model (which type of model? Which type of problem?) X = purchase O = nonpurchase X X X # of times clicked O O O O X O X Total time spent viewing

PROBLEM SCENARIO: We have clickstream data on how long a customer has viewed an item (total) and how many times they’ve clicked it We pull 10 records from the database Build a model (which type of model? Which type of problem) X = purchase O = nonpurchase X X X NON- PURCHASE # of times clicked PURCHASE O O O O X O X Time to cash the check! Right? q Total time spent viewing SUCCESS! If Total_Time >= q THEN Purchase ELSE Nonpurchase

We deploy the system, 10 new customers Suddenly … Our classifier looks much worse With more rules, using both dimensions, we can still have reasonable accuracy We can’t have a perfect classifier using these two dimensions … Can we imagine a third? And another rule? X X X X X X X X # of times clicked p O O O O O O X O X O O O q Total time spent viewing

UNDERFIT OVERFIT Which of these did we display in the original model for our clickstream customers? What if we kept adding rules, and kept adding dimensions?

10-fold cross validation is the most popular
Testing a Model Cross Validation lets us choose random subsets of the data to hold out from the model building stage 10-fold cross validation is the most popular

10-Fold Cross Validation Utility
Choose between different modeling algorithms Tune parameters of chosen algorithm to reduce bias or variance Create an estimate of model performance

Evaluating classifiers
Refer to handout – Confusion Matrix and Performance Metrics Time for the big bucks! I can deliver 99% accuracy for some problems, guaranteed! How? BRAINSTORM Why not just use accuracy?

Receiver Operating Characteristic (ROC) Curve
IMPORTANT Points on the ROC Curve each represent a specific threshold for a classifier. Time for the big bucks! I can deliver a 0% False Positive Rate, guaranteed! How?

Using ROC Curves to Compare Models
Generally, models closer to the top left are best, e.g. 100% true positive rate and 0% False Positive Rate Occasionally, one model will tend toward lower false positives while another favors higher true positives Leads to: Cost sensitive or unbalanced classifiers

K-Nearest Neighbors Non-parametric approach
(Does not actually extract model parameters, relies on having all examples in memory) Relies on a distance function between each point and others Each nearby point “votes” on the class label

K-Nearest Neighbors, Continued
Parameters can be tuned: K (number of neighbors to use as a vote) Distance metric What about 1 Nearest Neighbor? What about 100 Nearest Neighbors?

K-Nearest Neighbors Fairly interpretable (present example cases)
Can handle multiple output categories (not just binary) Can only handle numeric inputs Nonparametric – Memory intensive and slower to score, no time to train Distance metrics begin to break down as dimensionality increases

Decision Tree Models Recursive algorithm approach
Uses some measure of purity Given a certain split in an input, how “pure” is the output? Information Gain, based on Entropy, is a popular measure

Decision Trees, continued
Test Information Gain of the output variable given a split on each of the input variables First split – Maximum Information Gain Recursively repeat for each of the nodes, which correspond to split points of original input value Stop when the leaf nodes have reached an adequate level of “purity” Handle numeric fields with cutpoints (beyond the scope of this session)

Decision Trees – Entropy and Information Gain
Entropy = −p(yi) ∗ log2p(yi) Example: 7/10 people purchased, 3/10 did not H(y)= -[0.7*log2(0.7)+0.3*log2(0.3)] H(y) = -[0.7* * -1.74] H(y) = .88 How does .88 correspond to the chart at right?

Entropy, Another View (Best Regards to CMU’s Andrew Moore, who used a similar image) PILLOW STUFFING ENTROPY: LOW PILLOW STUFFING ENTROPY: HIGH

Decision Trees – Entropy and Information Gain, Continued
IG(T,a) = H(T) – H(T|a) In other words, take the entropy of the outcome and subtract the entropy after you’ve split on some criteria. For example, if H(y) = .88 for purchases on the preceding slide, what is the entropy if you first break it into income >= 50,000 and income <= 50,000?

Information Gain – Worked Example
IG, Shopped: Entropy(Purchased) = .88 Entropy(Purchased|Shopped = Y) = .88 P(shopped = y) = 1 IG = .88 – (1*.88) = 0 IG, Income: Entropy(Purchased|Income >= 50) = .72 Entropy(Purchased|Income < 50) = .97 P(Income >= 50) = .5 P(Income < 50) = .5 IG = .88 – (.5 * * .97) IG = .88 – ( ) IG = .035 Shopped? Income >= 50k? Purchased? Y N

Information Gain, Worked Example
Income > 50k Purchased Y N Income > 50k Purchased N Y

Why devote so much time to decision trees?
Many modern implementations, including ID3 and C4.5/C5.0 Intuitive, interpretable models Fast, scalable Can handle many dimensions, mixed categorical and numeric inputs Can handle multiple valued outputs, not just binary Parametric, requires longer to train, scores very quickly Sometimes called CART – Classification and Regression Trees. Can be used for Regression, beyond the scope of this session.

K-Means Clustering K is a given – the number of clusters
Place K cluster centers randomly throughout the space, assigning all points to one of the K clusters Calculate the centroid of the points assigned to each cluster (in all dimensions) and move the cluster centers to this new point Calculate the distance of all points to the new centroid, repeat step 3 Repeat steps 3 and 4 until convergence (cluster centroids move very little from one iteration to the next)

K-Means Clustering Example
Notice that the clusters are extremely imbalanced in Iteration 1 Imagine the green points on the bottom left pulling that centroid down and left Likewise with blue

Conclusion Topics introduced: Definition of Data Mining
Data format – Inputs, Outputs, Dimensions Problem Types and Paradigms Model Performance Considerations, Measuring and Comparing Models K-Nearest Neighbors non-parametric classification Decision Trees parametric classification K-Means clustering, unsupervised clustering

Topics Not Covered Much of classical statistics (p values, ANOVA, statistical representativeness, sampling plans), much of Bayesian Preprocessing data (Binning, normalizing, deriving new variables) Segmentation for model performance IT Infrastructure Concerns Bayesian methods (Naïve Bayes, Bayes Nets) Geospatial analytics Social Network Analysis and Visualization Recommender Systems Regression, Time Series Analysis Advanced Techniques (Two-Stage Modeling, etc) Many algorithms (Neural Nets, Support Vector Machines, etc) Operationalizing Data Mining

Introduction to Predictive Analytics

Similar presentations

Presentation on theme: "Introduction to Predictive Analytics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Predictive Analytics

Similar presentations

Presentation on theme: "Introduction to Predictive Analytics"— Presentation transcript:

Similar presentations

About project

Feedback