KDD-09 Tutorial Predictive Modeling in the Wild: Success Factors in Data Mining Competitions and Real-Life Projects Saharon Rosset, Tel Aviv University.

KDD-09 Tutorial Predictive Modeling in the Wild: Success Factors in Data Mining Competitions and Real-Life Projects Saharon Rosset, Tel Aviv University Claudia Perlich, IBM Research

Predictive modeling Most general definition: build a model from observed data, with the goal of predicting some unobserved outcomes Primary example: supervised learning get training data: (x1,y1), (x2,y2),…, (xn,yn) drawn i.i.d from joint distribution on (X,y) Build model f(x) to describe the relationship between x and y Use it to predict y when only x is observed in “future” Other cases may relax some of the supervised learning assumptions For example: in KDD Cup 2007 did not see any yi’s, had to extrapolate them based on training xi’s – see later in tutorial Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Predictive Modeling Competitions
Competitions like KDD-Cup extract “core” predictive modeling challenges from their application environment Usually supposed to represent real-life predictive modeling challenges Extracting a real-life problem from its context and making a credible competition out of it is often more difficult than it seems We will see it in examples Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

The Goals of this Tutorial
Understand the two modes of predictive modeling, their similarities and differences: Real life projects Data mining competitions Describe the main factors for success in the two modes of predictive modeling Discuss some of the recurring challenges that come up in determining success These goals will be addressed and demonstrated through a series of case studies Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Credentials in Data Mining Competitions
Claudia Perlich - Runner up KDD CUP 03* - Winner ILP challenge 05 - Winner KDD CUP Saharon Rosset - Winner KDD CUP 99~ - Winner KDD CUP 00+ Jointly Winners in KDD CUP Winners of KDD CUP Winners of INFORMS data mining challenge Collaborators: @Prem Swirszcz, *Foster Provost, *Sofus Macscassy, +~Aron Inger, +Nurit Vatnik, +Einat Niculescu-Mizil Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Experience with Real Life Projects
Collaboration on Business Intelligence projects at IBM Research Total of >10 publications on real life projects Total 4 IBM Outstanding Technical Achievement awards IBM accomplishment and major accomplishment Finalists in this year’s INFORMS Edelman Prize for real-life applications of Operations Research and Statistics One of the successful projects will be discussed here as a case study Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Outline Introduction and overview (SR)
Differences between competitions and real life Success Factors Recurrent challenges in competitions and real projects Case Studies KDD CUP 2007 (SR) KDD CUP 2008 (CP) Business Intelligence Example : Market Alignment Program (MAP) (CP) Conclusions and summary (CP) Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Introduction: What do you think is important?
Domain knowledge Statistics background Data mining algorithms Computing power General Experience with data Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Differences between competitions and projects
Task Clearly defined tasks Clear Evaluation metrics ‘improve marketing effectiveness’ ‘identify underperforming stores’ ‘at what R2 can I fire people?’ Data Clean and available with (some) documentation Don’t know what data they have Don’t know what the data mean Objective Prediction Insight, decision support, weapon in political battlefields, prediction Deliverable ASCII file with numbers Endless conference calls PowerPoint slides Prototype/Predictions (bi-monthly to annual refresh) Duration Weeks/months You know when it is over Some projects just fail to die (3+ years) Most die before being born In this tutorial we deal with the predictive modeling aspect, so our discussion of projects will also start with a well defined predictive task and ignore most of the difficulties with getting to that point Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Real life project evolution and our focus
Business/ modeling problem definition Statistical problem definition Data preparation & integration Modeling methodology design Sales force mgmt. Wallet est. Quantile est., Latent variable est. IBM relationships Firmographics Quantile est., Graphical model Model generation & validation Implementation & application development Not our focus Loosely related Programming, Simulation, IBM Wallets OnTarget, MAP Our focus Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Two types of competitions
Sterile Clean data matrix Standard error measure Often anonymized features Pure machine learning Example KDD Cup 2009 PKDD Challenge 2007 Approach Emphasize algorithms, computation Attack with heavy (kernel?) machines Challenges Size, missing values, # features Real Raw data Set up the model yourself Task-specific evaluation Simulate real life mode Example KDD Cup 2007 KDD Cup 2008 Approach Understand the domain Analyze the data Build model Challenges Too numerous Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Factors of Success in Competitions and Real Life
1. Data and domain understanding Generation of data and task Cleaning and representation/transformation 2. Statistical insights Statistical properties Test validity of assumptions Performance measure 3. Modeling and learning approach Most “publishable” part Choice or development of most suitable algorithm Real Sterile Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Recurring challenges We emphasize three recurring challenges in predictive modeling that often get overlooked: Data leakage: impact, avoidance and detection Leakage: use of “illegitimate” data for modeling “Legitimate data”: data that will be available when model is applied In competitions, the definition of leakage is unclear Adapting learning to real-life performance measures Could move well beyond standard measures like MSE, error rate, or AUC We will see this in two of our case studies Relational learning/Feature construction Real data is rarely flat, and good, practical solutions for this problem remain a challenge Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

1 Leakage in Predictive Modeling
Introduction of predictive information about the target by the data generation, collection, and preparation process Trivial example: Binary target was created using a cutoff on a continuous variable and by accident, the continuous variable was not removed Reversal of cause and effect when information from the future becomes available It produces models that do not generalize/true model performances is much lower than ‘out-of sample’ (but including leakage) estimate Commonly occurs when combining data from multiple sources or multiple time points and often manifests in the order in data files Leakage is surprisingly pervasive in competitions and real life KDD CUP 2007, KDD CUP 2008 had leakages, as we will see in case studies INFORMS competition had leakage due to partial removal of information for only positive cases Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Real life leakage example
P. Melville, S. Rosset, R. Lawrence (2008) Customer Targeting Models Using Actively-Selected Web Content. KDD-08 Built models for identifying new customers for IBM products, based on: IBM Internal databases Companies’ websites Example pattern: Companies with the word “Websphere” on their website are likely to be good customers for IBM Websphere products Ahem, a slight cause and effect problem Source of problem: we only have current view of company website, not its view when it was an IBM prospect (=prior to buying) Ad-hoc solution: remove all obvious leakage words. Does not solve the fundamental problem Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

General leakage solution: “predict the future”
Niels Bohr is quoted as saying: “Prediction is difficult, especially about the future” Flipping this around, if: The true prediction task is “about the future” (usually is) We can make sure that our model only has access to information “at the present” We can apply the time-based cutoff in the competition / evaluation / proof of concept stage  We are guaranteed (intuitively and mathematically) that we can prevent leakage For the websites example, this would require getting internet snapshot from (say) two years ago, and using only what we knew then to learn who bought since Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

2 Real-life performance measures
Real life prediction models should be constructed and judged for performance on real-life measures: Address the real problem at hand – optimize $$$, life span etc. At the same time, need to maintain statistical soundness: Can we optimize these measures directly? Are we better off just building good models in general? Example: breast cancer detection (KDD Cup 2008) At first sight, a standard classification problem (malignant or benign?) Obvious extension: cost sensitive objective Much better to do a biopsy on a healthy subject than send a malignant patient home! Competition objective: optimize effective use of radiologists’ time Complex measure called FROC See case study in Claudia’s part Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Optimizing real-life measures
It is a common approach to use the prediction objective to motivate an empirical loss function for modeling: If the prediction objective is the expected value of Y given x, then squared error loss (e.g, linear regression or CART) is appropriate If we want to predict the median of Y instead, then absolute loss is appropriate More generally, quantile loss can be used (cf. MAP case study) We will see successful examples of this approach in two case studies (KDD CUP 07 and MAP) What do we do with complex measures like FROC? There is really no way to build a good model directly Less ambitious approach: Build a model using standard approaches (e.g. logistic regression) post-process your model to do well on the specific measure We will see a successful example of this approach in KDD CUP 08 Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

3 Relational and Multi-Level Data
Real-life databases are rarely flat! Example: INFORMS Challenge 08, medical records: Medication (629K) Event ID Patient ID Diagnosis Medication … Accounting Hospital (39K) Event ID Patient ID Diagnosis Hospital Stay … Accounting m:n m:n m:n Conditions (210K) Patient ID Diagnosis Year m:n Demographics (68K) Patient ID Demographics … Year Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Approaches for dealing with relational data
Modeling approaches that use relational data directly There has been a lot of research, but there is a scarcity of practically useful methods that take this approach Flattening the relational structure into a standard X,y setup The key to this approach is generation of useful features from the relational tables This is the approach we took in the INFORMS08 challenge Ad hoc approaches Based on specific properties of the data and modeling problem, it may be possible to “divide and conquer” the relational setup See example in the KDD CUP 08 case study Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Modeler’s best friend: Exploratory data analysis
Exploratory data analysis (EDA) is a general name for a class of techniques aimed at Examining data Validating data Forming hypotheses about data The techniques are often graphical or intuitive, but can also be statistical Testing very simple hypotheses as a way of getting at more complex ones E.g.: test each variable separately against response, and look for strong correlations The most important proponent of EDA was the great, late statistician John Tukey Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

The beauty and value of exploratory data analysis
EDA is a critical step in creating successful predictive modeling solutions: Expose leakage AVOID PRECONCEPTIONS about: What matters What would work Etc. Example: Identifying KDD CUP 08 leakage through EDA Graphical display of identifier vs malingnant/benign (see case study slide) Could also be discovered via a statistical variable-by-variable examination of significant correlations with response to detect it Key to finding this: AOIVDING PRECONCEPTIONS about the irrelevance of identifier Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Elements of EDA for predictive modeling
Examine data variable by variable Outliers? Missing data patterns? Examine relationships with response Strong correlations? Unexpected correlations? Compare to other similar datasets/problems Are variable distributions consistent? Are correlations consistent? Stare: at raw data, at graphs, at correlations/results Unexpected answers to any of these questions may change the course of the predictive modeling process Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Case study #1: Netflix/KDD-Cup 2007
Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

October 2006 Announcement of the NETFLIX Competition
USAToday headline: “Netflix offers $1 million prize for better movie recommendations” Details: Beat NETFLIX current recommender ‘Cinematch’ RMSE by 10% prior to 2011 $50,000 for the annual progress price First two awarded to AT&T team: 9.4% improvement as of 10/08 (almost there!) Data contains a subset of 100 million movie ratings from NETFLIX including 480,189 users and 17,770 movies Performance is evaluated on holdout movie-user pairs NETFLIX competition has attracted ~50K contestants on ~40K teams from >150 different countries ~40K valid submissions from ~5K different teams Public in 2002 (at around $20) Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

NETFLIX Data Internet Movie Data Base
All movies (80K) 17K Selection unclear All users (6.8 M) NETFLIX Competition Data 480 K At least 20 Ratings by end 2005 Fields Title Year Actors Awards Revenue … 100 M ratings 4 5 1 3 2 Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

NETFLIX data generation process
KDD CUP NO User or Movie Arrival User Arrival Movie Arrival Task 1 17K movies Task 2 Training Data Time 4 5 ? 3 2 Qualifier Dataset 3M Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

KDD-CUP 2007 based on the NETFLIX
Training: NETFLIX competition data from Test: 2006 ratings randomly split by movie in to two tasks Task 1: Who rated what in 2006 Given a list of 100,000 pairs of users and movies, predict for each pair the probability that the user rated the movie in 2006 Result: IBM Research team was second runner-up, No 3 out of 39 teams Task 2: Number of ratings per movie in 2006 Given a list of 8863 movies, predict the number of additional reviews that all existing users will give in 2006 Result: IBM Research team was the winner, No 1 out of 34 teams Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Test sets from 2006 for Task 1 and Task 2
Marginal 2006 Distribution of rating Movie User Rating M1 U31 4 M832 U83 M63 U2 3 M83 U97 M527 U63 1 M36 U81 … Users Sample (movie, user) pairs according to product of marginals Task 1 Remove Pairs that were rated prior to 2006 183 8 24 316 19324 89 25 375 2.2 0.9 1.4 2.5 4.2 1.9 2.6 Movies Movie User Rating M1 U31 1 M832 U83 M63 U2 M83 U97 M527 U63 … Task 2 log(n+1) Rating Totals Task 2 Test Set (8.8K) Task 1 Test Set (100K) Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Task 1: Did User A review Movie B in 2006?
A standard classification task to answer question whether “existing” users will review “existing” movies In line more with “synthetic” mode of competitions than “real” mode Challenges Huge amount of data how to sample the data so that any learning algorithms can be applied is critical Complex affecting factors decrease of interest in old movies, growing tendency of watching (reviewing) more movies by Netflix users Key solutions Effective sampling strategies to keep as much information as possible Careful feature extraction from multiple sources Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Task 2: How many reviews in 2006?
Task formulation Regression task to predict the total count of reviewers from “existing” users for 8863 “existing” movies Evaluation is by RMSE on log scale Challenges Movie dynamics and life-cycle Interest in movies changes over time User dynamics and life-cycle No new users are added to the database Key solutions Use counts from test set of Task 1 to learn a model for 2006 adjusting for pair removal Build set of quarterly lagged models to determine the overall scalar Use Poisson regression Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Some data observations
Leakage Alert! Task 1 test set is a potential response for training a model for Task 2 Was sampled according to marginal (= # reviews for movie in 06 / total # reviews in 06) which is proportional to the Task 2 response (= # reviews for movie in 06) BIG advantage: we get a view of 2006 behavior for half the movies  Build model on this half, apply to the other half (Task 2 test set) Caveats: Proportional sampling implies there is a scaling parameter left, which we don’t know Recall that after sampling, (movie, person) pairs that appeared before 2006 were dropped from Task 1 test set  Correcting it is an inverse rejection sampling problem Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Test sets from 2006 for Task 1 and Task 2
Movie User Rating M1 U31 4 M832 U83 M63 U2 3 M83 U97 M527 U63 1 M36 U81 … Users Marginal 2006 Distribution of rating 4 25 3 1 36 5 Sample (movie, user) pairs according to product of marginals Task 1 Remove Pairs that were rated prior to 2006 Estimate Marginal Distribution Surrogate learning problem 183 8 24 316 19324 89 25 375 2.2 0.9 1.4 2.5 4.2 1.9 2.6 Movies Movie User Rating M1 U31 1 M832 U83 M63 U2 M83 U97 M527 U63 … Task 2 log(n+1) Rating Totals Task 2 Test Set (8.8K) Task 1 Test Set (100K) Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Some data observations (ctd.)
No new movies and reviewers in 2006 Need to emphasize modeling the life-cycle of movies (and reviewers) How are older movies reviewed relative to newer movies? Does this depend on other features (like movie’s genre)? This is especially critical when we consider the scaling caveat above Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Some statistical perspectives
Poisson distribution is very appropriate for counts Clearly true of overall counts for 2006 Assuming any kind of reasonable reviewers arrival process Right modeling approach for true counts is Poisson regression: ni ~ Pois (it) log(i) = j j xij * = arg max l(n ; X,) (maximum likelihood) What does this imply for model evaluation approach? Variance stabilizing transformation for Poisson is square root  ni has roughly constant variance  RMSE on log scale emphasizes performance on unpopular movies (small Poisson parameter  larger log scale variance) We still assumed that if we do well in a likelihood formulation, we will do well with any evaluation approach Adapting to evaluation measures! Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Some statistical perspectives (ctd.)
Can we invert the rejection sampling mechanism? This can be viewed as a missing data problem ni, mj are the counts for movie i and reviewer j respectively pi, qj are the true marginals for movie i and reviewer j respectively N is the total number of pairs rejected due to review prior to 2006 Ui, Pj are the users who reviewed movie i prior to 2006 and movies reviewed by user j prior to 2006, respectively Can we design a practical EM algorithm with our huge data size? Interesting research problem… We implemented ad-hoc inversion algorithm Iterate until convergence between: - assuming movie marginals are fixed, adjusting reviewer marginals - assuming reviewer marginals are fixed, adjusting movie marginals We verified that it indeed improved our data since it increased correlation with 4Q2005 counts Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Modeling Approach Schema
Utilizing leakage Who Reviewed Test (100K) Estimate Poison Regression M1 & Predict on Task 1 movies Inverse Rejection Sampling Count ratings by Movie from Scale Predictions To Total Use M1 to Predict Task 2 movies Standard Approach IMDB Movie Features Estimate 4 Poison Regression G1…G4 & Predict for 2006 Construct Movie Features Find optimal Scalar NETFLIX challenge Estimate 2006 total Ratings for Task 2 Test set Construct Lagged Features Q1-Q4 2005  Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Some observations on modeling approach
Lagged datasets are meant to simulate forward prediction to 2006 Select quarter (e.g., Q105), remove all movies & reviewers that “started” later Build model on this data with e.g., Q305 as response Apply model to our full dataset, which is naturally cropped at Q405  Gives a prediction for Q206 With several models like this, predict all of 2006 Two potential uses: Use as our prediction for 2006 – but only if better than the model built on Task 1 movies! Consider only sum of their predictions to use for scaling the Task 1 model We evaluated models on Task 1 test set Used holdout when also building them on this set How can we evaluate the models built on lagged datasets? Missing a scaling parameter between the 2006 prediction and sampled set Solution: select optimal scaling based on Task 1 test set performance  Since other model was still better, we knew we should use it! Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Some details on our models and submission
All models at movie level. Features we used: Historical reviews in previous months/quarters/years (on log scale) Movie’s age since premier, movie’s age in Netflix (since first review) Also consider log, square etc  have flexibility in form of functional dependence Movie’s genre Include interactions between genre and age  “life cycle” seems to differ by genre! Models we considered (MSE on log-scale on Task 1 holdout): Poisson regression on Task 1 test set (0.24) Log-scale linear regression model on Task 1 test set (0.25) Sum of lagged models built on 2005 quarters + best scaling (0.31) Scaling based on lagged models Our estimated of number of reviews for all models in Task 1 test set: about 9.5M Implied scaling parameter for predictions about 90 Total of our submitted predictions for Task 2 test set was 9.3M Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Competition evaluation
First we were informed that we won with RMSE of ~770 They mistakenly evaluated on non-log scale Strong emphasis on most popular movies We won by large margin  Our model did well on most popular movies! Then they re-evaluated on log scale, we still won On log scale the least popular movies are emphasized Recall that variance stabilizing transformation is in-between (square root) So our predictions did well on unpopular movies too! Interesting question: would we win on square root scale (or similarly, Poisson likelihood-based evaluation)? Sure hope so! Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Competition evaluation (ctd.)
Results of competition (log-scale evaluation): Components of our model’s MSE: The error of the model for the scaled-down Task 1 test set (which we estimated at about 0.24) Additional error from incorrect scaling factor Scaling numbers: True total reviews: 8.7M Sum of our predictions: 9.3M Interesting question: what would be best scaling For log-scale evaluation? Conjecture: need to under-estimate true total For square-root evaluation? Conjecture: need to estimate about right Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Effect of scaling on the two evaluation approaches
Total reviews (M) Log-scale MSE Square-root scale MSE Comment 0.7 6.55 0.222 40.28 0.8 7.48 0.208 29.80 Best log performance 0.9 8.42 0.225 26.38 Best sqrt performance 0.93 8.70 0.234 26.55 Correct scaling 1 9.35 0.263 28.86 Our solution 1.1 10.29 0.316 36.37 Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Effect of scaling on the two evaluation approaches
Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

KDD CUP 2007: Summary Keys to our success: Identify subtle leakage
Is it formally leakage? Depends on intentions of organizers… Appropriate statistical approach Poisson regression Inverting rejection sampling in leakage Careful handling of time-series aspects Not keys to our success: Fancy machine learning algorithms Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

candidate feature vector
Case Study # 2: KDD CUP Siemens Medical Breast Cancer Identification MLO CC MLO CC 6816 Images 1712 Patients Malignant 105,000 Candidates ? [ x1 , x2 , … , x117, class] candidate feature vector Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

KDD-CUP 2008 based on Mammography
Training: labeled candidates from 1300 patient and association of candidate to location, image and patient Test: candidates from separate set of 1300 patients Task 1: Rank all candidates by the likelihood of being cancerous Results: IBM Research team was the winner out of 246 Task 2: Identify a list of healthy patients Results: IBM Research team was the winner out of 205 Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Task 1: Candidate Likelihood of Cancer
Almost standard probability estimation/ranking task on the candidate level Somewhat synthetic as the meaning of the features is unknown Challenges Low positive rate: 7% patients and 0.6% of candidates Beware of overfitting Sampling Unfamiliar evaluation measure FROC, related to AUC Non-robust Hint at locality Key Solution Simple linear model Post-processing of scores Leakge in identifiers Adapting to evaluation measures! True Positive Patient Rate False Positive Candidate Rate Per Image FROC Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Task 2: Classify patients
Derivate of the previous task 1 A patient is healthy if all her candidates are benign Probability that a patient is healthy is the product of the probabilities of her candidates Challenges Extremely non robust performance measure: Including any patient with cancer in the list disqualified the entry Risk tradeoff – need to anticipate the solutions of the other participants Key solution Pick a model with high sensitivity to false negatives Leakage in identifiers: EDA at work Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

EDA on the Breast Cancer Domain
Console output of sorted ‘patient_ID patient_lable’: ---more--- Base rate of 7%???? What about 200K to 999K? Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Leakage Mystery of the Data Generation: Identifier Leakage in the Breast cancer data 245 Patients: 36% Cancer 414 Patients: 1% Cancer 1027 Patients 0% Cancer 18 Patients: 85% Cancer Model score Log of Patient ID Every point is a candidate Distribution of identifiers has a strong natural grouping of patient identifiers 3 natural buckets The three group have VERY different base rated of cancer prevalence Last group seems to be sorted (cancer first) Total of 4 groups with very patient different probability of cancer Organizers admitted to have combined data from different years in order to increase the positive rate For example, one source might be a preventive care institution with only very low base rate of malignant patients and another could be a treatment-oriented institution with much higher cancer prevalence. Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Building the classification model
For evaluation we created a stratified 50% training and test split by patient Given few positives (~300), results may exhibit high variance We explored the use of various learning algorithms including Neural Networks, Logistic regression and various SVMs Linear models (logistic regression or linear SVMs) yielded the most promising results FROC Down-sampling the negative class? Keep on 25% of all healthy patients Helped in some cases, not really reliable improvement Add the identifier category (1,2,3,4) as additional feature Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Modeling Neighborhood Dependence
Relational Data Modeling Neighborhood Dependence Candidates are not really iid but actually relational: Stacking Build initial model and score all candidates Use labels of neighbors in second round Formulate as EM problem Treat the labels of the neighbors are unobserved in EM Pair-wise constraints based on location adjacency Calculate the Euclidean distance from the candidates within the same picture and distance to the nipple in both views for each breast Select the pairs of candidates with distance difference less than a threshold Constraints: selected pairs of examples (xi,MLO, xi,CC) should have the same predicted labels, i.e. f(xi,MLO) = f(xi,CC). Results Seems to improve the probability estimate in terms of AUC Did not improve FROC threshold=20 in our experiments, resulting in 29,139 constraints Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Outlier Treatment Many of the 117 numeric features have large outliers
Statistics Outlier Treatment Many of the 117 numeric features have large outliers Incur a huge penalty in terms of likelihood Large bias Badly calibrated probabilities Extreme (wrong) values in the prediction Histogram of Feature 10 142 observations > 5 Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

ROC vs. FROC optimization: Post-processing of model scores?
Adapting to evaluation ROC vs. FROC optimization: Post-processing of model scores? True Positive Rate False Positive Rate ROC True Positive Patient Rate FROC False Positive Candidate Rate In ROC all rows are independent and both true positives and false positives are counted by row FROC has true patients and false positive candidates Higher TP rate for candidates does not improve FROC unless from new patient, e.g., It’s better to have 2 correctly identified candidates from different patients, instead of 5 from the same It’s best to re-order candidates based on model scores so as to ensure many different patients up front Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Theory of Post-processing
Adapting to evaluation Theory of Post-processing Probabilistic Approach: At any point we want to maximize the expected gradient of the FROC at this point Define for each candidate c of patient i pc probability that candidate c is malignant npj probability that a patient i has not yet been identified 3 cases Candidate is positive but you already have identified patient with probability = pc *(1-npi) Candidate is positive and new patient with probability = pc *npi Candidate is negative with probability =1- pc Pick candidate with largest expected gain: pc *npi/(1- pc) Theorem: The expected value of FROC for the is higher that for any other order Problem: Our probability estimates are not good enough for this to work well pUp - a name derived from "probability to go up”. It refers to probability that the AUC will increase if the observations are reordered. Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Statistics Bad Calibration! Calibration Plot We consistently over-predict the probability of cancer for the most likely candidates Linear Bias of the method High class-skew Outlier in the 117 numeric features leads to extreme predictions on holdout Re-calibration? We tried a number of methods No improvement Some resulted in better calibration but hurt the ranking Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Post-Processing Heuristic
Adapting to evaluation Post-Processing Heuristic Ad-HOC Approach: Take the top n ranked candidates where n is approximately the number of positive candidates Select one candidate with the highest score for each patient from this list and put them on the top of the list Iterate until all top n candidates are re-ordered True Positive Patient Rate False Positive Rate Per Image Re-ordering model scores significantly improves the FROC with no additional modeling Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Submissions and Results
Task 1 Bagged linear SVMs with hinge loss and heuristic post processing This approach scored the winning result of on FROC out of 246 submission of 110 unique participants Second place scored Some rumor that other participants also found the ID leakage Task 2 Logistic model performs better than the SVM models probably because likelihood is more sensitive to extreme errors (the first false negative) The first false negative occur typically around 1100 patients in the training set We submitted the first 1020 patients ranked by a logistic model that included ID feature + original 117 features Scored a specificity of on the test set with no false negatives Only 24 out of 203 submissions had no false negatives Second place scored 0.17 specificity Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Summary in terms of success factors
Leakage in the identifier provides information about the likelihood of a patient to have cancer Caused by the organizers effort to increase the positive rate by adding ‘old’ patients that developed cancer Post-processing for FROC optimization Awareness of impact of feature outliers Interacts with the statistical properties of the data and the model Log-likelihood more sensitive than hinge loss Otherwise simple model to avoid overfitting Linear models Relational is not helpful for the given evaluation Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

KDD CUP 2009 Data: customer database of Orange with 100K observations and 15K variables Three different tasks and 2.5 versions: Prediction: Churn, Appetency, Upselling Versions: Fast (5 days) & Slow (1 month) Large and Small version Interesting characteristics: Highly ‘sterile’, nothing known about anything Leaderboard is was possible to match the large and small and receive feedback on 20% of test Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

KDD CUP 2000 Data: online store history for Gazelle.com
Five different tasks including: Prediction: Who will continue in session? Who will buy? Insights: Characterize heavy spenders Interesting characteristics: “Leakage”: internal testing sessions were left in data Deterministic behavior If identified, give 100% accuracy in prediction for part of data Evaluation in terms of “real” business objectives? Sort of: handled by defining a set of “standard” questions, each covering different aspect of business objective Relational data? Yes, customers had different # of sessions, of different length, with different stages Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

KDD CUP 2003 Data: Citation rates in Physics papers Two tasks:
Predict Change in number of citation during next 3 month Write an interesting paper about it Interesting Characteristics Highly relational, links between papers and authors Feature construction up to participants Leakage impossible since the truth was really in the future Evaluation on SSE against integer values (Poisson) Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

ILP Challenge 2003 Data: Yeast genome including protein sequence, alignment similarity scores with other proteins, additional protein information from relational DB Task: Identify (potentially multiple) functional classes for each gene Interesting characteristics 420 possible classes, very subjective asignment Purely relational, no ‘features’ available Distances (supposedly p-values) of gene alignment Secondary structure (protein of amino acids) Protein DB with keywords, etc ‘Leakage’ in the identifier: contains letter of the labeling research group Highly unsatisfcatory evaluation: precision of the prediction Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

INFORMS data mining contest 2008
Data: 2 years of hospital records with accounting information (cost, reimbursement, …) , patient demographics, medication history Tasks: Identify pneumonia patients Design optimization setting for preventive treatment Interesting characteristics: Relational setting (4 tables linked though patient identifier) Leakage: removal of the pneumonia code left hidden traces ‘Dirty’ data with plenty missing, contradicting demographics and changing patient ID’s Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Data Mining in the Wild: Project Work
Similarities with competitions (compared to DM research): Single dataset Algorithms can be existing and simple No real need for baselines (although useful) The absolute performance matters Differences to competitions: You need to decide what the analytical problem is You need to define the evaluation rather than optimize it You need to avoid leakage rather than use it You need to FIND all relevant data rather than use what is there (often leads to relational settings) You need to deliver it somehow to have impact Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Case Study #3: Market Alignment Program
Wallet: Total amount of money that the customer can spend in a certain product category in a given period Why Are We Interested in Wallet? Customer targeting Focus on acquiring customers with high wallet For existing customers, focus on high wallet, low share-of-wallet customers Sales force management Wallet of as sales force allocation target and make resource assignment decisions based on wallet Evaluate success of sales personnel and by attained share-of-wallet Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Wallet Modeling Challenge
The customer wallet is never observed Nothing to “fit a model” Even if you have a model, how do you evaluate it? We would like a predictive approach from available data Firmographics (Sales, Industry, Employees) IBM Sales and transaction history Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Define Wallet/Opportunity?
TOTAL: Total customer available budget in total IT Can we really hope to attain all of it? SERVED: Total customer spending on IT products offered by IBM Better definition for our marketing purposes REALISTIC: IBM spending of the “best similar customers” IBM Sales  REALISTIC  SERVED  TOTAL Company Revenue Company Revenue TOTAL SERVED REALISTIC IBM Sales Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

REALISTIC Wallets as quantiles
Motivation Imagine 100 identical firms with identical IT needs Consider the distribution of the IBM sales to these firms Bottom 95% of firms should spend as much as the top 5% Define REALISTIC wallet as high percentile of spending conditional on the customer attributes Implies that a few customers are spending full wallet with us however, we do not know which ones Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Formally: Percentile of Conditional
Distribution of IBM sales s to the customer given customer attributes x: s|x ~ f,x Two obvious ways to get at the pth percentile: Estimate the conditional by integrating over a neighborhood of similar customers  Take pth percentile of spending in neighborhood Create a global model for pth percentile  Build global regression models, e.g., s|x ~ Exp(αx+β) REALISTIC Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Estimation: the Quantile Loss Function
The mean minimizes a sum of squared residuals: The median minimizes a sum of absolute residuals. The p-th quantile minimizes an asymmetrically weighted sum of absolute residuals: p=0.8 p=0.5 (absolute loss) Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Overview of analytical approaches
‘Ad HOC’ Optimization kNN -Industry - Size Quantile Regression Decomposition General kNN - K - Distance - Features Model Form - Linear - Decision Tree - Quanting - Linear Model - Adjustment Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Data Generation Process
Need to combine data on revenue with customers properties Complicated matching process between in IBM internal customer view (accounts) and the external sources (Dun & Breadstreet) Probabilistic process with plenty of heuristics Huge danger of introducing data bias Tradeoff in data quality and coverage Leakage potential: We can only get current customer information This information might be tainted by the customer’s interaction with IBM Problem gets amplified when we try to augment the data with home-page information Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Evaluating Measures for Wallet
We still don’t know the truth  Combined approach: Quantile loss to assess only the relevant predictive ability and feature selection Expert Feedback to select suitable model class Business Impact to identify overall effectiveness Quantile Loss Expert Feedback Business Impact Available Relevant Very Relevant - Not that relevant - Missing a parameter - sensitive to skew - Scale? Log? - Similar to survey - Unclear incentives - Potentially biased - Hard to come by on large scale - Highly aggregated - Long lag - Convoluted with impact of other things - Requires intense tracking Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Empirical Evaluation I: Quantile Loss
Setup Four domains with relevant quantile modeling problems: direct mailing, housing prices, income data, IBM sales Performance on test set in terms of 0.9th quantile loss Approaches: Linear quantile regression Q-kNN (kNN with quantile prediction from the neighbors) Quantile trees (quantile prediction in the leaf) Bagged quantile trees Quanting (Langrofd et al reduces quantile estimation to averaged classification using trees) Baselines Best constant model Traditional regression models for expected values, adjusted under Gaussian assumption (+1.28) Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Performance on Quantile Loss (smaller is better)
Conclusions Standard regression is not competitive (because the residuals are not normal) If there is a time-lagged variable, linear quantile model is best Splitting criterion is irrelevant in the tree models Quanting (using decision trees) and quantile tree perform comparably Generalized kNN is not competitive Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Evaluation II: MAP Workshops Overview
Calculated 2005 opportunity using naive Q-kNN approach 2005 MAP workshops Displayed opportunity by brand Expert can accept or alter the opportunity Select 3 brands for evaluation: DB2, Rational, Tivoli Build ~100 models for each brand using different approaches Compare expert opportunity to model predictions Error measures: absolute, squared Scale: original, log, root Total of 6 measures Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Expert Feedback to Original Model
Experts accept opportunity (45%) Increase (17%) Experts change opportunity (40%) Decrease (23%) Break out the 40% Experts reduced opportunity to 0 (15%) Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Observations Many accounts are set for external reasons to zero
Exclude from evaluation since no model can predict the competitive environment Exponential distribution of opportunities Evaluation on the original (non-log) scale is subject to large outliers Experts seem to make percentage adjustments Consider log scale evaluation in addition to original scale and root as intermediate Suspect strong “anchoring” bias, 45% of opportunities were not touched Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Model Comparison Results
We count how often a model scores within the top 10 and 20 for each of the 6 measures: Model Rational DB2 Tivoli Displayed Model (kNN) 6 4 5 Max Revenue 1 3 Linear Quantile 0.8 2 Regression Tree Q-kNN 50 + flooring Decomposition Center Quantile Tree 0.8 (Anchoring) (Best) Floor and non Floor Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

MAP Experiments Conclusions
Q-kNN performs very well after flooring but is typically inferior prior to flooring 80th percentile Linear quantile regression performs consistently well (flooring has a minor effect) Experts are strongly influenced by displayed opportunity (and displayed revenue of previous years) Models without last year’s revenue don’t perform well Use Linear Quantile Regression with q=0.8 in MAP 06 Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

MAP Business Impact MAP launched in 2005
In workshops held worldwide, with teams responsible for most of IBM’s revenue MAP recognized as 2006 IBM Research Accomplishment Awarded based on “proven” business impact Runner up in Case Study Award in KDD 2007 Edelman finalist 2009 Most important use is segmentation of customer base Shift resources into “invest” segments with low wallet share Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Validated Revenue Opportunity ($M)
Business Impact For 2006, 270 resource shifts were made to 268 Invest Accounts We examine the performance of these accounts relative to background Invest Core - Growth Core - Optimize REVENUE: 9% growth in INVEST accounts 4% growth in all other accounts PIPELINE (relative to 2005): 17% growth in INVEST accounts 3% growth in all other accounts Validated Revenue Opportunity ($M) QUOTA ATTAINMENT: 45% for MAP-shifted resources 36% for non-MAP shifts Edelman 270 Shifts 2005 Actual Revenue ($M) Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Summary in terms of success factors
1 Data and Domain understanding Match of business objective to modeling approach made a previously unsolvable business problem solvable with predictive modeling 2 Statistical insight Minimizing quantile-loss estimates the correct quantity One single evaluation metrics is in real life not enough Autocorrelation helps linear model 3 Modeling Extension to tree induction Comparative study In the end: linear it is Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Identify Potential Causes for Chip Failure
Data: 5K machines of which 18 failed in the last year Task: Can you identify a (short) list of other machines that are likely to fail to have them preemptively fixed Characteristics Relational: Tool ID, Multiple chips per machine (only the first failure is detected) Leakage: database is clearly augmented past failure: all failure have a customer associated, but customer is missing in most non-failure Statistical observation: This is really a survival analysis problem, the failure does not occur prior to a runtime of 180 days Accuracy and even AUC is NOT relevant Insight: cause of failure Lift and false positive rate in the top k is more important Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Threats in Competitions and Projects
Unavailability of data Data generation problems The model is not good enough to be useful Model results are not accessible to the user If the user has to understand the model you need to keep it simple Web delivery of predictions Competitions Mistakes under time pressure Accidental use of the target (kernel SVM) Complexity Overfitting Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Overfitting Even if you think, that you know this one - You probably still overdo it! KDD CUP results have shown that a large number of entries overfit 2003, 90% of entries did worse than the best constant prediction Corollary: Don’t overdo it on the search Having a holdout, does NOT make you immune to overfitting- you just overfit on the holdout 10 fold cross validation does NO make you immune either Leaderboards on 10% of test are VERY deceptive KDD CUP 2009: The winner of the fast challenge after only 5 days was indeed the leader of the board The winner of the slow challenge after 1 more month was NOT the leader of the board Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Overfitting: Example KDD CUP 2008
Data 105,000 candidates 117 numeric features Sounds good right? Overfitting is NOT just about the training size and model complexity! Linear models overfit too! How robust is the evaluation measure? AUC  FROC  Number of healthy patients  What is the base rate? 600 positives  Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Factors of Success in Competitions and Real Life
1. Data and domain understanding Generation of data and task Cleaning and representation/transformation 2. Statistical insights Statistical properties Test validity of assumptions Performance measure 3. Modeling and learning approach Most “publishable” part Choice or development of most suitable algorithm Real Sterile Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Success Factor 1: Data and Domain Understanding
Task and data generation Formulate analytical problem (MAP) EDA Check for Leakage KDD 07: NETFLIX KDD 08: Cancer MAP Adjust for Decreasing population Task 1 target Combined sources lead to leakage Wallet definition and design of analytical solution Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Success Factors 2: Statistical insights
Properties of Evaluation Measures Does it measure what you care about? Robustness Invariance to transformation/… Linkage between model optimization, statistic and performance KDD 07: NETFLIX KDD 08: Cancer MAP Poisson regression Log transform downscale Highly non-robust, beware of overfitting Post processing Robust evaluation Multiple measures Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Success Factors 3: Models and approach
How much complexity do you need? Often linear does just fine with the correctly constructed features (Actually of my wins have been with linear models) Feature selection Can you optimize what you want to optimize? How does the model relate to your evaluation metrics Regression approaches predict conditional mean Accuracy vs AUC vs Log likelihood Does it scale to your problem? Some cool methods just do not run on 100K NETFLIX KDD CUP 08 MAP Linear Poisson Log transform Logistic Regression Linear SVM Linear quantile regression Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Summary: comparison of case studies
KDD CUP 07 Task 2 KDD CUP 08 Task 1 MAP Ultimate modeling goal Demand forecasting Breast cancer detection Customer wallet estimation Evaluation objective Log-scale RMSE FROC Quantile loss/ Expert feedback Key data/domain insight Leakage from Task 1 Leakage in patient IDs Duality quantilewallet Key statistical insight Poisson distribution FROC post-processing Optimizer of quantile loss Best modeling approach Maximum likelihood (Poisson reg.) Machine learning (linear SVM) Empirical risk minimization (quantile reg.) Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

Invitation: Please join us on another data mining competition!
INFORMS Data Mining contest on health care data Register at Real data of hospital visits for patients with severe heart disease ‘Real’ tasks for ongoing project Transfer to specialized hospitals Severity / death Relational (multiple hospital stays per patient) Evaluation: AUC Publication and workshop at INFORMS 2009 Predictive Modeling in the Wild Saharon Rosset & Claudia Perlich

KDD-09 Tutorial Predictive Modeling in the Wild: Success Factors in Data Mining Competitions and Real-Life Projects Saharon Rosset, Tel Aviv University.

Similar presentations

Presentation on theme: "KDD-09 Tutorial Predictive Modeling in the Wild: Success Factors in Data Mining Competitions and Real-Life Projects Saharon Rosset, Tel Aviv University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

KDD-09 Tutorial Predictive Modeling in the Wild: Success Factors in Data Mining Competitions and Real-Life Projects Saharon Rosset, Tel Aviv University.

Similar presentations

Presentation on theme: "KDD-09 Tutorial Predictive Modeling in the Wild: Success Factors in Data Mining Competitions and Real-Life Projects Saharon Rosset, Tel Aviv University."— Presentation transcript:

Similar presentations

About project

Feedback