Predictive modeling competitions making data science a sport Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Photo by mikebaird, www.flickr.com/photos/mikebaird
Global competitions Predicting HIV viral load Competition closes 77% 1½ weeks 70.8% State of the art 70% Competitions involve participants from all over the world competing to produce the best models. One of our first competitions helped improve the state of the art in HIV modelling by 10 per cent. The scientific literature or an in-house modeller effort, evolves slowly, somebody tries something, somebody tweaks that approach and so on. Opening up a problem to a wide audience leads to rapid improvements.
Diverse experts solving diverse problems Grant Application Forecasting Chess Ratings HIV Research Stock Price Prediction Travel Time Prediction Edmund & Adrian London & USA Dr. Derek Gatherer UK Felipe Maia Uppsala University Ivan Russian Federation Philipp Emanuel Widmann Heidelberg, DE Dr. Christopher Hefele, New York Robert Warsaw Chih-Li Sung & Roy Tseng Penghu & Taipei Gzegorz Swiszcz Gera Cole Harris Texas Giuseppe Ragusa Rome Jure Zbontar Ljubljana Claudio Perlich USA Chris DuBois Portland Edmund & Adrian London & USA John Blatz Baltimore Jason Trigg Pennsylvania Chris Raimondi Batimore Rajstennaj Barrabas USA Jason Trigg Pennsylvania Uri Blass Tel-Aviv Lee Baker Las Cruces, NM Nan Zhou Pittsburgh Jeremy Howard Australia Thomas Mahony Canberra Glen Maher Canberra Emir Delic Australia
Motivation Why host a competition? Why compete? How it works Heritage Health Prize Questions
“I keep saying the sexy job in the next ten years will be statisticians.” Hal Varian Google Chief Economist 2009
Crowdsourcing Mismatch between those with data and those with the skills to analyse it It is almost never the case that any single organization has access to the advanced machine learning and statistical techniques that would allow them to extract maximum value from their data. Meanwhile, data scientists crave real-world data to develop and refine their techniques. Crowdsourcing corrects this mismatch by offering companies a cost effective way to harness the ‘cognitive surplus’ of the world's best data scientists. 6 6
Countless possible approaches to any data prediction problem Countless possible approaches to any data prediction problem. Which to choose? There are countless models that can be applied to solve any one predictive analytics problem. It is impossible to know at the outset which technique will be most effective. 7 7 7
18 year old beating his professors There are countless models that can be applied to solve any one predictive analytics problem. It is impossible to know at the outset which technique will be most effective. 8 8 8
Motivation Why host a competition? Why compete? How it works Heritage Health Prize Questions
Tourism Forecasting Competition Forecast Error (MASE) Existing model Aug 9 2 weeks later 1 month later Competition End Very rapid improvements first then the rate of change slows down
Chess Ratings Competition Existing model (ELO) Error Rate (RMSE) Aug 4 1 month later 2 months later Today The algorithm used to power Mark Zuckerberg’s Facemash. For those who have seen the Social Network, it was the algorithm that Eduardo Saverin wrote on Mark’s window.
Our User Base From many different (maths-related disciplines)
Users apply different techniques neural networks logistic regression support vector machine decision trees ensemble methods adaBoost Bayesian networks genetic algorithms random forest Monte Carlo methods principal component analysis Kalman filter evolutionary fuzzy modeling Users have the option to tell us their favourite techniques 13 13
Benchmarking We’re talking to a bank at the moment in Australia. They are receiving criticism for a credit scores on a particular product – they want to know whether the 14 14
Case study: VicRoads has an algorithm that they used to forecast travel time on Melbourne freeways (taking into account time, weather, accidents etc). Their current model is inaccurate and somewhat useless. They want to do better (or at least fnd out about whether it’s possible to do better). 15 15
NASA tried, now it’s our turn ~25% Successful grant applications NASA tried, now it’s our turn NASA’s leading experts have tried for years to find galaxies that have been gradationally lensed. Haven’t satisfactorily solved the problem. Now it’s our turn. 16 16
Ideal for complex problems Example a real estate data provider that wants to predict which houses in a particular suburb will go up for sale in any three month period 17 17
~25% Successful grant applications Outcomes of a competition to predict the success of grant applications: Successful grant applications Better identify likely successes to avoid wasting resources on hopeless applications Identify and communicate the characteristics of a successful application to future applicants Case Study Melbourne University 18 18
Motivation Why host a competition? Why compete? How it works Heritage Health Prize Questions
More fun than Sudoku Why Participants Compete 1 2 Clean, Real world data Professional Reputation & Experience 3 4 Interactions with experts in related fields Prizes Participants compete for four reasons: Access to real world data (which is developed on a silver platter) Benchmark their techniques and enhance their professional reputations (winner’s are the rockstars on Kaggle) The opportunity to interact with experts in related fields (who they might otherwise not get to meet) Prizes
User base Many are academics who want access to real world data and problems 21 21
User base
Motivation Why host a competition? Why compete? How it works Heritage Health Prize Questions
1 2 3 Upload Submit Evaluate & Exchange 24 24
Use the wizard to post a competition 25 25
Participants make their entries 26 26
Competitions are judged based on predictive accuracy How do you know who to choose? Compare techniques on a uniform dataset with a uniform evaluation algorithm 27 27
Competitions are judged on objective criteria Competition Mechanics Competitions are judged on objective criteria The essence of predicting the past competition is deriving insights from data that is already available to facilitate better decisions in the future.
Motivation Why host a competition? Why compete? How it works Heritage Health Prize Questions
$3 million prize An upcoming competition, powered by Kaggle De-identified dataset containing medical records of 100,000 Americans $3 million prize http://www.heritagehealthprize.com
Probability of going to hospital in the next year & Unfilled Prescriptions & Hypertension & High Cholesterol Diabetes Probability of going to hospital in the next year
Projected 100,000 registrations NetFlix Prize 2006 – 2009 $1 million prize 50,000 registrations 2011 $3 million prize Projected 100,000 registrations
Motivation Why host a competition? Why compete? How it works Heritage Health Prize Questions
Predict Grant Applications Tourism Forecasting (Part 2) IJCNN Social Network Challenge Chess Ratings – Elo vs. the Rest of the World
Jeff Moser Jeremy Howard Nicholas Gruen Anthony Goldbloom
What could the world’s best analysts find in your data? e-mail anthony.goldbloom@kaggle.com phone +61438400053 Photo by gidzy, www.flickr.com/photos/gidzy