Predictive modeling competitions

Slides:



Advertisements
Similar presentations
Predictive modeling competitions
Advertisements

Grant review at NIH for statistical methodology Jeremy M G Taylor Michelle Dunn Marie Davidian.
Copyright © 2006, SAS Institute Inc. All rights reserved. Career Opportunities in Statistical Computing Robert N. Rodriguez Director, Statistical Research.
The leading global network of the health insurance industry Tom Sackville.
Education is very much like space itself. Limitless. No boundaries! Interview with Astronaut Educator Barbara Morgan No Boundaries Working at NASA is like.
XE.com and the “XE Currency Mobile” App N6 Business Innovation Conference September 28, 2011.
What ails the economy: turning New Zealand’s small size from a weakness to a strength 16 th March 2011 New Zealand Institute, Wellington Nicholas Gruen.
Adaptation of University and College Graduates to Real Life: Fighting Down Unemployment and Improvement of Youth Competitiveness Mr. Dmitry LIVANOV, Minister.
Public Opinion : Health Care Coverage, Costs, and Financing.
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition Todd Holloway Two Lecture Series for B551 November 20 & 27, 2007.
Communicative Language Teaching (Classroom Activities)
- GALILEO GALILEI. MATHEMATICS BREAKS THE WORLD DOWN INTO NUMBERS AND SYMBOLS.
Dr. Susan Simmons Assistant Chair of Mathematics and Statistics.
Evaluating Inforce Blocks Of Disability Business With Predictive Modeling SOA Spring Health Meeting May 28, 2008 Jonathan Polon FSA
Managing Technical Talent: How to Find the Right Analyst for Your Problem Photo by mikebaird, Presentation to the Wolfram.
© 2014 Fair Isaac Corporation. Confidential. This presentation is provided for the recipient only and cannot be reproduced or shared without Fair Isaac.
Predictive Modeling for Disability Pricing May 13, 2009 Claim Analytics Inc. Barry Senensky FSA FCIA MAAA Jonathan Polon FSA
Reflections on Stata as a Tool for Life-long Learning Lee Sieswerda Epidemiologist, Thunder Bay District Health Unit Assistant Professor, Northern Ontario.
Chapter 4 DECISION SUPPORT AND ARTIFICIAL INTELLIGENCE
Collaborative Filtering in iCAMP Max Welling Professor of Computer Science & Statistics.
1 University of York Department of Health Sciences Computer Skills Review By Ian Cole Lecturer in C&IT.
CSE 546 Data Mining Machine Learning Instructor: Pedro Domingos.
EE491D Special Topics in Communications Adaptive Signal Processing Spring 2005 Prof. Anthony Kuh POST 205E Dept. of Elec. Eng. University of Hawaii Phone:
McGraw-Hill/Irwin ©2005 The McGraw-Hill Companies, All rights reserved ©2005 The McGraw-Hill Companies, All rights reserved McGraw-Hill/Irwin.
Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.
Copyright R. Weber INFO 629 Concepts in Artificial Intelligence Fall 2004 Professor: Dr. Rosina Weber.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
McGraw-Hill/Irwin Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 4 Analytics, Decision Support, and Artificial Intelligence:
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Netflix Prize and Heritage Health Prize Philip Chan.
Mathematics at Google. Brief history Started in 1996 as the research project ‘Backrub’ by the then PhD student Larry Page Sergey Brin joined in Became.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CLASSIFICATION: Ensemble Methods
Classification Derek Hoiem CS 598, Spring 2009 Jan 27, 2009.
Understanding the field & setting expectations.  Personal  International  UNT Alumni (Mathematics)  Academic  Economics & Mathematics  Professional.
Syllabus. We covered Regression in Applied Stats. We will review Regression and cover Time Series and Principle Components Analysis. Reference Book.
Learning from Positive and Unlabeled Examples Investigator: Bing Liu, Computer Science Prime Grant Support: National Science Foundation Problem Statement.
Feature (Gene) Selection MethodsSample Classification Methods Gene filtering: Variance (SD/Mean) Principal Component Analysis Regression using variable.
Artificial Intelligence, Expert Systems, and Neural Networks Group 10 Cameron Kinard Leaundre Zeno Heath Carley Megan Wiedmaier.
Team Dogecoin: An Experience in Predicting Hospital Readmissions Acknowledgements The Problem Hospitals in the UK must keep track of which patients, once.
Instructor: Pedro Domingos
Konstantina Christakopoulou Liang Zeng Group G21
Classification Ensemble Methods 1
Ergonomics/Human Integrated Systems (Project 02)
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Jeremy Howard President, Kaggle web Machine learning competitions Photo by mikebaird,
Data-Driven Education
the first crowdsourcing platform dedicated to applied math, modeling,
Instructor: Pedro Domingos
Project Participants Mitch Campion, M.S. Graduate Student
CSEP 546 Data Mining Machine Learning
By: Dramane Diakite Ding Chao LIao Moustafa Elshaabiny Olivier Dounla
Applications of IScore (using R)
Machine Learning & Data Science
CSEP 546 Data Mining Machine Learning
Chapter GS Getting Started.
CSEP 546 Data Mining Machine Learning
Using decision trees and their ensembles for analysis of NIR spectroscopic data WSC-11, Saint Petersburg, 2018 In the light of morning session on superresolution.
Predicting Pneumonia & MRSA in Hospital Patients
Chapter GS Getting Started.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
MTBI Personality Predictor using ML
Chapter GS Getting Started.
Analysis for Predicting the Selling Price of Apartments Pratik Nikte
Derek Hoiem CS 598, Spring 2009 Jan 27, 2009
Chapter GS Getting Started.
CAMCOS Report Day December 9th, 2015 San Jose State University
Faculty of Computer Science
The Belgian experience on the detection of social contribution fraud
March Madness Data Crunch Overview
Presentation transcript:

Predictive modeling competitions making data science a sport Anthony Goldbloom CEO, Kaggle e-mail anthony.goldbloom@kaggle.com twitter @antgoldbloom Photo by mikebaird, www.flickr.com/photos/mikebaird

Global competitions Predicting HIV viral load Competition closes 77% 1½ weeks 70.8% State of the art 70% Competitions involve participants from all over the world competing to produce the best models. One of our first competitions helped improve the state of the art in HIV modelling by 10 per cent. The scientific literature or an in-house modeller effort, evolves slowly, somebody tries something, somebody tweaks that approach and so on. Opening up a problem to a wide audience leads to rapid improvements.

Diverse experts solving diverse problems Grant Application Forecasting Chess Ratings HIV Research Stock Price Prediction Travel Time Prediction Edmund & Adrian London & USA Dr. Derek Gatherer UK Felipe Maia Uppsala University Ivan Russian Federation Philipp Emanuel Widmann Heidelberg, DE Dr. Christopher Hefele, New York Robert Warsaw Chih-Li Sung & Roy Tseng Penghu & Taipei Gzegorz Swiszcz Gera Cole Harris Texas Giuseppe Ragusa Rome Jure Zbontar Ljubljana Claudio Perlich USA Chris DuBois Portland Edmund & Adrian London & USA John Blatz Baltimore Jason Trigg Pennsylvania Chris Raimondi Batimore Rajstennaj Barrabas USA Jason Trigg Pennsylvania Uri Blass Tel-Aviv Lee Baker Las Cruces, NM Nan Zhou Pittsburgh Jeremy Howard Australia Thomas Mahony Canberra Glen Maher Canberra Emir Delic Australia

Motivation Why host a competition? Why compete? How it works Heritage Health Prize Questions

“I keep saying the sexy job in the next ten years will be statisticians.” Hal Varian Google Chief Economist 2009

Crowdsourcing Mismatch between those with data and those with the skills to analyse it It is almost never the case that any single organization has access to the advanced machine learning and statistical techniques that would allow them to extract maximum value from their data. Meanwhile, data scientists crave real-world data to develop and refine their techniques. Crowdsourcing corrects this mismatch by offering companies a cost effective way to harness the ‘cognitive surplus’ of the world's best data scientists. 6 6

Countless possible approaches to any data prediction problem Countless possible approaches to any data prediction problem. Which to choose? There are countless models that can be applied to solve any one predictive analytics problem. It is impossible to know at the outset which technique will be most effective. 7 7 7

18 year old beating his professors There are countless models that can be applied to solve any one predictive analytics problem. It is impossible to know at the outset which technique will be most effective. 8 8 8

Motivation Why host a competition? Why compete? How it works Heritage Health Prize Questions

Tourism Forecasting Competition Forecast Error (MASE) Existing model Aug 9 2 weeks later 1 month later Competition End Very rapid improvements first then the rate of change slows down

Chess Ratings Competition Existing model (ELO) Error Rate (RMSE) Aug 4 1 month later 2 months later Today The algorithm used to power Mark Zuckerberg’s Facemash. For those who have seen the Social Network, it was the algorithm that Eduardo Saverin wrote on Mark’s window.

Our User Base From many different (maths-related disciplines)

Users apply different techniques neural networks logistic regression support vector machine decision trees ensemble methods adaBoost Bayesian networks genetic algorithms random forest Monte Carlo methods principal component analysis Kalman filter evolutionary fuzzy modeling Users have the option to tell us their favourite techniques 13 13

Benchmarking We’re talking to a bank at the moment in Australia. They are receiving criticism for a credit scores on a particular product – they want to know whether the 14 14

Case study: VicRoads has an algorithm that they used to forecast travel time on Melbourne freeways (taking into account time, weather, accidents etc). Their current model is inaccurate and somewhat useless. They want to do better (or at least fnd out about whether it’s possible to do better). 15 15

NASA tried, now it’s our turn ~25% Successful grant applications NASA tried, now it’s our turn NASA’s leading experts have tried for years to find galaxies that have been gradationally lensed. Haven’t satisfactorily solved the problem. Now it’s our turn. 16 16

Ideal for complex problems Example a real estate data provider that wants to predict which houses in a particular suburb will go up for sale in any three month period 17 17

~25% Successful grant applications Outcomes of a competition to predict the success of grant applications: Successful grant applications Better identify likely successes to avoid wasting resources on hopeless applications Identify and communicate the characteristics of a successful application to future applicants Case Study Melbourne University 18 18

Motivation Why host a competition? Why compete? How it works Heritage Health Prize Questions

More fun than Sudoku Why Participants Compete 1 2 Clean, Real world data Professional Reputation & Experience 3 4 Interactions with experts in related fields Prizes Participants compete for four reasons: Access to real world data (which is developed on a silver platter) Benchmark their techniques and enhance their professional reputations (winner’s are the rockstars on Kaggle) The opportunity to interact with experts in related fields (who they might otherwise not get to meet) Prizes

User base Many are academics who want access to real world data and problems 21 21

User base

Motivation Why host a competition? Why compete? How it works Heritage Health Prize Questions

1 2 3 Upload Submit Evaluate & Exchange 24 24

Use the wizard to post a competition 25 25

Participants make their entries 26 26

Competitions are judged based on predictive accuracy How do you know who to choose? Compare techniques on a uniform dataset with a uniform evaluation algorithm 27 27

Competitions are judged on objective criteria Competition Mechanics Competitions are judged on objective criteria The essence of predicting the past competition is deriving insights from data that is already available to facilitate better decisions in the future.

Motivation Why host a competition? Why compete? How it works Heritage Health Prize Questions

$3 million prize An upcoming competition, powered by Kaggle De-identified dataset containing medical records of 100,000 Americans $3 million prize http://www.heritagehealthprize.com

Probability of going to hospital in the next year & Unfilled Prescriptions & Hypertension & High Cholesterol Diabetes Probability of going to hospital in the next year

Projected 100,000 registrations NetFlix Prize 2006 – 2009 $1 million prize 50,000 registrations 2011 $3 million prize Projected 100,000 registrations

Motivation Why host a competition? Why compete? How it works Heritage Health Prize Questions

Predict Grant Applications Tourism Forecasting (Part 2) IJCNN Social Network Challenge Chess Ratings – Elo vs. the Rest of the World

Jeff Moser Jeremy Howard Nicholas Gruen Anthony Goldbloom

What could the world’s best analysts find in your data? e-mail anthony.goldbloom@kaggle.com phone +61438400053 Photo by gidzy, www.flickr.com/photos/gidzy