Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Kun Zhang,

Slides:



Advertisements
Similar presentations
Wei Fan Ed Greengrass Joe McCloskey Philip S. Yu Kevin Drummey
Advertisements

A General Framework for Fast and Accurate Regression by Data Summarization in Random Decision Trees Wei Fan, IBM T.J.Watson Joe McCloskey, US Department.
Discovering Unrevealed Properties of Probability Estimation Trees: on Algorithm Selection and Performance Explanation Discovering Unrevealed Properties.
On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.
Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.
An Improved Categorization of Classifiers Sensitivity on Sample Selection Bias Wei Fan Ian Davidson Bianca Zadrozny Philip S. Yu.
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.
Is Random Model Better? -On its accuracy and efficiency-
Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM T.J.Watson Janek Mathuria, and Chang-tien Lu Virginia.
Experience with Simple Approaches Wei Fan Erheng Zhong Sihong Xie Yuzhao Huang Kun Zhang $ Jing Peng # Jiangtao Ren IBM T. J. Watson Research Center Sun.
ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany.
When Efficient Model Averaging Out-Perform Bagging and Boosting Ian Davidson, SUNY Albany Wei Fan, IBM T.J.Watson.
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
DECISION TREES. Decision trees  One possible representation for hypotheses.
Random Forest Predrag Radenković 3237/10
Multi-label Classification without Multi-label Cost - Multi-label Random Decision Tree Classifier 1.IBM Research – China 2.IBM T.J.Watson Research Center.
Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:
Data Mining Classification: Alternative Techniques
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Data Mining Classification: Alternative Techniques
An Overview of Machine Learning
Classification Techniques: Decision Tree Learning
Probabilistic Generative Models Rong Jin. Probabilistic Generative Model Classify instance x into one of K classes Class prior Density function for class.
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar.
Assuming normally distributed data! Naïve Bayes Classifier.
x – independent variable (input)
Sparse vs. Ensemble Approaches to Supervised Learning
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Ensemble Learning: An Introduction
ROC Curves.
Sparse vs. Ensemble Approaches to Supervised Learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Chapter 5 Data mining : A Closer Look.
Ensemble Learning (2), Tree and Forest
For Better Accuracy Eick: Ensemble Learning
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Scaling up Decision Trees. Decision tree learning.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CLASSIFICATION: Ensemble Methods
Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Ensemble Methods in Machine Learning
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification Ensemble Methods 1
Data Mining and Decision Support
Validation methods.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
7. Performance Measurement
Machine Learning Basics
Introduction to Data Mining, 2nd Edition
iSRD Spam Review Detection with Imbalanced Data Distributions
Roc curves By Vittoria Cozza, matr
Ch13. Ensemble method (draft)
Support Vector Machines 2
Presentation transcript:

Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Kun Zhang, Wei Fan, Xiaojing Yuan, Ian Davidson, and Xiangshang Li Recall Precision Ma Mb VEVE

What this Paper Offers Application: more accurate (higher recall & precision) solution to predict ozone days Interesting and Difficult Data Mining Problem: High dimensionality and some could be irrelevant features: 72 continuous, 10 verified by scientists to be relevant Skewed class distribution : either 2 or 5% ozone days depending on ozone day criteria (either 1-hr peak and 8-hr peak) Streaming: data in the past collected to train model to predict the future. Feature sample selection bias: hard to find many days in the training data that is very similar to a day in the future Stochastic true model: given measurable information, sometimes target event happens and sometimes it doesnt.

Key Solution Highlights Non-parametric models are easier to use when physical or generative mechanism is unknown. Reliable conditional probabilities estimation under skewed, high-dimensional, possibly irrelevant features, … Estimate decision threshold predict the unknown distribution of the future

Seriousness of Ozone Problem Ground ozone level is a sophisticated chemical and physical process and stochastic in nature. Ozone level above some threshold is rather harmful to human health and our daily life.

Drawbacks of current ozone forecasting systems Traditional simulation systems Consume high computational power Customized for a particular location, so solutions not portable to different places Regression-based methods E.g. Regression trees, parametric regression equations, and ANN Limited prediction performances

Ozone Level Prediction: Problems we are facing Daily summary maps of two datasets from Texas Commission on Environmental Quality (TCEQ)

1. Rather skewed and relatively sparse distribution examples over 7 years ( ) 72 continuous features with missing values Huge instance space If binary and uncorrelated, 2 72 is an astronomical number 2% and 5% true positive ozone days for 1-hour and 8-hour peak respectively Challenges as a Data Mining Problem

2. True model for ozone days are stochastic in nature. Given all relevant features X R, P(Y = ozone day| X R ) < 1 Predictive mistakes are inevitable

3. A large number of irrelevant features Only about 10 out of 72 features verified to be relevant, No information on the relevancy of the other 62 features For stochastic problem, given irrelevant features X ir, where X=(X r, X ir ), P(Y|X) = P(Y|X r ) only if the data is exhaustive. May introduce overfitting problem, and change the probability distribution represented in the data. P(Y = ozone day| X r, X ir ) 1 P(Y = normal day| X r, X ir ) 0

4. Feature sample selection bias. Given 7 years of data and 72 continuous features, hard to find many days in the training data that is very similar to a day in the future Given these, 2 closely-related challenges 1. How to train an accurate model 2. How to effectively use a model to predict the future with a different and yet unknown distribution Training Distribution Testing Distribution

Addressing Challenges Skewed and stochastic distribution Probability distribution estimation Parametric methods Non-parametric methods Decision threshold determination through optimization of some given criteria Compromise between precision and recall List of methods: Logistic Regression Naïve Bayes Kernel Methods Linear Regression RBF Gaussian mixture models List of methods: Decision Trees RIPPER rule learner CBA: association rule clustering-based methods … … Recall Precision Ma Mb Highly accurate if the data is indeed generated from that model you use! But how about, you dont know which to choose or use the wrong one? use a family of free-form functions to match the data given some preference criteria. free form function/criteria is appropriate. preference criteria is appropriates VEVE

Reliable probability estimation under irrelevant features Recall that due to irrelevant features: P(Y = ozone day| X r, X ir ) 1 P(Y = normal day| X r, X ir ) 0 Construct multiple models Average their predictions P(ozone|x r ): true probability P(ozone| X r, X ir, θ ): estimated probability by model θ MSE singlemodel: Difference between true and estimated. MSE Average Difference between true and average of many models Formally show that MSE Average MSE SingleModel

Prediction with feature sample selection bias TrainingSet Algorithm ….. Estimated probability values 1 fold Estimated probability values 10 fold 10CV Estimated probability values 2 fold Decision threshold V E VEVE Probability- TrueLabel file Concatenate P(y=ozoneday|x,θ) Lable 7/1/ Normal 7/2/ Ozone 7/3/ Ozone ……… PrecRec plot Recall Precision Ma Mb A CV based procedure for decision threshold selection Training Distribution Testing Distribution P(y=ozoneday|x,θ) Lable 7/1/ Normal 7/3/ Ozone 7/2/ Ozone ………

Addressing Data Mining Challenges Prediction with feature sample selection bias Future prediction based on decision threshold selected Whole Training Set θ Classification on future days if P(Y = ozonedays|X,θ ) V E Predict ozonedays

Probabilistic Tree Models Single tree estimators C4.5 (Quinlan93) C4.5Up,C4.5P C4.4 (Provost03) Ensembles RDT (Fan et al03) Member tree trained randomly Average probability Bagging Probabilistic Tree (Breiman96) Bootstrap Compute probability Member tree: C4.5, C4.4 RDT: Random Decision Tree (Fan et al03) Encoding data in trees. At each node, an un-used feature is chosen randomly A discrete feature is un-used if it has never been chosen previously on a given decision path starting from the root to the current node. A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen Stop when one of the following happens: A node becomes too small (<= 3 examples). Or the total height of the tree exceeds some limits: Different from Random Forest 1.Original Data vs Bootstrap 2.Random pick vs. Random Subset + info gain 3.Probability Averaging vs. Voting

Optimal Decision Boundary from Tony Lius thesis (supervised by Kai Ming Ting)

Baseline Forecasting Parametric Model in which, O3 - Local ozone peak prediction Upwind - Upwind ozone background level EmFactor - Precursor emissions related factor Tmax - Maximum temperature in degrees F Tb - Base temperature where net ozone production begins (50 F) SRd - Solar radiation total for the day WSa - Wind speed near sunrise (using UTC forecast mode) WSp - Wind speed mid-day (using UTC forecast mode)

Model evaluation criteria Precision and Recall At the same recall level, M a is preferred over M b if the precision of M a is consistently higher than that of M b Coverage under PR curve, like AUC Recall Precision

Some Coverage Results 8-hour: recall = [0.4,0.6] Coverage under PR-Curve

Some Action Results Annual test 1.BC4.4 and RDT more accurate than baseline Para 2.BC4.4 and RDT less surprise than single tree 1.Previous years data for training 2.Next year for testing 3.Repeated 6 times using 7 years of data 1.C4.4 best among single trees 2.BC4.4 and RDT best among tree ensembles 8-hour: thresholds selected at the recall = hour: thresholds selected at the recall = BC4.4RDTC4.4Para Recall Precision

Summary Procedures to formulate as a data mining problem, Analysis of combination of technical challenges Process to search for the most suitable solutions. Model averaging of probability estimators can effectively approximate the true probability A lot of irrelevant features Feature sample selection bias A CV based guide for decision threshold determination for stochastic problems under sample selection bias

AUC Score Given dataset Signal-noise separability estimation through RDT or BPET Ensembl e or Single trees Low signal-noise separability High signal- noise separability Ensemble or Single trees Ensemble (AUC,MSE, ErrorRate) RDT CFT Single Trees (AUC,MSE, ErrorRate) >=0.9 < 0.9 Ensemble Single Tree AUC MSE Error Rate CFT AUC MSE, ErrorRate C4.5 or C4.4 Feature types and value characteristic s Categorical feature with limited values BPET RDT ( BPET) Continuous features or categorical feature with a large number of values AUC, MSE, ErrorRate Choosing the Appropriate PET come to our other talk 10:30 RM 402

Thank you! Questions?