Presentation is loading. Please wait.

Presentation is loading. Please wait.

Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:

Similar presentations


Presentation on theme: "Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:"— Presentation transcript:

1 Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor: Prof. Longbin Cao Wei Fan, Kun Zhang, and Xiaojing Yuan Recall Precision Ma Mb VEVE

2 Data Mining Challenges Application: more accurate solution to predict “ozone days” than physical models Interesting and Difficult Data Mining Problem: High dimensionality and some could be irrelevant features: 72 continuous, 10 verified by scientists to be relevant Skewed class distribution : either 2 or 5% “ozone days” depending on “ozone day criteria” (either 1-hr average peak and 8-hr average peak) Streaming: data in the “past” collected to train model to predict the “future”. “Feature sample selection bias”: hard to find many days in the training data that is very similar to a day in the future Stochastic true model: given measurable information, sometimes target event happens and sometimes it doesn’t.

3 Key Solution Highlights Physical model is not known. Do not know well what factors are really contributing Non-parametric models are easier to use when “physical or generative mechanism” is unknown. Reliable conditional probabilities estimation under “skewed, high-dimensional, possibly irrelevant features”, … Estimate decision threshold predict the unknown distribution of the future

4 Seriousness of Ozone Problem Ground ozone level is a sophisticated chemical and physical process and “stochastic” in nature. Ozone level above some threshold is rather harmful to human health and our daily life.

5 Drawbacks of current ozone forecasting systems Traditional simulation systems Consume high computational power Customized for a particular location, so solutions not portable to different places Regression-based methods E.g. Regression trees, parametric regression equations, and ANN Limited prediction performances Physical Model: hard to come up with good equations when there are many parameters, and changes from place to place

6 1. Rather skewed and relatively sparse distribution 3500+ examples collected over 10 year period 72 continuous features with missing values Huge instance space If binary and uncorrelated, 2 72 is an astronomical number 2% and 5% true positive ozone days for 1-hour and 8-hour peak respectively Many factors contribute to ozone pollution. Some we know and some we do not know well. Challenges as a Data Mining Problem

7 2. It is suspected that true model for ozone days are stochastic in nature. Given all relevant features X R, P(Y = “ozone day”| X R ) < 1 Predictive mistakes are inevitable

8 3. A large number of unverified physical features Only about 10 out of 72 features verified to be relevant, No information on the relevancy of the other 62 features For stochastic problem, given irrelevant features X ir, where X=(X r, X ir ), P(Y|X) = P(Y|X r ) only if the data is exhaustive. May introduce overfitting problem, and change the probability distribution represented in the data. P(Y = “ozone day”| X r, X ir ) 1 P(Y = “normal day”| X r, X ir ) 0

9 4. “Feature sample selection bias”. Given 72 continuous features, hard to find many days in the training data that is very similar to a day in the future Given these, 2 closely-related challenges 1. How to train an accurate model 2. How to effectively use a model to predict the future with a different and yet unknown distribution Training Distribution Testing Distribution 1 2 3 1 2 3 + + + + + + - -

10 Addressing Challenges Skewed and stochastic distribution Probability distribution estimation Parametric methods Non-parametric methods Decision threshold determination through optimization of some given criteria Compromise between precision and recall List of methods: Logistic Regression Naïve Bayes Kernel Methods Linear Regression RBF Gaussian mixture models List of methods: Decision Trees RIPPER rule learner CBA: association rule clustering-based methods … … Recall Precision Ma Mb Highly accurate if the data is indeed generated from that model you use! But how about, you don’t know which to choose or use the wrong one? use a family of “free-form” functions to “match the data” given some “preference criteria”. free form function/criteria is appropriate. preference criteria is appropriates VEVE

11 Reliable probability estimation under irrelevant features Recall that due to irrelevant features: P(Y = “ozone day”| X r, X ir ) 1 P(Y = “normal day”| X r, X ir ) 0 Construct multiple models Average their predictions P(“ozone”|x r ): true probability P(“ozone”| X r, X ir, θ ): estimated probability by model θ MSE singlemodel:  Difference between “true” and “estimated”. MSE Average  Difference between “true” and “average of many models” Formally show that MSE Average ≤ MSE SingleModel

12 Prediction with feature sample selection bias TrainingSet Algorithm ….. Estimated probability values 1 fold Estimated probability values 10 fold 10CV Estimated probability values 2 fold Decision threshold V E VEVE “Probability- TrueLabel” file Concatenate P(y=“ozoneday”|x,θ) Lable 7/1/98 0.1316 Normal 7/2/98 0.6245 Ozone 7/3/98 0.5944 Ozone ……… PrecRec plot Recall Precision Ma Mb A CV based procedure for decision threshold selection Training Distribution Testing Distribution 1 2 3 1 2 3 + + + + + + - - P(y=“ozoneday”|x,θ) Lable 7/1/98 0.1316 Normal 7/3/98 0.5944 Ozone 7/2/98 0.6245 Ozone ………

13 Addressing Data Mining Challenges Prediction with feature sample selection bias Future prediction based on decision threshold selected Whole Training Set θ Classification on future days if P(Y = “ozonedays”|X,θ ) ≥ V E Predict “ozonedays”

14 Probabilistic Tree Models Single tree estimators C4.5 (Quinlan’93) C4.5Up,C4.5P C4.4 (Provost’03) Ensembles RDT (Fan et al’03) Member tree trained randomly Average probability Bagging Probabilistic Tree (Breiman’96) Bootstrap Compute probability Member tree: C4.5, C4.4

15 Illustration of RDT B1: {0,1} B2: {0,1} B3: continuous B2: {0,1} B3: continuous B2: {0,1} B3: continuous B2: {0,1} B3: continous Random threshold 0.3 Random threshold 0.6 B1 chosen randomly B2 chosen randomly B3 chosen randomly RDT vs Random Forest 1.Original Data vs Bootstrap 2.Random pick vs. Random Subset + info gain 3.Probability Averaging vs. Voting 4.RDT: superfast

16 Optimal Decision Boundary from Tony Liu’s thesis (supervised by Kai Ming Ting)

17

18 Target Distribution Single Decision Tree (5 sec to train) SVM Linear kernel (over night) SVM RBF kernel (1 day) RDT (5 sec)

19 Hidden Variable

20 Limitation of GUIDE Need to decide grouping variables and independent variables. A non-trivial task. If all variables are categorical, GUIDE becomes a single CART regression tree. Strong assumption and greedy-based search. Sometimes, can lead to very unexpected results.

21 Baseline Forecasting Parametric Model in which, O3 - Local ozone peak prediction Upwind - Upwind ozone background level EmFactor - Precursor emissions related factor Tmax - Maximum temperature in degrees F Tb - Base temperature where net ozone production begins (50 F) SRd - Solar radiation total for the day WSa - Wind speed near sunrise (using 09-12 UTC forecast mode) WSp - Wind speed mid-day (using 15-21 UTC forecast mode)

22 Model evaluation criteria Precision and Recall At the same recall level, M a is preferred over M b if the precision of M a is consistently higher than that of M b Coverage under PR curve, like AUC Recall Precision

23 Some Coverage Results 8-hour: recall = [0.4,0.6] Coverage under PR-Curve

24 System Results Annual test 1.BC4.4 and RDT more accurate than baseline Para 2.BC4.4 and RDT “less surprise” than single tree 1.Previous years’ data for training 2.Next year for testing 3.Repeated 6 times using 7 years of data 1.C4.4 best among single trees 2.BC4.4 and RDT best among tree ensembles 8-hour: thresholds selected at the recall = 0.6 1-hour: thresholds selected at the recall = 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 BC4.4RDTC4.4Para Recall Precision

25 SVM: 1-hr criteria CV

26 AdaBoost: 1-hr criteria CV

27 Intuition The true distribution P(y|X) is never known. Is it an elephant? Every random tree is not a random guess of this P(y|X). Their structure is, but not the “node statistics” Each tree looks at the elephant from a different angle. Every tree is consistent with the training data. Each tree is quite strong.

28 Expected Error Reduction Quadratic loss: for probability estimation: ( P(y|X) – P(y|X, θ) ) 2 regression problems ( y – f(x) ) 2 Theorem 1: the “expected quadratic loss” of RDT is less than any combined model chosen at random.

29 Bias and Variance Reduction

30 Summary When physical model is hard to build, data mining is one of the top choices Procedures to formulate as a data mining problem How to collect data Analysis of combination of technical challenges: Skewed problem Sample selection bias Many features Stochastic problem Process to search for the most suitable solutions. Model averaging of probability estimators can effectively approximate the true probability A lot of irrelevant features Feature sample selection bias A CV based guide for decision threshold determination for stochastic problems under sample selection bias Random Decision Tree (Fan et al’03) ICDM’06 Best Application Award ICDM’08 Data Mining Contest Championship

31 Thank you! Questions?


Download ppt "Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:"

Similar presentations


Ads by Google