An Evaluation of A Commercial Data Mining Suite Oracle Data Mining Presented by Emily Davis Supervisor: John Ebden.

An Evaluation of A Commercial Data Mining Suite Oracle Data Mining Presented by Emily Davis Supervisor: John Ebden

Oracle Data Mining An Investigation Emily Davis Investigating the data mining tools and software available with Oracle9i. Use Oracle Data Mining and JDeveloper (Java API) to run algorithms in data mining suite on sample data. An evaluation of results using confusion matrices, lift charts & error rates. A comparison of the effectiveness of different algorithms. Supervisor: John Ebden Contact: g01d1801@campus.ru.ac.zag01d1801@campus.ru.ac.za Visit: http://www.cs.ru.ac.za/research/students/g01D1801/http://www.cs.ru.ac.za/research/students/g01D1801/ Model A Model Accept Model Reject Actual Accept 60025 Actual Reject 75300 Oracle Data Mining, DM4J and JDeveloper Adaptive BayesNaive Bayes

Problem Statement To determine how Oracle provides data mining functionality  Ease of use  Data preparation  Model building  Model testing  Applying models to new data

Problem Statement To determine whether the algorithms used would find a pattern in a data set  What happened when the models were applied to a new data set To determine which algorithm built the most effective model and under what circumstances

Problem Statement To determine how models are tested and if this indicates how they will perform when applied to new data To determine how the data affected the model building and how the test data affected the model testing

Methodology Two Classification algorithms selected:  Naïve Bayes  Adaptive Bayes Network Both produce predictions which could then be compared

Methodology Data from http://www.ru.ac.za/weather/http://www.ru.ac.za/weather/ Weather data Data recorded includes:  Temperature (degrees F)  Humidity (percent)  Barometer (inches of mercury)  Wind Direction (degrees, 360 = North, 90 = East)  Wind Speed (MPH)  High Wind Speed (MPH)  Solar Radiation (Watts/m^2)  Rainfall (inches)  Wind Chill (computed from high wind speed and temperature)

Data Rainfall reading removed and replaced with a yes or no depending on whether rainfall was recorded This variable, RAIN, was chosen as the target variable 2 Data sets put into tables in the database  WEATHER_BUILD  WEATHER_APPLY

WEATHER_BUILD  2601 records  Used to create build and test data with Transformation Split wizard WEATHER_APPLY  290 records  Used to validate models

Building and Testing the Models The Priors technique Training and tuning the models The models built Testing Results

Data Preparation Techniques - Priors

Priors Stratified Sampling

Training and Tuning the Models Predicted NoPredicted Yes Actual No38434 Actual Yes14174

Training and Tuning the Models Viable to introduce a weighting of 3 against false negatives Makes a false negative prediction 3 times as costly as a false positive Algorithm attempts to minimise costs

The Models 8 models in total 4 using each algorithm  One using default settings  One using the Priors technique  One using weighting  One using Priors and weighting

Testing the Models Tested on test data set created from WEATHER_BUILD data set Confusion matrices indicating accuracy of models

Testing Results

Applying the Models to New Data Models were applied to the new data in WEATHER_APPLY PredictionProbabilityTHE_TIME no0.99991 yes0.6711138 Prediction Cost of incorrect prediction THE_TIME no01 yes0.3288138 Extracts showing 2 predictions in actual results

Attribute Influence on Predictions Adaptive Bayes Network provides rules along with predictions Rules in if…….then format Rules showed attributes with most influence were:  Wind Chill  Wind Direction

Results of Applying Models to New Data

Comparing Accuracy

Observations Algorithms found a pattern in the weather data Most effective model: Adaptive Bayes Network algorithm using weighting Accuracy of Naïve Bayes models improves dramatically if weighting and Priors are used Significant difference between accuracy during testing of models and accuracy when applied to new data

Conclusions Oracle Data Mining provides easy to use wizards that support all aspects of the data mining process Algorithms found a pattern in the weather data  Best case: the Adaptive Bayes Network model predicted 73.1% of RAIN outcomes correctly

Conclusions Adaptive Bayes Network algorithm produced most effective model: accuracy 73.1% when applied to new data  Tuned using a weighting of 3 against false negatives Most effective model using Naïve Bayes: accuracy of 63.79%  Uses a weighting of 3 against false negatives and uses Priors technique

Conclusions Accuracy during testing does not always indicate performance of model on new data Test accuracy inflated if target attribute distribution in build and test data sets is similar Shows the need for testing of a model on a variety of data sets

Questions

An Evaluation of A Commercial Data Mining Suite Oracle Data Mining Presented by Emily Davis Supervisor: John Ebden.

Similar presentations

Presentation on theme: "An Evaluation of A Commercial Data Mining Suite Oracle Data Mining Presented by Emily Davis Supervisor: John Ebden."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Evaluation of A Commercial Data Mining Suite Oracle Data Mining Presented by Emily Davis Supervisor: John Ebden.

Similar presentations

Presentation on theme: "An Evaluation of A Commercial Data Mining Suite Oracle Data Mining Presented by Emily Davis Supervisor: John Ebden."— Presentation transcript:

Similar presentations

About project

Feedback