An Evaluation of A Commercial Data Mining Suite Oracle Data Mining Presented by Emily Davis Supervisor: John Ebden.

Slides:



Advertisements
Similar presentations
Classification Algorithms
Advertisements

Decision Tree Approach in Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Data Mining: A Closer Look Chapter Data Mining Strategies.
Weather Tools. Anemometer  Used to measure wind speed  Can predict when and where weather is moving to next  Units of measurement: miles per hour (mph)
1 Application of Metamorphic Testing to Supervised Classifiers Xiaoyuan Xie, Tsong Yueh Chen Swinburne University of Technology Christian Murphy, Gail.
Data Mining.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
1 4. Multiple Regression I ECON 251 Research Methods.
Multiple Regression and Correlation Analysis
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
PY 427 Statistics 1Fall 2006 Kin Ching Kong, Ph.D Lecture 6 Chicago School of Professional Psychology.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
October 12, 2010Neural Networks Lecture 11: Setting Backpropagation Parameters 1 Exemplar Analysis When building a neural network application, we must.
Chapter 5 Data mining : A Closer Look.
1 Formal Evaluation Techniques Chapter 7. 2 test set error rates, confusion matrices, lift charts Focusing on formal evaluation methods for supervised.
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
An Exercise in Machine Learning
Testing Hypotheses I Lesson 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics n Inferential Statistics.
© 2011 Pearson Prentice Hall, Salkind. Introducing Inferential Statistics.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
CLassification TESTING Testing classifier accuracy
Chapter 11 LEARNING FROM DATA. Chapter 11: Learning From Data Outline  The “Learning” Concept  Data Visualization  Neural Networks The Basics Supervised.
Weka Project assignment 3
8/25/05 Cognitive Computations Software Tutorial Page 1 SNoW: Sparse Network of Winnows Presented by Nick Rizzolo.
Scientific Inquiry & Skills
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Data Mining with Oracle using Classification and Clustering Algorithms Presented by Nhamo Mdzingwa Supervisor: John Ebden.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Data Mining with Oracle using Classification and Clustering Algorithms Proposed and Presented by Nhamo Mdzingwa Supervisor: John Ebden.
Oracle Data Mining Update and Xerox Application Charlie Berger Sr. Director of Product Management, Life Sciences and Data Mining
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
L6. Learning Systems in Java. Necessity of Learning No Prior Knowledge about all of the situations. Being able to adapt to changes in the environment.
An Evaluation of Commercial Data Mining Proposed and Presented by Emily Davis Supervisor: John Ebden.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
Table of Content 6.5 How Do Meteorologist Predict Weather? Predicting the Weather Ms. D 6 th Grade Weather Patterns.
What are they? What do they do?
What are they? What do they do?
CHAPTER OVERVIEW Say Hello to Inferential Statistics The Idea of Statistical Significance Significance Versus Meaningfulness Meta-analysis.
Experimental Design. Scientific Method Review ASK A QUESTION Develop a question that can be solved through experimentation. FORM A HYPOTHESIS Predict.
An Exercise in Machine Learning
Lesson 1: What is Weather?
Evaluating Classification Performance
Brian Lukoff Stanford University October 13, 2006.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Chapter 8 Introducing Inferential Statistics.
What are they? What do they do?
What are they? What do they do?
What are they? What do they do?
Prepared by: Mahmoud Rafeek Al-Farra
Scientific Method 1.
What are they? What do they do?
What are they? What do they do?
What are they? What do they do?
What are they? What do they do?
What are they? What do they do?
What are they? What do they do?
What are they? What do they do?
What are they? What do they do?
What are they? What do they do?
What are they? What do they do?
A task of induction to find patterns
Machine Learning in Business John C. Hull
What are they? What do they do?
Presentation transcript:

An Evaluation of A Commercial Data Mining Suite Oracle Data Mining Presented by Emily Davis Supervisor: John Ebden

Oracle Data Mining An Investigation Emily Davis Investigating the data mining tools and software available with Oracle9i. Use Oracle Data Mining and JDeveloper (Java API) to run algorithms in data mining suite on sample data. An evaluation of results using confusion matrices, lift charts & error rates. A comparison of the effectiveness of different algorithms. Supervisor: John Ebden Contact: Visit: Model A Model Accept Model Reject Actual Accept Actual Reject Oracle Data Mining, DM4J and JDeveloper Adaptive BayesNaive Bayes

Problem Statement To determine how Oracle provides data mining functionality  Ease of use  Data preparation  Model building  Model testing  Applying models to new data

Problem Statement To determine whether the algorithms used would find a pattern in a data set  What happened when the models were applied to a new data set To determine which algorithm built the most effective model and under what circumstances

Problem Statement To determine how models are tested and if this indicates how they will perform when applied to new data To determine how the data affected the model building and how the test data affected the model testing

Methodology Two Classification algorithms selected:  Naïve Bayes  Adaptive Bayes Network Both produce predictions which could then be compared

Methodology Data from Weather data Data recorded includes:  Temperature (degrees F)  Humidity (percent)  Barometer (inches of mercury)  Wind Direction (degrees, 360 = North, 90 = East)  Wind Speed (MPH)  High Wind Speed (MPH)  Solar Radiation (Watts/m^2)  Rainfall (inches)  Wind Chill (computed from high wind speed and temperature)

Data Rainfall reading removed and replaced with a yes or no depending on whether rainfall was recorded This variable, RAIN, was chosen as the target variable 2 Data sets put into tables in the database  WEATHER_BUILD  WEATHER_APPLY

WEATHER_BUILD  2601 records  Used to create build and test data with Transformation Split wizard WEATHER_APPLY  290 records  Used to validate models

Building and Testing the Models The Priors technique Training and tuning the models The models built Testing Results

Data Preparation Techniques - Priors

Priors Stratified Sampling

Priors Stratified Sampling

Training and Tuning the Models Predicted NoPredicted Yes Actual No38434 Actual Yes14174

Training and Tuning the Models Viable to introduce a weighting of 3 against false negatives Makes a false negative prediction 3 times as costly as a false positive Algorithm attempts to minimise costs

The Models 8 models in total 4 using each algorithm  One using default settings  One using the Priors technique  One using weighting  One using Priors and weighting

Testing the Models Tested on test data set created from WEATHER_BUILD data set Confusion matrices indicating accuracy of models

Testing Results

Applying the Models to New Data Models were applied to the new data in WEATHER_APPLY PredictionProbabilityTHE_TIME no yes Prediction Cost of incorrect prediction THE_TIME no01 yes Extracts showing 2 predictions in actual results

Attribute Influence on Predictions Adaptive Bayes Network provides rules along with predictions Rules in if…….then format Rules showed attributes with most influence were:  Wind Chill  Wind Direction

Results of Applying Models to New Data

Comparing Accuracy

Observations Algorithms found a pattern in the weather data Most effective model: Adaptive Bayes Network algorithm using weighting Accuracy of Naïve Bayes models improves dramatically if weighting and Priors are used Significant difference between accuracy during testing of models and accuracy when applied to new data

Conclusions Oracle Data Mining provides easy to use wizards that support all aspects of the data mining process Algorithms found a pattern in the weather data  Best case: the Adaptive Bayes Network model predicted 73.1% of RAIN outcomes correctly

Conclusions Adaptive Bayes Network algorithm produced most effective model: accuracy 73.1% when applied to new data  Tuned using a weighting of 3 against false negatives Most effective model using Naïve Bayes: accuracy of 63.79%  Uses a weighting of 3 against false negatives and uses Priors technique

Conclusions Accuracy during testing does not always indicate performance of model on new data Test accuracy inflated if target attribute distribution in build and test data sets is similar Shows the need for testing of a model on a variety of data sets

Questions