An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.

Slides:



Advertisements
Similar presentations
Pat Langley Computational Learning Laboratory Center for the Study of Language and Information Stanford University, Stanford, California
Advertisements

Random Forest Predrag Radenković 3237/10
Decision Tree Approach in Data Mining
Data Mining: A Closer Look Chapter Data Mining Strategies.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Decision Tree Algorithm
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Basic Data Mining Techniques Chapter Decision Trees.
Basic Data Mining Techniques
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Neural Networks Chapter Feed-Forward Neural Networks.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Genetic Algorithm Genetic Algorithms (GA) apply an evolutionary approach to inductive learning. GA has been successfully applied to problems that are difficult.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Data Mining – Intro.
Oracle Data Mining Ying Zhang. Agenda Data Mining Data Mining Algorithms Oracle DM Demo.
Data Mining: A Closer Look
Data Mining: A Closer Look Chapter Data Mining Strategies 2.
Chapter 5 Data mining : A Closer Look.
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
Enterprise systems infrastructure and architecture DT211 4
Evaluating Performance for Data Mining Techniques
1 Formal Evaluation Techniques Chapter 7. 2 test set error rates, confusion matrices, lift charts Focusing on formal evaluation methods for supervised.
Understanding Data Analytics and Data Mining Introduction.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
An Evaluation of A Commercial Data Mining Suite Oracle Data Mining Presented by Emily Davis Supervisor: John Ebden.
Data mining and machine learning A brief introduction.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Inductive learning Simplest form: learn a function from examples
COMP3503 Intro to Inductive Modeling
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 1 Slide Evaluation. 2 2 n Interactive decision tree construction Load segmentchallenge.arff; look at dataset Load segmentchallenge.arff; look at dataset.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Data Mining with Oracle using Classification and Clustering Algorithms Presented by Nhamo Mdzingwa Supervisor: John Ebden.
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
Data Mining with Oracle using Classification and Clustering Algorithms Proposed and Presented by Nhamo Mdzingwa Supervisor: John Ebden.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
Discriminant Analysis Discriminant analysis is a technique for analyzing data when the criterion or dependent variable is categorical and the predictor.
Today Ensemble Methods. Recap of the course. Classifier Fusion
CLASSIFICATION: Ensemble Methods
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.
APPLICATION OF DATAMINING TOOL FOR CLASSIFICATION OF ORGANIZATIONAL CHANGE EXPECTATION Şule ÖZMEN Serra YURTKORU Beril SİPAHİ.
Neural Networks Steven Le. Overview Introduction Architectures Learning Techniques Advantages Applications.
1 STAT 5814 Statistical Data Mining. 2 Use of SAS Data Mining.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
An Evaluation of Commercial Data Mining Proposed and Presented by Emily Davis Supervisor: John Ebden.
Prepared by: Mahmoud Rafeek Al-Farra
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Data Mining and Decision Support
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Knowledge Discovery and Data Mining 19 th Meeting Course Name: Business Intelligence Year: 2009.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining – Intro.
DATA MINING © Prentice Hall.
Data Mining: Concepts and Techniques Course Outline
Data Mining Practical Machine Learning Tools and Techniques
Prepared by: Mahmoud Rafeek Al-Farra
CSCI N317 Computation for Scientific Applications Unit Weka
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Presentation transcript:

An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden

Outline Data Mining Classification Data Mining Algorithms Choice of Technique Data Mining Process Evaluation of Results Oracle Data Mining Progress

Data Mining Classification Directed data mining builds a model that describes one particular variable in terms of the rest of the data. Includes: Classification, Estimation and Prediction Undirected data mining builds a model to establish the relationships amongst all the variables. Includes: Affinity Groupings or Association Discovery, Clustering and Description or Visualization.

Data Mining Algorithms Clustering: Groups instances of data into classes and allows for the discovery of structures in the data. Neural Networks: Segments the state space of the data with gradients or sloping lines. Estimation: Determines the value of an unknown output attribute that is numerical.

Prediction: Determines future outcomes of data (similar to estimation). Classification: Assigns new instances of data to categorical classes. Association discovery: Discovery of associations between data fields (includes market basket analysis).

Decision Trees: Uses data splitting rules to split data and then apply more data splitting rules to the resulting subsets of data. Association rules: Rule induction to generate patterns relating business goals to other data fields. The patterns are generated as trees with splits on data fields.

Choosing a Technique Supervised Learning:  set of input and output data  clear explanation of results Association rules:  input and output data have interesting interactions Decision trees:  known which attributes best define the data  faster

Clustering and neural networks:  all attributes are of equal importance  perform well on noisy data (neural networks) When increased accuracy is required create multiple models using the same data mining technique until the optimal model is created.

Data Mining Process Too much focus on the automatic techniques. Not enough focus on the exploration and analysis of the problem and the data. Common to all the presented processes:  Thorough data preparation and exploration  Interpretation and validation of the resulting models

Evaluating the Output Evaluation of supervised learning models involves determining the level of predictive accuracy. Evaluated using test data sets. Compare error rates of models created from the same training data to determine accuracy.

Model AModel AcceptModel Reject Actual Accept60025 Actual Reject75300

When evaluating numerical output use error rates - the percentage of correct predictions. Mean absolute error = average absolute difference between computed and predicted outcome. Mean squared error rate = average squared difference between computed and desired outcome.

Cumulative Gains Chart

Evaluating unsupervised learning models using supervised learning Perform clustering. A cluster is thought of as a class and assigned a name. Random samples are chosen from instances of each class. A supervised model is then built with the class names as output. Random samples are the training set. The remaining instances are used to test the accuracy of the clustering model

Measures of interestingness These include whether the pattern:  is easily understood  is valid with a degree of certainty  is potentially useful  is novel  confirms a hypothesis of some kind  represents knowledge.

Oracle Adaptive Bayes Network supporting decision trees (classification) Naive Bayes (classification) Model Seeker (classification) k-Means (clustering) O-Cluster (clustering) Predictive variance (attribute importance) Apriori (association rules)

ODM public class Sample_NaiveBayesBuild_short extends Object { public static void main ( String[] args ) { System.out.println("Start: " + new java.util.Date()); DataMiningServer dms = null; oracle.dmt.odm.Connection dmsConnection = null; try { // Create an instance of the data mining server and get a connection // The mining server URL, user_name and password need to be specified dms = new DataMiningServer("ora1.ict.ru.ac.za", "system", "emily"); dmsConnection = dms.login(); // Create PhysicalDataSpecification object // First create a LocationAccessData using the table name and schema name LocationAccessData lad = new LocationAccessData("CENSUS_2D_BUILD_UNBINNED", "odm_mtr"); // Create a NonTransactionalDataSpecification object since the dataset is nontransactional PhysicalDataSpecification m_PhysicalDataSpecification = new NonTransactionalDataSpecification(lad);

Data Mining for Java(DM4J)

Progress Literature Survey Oracle installed on Ora in COE Exploring the Oracle Suite including JDeveloper Member of MetaLink(Oracle’s online support service)

Addressing the Problem: Run the different algorithms available in the data mining suite on sample data using ODM and DM4J. Document and evaluate results using techniques discussed.