Data Mining Demystified John Aleshunas Fall Faculty Institute October 2006.

Slides:



Advertisements
Similar presentations
An Introduction to Data Mining
Advertisements

Decision Tree Approach in Data Mining
1. Abstract 2 Introduction Related Work Conclusion References.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Basic Data Mining Techniques Chapter Decision Trees.
Basic Data Mining Techniques
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Data Mining By Archana Ketkar.
Data Mining Adrian Tuhtan CS157A Section1.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Classification.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
Data Mining: A Closer Look
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
Enterprise systems infrastructure and architecture DT211 4
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Basic Data Mining Techniques
Data Mining Techniques
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
1 An Introduction to Data Mining Hosein Rostani Alireza Zohdi Report 1 for “advance data base” course Supervisor: Dr. Masoud Rahgozar December 2007.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent.
Inductive learning Simplest form: learn a function from examples
By Fatine Bourkadi.  Introduction to C4.5  Training Set  Test set  Data Sets  results.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Basic Data Mining Technique
Knowledge Discovery and Data Mining Evgueni Smirnov.
 Fundamentally, data mining is about processing data and identifying patterns and trends in that information so that you can decide or judge.  Data.
1 Statistical Techniques Chapter Linear Regression Analysis Simple Linear Regression.
Copyright © 2004 Pearson Education, Inc.. Chapter 27 Data Mining Concepts.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
DATA MINING By Cecilia Parng CS 157B.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Data Mining Brandon Leonardo CS157B (Spring 2006).
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan,
MIS2502: Data Analytics Advanced Analytics - Introduction.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Data Mining and Decision Support
Copyright © 2001, SAS Institute Inc. All rights reserved. Data Mining Methods: Applications, Problems and Opportunities in the Public Sector John Stultz,
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Waqas Haider Bangyal. 2 Source Materials “ Data Mining: Concepts and Techniques” by Jiawei Han & Micheline Kamber, Second Edition, Morgan Kaufmann, 2006.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Introduction.  Instructor: Cengiz Örencik   Course materials:  myweb.sabanciuniv.edu/cengizo/courses.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
Data Mining – Intro.
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Erich Smith Coleman Platt
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Data Mining 101 with Scikit-Learn
Adrian Tuhtan CS157A Section1
Data Warehousing Data Mining Privacy
Presentation transcript:

Data Mining Demystified John Aleshunas Fall Faculty Institute October 2006

Prediction is very hard, especially when it's about the future. - Yogi Berra

Data Mining Stories “My bank called and said that they saw that I bought two surfboards at Laguna Beach, California.” - credit card fraud detection “My bank called and said that they saw that I bought two surfboards at Laguna Beach, California.” - credit card fraud detection The NSA is using data mining to analyze telephone call data to track al’Qaeda activities The NSA is using data mining to analyze telephone call data to track al’Qaeda activities Victoria’s Secret uses data mining to control product distribution based on typical customer buying patterns at individual stores Victoria’s Secret uses data mining to control product distribution based on typical customer buying patterns at individual stores

Preview Why data mining? Why data mining? Example data sets Example data sets Data mining methods Data mining methods Example application of data mining Example application of data mining Social issues of data mining Social issues of data mining

Source: Han Why Data Mining? Database systems have been around since the 1970s Database systems have been around since the 1970s Organizations have a vast digital history of the day-to-day pieces of their processes Organizations have a vast digital history of the day-to-day pieces of their processes Simple queries no longer provide satisfying results Simple queries no longer provide satisfying results They take too long to execute They take too long to execute They cannot help us find new opportunities They cannot help us find new opportunities

Source: Han Why Data Mining? Data doubles about every year while useful information seems to be decreasing Data doubles about every year while useful information seems to be decreasing Vast data stores overload traditional decision making processes Vast data stores overload traditional decision making processes We are data rich, but information poor We are data rich, but information poor

Data Mining: a definition Simply stated, data mining refers to the extraction of knowledge from large amounts of data.

Source: Dunham Data Mining Models A Taxonomy Data Mining PredictiveDescriptive Classification Regression Time Series Analysis PredictionSummarization Sequence Discovery ClusteringAssociation Rules

Example Datasets Iris Iris Wine Wine Diabetes Diabetes

Source: Fisher Iris Dataset Created by R.A. Fisher (1936) Created by R.A. Fisher (1936) 150 instances 150 instances Three cultivars (Setosa, Virginica, Versicolor) 50 instances each Three cultivars (Setosa, Virginica, Versicolor) 50 instances each 4 measurements (petal width, petal length, sepal width, sepal length) 4 measurements (petal width, petal length, sepal width, sepal length) One cultivar (Setosa) is easily separable, the others are not – noisy data One cultivar (Setosa) is easily separable, the others are not – noisy data

Iris Dataset Analysis

Source: UCI Machine Learning Repository Wine Dataset This data is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different varieties. This data is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different varieties. 153 instances with 13 constituents found in each of the three types of wines. 153 instances with 13 constituents found in each of the three types of wines.

Wine Dataset Analysis

Source: UCI Machine Learning Repository Diabetes Dataset Data is based on a population of women who were at least 21 years old of Pima Indian heritage and living near Phoenix in 1990 Data is based on a population of women who were at least 21 years old of Pima Indian heritage and living near Phoenix in instances 768 instances 9 attributes (Pregnancies, PG Concentration, Diastolic BP, Tri Fold Thick, Serum Ins, BMI, DP Function, Age, Diabetes) 9 attributes (Pregnancies, PG Concentration, Diastolic BP, Tri Fold Thick, Serum Ins, BMI, DP Function, Age, Diabetes) Dataset has many missing values, only 532 instances are complete Dataset has many missing values, only 532 instances are complete

Diabetes Dataset Analysis

Classification Classification builds a model using a training dataset with known classes of data Classification builds a model using a training dataset with known classes of data That model is used to classify new, unknown data into those classes That model is used to classify new, unknown data into those classes

Classification Techniques K-Nearest Neighbors K-Nearest Neighbors Decision Tree Classification (ID3, C4.5) Decision Tree Classification (ID3, C4.5)

K-Nearest Neighbors Example A A A A A A B B B B B X A A A A B B B B A B Easy to explain Simple to implement Sensitive to the selection of the classification population Not always conclusive for complex data

Source: Indelicato K-Nearest Neighbors Example MISCLASSIFICATION PERCENTAG ES Iris Dataset All Attributes Petal Length and Petal Width Setosa 0/150 = 0% Versicolor Virginica 9/150 = 6% 7/150 = 4.67% Total6%4.67% Wine Dataset All Attributes Phenols, Flavanoids, OD280/OD315 Class 1 0/153 = 0% 2/153 = 1.31% Class 2 9/153 = 5.88% 30/153 = 19.61% Class 3 0/153 = 0% Total5.88%20.92%

Decision Tree Example (C4.5) C4.5 is a decision tree generating algorithm, based on the ID3 algorithm. It contains several improvements, especially needed for software implementation. C4.5 is a decision tree generating algorithm, based on the ID3 algorithm. It contains several improvements, especially needed for software implementation. Choice of best splitting attribute is based on an entropy calculation. Choice of best splitting attribute is based on an entropy calculation. These improvements include: These improvements include: Choosing an appropriate attribute selection measure. Choosing an appropriate attribute selection measure. Handling training data with missing attribute values. Handling training data with missing attribute values. Handling attributes with differing costs. Handling attributes with differing costs. Handling continuous attributes. Handling continuous attributes.

Source: Siedler Decision Tree Example (C4.5) Iris datasetWine dataset Accuracy 97.67%Accuracy 86.7%

Decision Tree Example (C4.5) C4.5 produces a complex tree (195 nodes) C4.5 produces a complex tree (195 nodes) The simplified (pruned) tree reduces the classification accuracy The simplified (pruned) tree reduces the classification accuracy Diabetes dataset Before Pruning After Pruning SizeErrorsSizeErrors (5.2%) (13.3%) Accuracy94.8%86.7%

Association Rules Association rules are used to show the relationships between data items. Purchasing one product when another product is purchased is an example of an association rule. They do not represent any causality or correlation.

Association Rule Techniques Market Basket Analysis Market Basket Analysis Terminology Terminology Transaction database Transaction database Association rule – implication {A, B} ═ > {C} Association rule – implication {A, B} ═ > {C} Support - % of transactions in which {A, B, C} occurs Support - % of transactions in which {A, B, C} occurs Confidence – ratio of the number of transactions that contain {A, B, C} to the number of transactions that contain {A, B} Confidence – ratio of the number of transactions that contain {A, B, C} to the number of transactions that contain {A, B}

Source: UCI Machine Learning Repository Association Rule Example 1984 United States Congressional Voting Records Database Attribute Information: 1. Class Name: 2 (democrat, republican) 2. handicapped-infants: 2 (y,n) 3. water-project-cost-sharing: 2 (y,n) 4. adoption-of-the-budget-resolution: 2 (y,n) 5. physician-fee-freeze: 2 (y,n) 6. El-Salvador-aid: 2 (y,n) 7. religious-groups-in-schools: 2 (y,n) 8. anti-satellite-test-ban: 2 (y,n) 9. aid-to-Nicaraguan-contras: 2 (y,n) 10. MX-missile: 2 (y,n) 11. immigration: 2 (y,n) 12. synfuels-corporation-cutback: 2 (y,n) 13. education-spending: 2 (y,n) 14. superfund-right-to-sue: 2 (y,n) 15. crime: 2 (y,n) 16. duty-free-exports: 2 (y,n) 17. export-administration-act-south-africa: 2 (y,n) Rules: {budget resolution = no, MX-missile = no, aid to El Salvador = yes}  {Republican} confidence 91.0% {budget resolution = yes, MX-missile = yes, aid to El Salvador = no}  {Democrat} confidence 97.5% {crime = yes, right-to-sue = yes, Physician fee freeze = yes}  {Republican} confidence 93.5% {crime = no, right-to-sue = no, Physician fee freeze = no}  {Democrat} confidence 100.0%

Clustering Clustering is similar to classification in that data are grouped. Unlike classification, the groups are not predefined; they are discovered. Grouping is accomplished by finding similarities between data according to characteristics found in the actual data.

Clustering Techniques K-Means Clustering K-Means Clustering Neural Network Clustering (SOM) Neural Network Clustering (SOM)

K-Means Example The K-Means algorithm is an method to cluster objects based on their attributes into k partitions. The K-Means algorithm is an method to cluster objects based on their attributes into k partitions. It assumes that the k clusters exhibit normal distributions. It assumes that the k clusters exhibit normal distributions. The objective it tries to achieve is to minimize the variance within the clusters. The objective it tries to achieve is to minimize the variance within the clusters.

K-Means Example Cluster 1 Cluster 2 Cluster 3 Dataset Mean 3 Mean 2Mean 1

K-Means Example Cluster 1 Cluster 2 Cluster 3 46 Versicolor 3 Virginica Cluster mean Versicolor 47 Virginica Cluster mean Setosa Cluster mean Cluster 1 Cluster 2 Cluster 3 47 Versicolor 49 Virginica Mean 6.30, 2.89, 4.96, Setosa 1 Virginica Mean 4.59, 3.07, 1.44, Setosa 3 Versicolor Mean 5.21, 3.53, 1.67, 0.35 Iris dataset, only the petal width attribute, Accuracy 95.33% Iris dataset, all attributes, Accuracy 66.0% Iris dataset, all attributes, Accuracy 66.0 % Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 23 Virginica 1 Virginica 26 Setosa 12 Virginica 24 Versicolor 1 Virginica 26 Versicolor 13 Virginica 24 Setosa Iris dataset, all attributes, Accuracy 90.67% Iris dataset, all attributes, Accuracy %

Self-Organizing Map Example The Self-Organizing Map was first described by the Finnish professor Teuvo Kohonen and is thus sometimes referred to as a Kohonen map. The Self-Organizing Map was first described by the Finnish professor Teuvo Kohonen and is thus sometimes referred to as a Kohonen map. SOM is especially good for visualizing high-dimensional data. SOM is especially good for visualizing high-dimensional data. SOM maps input vectors onto a two-dimensional grid of nodes. SOM maps input vectors onto a two-dimensional grid of nodes. Nodes that are close together have similar attribute values and nodes that are far apart have different attribute values. Nodes that are close together have similar attribute values and nodes that are far apart have different attribute values.

Z Y X Self-Organizing Map Example Input Vectors Z Y X

Self-Organizing Map Example Virginica Versicolor Setosa Virginica Versicolor Setosa Virginica Versicolor Setosa Virginica Versicolor Setosa Virginica Versicolor Setosa Virginica Versicolor Setosa Virginica Versicolor Virginica Versicolor VirginicaVersicolor Virginica Versicolor VirginicaVersicolor Iris Data

Self-Organizing Map Example Class-2 Class-3Class-2 Class-3 Class-2 Class-3 Class-2Class-3 Class-2 Class-3Class-2 Class-3 Class-2Class-3 Class-1 Class-3 Class-2Class-3 Class-2Class-1 Class-3 Class-2Class-3 Class-1 Class-2Class-1 Class-3 Class-1 Class-2 Class-1 Wine Data

Self-Organizing Map Example Healthy SickHealthy SickHealthySickHealthy SickHealthy Sick HealthySick HealthySickHealthy Sick Healthy Sick Healthy Sick HealthySickHealthySick Healthy SickHealthySick HealthySick Diabetes Data

Source: McKee NFL Quarterback Analysis Data from 2005 for 42 NFL quarterbacks Data from 2005 for 42 NFL quarterbacks Preprocessed data to normalize for a full 16 game regular season Preprocessed data to normalize for a full 16 game regular season Used SOM to cluster individuals based on performance and descriptive data Used SOM to cluster individuals based on performance and descriptive data

Source: McKee NFL Quarterback Analysis The SOM Map

Source: McKee NFL Quarterback Analysis QB Passing RatingOverall Clustering

Source: McKee NFL Quarterback Analysis The SOM Map

Data Mining Stories - Revisited Credit card fraud detection Credit card fraud detection NSA telephone network analysis NSA telephone network analysis Supply chain management Supply chain management

Social Issues of Data Mining Impacts on personal privacy and confidentiality Impacts on personal privacy and confidentiality Classification and clustering is similar to profiling Classification and clustering is similar to profiling Association rules resemble logical implications Association rules resemble logical implications Data mining is an imperfect process subject to interpretation Data mining is an imperfect process subject to interpretation

Conclusion Why data mining? Why data mining? Example data sets Example data sets Data mining methods Data mining methods Example application of data mining Example application of data mining Social issues of data mining Social issues of data mining

What on earth would a man do with himself if something did not stand in his way? - H.G. Wells I don’t think necessity is the mother of invention – invention, in my opinion, arises directly from idleness, probably also from laziness, to save oneself trouble. - Agatha Christie, from “An Autobiography, Pt III, Growing U p”

References Dunham, Margaret, Data Mining Introductory and Advanced Topics, Pearson Education, Inc., 2003 Dunham, Margaret, Data Mining Introductory and Advanced Topics, Pearson Education, Inc., 2003 Fisher, R.A., The Use of Multiple Measurements in Taxonomic Problems, Annals of Eugenics 7, pp Fisher, R.A., The Use of Multiple Measurements in Taxonomic Problems, Annals of Eugenics 7, pp Han, Jiawei, Data Mining: Concepts and Techniques, Elsevier Inc., 2006 Han, Jiawei, Data Mining: Concepts and Techniques, Elsevier Inc., 2006 Indelicato, Nicolas, Analysis of the K-Nearest Neighbors Algorithm, MATH 4500: Foundations of Data Mining, 2004 Indelicato, Nicolas, Analysis of the K-Nearest Neighbors Algorithm, MATH 4500: Foundations of Data Mining, 2004 McKee, Kevin, MATH 4200: Data Mining Foundations, 2006 McKee, Kevin, The Self Organized Map Applied to 2005 NFL Quarterbacks, MATH 4200: Data Mining Foundations, 2006 Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [ Irvine, CA: University of California, Department of Information and Computer Science Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [ Irvine, CA: University of California, Department of Information and Computer Science Seidler, Toby, The C4.5 Project: An Overview of the Algorithm with Results of Experimentation, MATH 4500: Foundations of Data Mining, 2004 Seidler, Toby, The C4.5 Project: An Overview of the Algorithm with Results of Experimentation, MATH 4500: Foundations of Data Mining, 2004