Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining Demystified John Aleshunas Fall Faculty Institute October 2006.

Similar presentations


Presentation on theme: "Data Mining Demystified John Aleshunas Fall Faculty Institute October 2006."— Presentation transcript:

1 Data Mining Demystified John Aleshunas Fall Faculty Institute October 2006

2 Prediction is very hard, especially when it's about the future. - Yogi Berra

3 Data Mining Stories “My bank called and said that they saw that I bought two surfboards at Laguna Beach, California.” - credit card fraud detection “My bank called and said that they saw that I bought two surfboards at Laguna Beach, California.” - credit card fraud detection The NSA is using data mining to analyze telephone call data to track al’Qaeda activities The NSA is using data mining to analyze telephone call data to track al’Qaeda activities Victoria’s Secret uses data mining to control product distribution based on typical customer buying patterns at individual stores Victoria’s Secret uses data mining to control product distribution based on typical customer buying patterns at individual stores

4 Preview Why data mining? Why data mining? Example data sets Example data sets Data mining methods Data mining methods Example application of data mining Example application of data mining Social issues of data mining Social issues of data mining

5 Source: Han Why Data Mining? Database systems have been around since the 1970s Database systems have been around since the 1970s Organizations have a vast digital history of the day-to-day pieces of their processes Organizations have a vast digital history of the day-to-day pieces of their processes Simple queries no longer provide satisfying results Simple queries no longer provide satisfying results They take too long to execute They take too long to execute They cannot help us find new opportunities They cannot help us find new opportunities

6 Source: Han Why Data Mining? Data doubles about every year while useful information seems to be decreasing Data doubles about every year while useful information seems to be decreasing Vast data stores overload traditional decision making processes Vast data stores overload traditional decision making processes We are data rich, but information poor We are data rich, but information poor

7 Data Mining: a definition Simply stated, data mining refers to the extraction of knowledge from large amounts of data.

8 Source: Dunham Data Mining Models A Taxonomy Data Mining PredictiveDescriptive Classification Regression Time Series Analysis PredictionSummarization Sequence Discovery ClusteringAssociation Rules

9 Example Datasets Iris Iris Wine Wine Diabetes Diabetes

10 Source: Fisher Iris Dataset Created by R.A. Fisher (1936) Created by R.A. Fisher (1936) 150 instances 150 instances Three cultivars (Setosa, Virginica, Versicolor) 50 instances each Three cultivars (Setosa, Virginica, Versicolor) 50 instances each 4 measurements (petal width, petal length, sepal width, sepal length) 4 measurements (petal width, petal length, sepal width, sepal length) One cultivar (Setosa) is easily separable, the others are not – noisy data One cultivar (Setosa) is easily separable, the others are not – noisy data

11 Iris Dataset Analysis

12 Source: UCI Machine Learning Repository Wine Dataset This data is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different varieties. This data is the result of a chemical analysis of wines grown in the same region in Italy but derived from three different varieties. 153 instances with 13 constituents found in each of the three types of wines. 153 instances with 13 constituents found in each of the three types of wines.

13 Wine Dataset Analysis

14 Source: UCI Machine Learning Repository Diabetes Dataset Data is based on a population of women who were at least 21 years old of Pima Indian heritage and living near Phoenix in 1990 Data is based on a population of women who were at least 21 years old of Pima Indian heritage and living near Phoenix in 1990 768 instances 768 instances 9 attributes (Pregnancies, PG Concentration, Diastolic BP, Tri Fold Thick, Serum Ins, BMI, DP Function, Age, Diabetes) 9 attributes (Pregnancies, PG Concentration, Diastolic BP, Tri Fold Thick, Serum Ins, BMI, DP Function, Age, Diabetes) Dataset has many missing values, only 532 instances are complete Dataset has many missing values, only 532 instances are complete

15 Diabetes Dataset Analysis

16 Classification Classification builds a model using a training dataset with known classes of data Classification builds a model using a training dataset with known classes of data That model is used to classify new, unknown data into those classes That model is used to classify new, unknown data into those classes

17 Classification Techniques K-Nearest Neighbors K-Nearest Neighbors Decision Tree Classification (ID3, C4.5) Decision Tree Classification (ID3, C4.5)

18 K-Nearest Neighbors Example A A A A A A B B B B B X A A A A B B B B A B Easy to explain Simple to implement Sensitive to the selection of the classification population Not always conclusive for complex data

19 Source: Indelicato K-Nearest Neighbors Example MISCLASSIFICATION PERCENTAG ES Iris Dataset All Attributes Petal Length and Petal Width Setosa 0/150 = 0% Versicolor Virginica 9/150 = 6% 7/150 = 4.67% Total6%4.67% Wine Dataset All Attributes Phenols, Flavanoids, OD280/OD315 Class 1 0/153 = 0% 2/153 = 1.31% Class 2 9/153 = 5.88% 30/153 = 19.61% Class 3 0/153 = 0% Total5.88%20.92%

20 Decision Tree Example (C4.5) C4.5 is a decision tree generating algorithm, based on the ID3 algorithm. It contains several improvements, especially needed for software implementation. C4.5 is a decision tree generating algorithm, based on the ID3 algorithm. It contains several improvements, especially needed for software implementation. Choice of best splitting attribute is based on an entropy calculation. Choice of best splitting attribute is based on an entropy calculation. These improvements include: These improvements include: Choosing an appropriate attribute selection measure. Choosing an appropriate attribute selection measure. Handling training data with missing attribute values. Handling training data with missing attribute values. Handling attributes with differing costs. Handling attributes with differing costs. Handling continuous attributes. Handling continuous attributes.

21 Source: Siedler Decision Tree Example (C4.5) Iris datasetWine dataset Accuracy 97.67%Accuracy 86.7%

22 Decision Tree Example (C4.5) C4.5 produces a complex tree (195 nodes) C4.5 produces a complex tree (195 nodes) The simplified (pruned) tree reduces the classification accuracy The simplified (pruned) tree reduces the classification accuracy Diabetes dataset Before Pruning After Pruning SizeErrorsSizeErrors 195 40 (5.2%) 69 102 (13.3%) Accuracy94.8%86.7%

23 Association Rules Association rules are used to show the relationships between data items. Purchasing one product when another product is purchased is an example of an association rule. They do not represent any causality or correlation.

24 Association Rule Techniques Market Basket Analysis Market Basket Analysis Terminology Terminology Transaction database Transaction database Association rule – implication {A, B} ═ > {C} Association rule – implication {A, B} ═ > {C} Support - % of transactions in which {A, B, C} occurs Support - % of transactions in which {A, B, C} occurs Confidence – ratio of the number of transactions that contain {A, B, C} to the number of transactions that contain {A, B} Confidence – ratio of the number of transactions that contain {A, B, C} to the number of transactions that contain {A, B}

25 Source: UCI Machine Learning Repository Association Rule Example 1984 United States Congressional Voting Records Database Attribute Information: 1. Class Name: 2 (democrat, republican) 2. handicapped-infants: 2 (y,n) 3. water-project-cost-sharing: 2 (y,n) 4. adoption-of-the-budget-resolution: 2 (y,n) 5. physician-fee-freeze: 2 (y,n) 6. El-Salvador-aid: 2 (y,n) 7. religious-groups-in-schools: 2 (y,n) 8. anti-satellite-test-ban: 2 (y,n) 9. aid-to-Nicaraguan-contras: 2 (y,n) 10. MX-missile: 2 (y,n) 11. immigration: 2 (y,n) 12. synfuels-corporation-cutback: 2 (y,n) 13. education-spending: 2 (y,n) 14. superfund-right-to-sue: 2 (y,n) 15. crime: 2 (y,n) 16. duty-free-exports: 2 (y,n) 17. export-administration-act-south-africa: 2 (y,n) Rules: {budget resolution = no, MX-missile = no, aid to El Salvador = yes}  {Republican} confidence 91.0% {budget resolution = yes, MX-missile = yes, aid to El Salvador = no}  {Democrat} confidence 97.5% {crime = yes, right-to-sue = yes, Physician fee freeze = yes}  {Republican} confidence 93.5% {crime = no, right-to-sue = no, Physician fee freeze = no}  {Democrat} confidence 100.0%

26 Clustering Clustering is similar to classification in that data are grouped. Unlike classification, the groups are not predefined; they are discovered. Grouping is accomplished by finding similarities between data according to characteristics found in the actual data.

27 Clustering Techniques K-Means Clustering K-Means Clustering Neural Network Clustering (SOM) Neural Network Clustering (SOM)

28 K-Means Example The K-Means algorithm is an method to cluster objects based on their attributes into k partitions. The K-Means algorithm is an method to cluster objects based on their attributes into k partitions. It assumes that the k clusters exhibit normal distributions. It assumes that the k clusters exhibit normal distributions. The objective it tries to achieve is to minimize the variance within the clusters. The objective it tries to achieve is to minimize the variance within the clusters.

29 K-Means Example Cluster 1 Cluster 2 Cluster 3 Dataset Mean 3 Mean 2Mean 1

30 K-Means Example Cluster 1 Cluster 2 Cluster 3 46 Versicolor 3 Virginica Cluster mean 4.22857 4 Versicolor 47 Virginica Cluster mean 5.55686 50 Setosa Cluster mean 1.46275 Cluster 1 Cluster 2 Cluster 3 47 Versicolor 49 Virginica Mean 6.30, 2.89, 4.96, 1.70 21 Setosa 1 Virginica Mean 4.59, 3.07, 1.44, 0.29 29 Setosa 3 Versicolor Mean 5.21, 3.53, 1.67, 0.35 Iris dataset, only the petal width attribute, Accuracy 95.33% Iris dataset, all attributes, Accuracy 66.0% Iris dataset, all attributes, Accuracy 66.0 % Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 23 Virginica 1 Virginica 26 Setosa 12 Virginica 24 Versicolor 1 Virginica 26 Versicolor 13 Virginica 24 Setosa Iris dataset, all attributes, Accuracy 90.67% Iris dataset, all attributes, Accuracy 90.67 %

31 Self-Organizing Map Example The Self-Organizing Map was first described by the Finnish professor Teuvo Kohonen and is thus sometimes referred to as a Kohonen map. The Self-Organizing Map was first described by the Finnish professor Teuvo Kohonen and is thus sometimes referred to as a Kohonen map. SOM is especially good for visualizing high-dimensional data. SOM is especially good for visualizing high-dimensional data. SOM maps input vectors onto a two-dimensional grid of nodes. SOM maps input vectors onto a two-dimensional grid of nodes. Nodes that are close together have similar attribute values and nodes that are far apart have different attribute values. Nodes that are close together have similar attribute values and nodes that are far apart have different attribute values.

32 Z Y X Self-Organizing Map Example Input Vectors Z Y X

33 Self-Organizing Map Example Virginica Versicolor Setosa Virginica Versicolor Setosa Virginica Versicolor Setosa Virginica Versicolor Setosa Virginica Versicolor Setosa Virginica Versicolor Setosa Virginica Versicolor Virginica Versicolor VirginicaVersicolor Virginica Versicolor VirginicaVersicolor Iris Data

34 Self-Organizing Map Example Class-2 Class-3Class-2 Class-3 Class-2 Class-3 Class-2Class-3 Class-2 Class-3Class-2 Class-3 Class-2Class-3 Class-1 Class-3 Class-2Class-3 Class-2Class-1 Class-3 Class-2Class-3 Class-1 Class-2Class-1 Class-3 Class-1 Class-2 Class-1 Wine Data

35 Self-Organizing Map Example Healthy SickHealthy SickHealthySickHealthy SickHealthy Sick HealthySick HealthySickHealthy Sick Healthy Sick Healthy Sick HealthySickHealthySick Healthy SickHealthySick HealthySick Diabetes Data

36 Source: McKee NFL Quarterback Analysis Data from 2005 for 42 NFL quarterbacks Data from 2005 for 42 NFL quarterbacks Preprocessed data to normalize for a full 16 game regular season Preprocessed data to normalize for a full 16 game regular season Used SOM to cluster individuals based on performance and descriptive data Used SOM to cluster individuals based on performance and descriptive data

37 Source: McKee NFL Quarterback Analysis The SOM Map

38 Source: McKee NFL Quarterback Analysis QB Passing RatingOverall Clustering

39 Source: McKee NFL Quarterback Analysis The SOM Map

40 Data Mining Stories - Revisited Credit card fraud detection Credit card fraud detection NSA telephone network analysis NSA telephone network analysis Supply chain management Supply chain management

41 Social Issues of Data Mining Impacts on personal privacy and confidentiality Impacts on personal privacy and confidentiality Classification and clustering is similar to profiling Classification and clustering is similar to profiling Association rules resemble logical implications Association rules resemble logical implications Data mining is an imperfect process subject to interpretation Data mining is an imperfect process subject to interpretation

42 Conclusion Why data mining? Why data mining? Example data sets Example data sets Data mining methods Data mining methods Example application of data mining Example application of data mining Social issues of data mining Social issues of data mining

43 What on earth would a man do with himself if something did not stand in his way? - H.G. Wells I don’t think necessity is the mother of invention – invention, in my opinion, arises directly from idleness, probably also from laziness, to save oneself trouble. - Agatha Christie, from “An Autobiography, Pt III, Growing U p”

44 References Dunham, Margaret, Data Mining Introductory and Advanced Topics, Pearson Education, Inc., 2003 Dunham, Margaret, Data Mining Introductory and Advanced Topics, Pearson Education, Inc., 2003 Fisher, R.A., The Use of Multiple Measurements in Taxonomic Problems, Annals of Eugenics 7, pp. 179-188 Fisher, R.A., The Use of Multiple Measurements in Taxonomic Problems, Annals of Eugenics 7, pp. 179-188 Han, Jiawei, Data Mining: Concepts and Techniques, Elsevier Inc., 2006 Han, Jiawei, Data Mining: Concepts and Techniques, Elsevier Inc., 2006 Indelicato, Nicolas, Analysis of the K-Nearest Neighbors Algorithm, MATH 4500: Foundations of Data Mining, 2004 Indelicato, Nicolas, Analysis of the K-Nearest Neighbors Algorithm, MATH 4500: Foundations of Data Mining, 2004 McKee, Kevin, MATH 4200: Data Mining Foundations, 2006 McKee, Kevin, The Self Organized Map Applied to 2005 NFL Quarterbacks, MATH 4200: Data Mining Foundations, 2006 Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science Seidler, Toby, The C4.5 Project: An Overview of the Algorithm with Results of Experimentation, MATH 4500: Foundations of Data Mining, 2004 Seidler, Toby, The C4.5 Project: An Overview of the Algorithm with Results of Experimentation, MATH 4500: Foundations of Data Mining, 2004


Download ppt "Data Mining Demystified John Aleshunas Fall Faculty Institute October 2006."

Similar presentations


Ads by Google