Introduction to Data Mining 12-1. Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.

Slides:



Advertisements
Similar presentations
DAVID M. KROENKE’S DATABASE PROCESSING, 10th Edition © 2006 Pearson Prentice Hall COS 236 Day 25.
Advertisements

Cluster Analysis.
ICS 421 Spring 2010 Data Mining 2 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 4/8/20101Lipyeow Lim.
Chapter 9 Business Intelligence Systems
DATA MINING CS157A Swathi Rangan. A Brief History of Data Mining The term “Data Mining” was only introduced in the 1990s. Data Mining roots are traced.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Chapter Extension 14 Database Marketing © 2008 Pearson Prentice Hall, Experiencing MIS, David Kroenke.
Basic Data Mining Techniques Chapter Decision Trees.
Basic Data Mining Techniques
DAVID M. KROENKE’S DATABASE PROCESSING, 10th Edition © 2006 Pearson Prentice Hall COS 346 Day 26.
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
Data Mining Adrian Tuhtan CS157A Section1.
Chapter Extension 15 Database Marketing. Q1:What is a database marketing opportunity? Q2: How does RFM analysis classify customers? Q3: How does market-basket.
Chapter Extension 12 Database Marketing.
Database Processing for Business Intelligence Systems
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Copyright © 2013 Pearson Education, Inc. publishing as Prentice Hall12-1.
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 18-1 Chapter 18 Data Analysis Overview Statistics for Managers using Microsoft Excel.
Chapter 5 Data mining : A Closer Look.
Beyond Opportunity; Enterprise Miner Ronalda Koster, Data Analyst.
Enterprise systems infrastructure and architecture DT211 4
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Business Analytics: Methods, Models, and Decisions, 1 st edition James R. Evans Copyright © 2013 Pearson Education, Inc. publishing as Prentice Hall12-1.
Basic Data Mining Techniques
Knowledge Discovery & Data Mining process of extracting previously unknown, valid, and actionable (understandable) information from large databases Data.
Chapter 5: Data Mining for Business Intelligence
Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge.
Chapter 9 Business Intelligence and Information Systems for Decision Making.
© 2008 Pearson Prentice Hall, Experiencing MIS, David Kroenke Slide 1 Chapter 9 Competitive Advantage with Information Systems for Decision Making.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Final Exam Review. The following is a list of items that you should review in preparation for the exam. Note that not every item in the following slides.
Chapter 11 Business Intelligence Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall 11-1.
 Fundamentally, data mining is about processing data and identifying patterns and trends in that information so that you can decide or judge.  Data.
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall Analyzing Systems Using Data Dictionaries Systems Analysis and Design, 8e Kendall.
Business Intelligence Systems Appendix J DAVID M. KROENKE and DAVID J. AUER DATABASE CONCEPTS, 6 th Edition.
Copyright © 2004 Pearson Education, Inc.. Chapter 27 Data Mining Concepts.
XLMiner – a Data Mining Toolkit QuantLink Solutions Pvt. Ltd.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
DATA MINING By Cecilia Parng CS 157B.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Data Mining Brandon Leonardo CS157B (Spring 2006).
Chap 18-1 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 18-1 Chapter 18 A Roadmap for Analyzing Data Basic Business Statistics.
1-1 Copyright © 2014, 2011, and 2008 Pearson Education, Inc.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Chap 6: Association Rules. Rule Rules!  Motivation ~ recent progress in data mining + warehousing have made it possible to collect HUGE amount of data.
David M. Kroenke and David J. Auer Database Processing Fundamentals, Design, and Implementation Appendix J: Business Intelligence Systems.
Monday, February 22,  The term analytics is often used interchangeably with:  Data science  Data mining  Knowledge discovery  Extracting useful.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Chapter 10 Introduction to Data Mining
Unsupervised Learning
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining ICCM
XLMiner – a Data Mining Toolkit
DATA MINING © Prentice Hall.
Data Mining 101 with Scikit-Learn
Sampling: Design and Procedures
Adrian Tuhtan CS157A Section1
Exam #3 Review Zuyin (Alvin) Zheng.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Kenneth C. Laudon & Jane P. Laudon
Unsupervised Learning
Presentation transcript:

Introduction to Data Mining 12-1

Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and patterns among variables in large data sets. It is used to identify and understand hidden patterns that large data sets may contain. It involves both descriptive and prescriptive analytics, though it is primarily prescriptive. Data Mining Copyright © 2013 Pearson Education, Inc. publishing as Prentice Hall 12-2

Some common approaches to data mining  Association - analyze data to identify natural associations among variables and create rules for target marketing or buying recommendations Netflix uses association to understand what types of movies a customer likes and provides recommendations based on the data Amazon makes recommendations based on past purchases Supermarket loyalty cards collect data on customer purchase habits and print coupons based on what was currently bought. The Scope of Data Mining 12-3

Some common approaches to data mining  Clustering ₋Similar to classification, but when no groups have been defined; finds groupings within data ₋ Example: Insurance company could use clustering to group clients by their age, location and types of insurance purchased. ₋ The categories are unspecified and this is referred to as ‘unsupervised learning’ The Scope of Data Mining 12-4

Some common approaches to data mining  Classification - analyze data to predict how to classify new elements – Spam filtering in by examining textural characteristics of message – Help predict if credit-card transaction may be fraudulent – Is a loan application high risk – Will a consumer respond to an ad The Scope of Data Mining 12-5

Association Rule Mining (affinity analysis) Seeks to uncover associations in large data sets Association rules identify attributes that occur together frequently in a given data set. Market basket analysis, for example, is used determine groups of items consumers tend to purchase together. Association rules provide information in the form of if-then (antecedent-consequent) statements. The rules are probabilistic in nature. Association Rule Mining Copyright © 2013 Pearson Education, Inc. publishing as Prentice Hall 12-6

Custom Computer Configuration (PC Purchase Data) Suppose we want to know which PC components are often ordered together. Association Rule Mining Figure Copyright © 2013 Pearson Education, Inc. publishing as Prentice Hall 12-7

Measuring the Strength of Association Rules  Support for the (association) rule is the percentage (or number) of transactions that include all items both antecedent and consequent.  Confidence of the (association) rule:  Lift is a ratio of confidence to expected confidence. Association Rule Mining Copyright © 2013 Pearson Education, Inc. publishing as Prentice Hall 12-8

Measuring Strength of Association  A supermarket database has 100,000 point-of-sale transactions: 2000 include both A and B items 5000 include C 800 include A, B, and C Association rule: If A and B are purchased, then C is also purchased.  Support = 800/100,000 =  Confidence = 800/2000 = 0.40  Expected confidence = 5000/100,000 = 0.05  Lift = 0.40/0.05 = 8 Association Rule Mining 12-9

(continued) Identifying Association Rules for PC Purchase Data Association Rule Mining Figure Copyright © 2013 Pearson Education, Inc. publishing as Prentice Hall 12-10

Example (continued) Identifying Association Rules for PC Purchase Data Association Rule Mining Figure Copyright © 2013 Pearson Education, Inc. publishing as Prentice Hall Rules are sorted by their Lift Ratio (how much more likely one is to purchase the consequent if they purchase the antecedents).

Similar to classification, but when no groups have been defined; finds groupings within data Cluster Analysis has many powerful uses like Market Segmentation. You can view individual record’s predicted cluster membership. Also called data segmentation Two major methods 1. Hierarchical clustering a) Agglomerative methods (used in XLMiner) proceed as a series of fusions 2. k-means clustering (available in XLMiner) partitions data into k clusters so that each element belongs to the cluster with the closest mean Cluster Analysis 12-12

Cluster Analysis – Agglomerative Methods  Dendrogram – a diagram illustrating fusions or divisions at successive stages  Objects “closest” in distance to each other are gradually joined together.  Euclidean distance is the most commonly used measure of the distance between objects. Copyright © 2013 Pearson Education, Inc. publishing as Prentice Hall Figure 12.2 Euclidean

Clustering Colleges and Universities  Cluster the Colleges and Universities data using the five numeric columns in the data set.  Use the hierarchical method Figure 12.3 Copyright © 2013 Pearson Education, Inc. publishing as Prentice Hall 12-14

Copyright © 2013 Pearson Education, Inc. publishing as Prentice Hall This process of agglomeration leads to the construction of a dendrogram. This is a tree-like diagram that summarizes the process of clustering. For any given number of clusters we can determine the records in the clusters by sliding a horizontal line (ruler) up and down the dendrogram until the number of vertical intersections of the horizontal line equals the number of clusters desired.

(continued) Clustering of Colleges From Figure 12.8 Copyright © 2013 Pearson Education, Inc. publishing as Prentice Hall Hierarchical clustering results: Dendrogram Smaller clusters “agglomerate” into bigger ones, with least possible loss of cohesiveness at each stage. Height of the bars is a measure of dissimilarity in the clusters that are merging into one.

(continued) Clustering of Colleges From Figure Hierarchical clustering results: Predicted clusters

(continued) Clustering of Colleges Figure 12.9 Copyright © 2013 Pearson Education, Inc. publishing as Prentice Hall Hierarchical clustering results: Predicted clusters Cluster # Colleges

(continued) Clustering of Colleges Hierarchical clustering results for clusters 3 and 4 Schools in cluster 3 appear similar. Cluster 4 has considerably higher Median SAT and Expenditures/Student.

 Recognizes patterns that describe group to which item belongs  We will analyze the Credit Approval Decisions data to predict how to classify new elements.  Categorical variable of interest: Decision (whether to approve or reject a credit application)  Predictor variables: shown in columns A-E Classification Figure

Modified Credit Approval Decisions The categorical variables are coded as numeric:  Homeowner - 0 if No, 1 if Yes  Decision - 0 if Reject, 1 if Approve Classification Figure Copyright © 2013 Pearson Education, Inc. publishing as Prentice Hall 12-21

Using Training and Validation Data  Data mining projects typically involve large volumes of data.  The data can be partitioned into: ▪ training data set – has known outcomes and is used to “teach” the data-mining algorithm ▪ validation data set – used to fine-tune a model ▪ test data set – tests the accuracy of the model  In XLMiner, partitioning can be random or user- specified. Classification 12-22

(continued) Partitioning Data Sets in XLMiner  Partitioning choices when choosing random 1.Automatic 60% training, 40% validation 2.Specify % 50% training, 30% validation, 20% test (training and validation % can be modified) 3.Equal # records 33.33% training, validation, test  XLMiner has size and relative size limitations on the data sets, which can affect the amount and % of data assigned to the data sets. Classification Copyright © 2013 Pearson Education, Inc. publishing as Prentice Hall 12-23

Three Data-Mining Approaches to Classification: 1. k-Nearest Neighbors (k-NN) Algorithm find records in a database that have similar numerical values of a set of predictor variables 2. Discriminant Analysis (what we will do) use predefined classes based on a set of linear discriminant functions of the predictor variables 3. Logistic Regression estimate the probability of belonging to a category using a regression on the predictor variables Classification Techniques 12-24

(continued) Using Discriminant Analysis for Classifying New Data Classification Techniques Figure Half of the applicants are in the “Approved” class