Clustering Algorithms Minimize distance But to Centers of Groups.

Slides:



Advertisements
Similar presentations
Copyright Jiawei Han, modified by Charles Ling for CS411a
Advertisements

1 Data Mining: and Knowledge Acquizition — Chapter 5 — BIS /2014 Summer.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Improving Disability Claims Management with Predictive Modeling May 15, 2008 Claim Analytics Inc. Barry Senensky FSA FCIA MAAA Jonathan Polon FSA
MIS2502: Data Analytics Clustering and Segmentation.
Evaluating Inforce Blocks Of Disability Business With Predictive Modeling SOA Spring Health Meeting May 28, 2008 Jonathan Polon FSA
Livelihoods analysis using SPSS. Why do we analyze livelihoods?  Food security analysis aims at informing geographical and socio-economic targeting 
Data Mining Techniques: Clustering
McGraw-Hill/Irwin © 2003 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 10 The Role of Costs in Pricing Decisions.
Data Mining: A Closer Look Chapter Data Mining Strategies.
Introduction to Machine Learning Anjeli Singh Computer Science and Software Engineering April 28 th 2008.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 14 Using Multivariate Design and Analysis.
DATA MINING CS157A Swathi Rangan. A Brief History of Data Mining The term “Data Mining” was only introduced in the 1990s. Data Mining roots are traced.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
© 2003 The McGraw-Hill Companies, Inc. All rights reserved. Working with Financial Statements Chapter Three.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Data Mining By Archana Ketkar.
Data Mining Adrian Tuhtan CS157A Section1.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Data Mining: A Closer Look
Data Mining: A Closer Look Chapter Data Mining Strategies 2.
Chapter 5 Data mining : A Closer Look.
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
Decision Tree Models in Data Mining
Enterprise systems infrastructure and architecture DT211 4
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Comparison of Classification Methods for Customer Attrition Analysis Xiaohua Hu, Ph.D. Drexel University Philadelphia, PA, 19104
Data Mining By Jason Baltazar, Phil Cademas, Jillian Latham, Rachel Peeler & Kamila Singh.
Data Mining Techniques
Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
Overview of Data Mining Methods Data mining techniques What techniques do, examples, advantages & disadvantages.
Chapter 6 Regression Algorithms in Data Mining
Chapter 7 Neural Networks in Data Mining Automatic Model Building (Machine Learning) Artificial Intelligence.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Time Series Forecasting Chapter 13.
Chapter 22 Business Statistics McGraw-Hill/Irwin Copyright © 2003 by The McGraw-Hill Companies, Inc. All rights reserved.
Banking on Analytics Dr A S Ramasastri Director, IDRBT.
 Fundamentally, data mining is about processing data and identifying patterns and trends in that information so that you can decide or judge.  Data.
Regression Models Fit data Time-series data: Forecast Other data: Predict.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Overview of Methods Data mining techniques What techniques do, examples, advantages & disadvantages.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Data Mining and Decision Support
Decision Tree Algorithms Rule Based Suitable for automatic generation.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
1 Automatic Cluster Detection Automatic Cluster Detection is useful to find “better behaved” clusters of data within a larger dataset; seeing the forest.
Monday, February 22,  The term analytics is often used interchangeably with:  Data science  Data mining  Knowledge discovery  Extracting useful.
McGraw-Hill/Irwin © 2003 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Marketing Profitability Analysis.
Classification Tree Interaction Detection. Use of decision trees Segmentation Stratification Prediction Data reduction and variable screening Interaction.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques.
Department of Computer Science Sir Syed University of Engineering & Technology, Karachi-Pakistan. Presentation Title: DATA MINING Submitted By.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
MIS2502: Data Analytics Clustering and Segmentation Jeremy Shafer
Semi-Supervised Clustering
Data Mining: Concepts and Techniques
David L. Olson Department of Management University of Nebraska
Topic 3: Cluster Analysis
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
What Is Good Clustering?
MIS2502: Data Analytics Clustering and Segmentation
Microarray Data Set The microarray data set we are dealing with is represented as a 2d numerical array.
Topic 5: Cluster Analysis
Presentation transcript:

Clustering Algorithms Minimize distance But to Centers of Groups

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-2 Clustering First need to identify clusters –Can be done automatically –Often clusters determined by problem Then simple matter to measure distance from new observation to each cluster –Use same measures as with memory-based reasoning

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-3 Partitioning Define new categorical variables –Divide data into fixed number (k) of regions –K-means clustering

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-4 Clustering Uses Segment customers –Find profitability of each, treat accordingly Star classification : –Red giants, white dwarfs, normal –Brightness & temperature used to classify U.S. Army –Identify sizes needed for female soldiers –(males – one size fits all)

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-5 Tires Segment customers into product categories –High end (they would buy Michelins) –Intermediate & Low Standardize data (as in memory-based reasoning)

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-6 Raw Tire Data BRANDINCOMEAGE OF CAR Michelin$182,2005 months Michelin$171,2003 years Goodyear$28,8007 years Goodyear$37,8006 years Goodyear$42,2005 years Goodyear$55,6004 years Goodyear$51,2009 years Goodyear$173,4007 years Opie’s tires$13,4003 years Opie’s tires$68,8006 years

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-7 Standardize INCOME –MIN(1,INCOME/200000) AGE OF CAR –IF({AGE OF CAR})<12 months,1, –ELSE[MIN{(8-Years)/7},1]

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-8 Sort Data by Outcome BRANDINCOMEAGE OF CAR MichelinHigh incomeBought this year MichelinHigh incomeBought 1-3 yrs ago GoodyearLow incomeBought 4+ yrs ago GoodyearLow incomeBought 4+ yrs ago GoodyearLow incomeBought 4+ yrs ago GoodyearAvg incomeBought 1-3 yrs ago GoodyearAvg incomeBought 4+ yrs ago GoodyearHigh incomeBought 4+ yrs ago Opie’s tiresLow incomeBought 1-3 yrs ago Opie’s tiresAvg incomeBought 4+ yrs ago

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-9 Standardized Training Data BRANDINCOMEAGE OF CAR Michelin Michelin Goodyear Goodyear Goodyear Goodyear Goodyear Goodyear Opie’s tires Opie’s tires

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-10 Identify Cluster Means (could use median, mode) BRANDINCOMECAR AGE Michelin Goodyear Opie’s tires

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-11 New Case #1 From new data (could be test set or new observations to classify) squared distance to each centroid Michelin:0.840 Goodyear0.025 Opie’s tires0.047 So minimum distance to Goodyear

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-12 New Case #2 Squared distance to each centroid Michelin:0.634 Goodyear0.255 Opie’s tires0.057 So minimum distance to Opie’s

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-13 Software Methods Hierarchical clustering –Number of clusters unspecified a priori –Two-step a form of hierarchical clustering K-means clustering Self-organizing maps –Neural network Hybrids combine methods

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-14 Application: Credit Cards Credit scoring critical Use past applicants; develop model to predict payback –Look for indicators providing early warning of trouble

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-15 British Credit Card Company Monthly account status – over 90 thousand customers, one year operations Outcome variable STATE: cumulative months of missed payments (integer) –Some errors & missing data (eliminated observations) –Biased sample of 10 thousand observations –Required initial STATE of 0

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-16 British Credit Card Company Compared clustering approaches with pattern detection method Used medians rather than centroids –More stable –Partitioned data Clustering useful for general profile behavior Pattern search method sought local clusters –Unable to partition entire data set –Identified a few groups with unusual behavior

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-17 Insurance Claim Application Large data warehouse of financial transactions & claims Customer retention very important –Recent heavy growth in policies –Decreased profitability Used clustering to analyze claim patterns –Wanted hidden trends & patterns

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-18 Insurance Claim Mining Undirected knowledge discovery –Cluster analysis to identify risk categories Data for –Quarterly data –Claims for prior 12 months –Contribution to profit of each policy –Over 100,000 samples –Heavy growth in young people with expensive automobiles –Transformed data to normalize, remove outliers

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-19 Insurance Claim Mining Number of clusters –Too few – no discrimination – best here was 50 –Used k-means algorithm to minimize least squared error Identified a few cluster with high claims frequency, unprofitability Compared 1998 data with 1996 data to find trends Developed model to predict new policy holder performance –Used for pricing

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 5-20 Computational Constraints Each cluster should have adequate sample size Since cluster averages are used, cluster analysis not as sensitive to disproportional cluster sizes relative to matching The more variables you have, the greater the computational complexity –The curse of dimensionality –(it won’t run in a reasonable time if you have too many variables)