Introduction to Data Mining Mining Association Rules Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Introduction to Data Mining by Tan, Steinbach, Kumar.
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Week 9 Data Mining System (Knowledge Data Discovery)
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
University of Minnesota
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Decision Support: Data Mining Introduction.
Knowledge Discovery & Data Mining process of extracting previously unknown, valid, and actionable (understandable) information from large databases Data.
Chapter 5: Data Mining for Business Intelligence
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
Eick, Tan, Steinbach, Kumar: Association Analysis Part1 Organization “Association Analysis” 1. What is Association Analysis? 2. Association Rules 3. The.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Knowledge Discovery and Data Mining Evgueni Smirnov.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
Data Mining Association Analysis Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/
Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
1 What is Association Analysis: l Association analysis uses a set of transactions to discover rules that indicate the likely occurrence of an item based.
ASSOCIATION RULES (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
COMSATS Institute of Information Technology Department of Computer Science Databases and Information Systems Dr. Ramzan Talib Databases and Information.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chap 6: Association Rules. Rule Rules!  Motivation ~ recent progress in data mining + warehousing have made it possible to collect HUGE amount of data.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
An Introduction to Data Mining
DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
Stats 202: Statistical Aspects of Data Mining Professor Rajan Patel
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Introduction
Data Mining: Introduction
Frequent Pattern Mining
William Norris Professor and Head, Department of Computer Science
Data Mining: Introduction
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Sangeeta Devadiga CS 157B, Spring 2007
Data Mining: Introduction
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
Data Mining: Introduction
Association Analysis: Basic Concepts
Presentation transcript:

Introduction to Data Mining Mining Association Rules Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al

What is Data Mining? Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data.

What is data mining (cont.)? Valid: The patterns hold in general. Novel: We did not know the pattern beforehand. Useful: We can devise actions from the patterns. Understandable: We can interpret and comprehend the patterns. Data mining falls into the broad field of knowledge discovery

The data mining process

Origins of Data Mining Draws ideas from machine learning/Artificial Intelligence, pattern recognition, statistics, and database systems Human analysis and traditional Techniques may be unsuitable due to –Enormity of data –High dimensionality of data –Heterogeneous data –Distributed nature of data Statistics/ AI

Data Mining Tasks Prediction Methods –Use some variables to predict unknown or future values of other variables. Description Methods –Find human-interpretable patterns that describe the data.

Data Mining Tasks... Association Rule Discovery [Descriptive] Clustering [Descriptive] Classification [Predictive] (for discrete variables) Sequential Pattern Discovery [Descriptive] Regression [Predictive] (for continuous variable) Deviation Detection [Predictive]

Association Rule Discovery: Definition Given a set of records each of which contain some number of items from a given collection; –Produce dependency rules which will predict occurrence of an item based on occurrences of other items. Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Association Rule Discovery: Sample Application Supermarket shelf management. –Goal: To identify items that are bought together by sufficiently many customers. –Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items.

Clustering: Definition Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that –Data points in one cluster are more similar to one another. –Data points in separate clusters are less similar to one another. Similarity Measures: –Euclidean Distance if attributes are continuous. –Other Problem-specific Measures.

Illustrating Clustering x Euclidean Distance Based Clustering in 3-D space.

Clustering: Sample Application 1 Market Segmentation: –Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. –Approach: Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.

Clustering: sample application 2 Document Clustering: –Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. –Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. –Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.

Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. –A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Classification and Clustering Classification: –Classes pre-defined –Uses training set (thus also known as supervised learning) Clustering: –Classes not defined in advance –No training set (thus also known as unsupervised learning)

Classification: Example

Classification: sample application 1 Direct Marketing –Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. –Approach: Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute. Collect various demographic, lifestyle, and company- interaction related information about all such customers. –Type of business, where they stay, how much they earn, etc. Use this information as input attributes to learn a classifier model.

Classification: sample application 2 Fraud Detection –Goal: Predict fraudulent cases in credit card transactions. –Approach: Use credit card transactions and the information on its account-holder as attributes. –When does a customer buy, what does he buy, how often he pays on time, etc Label past transactions as fraud or fair transactions. This forms the class attribute. Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card transactions on an account.

Sequential Pattern Discovery: Definition Given a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events. Rules are formed by first discovering patterns. Event occurrences in the patterns are governed by timing constraints.

Regression Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Greatly studied in statistics, neural network fields. Examples: –Predicting sales amounts of new product based on advetising expenditure. –Predicting wind velocities as a function of temperature, humidity, air pressure, etc. –Time series prediction of stock market indices.

Deviation/Anomaly Detection Detect significant deviations from normal behavior Applications: –Credit Card Fraud Detection –Network Intrusion Detection

Mining Association Rules

Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper}  {Beer}, {Milk, Bread}  {Eggs,Coke}, {Beer, Bread}  {Milk}, Implication means co-occurrence!

Definition: Frequent Itemset Itemset –A collection of one or more items Example: {Milk, Bread, Diaper} –k-itemset An itemset that contains k items Support count (  ) –Frequency of occurrence of an itemset –E.g.  ({Milk, Bread,Diaper}) = 2 Support –Fraction of transactions that contain an itemset –E.g. s({Milk, Bread, Diaper}) = 2/5 Frequent Itemset –An itemset whose support is greater than or equal to a minsup threshold

Definition: Association Rule Example: l Association Rule –An implication expression of the form X  Y, where X and Y are itemsets –Example: {Milk, Diaper}  {Beer} l Rule Evaluation Metrics –Support (s)  Fraction of transactions that contain both X and Y –Confidence (c)  Measures how often items in Y appear in transactions that contain X

Association Rule Mining Task Given a set of transactions T, the goal of association rule mining is to find all rules having –support ≥ minsup threshold –confidence ≥ minconf threshold Brute-force approach: –List all possible association rules –Compute the support and confidence for each rule –Prune rules that fail the minsup and minconf thresholds  Computationally prohibitive!

Mining Association Rules Example of Rules: {Milk,Diaper}  {Beer} (s=0.4, c=0.67) {Milk,Beer}  {Diaper} (s=0.4, c=1.0) {Diaper,Beer}  {Milk} (s=0.4, c=0.67) {Beer}  {Milk,Diaper} (s=0.4, c=0.67) {Diaper}  {Milk,Beer} (s=0.4, c=0.5) {Milk}  {Diaper,Beer} (s=0.4, c=0.5) Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence Thus, we may decouple the support and confidence requirements

Mining Association Rules Two-step approach: 1.Frequent Itemset Generation –Generate all itemsets whose support  minsup 2.Rule Generation –Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive

Frequent Itemset Generation Given d items, there are 2 d possible candidate itemsets

Frequent Itemset Generation Brute-force approach: –Each itemset in the lattice is a candidate frequent itemset –Count the support of each candidate by scanning the database –Match each transaction against every candidate –Complexity ~ O(NMw) => Expensive since M = 2 d !!!

Computational Complexity Given d unique items: –Total number of itemsets = 2 d –Total number of possible association rules: If d=6, R = 602 rules

Frequent Itemset Generation Strategies Reduce the number of candidates (M) –Complete search: M=2 d –Use pruning techniques to reduce M Reduce the number of transactions (N) –Reduce size of N as the size of itemset increases –Used by DHP and vertical-based mining algorithms Reduce the number of comparisons (NM) –Use efficient data structures to store the candidates or transactions –No need to match every candidate against every transaction

Reducing Number of Candidates Apriori principle: –If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: –Support of an itemset never exceeds the support of its subsets –This is known as the anti-monotone property of support

Found to be Infrequent Illustrating Apriori Principle Pruned supersets

Illustrating Apriori Principle Items (1-itemsets) Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) Minimum Support count = 3 If every subset is considered, 6 C C C 3 = 41 With support-based pruning, = 13

Apriori Algorithm Method: –Let k=1 –Generate frequent k-itemsets –Repeat until no new frequent itemsets are identified Generate candidate (k+1)-itemsets from frequent k-itemsets Prune candidate itemsets containing subsets that are infrequent Count the support of each candidate by scanning the DB Eliminate candidates that are infrequent, leaving only those that are frequent

Example:apriori method for finding frequent itemset Trans action ID Items 01milk, bread, cookies, juice 02Milk, bread, juice 03milk, eggs 04bread, cookies, coffee Find itemsets with support >=50% 1.Frequent 1-itemsets: {bread}, {cookies}, {juice}, {milk} 2. Generate candidate 2-itemsets: {bread, cookies}, {bread, Juice}, {bread, milk}, {cookies, juice}, {cookies, milk}, {juice, milk} 3. Frequent 2-itemsets {bread, cookies} {bread, Juice} {bread, milk} { juice, milk} 4. candidate 3-itemsets {bread, cookies, juice}, {bread, cookies,milk}, {bread, juice, milk}

Counting the support support counting: –Scan the database of transactions to determine the support of each candidate itemset –To reduce the number of comparisons, store the candidates in a hash structure Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets (details omitted)

Rule Generation for Apriori Algorithm Lattice of rules Pruned Rules Low Confidence Rule

Rule Generation for Apriori Algorithm Candidate rule is generated by merging two rules that share the same prefix in the rule consequent join(CD=>AB,BD=>AC) would produce the candidate rule D => ABC Prune rule D=>ABC if its subset AD=>BC does not have high confidence

Example: generating rules 41 Find rules with confidence >=2/3 from the itemset {bread, milk, diaper} 1.{bread, milk}  {diaper} {bread, diaper}  {milk} {diaper, milk}  {bread} 2. {bread}  {diaper, milk} 2/3 2/4 2/3

Next Clustering classification 42