Stats 202: Statistical Aspects of Data Mining Professor Rajan Patel

Slides:



Advertisements
Similar presentations
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 8 = Finish chapter 6 Agenda:
Advertisements

Association Rule Mining
Pertemuan XIV FUNGSI MAYOR Assosiation. What Is Association Mining? Association rule mining: –Finding frequent patterns, associations, correlations, or.
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
MIS2502: Data Analytics Association Rule Mining. Uses What products are bought together? Amazon’s recommendation engine Telephone calling patterns Association.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Organization “Association Analysis”
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
1) Go over HW #1 solutions (Due today)
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Eick, Tan, Steinbach, Kumar: Association Analysis Part1 Organization “Association Analysis” 1. What is Association Analysis? 2. Association Rules 3. The.
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 9 = Review for midterm exam.
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 7 = Finish chapter 3 and.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Eick, Tan, Steinbach, Kumar: Association Analysis Part1 Organization “Association Analysis” 1. What is Association Analysis? 2. Association Rules 3. The.
Supermarket shelf management – Market-basket model:  Goal: Identify items that are bought together by sufficiently many customers  Approach: Process.
Data Mining Association Analysis Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/
Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
1 What is Association Analysis: l Association analysis uses a set of transactions to discover rules that indicate the likely occurrence of an item based.
ASSOCIATION RULES (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
DISCOVERING SPATIAL CO- LOCATION PATTERNS PRESENTED BY: REYHANEH JEDDI & SHICHAO YU (GROUP 21) CSCI 5707, PRINCIPLES OF DATABASE SYSTEMS, FALL 2013 CSCI.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Elective-I Examination Scheme- In semester Assessment: 30 End semester Assessment :70 Text Books: Data Mining Concepts and Techniques- Micheline Kamber.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
MIS2502: Data Analytics Association Rule Mining David Schuff
Introduction to Data Mining Mining Association Rules Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
MIS2502: Data Analytics Association Rule Mining Jeremy Shafer
Data Mining – Association Rules
Statistics 202: Statistical Aspects of Data Mining
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent Pattern Mining
William Norris Professor and Head, Department of Computer Science
COMP 5331: Knowledge Discovery and Data Mining
Association Analysis: Basic Concepts
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
MIS2502: Data Analytics Association Rule Mining
MIS2502: Data Analytics Association Rule Mining
Mining Association Rules in Large Databases
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts
Presentation transcript:

Stats 202: Statistical Aspects of Data Mining Professor Rajan Patel Lecture 5 = Start Chapter 6 Agenda: 1) Reminder: Midterm is on Monday, July 14th 2) Lecture over Chapter 6 *

Announcement – Midterm Exam: The midterm exam will be Monday, July 14 during the scheduled class time The best thing will be to take it in the classroom (even SCPD students) For remote students who absolutely can not come to the classroom that day please make arrangements with SCPD to take the exam with your proctor. You will submit the exam through Scoryst. You are allowed one 8.5 x 11 inch sheet (front and back) containing notes. No books or computers are allowed, but please bring a hand held calculator The exam will cover the material that we covered in class and in homeworks. *

Introduction to Data Mining Chapter 6: Association Analysis by Tan, Steinbach, Kumar Chapter 6: Association Analysis *

What is Association Analysis: Association analysis uses a set of transactions to discover rules that indicate the likely occurrence of an item based on the occurrences of other items in the transaction Examples: {Diaper} → {Beer}, {Milk, Bread} → {Eggs,Coke} {Beer, Bread} → {Milk} Implication means co-occurrence, not causality! Industry Examples: Netflix, Amazon – related videos Safeway: coupons for products *

Definitions: Itemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset = An itemset that contains k items Support count (σ) Frequency of occurrence of an itemset E.g. σ({Milk, Bread,Diaper}) = 2 Support (s) Fraction of transactions that contain an itemset E.g. s({Milk, Bread, Diaper}) = 2/5 Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold *

Another Definition: Association Rule An implication expression of the form X → Y, where X and Y are itemsets Example: {Milk, Diaper} → {Beer} *

Even More Definitions: Association Rule Evaluation Metrics Support (s) Fraction of transactions that contain both X and Y Confidence (c) Measures how often each item in Y appears in transactions that contain X Example: {Milk, Diaper} → {Beer} *

In class exercise #19: Compute the support for itemsets {a}, {b, d}, and {a,b,d} by treating each transaction ID as a market basket. *

In class exercise #20: Use the results in the previous problem to compute the confidence for the association rules {b, d} → {a} and {a} → {b, d}. State what these values mean in plain English. *

In class exercise #21: Compute the support for itemsets {a}, {b, d}, and {a,b,d} by treating each customer ID as a market basket. *

{a} → {b, d}. State what these values mean in plain English. In class exercise #22: Use the results in the previous problem to compute the confidence for the association rules {b, d} → {a} and {a} → {b, d}. State what these values mean in plain English. *

An Association Rule Mining Task: Given a set of transactions T, find all rules having both - support ≥ minsup threshold - confidence ≥ minconf threshold Brute-force approach: - List all possible association rules - Compute the support and confidence for each rule - Prune rules that fail the minsup and minconf thresholds - Problem: this is computationally prohibitive! *

The Support and Confidence Requirements can be Decoupled All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence Thus, we may decouple the support and confidence requirements {Milk,Diaper} → {Beer} (s=0.4, c=0.67) {Milk,Beer} → {Diaper} (s=0.4, c=1.0) {Diaper,Beer} → {Milk} (s=0.4, c=0.67) {Beer} → {Milk,Diaper} (s=0.4, c=0.67) {Diaper} → {Milk,Beer} (s=0.4, c=0.5) {Milk} → {Diaper,Beer} (s=0.4, c=0.5) *

Two Step Approach: 1) Frequent Itemset Generation = Generate all itemsets whose support ≥ minsup 2) Rule Generation = Generate high confidence (confidence ≥ minconf ) rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Note: Frequent itemset generation is still computationally expensive and your book discusses algorithms that can be used *

In class exercise #23: Use the two step approach to generate all rules having support ≥ .4 and confidence ≥ .6 for the transactions below. *

row per transaction, one column per item In class exercise #23: Use the two step approach to generate all rules having support ≥ .4 and confidence ≥ .6 for the transactions below. 1) Create a CSV file: one row per transaction, one column per item 2) Find itemsets of size 2 that have support >= 0.4 data = read.csv("ice23.csv") num_transactions = dim(data)[1] num_items = dim(data)[2] item_labels = labels(data)[[2]] for (col in 1:(num_items-1)) { for (col2 in (col+1):num_items) { sup = sum(data[,col] * data[,col2]) / num_transactions if (sup >= 0.4) { print(item_labels[c(col, col2)]) } Milk Beer Diapers Butter Cookies Bread 1 … *

Drawback of Confidence Association Rule: Tea → Coffee Confidence(Tea → Coffee) = P(Coffee|Tea) = 0.75 Coffee Tea 15 5 20 75 80 90 10 100 *

Drawback of Confidence Association Rule: Tea → Coffee Confidence(Tea → Coffee) = P(Coffee|Tea) = 0.75 but support(Coffee) = P(Coffee) = 0.9 Although confidence is high, rule is misleading confidence(Tea → Coffee) = P(Coffee|Tea) = 0.9375 Coffee Tea 15 5 20 75 80 90 10 100 *

Other Proposed Metrics: *