Data Mining: Concepts and Techniques — Chapter 3 —

Data Mining: Concepts and Techniques — Chapter 3 —
Association Rules April 25, 2017 Data Mining: Concepts and Techniques 1

Mining Frequent Patterns, Association
Basic concepts Apriori FP-Growth Mining Multi-Dimensional Association Exercises

What Is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining Motivation: Finding inherent regularities in data What products were often purchased together?— cheese and chips?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Applications Basket data analysis, cross-marketing, catalog design, Web log (click stream) analysis, and DNA sequence analysis.

Why Is Freq. Pattern Mining Important?
Forms the foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, time-series, and stream data Classification: associative classification Cluster analysis: frequent pattern-based clustering Semantic data compression

Basic Concepts: Frequent Patterns and Association Rules
Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F Itemset X = {x1, …, xk} Find all the rules X  Y with minimum support and confidence support, s, probability that a transaction contains X  Y confidence, c, conditional probability that a transaction having X also contains Y Customer buys chips buys both buys cheese Let supmin = 50%, confmin = 50% Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3} Association rules: A  D (60%, 100%) D  A (60%, 75%)

Closed Patterns and Max-Patterns
Number of sub-patterns, e.g., {a1, …, a100} contains – 1 = 1.27*1030 sub-patterns! (Why?) Solution: Mine closed patterns and max-patterns instead An itemset X is closed if X is frequent and there exists no super-pattern Y כ X, with the same support as X An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X

Closed Patterns and Max-Patterns
Exercise. DB = {<a1, …, a100>, < a1, …, a50>} Min_sup = 1. What is the set of closed itemset? <a1, …, a100>: 1 < a1, …, a50>: 2 What is the set of max-pattern?

Scalable Methods for Mining Frequent Patterns
The downward closure property of frequent patterns Any subset of a frequent itemset must be frequent If {cheese, chips, nuts} is frequent, so is {cheese, chips} i.e., every transaction having {cheese, chips, nuts} also contains {cheese, chips} Scalable mining methods: Apriori (Agrawal & Freq. pattern growth (FPgrowth—Han, Pei &

Apriori: A Candidate Generation-and-Test Approach
Apriori is a classic algorithm for learning association rules Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated

The Apriori Algorithm—An Example
Supmin = 2 Itemset sup {A} 2 {B} 3 {C} {D} 1 {E} Database TDB Itemset sup {A} 2 {B} 3 {C} {E} L1 C1 Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E 1st scan C2 Itemset sup {A, B} 1 {A, C} 2 {A, E} {B, C} {B, E} 3 {C, E} C2 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} L2 2nd scan Itemset sup {A, C} 2 {B, C} {B, E} 3 {C, E} C3 L3 Itemset {B, C, E} 3rd scan Itemset sup {B, C, E} 2

Apriori Algorithm - An Example
Assume minimum support = 2

Apriori Algorithm - An Example
The final “frequent” item sets are those remaining in L2 and L3. However, {2,3}, {2,5}, and {3,5} are all contained in the larger item set {2, 3, 5}. Thus, the final group of item sets reported by Apriori are {1,3} and {2,3,5}. These are the only item sets from which we will generate association rules.

Generating Association Rules from Frequent Itemsets
Only strong association rules are generated Frequent itemsets satisfy minimum support threshold Strong rules are those that satisfy minimum confidence threshold confidence(A ==> B) = Pr(B | A) = For each frequent itemset, f, generate all non-empty subsets of f For every non-empty subset s of f do if support(f)/support(s) ³ min_confidence then output rule s ==> (f-s) end

Generating Association Rules (Example Continued)
Item sets: {1,3} and {2,3,5} Candidate rules for {1,3} Candidate rules for {2,3,5} Rule Conf. {1}{3} 2/2 = 1.0 {2,3}{5} 2/2 = 1.00 {2}{5} 3/3 = 1.00 {3}{1} 2/3 = 0.67 {2,5}{3} {2}{3} {3,5}{2} {3}{2} {2}{3,5} {3}{5} {3}{2,5} {5}{2} {5}{2,3} {5}{3} Assuming a min. confidence of 75%, the final set of rules reported by Apriori are: {1}{3}, {3,5}{2}, {5}{2} and {2}{5}

The Apriori Algorithm Ck: Candidate itemset of size k
Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk;

Important Details of Apriori
How to generate candidates? Step 1: self-joining Lk Step 2: pruning How to count supports of candidates? Example of Candidate-generation L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L3 C4={abcd}

How to Generate Candidates?
Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck

Challenges of Frequent Pattern Mining
Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Reduce passes of transaction database scans Shrink number of candidates Facilitate support counting of candidates

Construct FP-tree from a Transaction Database
TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o, w} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} min_support = 3 {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 Scan DB once, find frequent 1-itemset (single item pattern) Sort frequent items in frequency descending order Scan DB again, construct FP-tree

Construct FP-tree from a Transaction Database
L=[I2:7,I1:6,I3:6,I4:2,I5:2] if Min Sup Count =2

Benefits of the FP-tree Structure
Completeness Preserve complete information for frequent pattern mining Never break a long pattern of any transaction Compactness Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more frequently occurring, the more likely to be shared Never be larger than the original database (not count node-links and the count field)

Mining Frequent Patterns With FP-trees
Idea: Frequent pattern growth Recursively grow frequent patterns by pattern and database partition Method For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern

Scaling FP-growth by DB Projection
FP-tree cannot fit in memory?—DB projection First partition a database into a set of projected DBs Then construct and mine FP-tree for each projected DB

FP-Growth vs. Apriori: Scalability With the Support Threshold
Data set T25I20D10K

Multiple-Level Association Rules
Items often form a hierarchy Items at the lower level are expected to have lower support Rules regarding itemsets at appropriate levels could be quite useful Transaction database can be encoded based on dimensions and levels Food Milk Bread Skim 2% Wheat White

Mining Multi-Level Associations
A top_down, progressive deepening approach First find high-level strong rules: milk -> bread [20%, 60%] Then find their lower-level “weaker” rules: 2% milk -> wheat bread [6%, 50%] When one threshold set for all levels; if support too high then it is possible to miss meaningful associations at low level; if support too low then possible generation of uninteresting rules different minimum support thresholds across multi-levels lead to different algorithms (e.g., decrease min-support at lower levels)

Quantitative Association Rules
Handling quantitative rules may require mapping of the continuous variables into Boolean

Mapping Quantitative to Boolean
One possible solution is to map the problem to the Boolean association rules: discretize a non-categorical attribute to intervals, e.g., Age [20,29], [30,39],... categorical attributes: each value becomes one item non-categorical attributes: each interval becomes one item

Example in Web Content Mining
Documents Associations Find (content-based) associations among documents in a collection Documents correspond to items and words correspond to transactions Frequent itemsets are groups of docs in which many words occur in common Term Associations Find associations among words based on their occurrences in documents similar to above, but invert the table (terms as items, and docs as transactions)

Example in Web Usage Mining
Association Rules in Web Transactions discover affinities among sets of Web page references across user sessions Examples 60% of clients who accessed /products/, also accessed /products/software/webminer.htm 30% of clients who accessed /special-offer.html, placed an online order in /products/software/ Actual Example from IBM official Olympics Site: {Badminton, Diving} ==> {Table Tennis} [conf69.7%,sup0.35%] Applications Use rules to serve dynamic, customized contents to users prefetch files that are most likely to be accessed determine the best way to structure the Web site (site optimization) targeted electronic advertising and increasing cross sales

Example in Web Usage Mining
Association Rules From Cray Research Web Site Design “suggestions”

تکليف با استفاده از مجموعه داده IBM با مشخصات زير ، دو روش Apriori و FP-Growth را پياده سازي کرده و نتايج را با هم مقايسه نماييد . ( زمان و مقياس پذيري ) The number of transactions =100000, The average, transaction Length= 10, The average frequent pattern leng بررسی یک مقاله در ارتباط با وب کاوی که از قوانین انجمنی یا کاوش الگوهای تکراری استفاده کرده است. (آمادگی برای ارائه) با استفاده از نرم اقزار Weka یک نمونه آزمایش با روش Apriori و FP-Growth انجام دهید (از دیتاستهای UCI استفاده نمائید)

Data Mining: Concepts and Techniques — Chapter 3 —

Similar presentations

Presentation on theme: "Data Mining: Concepts and Techniques — Chapter 3 —"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining: Concepts and Techniques — Chapter 3 —

Similar presentations

Presentation on theme: "Data Mining: Concepts and Techniques — Chapter 3 —"— Presentation transcript:

Similar presentations

About project

Feedback