Presentation is loading. Please wait.

Presentation is loading. Please wait.

2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter 5)

Similar presentations


Presentation on theme: "2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter 5)"— Presentation transcript:

1

2 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter 5)

3 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques2 Chapter 5 Mining Frequent Patterns, Associations, and Correlations What is association rule mining Mining single-dimensional Boolean association rules Mining multilevel association rules Mining multidimensional association rules From association mining to correlation analysis Summary

4 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques3 What Is Association Mining? Association rule mining: Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Rule form: “Body  ead [support, confidence] ”. Examples. buys(x, “diapers”)  buys(x, “beers”) [0.5%, 60%] age(X, “20..29”) ^ income(X, “30..39K”)  buys(X, “PC”) [2%, 60%] Applications: Basket data analysis, cross-marketing, catalog design, loss-leader analysis, classification, etc.

5 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques4 Association Rule Mining (Basic Concepts) Given: a database or set of transactions each transaction is a list of items (e.g., purchased by a customer in a visit) Find: all strong rules that correlate the presence of a set of items with that of another set of items (X  Y) E.g., 98% of people who purchase tires and auto accessories also get car maintenance done Application examples ?  Car Maintenance Agreement (What the store should do to boost Car Maintenance Agreement sales) Home Electronics  ? (What other products should the store stocks up?)

6 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Introduction to Data Mining 5 Denotation of Association Rules Rule form: “Body  ead [support, confidence]”. (association, correlation and causality) Rule examples sales(T, “computer”)  sales(T, “software”) [support = 1%, confidence = 75%] buy(T, “Beer”)  buy(T, “Diaper”) [support = 2%, confidence = 70%] age(X, “20..29”) ^ income(X, “30..39K”)  buys(X, “PC”) [support = 2%, confidence = 60%] attribute name item in an object data objectobject context

7 Association Rule Mining (Support and Confidence) Given a transaction D/B, find all the rules X  Y with minimal support and confidence support S : probability that a transaction contains {X & Y } confidence C : conditional probability that a transaction having {X} also contains Y I = {i 1,i 2,i 3,...,i n } : set of all items T j  I : a transaction, A  C (50%, 66.6%) C  A (50%, 100%) Transactions buy X Transactions buy both Transactions buy Y 6

8 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques7 A Road Map for Association Rule Mining (Types of Association Rules) Boolean vs. quantitative associations (based on the types of data values) buys(x, “SQLServer”) ^ buys(x, “DMBook”)  buys(x, “DBMiner”) [0.2%, 60%] age(x, “30..39”) ^ income(x, “42..48K”)  buys(x, “PC”) [1%, 75%] Single dimensional vs. multiple dimensional associations (see above e.g.) Single-level vs. multiple-level analysis What brands of beers are associated with what brands of diapers? Other extension Correlation, causality analysis Association does not necessarily imply correlation or causality

9 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques8 Chapter 5: Mining Association Rules in Large Databases What is association rule mining Mining single-dimensional Boolean association rules Mining multilevel association rules Mining multidimensional association rules From association mining to correlation analysis Summary

10 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques9 Mining Association Rules (How to Find Strong Rules : An Example) 1).Find the frequent itemsets Frequent itemset : The set of items that has minimum support 2).Generate association rules by frequent itemsets Rule form : “Body  Head [support, confidence] ” with min. sup. and conf. Example : for rule A  C : find supports of {A,C } and {A} support = support({A,C }) = 50% > min. support confidence = support({A,C })/support({A}) = 66.6% > min. confidence Problem : There may be a large amount of frequent itemsets. Find strong rules with - support >= 50% - confidence >= 50%

11 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques10 Mining Association Rules Efficiently : ( Apriori Algorithm) 1. Find the frequent itemsets (Key Steps) Iteratively find frequent itemsets with cardinality from 1 to k (from 1-itemset to k-itemset) by Apriori principle k-itemset an item set that has k items Apriori principle Any subset of a frequent itemset must also be frequent i.e., if {A, B } is a frequent 2-itemset, both {A } and {B } must be frequent 1-itemsets. If either {A } or {B } is not frequent, {A, B } cannot be frequent. 2. Generate association rules by the frequent itemsets

12 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques11 The Apriori Algorithm — Example ( min. support = 50% ~ support count = 2) Database D Scan D C1C1 C2C2 C2C2 C3C3 C3C3 L1L1 L2L2 L3L3

13 Iteratively : C 1 → L 1 → C 2 → L 2 → C 3 → L 3 …… → C N →  C k (like a table) : The set of candidate itemsets of size k L k (like a table) : The set of frequent itemsets of size k Pseudo-code: C 1 = {candidate 1-itemsets}; L 1 = {frequent 1-itemsets}; for (k = 1; L k !=  ; k++) do begin C k+1 = candidate itemset generated from L k for each transaction t in database do increment the count of all itemsets in C k+1 contained in t L k+1 = itemsets in C k+1 with support >= min_support End Return  k L k ; 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques12 The Apriori Algorithm (Pseudo-Code for Finding Frequent Itemsets) scan Database C k+1  L k+1 join & prune L k  C k+1 Example Detail

14 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques13 Generate Candidate Itemsets C k+1 from L k ( L k  C k+1 : Join & Prune) Representation : L k  C k+1 L k : The set of frequent itemsets of size k C k+1 : The set of candidate itemsets of size k+1 Steps of computing candidate k+1-itemsets C k+1 1) Join Step (According to Apriori principle) C k+1 is generated by joining L k with itself (Note the length constraint : k  k+1 ) 2) Prune Step (Apply Apriori principle to prune) Any k-itemset that is not frequent cannot be a subset of a frequent (k+1)-itemset (Use L k to prune C k +1 ) Example

15 Suppose the items in L k are listed (sorted) in order Step 1: self-joining L k (SQL pseudo code) insert into C k+1 (item i ’s are sorted) select p.item 1, p.item 2, …, p.item k, q.item k from L k p, L k q (where p, q are aliases of L k ) where p.item 1 =q.item 1, …, p.item k-1 =q.item k-1, p.item k < q.item k Step 2: pruning : Any k-itemset that is not frequent cannot be a subset of a frequent (k+1)-itemset ( C k+1 is like a table) For every (k+1)-itemset c in C k+1 do For every k-subset s of ( k+1)-itemset c do if (s is not in L k ) then delete (k+1)-itemset c from C k+1 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques14 Generation of Candidate Itemset C k+1 ( pseudo code for join & pruning ) example data

16 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques15 Generating Candidate Itemsets C 4 (Example : L 3 → C 4 ) L 3 ={abc, abd, acd, ace, bcd} → C 4 p : L 3 q : L 3 Self-joining: L 3 *L 3 abc abc abcd ( from abc+abd )  C 4 abd abd abce ( from abc+ace )  C 4 acd acd acde ( from acd+ace )  C 4 ace ace Pruning: bcd bcd acde is removed from C 4 because ade  L 3,or cde  L 3 {3-subsets of acde have acd, ace, ade, cde} Therefore, C 4 ={abcd}...... pair-wise joining ╳

17 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques16 Is Apriori Fast Enough ? (Performance Bottlenecks) The core of the Apriori algorithm: Compute C k (L k-1 → C k ) : Use frequent (k – 1)-itemsets to generate candidate k-itemsets (candidate generation) Compute L k (C k → L k ) : Use database scan and pattern matching to collect counts for the candidate itemsets The bottleneck of Apriori: candidate generation Huge candidate itemsets for computing C k : 10 4 frequent 1-itemset would generate ~10 7 candidate 2-itemsets What would be the number of 3-itemsets, 4-itemsets, …, ? For a transaction database of 100 items, e.g., {a 1, a 2, …, a 100 }, there are 2 100  10 30 possible candidate itemsets. Multiple scans of database for computing L k : Needs N scans, N is the length of the longest pattern

18 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques17 Methods to Improve Apriori’s Efficiency Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans Partitioning (parallel computing): Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Sampling: mining on a subset of the given data, lower support threshold + a method to determine the completeness

19 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques18 Mining Frequent Patterns by FP-Growth (Without Candidate Generation and DB Scans) Avoid costly database scans Compress a large transaction database into a compact, Frequent-Pattern tree structure (FP-tree) Highly condensed, but complete for frequent pattern mining Avoid costly candidate generation Develop an efficient, FP-tree-based frequent pattern mining method A divide-and-conquer methodology: decompose mining tasks into smaller ones bottlenecks

20 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques19 FP-growth vs. Apriori for Mining Rules (Scalability with the Support Threshold) On data set T25I20D10K For data set TxIyDz, x indicates the average transaction length y indicates the average itemset length z indicates the total number of transaction instances rule format D/B example

21 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques20 Major Steps of FP-growth Algorithm 1)Construct FP-tree 2)Construct conditional pattern base for each item in the header table of FP-tree 3)Construct conditional FP-tree from each entry of conditional pattern base 4)Derive frequent patterns by mining conditional FP- trees recursively  If the conditional FP-tree contains a single path, simply enumerate all the pattern combinations

22 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques21 Construct FP-tree from Transaction DB Header Table Item frequency head f4 c4 a3 b3 m3 p3 {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 min_support = 50% TIDItems bought 100{f, a, c, d, g, i, m, p} 200{a, b, c, f, l, m, o} 300 {b, f, h, j, o} 400 {b, c, k, s, p} 500 {a, f, c, e, l, p, m, n}   -1  -2 Steps : 1.Scan DB once, find frequent 1-itemset 2.Order frequent items in frequency descending order 3.Scan DB again, construct FP-tree support count >=2.5 (ordered) frequent items {f, c, a, m, p} {f, c, a, b, m} {f, b} {c, b, p} {f, c, a, m, p} 

23 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques22 Benefits of the FP-tree Structure Completeness: never breaks a long pattern of any transaction preserves complete information for frequent pattern mining Compactness reduce irrelevant information—infrequent items are gone frequency descending ordering: more frequent items are more likely to be shared never be larger than the original database (if not count node-links and counts) Example: For Connect-4 DB, compression ratio could be over 100

24 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques23 Step 2: Construct Conditional Pattern Base Starting from the 2 nd item of header table in the FP-tree Traverse the FP-tree by following the link of each frequent item Accumulate pattern count of each prefix path for each frequent item to form an entry of the conditional pattern base Conditional pattern base itemcond. pattern base cf:3 afc:3 b mfca:2, fcab:1 pfcam:2, cb:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3 {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 fca:1, f:1, c:1

25 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques24 Properties of FP-tree for Conditional Pattern Base Construction Node-link property For a frequent item a i, all the frequent itemsets that contain a i can be obtained by following a i 's node-links, starting from a i 's entry in header table of the FP-tree. Prefix path property To calculate the frequent itemsets containing a i in a path P, only the prefix sub-path of node a i in P need to be accumulated, and its frequency count should carry the same count as node a i.

26 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques25 Step 3: Construct Conditional FP-tree For each conditional pattern-base entry (illustrated by m) Accumulate the count for each item of the entry Construct the conditional FP-tree for the frequent items of the entry  {} f:3 c:3 a:3 m-conditional FP-tree  {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3  m-conditional pattern base: fca:2, fcab:1 

27 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques26 Conditional Pattern-Bases & Conditional FP-tree Empty f {(f:3)}|c{(f:3)}c {(f:3, c:3)}|a{(fc:3)}a Empty{(fca:1), (f:1), (c:1)}b {(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m {(c:3)}|p{(fcam:2), (cb:1)}p Conditional FP-tree Conditional pattern-base Item Assume minimal support = 3, and | denotes concatenation

28 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques27 Suppose a conditional FP-tree T has a single path P All frequent itemsets of T can be generated by enumerating all the item combinations in path P {} f:3 c:3 a:3 a single path for an m-conditional FP-tree All frequent itemsets containing m m, fm, cm, am, fcm, fam, cam, fcam Frequent Pattern Generation all item combinations in P Step 4: Recursively Mine Conditional FP-trees

29 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques28 Why Is FP–Growth Algorithm Fast? Performance study shows FP-growth is an order of magnitude faster than Apriori algorithm. Reasoning : FP-tree building Use compact data structure No candidate generation, no candidate test Eliminate repeated database scan Basic operation is counting

30 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques29 Presentation of Association Rules (Table Form )

31 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques30 Visualization of Association Rule Using Plane Graph

32 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques31 Visualization of Association Rule Using Rule Graph

33 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques32 Chapter 5: Mining Association Rules in Large Databases What is association rule mining Mining single-dimensional Boolean association rules Mining multilevel association rules Mining multidimensional association rules From association mining to correlation analysis Summary

34 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques33 Multiple-Level Association Rules Rules regarding itemsets at appropriate levels could be quite useful. Items at the lower level are expected to have lower support. Transaction database can be encoded based on dimensions and levels We can explore multi-level mining WonderFraser skim 2% Food bread milk white wheat Items often form concept hierarchy. … …

35 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques34 Mining Multi-Level Associations A top_down, progressive deepening approach: First find high-level strong rules: milk  bread [20%, 60%]. Then find their lower-level “weaker” rules: 2% milk  wheat bread [6%, 50%]. Variations at mining multiple-level association rules. Level-crossed association rules: 2% milk  Wonder wheat bread Association rules with multiple, alternative concept hierarchies: 2% milk  Wonder bread hierarchy

36 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques35 Uniform Support vs. Reduced Support (Multi-level Association Mining) Uniform Support: Use the same minimum support for mining all levels Pros: Simple - use one minimum support threshold If a candidate itemset contains any item whose ancestors do not have minimum support, there is no need to examine it. Cons: Lower level items do not occur as frequently. If support threshold is too high  miss low level associations too low  generate too many high-level associations parent child 1

37 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques36 Uniform Support Example Multi-level mining with uniform support Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 5% Χ  

38 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques37 Uniform Support vs. Reduced Support (Multi-level Association Mining) Reduced Support: Use reduced minimum support at lower levels There are 3 main search strategies: Level-by-level independent (without (i-1)th level info) Each tree node (item) is examined, regardless of whether or not its parent node is found to be frequent (too loose → not efficient ) Level-cross filtering by k-itemset (at L k with (i-1)th level info of k--item) A k-itemset at the i-th level is examined iff its parent k-itemset at the (i- 1)th level is frequent ( efficient, but too restrictive → many missing rules ) Level-cross filtering by single item (at L k with (i-1)th level info of 1-item) An item at the i-th level is examined iff its parent node at the (i-1)-th level is significant (may not be frequent, comprise of above two strategies) Example

39 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques38 Reduced Support Multi-level mining with reduced support (Level-cross filtering by single item) Case 1 Level 1 : min_sup = 8% Level 2 : min_sup = 4% Case 2 Level 1 : min_sup = 12% Level 2 : min_sup = 6% 2% Milk [support = 6%] Skim Milk [support = 4%] Milk [support = 10%] 1:  2:╳2:╳ 1:1: 2:2: 1: 1:  2:╳2:╳ In case 2, level-1 parent node is significant but not frequent, and its child may be frequent.

40 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques39 Redundancy Filtering in Multi-level Association Mining Some rules may be redundant due to relationships between the “ancestors” of items. Example of a redundant rule milk  wheat bread [support = 8%, confidence = 70%] 2% milk  wheat bread [support = 2%, confidence = 72%] We say the first rule is an ancestor of the second rule. A rule is redundant if its support is close to or less than the “expected” value, based on the rule’s ancestor.

41 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques40 Progressive Deepening in Multi-Level Association Mining A top-down, progressive deepening approach: First mine high-level frequent items: milk (15%), bread (10%) Then mine their lower-level “weaker” frequent itemsets: 2% milk (5%), wheat bread (4%) Different min_support threshold strategies across multi-levels lead to different algorithms: If adopting the same min_support across multi-levels, then toss item t if any of t ’s ancestors is infrequent. If adopting reduced min_support at lower levels, then examine only those descendents whose ancestor’s support is frequent or non-negligible.

42 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques41 Progressive Refinement of Data Mining Quality Why progressive refinement? Trade speed with quality: step-by-step refinement Mining operator can be expensive or cheap, fine or rough Two- or multi-step mining: First apply rough/cheap operator (superset coverage) Then apply expensive algorithm on a substantially reduced candidate set (Koperski & Han, SSD’95). Superset coverage property: Preserve all the positive answers—allow a false positive but not a false negative. + -

43 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques42 Chapter 5: Mining Association Rules in Large Databases What is association rule mining Mining single-dimensional Boolean association rules Mining multilevel association rules Mining multidimensional association rules From association mining to correlation analysis Summary

44 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques43 Multi-Dimensional Association (Concepts) Single-dimensional rules: buys(X, “milk”)  buys(X, “bread”) Multi-dimensional rules: Inter-dimension association rules (no repeated predicates) age(X,”19-25”)  occupation(X,“student”)  buys(X,“coke”) Hybrid-dimension association rules (repeated predicates) age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”) Categorical Attributes finite number of possible values, no ordering among values Quantitative Attributes numeric, implicit ordering among values, e.g., age

45 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques44 Mining Quantitative Associations For example : search for frequent k-predicate set An example: {age, occupation, buys} is a 3-predicate set. Techniques can be categorized by how quantitative attributes (e.g., age) are treated. 1.Static discretization Quantitative attributes are statically discretized by using predefined concept hierarchies, where numeric values are replaced by ranges Example, for salary, 0~20K, 21K~30K, 31K~40K,>= 41K 2. Dynamic discretization Each quantitative attribute is dynamically discretized into bins based on the distribution of its attribute values. Example, by equi-width / equi-depth bins 3. Distance-based (clustering) discretization This is a dynamic discretization process that considers the distance between data points. (See slide 47 for example)

46 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques45 Static Discretization of Quantitative Attributes Before mining, discretizing data using concept hierarchy. Numeric values are replaced by ranges. Range example for salary : 0~20K, 21K~30K, 31K~40K,>= 41K In relational database, finding all frequent k-predicate sets will require k or k+1 table scans. Data cube is well suited for mining. The cells of an n-dimensional cuboid correspond to the predicate sets. The cells of an n-dimensional cuboid store the support counts of the corresponding n-predicate sets (income)(age) () (buys) (age, income)(age,buys)(income,buys) (age,income,buys)

47 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques46 Quantitative Association Rule Mining Numeric attributes are dynamically discretized such that the compactness of the mined rules is maximized. For 2D quantitative association rule: A quan1  B quan2  C cat Cluster “adjacent” association rules to form more general rules using a 2-D grid. Example: age(X,”34”)  income(X,”30K - 40K”)  buys(X,”TV”) age(X,”34”)  income(X,”40K - 50K”)  buys(X,”TV”) age(X,”35”)  income(X,”30K - 40K”)  buys(X,”TV”) age(X,”35”)  income(X,”40K - 50K”)  buys(X,”TV”) age(X,”34-35”)  income(X,”30K - 50K”)  buys(X,”TV”) (more compact)

48 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques47 Mining Distance-based Association Rules Binning methods do not capture the semantics of interval data Distance-based partitioning, more meaningful discretization considering: density/”number of points” in an interval “closeness” of points in an interval

49 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques48 S[X] is a set of N tuples t 1, t 2, …, t N, projected on the numerical attribute set X The diameter of S[X]: ( 相當於每兩點間的平均距離共 N(N-1) 筆 ) dist x :distance metric, e.g. Euclidean distance or Manhattan distance Clusters and Distance Measurements

50 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques49 Assesses the density of a cluster C X, where Finding clusters and distance-based rules the density threshold replaces the notion of support modified version of the BIRCH clustering algorithm How to Find Clusters Cluster diameter must be small enough # of cluster samples must be large enough

51 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques50 Chapter 5: Mining Association Rules in Large Databases What is association rule mining Mining single-dimensional Boolean association rules Mining multilevel association rules Mining multidimensional association rules From association mining to correlation analysis Summary

52 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques51 Interestingness Measurements Objective measures Two popular measurements: support confidence Subjective measures (Silberschatz & Tuzhilin, KDD95) A rule (pattern) is interesting if it is unexpected (surprising to the user); and/or actionable (the user can do something with it)

53 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques52 Criticism to Support and Confidence Example 1: (Aggarwal & Yu, PODS98) Totally 5000 students (Analyzed in 2D space : playing basketball, eating cereal) 3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal Consider the following association rules play basketball  eat cereal [40%, 66.7% ] play basketball  not eat cereal [20%, 33.3%] : given : derived 2000 3000 3750 5000 1750 2000 250 1000 1250 (Example 1)

54 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques53 Criticism to Support and Confidence Rule 1 : play basketball  eat cereal [40%, 66.7% ] Rule 1 implies that a student who plays basketball like eating cereal This rule is misleading. Because the percentage of students eating cereal is 75% (=3750/5000) which is higher than 66.7%. Rule 2 : play basketball  not eat cereal [20%, 33.3%] Rule 2 is far more accurate. It implies that a student who plays basketball doesn’t like eating cereal

55 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques54 Interestingness Measure: Correlations (Lift) play basketball  eat cereal [40%, 66.7%] is misleading The overall % of students eating cereal is 75% > 66.7%. play basketball  not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence Measure of dependent/correlated events: lift B: basketball C: cereal Lift(B,C) < lift(B,¬C) B C A D (C/A>D/B)  a student playing basketball doesn’t like eating cereal

56 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques55 Criticism to Support and Confidence X and Y: positively correlated X and Z: negatively related We need a measure of dependent or correlated events (Cont.) Z is more positively correlated to X than Y according to sup. & conf. 2). contradict (Example 2) 1 : occurred, 0 : not occurred 1). A B

57 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques56 Correlation (lift) : taking both P(A) and P(B) in consideration P(A^B)=P(B)*P(A), if A and B are independent events A and B negatively correlated, if corr A,B is less than 1; otherwise A and B positively correlated P(X)=1/2, P(Y)=1/4, P(Z)=7/8 Criticism to Support and Confidence (Example 2) 2 6/7 4/7 X and Y: positively correlated X and Z: negatively related Y and Z: negatively related

58 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques57 Chapter 5: Mining Association Rules in Large Databases What is association rule mining Mining single-dimensional Boolean association rules Mining multilevel association rules Mining multidimensional association rules From association mining to correlation analysis Summary

59 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques58 Summary Association rule mining probably the most significant contribution from the database community in KDD A large number of papers have been published Many interesting issues have been explored An interesting research direction Association analysis in other types of data: spatial data, multimedia data, time series data, etc.

60 2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Introduction to Data Mining 59 Relevant Data Data Preprocessing Data Mining Evaluation/PresentationPattern Knowledge Databases Steps in KDD Process (Technically) Data mining The core step of KDD process


Download ppt "2016年6月14日星期二 2016年6月14日星期二 2016年6月14日星期二 Data Mining: Concepts and Techniques1 Mining Frequent Patterns, Associations, and Correlations (Chapter 5)"

Similar presentations


Ads by Google