Download presentation

Presentation is loading. Please wait.

Published byLenard Johnson Modified over 4 years ago

1
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing and IT Dept.

2
2 of 25 2 of 45 What Is Association Mining? Association rule mining: –Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories Frequent Pattern: A pattern (set of items, sequence, etc.) that occurs frequently in a database

3
3 of 25 3 of 45 Motivations For Association Mining Motivation: Finding regularities in data –What products were often purchased together? Beer and nappies! –What are the subsequent purchases after buying a PC? –What kinds of DNA are sensitive to this new drug? –Can we automatically classify web documents?

4
4 of 25 4 of 45 Motivations For Association Mining (cont…) Broad applications –Basket data analysis, cross-marketing, catalog design, sale campaign analysis –Web log (click stream) analysis, DNA sequence analysis, etc.

5
5 of 25 5 of 45 Market Basket Analysis Market basket analysis is a typical example of frequent itemset mining Customers buying habits are divined by finding associations between different items that customers place in their “shopping baskets” This information can be used to develop marketing strategies

6
6 of 25 6 of 45 Market Basket Analysis (cont…)

7
7 of 25 7 of 45 Application of Association Association analysis can be used in promoting/improving marketing strategy by analysing frequent itemset. As a marketing manager of a Company X for instance you would like to determine which items are frequently purchased together within the same transactions.

8
8 of 25 8 of 45 Application of Association An example of such a rule, mined from the X Company transactional database, is buys(X; “computer”)=>buys(X; “software”) [support = 1%; confidence = 50%] where X is a variable representing a customer. A confidence, or certainty, of 50% means that if a customer buys a computer, there is a 50% chance that she will buy software as well.

9
9 of 25 9 of 45 Application of Association A 1% support means that 1% of all of the transactions under analysis showed that computer and software were purchased together. This association rule involves a single attribute or predicate (i.e., buys) that repeats. Association rules that contain a single predicate are referred to as single- dimensional association rules.

10
10 of 25 10 of 45 Application of Association In addition to the marketing application, the same sort of question has the following uses: Baskets = documents; items = words. Words appearing frequently together in documents may represent phrases or linked concepts. Can be used for intelligence gathering.

11
11 of 25 11 of 45 Application of Association Baskets = sentences, items = documents. Two documents with many of the same sentences could represent plagiarism or mirror sites on the Web.

12
12 of 25 12 of 45 Association Rule Basic Concepts Let I be a set of items {I 1, I 2, I 3,…, I m } Let D be a database of transactions where each transaction T is a set of items such that T I So, if A is a set of items a transaction T is said to contain A if and only if A T An association rule is an implication A B where A I, B I, and A B=

13
13 of 25 13 of 45 Association Rule Support & Confidence We say that an association rule A B holds in the transaction set D with support, s, and confidence, c The support of the association rule is given as the percentage of transactions in D that contain both A and B (or A B ) So, the support can be considered the probability P(A B)

14
14 of 25 14 of 45 Association Rule Support & Confidence (cont…) The confidence of the association rule is given as the percentage of transactions in D containing A that also contain B So, the confidence can be considered the conditional probability P(B|A) Association rules that satisfy minimum support and confidence values are said to be strong

15
15 of 25 15 of 45 Itemsets & Frequent Itemsets An itemset is a set of items A k -itemset is an itemset that contains k items The occurrence frequency of an itemset is the number of transactions that contain the itemset –This is also known more simply as the frequency, support count or count An itemset is said to be frequent if the support count satisfies a minimum support count threshold The set of frequent itemsets is denoted L k

16
16 of 25 16 of 45 Support & Confidence Again Support and confidence values can be calculated as follows:

17
17 of 25 17 of 45 Mining Association Rules: An Example Transaction-idItems bought 10A, B, C 20A, C 30A, D 40B, E, F Frequent patternSupport {A}75% {B}50% {C}50% {A, C}50%

18
18 of 25 18 of 45 Mining Association Rules: An Example (cont…) Transaction-idItems bought 10A, B, C 20A, C 30A, D 40B, E, F Frequent patternSupport {A}75% {B}50% {C}50% {A, C}50%

19
19 of 25 19 of 45 Association Rule Mining So, in general association rule mining can be reduced to the following two steps: 1.Find all frequent itemsets Each itemset will occur at least as frequently as as a minimum support count 2.Generate strong association rules from the frequent itemsets These rules will satisfy minimum support and confidence measures

20
20 of 25 20 of 45 Combinatorial Explosion! A major challenge in mining frequent itemsets is that the number of frequent itemsets generated can be massive For example, a long frequent itemset will contain a combinatorial number of shorter frequent sub-itemsets A frequent itemset of length 100 will contains the following number of frequent sub-itemsets:

21
21 of 25 21 of 45 The Apriori Algorithm Any subset of a frequent itemset must be frequent –If {beer, nappy, nuts} is frequent, so is {beer, nappy} –Every transaction having {beer, nappy, nuts} also contains {beer, nappy} Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!

22
22 of 25 22 of 45 The Apriori Algorithm (cont…) The Apriori algorithm is known as a candidate generation-and-test approach Method: –Generate length ( k+1 ) candidate itemsets from length k frequent itemsets –Test the candidates against the DB Performance studies show the algorithm’s efficiency and scalability

23
23 of 25 23 of 45 The Apriori Algorithm: An Example Database TDB 1 st scan C1C1 L1L1 L2L2 C2C2 C2C2 2 nd scan C3C3 L3L3 3 rd scan TidItems 10A, C, D 20B, C, E 30A, B, C, E 40B, E Itemsetsup {A}2 {B}3 {C}3 {D}1 {E}3 Itemsetsup {A}2 {B}3 {C}3 {E}3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemsetsup {A, B}1 {A, C}2 {A, E}1 {B, C}2 {B, E}3 {C, E}2 Itemsetsup {A, C}2 {B, C}2 {B, E}3 {C, E}2 Itemset {B, C, E} Itemsetsup {B, C, E}2

24
24 of 25 24 of 45 Important Details Of The Apriori Algorithm There are two crucial questions in implementing the Apriori algorithm: –How to generate candidates? –How to count supports of candidates?

25
25 of 25 25 of 45 Generating Candidates There are 2 steps to generating candidates: –Step 1: Self-joining L k –Step 2: Pruning Example of Candidate-generation –L 3 ={abc, abd, acd, ace, bcd} –Self-joining: L 3 *L 3 abcd from abc and abd acde from acd and ace –Pruning: acde is removed because ade is not in L 3 –C 4 ={abcd}

26
26 of 25 26 of 45 How to Count Supports Of Candidates? Why counting supports of candidates a problem? –The total number of candidates can be huge –One transaction may contain many candidates Method: –Candidate itemsets are stored in a hash-tree –Leaf node of hash-tree contains a list of itemsets and counts –Interior node contains a hash table –Subset function: finds all the candidates contained in a transaction

27
27 of 25 27 of 45 Generating Association Rules Once all frequent itemsets have been found association rules can be generated Strong association rules from a frequent itemset are generated by calculating the confidence in each possible rule arising from that itemset and testing it against a minimum confidence threshold

28
28 of 25 28 of 45 Example TIDList of item_IDs T100Coke, Crisps, Milk T200Crisps, Bread T300Crisps, Nappies T400Coke, Crisps, Bread T500Coke, Nappies T600Crisps, Nappies T700Coke, Nappies T800Coke, Crisps, Nappies, Milk T900Coke, Crisps, Nappies IDItem I1Coke I2Crisps I3Nappies I4Bread I5Milk

29
29 of 25 29 of 45 Example

30
30 of 25 30 of 45 Challenges Of Frequent Pattern Mining Challenges –Multiple scans of transaction database –Huge number of candidates –Tedious workload of support counting for candidates Improving Apriori: general ideas –Reduce passes of transaction database scans –Shrink number of candidates –Facilitate support counting of candidates

31
31 of 25 31 of 45 Questions? ?

Similar presentations

© 2019 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google