Presentation is loading. Please wait.

Presentation is loading. Please wait.

EECS 800 Research Seminar Mining Biological Data

Similar presentations


Presentation on theme: "EECS 800 Research Seminar Mining Biological Data"— Presentation transcript:

1 EECS 800 Research Seminar Mining Biological Data
Font size High light should be consistent Instructor: Luke Huan Fall, 2006

2 Administrative Last day to add/change sections without written permission Send a request to reserve reference books in Spahr library

3 Outline for today A road map of frequent item set mining
Efficient and scalable frequent itemset mining methods Mining various kinds of association rules Summary

4 What Is Frequent Pattern Analysis?
Motivation: Finding inherent regularities in data What products were often purchased together?— Beer and diapers?! What are the subsequent purchases after buying a PC? What are the commonly occurring subsequences in a group of genes? What are the shared substructures in a group of effective drugs?

5 What Is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining Applications Identify motifs in bio-molecules DNA sequence analysis, protein structure analysis Identify patterns in micro-arrays Business applications: Market basket analysis, cross-marketing, catalog design, sale campaign analysis, etc.

6 Data An item is an element (a literal, a variable, a symbol, a descriptor, an attribute, a measurement, etc) A transaction is a set of items A data set is a set of transactions A database is a data set Transaction-id Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l,m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n

7 Where Data are Collected
Retailer stores On-line shopping record Genetic profiling Business records Entities and attributes

8 The Nature of the Data Large volume of data Usually very sparse
Gigabytes Terabytes Usually very sparse Most of the items are task-irrelevant

9 Frequent Patterns Pattern (Itemset): a set of items
E.g., acm={a, c, m} Support of patterns Sup(acm)=3 Given min_sup=3, acm is a frequent pattern Frequent pattern mining: find all frequent patterns in a database Transaction database TDB Transaction-id Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l,m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n

10 Association Rules Let supmin = 50%, confmin = 50% Association rules:
Transaction-id Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l,m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n Itemset X = {x1, …, xk} Find all the rules X  Y with minimum support and confidence support, s, is the probability that a transaction contains X  Y confidence, c, is the conditional probability that a transaction having X also contains Y Customer buys diaper buys both buys beer Let supmin = 50%, confmin = 50% Association rules: A  C (60%, 100%) C  A (60%, 75%)

11 Frequent Pattern Mining: A Road Map
Boolean vs. quantitative associations occupation(x, “college student”)  buys(x, “laptop”) age(x, “30..39”) ^ income(x, “42..48K”)  buys(x, “car”) [1%, 75%] Single dimension vs. multiple dimensional associations Single level vs. multiple-level analysis What brands of beers are associated with what brands of diapers?

12 Extensions Correlation, causality analysis & mining interesting rules
Maxpatterns and frequent closed itemsets Constraint-based mining Sequential patterns Periodic patterns Structural Patterns

13 Frequent Pattern Mining Methods
Apriori and its variations/improvements Mining frequent-patterns without candidate generation Mining max-patterns and closed itemsets Mining multi-dimensional, multi-level frequent patterns with flexible support constraints Interestingness: correlation and causality

14 Apriori: Candidate Generation-and-test
Frequency anti-monotone property: Any subset of a frequent itemset must be also frequent — an anti-monotone property E.g. a transaction containing {beer, diaper, nuts} also contains {beer, diaper} {beer, diaper, nuts} is frequent implies that {beer, diaper} must also be frequent No superset of any infrequent itemset should be generated or tested Many item combinations can be pruned

15 Apriori-based Mining Generate length (k+1) candidate itemsets from length k frequent itemsets, and Test the candidates against DB

16 Apriori Algorithm A level-wise, candidate-generation-and-test approach (Agrawal & Srikant 1994) Data base D 1-candidates Freq 1-itemsets 2-candidates TID Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e Itemset Sup a 2 b 3 c d 1 e Itemset Sup a 2 b 3 c e Itemset ab ac ae bc be ce Scan D Min_sup=2 3-candidates Freq 2-itemsets Counting Itemset bce Itemset Sup ac 2 bc be 3 ce Itemset Sup ab 1 ac 2 ae bc be 3 ce Scan D Scan D Freq 3-itemsets Itemset Sup bce 2

17 The Apriori Algorithm Ck: Candidate itemset of size k
Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do Ck+1 = candidates generated from Lk; for each transaction t in a database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support return k Lk;

18 Important Details of Apriori
How to generate candidates? Step 1: self-joining Lk Step 2: pruning Example of Candidate-generation L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L3 C4={abcd}

19 How to Generate Candidates?
Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 Step 2: pruning For each itemsets c in Ck do For each (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck

20 How to Count Supports of Candidates?
Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates Method: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction

21 Revisit the Apriori Algorithm
Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do Ck+1 = candidates generated from Lk; for each transaction t in a database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support return k Lk;

22 Challenges of Frequent Pattern Mining
Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Reduce passes of transaction database scans Shrink number of candidates Facilitate support counting of candidates

23 DIC: Reduce Number of Scans
Once both A and D are determined frequent, the counting of AD can begin Once all length-2 subsets of BCD are determined frequent, the counting of BCD can begin ABCD ABC ABD ACD BCD AB AC BC AD BD CD Transactions 1-itemsets A B C D 2-itemsets Apriori {} Itemset lattice 1-itemsets 2-items S. Brin R. Motwani, J. Ullman, and S. Tsur, 1997. DIC 3-items

24 Partition: Scan Database Only Twice
Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Scan 1: partition database and find local frequent patterns Scan 2: consolidate global frequent patterns A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’95

25 DHP: Reduce the Number of Candidates
Observation: the first a few iterations take a large portion of the total CPU consumption Intuition: give Apriori a jump strart! Identify all frequent k-patterns where k is a small number > 1 in the first scan of a DB Use a hash table to keep track of frequent k-patterns For each transaction t in DB do For each triplet (k-patterns) s of t do H[s] ← H[s] +1

26 DHP: Reduce the Number of Candidates
Technique: a hashing bucket count <min_sup  every candidate in the buck is infrequent Candidates: a, b, c, d, e Hash entries: {ab, ad, ae} {bd, be, de} … Large 1-itemset: a, b, d, e The sum of counts of {ab, ad, ae} < min_sup  ab should not be a candidate 2-itemset J. Park, M. Chen, and P. Yu, 1995, An effective hash-based algorithm for mining association rules. ICMD’95.

27 An New Algebraic Frame: Set Enumeration Tree
Subsets of I can be enumerated systematically I={a, b, c, d} a b c d ab ac ad bc bd cd abc abd acd bcd abcd

28 Borders of Frequent Itemsets
Connected X and Y are frequent and X is an ancestor of Y implies that all patterns between X and Y are frequent a b c d ab ac ad bc bd cd abc abd acd bcd abcd

29 Projected Databases To find a child Xy of X, only X-projected database is needed The sub-database of transactions containing X Item y is frequent in X-projected database a b c d ab ac ad bc bd cd abc abd acd bcd abcd

30 Tree-Projection Method
Find frequent 2-itemsets For each frequent 2-itemset xy, form a projected database The sub-database containing xy Recursive mining If x’y’ is frequent in xy-proj db, then xyx’y’ is a frequent pattern

31 Why Is Tree-projection Fast?
A bi-level unfolding of set enumeration tree Major operations Finding frequent 2-itemsets: faster than matching candidates Form projected databases AAP’01

32 Reduce Pattern Number by Sampling
Select a sample of original database, mine frequent patterns within sample using Apriori Scan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked Example: check abcd instead of ab, ac, …, etc. Scan database again to find missed frequent patterns H. Toivonen. Sampling large databases for association rules. In VLDB’96

33 Bottleneck of Frequent-pattern Mining
Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates To find frequent itemset i1i2…i100 # of scans: 100 # of Candidates: (1001) + (1002) + … + (110000) = = 1.27*1030 ! Bottleneck: candidate-generation-and-test Can we avoid candidate generation?

34 Mining Frequent Patterns Without Candidate Generation
Grow long patterns from short ones using local frequent items “abc” is a frequent pattern Get all transactions having “abc”: DB|abc “d” is a local frequent item in DB|abc  abcd is a frequent pattern

35 Construct FP-tree from a Transaction Database
TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o, w} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} min_support = 3 {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 Scan DB once, find frequent 1-itemset (single item pattern) Sort frequent items in frequency descending order, f-list Scan DB again, construct FP-tree F-list=f-c-a-b-m-p

36 Benefits of the FP-tree Structure
Completeness Preserve complete information for frequent pattern mining Never break a long pattern of any transaction Compactness Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more frequently occurring, the more likely to be shared Never be larger than the original database (not count node-links and the count field) For Connect-4 DB, compression ratio could be over 100

37 Partition Patterns and Databases
Frequent patterns can be partitioned into subsets according to f-list F-list=f-c-a-b-m-p Patterns containing p Patterns having m but no p Patterns having c but no a nor b, m, p Pattern f Completeness and non-redundency

38 Find Patterns Having P Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1

39 Find Patterns Having Item m But No p
Form m-projected database TDB|m Item p is excluded Contain fca:2, fcab:1 Local frequent items: f, c, a Build FP-tree for TDB|m Header table item f c a b m p root root Header table item f c a f:4 c:1 f:3 c:3 b:1 b:1 c:3 a:3 p:1 a:3 m:2 b:1 m-projected FP-tree p:2 m:1

40 Find Patterns Having Item m But No p
For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base m-conditional pattern base: fca:2, fcab:1 {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 All frequent patterns relate to m m, fm, cm, am, fcm, fam, cam, fcam {} f:3 c:3 a:3 m-conditional FP-tree c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1

41 Recursion: Mining Each Conditional FP-tree
{} f:3 c:3 am-conditional FP-tree Cond. pattern base of “am”: (fc:3) {} f:3 c:3 a:3 m-conditional FP-tree {} Cond. pattern base of “cm”: (f:3) f:3 cm-conditional FP-tree {} Cond. pattern base of “cam”: (f:3) f:3 cam-conditional FP-tree

42 Recursive Mining Patterns having m but no p can be mined recursively
Optimization: enumerate patterns from single-branch FP-tree Enumerate all combination Support = that of the last item m, fm, cm, am fcm, fam, cam fcam Header table item f c a root f:3 c:3 a:3 m-projected FP-tree

43 A Special Case: Single Prefix Path in FP-tree
Suppose a (conditional) FP-tree T has a shared single prefix-path P Mining can be decomposed into two parts Reduction of the single prefix path into one node Concatenation of the mining results of the two parts a2:n2 a3:n3 a1:n1 {} b1:m1 C1:k1 C2:k2 C3:k3 b1:m1 C1:k1 C2:k2 C3:k3 r1 a2:n2 a3:n3 a1:n1 {} r1 = +

44 Mining Frequent Patterns With FP-trees
Idea: Frequent pattern growth Recursively grow frequent patterns by pattern and database partition Method For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern

45 Why Is FP-Growth the Winner?
Divide-and-conquer: decompose both the mining task and DB according to the frequent patterns obtained so far leads to focused search of smaller databases Other factors no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops—counting local freq items and building sub FP-tree, no pattern search and matching

46 Implications of the Methodology
Mining closed frequent itemsets and max-patterns CLOSET (DMKD’00) Mining sequential patterns FreeSpan (KDD’00), PrefixSpan (ICDE’01) Constraint-based mining of frequent patterns Convertible constraints (KDD’00, ICDE’01) Computing iceberg data cubes with complex measures H-tree and H-cubing algorithm (SIGMOD’01)

47 Borders and Max-patterns
Max-patterns: borders of frequent patterns A subset of max-pattern is frequent A superset of max-pattern is infrequent a b c d ab ac ad bc bd cd abc abd acd bcd abcd

48 Closed Patterns A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns! Solution: Mine closed patterns and max-patterns instead An itemset X is closed if X is frequent and there exists no super-pattern Y  X, with the same support as X (proposed by Pasquier, et ICDT’99) Closed pattern is a lossless compression of freq. patterns Reducing the # of patterns and rules

49 Closed Patterns and Max-Patterns
Exercise. DB = {<a1, …, a100>, < a1, …, a50>} Min_sup = 1. What is the set of closed itemset? <a1, …, a100>: 1 < a1, …, a50>: 2 What is the set of max-pattern? What is the set of all patterns? !!

50 MaxMiner: Mining Max-patterns
1st scan: find frequent items A, B, C, D, E 2nd scan: find support for AB, AC, AD, AE, ABCDE BC, BD, BE, BCDE CD, CE, CDE, DE, Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scan R. Bayardo. Efficiently mining long patterns from databases. In SIGMOD’98 Tid Items 10 A,B,C,D,E 20 B,C,D,E, 30 A,C,D,F Potential max-patterns Min_sup=2

51 Mining Frequent Closed Patterns: CLOSET
Flist: list of all frequent items in support ascending order Flist: d-a-f-e-c Divide search space Patterns having d Patterns having d but no a, etc. Find frequent closed pattern recursively Every transaction having d also has cfa  cfad is a frequent closed pattern J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets", DMKD'00. Min_sup=2

52 CLOSET+: Mining Closed Itemsets by Pattern-Growth
Itemset merging: if Y appears in every occurrence of X, then Y is merged with X Sub-itemset pruning: if Y כ X, and sup(X) = sup(Y), X and all of X’s descendants in the set enumeration tree can be pruned Hybrid tree projection Bottom-up physical tree-projection Top-down pseudo tree-projection Item skipping: if a local frequent item has the same support in several header tables at different levels, one can prune it from the header table at higher levels Efficient subset checking

53 CHARM: Mining by Exploring Vertical Data Format
Vertical format: t(AB) = {T11, T25, …} tid-list: list of trans.-ids containing an itemset Deriving closed patterns based on vertical intersections t(X) = t(Y): X and Y always happen together t(X)  t(Y): transaction having X always has Y Using diffset to accelerate mining Only keep track of differences of tids t(X) = {T1, T2, T3}, t(XY) = {T1, T3} Diffset (XY, X) = {T2} Eclat/MaxEclat (Zaki et VIPER(P. Shenoy et CHARM (Zaki &

54 Further Improvements of Mining Methods
AFOPT (Liu, et KDD’03) A “push-right” method for mining condensed frequent pattern (CFP) tree Carpenter (Pan, et KDD’03) Mine data sets with small rows but numerous columns Construct a row-enumeration tree for efficient mining

55 Visualization of Association Rules: Plane Graph

56 Visualization of Association Rules: Rule Graph

57 Visualization of Association Rules (SGI/MineSet 3.0)

58 Summary The frequent item set mining problem is an important problem and has many applications in bioinformaitcs Efficient heuristics were developed in the past Patterns are used to construct association rules, rules are used to predict properties of objects Maximal pattern and closed patterns are techniques to reduce the total number of patterns

59 Ref: Basic Concepts of Frequent Pattern Mining
(Association Rules) R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD'93. (Max-pattern) R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98. (Closed-pattern) N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. ICDT'99. (Sequential pattern) R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95


Download ppt "EECS 800 Research Seminar Mining Biological Data"

Similar presentations


Ads by Google