Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining 1 Data Mining is one aspect of Database Query Processing (on the "what if" or pattern and trend end of Query Processing, rather than the "please.

Similar presentations


Presentation on theme: "Data Mining 1 Data Mining is one aspect of Database Query Processing (on the "what if" or pattern and trend end of Query Processing, rather than the "please."— Presentation transcript:

1

2 Data Mining 1 Data Mining is one aspect of Database Query Processing (on the "what if" or pattern and trend end of Query Processing, rather than the "please find" or straight forward end. To say it another way, data mining queries are on the ad hoc or unstructured end of the query spectrum rather than standard report generation or "retieve all records matching a criteria" or SQL side). Still, Data Mining queries ARE queries and are processed (or will eventually be processed) by a Database Management System the same way queries are processed today, namely: 1. SCAN and PARSE (SCANNER-PARSER): A Scanner identifies the tokens or language elements of the DM query. The Parser check for syntax or grammar validity. 2. VALIDATED: The Validator checks for valid names and semantic correctness. 3. CONVERTER converts to an internal representation. |4. QUERY OPTIMIZED: the Optimzier devises a stategy for executing the DM query (chooses among alternative Query internal representations). 5. CODE GENERATION: generates code to implement each operator in the selected DM query plan (the optimizer- selected internal representation). 6. RUNTIME DATABASE PROCESSORING: run plan code. Developing new, efficient and effective DataMining Query (DMQ) processors is the central need and issue in DBMS research today (far and away!). These notes concentrate on 5,i.e., generating code (algorithms) to implement operators (at a high level) namely operators that do: Association Rule Mining (ARM), Clustering (CLU), Classification (CLA)

3 Machine Learning is almost always based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1 st (round NNS s,  about a center). Classification is continuity based and Near Neighbor Sets (NNS) are the central concept in continuity  >0  >0  : d(x,a)<   d(f(x),f(a))<  where f assigns a class to a feature vector, or   -NNS of f(a),  a  -NNS of a in its pre-image. f(Dom) categorical  >0  : d(x,a)<  f(x)=f(a) Data Mining can be broken down into 2 areas, Machine Learning and Assoc. Rule Mining Machine Learning can be broken down into 2 areas, Clustering and Classification. Clustering can be broken down into 2 types, Isotropic (round clusters) and Density-based Classification can be broken down into to types, Model-based and Neighbor-based Database analysis can be broken down into 2 areas, Querying and Data Mining. Caution: For classification, boundary analysis may be needed also to see the class (done by projecting?). 1234 Finding NNS in lower a dimension may still the 1st step. Eg, 12345678 are all  from a 5 a 6 (unclassified sample); 1234 are red-class, 5678 are blue-class. 7 8 Any  that gives us a vote gives us a tie vote (0-to-0 then 4-to-4). But projecting onto the vertical subspace, then taking  /2 we see that  /2 about a contains only blue class (5,6) votes. ****** Using horizontal data, NNS derivation requires ≥1 scan (O(n)). L  ε-NNS can be derived using vertical-data in O(log 2 n) (but Euclidean disks are preferred). (Euclidean and L  coincide in Binary data sets).

4 Association Rule Mining (ARM) Assume a relationship between two entities, T (e.g., a set of Transactions an enterprise performs) and I (e.g., a set of Items which are acted upon by those transactions). I n Market Basket Research (MBR) a transaction is a checkout transaction and an item is an Item in that customer's market basket going thru check out). An I-Association Rule, A  C, relates 2 disjoint subsets of I (I-temsets) has 2 main measures, support and confidence (A is called the antecedent, C is called the consequent) There are also the dual concepts of T-association rules (just reverse the roles of T and I above). Examples of Association Rules include: The MBR, relationship between customer cash-register transactions, T, and purchasable items, I (t is related to i iff i is being bought by that customer during that cash-register transaction.). In Software Engineering (SE), the relationship between Aspects, T, and Code Modules, I (t is related to i iff module, i, is part of the aspect, t). In Bioformatics, the relationship between experiments, T, and genes, I (t is related to i iff gene, i, expresses at a threshold level during experiment, t). In ER diagramming, any “part of” relationship in which i  I is part of t  T (t is related to i iff i is part of t); and any “ISA” relationship in which i  I ISA t  T (t is related to i iff i IS A t)... The support of an I-set, A, is the fraction of T-instances related to every I-instance in A, e.g. if A={i 1,i 2 } and C={i 4 } then supp(A)= |{t 2,t 4 }|/|{t 1,t 2,t 3,t 4,t 5 }| = 2/5 Note: | | means set size or count of elements in the set. I.e., T 2 and T 4 are the only transactions from the total transaction set, T={T 1,T 2,T 3,T 4,T 5 }. that are related to both i 1 and i 2, (buy i 1 and i 2 during the pertinent T-period of time). support of rule, A  C, is defined as supp{A  C} = |{T 2, T 4 }|/|{T 1,T 2,T 3,T 4, T 5 }| = 2/5 confidence of rule, A  C, is supp(A  C)/ supp(A) = (2/5) / (2/5) = 1 DM Queriers typically want STRONG RULES: supp≥minsupp, conf≥minconf (minsupp and minconf are threshold levels) Note that Conf(A  C) is also just the conditional probability of t being related to C, given that t is related to A). T I A t1t1 t2t2 t3t3 t4t4 t5t5 i1i1 i2i2 i3i3 i4i4 C

5 Finding Strong Association Rules The relationship between Transactions and Items can be expressed in a Transaction Table where each transaction is a row containing its ID and the list of the items that are related to that transaction: T IDABCDEF 2000111000 1000101000 4000100100 5000010011 If minsupp is set by the querier at.5 and minconf at.75: To find frequent or Large itemsets (support ≥ minsupp) PseudoCode: Assume the items in L k-1 are ordered: Step 1: self-joining L k-1 insert into C k select p.item 1, p.item 2, …, p.item k-1, q.item k-1 p from L k-1, q from L k-1 where p.item 1 =q.item 1,..,p.item k-2 =q.item k-2, p.item k-1 <q.item k-1 Step 2: pruning forall itemsets c in C k do forall (k-1)-subsets s of c do if (s is not in L k-1 ) delete c from C k Transaction Bitmap Tab;e can be expressed using “Item bit vectors” (inheritance property) Any subset of a large itemset is large. Why? (e.g., if {A, B} is large, {A} and {B} must be large) APRIORI METHOD: Iteratively find the large k-itemsets, k=1... Find all association rules supported by each large Itemset. C k denotes candidate k-itemsets generated at each step. L k denotes Large k-itemsets. 3 2 2 1 1 1 1-itemset supp 3 2 2 Large (supp  2) Start by finding large 1-ItemSets.

6 Database D Scan D C1C1 TID 12345 10010110 20001101 30011101 40001001 C2C2 Scan D C2C2 L3L3 P 1 2 //\\ 1010 P 2 3 //\\ 0111 P 3 3 //\\ 1110 P 4 1 //\\ 1000 P 5 3 //\\ 0111 Build Ptrees: Scan D L 1 ={1}{2}{3}{5} P 1 ^P 2 1 //\\ 0010 P 1 ^P 3 2 //\\ 1010 P 1 ^P 5 1 //\\ 0010 P 2 ^P 3 2 //\\ 0110 P 2 ^P 5 3 //\\ 0111 P 3 ^P 5 2 //\\ 0110 L 2 ={13}{23}{25}{35} P 1 ^P 2 ^P 3 1 //\\ 0010 P 1 ^P 3 ^P 5 1 //\\ 0010 P 2 ^P 3 ^P 5 2 //\\ 0110 L 3 ={235} L1L1 L2L2 {123} pruned since {12} not large {135} pruned since {15} not Large Example ARM using uncompressed P-trees (note: I have placed the 1-count at the root of each Ptree) C3C3 itemset {2 3 5} {1 2 3} {1,3,5}

7 L3L3 L1L1 L2L2 1-ItemSets don’t support Association Rules (They will have no antecedent or no consequent). Are there any Strong Rules supported by Large 2-ItemSets (at minconf=.75)? {1,3}conf{1}  {3} = supp{1,3}/supp{1} = 2/2 = 1 ≥.75 STRONG conf{3}  {1} = supp{1,3}/supp{3} = 2/3 =.67 <.75 {2,3}conf{2}  {3} = supp{2,3}/supp{2} = 2/3 =.67 <.75 conf{3}  {2} = supp{2,3}/supp{3} = 2/3 =.67 <.75 {2,5}conf{2}  {5} = supp{2,5}/supp{2} = 3/3 = 1 ≥.75 STRONG! conf{5}  {2} = supp{2,5}/supp{5} = 3/3 = 1 ≥.75 STRONG! {3,5}conf{3}  {5} = supp{3,5}/supp{3} = 2/3 =.67 <.75 conf{5}  {3} = supp{3,5}/supp{5} = 2/3 =.67 <.75 Are there any Strong Rules supported by Large 3-ItemSets? {2,3,5}conf{2,3}  {5} = supp{2,3,5}/supp{2,3} = 2/2 = 1 ≥.75 STRONG! conf{2,5}  {3} = supp{2,3,5}/supp{2,5} = 2/3 =.67 <.75 conf{3,5}  {2} = supp{2,3,5}/supp{3,5} = 2/3 =.67 <.75 No subset antecedent can yield a strong rule either (i.e., no need to check conf{2}  {3,5} or conf{5}  {2,3} since both denominators will be at least as large and therefore, both confidences will be at least as low. No need to check conf{3}  {2,5} or conf{5}  {2,3} DONE! 2-Itemsets do support ARs.

8 6. 1 st half of 1 st of 2 nd is  1 0 0 1 1 4. 1 st half of 2 nd half not  0 0 2. 1 st half is not pure1  0 0 0 1. Whole file is not pure1  0 Horizontal structures (records) Scanned vertically P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P 43 0 0 0 0 1 10 0 1 0 0 1 0 0 0 0 0 0 1 01 10 0 1 0 0 1 0 0 0 0 1 0 01 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 10 01 5. 2 nd half of 2 nd half is  1 0 0 1 R 11 0 1 0 1 then process using multi-operand logical AND s. Vertical basic binary Predicate-tree (P-tree): vertically partition table; compress each vertical bit slice into a basic binary P-tree as follows 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A 1 A 2 A 3 A 4 ) Ptree Review: A data table, R(A 1..A n ), containing horizontal structures (records) is processed vertically (vertical scans) The basic binary P-tree, P 1,1, for R 11 is built top- down by record truth of predicate pure1 recursively on halves, until purity. 3. 2 nd half is not pure1  0 0 7. 2 nd half of 1 st of 2 nd not  0 0 0 1 10 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 Eg, Count number of occurences of 111 000 001 100 0 2 3 -level P 11 ^P 12 ^P 13 ^P’ 21 ^P’ 22 ^P’ 23 ^P’ 31 ^P’ 32 ^P 33 ^P 41 ^P’ 42 ^P’ 43 = 0 0 2 2 -level =2 01 2 1 -level But it is pure (pure0) so this branch ends

9 R 11 0 1 0 1 Top-down construction of basic binary P-trees is good for understanding, but bottom-up is more efficient. Bottom-up construction of P 11 is done using in-order tree traversal and the collapsing of pure siblings, as follow: 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 P 11 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0

10 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 To count occurrences of 7,0,1,4 use pure111000001100 : 0 P 11 ^P 12 ^P 13 ^P’ 21 ^P’ 22 ^P’ 23 ^P’ 31 ^P’ 32 ^P 33 ^P 41 ^P’ 42 ^P’ 43 = 0 0 01 ^ 7 0 1 4 P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P 43 0 0 0 0 1 10 0 1 0 0 1 0 0 0 0 0 0 1 01 10 0 1 0 0 1 0 0 0 0 1 0 01 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 10 01 ^ ^^^ ^ ^ ^ ^^ R (A 1 A 2 A 3 A 4 ) 2 7 6 1 6 7 6 0 2 7 5 1 2 7 5 7 5 2 1 4 2 2 1 5 7 0 1 4 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 = This 0 makes entire left branch 0 These 0 s make this node 0 These 1 s and these 0 s make this 1 2 1 -level has the only 1-bit so the 1-count = 1*2 1 = 2 Processing Efficiencies? (prefixed leaf-sizes have been removed)

11 Ptree-ARM versus Apriori on aerial photo (RGB) data together with yeild data Scalability with support threshold 1320  1320 pixel TIFF- Yield dataset (total number of transactions is ~1,700,000). P-ARM compared to Horizontal Apriori (classical) and FP-growth (an improvement of it).  In P-ARM, we find all frequent itemsets, not just those containing Yield (for fairness)  Aerial TIFF images (R,G,B) with synchronized yield (Y). Scalability with number of transactions  Identical results  P-ARM is more scalable for lower support thresholds.  P-ARM algorithm is more scalable to large spatial datasets.

12 P-ARM versus FP-growth (see literature for definition) Scalability with support threshold 17,424,000 pixels (transactions) Scalability with number of trans  FP-growth = efficient, tree-based frequent pattern mining method (details later)  For a dataset of 100K bytes, FP-growth runs very fast. But for images of large size, P-ARM achieves better performance.  P-ARM achieves better performance in the case of low support threshold.

13 Other methods (other than FP-growth) to Improve Apriori’s Efficiency (see the literature or the html notes 10datamining.html in Other Materials for more detail) Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent  The core of the Apriori algorithm: –Use only large (k – 1)-itemsets to generate candidate large k-itemsets –Use database scan and pattern matching to collect counts for the candidate itemsets  The bottleneck of Apriori: candidate generation 1. Huge candidate sets: 10 4 large 1-itemset may generate 10 7 candidate 2-itemsets To discover large pattern of size 100, eg, {a 1 …a 100 }, we need to generate 2 100  10 30 candidates. 2. Multiple scans of database: (Needs (n +1 ) scans, n = length of the longest pattern)


Download ppt "Data Mining 1 Data Mining is one aspect of Database Query Processing (on the "what if" or pattern and trend end of Query Processing, rather than the "please."

Similar presentations


Ads by Google