CMU SCS Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 – DB Applications Data Warehousing / Data Mining (R&G, ch 25 and 26) C. Faloutsos and.

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Data Mining Techniques Association Rule
OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
LOGO Association Rule Lecturer: Dr. Bo Yuan
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
ICDM'06 Panel 1 Apriori Algorithm Rakesh Agrawal Ramakrishnan Srikant (description by C. Faloutsos)
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications Lecture # 24: Data Warehousing / Data Mining (R&G, ch 25 and 26)
OLAP Services Business Intelligence Solutions. Agenda Definition of OLAP Types of OLAP Definition of Cube Definition of DMR Differences between Cube and.
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Classification and Prediction
Spatial and Temporal Data Mining V. Megalooikonomou Introduction to Decision Trees ( based on notes by Jiawei Han and Micheline Kamber and on notes by.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Carnegie Mellon Carnegie Mellon Univ. Dept. of Computer Science Database Applications C. Faloutsos Data Warehousing / Data Mining.
Association Rule Mining (Some material adapted from: Mining Sequential Patterns by Karuna Pande Joshi)‏
Fast Algorithms for Association Rule Mining
Mining Association Rules
Classification.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Online Analytical Processing (OLAP) Hweichao Lu CS157B-02 Spring 2007.
Chapter 7 Decision Tree.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Mining Association Rules between Sets of Items in Large Databases presented by Zhuang Wang.
Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation Lauri Lahti.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
OnLine Analytical Processing (OLAP)
Chapter 9 – Classification and Regression Trees
OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
Winter 2006Winter 2002 Keller, Ullman, CushingJudy Cushing 19–1 Warehousing The most common form of information integration: copy sources into a single.
Chapter 20 Data Analysis and Mining. 2 n Decision Support Systems  Obtain high-level information out of detailed information stored in (DB) transaction-processing.
Data Mining Find information from data data ? information.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications Data Warehousing / Data Mining (R&G, ch 25 and 26)
Business Intelligence Transparencies 1. ©Pearson Education 2009 Objectives What business intelligence (BI) represents. The technologies associated with.
CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
Data Mining  Association Rule  Classification  Clustering.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Chapter 6 Decision Tree.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association rule mining
Frequent Pattern Mining
Data Warehousing / Data Mining
Waikato Environment for Knowledge Analysis
Database Management Systems Data Mining - Data Cubes
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
I don’t need a title slide for a lecture
©Jiawei Han and Micheline Kamber
©Jiawei Han and Micheline Kamber
Avoid Overfitting in Classification
15-826: Multimedia Databases and Data Mining
Association Analysis: Basic Concepts
Presentation transcript:

CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications Data Warehousing / Data Mining (R&G, ch 25 and 26) C. Faloutsos and A. Pavlo

CMU SCS Faloutsos/PavloCMU-SCS2 Data mining - detailed outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees Unsupervised learning –association rules

CMU SCS Faloutsos/PavloCMU-SCS3 Problem Given: multiple data sources Find: patterns (classifiers, rules, clusters, outliers...) sales(p-id, c-id, date, $price) customers( c-id, age, income,...) NY SF ??? PGH

CMU SCS Faloutsos/PavloCMU-SCS4 Data Ware-housing First step: collect the data, in a single place (= Data Warehouse) How? How often? How about discrepancies / non- homegeneities?

CMU SCS Faloutsos/PavloCMU-SCS5 Data Ware-housing First step: collect the data, in a single place (= Data Warehouse) How? A: Triggers/Materialized views How often? A: [Art!] How about discrepancies / non- homegeneities? A: Wrappers/Mediators

CMU SCS Faloutsos/PavloCMU-SCS6 Data Ware-housing Step 2: collect counts. (DataCubes/OLAP) Eg.:

CMU SCS Faloutsos/PavloCMU-SCS7 OLAP Problem: “is it true that shirts in large sizes sell better in dark colors?” sales...

CMU SCS Faloutsos/PavloCMU-SCS8 DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE  color size color; size

CMU SCS Faloutsos/PavloCMU-SCS9 DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE  color size color; size

CMU SCS Faloutsos/PavloCMU-SCS10 DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE  color size color; size

CMU SCS Faloutsos/PavloCMU-SCS11 DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE  color size color; size

CMU SCS Faloutsos/PavloCMU-SCS12 DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE  color size color; size

CMU SCS Faloutsos/PavloCMU-SCS13 DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE  color size color; size DataCube

CMU SCS Faloutsos/PavloCMU-SCS14 DataCubes SQL query to generate DataCube: Naively (and painfully:) select size, color, count(*) from sales where p-id = ‘shirt’ group by size, color select size, count(*) from sales where p-id = ‘shirt’ group by size...

CMU SCS Faloutsos/PavloCMU-SCS15 DataCubes SQL query to generate DataCube: with ‘cube by’ keyword: select size, color, count(*) from sales where p-id = ‘shirt’ cube by size, color

CMU SCS Faloutsos/PavloCMU-SCS16 DataCubes DataCube issues: Q1: How to store them (and/or materialize portions on demand) Q2: Which operations to allow

CMU SCS Faloutsos/PavloCMU-SCS17 DataCubes DataCube issues: Q1: How to store them (and/or materialize portions on demand) A: ROLAP/MOLAP Q2: Which operations to allow A: roll-up, drill down, slice, dice [More details: book by Han+Kamber]

CMU SCS Faloutsos/PavloCMU-SCS18 DataCubes Q1: How to store a dataCube?

CMU SCS Faloutsos/PavloCMU-SCS19 DataCubes Q1: How to store a dataCube? A1: Relational (R-OLAP)

CMU SCS Faloutsos/PavloCMU-SCS20 DataCubes Q1: How to store a dataCube? A2: Multi-dimensional (M-OLAP) A3: Hybrid (H-OLAP)

CMU SCS Faloutsos/PavloCMU-SCS21 DataCubes Pros/Cons: ROLAP strong points: (DSS, Metacube)

CMU SCS Faloutsos/PavloCMU-SCS22 DataCubes Pros/Cons: ROLAP strong points: (DSS, Metacube) use existing RDBMS technology scale up better with dimensionality

CMU SCS Faloutsos/PavloCMU-SCS23 DataCubes Pros/Cons: MOLAP strong points: (EssBase/hyperion.com) faster indexing (careful with: high-dimensionality; sparseness) HOLAP: (MS SQL server OLAP services) detail data in ROLAP; summaries in MOLAP

CMU SCS Faloutsos/PavloCMU-SCS24 DataCubes Q1: How to store a dataCube Q2: What operations should we support?

CMU SCS Faloutsos/PavloCMU-SCS25 DataCubes Q2: What operations should we support?  color size color; size

CMU SCS Faloutsos/PavloCMU-SCS26 DataCubes Q2: What operations should we support? Roll-up  color size color; size

CMU SCS Faloutsos/PavloCMU-SCS27 DataCubes Q2: What operations should we support? Drill-down  color size color; size

CMU SCS Faloutsos/PavloCMU-SCS28 DataCubes Q2: What operations should we support? Slice  color size color; size

CMU SCS Faloutsos/PavloCMU-SCS29 DataCubes Q2: What operations should we support? Dice  color size color; size

CMU SCS Faloutsos/PavloCMU-SCS30 DataCubes Q2: What operations should we support? Roll-up Drill-down Slice Dice (Pivot/rotate; drill-across; drill-through top N moving averages, etc)

CMU SCS Faloutsos/PavloCMU-SCS31 D/W - OLAP - Conclusions D/W: copy (summarized) data + analyze OLAP - concepts: –DataCube –R/M/H-OLAP servers –‘dimensions’; ‘measures’

CMU SCS Faloutsos/PavloCMU-SCS32 Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees Unsupervised learning –association rules –(clustering)

CMU SCS Faloutsos/PavloCMU-SCS33 Decision trees - Problem ??

CMU SCS Faloutsos/PavloCMU-SCS34 Decision trees Pictorially, we have num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level)

CMU SCS Faloutsos/PavloCMU-SCS35 Decision trees and we want to label ‘?’ num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level) ?

CMU SCS Faloutsos/PavloCMU-SCS36 Decision trees so we build a decision tree: num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level) ? 50 40

CMU SCS Faloutsos/PavloCMU-SCS37 Decision trees so we build a decision tree: age<50 Y + chol. <40 N -... Y N

CMU SCS Faloutsos/PavloCMU-SCS38 Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees –problem –approach –scalability enhancements Unsupervised learning –association rules –(clustering)

CMU SCS Faloutsos/PavloCMU-SCS39 Decision trees Typically, two steps: –tree building –tree pruning (for over-training/over-fitting)

CMU SCS Faloutsos/PavloCMU-SCS40 Tree building How? num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level)

CMU SCS Faloutsos/PavloCMU-SCS41 Tree building How? A: Partition, recursively - pseudocode: Partition ( Dataset S) if all points in S have same label then return evaluate splits along each attribute A pick best split, to divide S into S1 and S2 Partition(S1); Partition(S2)

CMU SCS Faloutsos/PavloCMU-SCS42 Tree building Q1: how to introduce splits along attribute A i Q2: how to evaluate a split?

CMU SCS Faloutsos/PavloCMU-SCS43 Tree building Q1: how to introduce splits along attribute A i A1: –for num. attributes: binary split, or multiple split –for categorical attributes: compute all subsets (expensive!), or use a greedy algo

CMU SCS Faloutsos/PavloCMU-SCS44 Tree building Q1: how to introduce splits along attribute A i Q2: how to evaluate a split?

CMU SCS Faloutsos/PavloCMU-SCS45 Tree building Q1: how to introduce splits along attribute A i Q2: how to evaluate a split? A: by how close to uniform each subset is - ie., we need a measure of uniformity:

CMU SCS Faloutsos/PavloCMU-SCS46 Tree building entropy: H(p+, p-) p Any other measure? Details

CMU SCS Faloutsos/PavloCMU-SCS47 Tree building entropy: H(p +, p - ) p ‘gini’ index: 1-p p - 2 p Details

CMU SCS Faloutsos/PavloCMU-SCS48 Tree building entropy: H(p +, p - ) ‘gini’ index: 1-p p - 2 (How about multiple labels?) Details

CMU SCS Faloutsos/PavloCMU-SCS49 Tree building Intuition: entropy: #bits to encode the class label gini: classification error, if we randomly guess ‘+’ with prob. p + Details

CMU SCS Faloutsos/PavloCMU-SCS50 Tree building Thus, we choose the split that reduces entropy/classification-error the most: Eg.: num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level)

CMU SCS Faloutsos/PavloCMU-SCS51 Tree building Before split: we need (n + + n - ) * H( p +, p - ) = (7+6) * H(7/13, 6/13) bits total, to encode all the class labels After the split we need: 0 bits for the first half and (2+6) * H(2/8, 6/8) bits for the second half Details

CMU SCS Faloutsos/PavloCMU-SCS52 Tree pruning What for? num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level)

CMU SCS Faloutsos/PavloCMU-SCS53 Tree pruning Shortcut for scalability: DYNAMIC pruning: stop expanding the tree, if a node is ‘reasonably’ homogeneous –ad hoc threshold [Agrawal+, vldb92] –( Minimum Description Language (MDL) criterion (SLIQ) [Mehta+, edbt96] ) Details

CMU SCS Faloutsos/PavloCMU-SCS54 Tree pruning Q: How to do it? A1: use a ‘training’ and a ‘testing’ set - prune nodes that improve classification in the ‘testing’ set. (Drawbacks?) (A2: or, rely on MDL (= Minimum Description Language) ) Details

CMU SCS Faloutsos/PavloCMU-SCS55 Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees –problem –approach –scalability enhancements Unsupervised learning –association rules –(clustering) Details

CMU SCS Faloutsos/PavloCMU-SCS56 Scalability enhancements Interval Classifier [Agrawal+,vldb92]: dynamic pruning SLIQ: dynamic pruning with MDL; vertical partitioning of the file (but label column has to fit in core) SPRINT: even more clever partitioning Details

CMU SCS Faloutsos/PavloCMU-SCS57 Conclusions for classifiers Classification through trees Building phase - splitting policies Pruning phase (to avoid over-fitting) For scalability: –dynamic pruning –clever data partitioning ?

CMU SCS Faloutsos/PavloCMU-SCS58 Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees –problem –approach –scalability enhancements Unsupervised learning –association rules –(clustering)

CMU SCS Faloutsos/PavloCMU-SCS59 Association rules - idea [Agrawal+SIGMOD93] Consider ‘market basket’ case: (milk, bread) (milk) (milk, chocolate) (milk, bread) Find ‘interesting things’, eg., rules of the form: milk, bread -> chocolate | 90%

CMU SCS Faloutsos/PavloCMU-SCS60 Association rules - idea In general, for a given rule Ij, Ik,... Im -> Ix | c ‘c’ = ‘confidence’ (how often people by Ix, given that they have bought Ij,... Im ‘s’ = support: how often people buy Ij,... Im, Ix

CMU SCS Faloutsos/PavloCMU-SCS61 Association rules - idea Problem definition: given – a set of ‘market baskets’ (=binary matrix, of N rows/baskets and M columns/products) –min-support ‘s’ and –min-confidence ‘c’ find –all the rules with higher support and confidence

CMU SCS Faloutsos/PavloCMU-SCS62 Association rules - idea Closely related concept: “large itemset” Ij, Ik,... Im, Ix is a ‘large itemset’, if it appears more than ‘min- support’ times Observation: once we have a ‘large itemset’, we can find out the qualifying rules easily (how?) Thus, let’s focus on how to find ‘large itemsets’

CMU SCS Faloutsos/PavloCMU-SCS63 Association rules - idea Naive solution: scan database once; keep 2**|I| counters Drawback? Improvement?

CMU SCS Faloutsos/PavloCMU-SCS64 Association rules - idea Naive solution: scan database once; keep 2**|I| counters Drawback? 2**1000 is prohibitive... Improvement? scan the db |I| times, looking for 1-, 2-, etc itemsets Eg., for |I|=3 items only (A, B, C), we have

CMU SCS Faloutsos/PavloCMU-SCS65 Association rules - idea AB C first pass min-sup:10

CMU SCS Faloutsos/PavloCMU-SCS66 Association rules - idea AB C first pass min-sup:10 A,B A,C B,C

CMU SCS Faloutsos/PavloCMU-SCS67 Association rules - idea Anti-monotonicity property: if an itemset fails to be ‘large’, so will every superset of it (hence all supersets can be pruned) Sketch of the (famous!) ‘a-priori’ algorithm Let L(i-1) be the set of large itemsets with i-1 elements Let C(i) be the set of candidate itemsets (of size i)

CMU SCS Faloutsos/PavloCMU-SCS68 Association rules - idea Compute L(1), by scanning the database. repeat, for i=2,3..., ‘join’ L(i-1) with itself, to generate C(i) two itemset can be joined, if they agree on their first i-2 elements prune the itemsets of C(i) (how?) scan the db, finding the counts of the C(i) itemsets - set this to be L(i) unless L(i) is empty, repeat the loop

CMU SCS Faloutsos/PavloCMU-SCS69 Association rules - Conclusions Association rules: a great tool to find patterns easy to understand its output fine-tuned algorithms exist

CMU SCS Faloutsos/PavloCMU-SCS70 Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees –problem –approach –scalability enhancements Unsupervised learning –association rules –clustering

CMU SCS Faloutsos/PavloCMU-SCS71 Clustering Problem: – given N points in V dimensions, –group them

CMU SCS Faloutsos/PavloCMU-SCS72 Clustering Problem: – given N points in V dimensions, –group them

CMU SCS Faloutsos/PavloCMU-SCS73 Clustering Problem: – given N points in V dimensions, –group them MANY algorithms: –K-means, X-means, BIRCH, OPTICS

CMU SCS Faloutsos/PavloCMU-SCS74 Clustering Easiest to describe: k-means -User gives # clusters ‘k’ -Start with ‘k’ random seeds -Assign each point to its nearest seed -Move seed towards center, and repeat

CMU SCS Faloutsos/PavloCMU-SCS75 Overall Conclusions Data Mining = ``Big Data’’ Analytics = Business Intelligence: – of high commercial, government and research interest DM = DB+ ML+ Stat+Sys Data warehousing / OLAP: to get the data Tree classifiers (SLIQ, SPRINT) Association Rules - ‘a-priori’ algorithm clustering: k-means (& BIRCH, CURE, OPTICS)

CMU SCS Faloutsos/PavloCMU-SCS76 Reading material Agrawal, R., T. Imielinski, A. Swami, ‘Mining Association Rules between Sets of Items in Large Databases’, SIGMOD M. Mehta, R. Agrawal and J. Rissanen, `SLIQ: A Fast Scalable Classifier for Data Mining', Proc. of the Fifth Int'l Conference on Extending Database Technology (EDBT), Avignon, France, March 1996

CMU SCS Faloutsos/PavloCMU-SCS77 Additional references Agrawal, R., S. Ghosh, et al. (Aug , 1992). An Interval Classifier for Database Mining Applications. VLDB Conf. Proc., Vancouver, BC, Canada. Jiawei Han and Micheline Kamber, Data Mining, Morgan Kaufman, 2001, chapters , , 7.3.5