Database Management Systems Data Mining - Data Cubes

Slides:



Advertisements
Similar presentations
CMU SCS : Multimedia Databases and Data Mining Lecture #19: SVD - part II (case studies) C. Faloutsos.
Advertisements

OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Chapter 7 – Classification and Regression Trees
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications Lecture # 24: Data Warehousing / Data Mining (R&G, ch 25 and 26)
OLAP Services Business Intelligence Solutions. Agenda Definition of OLAP Types of OLAP Definition of Cube Definition of DMR Differences between Cube and.
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Lecture Notes for Chapter 4 Introduction to Data Mining
Classification and Prediction
CSci 8980: Data Mining (Fall 2002)
Decision Tree Algorithm
Spatial and Temporal Data Mining V. Megalooikonomou Introduction to Decision Trees ( based on notes by Jiawei Han and Micheline Kamber and on notes by.
Carnegie Mellon Carnegie Mellon Univ. Dept. of Computer Science Database Applications C. Faloutsos Data Warehousing / Data Mining.
Lecture 5 (Classification with Decision Trees)
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Classification.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Chapter 7 Decision Tree.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Ahsan Abdullah 1 Data Warehousing Lecture-11 Multidimensional OLAP (MOLAP) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
Classification supplemental. Scalable Decision Tree Induction Methods in Data Mining Studies SLIQ (EDBT’96 — Mehta et al.) – builds an index for each.
Basics of Decision Trees  A flow-chart-like hierarchical tree structure –Often restricted to a binary structure  Root: represents the entire dataset.
Chapter 9 – Classification and Regression Trees
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Basic Data Mining Technique
OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.
Data Warehousing.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification CS 685: Special Topics in Data Mining Fall 2010 Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Classification and Prediction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot Readings: Chapter 6 – Han and Kamber.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications Data Warehousing / Data Mining (R&G, ch 25 and 26)
Business Intelligence Transparencies 1. ©Pearson Education 2009 Objectives What business intelligence (BI) represents. The technologies associated with.
CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications Data Warehousing / Data Mining (R&G, ch 25 and 26) C. Faloutsos and.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.
Lecture Notes for Chapter 4 Introduction to Data Mining
Copyright© 2014, Sira Yongchareon Department of Computing, Faculty of Creative Industries and Business Lecturer : Dr. Sira Yongchareon ISCG 6425 Data Warehousing.
Data Warehousing.
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
Decision Trees.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Chapter 6 Decision Tree.
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
On-Line Analytic Processing
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Data Warehousing / Data Mining
Introduction to Data Mining, 2nd Edition by
Classification and Prediction
Introduction to Data Mining, 2nd Edition by
Classification and Prediction — Slides for Textbook — — Chapter 7 —
Data Warehouse and OLAP
I don’t need a title slide for a lecture
Data Mining: Concepts and Techniques
Basic Concepts and Decision Trees
CS 685: Special Topics in Data Mining Jinze Liu
CS 685: Special Topics in Data Mining Jinze Liu
Data Warehouse.
©Jiawei Han and Micheline Kamber
Avoid Overfitting in Classification
Classification.
CS 685: Special Topics in Data Mining Spring 2009 Jinze Liu
15-826: Multimedia Databases and Data Mining
CS 685: Special Topics in Data Mining Jinze Liu
Data Warehouse and OLAP
Presentation transcript:

15-721 Database Management Systems Data Mining - Data Cubes Anastassia Ailamaki http://www.cs.cmu.edu/~natassa

Copyright: C. Faloutsos (2001) Substitute lecturer Christos Faloutsos christos@cs.cmu.edu www.cs.cmu.edu/~christos 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees Unsupervised learning association rules (clustering) 15-721 Copyright: C. Faloutsos (2001)

Problem PGH NY ??? SF Given: multiple data sources Find: patterns (classifiers, rules, clusters, outliers...) sales(p-id, c-id, date, $price) customers( c-id, age, income, ...) NY SF ??? PGH 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) Data Ware-housing First step: collect the data, in a single place (= Data Warehouse) How? How often? How about discrepancies / non-homegeneities? 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) Data Ware-housing First step: collect the data, in a single place (= Data Warehouse) How? A: Triggers/Materialized views How often? A: [Art!] How about discrepancies / non-homegeneities? A: Wrappers/Mediators 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) Data Ware-housing Step 2: collect counts. (DataCubes/OLAP) Eg.: 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) OLAP Problem: “is it true that shirts in large sizes sell better in dark colors?” sales ... 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE f color size color; size 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE f size color color; size 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE f size color color; size 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE f size color color; size 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE f size color color; size 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes ‘color’, ‘size’: DIMENSIONS ‘count’: MEASURE f size color color; size DataCube 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes SQL query to generate DataCube: Naively (and painfully:) select size, color, count(*) from sales where p-id = ‘shirt’ group by size, color select size, count(*) group by size ... 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes SQL query to generate DataCube: with ‘cube by’ keyword: select size, color, count(*) from sales where p-id = ‘shirt’ cube by size, color 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes DataCube issues: Q1: How to store them (and/or materialize portions on demand) Q2: How to index them Q3: Which operations to allow 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes DataCube issues: Q1: How to store them (and/or materialize portions on demand) A: ROLAP/MOLAP Q2: How to index them A: bitmap indices Q3: Which operations to allow A: roll-up, drill down, slice, dice [More details: book by Han+Kamber] 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes Q1: How to store a dataCube? 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes Q1: How to store a dataCube? A1: Relational (R-OLAP) 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes Q1: How to store a dataCube? A2: Multi-dimensional (M-OLAP) A3: Hybrid (H-OLAP) 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes Pros/Cons: ROLAP strong points: (DSS, Metacube) 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes Pros/Cons: ROLAP strong points: (DSS, Metacube) use existing RDBMS technology scale up better with dimensionality 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes Pros/Cons: MOLAP strong points: (EssBase/hyperion.com) faster indexing (careful with: high-dimensionality; sparseness) HOLAP: (MS SQL server OLAP services) detail data in ROLAP; summaries in MOLAP 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes Q1: How to store a dataCube Q2: What operations should we support? Q3: How to index a dataCube? 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes Q2: What operations should we support? f color size color; size 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes Q2: What operations should we support? Roll-up f size color color; size 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes Q2: What operations should we support? Drill-down f size color color; size 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes Q2: What operations should we support? Slice f size color color; size 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes Q2: What operations should we support? Dice f size color color; size 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes Q2: What operations should we support? Roll-up Drill-down Slice Dice 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes Q1: How to store a dataCube Q2: What operations should we support? Q3: How to index a dataCube? 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes Q3: How to index a dataCube? 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes Q3: How to index a dataCube? A1: Bitmaps 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) DataCubes Q3: How to index a dataCube? A2: Join indices (see [Han+Kamber]) 15-721 Copyright: C. Faloutsos (2001)

D/W - OLAP - Conclusions D/W: copy (summarized) data + analyze OLAP - concepts: DataCube R/M/H-OLAP servers ‘dimensions’; ‘measures’ 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees Unsupervised learning association rules (clustering) 15-721 Copyright: C. Faloutsos (2001)

Decision trees - Problem ?? 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) Decision trees Pictorially, we have num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level) + - 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) Decision trees and we want to label ‘?’ num. attr#2 (eg., chol-level) ? - - + + + - + - + - + - + num. attr#1 (eg., ‘age’) 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) Decision trees so we build a decision tree: num. attr#2 (eg., chol-level) ? - - + + + 40 - + - + - + - + 50 num. attr#1 (eg., ‘age’) 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) Decision trees so we build a decision tree: age<50 N Y chol. <40 + Y N - ... 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees problem approach scalability enhancements Unsupervised learning association rules (clustering) 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) Decision trees Typically, two steps: tree building tree pruning (for over-training/over-fitting) 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) Tree building How? num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level) + - 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) skip Tree building How? A: Partition, recursively - pseudocode: Partition ( Dataset S) if all points in S have same label then return evaluate splits along each attribute A pick best split, to divide S into S1 and S2 Partition(S1); Partition(S2) 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) skip Tree building Q1: how to introduce splits along attribute Ai Q2: how to evaluate a split? 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) skip Tree building Q1: how to introduce splits along attribute Ai A1: for num. attributes: binary split, or multiple split for categorical attributes: compute all subsets (expensive!), or use a greedy algo 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) skip Tree building Q1: how to introduce splits along attribute Ai Q2: how to evaluate a split? 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) Tree building Q1: how to introduce splits along attribute Ai Q2: how to evaluate a split? A: by how close to uniform each subset is - ie., we need a measure of uniformity: 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) skip Tree building entropy: H(p+, p-) Any other measure? 1 0.5 1 p+ 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) skip Tree building entropy: H(p+, p-) ‘gini’ index: 1-p+2 - p-2 1 p+ 1 0.5 0.5 1 p+ 15-721 Copyright: C. Faloutsos (2001)

Tree building skip entropy: H(p+, p-) ‘gini’ index: 1-p+2 - p-2 (How about multiple labels?) 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) skip Tree building Intuition: entropy: #bits to encode the class label gini: classification error, if we randomly guess ‘+’ with prob. p+ 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) Tree building Thus, we choose the split that reduces entropy/classification-error the most: Eg.: num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level) + - 15-721 Copyright: C. Faloutsos (2001)

Tree building skip Before split: we need (n+ + n-) * H( p+, p-) = (7+6) * H(7/13, 6/13) bits total, to encode all the class labels After the split we need: 0 bits for the first half and (2+6) * H(2/8, 6/8) bits for the second half 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) Tree pruning What for? ... num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level) + - 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) Tree pruning Shortcut for scalability: DYNAMIC pruning: stop expanding the tree, if a node is ‘reasonably’ homogeneous ad hoc threshold [Agrawal+, vldb92] Minimum Description Language (MDL) criterion (SLIQ) [Mehta+, edbt96] 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) skip Tree pruning Q: How to do it? A1: use a ‘training’ and a ‘testing’ set - prune nodes that improve classification in the ‘testing’ set. (Drawbacks?) A2: or, rely on MDL (= Minimum Description Language) - in detail: 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) skip Tree pruning envision the problem as compression (of what?) 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) skip Tree pruning envision the problem as compression (of what?) and try to min. the # bits to compress (a) the class labels AND (b) the representation of the decision tree 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) skip (MDL) a brilliant idea - eg.: best n-degree polynomial to compress these points: the one that minimizes (sum of errors + n ) 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) Outline Problem Getting the data: Data Warehouses, DataCubes, OLAP Supervised learning: decision trees problem approach scalability enhancements Unsupervised learning association rules (clustering) 15-721 Copyright: C. Faloutsos (2001)

Scalability enhancements Interval Classifier [Agrawal+,vldb92]: dynamic pruning SLIQ: dynamic pruning with MDL; vertical partitioning of the file (but label column has to fit in core) SPRINT: even more clever partitioning 15-721 Copyright: C. Faloutsos (2001)

Conclusions for classifiers Classification through trees Building phase - splitting policies Pruning phase (to avoid over-fitting) For scalability: dynamic pruning clever data partitioning 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) Overall Conclusions Data Mining: of high commercial interest DM = DB+ ML+ Stat Data warehousing / OLAP: to get the data Tree classifiers (SLIQ, SPRINT) Association Rules - ‘a-priori’ algorithm (clustering: BIRCH, CURE, OPTICS) 15-721 Copyright: C. Faloutsos (2001)

Copyright: C. Faloutsos (2001) Reading material Agrawal, R., T. Imielinski, A. Swami, ‘Mining Association Rules between Sets of Items in Large Databases’, SIGMOD 1993. M. Mehta, R. Agrawal and J. Rissanen, `SLIQ: A Fast Scalable Classifier for Data Mining', Proc. of the Fifth Int'l Conference on Extending Database Technology (EDBT), Avignon, France, March 1996 15-721 Copyright: C. Faloutsos (2001)

Additional references Agrawal, R., S. Ghosh, et al. (Aug. 23-27, 1992). An Interval Classifier for Database Mining Applications. VLDB Conf. Proc., Vancouver, BC, Canada. Jiawei Han and Micheline Kamber, Data Mining , Morgan Kaufman, 2001, chapters 2.2-2.3, 6.1-6.2, 7.3.5 15-721 Copyright: C. Faloutsos (2001)