1 Data Mining, Database Tuning Tuesday, Feb. 27, 2007.

Slides:



Advertisements
Similar presentations
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Advertisements

Data Mining Techniques Association Rule
OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
Association Rules Spring Data Mining: What is it?  Two definitions:  The first one, classic and well-known, says that data mining is the nontrivial.
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining 198:541. Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Definition Data mining is the exploration and analysis of large.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Overview of Indexing Chapter 8 – Part II. 1. Introduction to indexing 2. First glimpse at indices and workloads.
ICS 421 Spring 2010 Data Mining 1 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 4/6/20101Lipyeow Lim.
Manajemen Basis Data Pertemuan 7 Matakuliah: M0264/Manajemen Basis Data Tahun: 2008.
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Data Warehousing. 2 Data Warehouse A data warehouse is a huge database that stores historical data Example: Store information about all sales of products.
1 Physical Database Design and Tuning Module 5, Lecture 3.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
6/23/2015CSE591: Data Mining by H. Liu1 Association Rules Transactional data Algorithm Applications.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Fast Algorithms for Association Rule Mining
Mining Association Rules
1 Overview of Indexing Chapter 8 – Part II. 1. Introduction to indexing 2. First glimpse at indices and workloads.
Mining Association Rules
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Practical Database Design and Tuning. Outline  Practical Database Design and Tuning Physical Database Design in Relational Databases An Overview of Database.
Database Tuning Prerequisite Cluster Index B+Tree Indexing Hash Indexing ISAM (indexed Sequential access)
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Chapter 16 Practical Database Design and Tuning Copyright © 2004 Pearson Education, Inc.
Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Instructor : Prof. Marina Gavrilova. Goal Goal of this presentation is to discuss in detail how data mining methods are used in market analysis.
Association Rule Mining
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Physical Database Design I, Ch. Eick 1 Physical Database Design I Chapter 16 Simple queries:= no joins, no complex aggregate functions Focus of this Lecture:
Lec 7 Practical Database Design and Tuning Copyright © 2004 Pearson Education, Inc.
Ramakrishnan and Gehrke. Database Management Systems, 3 rd Edition. Definition Data mining is the exploration and analysis of large quantities of data.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
Chap 6: Association Rules. Rule Rules!  Motivation ~ recent progress in data mining + warehousing have made it possible to collect HUGE amount of data.
Introduction to Machine Learning Lecture 13 Introduction to Association Rules Albert Orriols i Puig Artificial.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Chapter 26: Data Mining Prepared by Assoc. Professor Bela Stantic.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
1 CS122A: Introduction to Data Management Lecture #15: Physical DB Design Instructor: Chen Li.
Data Mining – Association Rules
Practical Database Design and Tuning
Record Storage, File Organization, and Indexes
Data Mining Association Analysis: Basic Concepts and Algorithms
Association rule mining
Mining Association Rules
Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*
Frequent Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Rules Assoc.Prof.Songül Varlı Albayrak
Transactional data Algorithm Applications
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Practical Database Design and Tuning
Market Basket Analysis and Association Rules
Department of Computer Science National Tsing Hua University
Presentation transcript:

1 Data Mining, Database Tuning Tuesday, Feb. 27, 2007

2 Outline Data Mining: chapter 26 Database tuning: chapter 20

3 Data Mining Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Example pattern (Census Bureau Data): –If (relationship = husband), then (gender = male). 99.6%

4 Data Mining Valid: The patterns hold in general. Novel: We did not know the pattern beforehand. Useful: We can devise actions from the patterns. Understandable: We can interpret and comprehend the patterns.

5 Why Use Data Mining Today ? Human analysis skills are inadequate: Volume and dimensionality of the data High data growth rate Availability of: Data Storage Computational power Off-the-shelf software Expertise

6 Types of Data Mining Association Rules Decision trees Clustering Niave Bayes Etc, etc, etc. We’ll discuss only association rules, and only briefly.

7 Association Rules Most studied mining method in db community: –Simple, easy to understand –Clever, scalable algorithm We discuss only association rules in class Project Phase 4, Task 1: –Use association rules –You should be done in 10’ Tasks 2, 3: may try something else –E.g Bayesian Networks –But need to read first

8 Association Rules Market Basket Analysis Consider shopping cart filled with several items Market basket analysis tries to answer the following questions: –Who makes purchases? –What do customers buy together? –In what order do customers purchase items?

9 Market Basket Analysis TIDCIDDateItemQty /1/99Pen /1/99Ink /1/99Milk /1/99Juice /3/99Pen /3/99Ink /3/99Milk /5/99Pen /5/99Milk /1/99Pen /1/99Ink /1/99Juice4 A database of customer transactions Each transaction is a set of items Example: Transaction with TID 111 contains items {Pen, Ink, Milk, Juice}

10 Market Basket Analysis Coocurrences 80% of all customers purchase items X, Y and Z together. Association rules 60% of all customers who purchase X and Y also buy Z. Sequential patterns 60% of customers who first buy X also purchase Y within three weeks.

11 Market Basket Analysis We prune the set of all possible association rules using two interestingness measures: Confidence of a rule: –X -->Y has confidence c if P(Y|X) = c Support of a rule: –X -->Y has support s if P(XY) = s We can also define Support of an itemset (a coocurrence) XY: –XY has support s if P(XY) = s

12 Market Basket Analysis Examples: {Pen} => {Milk} Support: 75% Confidence: 75% {Ink} => {Pen} Support: 100% Confidence: 100% TIDCIDDateItemQty /1/99Pen /1/99Ink /1/99Milk /1/99Juice /3/99Pen /3/99Ink /3/99Milk /5/99Pen /5/99Milk /1/99Pen /1/99Ink /1/99Juice4

13 Market Basket Analysis Find all itemsets with support >= 75%? TIDCIDDateItemQty /1/99Pen /1/99Ink /1/99Milk /1/99Juice /3/99Pen /3/99Ink /3/99Milk /5/99Pen /5/99Milk /1/99Pen /1/99Ink /1/99Juice4

14 Market Basket Analysis Can you find all association rules with support >= 50%? TIDCIDDateItemQty /1/99Pen /1/99Ink /1/99Milk /1/99Juice /3/99Pen /3/99Ink /3/99Milk /5/99Pen /5/99Milk /1/99Pen /1/99Ink /1/99Juice4

15 Finding Frequent Itemsets Input: a set of “transactions”: TIDItemSet T1T1 Pen, Milk, Juice, Wine T2T2 Pen, Beer, Juice, Eggs, Bread, Salad... TnTn Beer, Diapers

16 Finding Frequent Itemsets Itemset I; E.g I = {Milk, Eggs, Diapers} TIDItemSet T1T1 Pen, Milk, Juice, Wine T2T2 Pen, Beer, Juice, Eggs, Bread, Salad... TnTn Beer, Diapers Support of I = supp(I) = # of transactions that contain I

17 Finding Frequent Itemsets Find ALL itemsets I with supp(I) > minsup TIDItemSet T1T1 Pen, Milk, Juice, Wine T2T2 Pen, Beer, Juice, Eggs, Bread, Salad... TnTn Beer, Diapers Problem: too many I’s to check; too big a table (sequential scan)

18 A priory property I  I’  supp(I)  supp(I’) (WHY ??) TIDItemSet T1T1 Pen, Milk, Juice, Wine T2T2 Pen, Beer, Juice, Eggs, Bread, Salad... TnTn Beer, Diapers Question: which is bigger supp({Pen}) or supp({Pen, Beer}) ?

19 The A-priori Algorithm Goal: find all itemsets I s.t. supp(I) > minsupp For each item X check if supp(X) > minsupp then retain I 1 = {X} K=1 Repeat –For every itemset I k, generate all itemsets I k+1 s.t. I k  I k+1 –Scan all transactions and compute supp(I k+1 ) for all itemsets I k+1 –Drop itemsets I k+1 with support < minsupp Until no new frequent itemsets are found

20 Association Rules Finally, construct all rules X  Y s.t. XY has high support Supp(XY)/Supp(X) > min-confidence

21 Database Tuning Goal: improve performance, without affecting the application –Recall the “data independence” principle How to achieve good performance: –Make good design choices (we’ve been studying this for 8 weeks…) –Physical database design, or “database tuning”

22 The Database Workload A list of queries, together with their frequencies –Note these queries are typically parameterized, since they are embedded in applications A list of updates and their frequencies Performance goals for each type of query and update

23 Analyze the Workload For each query: –What tables/attributes does it touch –How selective are the conditions; note: this is even harder since queries are parameterized For each update: –What kind of update –What tables/attributes does it affect

24 Physical Design and Tuning Choose what indexes to create Tune the conceptual schema: –Alternative BCNF form (recall: there can be several choices) –Denormalization: may seem necessary for performance –Vertical/horizontal partitioning (see the lecture on views) –Materialized views Manual query/transaction rewriting

25 Guidelines for Index Selection Guideline 1: don’t build it unless someone needs it ! Guideline 2: consider building it if it occurs in a WHERE clause –WHERE R.A= consider B+-tree or hash-index –WHERE R.A > 555 and R.A < consider B+ tree

26 Guidelines for Index Selection Guideline 3: Multi-attribute indexes –WHERE R.A = 555 and R.B = consider an index with key (A,B) –Note: multi-attribute indexes enable “index only” strategies Guideline 4: which index to cluster –Rule of thumb: range predicate  clustered –Rule of thumb: “index only”  unclustered

27 Guidelines for Index Selection Guideline 5: Hash v.s. B+ tree –For index nested loop join: prefer hash –Range predicates: prefer B+ Guideline 6: balance maintenance cost v.s. benefit –If touched by too many updates, perhaps drop it

28 Clustered v.s. Unclustered Index Recall that when the selectivity is low, then an unclustered index may be less efficient than a linear scan. See graph on pp. 660

29 Co-clustering Two Relations Product(pid, pname, manufacturer, price) Company(cid, cname, address) cid1 p1 p2p3p4p5p6p7p8p9papb cid2 pc Block 1Block 2Block 3 company product company We say that Company is unclustered

30 Index-Only Plans SELECT Company.name FROM Company, Product WHERE Company.cid = Product.manufacturer SELECT Company.name, Company.city,Product.price FROM Company, Product WHERE Company.cid = Product.manufacturer How can we evaluate these using an index only ?

31 Automatic Index Selection SQL Server -- see book

32 Denormalization 3NF instead of BCNF Alternative BCNF when possible Denormalize (I.e. keep the join) Vertical partitioning Horizontal partitioning