Download presentation
Presentation is loading. Please wait.
1
Data Mining: Concepts and Techniques
What is a Warehouse? Defined in many different ways, but not rigorously. A decision support database that is maintained separately from the organization’s operational database Support information processing by providing a solid platform of consolidated, historical data for analysis. “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon Data warehousing: The process of constructing and using data warehouses September 16, 2018 Data Mining: Concepts and Techniques
2
Cube: A Lattice of Cuboids
all 0-D(apex) cuboid time item location supplier 1-D cuboids time,item time,location item,location location,supplier 2-D cuboids time,supplier item,supplier time,location,supplier time,item,location 3-D cuboids time,item,supplier item,location,supplier 4-D(base) cuboid time, item, location, supplier September 16, 2018 Data Mining: Concepts and Techniques
3
Typical OLAP Operations
Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: project and select Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes. Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its back-end relational tables (using SQL) September 16, 2018 Data Mining: Concepts and Techniques
4
MOLAP versus ROLAP MOLAP ROLAP Multidimensional OLAP
Data stored in multi-dimensional cube Transformation required Data retrieved directly from cube for analysis Faster analytical processing Cube size limitations ROLAP Relational OLAP Data stored in relational database as virtual cube No transformation needed Data retrieved via SQL from database for analysis Slower analytical processing No size limitations
5
Data Mining: Concepts and Techniques
An OLAM Architecture Mining query Mining result Layer4 User Interface User GUI API OLAM Engine OLAP Engine Layer3 OLAP/OLAM Data Cube API Layer2 MDDB MDDB Meta Data Database API Filtering&Integration Filtering Layer1 Data Repository Data cleaning Data Warehouse Databases Data integration September 16, 2018 Data Mining: Concepts and Techniques
6
Data Mining: Concepts and Techniques
Summary Data warehouse A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process A multi-dimensional model of a data warehouse Star schema, snowflake schema, fact constellations A data cube consists of dimensions & measures OLAP operations: drilling, rolling, slicing, dicing and pivoting OLAP servers: ROLAP, MOLAP, HOLAP From OLAP to OLAM September 16, 2018 Data Mining: Concepts and Techniques
7
Data Mining: Concepts and Techniques
Lecture #2 September 16, 2018 Data Mining: Concepts and Techniques
8
Data Mining: Concepts and Techniques
Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary September 16, 2018 Data Mining: Concepts and Techniques
9
Multi-Dimensional Measure of Data Quality
A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility Broad categories: intrinsic, contextual, representational, and accessibility. September 16, 2018 Data Mining: Concepts and Techniques
10
Major Tasks in Data Preprocessing
Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data September 16, 2018 Data Mining: Concepts and Techniques
11
Data Cleaning Problems --- see figure 2 of data cleaning paper
Data quality problems Multi-source Single Source Schema level Instance level Schema level Instance level (poor schema design) (data entry errors) (heterogeneity) (overlapping contradicting data) .Uniqueness .Misspellings Naming conflicts Inconsistent aggregation September 16, 2018 Data Mining: Concepts and Techniques
12
Data Mining: Concepts and Techniques
Data Analysis Data profiling Focuses on the instance analysis of individual attributes Derives information such as data type, length, value range, discrete values, frequency, variance, uniqueness, occurrence of null values, typical string pattern (e.g., phone numbers, zip codes), providing a view of quality aspects of the attribute September 16, 2018 Data Mining: Concepts and Techniques
13
Data Mining: Concepts and Techniques
Data analysis and DM Data Mining Discover patterns in the data Integrity constraints among attributes can be derived “business rules” With the rules at hand, one can find exceptions which may be suspicious (candidates for cleaning) E.g. Discovered rule “total = quantity*unitprice” with confidence 99%. Then, 1% of the records require closer examination (usually by hand…) September 16, 2018 Data Mining: Concepts and Techniques
14
Daimler-Chrysler Example
A warehouse which contains information about vehicle repairs To analyze quality of Products (cars) Processes (warranty claims) Services (actual repairs) To evaluate and redefine Policies Costs To collect and analyze data Wear Damages Potential recalls September 16, 2018 Data Mining: Concepts and Techniques
15
Data Source Analysis (cont.)
Discovery of integrity constraints: Vehicle type {C180,C220,C250,…} Production date precedes date of repair Data mining can be used to uncover integrity constraints E.g., use visualization to discover that vehicle type, power and weight correlate. 99% of WGT {1000,2000} (are the 1% incorrect?) September 16, 2018 Data Mining: Concepts and Techniques
16
Structural integration & data mining
Description conflicts: objects modeled differently in two or more schemes Structural conflicts: using different model constructs (I.e., taxes in one scheme and not in the other) Data conflicts: incorrect data, different representations, etc. September 16, 2018 Data Mining: Concepts and Techniques
17
Structural integration and DM
Using data mining methods to identify and solve these conflicts: Assume that the same vehicle repair cases are stored in two different databases. One DB contains mileage in Km/lit and the other in Miles/Galon A linear regression method would discover the conversion factors! September 16, 2018 Data Mining: Concepts and Techniques
18
Data Mining: Concepts and Techniques
Data cleansing and DM Missing values: replace them by the most frequent value. Or, better, use the inferred rules to determine the set of possible values E.g. (WGT = 800) WGT_GROUP = light, helps in converting a missing value in WGT-GROUP Correction of noise and incorrect data: E.g. 10 year old car with odometer 10… Use rules: (AGE> 10) (odometer > 100,000) Conflicts between data sources: E.g., same product different prices. Again, use rules to correct. September 16, 2018 Data Mining: Concepts and Techniques
19
Multidimensional Data Modeling and DM
Identification of orthogonal dimensions: Some fields are functionally dependent (e.g., customer birthday and age) Other fields do not strongly influence the measure (e.g., steering wheel type may not have much influence in number of repairs) Data mining methods to rank the variables according to importance and correlations can be used to decide which dimensions are kept. September 16, 2018 Data Mining: Concepts and Techniques
20
Data Mining: Concepts and Techniques
Data transformations Define the transformations in an appropriate language, usually supported by a GUI Various ETL tools support this functionality. Or, alternatively use User Defined Functions (UDFs) and SQL E.g.: CREATE VIEW Customer2(Lname, Fname, Gender, Street, City, State, ZIP, CID) AS SELECT LastNameExtract(Name), FirstNameExtract(Name), Sex, Street, CityExtract( City) FROM Customer The bold UDFs extract names and contain cleaning logic, e.g. remove misspellings, or provide missing ZIP codes. September 16, 2018 Data Mining: Concepts and Techniques
21
Data Mining: Concepts and Techniques
Data Cleaning Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data September 16, 2018 Data Mining: Concepts and Techniques
22
Data Mining: Concepts and Techniques
Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data September 16, 2018 Data Mining: Concepts and Techniques
23
How to Handle Noisy Data?
Binning method: first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human Regression smooth by fitting the data into regression functions September 16, 2018 Data Mining: Concepts and Techniques
24
Simple Discretization Methods: Binning
Equal-width (distance) partitioning: It divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. The most straightforward But outliers may dominate presentation Skewed data is not handled well. Equal-depth (frequency) partitioning: It divides the range into N intervals, each containing approximately same number of samples Good data scaling Managing categorical attributes can be tricky. September 16, 2018 Data Mining: Concepts and Techniques
25
Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 September 16, 2018 Data Mining: Concepts and Techniques
26
Chapter 3: Data Preprocessing
Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary September 16, 2018 Data Mining: Concepts and Techniques
27
Data Transformation: Normalization
min-max normalization z-score normalization normalization by decimal scaling Where j is the smallest integer such that Max(| |)<1 September 16, 2018 Data Mining: Concepts and Techniques
28
Dimensionality Reduction
Feature selection (i.e., attribute subset selection): Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features reduce # of patterns in the patterns, easier to understand Heuristic methods (due to exponential # of choices): step-wise forward selection step-wise backward elimination combining forward selection and backward elimination decision-tree induction September 16, 2018 Data Mining: Concepts and Techniques
29
Data Mining: Concepts and Techniques
Example of Decision Tree Induction Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A1? A6? Class 1 Class 2 Class 2 Class 1 Reduced attribute set: {A1, A4, A6} > September 16, 2018 Data Mining: Concepts and Techniques
30
Data Mining: Concepts and Techniques
Data Compression String compression There are extensive theories and well-tuned algorithms Typically lossless But only limited manipulation is possible without expansion Audio/video compression Typically lossy compression, with progressive refinement Sometimes small fragments of signal can be reconstructed without reconstructing the whole Time sequence is not audio Typically short and vary slowly with time September 16, 2018 Data Mining: Concepts and Techniques
31
Data Mining: Concepts and Techniques
Haar2 Daubechie4 Wavelet Transforms Discrete wavelet transform (DWT): linear signal processing Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space Method: Length, L, must be an integer power of 2 (padding with 0s, when necessary) Each transform has 2 functions: smoothing, difference Applies to pairs of data, resulting in two set of data of length L/2 Applies two functions recursively, until reaches the desired length September 16, 2018 Data Mining: Concepts and Techniques
32
Principal Component Analysis
Given N data vectors from k-dimensions, find c <= k orthogonal vectors that can be best used to represent data The original data set is reduced to one consisting of N data vectors on c principal components (reduced dimensions) Each data vector is a linear combination of the c principal component vectors Works for numeric data only Used when the number of dimensions is large September 16, 2018 Data Mining: Concepts and Techniques
33
Data Mining: Concepts and Techniques
Principal Component Analysis X2 Y1 Y2 X1 September 16, 2018 Data Mining: Concepts and Techniques
34
Data Mining: Concepts and Techniques
Numerosity Reduction Parametric methods Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) Log-linear models: obtain value at a point in m-D space as the product on appropriate marginal subspaces Non-parametric methods Do not assume models Major families: histograms, clustering, sampling September 16, 2018 Data Mining: Concepts and Techniques
35
Regression and Log-Linear Models
Linear regression: Data are modeled to fit a straight line Often uses the least-square method to fit the line Multiple regression: allows a response variable Y to be modeled as a linear function of multidimensional feature vector Log-linear model: approximates discrete multidimensional probability distributions September 16, 2018 Data Mining: Concepts and Techniques
36
Regress Analysis and Log-Linear Models
Linear regression: Y = + X Two parameters , and specify the line and are to be estimated by using the data at hand. using the least squares criterion to the known values of Y1, Y2, …, X1, X2, …. Multiple regression: Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into the above. Log-linear models: The multi-way table of joint probabilities is approximated by a product of lower-order tables. Probability: p(a, b, c, d) = ab acad bcd
37
Data Mining: Concepts and Techniques
Clustering Partition data set into clusters, and one can store cluster representation only Can be very effective if data is clustered but not if data is “smeared” Can have hierarchical clustering and be stored in multi-dimensional index tree structures There are many choices of clustering definitions and clustering algorithms, further detailed in Chapter 8 September 16, 2018 Data Mining: Concepts and Techniques
38
Data Mining: Concepts and Techniques
Sampling Raw Data Cluster/Stratified Sample September 16, 2018 Data Mining: Concepts and Techniques
39
Hierarchical Reduction
Use multi-resolution structure with different degrees of reduction Hierarchical clustering is often performed but tends to define partitions of data sets rather than “clusters” Parametric methods are usually not amenable to hierarchical representation Hierarchical aggregation An index tree hierarchically divides a data set into partitions by value range of some attributes Each partition can be considered as a bucket Thus an index tree with aggregates stored at each node is a hierarchical histogram September 16, 2018 Data Mining: Concepts and Techniques
40
Chapter 3: Data Preprocessing
Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary September 16, 2018 Data Mining: Concepts and Techniques
41
Data Mining: Concepts and Techniques
Discretization Three types of attributes: Nominal — values from an unordered set Ordinal — values from an ordered set Continuous — real numbers Discretization: divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. Reduce data size by discretization Prepare for further analysis September 16, 2018 Data Mining: Concepts and Techniques
42
Discretization and Concept hierachy
reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values. Concept hierarchies reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior). September 16, 2018 Data Mining: Concepts and Techniques
43
Discretization and concept hierarchy generation for numeric data
Binning (see sections before) Histogram analysis (see sections before) Clustering analysis (see sections before) Entropy-based discretization Segmentation by natural partitioning September 16, 2018 Data Mining: Concepts and Techniques
44
Entropy-Based Discretization
Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization. The process is recursively applied to partitions obtained until some stopping criterion is met, e.g., Experiments show that it may reduce data size and improve classification accuracy September 16, 2018 Data Mining: Concepts and Techniques
45
Data Mining: Concepts and Techniques
Example For large data sets. Age < 25 Car = Sports H H L September 16, 2018 Data Mining: Concepts and Techniques
46
Segmentation by natural partitioning
3-4-5 rule can be used to segment numeric data into relatively uniform, “natural” intervals. * If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equi-width intervals * If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals * If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals September 16, 2018 Data Mining: Concepts and Techniques
47
Data Mining: Concepts and Techniques
Example of rule Step 1: -$351 -$159 profit $1, $4,700 Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max count msd=1,000 Low=-$1,000 High=$2,000 Step 2: (-$1, $2,000) (-$1, ) (0 -$ 1,000) Step 3: ($1,000 - $2,000) (-$4000 -$5,000) Step 4: ($2,000 - $5, 000) ($2,000 - $3,000) ($3,000 - $4,000) ($4,000 - $5,000) (-$ ) (-$400 - -$300) (-$300 - -$200) (-$200 - -$100) (-$100 - 0) (0 - $1,000) (0 - $200) ($200 - $400) ($400 - $600) ($600 - $800) ($800 - $1,000) ($1,000 - $2, 000) ($1,000 - $1,200) ($1,200 - $1,400) ($1,400 - $1,600) ($1,600 - $1,800) ($1,800 - $2,000) September 16, 2018 Data Mining: Concepts and Techniques
48
Concept hierarchy generation for categorical data
Specification of a partial ordering of attributes explicitly at the schema level by users or experts Specification of a portion of a hierarchy by explicit data grouping Specification of a set of attributes, but not of their partial ordering Specification of only a partial set of attributes September 16, 2018 Data Mining: Concepts and Techniques
49
Specification of a set of attributes
Concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy. 15 distinct values country province_or_ state 65 distinct values 3567 distinct values city 674,339 distinct values street September 16, 2018 Data Mining: Concepts and Techniques
50
Data Mining: Concepts and Techniques
Lecture #4 September 16, 2018 Data Mining: Concepts and Techniques
51
Data Mining: Concepts and Techniques
Example PSC 1.PSC 6 Million cells PC 6 Million cells PS 0.8 Million cells SC 6 Million cells P 0.2 Million cells S 0.05 Million cells C Million cells ALL 1 cell PC SC PS P S C ALL (Cube lattice) September 16, 2018 Data Mining: Concepts and Techniques
52
Data Mining: Concepts and Techniques
Decisions, decisions... How many views must we materialize to get good performance? Given space S (on disk), which views do we materialize? In the previous example we’d need space for 19 Million cells. Can we do better? Avoid going to the raw (fact table) data: PSC (6 M) PC (6M) can be answered using PSC (6 M) no advantage SC (6 M) can be answered using PSC (6 M) no advantage September 16, 2018 Data Mining: Concepts and Techniques
53
Data Mining: Concepts and Techniques
Example again 1 PSC M 6M PC 6M PS 0.8M 0.8M SC 6M P 0.2 M 0.2M S 0.01M 0.01M C 0.1M M vs M (about the same performance) September 16, 2018 Data Mining: Concepts and Techniques
54
Data Mining: Concepts and Techniques
Formal treatment Q1 Q2 (dependency) Q(P) Q(PC) Q(PSC) (lattice) Add hierarchies C (customers) S (suppliers) P(parts) N (nation-wide cust. ) SN (nation-wide) Sz Ty e.g., USA, Japan) (size) (type) DF (domestic- ALL foreign) ALL (all cust.) ALL September 16, 2018 Data Mining: Concepts and Techniques
55
Data Mining: Concepts and Techniques
Formal treatment(2) CP (6M) C Sz (5M) CTy (5.99M) NP (5M) N Sz (1,250) C (0.1M) P (0.2 M) N Ty (3,750) Sz (50) N 2 5 Ty 150 ALL(1) September 16, 2018 Data Mining: Concepts and Techniques
56
Optimizing Data Cube lattices
First problem (no space restrictions) VERY HARD problem (NP-complete) Heuristics: Always include the “core” cuboid. At every step you have materialized Sv views Compute the benefit of view v relative to Sv as: For each w v define Bw Let u be the view of least cost in Sv such that w u If Cost(v) < Cost (u) Bw = Cost(v)-Cost(u) (-) else Bw = Define B(V,Sv) = - w v B(w) September 16, 2018 Data Mining: Concepts and Techniques
57
Data Mining: Concepts and Techniques
Greedy algorithm Sv = {core view} for i = 1 to k begin select v not in Sv such that B(v,Sv) is maximum Sv = Sv {v} End September 16, 2018 Data Mining: Concepts and Techniques
58
Data Mining: Concepts and Techniques
September 16, 2018 Data Mining: Concepts and Techniques
59
Data Mining: Concepts and Techniques
Structures Two levels: Blocks in the first level correspond to the dense dimension combinations. The basic block will have the size proportional to the product of the cardinalities for these dimensions. Each entry in the block points to a second-level block. Blocks in the second level correspond to the sparse dimensions. They are arrays of pointers, as many as the product of the cardinalities for sparse dimensions. Each pointer has one of three values: null (non-existent data), impossible (non-allowed combination) or a pointer to an actual data block. September 16, 2018 Data Mining: Concepts and Techniques
60
Data Mining: Concepts and Techniques
Data Example Departments will generally have data for each Time period. (so the two are the dense dimension combination) Geographical information, Product and Distribution channels, on the other hand are typically sparse (e.g., most cities have only one Distribution channel and some Product values). Dimensions Departments (Sales,Mkt) Time Geographical information Product Distribution channels September 16, 2018 Data Mining: Concepts and Techniques
61
Data Mining: Concepts and Techniques
Structures revisited S,1Q S,2Q S,3Q S,4Q M,1Q M,2Q M,3Q M,4Q Geo., Product, Dist Data block Data block September 16, 2018 Data Mining: Concepts and Techniques
62
Data Mining: Concepts and Techniques
Allocating memory Define member structure (e.g., dimensions) Select dense dimension combinations and create upper level structure Create lower level structure. Input data cell: if pointer to data block is empty, create new else insert data in data block September 16, 2018 Data Mining: Concepts and Techniques
63
Problem 2: COMPUTING DATACUBES
Four algorithms PIPESORT PIPEHASH SORT-OVERLAP Partitioned-cube September 16, 2018 Data Mining: Concepts and Techniques
64
Data Mining: Concepts and Techniques
Optimizations Smallest-parent AB can be computed from ABC, ABD, or ABCD. Which one should be use? Cache-results Having computed ABC, we compute AB from it while ABC is still in memory Amortize-scans We may try to compute ABC, ACD, ABD, BCD in one scan of ABCD Share-sorts Share-partitions September 16, 2018 Data Mining: Concepts and Techniques
65
Data Mining: Concepts and Techniques
PIPESORT Input: Cube lattice and cost matrix. Each edge (eij in the lattice is annotated with two costs: S(i,j) cost of computing j from i when i is not sorted A(i,j) cost of computing j from i when i is sorted Output: Subgraph of the lattice where each cuboid (group-by) is connected to a single parent from which it will be computed and is associated with an attribute order in which it will be sorted. If the order is a prefix of the order of its parent, then the child can be computed without sorting the parent (cost A); otherwise it has to be sorted (cost B). For every parent there will be only one out-edge labeled A. September 16, 2018 Data Mining: Concepts and Techniques
66
Data Mining: Concepts and Techniques
PIPESORT (2) Algorithm: Proceeds in levels, k = 0,…,N-1 (number of dimensions). For each level, finds the best way of computing level k from level k+1 by reducing the problem to a weighted bypartite problem Make k additional copies of each group-by (each node has then, k+1 vertices) and connect them to the same children of the original From the original copy, the edges have A costs, while the costs from the copies have S costs. Find the minimum cost matching in the bypartite graph. (Each vertex in level k+1 matched with one vertex in level k.) September 16, 2018 Data Mining: Concepts and Techniques
67
Data Mining: Concepts and Techniques
Example AB AB AC AC BC BC A B C AB AB AC AC BC BC A B C September 16, 2018 Data Mining: Concepts and Techniques
68
Data Mining: Concepts and Techniques
Transformed lattice A B C AB(2) AB(10) AC(5) AC(12) BC(13) BC(20) September 16, 2018 Data Mining: Concepts and Techniques
69
Data Mining: Concepts and Techniques
Explanation of edges A AB(2) AB(10) This means that we really have BA (we need to sort it to get A) This means we have AB (no need to sort) September 16, 2018 Data Mining: Concepts and Techniques
70
PIPESORT pseudo-algorithm
Pipesort: (Input: lattice with A() and S() edges costs) For level k = 0 to N Generate_plan(k+1) For each cuboid g in level k Fix sort order of g as the order of the cuboid connected to g by an A edge; September 16, 2018 Data Mining: Concepts and Techniques
71
Data Mining: Concepts and Techniques
Generate_plan Generate_plan(k+1) Make k additional copies of each level k+1 cuboid; Connect each copy to the same set of vertices as the original; Assign costs A to original edges and S to copies; Find minimum cost matching on the transformed graph; September 16, 2018 Data Mining: Concepts and Techniques
72
Data Mining: Concepts and Techniques
Example 1 3 2 September 16, 2018 Data Mining: Concepts and Techniques
73
Data Mining: Concepts and Techniques
PipeHash Input: lattice PipeHash chooses for each vertex the parent with the smallest estimated size. The outcome is a minimum spanning tree (MST), where each vertex is a cuboid and an edge from i to j shows that i is the smallest parent of j. Available memory is not usually enough to compute all the cuboids in MST together, so we need to decide what cuboids can be computed together (sub-MST), and when to allocate and deallocate memory for different hash-tables and what attribute to use for partitioning data. Input: lattice and estimated sizes of cuboids Initialize worklist with MST of the search lattice While worklist is not empty Pick tree from worklist; T’ = Select-subtree of T to be executed next; Compute-subtree(T’); September 16, 2018 Data Mining: Concepts and Techniques
74
Data Mining: Concepts and Techniques
Select-subtree Select-subtree(T) If memory required by T less than available, return(T); Else, let S be the attributes in root(T) For any s S we get a subtree Ts of T also rooted at T including all cuboids that contain s Ps= maximum number of partitions of root(T) possible if partitioned on s Choose s such that mem(Ts)/Ps < memory available and Ts the largest over all subsets of S. Remove Ts from T. (put T-Ts in worklist) September 16, 2018 Data Mining: Concepts and Techniques
75
Data Mining: Concepts and Techniques
Compute-subtree Compute-subtree numP = mem(T’) * f / mem-available Partition root of T’ into numP For each partition of root(T’) For each node n in T’ Compute all children of n in one scan If n is cached, saved to disk and release memory occupied by its hash table September 16, 2018 Data Mining: Concepts and Techniques
76
Data Mining: Concepts and Techniques
OVERLAP Sorted-Runs: Consider a cuboid on j attributes {A1,A2,…,Aj}, we use B= (A1,A2,…,Aj) to denote the cuboid sorted on that order. Consider S = (A1,A2,…,Al-1,Al+1,…,Aj), computed using the one before. A sorted run R of S in B is defined as: R = S (Q) where Q is a maximal sequence of tuples of B such that for each tuple in Q, the first l columns have the same value. September 16, 2018 Data Mining: Concepts and Techniques
77
Data Mining: Concepts and Techniques
Sorted-run B = [(a,1,2),(a,1,3),(a,2,2),(b,1,3),(b,3,2),(c,3,1)] S = first and third attribute S = [(a,2),(a,3),(b,3),(b,2),(c,1)] Sorted runs: [(a,2),(a,3)] [(a,2)] [(b,3)] [(b,2)] [(c,1)] September 16, 2018 Data Mining: Concepts and Techniques
78
Data Mining: Concepts and Techniques
Partitions B and S have a common prefix (A1… Al-1) A partition of the cuboid S in B is the union of sorted runs such that the first l-1 columns of all the tuples of the sorted runs have the same values. [(a,2),(a,3)] [(b,2),(b,3)] [(c,1)] September 16, 2018 Data Mining: Concepts and Techniques
79
Data Mining: Concepts and Techniques
OVERLAP Sort the base cuboid: this forces the sorted order in which the other cuboids are computed ABCD ABC ABD ACD BCD AB AC BC AD CD BD A B C D ALL September 16, 2018 Data Mining: Concepts and Techniques
80
Data Mining: Concepts and Techniques
OVERLAP(2) If there is enough memory to hold all the cuboids, compute all. (very seldom true). Otherwise, use the partition as a unit of computation: just need sufficient memory to hold a partition. As soon as a partition is computed, tuples can be pipelined to compute descendant cuboids (same partition) and then written to disk. Reuse then the memory to compute the next partition. Example XYZ->XZ Partitions:[(a,2),(a,3)] [(b,2),(b,3)] [(c,1)] XYZ=[(a,1,2),(a,1,3),(a,2,2),(b,1,3),(b,3,2),(C,3,1)] Compute C,3,1 cell in XYZ, XZ. Use them to compute (c,1) in XZ. Then write these cells to disk. Compute (b,1,3),(b,3,2) cells in XYZ, XZ. Use them to compute [(b,2),(b,3)] in XZ. Then write these cells to disk. Compute (a,1,2),(a,1,3),(a,2,2) cells in XYZ, use them to compute (a,2),(a,3) in XZ. Then write all these cells to disk September 16, 2018 Data Mining: Concepts and Techniques
81
Data Mining: Concepts and Techniques
OVERLAP(3) Choose a parent to compute a cuboid: DAG Goal: minimize the size of the partitions of a cuboid, so less memory is needed. E.g., it is better to compute AC from ACD than from ABC, (since the sort order matches and the partition size is 1). This is a hard problem. Heuristic: maximize the size of the common prefix. ABCD ABC ABD ACD BCD AB AC BC AD CD BD A B C D ALL September 16, 2018 Data Mining: Concepts and Techniques
82
Data Mining: Concepts and Techniques
OVERLAP (4) Choosing a set of cuboids for overlapped computation, according to your memory constraints. To compute a cuboid in memory, we need memory equal to the size of its partition. Partition sizes can be estimated from cuboid sizes by using some distribution (uniform?) assumption. If this much memory can be spared, then the cuboid will be marked as in Partition state. For other cuboids, allocate a single page (for temporary results), these cuboids are in SortRun state. A cuboid in partition state can have its tuples pipelined for computation of its descendants. A cuboid can be considered for computation if it is the root, or its parent is marked as in Partition State The total memory allocated to all cuboids cannot be more than the available memory. September 16, 2018 Data Mining: Concepts and Techniques
83
Data Mining: Concepts and Techniques
OVERHEAD (5) Again, a hard problem… Heuristic: traverse the tree in BFS manner. ABCD ABC(1) ABD(1) ACD(1) BCD(50) AB(1) AC(1) BC(1) AD(5) CD(40) BD(1) A(1) B(1) C(1) D(5) ALL September 16, 2018 Data Mining: Concepts and Techniques
84
Data Mining: Concepts and Techniques
Computing a cuboid from its parent: Output: The sorted cuboid S foreach tuple of B do if (state == Partition) then process_partition(); else process_sorted_run( ); Process_partition: If the input tuple starts a new partition, output the current partition at the end of the cuboid, start a new one If the input tuple matches with an existing tuple in the partition, update the aggregate Else input tuple aggregate. Process_sorted_run: If input tuple starts a new sorted_run, flush all the pages of current sorted_run, and start a new one If the input tuple matches the last tuple in the sorted_run, recompute the aggregate Else, append the tuple to the end of the existing run. September 16, 2018 Data Mining: Concepts and Techniques
85
Data Mining: Concepts and Techniques
Observations In ABCD ABC, the partition size is 1. Why? In ABCD ABD, the partition size is equal to the number of distinct C values, Why? In ABCD BCD the partition size is the size of the cuboid BCD, Why? September 16, 2018 Data Mining: Concepts and Techniques
86
Data Mining: Concepts and Techniques
Lecture #5 September 16, 2018 Data Mining: Concepts and Techniques
87
Data Mining: Concepts and Techniques
Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of indexed keys. Dynamic, stable and exhibit good performance under updates. (But OLAP is not about updates….) Bitmaps: Space efficient Difficult to update (but we don’t care in DW). Can effectively prune searches before looking at data. September 16, 2018 Data Mining: Concepts and Techniques
88
Data Mining: Concepts and Techniques
Bitmaps R = (…., A,….., M) R (A) B8 B B6 B5 B4 B3 B2 B1 B0 September 16, 2018 Data Mining: Concepts and Techniques
89
Data Mining: Concepts and Techniques
Query optimization Consider a high-selectivity-factor query with predicates on two attributes. Query optimizer: builds plans (P1) Full relation scan (filter as you go). (P2) Index scan on the predicate with lower selectivity factor, followed by temporary relation scan, to filter out non-qualifying tuples, using the other predicate. (Works well if data is clustered on the first index key). (P3) Index scan for each predicate (separately), followed by merge of RID. September 16, 2018 Data Mining: Concepts and Techniques
90
Query optimization (continued)
tn Index Pred1 Blocks of data (P2) Tuple list1 (P3) Merged list Pred. 2 t1 tn Index Pred2 Tuple list2 answer September 16, 2018 Data Mining: Concepts and Techniques
91
Query optimization (continued)
When using bitmap indexes (P3) can be an easy winner! CPU operations in bitmaps (AND, OR, XOR, etc.) are more efficient than regular RID merges: just apply the binary operations to the bitmaps (In B-trees, you would have to scan the two lists and select tuples in both -- merge operation--) Of course, you can build B-trees on the compound key, but we would need one for every compound predicate (exponential number of trees…). September 16, 2018 Data Mining: Concepts and Techniques
92
Data Mining: Concepts and Techniques
Tradeoffs Dimension cardinality small dense bitmaps Dimension cardinality large sparse bitmaps Compression (decompression) September 16, 2018 Data Mining: Concepts and Techniques
93
Query strategy for Star joins
Maintain join indexes between fact table and dimension tables Prod. Fact table Dimension table a k … … Bitmap for type a Bitmap for type k ….. Bitmap for loc. Bitmap for loc. ….. Bitmap for prod Bitmap for prod ….. September 16, 2018 Data Mining: Concepts and Techniques
94
Data Mining: Concepts and Techniques
Star-Joins Select F.S, D1.A1, D2.A2, …. Dn.An from F,D1,D2,Dn where F.A1 = D1.A1 F.A2 = D2.A2 … F.An = Dn.An and D1.B1 = ‘c1’ D2.B2 = ‘p2’ …. Likely strategy: For each Di find suitable values of Ai such that Di.Bi = ‘xi’ (unless you have a bitmap index for Bi). Use bitmap index on Ai’ values to form a bitmap for related rows of F (OR-ing the bitmaps). At this stage, you have n such bitmaps, the result can be found AND-ing them. September 16, 2018 Data Mining: Concepts and Techniques
95
Data Mining: Concepts and Techniques
Example Selectivity/predicate = 0.01 (predicates on the dimension tables) n predicates (statistically independent) Total selectivity = 10 -2n Facts table = 108 rows, n = 3, tuples in answer = 108/ 106 = 100 rows In the worst case = 100 blocks… Still better than all the blocks in the relation (e.g., assuming 100 tuples/block, this would be blocks!) September 16, 2018 Data Mining: Concepts and Techniques
96
Design Space of Bitmap Indexes
The basic bitmap design is called Value-list index. The focus there is on the columns If we change the focus to the rows, the index becomes a set of attribute values (integers) in each tuple (row), that can be represented in a particular way. We can encode this row in many ways... September 16, 2018 Data Mining: Concepts and Techniques
97
Attribute value decomposition
C = attribute cardinality Consider a value of the attribute, v, and a sequence of numbers <bn-1, bn-2 , …,b1> Also, define bn = C / bi , then v can be decomposed into a sequence of n digits <vn, vn-1, vn-2 , …,v1> as follows: v = V = V2 b1 + v = V3(b2b1) + v2 b1 + v … n i = vn ( bj) + …+ vi ( bj) + …+ v2b1 + v1 where vi = Vi mod bi and Vi = Vi-1/bi-1 September 16, 2018 Data Mining: Concepts and Techniques
98
Data Mining: Concepts and Techniques
Number systems How do you write 576 in: < 7,7,5,3> 576/(7x7x5x3) = 576/735 = 0 | 576, 576/(7x5x3)=576/105=5|51 576 = 5 x (7x5x3)+51 <2,2,2,2,2,2,2,2,2> 576 = 1 x x x x x x x x x x 20 576/ 29 = 1 | 64, 64/ 28 = 0|64, 64/ 27 = 0|64, 64/ 26 = 1|0, / 25 = 0|0, 0/ 24= 0|0, 0/ 23= 0|0, 0/ 22 = 0|0, 0/ 21 = 0|0, / 20 = 0|0 <10,10,10> (decimal system!) 576 = 5 x 10 x x 576/100 = 5 | 76 76/10 = | 6 6 51/(5x3) = 51/15 = 3 | 6 6/3 =2 | 0 576 = 5 x (7x5x3) + 3 (5 x 3) + 16 576 = 5 x (7x 5 x 3) + 3 x (5 x 3 ) + 2 x (3) September 16, 2018 Data Mining: Concepts and Techniques
99
Data Mining: Concepts and Techniques
September 16, 2018 Data Mining: Concepts and Techniques
100
Data Mining: Concepts and Techniques
Bitmaps R = (…., A,….., M) value-list index R (A) B8 B B6 B5 B4 B3 B2 B1 B0 September 16, 2018 Data Mining: Concepts and Techniques
101
Data Mining: Concepts and Techniques
Example sequence <3,3> value-list index (equality) R (A) B22 B12 B02 B21 B11 B01 (1x3+0) September 16, 2018 Data Mining: Concepts and Techniques
102
Data Mining: Concepts and Techniques
Encoding scheme Equality encoding: all bits to 0 except the one that corresponds to the value Range Encoding: the vi righmost bits to 0, the remaining to 1 September 16, 2018 Data Mining: Concepts and Techniques
103
Range encoding single component, base-9
R (A) B8 B B6 B5 B4 B3 B2 B1 B0 September 16, 2018 Data Mining: Concepts and Techniques
104
Data Mining: Concepts and Techniques
Example (revisited) sequence <3,3> value-list index(Equality) R (A) B22 B12 B02 B21 B11 B01 (1x3+0) September 16, 2018 Data Mining: Concepts and Techniques
105
Data Mining: Concepts and Techniques
Example sequence <3,3> range-encoded index R (A) B12 B B11 B01 September 16, 2018 Data Mining: Concepts and Techniques
106
Data Mining: Concepts and Techniques
Design Space range equality …. September 16, 2018 Data Mining: Concepts and Techniques
107
Data Mining: Concepts and Techniques
RangeEval Evaluates each range predicate by computing two bitmaps: BEQ bitmap and either BGT or BLT RangeEval-Opt uses only <= A < v is the same as A <= v-1 A > v is the same as Not( A <= v) A >= v is the same as Not (A <= v-1) September 16, 2018 Data Mining: Concepts and Techniques
108
Data Mining: Concepts and Techniques
RangeEval-OPT September 16, 2018 Data Mining: Concepts and Techniques
109
Data Mining: Concepts and Techniques
September 16, 2018 Data Mining: Concepts and Techniques
110
Tree-Structured Indexes
The slides for this text are organized into chapters. This lecture covers Chapter 9. Chapter 1: Introduction to Database Systems Chapter 2: The Entity-Relationship Model Chapter 3: The Relational Model Chapter 4 (Part A): Relational Algebra Chapter 4 (Part B): Relational Calculus Chapter 5: SQL: Queries, Programming, Triggers Chapter 6: Query-by-Example (QBE) Chapter 7: Storing Data: Disks and Files Chapter 8: File Organizations and Indexing Chapter 9: Tree-Structured Indexing Chapter 10: Hash-Based Indexing Chapter 11: External Sorting Chapter 12 (Part A): Evaluation of Relational Operators Chapter 12 (Part B): Evaluation of Relational Operators: Other Techniques Chapter 13: Introduction to Query Optimization Chapter 14: A Typical Relational Optimizer Chapter 15: Schema Refinement and Normal Forms Chapter 16 (Part A): Physical Database Design Chapter 16 (Part B): Database Tuning Chapter 17: Security Chapter 18: Transaction Management Overview Chapter 19: Concurrency Control Chapter 20: Crash Recovery Chapter 21: Parallel and Distributed Databases Chapter 22: Internet Databases Chapter 23: Decision Support Chapter 24: Data Mining Chapter 25: Object-Database Systems Chapter 26: Spatial Data Management Chapter 27: Deductive Databases Chapter 28: Additional Topics September 16, 2018 Data Mining: Concepts and Techniques 1
111
Data Mining: Concepts and Techniques
Introduction As for any index, 3 alternatives for data entries k*: Data record with key value k <k, rid of data record with search key value k> <k, list of rids of data records with search key k> Choice is orthogonal to the indexing technique used to locate data entries k*. Tree-structured indexing techniques support both range searches and equality searches. ISAM: static structure; B+ tree: dynamic, adjusts gracefully under inserts and deletes. September 16, 2018 Data Mining: Concepts and Techniques 2
112
Data Mining: Concepts and Techniques
Range Searches ``Find all students with gpa > 3.0’’ If data is in sorted file, do binary search to find first such student, then scan to find others. Cost of binary search can be quite high. Simple idea: Create an `index’ file. Index File k1 k2 kN Data File Page 1 Page 2 Page 3 Page N Can do binary search on (smaller) index file! September 16, 2018 Data Mining: Concepts and Techniques 3
113
Data Mining: Concepts and Techniques
ISAM index entry P K P K P K P 1 1 2 2 m m Index file may still be quite large. But we can apply the idea repeatedly! Non-leaf Pages Leaf Pages Overflow page Primary pages Leaf pages contain data entries. September 16, 2018 Data Mining: Concepts and Techniques 4
114
Data Mining: Concepts and Techniques
Comments on ISAM Data Pages File creation: Leaf (data) pages allocated sequentially, sorted by search key; then index pages allocated, then space for overflow pages. Index entries: <search key value, page id>; they `direct’ search for data entries, which are in leaf pages. Search: Start at root; use key comparisons to go to leaf. Cost log F N ; F = # entries/index pg, N = # leaf pgs Insert: Find leaf data entry belongs to, and put it there. Delete: Find and remove from leaf; if empty overflow page, de-allocate. Index Pages Overflow pages Static tree structure: inserts/deletes affect only leaf pages. September 16, 2018 Data Mining: Concepts and Techniques 5
115
Data Mining: Concepts and Techniques
Example ISAM Tree Each node can hold 2 entries; no need for `next-leaf-page’ pointers. 10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97* 20 33 51 63 40 Root September 16, 2018 Data Mining: Concepts and Techniques 6
116
Data Mining: Concepts and Techniques
After Inserting 23*, 48*, 41*, 42* ... Root Index 40 Pages 20 33 51 63 Primary Leaf 10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97* Pages 23* 48* 41* Overflow Pages 42* September 16, 2018 Data Mining: Concepts and Techniques 7
117
Data Mining: Concepts and Techniques
... Then Deleting 42*, 51*, 97* Root 40 20 33 51 63 10* 15* 20* 27* 33* 37* 40* 46* 55* 63* 23* 48* 41* Note that 51* appears in index levels, but not in leaf! September 16, 2018 Data Mining: Concepts and Techniques 8
118
B+ Tree: The Most Widely Used Index
Insert/delete at log F N cost; keep tree height-balanced. (F = fanout, N = # leaf pages) Minimum 50% occupancy (except for root). Each node contains d <= m <= 2d entries. The parameter d is called the order of the tree. Supports equality and range-searches efficiently. Index Entries Data Entries ("Sequence set") (Direct search) September 16, 2018 Data Mining: Concepts and Techniques 9
119
Data Mining: Concepts and Techniques
Example B+ Tree Search begins at root, and key comparisons direct it to a leaf (as in ISAM). Search for 5*, 15*, all data entries >= 24* ... Root 13 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* Based on the search for 15*, we know it is not in the tree! September 16, 2018 Data Mining: Concepts and Techniques 10
120
Data Mining: Concepts and Techniques
B+ Trees in Practice Typical order: Typical fill-factor: 67%. average fanout = 133 Typical capacities: Height 4: 1334 = 312,900,700 records Height 3: 1333 = 2,352,637 records Can often hold top levels in buffer pool: Level 1 = page = Kbytes Level 2 = pages = Mbyte Level 3 = 17,689 pages = 133 MBytes September 16, 2018 Data Mining: Concepts and Techniques
121
Inserting a Data Entry into a B+ Tree
Find correct leaf L. Put data entry onto L. If L has enough space, done! Else, must split L (into L and a new node L2) Redistribute entries evenly, copy up middle key. Insert index entry pointing to L2 into parent of L. This can happen recursively To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.) Splits “grow” tree; root split increases height. Tree growth: gets wider or one level taller at top. September 16, 2018 Data Mining: Concepts and Techniques 6
122
Inserting 8* into Example B+ Tree
Entry to be inserted in parent node. Observe how minimum occupancy is guaranteed in both leaf and index pg splits. Note difference between copy-up and push-up; be sure you understand the reasons for this. 5 (Note that 5 is s copied up and continues to appear in the leaf.) 2* 3* 5* 7* 8* 5 24 30 17 13 Entry to be inserted in parent node. (Note that 17 is pushed up and only this with a leaf split.) appears once in the index. Contrast September 16, 2018 Data Mining: Concepts and Techniques 12
123
Example B+ Tree After Inserting 8*
Root 17 5 13 24 30 2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* Notice that root was split, leading to increase in height. In this example, we can avoid split by re-distributing entries; however, this is usually not done in practice. September 16, 2018 Data Mining: Concepts and Techniques 13
124
Deleting a Data Entry from a B+ Tree
Start at root, find leaf L where entry belongs. Remove the entry. If L is at least half-full, done! If L has only d-1 entries, Try to re-distribute, borrowing from sibling (adjacent node with same parent as L). If re-distribution fails, merge L and sibling. If merge occurred, must delete entry (pointing to L or sibling) from parent of L. Merge could propagate to root, decreasing height. September 16, 2018 Data Mining: Concepts and Techniques 14
125
Example Tree After (Inserting 8*, Then) Deleting 19* and 20* ...
Root 17 5 13 27 30 2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39* Deleting 19* is easy. Deleting 20* is done with re-distribution. Notice how middle key is copied up. September 16, 2018 Data Mining: Concepts and Techniques 15
126
Data Mining: Concepts and Techniques
... And Then Deleting 24* Must merge. Observe `toss’ of index entry (on right), and `pull down’ of index entry (below). 30 22* 27* 29* 33* 34* 38* 39* Root 5 13 17 30 2* 3* 5* 7* 8* 14* 16* 22* 27* 29* 33* 34* 38* 39* September 16, 2018 Data Mining: Concepts and Techniques 16
127
Data Mining: Concepts and Techniques
Summary Tree-structured indexes are ideal for range-searches, also good for equality searches. ISAM is a static structure. Only leaf pages modified; overflow pages needed. Overflow chains can degrade performance unless size of data set and data distribution stay constant. B+ tree is a dynamic structure. Inserts/deletes leave tree height-balanced; log F N cost. High fanout (F) means depth rarely more than 3 or 4. Almost always better than maintaining a sorted file. September 16, 2018 Data Mining: Concepts and Techniques 23
128
Data Mining: Concepts and Techniques
Summary (Contd.) Typically, 67% occupancy on average. Usually preferable to ISAM, modulo locking considerations; adjusts to growth gracefully. Key compression increases fanout, reduces height. Bulk loading can be much faster than repeated inserts for creating a B+ tree on a large data set. Most widely used index in database management systems because of its versatility. One of the most optimized components of a DBMS. September 16, 2018 Data Mining: Concepts and Techniques 24
129
Data Mining: Concepts and Techniques
Lecture #6 September 16, 2018 Data Mining: Concepts and Techniques
130
Monitoring Techniques
Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Application level monitoring è Advantages & Disadvantages!! September 16, 2018 Data Mining: Concepts and Techniques
131
Data Mining: Concepts and Techniques
Monitoring Issues Frequency periodic: daily, weekly, … triggered: on “big” change, lots of changes, ... Data transformation convert data to uniform format remove & add fields (e.g., add date to get history) Standards (e.g., ODBC) Gateways September 16, 2018 Data Mining: Concepts and Techniques
132
Data Mining: Concepts and Techniques
Monitoring Products Gateways: Info Builders EDA/SQL, Oracle Open Connect, Informix Enterprise Gateway, … Data Shipping: Oracle Replication Server, Praxis OmniReplicator, … Transaction Shipping: Sybase Replication Server, Microsoft SQL Server Extraction: Aonix, ETI, CrossAccess, DBStar Monitoring/Integration products later on September 16, 2018 Data Mining: Concepts and Techniques
133
Data Mining: Concepts and Techniques
Integration Data Cleaning Data Loading Derived Data Query &analysis Client Warehouse Source Query & Analysis Integration Metadata integration September 16, 2018 Data Mining: Concepts and Techniques
134
Change detection Detect & send changes to integrator Different classes of sources Cooperative Queryable Logged Snapshot/dump
135
Data transformation Convert data to uniform format Byte ordering, string termination Internal layout Remove, add, & reorder attributes Add (regeneratable) key Add date to get history
136
Data transformation (2)
Sort tuples May use external utilites Can be much faster (10x) than SQL engine E.g., perl script to reorder attributes
137
External functions (EFs)
Special transformation functions E.g., Yen_to_dollars User defined Specified in warehouse table definition Aid in integration Must be applied to updates, too
138
Data integration Rules for matching data from different sources Build composite view of data Eliminate duplicate, unneeded attributes
139
Data Mining: Concepts and Techniques
Data Cleaning Migration (e.g., yen ð dollars) Scrubbing: use domain-specific knowledge (e.g., social security numbers) Fusion (e.g., mail list, customer merging) Auditing: discover rules & relationships (like data mining) billing DB service DB customer1(Joe) customer2(Joe) merged_customer(Joe) September 16, 2018 Data Mining: Concepts and Techniques
140
Data cleansing Find (& remove) duplicate tuples E.g., Jane Doe & Jane Q. Doe Detect inconsistent, wrong data Attributes that don’t match E.g., city, state and zipcode Patch missing, unreadable data Want to “backflush” clean data Notify sources of errors found
141
Data Mining: Concepts and Techniques
Loading Data Incremental vs. refresh Off-line vs. on-line Frequency of loading At night, 1x a week/month, continuously Parallel/Partitioned load September 16, 2018 Data Mining: Concepts and Techniques
142
Data Mining: Concepts and Techniques
Derived Data Derived Warehouse Data indexes aggregates materialized views (next slide) When to update derived data? Incremental vs. refresh September 16, 2018 Data Mining: Concepts and Techniques
143
The “everything is a view” view
Pure programs: e.g., “can queries.” Always the same cost. No data is materialized. (DBMSs) Derived data: Materialized views. Data always there but must be updated. (Good for warehouses.) Pure data: Snapshot. Procedure is thrown away! Not maintainable. Approximate: Snapshot+refresh procedure applied in some conditions. (Quasi-copies) Approximate models (e.g., statistical). (Quasi-cubes). September 16, 2018 Data Mining: Concepts and Techniques
144
Data Mining: Concepts and Techniques
Materialized Views Define new warehouse relations using SQL expressions does not exist at any source September 16, 2018 Data Mining: Concepts and Techniques
145
Data Mining: Concepts and Techniques
Integration Products Monitoring & Integration: Apertus, Informatica, Prism, Sagent, … Merging: DataJoiner, SAS,… Cleaning: Trillum, ... Typically take warehouse off-line Typically refresh or simple incremental: e.g., Red Brick Table Management Utility, Prism September 16, 2018 Data Mining: Concepts and Techniques
146
Data Mining: Concepts and Techniques
Managing Metadata Warehouse Design Tools Query &analysis Client Warehouse Source Query & Analysis Integration Metadata integration September 16, 2018 Data Mining: Concepts and Techniques
147
Data Mining: Concepts and Techniques
Metadata Administrative definition of sources, tools, ... schemas, dimension hierarchies, … rules for extraction, cleaning, … refresh, purging policies user profiles, access control, ... September 16, 2018 Data Mining: Concepts and Techniques
148
Data Mining: Concepts and Techniques
Metadata Business business terms & definition data ownership, charging Operational data lineage data currency (e.g., active, archived, purged) use stats, error reports, audit trails September 16, 2018 Data Mining: Concepts and Techniques
149
Data Mining: Concepts and Techniques
Tools Development design & edit: schemas, views, scripts, rules, queries, reports Planning & Analysis what-if scenarios (schema changes, refresh rates), capacity planning Warehouse Management performance monitoring, usage patterns, exception reporting System & Network Management measure traffic (sources, warehouse, clients) Workflow Management “reliable scripts” for cleaning & analyzing data September 16, 2018 Data Mining: Concepts and Techniques
150
Data Mining: Concepts and Techniques
Tools - Products Management Tools HP Intelligent Warehouse Advisor, IBM Data Hub, Prism Warehouse Manager System & Network Management HP OpenView, IBM NetView, Tivoli September 16, 2018 Data Mining: Concepts and Techniques
151
Current State of Industry
Extraction and integration done off-line Usually in large, time-consuming, batches Everything copied at warehouse Not selective about what is stored Query benefit vs storage & update cost Query optimization aimed at OLTP High throughput instead of fast response Process whole query before displaying anything September 16, 2018 Data Mining: Concepts and Techniques
152
Data Mining: Concepts and Techniques
Future Directions Better performance Larger warehouses Easier to use What are companies & research labs working on? September 16, 2018 Data Mining: Concepts and Techniques
153
Data Mining: Concepts and Techniques
Research (1) Incremental Maintenance Data Consistency Data Expiration Recovery Data Quality Error Handling (Back Flush) September 16, 2018 Data Mining: Concepts and Techniques
154
Data Mining: Concepts and Techniques
Research (2) Rapid Monitor Construction Temporal Warehouses Materialization & Index Selection Data Fusion Data Mining Integration of Text & Relational Data September 16, 2018 Data Mining: Concepts and Techniques
155
Make warehouse self-maintainable
Add auxiliary tables to minimize update cost Original + auxiliary are self-maintainable E.g., auxiliary table of all unsold catalog items Some updates may still be self-maintainable E.g., insert into catalog if item (the join attribute) is a key Items sold Sales Catalog
156
Detection of self-maintainability
Most algorithms are at table level Most algorithms are compile-time Tuple level at runtime [Huyn 1996, 1997] Use state of tables and update to determine if self-maintainable E.g., check whether sale is for item previously sold
157
Warehouse maintenance
Current systems ignore integration of new data Or assume warehouse can be rebuilt periodically Depend on long “downtime” to regenerate warehouse Technology gap: continuous incremental maintenance
158
Maintenance research Change detection Data consistency Single table consistency Multiple table consistency Expiration of data Crash recovery
159
Snapshot change detection
Compare old & new snapshots Join-based algorithms Hash old data, probe with new Window algorithm Sliding window over snapshots Good for local changes
160
Integrated data consistency
Conventional maintenance inadequate Sources report changes but: No locking, no global transactions (sources don’t communicate, coordinate with each other) Inconsistencies caused by interleaving of updates
161
Example anomaly table Sold = catalog x sale x emp
insert into sale [hat, Sue] delete from catalog [$12, hat] Sold price item clerk age catalog sale emp price item item clerk clerk age $12 hat Sue 26
162
Data Mining: Concepts and Techniques
Anomaly (2) price item clerk age Sold $12,hat,Sue,26 ignored Q1 = catalog [hat, Sue] A(Q1)= [$12,hat, Sue] Q2 = [$12,hat, Sue] emp delete from catalog [$12, hat] insert into sale [hat, Sue] A(Q2)= [$12,hat,Sue,26] price item catalog $12 hat price item catalog item clerk sale clerk age Sue 26 emp hat Sue September 16, 2018 Data Mining: Concepts and Techniques
163
Choices to deal with anomalies
Keep all relations in the DW (storage-expensive!) Run all queries as distributed (may not be feasible! --legacy systems-- + poor performance) Use specialized algorithms. E.g., Eager Compensation Algorithm (ECA), STROBE. September 16, 2018 Data Mining: Concepts and Techniques
164
Another anomaly example
price clerk Sold V = [$12, Sue]-= V V = [$12, Sue]-= V WRONG! $12 Sue Delete(catalog[$12,hat]) Q1= p,c([$12,hat] sale) A(Q1) = A(Q2) = item clerk sale hat Sue item clerk sale $12 hat price item catalog price item catalog Delete(sale[hat,Sue]) September 16, 2018 Data Mining: Concepts and Techniques
165
Yet another anomaly example
Depts = Dept(catalog Store) Depts Shoes Bags Bags Shoes Bags Q1= Dept(catalog [NY,Madison Av]) Q2= Dept([Bags,NY] Store) Insert(catalog[Bags,NY]) A(Q1) = [[Shoes],[Bags]] A(Q2) = [[Bags]] Shoes NY Dept City catalog Insert(Store[NY, Madison Av. City Add. Store NY Madison Ave City Add. Store Bags NY September 16, 2018 Data Mining: Concepts and Techniques
166
Eager Compensating Algorithm(ECA)
Principle: send compensating queries to offset the effect of concurrent updates ONLY GOOD IF ALL THE SOURCE RELATIONS ARE STORED IN ONE NODE (ONE SOURCE). September 16, 2018 Data Mining: Concepts and Techniques
167
Anomaly example revisited (ECA)
Depts = Dept(catalog Store) Depts Q2= Dept([Bags,NY] Store) -Dept([Bags,NY] [NY,Madison Ave]] Shoes Bags Q1= Dept(catalog [NY,Madison Av]) Insert(catalog[Bags,NY]) A(Q1) = [[Shoes],[Bags]] A(Q2) = Shoes NY Dept City catalog Insert(Store[NY, Madison Av. City Add. Store NY Madison Ave City Add. Store Bags NY September 16, 2018 Data Mining: Concepts and Techniques
168
Data Mining: Concepts and Techniques
ECA Algorithm SOURCE DATA WAREHOUSE (DW) S_upi: Execute Ui W_upi: receive Ui send Ui to DW Qi=V(Ui)-QQSQj(Ui) trigger W_upi at DW UQS = UQS + {Qi} Send Qi to S trigger S_qui at S S_qui : Receive Qi W_ansi: Receive Ai let Ai = Qi(ssi) COL = COL + Ai Send Ai to DW UQS = UQS - {Qi} trigger W_ansi at DW if UQS = MV=MV+COL COL = ssi = current source state UQS = unanswered query set September 16, 2018 Data Mining: Concepts and Techniques
169
Data Mining: Concepts and Techniques
ECA-key Avoids the need for compensating queries. Necessary condition: the view contains key attributes for each of the base tables (e.g., star schema) September 16, 2018 Data Mining: Concepts and Techniques
170
Data Mining: Concepts and Techniques
Example of ECA-key UQS = UQS = {Q2} UQS = {Q1} UQS=UQS+{Q2}={Q1,Q2} COL = {[bags,Jane]} COL = {[bags,Jane], [bags,Sue]} COL = {[hat,Sue]} COL = Item clerk Sells bags Sue bagsJane Q1= i,d(catalog [hat,Jane]) Insert(catalog[bag,acc])) A1 = {[bag,Jane]} hat Sue Q2= i,c([bags,acc] emp) A(Q2) = {[bags,Sue],[bags,Jane]} Delete(catalog,[hat,acc]) bags acc Item dept. catalog hat acc bags acc Item dept. catalog hat acc Item dept. catalog Insert (sale[acc,Jane]) item clerk emp acc Sue acc Jane September 16, 2018 Data Mining: Concepts and Techniques
171
Strobe algorithm ideas
Apply actions only after a set of interleaving updates are all processed Wait for sources to quiesce Compensate effects of interleaved updates Subtract effects of later updates before installing changes Can combine these ideas STROBE IS A FAMILY OF ALGORITHMS
172
Data Mining: Concepts and Techniques
Strobe Terminology The materialized view MV is the current state of the view at the warehouse V(ws). Given a query Q that needs to be evaluated, the function next_source(Q) returns the pair (x,Qi), where x is the next source to contact and Qi the portion of the query that can be answered by x. Example: if V = r1 r2 r3, and U and update received from r2, then Q = (r1 U r3) and next_source(Q) = (r1, r1 U) September 16, 2018 Data Mining: Concepts and Techniques
173
Data Mining: Concepts and Techniques
Strobe terminology (2) Source_evaluation(Q): /returns answers to Q/ Begin i = 0; WQ = Q; A0 = Q; (x,Q1) next_source(WQ); While x is not nil do Let i = i + 1; Send Qi to source x; When x returns Ai, let WQ = WQ(Ai); Let (x,Qi+1) next_source(WQ); Return(Ai); End September 16, 2018 Data Mining: Concepts and Techniques
174
Data Mining: Concepts and Techniques
Strobe Algorithm Source DW After exec. Ui, send Ui to DW AL = When receiving Qi When Ui is received Compute Ai over ss[x] if a deletion Send Ai to DW Qj UQS add Ui to pend(Qj) Add key_del(MV,Ui) to AL if an insertion Qi = V(Ui), pend(Qi) = Ai = source_evaluate(Qi); Uj pend(Qi), key_del(Ai,Uj); Add insert(MV,Ai) to AL When UQS = , apply AL to MV as a single transaction, without adding duplicate tuples to MV Reset AL September 16, 2018 Data Mining: Concepts and Techniques
175
Example with Strobe AL ={key_del(MV,U2)} AL =
Add nothing to AL UQS = MV = Apply key_del(A12,U2) A2 = Q1=catalog [hat,Sue]emp price item clerk age Sold Pend(Q1) = U2 Pend(Q1) = U2=Del([$12,hat]) Q11=(catalog[hat,Sue]) A11=[$12,hat,Sue] U1=Insert(sale, [hat, Sue]) Q12 =[$12,hat,Sue] emp A12=[$12,hat,Sue,26] $12 hat price item catalog price item catalog item clerk sale clerk age Sue 26 emp hat Sue
176
Data Mining: Concepts and Techniques
Transaction-Strobe AL = {ins([shoes,Jane]} MV = {[shoes,Jane]} AL = {del([hat,Sue]} MV = item clerk sale item clerk sale shoes Jane item clerk sale hat Sue T1 = {delete(sale,[hat,Sue]), insert(sale,[shoes,Jane])} item clerk sale shoes, Jane item clerk sale hat Sue September 16, 2018 Data Mining: Concepts and Techniques
177
Multiple table consistency
More than 1 table at warehouse Multiple tables share source data Updates at source should be reflected in all warehouse tables at the same time V1: Customer-info V2: Cust-prefs V3: Total-sales Customer Sales
178
Multiple table consistency
Warehouse V1 V2 ... Vn Single table consistency Sources S1 S2 ... Sm Source consistency
179
Painting algorithm Use merge process (MP) to coordinate sending updates to warehouse MP holds update actions for each table MP charts potential table states arising from each set of update actions MP sends batch of update actions together when tables will be consistent
180
Data Mining: Concepts and Techniques
Lecture #7 September 16, 2018 Data Mining: Concepts and Techniques
181
Chapter 7. Classification and Prediction
What is classification? What is prediction? Classification by decision tree induction Bayesian Classification Other Classification Methods (SVM) Classification accuracy Prediction Summary September 16, 2018 Data Mining: Concepts and Techniques
182
Classification problem
Given: Tuples each assigned a class level. Develop a model for each class Example: Good creditor : (age in [25,40]) AND (income > 50K) AND (status = MARRIED) Applications: Credit approval (good, bad) Store locations (good, fair, poor) Emergency situations (emergency, non-emergency) September 16, 2018 Data Mining: Concepts and Techniques
183
Classification vs. Prediction
predicts categorical class labels classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Prediction: models continuous-valued functions, i.e., predicts unknown or missing values Typical Applications credit approval target marketing medical diagnosis treatment effectiveness analysis September 16, 2018 Data Mining: Concepts and Techniques
184
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction: training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur September 16, 2018 Data Mining: Concepts and Techniques
185
Supervised vs. Unsupervised Learning
Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data September 16, 2018 Data Mining: Concepts and Techniques
186
Chapter 7. Classification and Prediction
What is classification? What is prediction? Classification by decision tree induction Bayesian Classification Other Classification Methods Classification accuracy Prediction Summary September 16, 2018 Data Mining: Concepts and Techniques
187
Classification by Decision Tree Induction
A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes Tree pruning Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree September 16, 2018 Data Mining: Concepts and Techniques
188
Data Mining: Concepts and Techniques
Training Dataset This follows an example from Quinlan’s ID3 September 16, 2018 Data Mining: Concepts and Techniques
189
Output: A Decision Tree for “buys_computer”
age? <=30 overcast 30..40 >40 student? yes credit rating? no yes excellent fair no yes no yes September 16, 2018 Data Mining: Concepts and Techniques
190
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left September 16, 2018 Data Mining: Concepts and Techniques
191
Data Mining: Concepts and Techniques
Decision trees Training set Salary < 20000 Y N Education = G A N Y A R September 16, 2018 Data Mining: Concepts and Techniques
192
Data Mining: Concepts and Techniques
Decision trees Pros: Fast. Rules easy to interpret. High dimensional data Cons: No correlations Axis-parallel cuts. September 16, 2018 Data Mining: Concepts and Techniques
193
Data Mining: Concepts and Techniques
Decision trees(cont.) Machine learning: ID3 (Quinlan86) C4.5 (Quinlan93 ) CART (Breiman, Friedman, Olshen, Stone, Classification and Regression Trees 1984) Database: SLIQ (Metha, Agrawal and Rissanen, EDBT96) SPRINT (Shafer, Agrawal, Metha, VLDB96) Rainforest (Gherke, Ramakrishnan, Ghanti VLDB98) September 16, 2018 Data Mining: Concepts and Techniques
194
Data Mining: Concepts and Techniques
Decision trees Finding the best tree is NP-Hard We look at non-backtracking algorithms (never look back at a previous decision) Assume we have a test with n outcomes that partitions T into subsets T1, T2,…, Tn If the test is to be evaluated without exploring subsequent dimensions of the Ti’s, the only information available for guidance is the distribution of classes in T and its subsets. September 16, 2018 Data Mining: Concepts and Techniques
195
Decision tree algorithms
Building phase: Recursively split nodes using best splitting attribute and value for node Pruning phase: Smaller (yet imperfect) tree achieves better prediction accuracy. Prune leaf nodes recursively to avoid over-fitting. September 16, 2018 Data Mining: Concepts and Techniques
196
Predictor variables (attributes)
Numerically ordered: values are ordered and they can be represented in real line. ( E.g., salary.) Categorical: takes values from a finite set not having any natural ordering. (E.g., color.) Ordinal: takes values from a finite set whose values posses a clear ordering, but the distances between them are unknown. (E.g., preference scale: good, fair, bad.) September 16, 2018 Data Mining: Concepts and Techniques
197
Data Mining: Concepts and Techniques
Binary Splits Recursive (binary) partitioning Univariate split on numerically ordered or ordinal X X <= c on categorical X X A Linear combination on numerical ai Xi <= c c and A are chosen to maximize separation. September 16, 2018 Data Mining: Concepts and Techniques
198
Data Mining: Concepts and Techniques
Some probability... S = cases freq(Ci,S) = # cases in S that belong to Ci Gain entropic meassure: Prob(“this case belongs to Ci”) = freq(Ci,S)/|S| Information conveyed: -log (freq(Ci,S)/|S|) Entropy = expected information = - (freq(Ci,S)/|S|) log (freq(Ci,S)/|S|) = info(S) September 16, 2018 Data Mining: Concepts and Techniques
199
Data Mining: Concepts and Techniques
Gain Test X: infoX (T) = |Ti|/T info(Ti) gain(X) = info (T) - infoX(T) September 16, 2018 Data Mining: Concepts and Techniques
200
Data Mining: Concepts and Techniques
Example Info(T) (9 play, 5 don’t) info(T) = -9/14log(9/14) /14log(5/14) = 0.94 (bits) Test: outlook infoOutlook = Test Windy infowindy= 5/14 (-2/5 log(2/5)-3/5 log(3/5))+ 7/14(-4/7log(4/7)-3/7 log(3/7)) 4/14 (-4/4 log(4/4)) + +7/14(-5/7log(5/7)-2/7log(2/(7)) 5/14 (-3/5 log(3/5) - 2/5 log(2/5)) gainOutlook = = 0.3 = 0.278 gainWindy = = 0.662 = 0.64 (bits) Windy is a better test September 16, 2018 Data Mining: Concepts and Techniques
201
Data Mining: Concepts and Techniques
Problem with Gain Strong bias towards test with many outcomes. Example: Z = Name |Ti| = 1 (each name unique) infoZ (T) = 1/|T| (- 1/N log (1/N)) 0 Maximal gain!! (but useless division--- overfitting--) September 16, 2018 Data Mining: Concepts and Techniques
202
Data Mining: Concepts and Techniques
Split Split-info (X) = - |Ti|/|T| log (|Ti|/|T|) gain-ratio(X) = gain(X)/split-info(X) Gain <= log(k) Split <= log(n) ratio small September 16, 2018 Data Mining: Concepts and Techniques
203
Extracting Classification Rules from Trees
Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no” September 16, 2018 Data Mining: Concepts and Techniques
204
Data Mining: Concepts and Techniques
OVERFITTING Decision trees can grow so long that there is a leaf for each training example. Extremes: Overfitted: “Whatever I haven’t seen can’t be classified” Too General: “If it is green, it is a tree” September 16, 2018 Data Mining: Concepts and Techniques
205
Avoid Overfitting in Classification
The generated tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Result is in poor accuracy for unseen samples Two approaches to avoid overfitting Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree” September 16, 2018 Data Mining: Concepts and Techniques
206
Approaches to Determine the Final Tree Size
Separate training (2/3) and testing (1/3) sets Use cross validation, e.g., 10-fold cross validation Use all the data for training but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution Use minimum description length (MDL) principle: halting growth of the tree when the encoding is minimized September 16, 2018 Data Mining: Concepts and Techniques
207
Enhancements to basic decision tree induction
Allow for continuous-valued attributes Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals Handle missing attribute values Assign the most common value of the attribute Assign probability to each of the possible values Attribute construction Create new attributes based on existing ones that are sparsely represented This reduces fragmentation, repetition, and replication September 16, 2018 Data Mining: Concepts and Techniques
208
Classification in Large Databases
Classification—a classical problem extensively studied by statisticians and machine learning researchers Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed Why decision tree induction in data mining? relatively faster learning speed (than other classification methods) convertible to simple and easy to understand classification rules can use SQL queries for accessing databases comparable classification accuracy with other methods September 16, 2018 Data Mining: Concepts and Techniques
209
Scalable Decision Tree Induction Methods in Data Mining Studies
SLIQ (EDBT’96 — Mehta et al.) builds an index for each attribute and only class list and the current attribute list reside in memory SPRINT (VLDB’96 — J. Shafer et al.) constructs an attribute list data structure PUBLIC (VLDB’98 — Rastogi & Shim) integrates tree splitting and tree pruning: stop growing the tree earlier RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) separates the scalability aspects from the criteria that determine the quality of the tree builds an AVC-list (attribute, value, class label) September 16, 2018 Data Mining: Concepts and Techniques
210
Data Mining: Concepts and Techniques
SPRINT For large data sets. Age < 25 Car = Sports H H L September 16, 2018 Data Mining: Concepts and Techniques
211
Gini Index (IBM IntelligentMiner)
If a data set T contains examples from n classes, gini index, gini(T) is defined as where pj is the relative frequency of class j in T. If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute). September 16, 2018 Data Mining: Concepts and Techniques
212
Data Mining: Concepts and Techniques
SPRINT Partition (S) if all points of S are in the same class return; else for each attribute A do evaluate_splits on A; use best split to partition into S1,S2; Partition(S1); Partition(S2); September 16, 2018 Data Mining: Concepts and Techniques
213
SPRINT Data Structures
Training set Age Car Attribute lists September 16, 2018 Data Mining: Concepts and Techniques
214
Data Mining: Concepts and Techniques
Splits Age < 27.5 Group2 Group1 September 16, 2018 Data Mining: Concepts and Techniques
215
Data Mining: Concepts and Techniques
Histograms For continuous attributes Associated with node (Cabove, Cbelow) to process already processed September 16, 2018 Data Mining: Concepts and Techniques
216
Data Mining: Concepts and Techniques
Example ginisplit3 =3/6 gini(S1) +3/6 gini(S2) gini(S1) = 1 - [(3/3) 2 ] = 0 gini(S2) = 1 - [(1/3)2 +(2/3)2 ] = 0.444 ginisplit0 = 0/6 gini(S1) + 6/6 gini(S2) gini(S2) = 1 - [(4/6)2 +(2/6)2 ] = 0.444 ginisplit2 = 2/6 gini(S1) +4/6 gini(S2) gini(S1) = 1 - [(2/2) 2 ] = 0 gini(S2) = 1 - [(2/4)2 +(2/4)2 ] = 0.5 ginisplit5 =6/6 gini(S1) +0/6 gini(S2) gini(S1) = 1 - [(4/6) 2 +(2/6) 2 ] = 0.320 ginisplit4 =4/6 gini(S1) +2/6 gini(S2) gini(S1) = 1 - [(3/4) 2 +(1/4) 2 ] = 0.375 gini(S2) = 1 - [(1/2)2 +(1/2)2 ] = 0.5 ginisplit5 =5/6 gini(S1) +1/6 gini(S2) gini(S1) = 1 - [(4/5) 2 +(1/5) 2 ] = 0.320 gini(S2) = 1 - [(1/1)2 ] = 0 ginisplit1 = 1/6 gini(S1) +5/6 gini(S2) gini(S1) = 1 - [(1/1) 2 ] = 0 gini(S2) = 1 - [(3/4)2 +(2/4)2 ] = ginisplit0 = 0.444 ginisplit1= 0.156 ginisplit2= 0.333 ginisplit3= 0.222 ginisplit4= 0.416 ginisplit5= 0.222 Age <= 18.5 ginisplit6= 0.444 September 16, 2018 Data Mining: Concepts and Techniques
217
Splitting categorical attributes
Single scan through the attribute list collecting counts on count matrix for each combination of class label + attribute value September 16, 2018 Data Mining: Concepts and Techniques
218
Data Mining: Concepts and Techniques
Example ginisplit(family)= 3/6 gini(S1) + 3/6 gini(S2) gini(S1) = 1 - [(2/3)2 + (1/3)2] = 4/9 gini(S2) = 1- [(2/3)2 + (1/3)2] = 4/9 ginisplit((sports)= 2/6 gini(S1) + 4/6 gini(S2) gini(S1) = 1 - [(2/2)2] = 0 gini(S2) = 1- [(2/4)2 + (2/4)2] = 0.5 ginisplit(truck)= 1/6 gini(S1) + 5/6 gini(S2) gini(S1) = 1 - [(1/1)2] = 0 gini(S2) = 1- [(4/5)2 + (1/5)2] = 0.32 ginisplit(family)= 0.444 ginisplit((sports) )= 0.333 ginisplit(truck) )= 0.266 Car Type = Truck September 16, 2018 Data Mining: Concepts and Techniques
219
Data Mining: Concepts and Techniques
Example (2 attributes) The winner is Age <= 18.5 Y N H September 16, 2018 Data Mining: Concepts and Techniques
220
Data Mining: Concepts and Techniques
Performing the split Create 2 child nodes Split attribute lists for winning attribute For the remaining Insert Tuple Ids in Hash Table (which child) Scan lists of attributes and probe hash table (may be too large and need several steps). September 16, 2018 Data Mining: Concepts and Techniques
221
Data Mining: Concepts and Techniques
Drawbacks Large explosion of space (possibly tripling the size of database). Costly Hash-Join. September 16, 2018 Data Mining: Concepts and Techniques
222
Chapter 7. Classification and Prediction
What is classification? What is prediction? Classification by decision tree induction Bayesian Classification Other methods (SVM) Classification accuracy Prediction Summary September 16, 2018 Data Mining: Concepts and Techniques
223
Data Mining: Concepts and Techniques
Bayesian Theorem Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem MAP (maximum posteriori) hypothesis Practical difficulty: require initial knowledge of many probabilities, significant computational cost September 16, 2018 Data Mining: Concepts and Techniques
224
Naïve Bayes Classifier (I)
A simplified assumption: attributes are conditionally independent: Greatly reduces the computation cost, only count the class distribution. September 16, 2018 Data Mining: Concepts and Techniques
225
Naive Bayesian Classifier (II)
Given a training set, we can compute the probabilities September 16, 2018 Data Mining: Concepts and Techniques
226
Data Mining: Concepts and Techniques
Example E ={outlook = sunny, temp = [64,70], humidity= [65,70], windy = y} = {E1,E2,E3,E4} Pr[“Play”/E] = (Pr[E1/Play] x Pr[E2/Play] x Pr[E3/Play] x Pr[E4/Play] x Pr[Play]) / Pr[E] = 3/9 x (2/9x 3/9 x 4/9x 9/14)/Pr[E] = 0.007/Pr[E] Pr[“Don’t”/E] = (3/5 x 2/5 x 1/5 x 3/5 x 5/14)/Pr[E] = 0.010/Pr[E] With E: Pr[“Play”/E] = 41 %, Pr[“Don’t”/E] = 59 % September 16, 2018 Data Mining: Concepts and Techniques
227
Bayesian Belief Networks (I)
Family History Smoker (FH, S) (FH, ~S) (~FH, S) (~FH, ~S) LC 0.8 0.5 0.7 0.1 LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9 The conditional probability table for the variable LungCancer PositiveXRay Dyspnea Bayesian Belief Networks September 16, 2018 Data Mining: Concepts and Techniques
228
Bayesian Belief Networks (II)
Bayesian belief network allows a subset of the variables conditionally independent A graphical model of causal relationships Several cases of learning Bayesian belief networks Given both network structure and all the variables: easy Given network structure but only some variables When the network structure is not known in advance September 16, 2018 Data Mining: Concepts and Techniques
229
Another Example (Friedman & Goldzsmidt)
Variables : Burglary, Earthquake, Alarm, Neighbor call, Radio announcement. Burglary and Earthquake are independent (P(BE) = P(B)*P(E)) Burglary and Radio announcement are independent given Earthquake (P(BR/E) = P(B/E)*P(R/E)) So, P(A,R,E,B)=P(A|R,E,B)*P(R|E,B)*P(E|B)*P(B) can be reduced to: P(A,R,E,B) = P(A|E,B)*P(R|E)*P(E)*P(B) September 16, 2018 Data Mining: Concepts and Techniques
230
Data Mining: Concepts and Techniques
Example (cont.) Burglary Earthquake Alarm Radio announc. Neigh. call September 16, 2018 Data Mining: Concepts and Techniques
231
Data Mining: Concepts and Techniques
Example (cont.) Associated with each node is a set of conditional probability distributions. For example, the "Alarm" node might have the following probability distribution September 16, 2018 Data Mining: Concepts and Techniques
232
Chapter 7. Classification and Prediction
What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Other Methods Classification accuracy Prediction Summary September 16, 2018 Data Mining: Concepts and Techniques
233
Extending linear classification
Problem: all the algorithms we covered (plus many other ones) can only represent linear boundaries between classes Age <= 25 <- -> Age > 25 Too simplistic for many real cases September 16, 2018 Data Mining: Concepts and Techniques
234
Nonlinear class boundaries
Support vector machines (SVM)-- a misnomer, since they are algorithms, not machines-- Idea: use a non-linear mapping and transform the space into a new space. Example: x = w1a13 + w2 a12 a2 + w3 a1 a22 + w4 a23 September 16, 2018 Data Mining: Concepts and Techniques
235
Data Mining: Concepts and Techniques
SVMs Based on an algorithm that finds a maximum marginal hyperplane (linear model). Convex hull: (tightest enclosing polygon) Maximum margin hyperplane Shortest line connecting the hulls Support vectors September 16, 2018 Data Mining: Concepts and Techniques
236
Data Mining: Concepts and Techniques
SVMs (cont.) We have assumed that the two classes are linearly separable, so their convex hulls cannot overlap. The maximum margin hyperplane (MMH) is the one that is as far away as possible from both convex hulls. It is orthogonal to the shortest line connecting the hulls. The instances closest to the MMH (minimum distance to the line) are called support vectors (SV). (At least one for each class, often more.) Given the SVs, we can easily construct the MLH. All other training points can be deleted without any effect on the MMH September 16, 2018 Data Mining: Concepts and Techniques
237
Data Mining: Concepts and Techniques
SVMs (cont.) A hyperplane that separates the two classes can be written as: x = w0 + w1a1 + w2 a2 for a two-attribute case. However, the equation that defines the MMH, can be defined in terms of the SVs. Write the class value y of a training instance (point) as 1 (yes) or -1 (no). Then the MMH is: x = b + i yi a(i). a i SVs yi = class value of the point a(i); b and i are numerical values to be determined; a is a test point. September 16, 2018 Data Mining: Concepts and Techniques
238
Data Mining: Concepts and Techniques
SVMs (cont.) So, now… Use the training values to determine b and i for x = b + i yi a(i). a i SVs Standard optimization problem: constrained quadratic optimization (off-the-shelf software packages to solve this: Fletcher, Practical Methods of Optimization, 1987) Dot product September 16, 2018 Data Mining: Concepts and Techniques
239
Chapter 7. Classification and Prediction
What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Other Methods Classification accuracy Prediction Summary September 16, 2018 Data Mining: Concepts and Techniques
240
Classification Accuracy: Estimating Error Rates
Partition: Training-and-testing use two independent data sets, e.g., training set (2/3), test set(1/3) used for data set with large number of samples Cross-validation divide the data set into k subsamples use k-1 subsamples as training data and one sub-sample as test data --- k-fold cross-validation for data set with moderate size Bootstrapping (leave-one-out) for small size data September 16, 2018 Data Mining: Concepts and Techniques
241
Data Mining: Concepts and Techniques
Boosting and Bagging Boosting increases classification accuracy Applicable to decision trees or Bayesian classifier Learn a series of classifiers, where each classifier in the series pays more attention to the examples misclassified by its predecessor Boosting requires only linear time and constant space September 16, 2018 Data Mining: Concepts and Techniques
242
Chapter 7. Classification and Prediction
What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification accuracy Prediction Summary September 16, 2018 Data Mining: Concepts and Techniques
243
Data Mining: Concepts and Techniques
What Is Prediction? Prediction is similar to classification First, construct a model Second, use model to predict unknown value Major method for prediction is regression Linear and multiple regression Non-linear regression Prediction is different from classification Classification refers to predict categorical class label Prediction models continuous-valued functions September 16, 2018 Data Mining: Concepts and Techniques
244
Predictive Modeling in Databases
Predictive modeling: Predict data values or construct generalized linear models based on the database data. One can only predict value ranges or category distributions Method outline: Minimal generalization Attribute relevance analysis Generalized linear model construction Prediction Determine the major factors which influence the prediction Data relevance analysis: uncertainty measurement, entropy analysis, expert judgement, etc. Multi-level prediction: drill-down and roll-up analysis September 16, 2018 Data Mining: Concepts and Techniques
245
Regress Analysis and Log-Linear Models in Prediction
Linear regression: Y = + X Two parameters , and specify the line and are to be estimated by using the data at hand. using the least squares criterion to the known values of Y1, Y2, …, X1, X2, …. Multiple regression: Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into the above. Log-linear models: The multi-way table of joint probabilities is approximated by a product of lower-order tables. Probability: p(a, b, c, d) = ab acad bcd September 16, 2018 Data Mining: Concepts and Techniques
246
Chapter 7. Classification and Prediction
What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Other Classification Methods Classification accuracy Prediction Summary September 16, 2018 Data Mining: Concepts and Techniques
247
Data Mining: Concepts and Techniques
Summary Classification is an extensively studied problem (mainly in statistics, machine learning & neural networks) Classification is probably one of the most widely used data mining techniques with a lot of extensions Scalability is still an important issue for database applications: thus combining classification with database techniques should be a promising topic Research directions: classification of non-relational data, e.g., text, spatial, multimedia, etc.. September 16, 2018 Data Mining: Concepts and Techniques
248
Data Warehousing Design And Implementation
Chapter Name September 98 Data Warehousing Design And Implementation Yong Ye Feb September 16, 2018 Data Mining: Concepts and Techniques 1
249
Data Mining: Concepts and Techniques
Outline Conceptual design Business requirement, scope of application Logical design Define the types of information you need Physical design Creation of the data warehouse with SQL statements September 16, 2018 Data Mining: Concepts and Techniques
250
Designing Data Warehouses
To begin a data warehouse project, need to find answers for questions such as: Which user requirements are most important and which data should be considered first? Should project be scaled down into something more manageable? Should infrastructure for a scaled down project be capable of ultimately delivering a full-scale enterprise-wide data warehouse? September 16, 2018 Data Mining: Concepts and Techniques
251
Designing Data Warehouses
For many enterprises, the way to avoid the complexities associated with designing a data warehouse is to start by building one or more data marts. Data marts allow designers to build something that is far simpler and achievable for a specific group of users. September 16, 2018 Data Mining: Concepts and Techniques
252
Designing Data Warehouses
Requirements collection and analysis stage of a data warehouse project involves interviewing appropriate members of staff (such as marketing users, finance users, and sales users) to enable identification of prioritized set of requirements that data warehouse must meet. September 16, 2018 Data Mining: Concepts and Techniques
253
Designing Data Warehouses
At same time, interviews are conducted with members of staff responsible for operational systems to identify which data sources can provide clean, valid, and consistent data that will remain supported over next few years. September 16, 2018 Data Mining: Concepts and Techniques
254
Designing Data Warehouses
Architecture of a data warehouse September 16, 2018 Data Mining: Concepts and Techniques
255
Design Methodology for Data Warehouses
Four steps: Choosing a business process to model Choosing the grain Identifying the dimensions Choosing the measure September 16, 2018 Data Mining: Concepts and Techniques
256
Step 1: Choosing The Process
The process (function) refers to the subject matter of a particular data mart. First data mart built should be the one that is most likely to be delivered on time, within budget, and to answer the most commercially important business questions. September 16, 2018 Data Mining: Concepts and Techniques
257
Step 2: Choosing The Grain
Decide what a record of the fact table is to represent. Also include time as a core dimension, which is always present in star schemas. September 16, 2018 Data Mining: Concepts and Techniques
258
Step 3: Identifying and Conforming the Dimensions
Dimensions set the context for asking questions about the facts in the fact table. If any dimension occurs in two data marts, they must be exactly the same dimension, or one must be a mathematical subset of the other. September 16, 2018 Data Mining: Concepts and Techniques
259
Step 4: Choosing The Measure of Facts
Typical measures are numeric additive quamtities like dollars_sold and unit_sold. September 16, 2018 Data Mining: Concepts and Techniques
260
Fact and Dimension Tables for each Business Process of DreamHome
September 16, 2018 Data Mining: Concepts and Techniques
261
Data Mining: Concepts and Techniques
Physical Design Translate schemas into actual database structures Entities to tables Relationships to foreign key constraints Attributes to columns Primary unique identifiers to primary key constraints September 16, 2018 Data Mining: Concepts and Techniques
262
Data Mining: Concepts and Techniques
Physical Design Most critical physical design issues affecting the end-user’s perception includes: physical sort order of the fact table on disk; presence of pre-stored summaries or aggregations. Indexing September 16, 2018 Data Mining: Concepts and Techniques
263
Data Mining: Concepts and Techniques
Physical Design Dimension Table: create table customer (csid varchar(30), cname varchar(20) not null, gender varchar(10), primary key (csid)); September 16, 2018 Data Mining: Concepts and Techniques
264
Data Mining: Concepts and Techniques
Physical Design Fact Table: create table Sales (customer_id varchar(30), product_id varchar(50), store_id varchar(50), date_id varchar(50), unit_sold real, unit_price real, total_price real, primary key (customer_id,product_id,store_id,date_id), foreign key (customer_id) references customer (csid), foreign key (product_id) references products (pid), foreign key (store_id) references store (name), foreign key (date_id) references time (tid)); September 16, 2018 Data Mining: Concepts and Techniques
265
Data Mining: Concepts and Techniques
Physical Design Dimension - A dimension is a schema object that defines hierarchical relationships between columns or column sets. CREATE DIMENSION time_dim LEVEL day IS time.tid LEVEL month IS time.month LEVEL year IS time.year HIERARCHY cal_rollup ( day CHILD OF month CHILD OF year ); September 16, 2018 Data Mining: Concepts and Techniques
266
Data Mining: Concepts and Techniques
Physical Design Materialized View - you can use materialized views to precompute and store aggregated data such as the sum of sales. CREATE MATERIALIZED VIEW product_sales as SELECT p.pname, SUM(s.unit_sold) AS totalsale FROM sales s, products p WHERE s.product_id =p.pid GROUP BY p.pname; Bitmap Index – Good for low cardinality column. create bitmap index customer_gender on customer (gender); September 16, 2018 Data Mining: Concepts and Techniques
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.