Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
What is a Warehouse? Defined in many different ways, but not rigorously. A decision support database that is maintained separately from the organization’s operational database Support information processing by providing a solid platform of consolidated, historical data for analysis. “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon Data warehousing: The process of constructing and using data warehouses September 16, 2018 Data Mining: Concepts and Techniques

Cube: A Lattice of Cuboids
all 0-D(apex) cuboid time item location supplier 1-D cuboids time,item time,location item,location location,supplier 2-D cuboids time,supplier item,supplier time,location,supplier time,item,location 3-D cuboids time,item,supplier item,location,supplier 4-D(base) cuboid time, item, location, supplier September 16, 2018 Data Mining: Concepts and Techniques

Typical OLAP Operations
Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: project and select Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes. Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its back-end relational tables (using SQL) September 16, 2018 Data Mining: Concepts and Techniques

MOLAP versus ROLAP MOLAP ROLAP Multidimensional OLAP
Data stored in multi-dimensional cube Transformation required Data retrieved directly from cube for analysis Faster analytical processing Cube size limitations ROLAP Relational OLAP Data stored in relational database as virtual cube No transformation needed Data retrieved via SQL from database for analysis Slower analytical processing No size limitations

An OLAM Architecture Mining query Mining result Layer4 User Interface User GUI API OLAM Engine OLAP Engine Layer3 OLAP/OLAM Data Cube API Layer2 MDDB MDDB Meta Data Database API Filtering&Integration Filtering Layer1 Data Repository Data cleaning Data Warehouse Databases Data integration September 16, 2018 Data Mining: Concepts and Techniques

Summary Data warehouse A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process A multi-dimensional model of a data warehouse Star schema, snowflake schema, fact constellations A data cube consists of dimensions & measures OLAP operations: drilling, rolling, slicing, dicing and pivoting OLAP servers: ROLAP, MOLAP, HOLAP From OLAP to OLAM September 16, 2018 Data Mining: Concepts and Techniques

Lecture #2 September 16, 2018 Data Mining: Concepts and Techniques

Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary September 16, 2018 Data Mining: Concepts and Techniques

Multi-Dimensional Measure of Data Quality
A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility Broad categories: intrinsic, contextual, representational, and accessibility. September 16, 2018 Data Mining: Concepts and Techniques

Major Tasks in Data Preprocessing
Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data September 16, 2018 Data Mining: Concepts and Techniques

Data Cleaning Problems --- see figure 2 of data cleaning paper
Data quality problems Multi-source Single Source Schema level Instance level Schema level Instance level (poor schema design) (data entry errors) (heterogeneity) (overlapping contradicting data) .Uniqueness .Misspellings Naming conflicts Inconsistent aggregation September 16, 2018 Data Mining: Concepts and Techniques

Data Analysis Data profiling Focuses on the instance analysis of individual attributes Derives information such as data type, length, value range, discrete values, frequency, variance, uniqueness, occurrence of null values, typical string pattern (e.g., phone numbers, zip codes), providing a view of quality aspects of the attribute September 16, 2018 Data Mining: Concepts and Techniques

Data analysis and DM Data Mining Discover patterns in the data Integrity constraints among attributes can be derived  “business rules” With the rules at hand, one can find exceptions which may be suspicious (candidates for cleaning) E.g. Discovered rule “total = quantity*unitprice” with confidence 99%. Then, 1% of the records require closer examination (usually by hand…) September 16, 2018 Data Mining: Concepts and Techniques

Daimler-Chrysler Example
A warehouse which contains information about vehicle repairs To analyze quality of Products (cars) Processes (warranty claims) Services (actual repairs) To evaluate and redefine Policies Costs To collect and analyze data Wear Damages Potential recalls September 16, 2018 Data Mining: Concepts and Techniques

Data Source Analysis (cont.)
Discovery of integrity constraints: Vehicle type  {C180,C220,C250,…} Production date precedes date of repair Data mining can be used to uncover integrity constraints E.g., use visualization to discover that vehicle type, power and weight correlate. 99% of WGT  {1000,2000} (are the 1% incorrect?) September 16, 2018 Data Mining: Concepts and Techniques

Structural integration & data mining
Description conflicts: objects modeled differently in two or more schemes Structural conflicts: using different model constructs (I.e., taxes in one scheme and not in the other) Data conflicts: incorrect data, different representations, etc. September 16, 2018 Data Mining: Concepts and Techniques

Structural integration and DM
Using data mining methods to identify and solve these conflicts: Assume that the same vehicle repair cases are stored in two different databases. One DB contains mileage in Km/lit and the other in Miles/Galon A linear regression method would discover the conversion factors! September 16, 2018 Data Mining: Concepts and Techniques

Data cleansing and DM Missing values: replace them by the most frequent value. Or, better, use the inferred rules to determine the set of possible values E.g. (WGT = 800)  WGT_GROUP = light, helps in converting a missing value in WGT-GROUP Correction of noise and incorrect data: E.g. 10 year old car with odometer 10… Use rules: (AGE> 10)  (odometer > 100,000) Conflicts between data sources: E.g., same product different prices. Again, use rules to correct. September 16, 2018 Data Mining: Concepts and Techniques

Multidimensional Data Modeling and DM
Identification of orthogonal dimensions: Some fields are functionally dependent (e.g., customer birthday and age) Other fields do not strongly influence the measure (e.g., steering wheel type may not have much influence in number of repairs) Data mining methods to rank the variables according to importance and correlations can be used to decide which dimensions are kept. September 16, 2018 Data Mining: Concepts and Techniques

Data transformations Define the transformations in an appropriate language, usually supported by a GUI Various ETL tools support this functionality. Or, alternatively use User Defined Functions (UDFs) and SQL E.g.: CREATE VIEW Customer2(Lname, Fname, Gender, Street, City, State, ZIP, CID) AS SELECT LastNameExtract(Name), FirstNameExtract(Name), Sex, Street, CityExtract( City) FROM Customer The bold UDFs extract names and contain cleaning logic, e.g. remove misspellings, or provide missing ZIP codes. September 16, 2018 Data Mining: Concepts and Techniques

Data Cleaning Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data September 16, 2018 Data Mining: Concepts and Techniques

Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data September 16, 2018 Data Mining: Concepts and Techniques

How to Handle Noisy Data?
Binning method: first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human Regression smooth by fitting the data into regression functions September 16, 2018 Data Mining: Concepts and Techniques

Simple Discretization Methods: Binning
Equal-width (distance) partitioning: It divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. The most straightforward But outliers may dominate presentation Skewed data is not handled well. Equal-depth (frequency) partitioning: It divides the range into N intervals, each containing approximately same number of samples Good data scaling Managing categorical attributes can be tricky. September 16, 2018 Data Mining: Concepts and Techniques

Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 September 16, 2018 Data Mining: Concepts and Techniques

Chapter 3: Data Preprocessing
Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary September 16, 2018 Data Mining: Concepts and Techniques

Data Transformation: Normalization
min-max normalization z-score normalization normalization by decimal scaling Where j is the smallest integer such that Max(| |)<1 September 16, 2018 Data Mining: Concepts and Techniques

Dimensionality Reduction
Feature selection (i.e., attribute subset selection): Select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features reduce # of patterns in the patterns, easier to understand Heuristic methods (due to exponential # of choices): step-wise forward selection step-wise backward elimination combining forward selection and backward elimination decision-tree induction September 16, 2018 Data Mining: Concepts and Techniques

Example of Decision Tree Induction Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A1? A6? Class 1 Class 2 Class 2 Class 1 Reduced attribute set: {A1, A4, A6} > September 16, 2018 Data Mining: Concepts and Techniques

Data Compression String compression There are extensive theories and well-tuned algorithms Typically lossless But only limited manipulation is possible without expansion Audio/video compression Typically lossy compression, with progressive refinement Sometimes small fragments of signal can be reconstructed without reconstructing the whole Time sequence is not audio Typically short and vary slowly with time September 16, 2018 Data Mining: Concepts and Techniques

Haar2 Daubechie4 Wavelet Transforms Discrete wavelet transform (DWT): linear signal processing Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space Method: Length, L, must be an integer power of 2 (padding with 0s, when necessary) Each transform has 2 functions: smoothing, difference Applies to pairs of data, resulting in two set of data of length L/2 Applies two functions recursively, until reaches the desired length September 16, 2018 Data Mining: Concepts and Techniques

Principal Component Analysis
Given N data vectors from k-dimensions, find c <= k orthogonal vectors that can be best used to represent data The original data set is reduced to one consisting of N data vectors on c principal components (reduced dimensions) Each data vector is a linear combination of the c principal component vectors Works for numeric data only Used when the number of dimensions is large September 16, 2018 Data Mining: Concepts and Techniques

Principal Component Analysis X2 Y1 Y2 X1 September 16, 2018 Data Mining: Concepts and Techniques

Numerosity Reduction Parametric methods Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) Log-linear models: obtain value at a point in m-D space as the product on appropriate marginal subspaces Non-parametric methods Do not assume models Major families: histograms, clustering, sampling September 16, 2018 Data Mining: Concepts and Techniques

Regression and Log-Linear Models
Linear regression: Data are modeled to fit a straight line Often uses the least-square method to fit the line Multiple regression: allows a response variable Y to be modeled as a linear function of multidimensional feature vector Log-linear model: approximates discrete multidimensional probability distributions September 16, 2018 Data Mining: Concepts and Techniques

Regress Analysis and Log-Linear Models
Linear regression: Y =  +  X Two parameters ,  and  specify the line and are to be estimated by using the data at hand. using the least squares criterion to the known values of Y1, Y2, …, X1, X2, …. Multiple regression: Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into the above. Log-linear models: The multi-way table of joint probabilities is approximated by a product of lower-order tables. Probability: p(a, b, c, d) = ab acad bcd

Clustering Partition data set into clusters, and one can store cluster representation only Can be very effective if data is clustered but not if data is “smeared” Can have hierarchical clustering and be stored in multi-dimensional index tree structures There are many choices of clustering definitions and clustering algorithms, further detailed in Chapter 8 September 16, 2018 Data Mining: Concepts and Techniques

Sampling Raw Data Cluster/Stratified Sample September 16, 2018 Data Mining: Concepts and Techniques

Hierarchical Reduction
Use multi-resolution structure with different degrees of reduction Hierarchical clustering is often performed but tends to define partitions of data sets rather than “clusters” Parametric methods are usually not amenable to hierarchical representation Hierarchical aggregation An index tree hierarchically divides a data set into partitions by value range of some attributes Each partition can be considered as a bucket Thus an index tree with aggregates stored at each node is a hierarchical histogram September 16, 2018 Data Mining: Concepts and Techniques

Chapter 3: Data Preprocessing
Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary September 16, 2018 Data Mining: Concepts and Techniques

Discretization Three types of attributes: Nominal — values from an unordered set Ordinal — values from an ordered set Continuous — real numbers Discretization: divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. Reduce data size by discretization Prepare for further analysis September 16, 2018 Data Mining: Concepts and Techniques

Discretization and Concept hierachy
reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values. Concept hierarchies reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior). September 16, 2018 Data Mining: Concepts and Techniques

Discretization and concept hierarchy generation for numeric data
Binning (see sections before) Histogram analysis (see sections before) Clustering analysis (see sections before) Entropy-based discretization Segmentation by natural partitioning September 16, 2018 Data Mining: Concepts and Techniques

Entropy-Based Discretization
Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization. The process is recursively applied to partitions obtained until some stopping criterion is met, e.g., Experiments show that it may reduce data size and improve classification accuracy September 16, 2018 Data Mining: Concepts and Techniques

Example For large data sets. Age < 25 Car = Sports H H L September 16, 2018 Data Mining: Concepts and Techniques

Segmentation by natural partitioning
3-4-5 rule can be used to segment numeric data into relatively uniform, “natural” intervals. * If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equi-width intervals * If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals * If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals September 16, 2018 Data Mining: Concepts and Techniques

Example of rule Step 1: -$351 -$159 profit $1, $4,700 Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max count msd=1,000 Low=-$1,000 High=$2,000 Step 2: (-$1, $2,000) (-$1, ) (0 -$ 1,000) Step 3: ($1,000 - $2,000) (-$4000 -$5,000) Step 4: ($2,000 - $5, 000) ($2,000 - $3,000) ($3,000 - $4,000) ($4,000 - $5,000) (-$ ) (-$400 - -$300) (-$300 - -$200) (-$200 - -$100) (-$100 - 0) (0 - $1,000) (0 - $200) ($200 - $400) ($400 - $600) ($600 - $800) ($800 - $1,000) ($1,000 - $2, 000) ($1,000 - $1,200) ($1,200 - $1,400) ($1,400 - $1,600) ($1,600 - $1,800) ($1,800 - $2,000) September 16, 2018 Data Mining: Concepts and Techniques

Concept hierarchy generation for categorical data
Specification of a partial ordering of attributes explicitly at the schema level by users or experts Specification of a portion of a hierarchy by explicit data grouping Specification of a set of attributes, but not of their partial ordering Specification of only a partial set of attributes September 16, 2018 Data Mining: Concepts and Techniques

Specification of a set of attributes
Concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy. 15 distinct values country province_or_ state 65 distinct values 3567 distinct values city 674,339 distinct values street September 16, 2018 Data Mining: Concepts and Techniques

Example PSC 1.PSC 6 Million cells PC 6 Million cells PS 0.8 Million cells SC 6 Million cells P 0.2 Million cells S 0.05 Million cells C Million cells ALL 1 cell PC SC PS P S C ALL (Cube lattice) September 16, 2018 Data Mining: Concepts and Techniques

Decisions, decisions... How many views must we materialize to get good performance? Given space S (on disk), which views do we materialize? In the previous example we’d need space for 19 Million cells. Can we do better? Avoid going to the raw (fact table) data:  PSC (6 M) PC (6M) can be answered using PSC (6 M) no advantage SC (6 M) can be answered using PSC (6 M) no advantage September 16, 2018 Data Mining: Concepts and Techniques

Example again 1 PSC  M 6M PC  6M PS  0.8M 0.8M SC  6M P  0.2 M 0.2M S  0.01M 0.01M C  0.1M M vs M (about the same performance) September 16, 2018 Data Mining: Concepts and Techniques

Formal treatment Q1  Q2 (dependency) Q(P)  Q(PC)  Q(PSC) (lattice) Add hierarchies C (customers) S (suppliers) P(parts) N (nation-wide cust. ) SN (nation-wide) Sz Ty e.g., USA, Japan) (size) (type) DF (domestic- ALL foreign) ALL (all cust.) ALL September 16, 2018 Data Mining: Concepts and Techniques

Formal treatment(2) CP (6M) C Sz (5M) CTy (5.99M) NP (5M) N Sz (1,250) C (0.1M) P (0.2 M) N Ty (3,750) Sz (50) N 2 5 Ty 150 ALL(1) September 16, 2018 Data Mining: Concepts and Techniques

Optimizing Data Cube lattices
First problem (no space restrictions) VERY HARD problem (NP-complete) Heuristics: Always include the “core” cuboid. At every step you have materialized Sv views Compute the benefit of view v relative to Sv as: For each w  v define Bw Let u be the view of least cost in Sv such that w  u If Cost(v) < Cost (u) Bw = Cost(v)-Cost(u) (-) else Bw = Define B(V,Sv) = -  w  v B(w) September 16, 2018 Data Mining: Concepts and Techniques

Greedy algorithm Sv = {core view} for i = 1 to k begin select v not in Sv such that B(v,Sv) is maximum Sv = Sv  {v} End September 16, 2018 Data Mining: Concepts and Techniques

September 16, 2018 Data Mining: Concepts and Techniques

Structures Two levels: Blocks in the first level correspond to the dense dimension combinations. The basic block will have the size proportional to the product of the cardinalities for these dimensions. Each entry in the block points to a second-level block. Blocks in the second level correspond to the sparse dimensions. They are arrays of pointers, as many as the product of the cardinalities for sparse dimensions. Each pointer has one of three values: null (non-existent data), impossible (non-allowed combination) or a pointer to an actual data block. September 16, 2018 Data Mining: Concepts and Techniques

Data Example Departments will generally have data for each Time period. (so the two are the dense dimension combination) Geographical information, Product and Distribution channels, on the other hand are typically sparse (e.g., most cities have only one Distribution channel and some Product values). Dimensions Departments (Sales,Mkt) Time Geographical information Product Distribution channels September 16, 2018 Data Mining: Concepts and Techniques

Structures revisited S,1Q S,2Q S,3Q S,4Q M,1Q M,2Q M,3Q M,4Q Geo., Product, Dist Data block Data block September 16, 2018 Data Mining: Concepts and Techniques

Allocating memory Define member structure (e.g., dimensions) Select dense dimension combinations and create upper level structure Create lower level structure. Input data cell: if pointer to data block is empty, create new else insert data in data block September 16, 2018 Data Mining: Concepts and Techniques

Problem 2: COMPUTING DATACUBES
Four algorithms PIPESORT PIPEHASH SORT-OVERLAP Partitioned-cube September 16, 2018 Data Mining: Concepts and Techniques

Optimizations Smallest-parent AB can be computed from ABC, ABD, or ABCD. Which one should be use? Cache-results Having computed ABC, we compute AB from it while ABC is still in memory Amortize-scans We may try to compute ABC, ACD, ABD, BCD in one scan of ABCD Share-sorts Share-partitions September 16, 2018 Data Mining: Concepts and Techniques

PIPESORT Input: Cube lattice and cost matrix. Each edge (eij in the lattice is annotated with two costs: S(i,j) cost of computing j from i when i is not sorted A(i,j) cost of computing j from i when i is sorted Output: Subgraph of the lattice where each cuboid (group-by) is connected to a single parent from which it will be computed and is associated with an attribute order in which it will be sorted. If the order is a prefix of the order of its parent, then the child can be computed without sorting the parent (cost A); otherwise it has to be sorted (cost B). For every parent there will be only one out-edge labeled A. September 16, 2018 Data Mining: Concepts and Techniques

PIPESORT (2) Algorithm: Proceeds in levels, k = 0,…,N-1 (number of dimensions). For each level, finds the best way of computing level k from level k+1 by reducing the problem to a weighted bypartite problem Make k additional copies of each group-by (each node has then, k+1 vertices) and connect them to the same children of the original From the original copy, the edges have A costs, while the costs from the copies have S costs. Find the minimum cost matching in the bypartite graph. (Each vertex in level k+1 matched with one vertex in level k.) September 16, 2018 Data Mining: Concepts and Techniques

Example AB AB AC AC BC BC A B C AB AB AC AC BC BC A B C September 16, 2018 Data Mining: Concepts and Techniques

Transformed lattice A B C AB(2) AB(10) AC(5) AC(12) BC(13) BC(20) September 16, 2018 Data Mining: Concepts and Techniques

Explanation of edges A AB(2) AB(10) This means that we really have BA (we need to sort it to get A) This means we have AB (no need to sort) September 16, 2018 Data Mining: Concepts and Techniques

PIPESORT pseudo-algorithm
Pipesort: (Input: lattice with A() and S() edges costs) For level k = 0 to N Generate_plan(k+1) For each cuboid g in level k Fix sort order of g as the order of the cuboid connected to g by an A edge; September 16, 2018 Data Mining: Concepts and Techniques

Generate_plan Generate_plan(k+1) Make k additional copies of each level k+1 cuboid; Connect each copy to the same set of vertices as the original; Assign costs A to original edges and S to copies; Find minimum cost matching on the transformed graph; September 16, 2018 Data Mining: Concepts and Techniques

Example 1 3 2 September 16, 2018 Data Mining: Concepts and Techniques

PipeHash Input: lattice PipeHash chooses for each vertex the parent with the smallest estimated size. The outcome is a minimum spanning tree (MST), where each vertex is a cuboid and an edge from i to j shows that i is the smallest parent of j. Available memory is not usually enough to compute all the cuboids in MST together, so we need to decide what cuboids can be computed together (sub-MST), and when to allocate and deallocate memory for different hash-tables and what attribute to use for partitioning data. Input: lattice and estimated sizes of cuboids Initialize worklist with MST of the search lattice While worklist is not empty Pick tree from worklist; T’ = Select-subtree of T to be executed next; Compute-subtree(T’); September 16, 2018 Data Mining: Concepts and Techniques

Select-subtree Select-subtree(T) If memory required by T less than available, return(T); Else, let S be the attributes in root(T) For any s  S we get a subtree Ts of T also rooted at T including all cuboids that contain s Ps= maximum number of partitions of root(T) possible if partitioned on s Choose s such that mem(Ts)/Ps < memory available and Ts the largest over all subsets of S. Remove Ts from T. (put T-Ts in worklist) September 16, 2018 Data Mining: Concepts and Techniques

Compute-subtree Compute-subtree numP = mem(T’) * f / mem-available Partition root of T’ into numP For each partition of root(T’) For each node n in T’ Compute all children of n in one scan If n is cached, saved to disk and release memory occupied by its hash table September 16, 2018 Data Mining: Concepts and Techniques

OVERLAP Sorted-Runs: Consider a cuboid on j attributes {A1,A2,…,Aj}, we use B= (A1,A2,…,Aj) to denote the cuboid sorted on that order. Consider S = (A1,A2,…,Al-1,Al+1,…,Aj), computed using the one before. A sorted run R of S in B is defined as: R = S (Q) where Q is a maximal sequence of tuples of B such that for each tuple in Q, the first l columns have the same value. September 16, 2018 Data Mining: Concepts and Techniques

Sorted-run B = [(a,1,2),(a,1,3),(a,2,2),(b,1,3),(b,3,2),(c,3,1)] S = first and third attribute S = [(a,2),(a,3),(b,3),(b,2),(c,1)] Sorted runs: [(a,2),(a,3)] [(a,2)] [(b,3)] [(b,2)] [(c,1)] September 16, 2018 Data Mining: Concepts and Techniques

Partitions B and S have a common prefix (A1… Al-1) A partition of the cuboid S in B is the union of sorted runs such that the first l-1 columns of all the tuples of the sorted runs have the same values. [(a,2),(a,3)] [(b,2),(b,3)] [(c,1)] September 16, 2018 Data Mining: Concepts and Techniques

OVERLAP Sort the base cuboid: this forces the sorted order in which the other cuboids are computed ABCD ABC ABD ACD BCD AB AC BC AD CD BD A B C D ALL September 16, 2018 Data Mining: Concepts and Techniques

OVERLAP(2) If there is enough memory to hold all the cuboids, compute all. (very seldom true). Otherwise, use the partition as a unit of computation: just need sufficient memory to hold a partition. As soon as a partition is computed, tuples can be pipelined to compute descendant cuboids (same partition) and then written to disk. Reuse then the memory to compute the next partition. Example XYZ->XZ Partitions:[(a,2),(a,3)] [(b,2),(b,3)] [(c,1)] XYZ=[(a,1,2),(a,1,3),(a,2,2),(b,1,3),(b,3,2),(C,3,1)] Compute C,3,1 cell in XYZ, XZ. Use them to compute (c,1) in XZ. Then write these cells to disk. Compute (b,1,3),(b,3,2) cells in XYZ, XZ. Use them to compute [(b,2),(b,3)] in XZ. Then write these cells to disk. Compute (a,1,2),(a,1,3),(a,2,2) cells in XYZ, use them to compute (a,2),(a,3) in XZ. Then write all these cells to disk September 16, 2018 Data Mining: Concepts and Techniques

OVERLAP(3) Choose a parent to compute a cuboid: DAG Goal: minimize the size of the partitions of a cuboid, so less memory is needed. E.g., it is better to compute AC from ACD than from ABC, (since the sort order matches and the partition size is 1). This is a hard problem. Heuristic: maximize the size of the common prefix. ABCD ABC ABD ACD BCD AB AC BC AD CD BD A B C D ALL September 16, 2018 Data Mining: Concepts and Techniques

OVERLAP (4) Choosing a set of cuboids for overlapped computation, according to your memory constraints. To compute a cuboid in memory, we need memory equal to the size of its partition. Partition sizes can be estimated from cuboid sizes by using some distribution (uniform?) assumption. If this much memory can be spared, then the cuboid will be marked as in Partition state. For other cuboids, allocate a single page (for temporary results), these cuboids are in SortRun state. A cuboid in partition state can have its tuples pipelined for computation of its descendants. A cuboid can be considered for computation if it is the root, or its parent is marked as in Partition State The total memory allocated to all cuboids cannot be more than the available memory. September 16, 2018 Data Mining: Concepts and Techniques

OVERHEAD (5) Again, a hard problem… Heuristic: traverse the tree in BFS manner. ABCD ABC(1) ABD(1) ACD(1) BCD(50) AB(1) AC(1) BC(1) AD(5) CD(40) BD(1) A(1) B(1) C(1) D(5) ALL September 16, 2018 Data Mining: Concepts and Techniques

Computing a cuboid from its parent: Output: The sorted cuboid S foreach tuple  of B do if (state == Partition) then process_partition(); else process_sorted_run( ); Process_partition: If the input tuple starts a new partition, output the current partition at the end of the cuboid, start a new one If the input tuple matches with an existing tuple in the partition, update the aggregate Else input tuple aggregate. Process_sorted_run: If input tuple starts a new sorted_run, flush all the pages of current sorted_run, and start a new one If the input tuple matches the last tuple in the sorted_run, recompute the aggregate Else, append the tuple to the end of the existing run. September 16, 2018 Data Mining: Concepts and Techniques

Observations In ABCD  ABC, the partition size is 1. Why? In ABCD  ABD, the partition size is equal to the number of distinct C values, Why? In ABCD  BCD the partition size is the size of the cuboid BCD, Why? September 16, 2018 Data Mining: Concepts and Techniques

Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of indexed keys. Dynamic, stable and exhibit good performance under updates. (But OLAP is not about updates….) Bitmaps: Space efficient Difficult to update (but we don’t care in DW). Can effectively prune searches before looking at data. September 16, 2018 Data Mining: Concepts and Techniques

Bitmaps R = (…., A,….., M) R (A) B8 B B6 B5 B4 B3 B2 B1 B0 September 16, 2018 Data Mining: Concepts and Techniques

Query optimization Consider a high-selectivity-factor query with predicates on two attributes. Query optimizer: builds plans (P1) Full relation scan (filter as you go). (P2) Index scan on the predicate with lower selectivity factor, followed by temporary relation scan, to filter out non-qualifying tuples, using the other predicate. (Works well if data is clustered on the first index key). (P3) Index scan for each predicate (separately), followed by merge of RID. September 16, 2018 Data Mining: Concepts and Techniques

Query optimization (continued)
tn Index Pred1 Blocks of data (P2) Tuple list1 (P3) Merged list Pred. 2 t1 tn Index Pred2 Tuple list2 answer September 16, 2018 Data Mining: Concepts and Techniques

Query optimization (continued)
When using bitmap indexes (P3) can be an easy winner! CPU operations in bitmaps (AND, OR, XOR, etc.) are more efficient than regular RID merges: just apply the binary operations to the bitmaps (In B-trees, you would have to scan the two lists and select tuples in both -- merge operation--) Of course, you can build B-trees on the compound key, but we would need one for every compound predicate (exponential number of trees…). September 16, 2018 Data Mining: Concepts and Techniques

Tradeoffs Dimension cardinality small dense bitmaps Dimension cardinality large sparse bitmaps Compression (decompression) September 16, 2018 Data Mining: Concepts and Techniques

Query strategy for Star joins
Maintain join indexes between fact table and dimension tables Prod. Fact table Dimension table a k  …   …  Bitmap for type a Bitmap for type k ….. Bitmap for loc.  Bitmap for loc.  ….. Bitmap for prod  Bitmap for prod  ….. September 16, 2018 Data Mining: Concepts and Techniques

Star-Joins Select F.S, D1.A1, D2.A2, …. Dn.An from F,D1,D2,Dn where F.A1 = D1.A1 F.A2 = D2.A2 … F.An = Dn.An and D1.B1 = ‘c1’ D2.B2 = ‘p2’ …. Likely strategy: For each Di find suitable values of Ai such that Di.Bi = ‘xi’ (unless you have a bitmap index for Bi). Use bitmap index on Ai’ values to form a bitmap for related rows of F (OR-ing the bitmaps). At this stage, you have n such bitmaps, the result can be found AND-ing them. September 16, 2018 Data Mining: Concepts and Techniques

Example Selectivity/predicate = 0.01 (predicates on the dimension tables) n predicates (statistically independent) Total selectivity = 10 -2n Facts table = 108 rows, n = 3, tuples in answer = 108/ 106 = 100 rows In the worst case = 100 blocks… Still better than all the blocks in the relation (e.g., assuming 100 tuples/block, this would be blocks!) September 16, 2018 Data Mining: Concepts and Techniques

Design Space of Bitmap Indexes
The basic bitmap design is called Value-list index. The focus there is on the columns If we change the focus to the rows, the index becomes a set of attribute values (integers) in each tuple (row), that can be represented in a particular way. We can encode this row in many ways... September 16, 2018 Data Mining: Concepts and Techniques

Attribute value decomposition
C = attribute cardinality Consider a value of the attribute, v, and a sequence of numbers <bn-1, bn-2 , …,b1> Also, define bn =  C /  bi  , then v can be decomposed into a sequence of n digits <vn, vn-1, vn-2 , …,v1> as follows: v = V = V2 b1 + v = V3(b2b1) + v2 b1 + v … n i = vn ( bj) + …+ vi ( bj) + …+ v2b1 + v1 where vi = Vi mod bi and Vi = Vi-1/bi-1 September 16, 2018 Data Mining: Concepts and Techniques

Number systems How do you write 576 in: < 7,7,5,3> 576/(7x7x5x3) = 576/735 = 0 | 576, 576/(7x5x3)=576/105=5|51 576 = 5 x (7x5x3)+51 <2,2,2,2,2,2,2,2,2> 576 = 1 x x x x x x x x x x 20 576/ 29 = 1 | 64, 64/ 28 = 0|64, 64/ 27 = 0|64, 64/ 26 = 1|0, / 25 = 0|0, 0/ 24= 0|0, 0/ 23= 0|0, 0/ 22 = 0|0, 0/ 21 = 0|0, / 20 = 0|0 <10,10,10> (decimal system!) 576 = 5 x 10 x x 576/100 = 5 | 76 76/10 = | 6 6 51/(5x3) = 51/15 = 3 | 6 6/3 =2 | 0 576 = 5 x (7x5x3) + 3 (5 x 3) + 16 576 = 5 x (7x 5 x 3) + 3 x (5 x 3 ) + 2 x (3) September 16, 2018 Data Mining: Concepts and Techniques

Bitmaps R = (…., A,….., M) value-list index R (A) B8 B B6 B5 B4 B3 B2 B1 B0 September 16, 2018 Data Mining: Concepts and Techniques

Example sequence <3,3> value-list index (equality) R (A) B22 B12 B02 B21 B11 B01 (1x3+0) September 16, 2018 Data Mining: Concepts and Techniques

Encoding scheme Equality encoding: all bits to 0 except the one that corresponds to the value Range Encoding: the vi righmost bits to 0, the remaining to 1 September 16, 2018 Data Mining: Concepts and Techniques

Range encoding single component, base-9
R (A) B8 B B6 B5 B4 B3 B2 B1 B0 September 16, 2018 Data Mining: Concepts and Techniques

Example (revisited) sequence <3,3> value-list index(Equality) R (A) B22 B12 B02 B21 B11 B01 (1x3+0) September 16, 2018 Data Mining: Concepts and Techniques

Example sequence <3,3> range-encoded index R (A) B12 B B11 B01 September 16, 2018 Data Mining: Concepts and Techniques

Design Space range equality …. September 16, 2018 Data Mining: Concepts and Techniques

RangeEval Evaluates each range predicate by computing two bitmaps: BEQ bitmap and either BGT or BLT RangeEval-Opt uses only <= A < v is the same as A <= v-1 A > v is the same as Not( A <= v) A >= v is the same as Not (A <= v-1) September 16, 2018 Data Mining: Concepts and Techniques

RangeEval-OPT September 16, 2018 Data Mining: Concepts and Techniques

Tree-Structured Indexes
The slides for this text are organized into chapters. This lecture covers Chapter 9. Chapter 1: Introduction to Database Systems Chapter 2: The Entity-Relationship Model Chapter 3: The Relational Model Chapter 4 (Part A): Relational Algebra Chapter 4 (Part B): Relational Calculus Chapter 5: SQL: Queries, Programming, Triggers Chapter 6: Query-by-Example (QBE) Chapter 7: Storing Data: Disks and Files Chapter 8: File Organizations and Indexing Chapter 9: Tree-Structured Indexing Chapter 10: Hash-Based Indexing Chapter 11: External Sorting Chapter 12 (Part A): Evaluation of Relational Operators Chapter 12 (Part B): Evaluation of Relational Operators: Other Techniques Chapter 13: Introduction to Query Optimization Chapter 14: A Typical Relational Optimizer Chapter 15: Schema Refinement and Normal Forms Chapter 16 (Part A): Physical Database Design Chapter 16 (Part B): Database Tuning Chapter 17: Security Chapter 18: Transaction Management Overview Chapter 19: Concurrency Control Chapter 20: Crash Recovery Chapter 21: Parallel and Distributed Databases Chapter 22: Internet Databases Chapter 23: Decision Support Chapter 24: Data Mining Chapter 25: Object-Database Systems Chapter 26: Spatial Data Management Chapter 27: Deductive Databases Chapter 28: Additional Topics September 16, 2018 Data Mining: Concepts and Techniques 1

Introduction As for any index, 3 alternatives for data entries k*: Data record with key value k <k, rid of data record with search key value k> <k, list of rids of data records with search key k> Choice is orthogonal to the indexing technique used to locate data entries k*. Tree-structured indexing techniques support both range searches and equality searches. ISAM: static structure; B+ tree: dynamic, adjusts gracefully under inserts and deletes. September 16, 2018 Data Mining: Concepts and Techniques 2

Range Searches ``Find all students with gpa > 3.0’’ If data is in sorted file, do binary search to find first such student, then scan to find others. Cost of binary search can be quite high. Simple idea: Create an `index’ file. Index File k1 k2 kN Data File Page 1 Page 2 Page 3 Page N Can do binary search on (smaller) index file! September 16, 2018 Data Mining: Concepts and Techniques 3

ISAM index entry P K P K P K P 1 1 2 2 m m Index file may still be quite large. But we can apply the idea repeatedly! Non-leaf Pages Leaf Pages Overflow page Primary pages Leaf pages contain data entries. September 16, 2018 Data Mining: Concepts and Techniques 4

Comments on ISAM Data Pages File creation: Leaf (data) pages allocated sequentially, sorted by search key; then index pages allocated, then space for overflow pages. Index entries: <search key value, page id>; they `direct’ search for data entries, which are in leaf pages. Search: Start at root; use key comparisons to go to leaf. Cost log F N ; F = # entries/index pg, N = # leaf pgs Insert: Find leaf data entry belongs to, and put it there. Delete: Find and remove from leaf; if empty overflow page, de-allocate. Index Pages Overflow pages Static tree structure: inserts/deletes affect only leaf pages. September 16, 2018 Data Mining: Concepts and Techniques 5

Example ISAM Tree Each node can hold 2 entries; no need for `next-leaf-page’ pointers. 10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97* 20 33 51 63 40 Root September 16, 2018 Data Mining: Concepts and Techniques 6

After Inserting 23*, 48*, 41*, 42* ... Root Index 40 Pages 20 33 51 63 Primary Leaf 10* 15* 20* 27* 33* 37* 40* 46* 51* 55* 63* 97* Pages 23* 48* 41* Overflow Pages 42* September 16, 2018 Data Mining: Concepts and Techniques 7

... Then Deleting 42*, 51*, 97* Root 40 20 33 51 63 10* 15* 20* 27* 33* 37* 40* 46* 55* 63* 23* 48* 41* Note that 51* appears in index levels, but not in leaf! September 16, 2018 Data Mining: Concepts and Techniques 8

B+ Tree: The Most Widely Used Index
Insert/delete at log F N cost; keep tree height-balanced. (F = fanout, N = # leaf pages) Minimum 50% occupancy (except for root). Each node contains d <= m <= 2d entries. The parameter d is called the order of the tree. Supports equality and range-searches efficiently. Index Entries Data Entries ("Sequence set") (Direct search) September 16, 2018 Data Mining: Concepts and Techniques 9

Example B+ Tree Search begins at root, and key comparisons direct it to a leaf (as in ISAM). Search for 5*, 15*, all data entries >= 24* ... Root 13 17 24 30 2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* Based on the search for 15*, we know it is not in the tree! September 16, 2018 Data Mining: Concepts and Techniques 10

B+ Trees in Practice Typical order: Typical fill-factor: 67%. average fanout = 133 Typical capacities: Height 4: 1334 = 312,900,700 records Height 3: 1333 = 2,352,637 records Can often hold top levels in buffer pool: Level 1 = page = Kbytes Level 2 = pages = Mbyte Level 3 = 17,689 pages = 133 MBytes September 16, 2018 Data Mining: Concepts and Techniques

Inserting a Data Entry into a B+ Tree
Find correct leaf L. Put data entry onto L. If L has enough space, done! Else, must split L (into L and a new node L2) Redistribute entries evenly, copy up middle key. Insert index entry pointing to L2 into parent of L. This can happen recursively To split index node, redistribute entries evenly, but push up middle key. (Contrast with leaf splits.) Splits “grow” tree; root split increases height. Tree growth: gets wider or one level taller at top. September 16, 2018 Data Mining: Concepts and Techniques 6

Inserting 8* into Example B+ Tree
Entry to be inserted in parent node. Observe how minimum occupancy is guaranteed in both leaf and index pg splits. Note difference between copy-up and push-up; be sure you understand the reasons for this. 5 (Note that 5 is s copied up and continues to appear in the leaf.) 2* 3* 5* 7* 8* 5 24 30 17 13 Entry to be inserted in parent node. (Note that 17 is pushed up and only this with a leaf split.) appears once in the index. Contrast September 16, 2018 Data Mining: Concepts and Techniques 12

Example B+ Tree After Inserting 8*
Root 17 5 13 24 30 2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39* Notice that root was split, leading to increase in height. In this example, we can avoid split by re-distributing entries; however, this is usually not done in practice. September 16, 2018 Data Mining: Concepts and Techniques 13

Deleting a Data Entry from a B+ Tree
Start at root, find leaf L where entry belongs. Remove the entry. If L is at least half-full, done! If L has only d-1 entries, Try to re-distribute, borrowing from sibling (adjacent node with same parent as L). If re-distribution fails, merge L and sibling. If merge occurred, must delete entry (pointing to L or sibling) from parent of L. Merge could propagate to root, decreasing height. September 16, 2018 Data Mining: Concepts and Techniques 14

Example Tree After (Inserting 8*, Then) Deleting 19* and 20* ...
Root 17 5 13 27 30 2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39* Deleting 19* is easy. Deleting 20* is done with re-distribution. Notice how middle key is copied up. September 16, 2018 Data Mining: Concepts and Techniques 15

... And Then Deleting 24* Must merge. Observe `toss’ of index entry (on right), and `pull down’ of index entry (below). 30 22* 27* 29* 33* 34* 38* 39* Root 5 13 17 30 2* 3* 5* 7* 8* 14* 16* 22* 27* 29* 33* 34* 38* 39* September 16, 2018 Data Mining: Concepts and Techniques 16

Summary Tree-structured indexes are ideal for range-searches, also good for equality searches. ISAM is a static structure. Only leaf pages modified; overflow pages needed. Overflow chains can degrade performance unless size of data set and data distribution stay constant. B+ tree is a dynamic structure. Inserts/deletes leave tree height-balanced; log F N cost. High fanout (F) means depth rarely more than 3 or 4. Almost always better than maintaining a sorted file. September 16, 2018 Data Mining: Concepts and Techniques 23

Summary (Contd.) Typically, 67% occupancy on average. Usually preferable to ISAM, modulo locking considerations; adjusts to growth gracefully. Key compression increases fanout, reduces height. Bulk loading can be much faster than repeated inserts for creating a B+ tree on a large data set. Most widely used index in database management systems because of its versatility. One of the most optimized components of a DBMS. September 16, 2018 Data Mining: Concepts and Techniques 24

Monitoring Techniques
Periodic snapshots Database triggers Log shipping Data shipping (replication service) Transaction shipping Polling (queries to source) Application level monitoring è Advantages & Disadvantages!! September 16, 2018 Data Mining: Concepts and Techniques

Monitoring Issues Frequency periodic: daily, weekly, … triggered: on “big” change, lots of changes, ... Data transformation convert data to uniform format remove & add fields (e.g., add date to get history) Standards (e.g., ODBC) Gateways September 16, 2018 Data Mining: Concepts and Techniques

Monitoring Products Gateways: Info Builders EDA/SQL, Oracle Open Connect, Informix Enterprise Gateway, … Data Shipping: Oracle Replication Server, Praxis OmniReplicator, … Transaction Shipping: Sybase Replication Server, Microsoft SQL Server Extraction: Aonix, ETI, CrossAccess, DBStar Monitoring/Integration products later on September 16, 2018 Data Mining: Concepts and Techniques

Integration Data Cleaning Data Loading Derived Data Query &analysis Client Warehouse Source Query & Analysis Integration Metadata integration September 16, 2018 Data Mining: Concepts and Techniques

Change detection Detect & send changes to integrator Different classes of sources Cooperative Queryable Logged Snapshot/dump

Data transformation Convert data to uniform format Byte ordering, string termination Internal layout Remove, add, & reorder attributes Add (regeneratable) key Add date to get history

Data transformation (2)
Sort tuples May use external utilites Can be much faster (10x) than SQL engine E.g., perl script to reorder attributes

External functions (EFs)
Special transformation functions E.g., Yen_to_dollars User defined Specified in warehouse table definition Aid in integration Must be applied to updates, too

Data integration Rules for matching data from different sources Build composite view of data Eliminate duplicate, unneeded attributes

Data Cleaning Migration (e.g., yen ð dollars) Scrubbing: use domain-specific knowledge (e.g., social security numbers) Fusion (e.g., mail list, customer merging) Auditing: discover rules & relationships (like data mining) billing DB service DB customer1(Joe) customer2(Joe) merged_customer(Joe) September 16, 2018 Data Mining: Concepts and Techniques

Data cleansing Find (& remove) duplicate tuples E.g., Jane Doe & Jane Q. Doe Detect inconsistent, wrong data Attributes that don’t match E.g., city, state and zipcode Patch missing, unreadable data Want to “backflush” clean data Notify sources of errors found

Loading Data Incremental vs. refresh Off-line vs. on-line Frequency of loading At night, 1x a week/month, continuously Parallel/Partitioned load September 16, 2018 Data Mining: Concepts and Techniques

Derived Data Derived Warehouse Data indexes aggregates materialized views (next slide) When to update derived data? Incremental vs. refresh September 16, 2018 Data Mining: Concepts and Techniques

The “everything is a view” view
Pure programs: e.g., “can queries.” Always the same cost. No data is materialized. (DBMSs) Derived data: Materialized views. Data always there but must be updated. (Good for warehouses.) Pure data: Snapshot. Procedure is thrown away! Not maintainable. Approximate: Snapshot+refresh procedure applied in some conditions. (Quasi-copies) Approximate models (e.g., statistical). (Quasi-cubes). September 16, 2018 Data Mining: Concepts and Techniques

Materialized Views Define new warehouse relations using SQL expressions does not exist at any source September 16, 2018 Data Mining: Concepts and Techniques

Integration Products Monitoring & Integration: Apertus, Informatica, Prism, Sagent, … Merging: DataJoiner, SAS,… Cleaning: Trillum, ... Typically take warehouse off-line Typically refresh or simple incremental: e.g., Red Brick Table Management Utility, Prism September 16, 2018 Data Mining: Concepts and Techniques

Managing Metadata Warehouse Design Tools Query &analysis Client Warehouse Source Query & Analysis Integration Metadata integration September 16, 2018 Data Mining: Concepts and Techniques

Metadata Administrative definition of sources, tools, ... schemas, dimension hierarchies, … rules for extraction, cleaning, … refresh, purging policies user profiles, access control, ... September 16, 2018 Data Mining: Concepts and Techniques

Metadata Business business terms & definition data ownership, charging Operational data lineage data currency (e.g., active, archived, purged) use stats, error reports, audit trails September 16, 2018 Data Mining: Concepts and Techniques

Tools Development design & edit: schemas, views, scripts, rules, queries, reports Planning & Analysis what-if scenarios (schema changes, refresh rates), capacity planning Warehouse Management performance monitoring, usage patterns, exception reporting System & Network Management measure traffic (sources, warehouse, clients) Workflow Management “reliable scripts” for cleaning & analyzing data September 16, 2018 Data Mining: Concepts and Techniques

Tools - Products Management Tools HP Intelligent Warehouse Advisor, IBM Data Hub, Prism Warehouse Manager System & Network Management HP OpenView, IBM NetView, Tivoli September 16, 2018 Data Mining: Concepts and Techniques

Current State of Industry
Extraction and integration done off-line Usually in large, time-consuming, batches Everything copied at warehouse Not selective about what is stored Query benefit vs storage & update cost Query optimization aimed at OLTP High throughput instead of fast response Process whole query before displaying anything September 16, 2018 Data Mining: Concepts and Techniques

Future Directions Better performance Larger warehouses Easier to use What are companies & research labs working on? September 16, 2018 Data Mining: Concepts and Techniques

Research (1) Incremental Maintenance Data Consistency Data Expiration Recovery Data Quality Error Handling (Back Flush) September 16, 2018 Data Mining: Concepts and Techniques

Research (2) Rapid Monitor Construction Temporal Warehouses Materialization & Index Selection Data Fusion Data Mining Integration of Text & Relational Data September 16, 2018 Data Mining: Concepts and Techniques

Make warehouse self-maintainable
Add auxiliary tables to minimize update cost Original + auxiliary are self-maintainable E.g., auxiliary table of all unsold catalog items Some updates may still be self-maintainable E.g., insert into catalog if item (the join attribute) is a key Items sold Sales Catalog

Detection of self-maintainability
Most algorithms are at table level Most algorithms are compile-time Tuple level at runtime [Huyn 1996, 1997] Use state of tables and update to determine if self-maintainable E.g., check whether sale is for item previously sold

Warehouse maintenance
Current systems ignore integration of new data Or assume warehouse can be rebuilt periodically Depend on long “downtime” to regenerate warehouse Technology gap: continuous incremental maintenance

Maintenance research Change detection Data consistency Single table consistency Multiple table consistency Expiration of data Crash recovery

Snapshot change detection
Compare old & new snapshots Join-based algorithms Hash old data, probe with new Window algorithm Sliding window over snapshots Good for local changes

Integrated data consistency
Conventional maintenance inadequate Sources report changes but: No locking, no global transactions (sources don’t communicate, coordinate with each other) Inconsistencies caused by interleaving of updates

Example anomaly table Sold = catalog x sale x emp
insert into sale [hat, Sue] delete from catalog [$12, hat] Sold price item clerk age catalog sale emp price item item clerk clerk age $12 hat Sue 26

Anomaly (2) price item clerk age Sold $12,hat,Sue,26 ignored Q1 = catalog  [hat, Sue] A(Q1)= [$12,hat, Sue] Q2 = [$12,hat, Sue]  emp delete from catalog [$12, hat] insert into sale [hat, Sue] A(Q2)= [$12,hat,Sue,26] price item catalog $12 hat price item catalog item clerk sale clerk age Sue 26 emp hat Sue September 16, 2018 Data Mining: Concepts and Techniques

Choices to deal with anomalies
Keep all relations in the DW (storage-expensive!) Run all queries as distributed (may not be feasible! --legacy systems-- + poor performance) Use specialized algorithms. E.g., Eager Compensation Algorithm (ECA), STROBE. September 16, 2018 Data Mining: Concepts and Techniques

Another anomaly example
price clerk Sold V = [$12, Sue]-= V V = [$12, Sue]-= V WRONG! $12 Sue Delete(catalog[$12,hat]) Q1= p,c([$12,hat] sale) A(Q1) =  A(Q2) =  item clerk sale hat Sue item clerk sale $12 hat price item catalog price item catalog Delete(sale[hat,Sue]) September 16, 2018 Data Mining: Concepts and Techniques

Yet another anomaly example
Depts = Dept(catalog  Store) Depts Shoes Bags Bags Shoes Bags Q1= Dept(catalog [NY,Madison Av]) Q2= Dept([Bags,NY]  Store) Insert(catalog[Bags,NY]) A(Q1) = [[Shoes],[Bags]] A(Q2) = [[Bags]] Shoes NY Dept City catalog Insert(Store[NY, Madison Av. City Add. Store NY Madison Ave City Add. Store Bags NY September 16, 2018 Data Mining: Concepts and Techniques

Eager Compensating Algorithm(ECA)
Principle: send compensating queries to offset the effect of concurrent updates ONLY GOOD IF ALL THE SOURCE RELATIONS ARE STORED IN ONE NODE (ONE SOURCE). September 16, 2018 Data Mining: Concepts and Techniques

Anomaly example revisited (ECA)
Depts = Dept(catalog  Store) Depts Q2= Dept([Bags,NY]  Store) -Dept([Bags,NY]  [NY,Madison Ave]] Shoes Bags Q1= Dept(catalog [NY,Madison Av]) Insert(catalog[Bags,NY]) A(Q1) = [[Shoes],[Bags]] A(Q2) =  Shoes NY Dept City catalog Insert(Store[NY, Madison Av. City Add. Store NY Madison Ave City Add. Store Bags NY September 16, 2018 Data Mining: Concepts and Techniques

ECA Algorithm SOURCE DATA WAREHOUSE (DW) S_upi: Execute Ui W_upi: receive Ui send Ui to DW Qi=V(Ui)-QQSQj(Ui) trigger W_upi at DW UQS = UQS + {Qi} Send Qi to S trigger S_qui at S S_qui : Receive Qi W_ansi: Receive Ai let Ai = Qi(ssi) COL = COL + Ai Send Ai to DW UQS = UQS - {Qi} trigger W_ansi at DW if UQS =  MV=MV+COL COL =  ssi = current source state UQS = unanswered query set September 16, 2018 Data Mining: Concepts and Techniques

ECA-key Avoids the need for compensating queries. Necessary condition: the view contains key attributes for each of the base tables (e.g., star schema) September 16, 2018 Data Mining: Concepts and Techniques

Example of ECA-key UQS =  UQS = {Q2} UQS = {Q1} UQS=UQS+{Q2}={Q1,Q2} COL = {[bags,Jane]} COL = {[bags,Jane], [bags,Sue]} COL = {[hat,Sue]} COL =  Item clerk Sells bags Sue bagsJane Q1= i,d(catalog [hat,Jane]) Insert(catalog[bag,acc])) A1 = {[bag,Jane]} hat Sue Q2= i,c([bags,acc] emp) A(Q2) = {[bags,Sue],[bags,Jane]} Delete(catalog,[hat,acc]) bags acc Item dept. catalog hat acc bags acc Item dept. catalog hat acc Item dept. catalog Insert (sale[acc,Jane]) item clerk emp acc Sue acc Jane September 16, 2018 Data Mining: Concepts and Techniques

Strobe algorithm ideas
Apply actions only after a set of interleaving updates are all processed Wait for sources to quiesce Compensate effects of interleaved updates Subtract effects of later updates before installing changes Can combine these ideas STROBE IS A FAMILY OF ALGORITHMS

Strobe Terminology The materialized view MV is the current state of the view at the warehouse V(ws). Given a query Q that needs to be evaluated, the function next_source(Q) returns the pair (x,Qi), where x is the next source to contact and Qi the portion of the query that can be answered by x. Example: if V = r1  r2  r3, and U and update received from r2, then Q = (r1  U  r3) and next_source(Q) = (r1, r1  U) September 16, 2018 Data Mining: Concepts and Techniques

Strobe terminology (2) Source_evaluation(Q): /returns answers to Q/ Begin i = 0; WQ = Q; A0 = Q; (x,Q1)  next_source(WQ); While x is not nil do Let i = i + 1; Send Qi to source x; When x returns Ai, let WQ = WQ(Ai); Let (x,Qi+1)  next_source(WQ); Return(Ai); End September 16, 2018 Data Mining: Concepts and Techniques

Strobe Algorithm Source DW After exec. Ui, send Ui to DW AL =  When receiving Qi When Ui is received Compute Ai over ss[x] if a deletion Send Ai to DW Qj  UQS add Ui to pend(Qj) Add key_del(MV,Ui) to AL if an insertion Qi = V(Ui), pend(Qi) =  Ai = source_evaluate(Qi); Uj  pend(Qi), key_del(Ai,Uj); Add insert(MV,Ai) to AL When UQS =  , apply AL to MV as a single transaction, without adding duplicate tuples to MV Reset AL September 16, 2018 Data Mining: Concepts and Techniques

Example with Strobe AL ={key_del(MV,U2)} AL = 
Add nothing to AL UQS =  MV =  Apply key_del(A12,U2)  A2 =  Q1=catalog [hat,Sue]emp price item clerk age Sold Pend(Q1) = U2 Pend(Q1) =  U2=Del([$12,hat]) Q11=(catalog[hat,Sue]) A11=[$12,hat,Sue] U1=Insert(sale, [hat, Sue]) Q12 =[$12,hat,Sue]  emp A12=[$12,hat,Sue,26] $12 hat price item catalog price item catalog item clerk sale clerk age Sue 26 emp hat Sue

Transaction-Strobe AL = {ins([shoes,Jane]} MV = {[shoes,Jane]} AL = {del([hat,Sue]} MV =  item clerk sale item clerk sale shoes Jane item clerk sale hat Sue T1 = {delete(sale,[hat,Sue]), insert(sale,[shoes,Jane])} item clerk sale shoes, Jane item clerk sale hat Sue September 16, 2018 Data Mining: Concepts and Techniques

Multiple table consistency
More than 1 table at warehouse Multiple tables share source data Updates at source should be reflected in all warehouse tables at the same time V1: Customer-info V2: Cust-prefs V3: Total-sales Customer Sales

Multiple table consistency
Warehouse V1 V2 ... Vn Single table consistency Sources S1 S2 ... Sm Source consistency

Painting algorithm Use merge process (MP) to coordinate sending updates to warehouse MP holds update actions for each table MP charts potential table states arising from each set of update actions MP sends batch of update actions together when tables will be consistent

Chapter 7. Classification and Prediction
What is classification? What is prediction? Classification by decision tree induction Bayesian Classification Other Classification Methods (SVM) Classification accuracy Prediction Summary September 16, 2018 Data Mining: Concepts and Techniques

Classification problem
Given: Tuples each assigned a class level. Develop a model for each class Example: Good creditor : (age in [25,40]) AND (income > 50K) AND (status = MARRIED) Applications: Credit approval (good, bad) Store locations (good, fair, poor) Emergency situations (emergency, non-emergency) September 16, 2018 Data Mining: Concepts and Techniques

Classification vs. Prediction
predicts categorical class labels classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Prediction: models continuous-valued functions, i.e., predicts unknown or missing values Typical Applications credit approval target marketing medical diagnosis treatment effectiveness analysis September 16, 2018 Data Mining: Concepts and Techniques

Classification—A Two-Step Process
Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction: training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur September 16, 2018 Data Mining: Concepts and Techniques

Supervised vs. Unsupervised Learning
Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data September 16, 2018 Data Mining: Concepts and Techniques

What is classification? What is prediction? Classification by decision tree induction Bayesian Classification Other Classification Methods Classification accuracy Prediction Summary September 16, 2018 Data Mining: Concepts and Techniques

Classification by Decision Tree Induction
A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases Tree construction At start, all the training examples are at the root Partition examples recursively based on selected attributes Tree pruning Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample Test the attribute values of the sample against the decision tree September 16, 2018 Data Mining: Concepts and Techniques

Training Dataset This follows an example from Quinlan’s ID3 September 16, 2018 Data Mining: Concepts and Techniques

Output: A Decision Tree for “buys_computer”
age? <=30 overcast 30..40 >40 student? yes credit rating? no yes excellent fair no yes no yes September 16, 2018 Data Mining: Concepts and Techniques

Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left September 16, 2018 Data Mining: Concepts and Techniques

Decision trees Training set Salary < 20000 Y N Education = G A N Y A R September 16, 2018 Data Mining: Concepts and Techniques

Decision trees Pros: Fast. Rules easy to interpret. High dimensional data Cons: No correlations Axis-parallel cuts. September 16, 2018 Data Mining: Concepts and Techniques

Decision trees(cont.) Machine learning: ID3 (Quinlan86) C4.5 (Quinlan93 ) CART (Breiman, Friedman, Olshen, Stone, Classification and Regression Trees 1984) Database: SLIQ (Metha, Agrawal and Rissanen, EDBT96) SPRINT (Shafer, Agrawal, Metha, VLDB96) Rainforest (Gherke, Ramakrishnan, Ghanti VLDB98) September 16, 2018 Data Mining: Concepts and Techniques

Decision trees Finding the best tree is NP-Hard We look at non-backtracking algorithms (never look back at a previous decision) Assume we have a test with n outcomes that partitions T into subsets T1, T2,…, Tn If the test is to be evaluated without exploring subsequent dimensions of the Ti’s, the only information available for guidance is the distribution of classes in T and its subsets. September 16, 2018 Data Mining: Concepts and Techniques

Decision tree algorithms
Building phase: Recursively split nodes using best splitting attribute and value for node Pruning phase: Smaller (yet imperfect) tree achieves better prediction accuracy. Prune leaf nodes recursively to avoid over-fitting. September 16, 2018 Data Mining: Concepts and Techniques

Predictor variables (attributes)
Numerically ordered: values are ordered and they can be represented in real line. ( E.g., salary.) Categorical: takes values from a finite set not having any natural ordering. (E.g., color.) Ordinal: takes values from a finite set whose values posses a clear ordering, but the distances between them are unknown. (E.g., preference scale: good, fair, bad.) September 16, 2018 Data Mining: Concepts and Techniques

Binary Splits Recursive (binary) partitioning Univariate split on numerically ordered or ordinal X X <= c on categorical X X  A Linear combination on numerical  ai Xi <= c c and A are chosen to maximize separation. September 16, 2018 Data Mining: Concepts and Techniques

Some probability... S = cases freq(Ci,S) = # cases in S that belong to Ci Gain entropic meassure: Prob(“this case belongs to Ci”) = freq(Ci,S)/|S| Information conveyed: -log (freq(Ci,S)/|S|) Entropy = expected information = -  (freq(Ci,S)/|S|) log (freq(Ci,S)/|S|) = info(S) September 16, 2018 Data Mining: Concepts and Techniques

Gain Test X: infoX (T) =  |Ti|/T info(Ti) gain(X) = info (T) - infoX(T) September 16, 2018 Data Mining: Concepts and Techniques

Example Info(T) (9 play, 5 don’t) info(T) = -9/14log(9/14) /14log(5/14) = 0.94 (bits) Test: outlook infoOutlook = Test Windy infowindy= 5/14 (-2/5 log(2/5)-3/5 log(3/5))+ 7/14(-4/7log(4/7)-3/7 log(3/7)) 4/14 (-4/4 log(4/4)) + +7/14(-5/7log(5/7)-2/7log(2/(7)) 5/14 (-3/5 log(3/5) - 2/5 log(2/5)) gainOutlook = = 0.3 = 0.278 gainWindy = = 0.662 = 0.64 (bits) Windy is a better test September 16, 2018 Data Mining: Concepts and Techniques

Problem with Gain Strong bias towards test with many outcomes. Example: Z = Name |Ti| = 1 (each name unique) infoZ (T) =  1/|T| (- 1/N log (1/N))  0 Maximal gain!! (but useless division--- overfitting--) September 16, 2018 Data Mining: Concepts and Techniques

Split Split-info (X) = -  |Ti|/|T| log (|Ti|/|T|) gain-ratio(X) = gain(X)/split-info(X) Gain <= log(k) Split <= log(n) ratio small September 16, 2018 Data Mining: Concepts and Techniques

Extracting Classification Rules from Trees
Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no” September 16, 2018 Data Mining: Concepts and Techniques

OVERFITTING Decision trees can grow so long that there is a leaf for each training example. Extremes: Overfitted: “Whatever I haven’t seen can’t be classified” Too General: “If it is green, it is a tree” September 16, 2018 Data Mining: Concepts and Techniques

Avoid Overfitting in Classification
The generated tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Result is in poor accuracy for unseen samples Two approaches to avoid overfitting Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the “best pruned tree” September 16, 2018 Data Mining: Concepts and Techniques

Approaches to Determine the Final Tree Size
Separate training (2/3) and testing (1/3) sets Use cross validation, e.g., 10-fold cross validation Use all the data for training but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution Use minimum description length (MDL) principle: halting growth of the tree when the encoding is minimized September 16, 2018 Data Mining: Concepts and Techniques

Enhancements to basic decision tree induction
Allow for continuous-valued attributes Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals Handle missing attribute values Assign the most common value of the attribute Assign probability to each of the possible values Attribute construction Create new attributes based on existing ones that are sparsely represented This reduces fragmentation, repetition, and replication September 16, 2018 Data Mining: Concepts and Techniques

Classification in Large Databases
Classification—a classical problem extensively studied by statisticians and machine learning researchers Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed Why decision tree induction in data mining? relatively faster learning speed (than other classification methods) convertible to simple and easy to understand classification rules can use SQL queries for accessing databases comparable classification accuracy with other methods September 16, 2018 Data Mining: Concepts and Techniques

Scalable Decision Tree Induction Methods in Data Mining Studies
SLIQ (EDBT’96 — Mehta et al.) builds an index for each attribute and only class list and the current attribute list reside in memory SPRINT (VLDB’96 — J. Shafer et al.) constructs an attribute list data structure PUBLIC (VLDB’98 — Rastogi & Shim) integrates tree splitting and tree pruning: stop growing the tree earlier RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) separates the scalability aspects from the criteria that determine the quality of the tree builds an AVC-list (attribute, value, class label) September 16, 2018 Data Mining: Concepts and Techniques

SPRINT For large data sets. Age < 25 Car = Sports H H L September 16, 2018 Data Mining: Concepts and Techniques

Gini Index (IBM IntelligentMiner)
If a data set T contains examples from n classes, gini index, gini(T) is defined as where pj is the relative frequency of class j in T. If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute). September 16, 2018 Data Mining: Concepts and Techniques

SPRINT Partition (S) if all points of S are in the same class return; else for each attribute A do evaluate_splits on A; use best split to partition into S1,S2; Partition(S1); Partition(S2); September 16, 2018 Data Mining: Concepts and Techniques

SPRINT Data Structures
Training set Age Car Attribute lists September 16, 2018 Data Mining: Concepts and Techniques

Splits Age < 27.5 Group2 Group1 September 16, 2018 Data Mining: Concepts and Techniques

Histograms For continuous attributes Associated with node (Cabove, Cbelow) to process already processed September 16, 2018 Data Mining: Concepts and Techniques

Example ginisplit3 =3/6 gini(S1) +3/6 gini(S2) gini(S1) = 1 - [(3/3) 2 ] = 0 gini(S2) = 1 - [(1/3)2 +(2/3)2 ] = 0.444 ginisplit0 = 0/6 gini(S1) + 6/6 gini(S2) gini(S2) = 1 - [(4/6)2 +(2/6)2 ] = 0.444 ginisplit2 = 2/6 gini(S1) +4/6 gini(S2) gini(S1) = 1 - [(2/2) 2 ] = 0 gini(S2) = 1 - [(2/4)2 +(2/4)2 ] = 0.5 ginisplit5 =6/6 gini(S1) +0/6 gini(S2) gini(S1) = 1 - [(4/6) 2 +(2/6) 2 ] = 0.320 ginisplit4 =4/6 gini(S1) +2/6 gini(S2) gini(S1) = 1 - [(3/4) 2 +(1/4) 2 ] = 0.375 gini(S2) = 1 - [(1/2)2 +(1/2)2 ] = 0.5 ginisplit5 =5/6 gini(S1) +1/6 gini(S2) gini(S1) = 1 - [(4/5) 2 +(1/5) 2 ] = 0.320 gini(S2) = 1 - [(1/1)2 ] = 0 ginisplit1 = 1/6 gini(S1) +5/6 gini(S2) gini(S1) = 1 - [(1/1) 2 ] = 0 gini(S2) = 1 - [(3/4)2 +(2/4)2 ] = ginisplit0 = 0.444 ginisplit1= 0.156 ginisplit2= 0.333 ginisplit3= 0.222 ginisplit4= 0.416 ginisplit5= 0.222 Age <= 18.5 ginisplit6= 0.444 September 16, 2018 Data Mining: Concepts and Techniques

Splitting categorical attributes
Single scan through the attribute list collecting counts on count matrix for each combination of class label + attribute value September 16, 2018 Data Mining: Concepts and Techniques

Example ginisplit(family)= 3/6 gini(S1) + 3/6 gini(S2) gini(S1) = 1 - [(2/3)2 + (1/3)2] = 4/9 gini(S2) = 1- [(2/3)2 + (1/3)2] = 4/9 ginisplit((sports)= 2/6 gini(S1) + 4/6 gini(S2) gini(S1) = 1 - [(2/2)2] = 0 gini(S2) = 1- [(2/4)2 + (2/4)2] = 0.5 ginisplit(truck)= 1/6 gini(S1) + 5/6 gini(S2) gini(S1) = 1 - [(1/1)2] = 0 gini(S2) = 1- [(4/5)2 + (1/5)2] = 0.32 ginisplit(family)= 0.444 ginisplit((sports) )= 0.333 ginisplit(truck) )= 0.266 Car Type = Truck September 16, 2018 Data Mining: Concepts and Techniques

Example (2 attributes) The winner is Age <= 18.5 Y N H September 16, 2018 Data Mining: Concepts and Techniques

Performing the split Create 2 child nodes Split attribute lists for winning attribute For the remaining Insert Tuple Ids in Hash Table (which child) Scan lists of attributes and probe hash table (may be too large and need several steps). September 16, 2018 Data Mining: Concepts and Techniques

Drawbacks Large explosion of space (possibly tripling the size of database). Costly Hash-Join. September 16, 2018 Data Mining: Concepts and Techniques

What is classification? What is prediction? Classification by decision tree induction Bayesian Classification Other methods (SVM) Classification accuracy Prediction Summary September 16, 2018 Data Mining: Concepts and Techniques

Bayesian Theorem Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem MAP (maximum posteriori) hypothesis Practical difficulty: require initial knowledge of many probabilities, significant computational cost September 16, 2018 Data Mining: Concepts and Techniques

Naïve Bayes Classifier (I)
A simplified assumption: attributes are conditionally independent: Greatly reduces the computation cost, only count the class distribution. September 16, 2018 Data Mining: Concepts and Techniques

Naive Bayesian Classifier (II)
Given a training set, we can compute the probabilities September 16, 2018 Data Mining: Concepts and Techniques

Example E ={outlook = sunny, temp = [64,70], humidity= [65,70], windy = y} = {E1,E2,E3,E4} Pr[“Play”/E] = (Pr[E1/Play] x Pr[E2/Play] x Pr[E3/Play] x Pr[E4/Play] x Pr[Play]) / Pr[E] = 3/9 x (2/9x 3/9 x 4/9x 9/14)/Pr[E] = 0.007/Pr[E] Pr[“Don’t”/E] = (3/5 x 2/5 x 1/5 x 3/5 x 5/14)/Pr[E] = 0.010/Pr[E] With E: Pr[“Play”/E] = 41 %, Pr[“Don’t”/E] = 59 % September 16, 2018 Data Mining: Concepts and Techniques

Bayesian Belief Networks (I)
Family History Smoker (FH, S) (FH, ~S) (~FH, S) (~FH, ~S) LC 0.8 0.5 0.7 0.1 LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9 The conditional probability table for the variable LungCancer PositiveXRay Dyspnea Bayesian Belief Networks September 16, 2018 Data Mining: Concepts and Techniques

Bayesian Belief Networks (II)
Bayesian belief network allows a subset of the variables conditionally independent A graphical model of causal relationships Several cases of learning Bayesian belief networks Given both network structure and all the variables: easy Given network structure but only some variables When the network structure is not known in advance September 16, 2018 Data Mining: Concepts and Techniques

Another Example (Friedman & Goldzsmidt)
Variables : Burglary, Earthquake, Alarm, Neighbor call, Radio announcement. Burglary and Earthquake are independent (P(BE) = P(B)*P(E)) Burglary and Radio announcement are independent given Earthquake (P(BR/E) = P(B/E)*P(R/E)) So, P(A,R,E,B)=P(A|R,E,B)*P(R|E,B)*P(E|B)*P(B) can be reduced to: P(A,R,E,B) = P(A|E,B)*P(R|E)*P(E)*P(B) September 16, 2018 Data Mining: Concepts and Techniques

Example (cont.) Burglary Earthquake Alarm Radio announc. Neigh. call September 16, 2018 Data Mining: Concepts and Techniques

Example (cont.) Associated with each node is a set of conditional probability distributions. For example, the "Alarm" node might have the following probability distribution September 16, 2018 Data Mining: Concepts and Techniques

What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Other Methods Classification accuracy Prediction Summary September 16, 2018 Data Mining: Concepts and Techniques

Extending linear classification
Problem: all the algorithms we covered (plus many other ones) can only represent linear boundaries between classes Age <= 25 <- -> Age > 25 Too simplistic for many real cases September 16, 2018 Data Mining: Concepts and Techniques

Nonlinear class boundaries
Support vector machines (SVM)-- a misnomer, since they are algorithms, not machines-- Idea: use a non-linear mapping and transform the space into a new space. Example: x = w1a13 + w2 a12 a2 + w3 a1 a22 + w4 a23 September 16, 2018 Data Mining: Concepts and Techniques

SVMs Based on an algorithm that finds a maximum marginal hyperplane (linear model). Convex hull: (tightest enclosing polygon) Maximum margin hyperplane Shortest line connecting the hulls Support vectors September 16, 2018 Data Mining: Concepts and Techniques

SVMs (cont.) We have assumed that the two classes are linearly separable, so their convex hulls cannot overlap. The maximum margin hyperplane (MMH) is the one that is as far away as possible from both convex hulls. It is orthogonal to the shortest line connecting the hulls. The instances closest to the MMH (minimum distance to the line) are called support vectors (SV). (At least one for each class, often more.) Given the SVs, we can easily construct the MLH. All other training points can be deleted without any effect on the MMH September 16, 2018 Data Mining: Concepts and Techniques

SVMs (cont.) A hyperplane that separates the two classes can be written as: x = w0 + w1a1 + w2 a2 for a two-attribute case. However, the equation that defines the MMH, can be defined in terms of the SVs. Write the class value y of a training instance (point) as 1 (yes) or -1 (no). Then the MMH is: x = b +  i yi a(i). a i  SVs yi = class value of the point a(i); b and i are numerical values to be determined; a is a test point. September 16, 2018 Data Mining: Concepts and Techniques

SVMs (cont.) So, now… Use the training values to determine b and i for x = b +  i yi a(i). a i  SVs Standard optimization problem: constrained quadratic optimization (off-the-shelf software packages to solve this: Fletcher, Practical Methods of Optimization, 1987) Dot product September 16, 2018 Data Mining: Concepts and Techniques

What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Other Methods Classification accuracy Prediction Summary September 16, 2018 Data Mining: Concepts and Techniques

Classification Accuracy: Estimating Error Rates
Partition: Training-and-testing use two independent data sets, e.g., training set (2/3), test set(1/3) used for data set with large number of samples Cross-validation divide the data set into k subsamples use k-1 subsamples as training data and one sub-sample as test data --- k-fold cross-validation for data set with moderate size Bootstrapping (leave-one-out) for small size data September 16, 2018 Data Mining: Concepts and Techniques

Boosting and Bagging Boosting increases classification accuracy Applicable to decision trees or Bayesian classifier Learn a series of classifiers, where each classifier in the series pays more attention to the examples misclassified by its predecessor Boosting requires only linear time and constant space September 16, 2018 Data Mining: Concepts and Techniques

What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification accuracy Prediction Summary September 16, 2018 Data Mining: Concepts and Techniques

What Is Prediction? Prediction is similar to classification First, construct a model Second, use model to predict unknown value Major method for prediction is regression Linear and multiple regression Non-linear regression Prediction is different from classification Classification refers to predict categorical class label Prediction models continuous-valued functions September 16, 2018 Data Mining: Concepts and Techniques

Predictive Modeling in Databases
Predictive modeling: Predict data values or construct generalized linear models based on the database data. One can only predict value ranges or category distributions Method outline: Minimal generalization Attribute relevance analysis Generalized linear model construction Prediction Determine the major factors which influence the prediction Data relevance analysis: uncertainty measurement, entropy analysis, expert judgement, etc. Multi-level prediction: drill-down and roll-up analysis September 16, 2018 Data Mining: Concepts and Techniques

Regress Analysis and Log-Linear Models in Prediction
Linear regression: Y =  +  X Two parameters ,  and  specify the line and are to be estimated by using the data at hand. using the least squares criterion to the known values of Y1, Y2, …, X1, X2, …. Multiple regression: Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into the above. Log-linear models: The multi-way table of joint probabilities is approximated by a product of lower-order tables. Probability: p(a, b, c, d) = ab acad bcd September 16, 2018 Data Mining: Concepts and Techniques

What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Other Classification Methods Classification accuracy Prediction Summary September 16, 2018 Data Mining: Concepts and Techniques

Summary Classification is an extensively studied problem (mainly in statistics, machine learning & neural networks) Classification is probably one of the most widely used data mining techniques with a lot of extensions Scalability is still an important issue for database applications: thus combining classification with database techniques should be a promising topic Research directions: classification of non-relational data, e.g., text, spatial, multimedia, etc.. September 16, 2018 Data Mining: Concepts and Techniques

Data Warehousing Design And Implementation
Chapter Name September 98 Data Warehousing Design And Implementation Yong Ye Feb September 16, 2018 Data Mining: Concepts and Techniques 1

Outline Conceptual design Business requirement, scope of application Logical design Define the types of information you need Physical design Creation of the data warehouse with SQL statements September 16, 2018 Data Mining: Concepts and Techniques

Designing Data Warehouses
To begin a data warehouse project, need to find answers for questions such as: Which user requirements are most important and which data should be considered first? Should project be scaled down into something more manageable? Should infrastructure for a scaled down project be capable of ultimately delivering a full-scale enterprise-wide data warehouse? September 16, 2018 Data Mining: Concepts and Techniques

For many enterprises, the way to avoid the complexities associated with designing a data warehouse is to start by building one or more data marts. Data marts allow designers to build something that is far simpler and achievable for a specific group of users. September 16, 2018 Data Mining: Concepts and Techniques

Requirements collection and analysis stage of a data warehouse project involves interviewing appropriate members of staff (such as marketing users, finance users, and sales users) to enable identification of prioritized set of requirements that data warehouse must meet. September 16, 2018 Data Mining: Concepts and Techniques

At same time, interviews are conducted with members of staff responsible for operational systems to identify which data sources can provide clean, valid, and consistent data that will remain supported over next few years. September 16, 2018 Data Mining: Concepts and Techniques

Architecture of a data warehouse September 16, 2018 Data Mining: Concepts and Techniques

Design Methodology for Data Warehouses
Four steps: Choosing a business process to model Choosing the grain Identifying the dimensions Choosing the measure September 16, 2018 Data Mining: Concepts and Techniques

Step 1: Choosing The Process
The process (function) refers to the subject matter of a particular data mart. First data mart built should be the one that is most likely to be delivered on time, within budget, and to answer the most commercially important business questions. September 16, 2018 Data Mining: Concepts and Techniques

Step 2: Choosing The Grain
Decide what a record of the fact table is to represent. Also include time as a core dimension, which is always present in star schemas. September 16, 2018 Data Mining: Concepts and Techniques

Step 3: Identifying and Conforming the Dimensions
Dimensions set the context for asking questions about the facts in the fact table. If any dimension occurs in two data marts, they must be exactly the same dimension, or one must be a mathematical subset of the other. September 16, 2018 Data Mining: Concepts and Techniques

Step 4: Choosing The Measure of Facts
Typical measures are numeric additive quamtities like dollars_sold and unit_sold. September 16, 2018 Data Mining: Concepts and Techniques

Fact and Dimension Tables for each Business Process of DreamHome

Physical Design Translate schemas into actual database structures Entities to tables Relationships to foreign key constraints Attributes to columns Primary unique identifiers to primary key constraints September 16, 2018 Data Mining: Concepts and Techniques

Physical Design Most critical physical design issues affecting the end-user’s perception includes: physical sort order of the fact table on disk; presence of pre-stored summaries or aggregations. Indexing September 16, 2018 Data Mining: Concepts and Techniques

Physical Design Dimension Table: create table customer (csid varchar(30), cname varchar(20) not null, gender varchar(10), primary key (csid)); September 16, 2018 Data Mining: Concepts and Techniques

Physical Design Fact Table: create table Sales (customer_id varchar(30), product_id varchar(50), store_id varchar(50), date_id varchar(50), unit_sold real, unit_price real, total_price real, primary key (customer_id,product_id,store_id,date_id), foreign key (customer_id) references customer (csid), foreign key (product_id) references products (pid), foreign key (store_id) references store (name), foreign key (date_id) references time (tid)); September 16, 2018 Data Mining: Concepts and Techniques

Physical Design Dimension - A dimension is a schema object that defines hierarchical relationships between columns or column sets. CREATE DIMENSION time_dim LEVEL day IS time.tid LEVEL month IS time.month LEVEL year IS time.year HIERARCHY cal_rollup ( day CHILD OF month CHILD OF year ); September 16, 2018 Data Mining: Concepts and Techniques

Physical Design Materialized View - you can use materialized views to precompute and store aggregated data such as the sum of sales. CREATE MATERIALIZED VIEW product_sales as SELECT p.pname, SUM(s.unit_sold) AS totalsale FROM sales s, products p WHERE s.product_id =p.pid GROUP BY p.pname; Bitmap Index – Good for low cardinality column. create bitmap index customer_gender on customer (gender); September 16, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

Similar presentations

Presentation on theme: "Data Mining: Concepts and Techniques"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining: Concepts and Techniques

Similar presentations

Presentation on theme: "Data Mining: Concepts and Techniques"— Presentation transcript:

Similar presentations

About project

Feedback