 Data warehouses  Decision support  The multidimensional model  OLAP queries.

 Data warehouses  Decision support  The multidimensional model  OLAP queries

 Traditional DBMSs are used by organizations for maintaining data to record day to day operations  On-line Transaction Processing (OLTP)  Such DBs have driven the growth of the DBMS industry, and will continue to be important  Organizations are increasingly analyzing current and historical data to identify useful patterns  Market research to target products to particular market segments (including political polling)  Decision support for high level decision-making  On-line Analytic Processing (OLAP)

 High level decision making requires an overall view of all aspects of an organization  This can be prohibitively expensive with a large, global, distributed database  Many organizations have created large consolidated data warehouses, often gigabytes or terabytes in size  Information from several databases is consolidated into a data warehouse  By copying tables from many sources into one location  The focus is on identifying patterns and global analysis, so having data current is not a priority

Operational Databases extract clean transform load refresh Data Warehouse Metadata Repository External Data Sources

 Extract data from operational databases and other sources  Clean data to minimize errors, and fill in missing information where possible  Transform data to reconcile semantic mismatches  Typically done by defining a view over the data sources  Load data by materializing, and storing, the views created in the transformation stage  Sort the data and generate summary information  Partition data and build indexes to increase efficiency

 Semantic integration  Eliminate mismatches between data from many sources ▪ e.g. different schemas, currencies  Heterogeneous sources  Data has to be accessed from a variety of sources formats  Load, refresh, purge  Data must be loaded and periodically refreshed  Data that is too old should be purged  Metadata management  The source, loading date, and other information should be maintained for all data in the warehouse

 Traditional SQL queries are inadequate for typical decision support queries  WHERE clauses often contain many AND, and OR conditions, SQL handles OR conditions poorly  Statistical functions are not supported by SQL-92, so queries to be embedded in host language programs  Many queries involve conditions over time, or aggregation of data over time  Users often need to submit several related queries, SQL does not support optimization of such families of queries

 Some systems support querying where the data is considered to be a multidimensional array  Typical queries involve group-by and aggregation operators, complex conditions, and statistical functions  OLAP applications  Some DBMSs are designed to support OLAP queries as well as traditional SQL queries  Relational DBMSs optimized for decision support  A third class of analysis tools supports exploratory data analysis  Identifying interesting patterns in the data, data mining

 OLAP applications use ad hoc, complex queries involving group-by and aggregation operators  Typically OLAP queries are considered in terms of a multidimensional data model  The data can be represented as a multidimensional array  The focus is on a collection of numeric measures  Each measure depends on a set of dimensions  Each measure has one value, over each set of possible values for the dimensions

13 810 12 302050 11 25815 123 slice where location_id = 1 Measure of sales, dimensions are products, location and time time_id product_id location_id

 Multidimensional data can be stored physically in a multidimensional array, MOLAP  The array must be disk-resident and persistent  Or, data can be stored in relations, ROLAP  The relation which relates the dimensions to a particular measure is known as the fact table  Each dimension (location, product, etc.) can have additional attributes stored in a dimension table  Fact tables are usually much larger than dimension tables

 Dimensions can be considered as hierarchies  Attributes have positions in the hierarchies  In many OLAP applications it is necessary to store information about time  Such data often cannot be represented by an SQL date or timestamp ▪ e.g. date, week, month, quarter, year, holiday status, … weekmonth quarter year date province country city

 Star schemas are a common pattern in OLAP DBs  So called because the dimension tables form a star pattern around the fact table  The majority of the data is in the fact table  The fact table should have no redundancy and be in BCNF  To reduce size, system generated dimension identifiers are used as the primary keys of dimensions ▪ e.g. product ID instead of a product name  Dimension tables are often not normalized  As they are static, so update, insert, and delete anomalies are not an issue

 Influenced by SQL and by spreadsheets  A measure is commonly aggregated over one or more dimensions e.g.  Find total sales  Find total sales for each city, or state  Find the top five products ranked by sales  Aggregation may be at different levels of a dimension hierarchy  Roll-up total sales by city to get sales by state, or  Drill-down from total sales by state, to get sales by city ▪ It is possible to drill-down on different dimensions

 Pivoting sales on location and time produces the total sales for each location and time  The result is a cross- tabulation  Slicing a dataset is an equality selection on a dimension  Dicing a dataset is a range selection BCOnttotal 20016381144 200238107145 20037535110 total176223399 cross-tabulation

 The queries shown below give the cross-tabulation shown previously: SELECT SUM(S.sales) FROM Sales S, Times T, Location L WHERE S.time_id = T.time_id AND S.loc_id = L.loc_id GROUP BY T.year, L.province SELECT SUM(S.sales) FROM Sales S, Times T, Location L WHERE S.time_id = T.time_id AND S.loc_id = L.loc_id GROUP BY T.year, L.province SELECT SUM(S.sales) FROM Sales S, Times T WHERE S.time_id = T.time_id GROUP BY T.year SELECT SUM(S.sales) FROM Sales S, Times T WHERE S.time_id = T.time_id GROUP BY T.year SELECT SUM(S.sales) FROM Sales S, Times T WHERE S.loc_id = L.loc_id GROUP BY L.province SELECT SUM(S.sales) FROM Sales S, Times T WHERE S.loc_id = L.loc_id GROUP BY L.province SELECT SUM(S.sales) FROM Sales S SELECT SUM(S.sales) FROM Sales S total details sub-totals

 The cross-tabulation shown previously is a roll-up on the location and time dimensions  With the sub-totals being roll-ups on time, and location  Each roll-up query corresponds to an SQL GROUP BY query with different grouping criteria  With three dimensions (time, location, and product) how many such queries are there?  If there are k dimensions, there are 2 k possible GROUP BY queries generated by pivoting  These can be generated by an SQL 99 CUBE query

 SQL 1999 extended the GROUP BY construct to provide better OLAP support  The GROUP BY clause with the CUBE keyword is equivalent to a collection of GROUP BY statements ▪ With one GROUP BY statement for each subset of the k dimensions  The ROLLUP keyword can also be used, which distinguishes between the GROUP BY attributes ▪ Subsets where the first attribute is null are not included in the result (except where all attributes are null)

T.YearL.provinceSUM(s.sales) 2001BC63 2201Ont81 2001null144 2002BC38 2002Ont107 2002null145 2003BC75 2003Ont35 2003null110 nullBC176 nullOnt223 null 399 SELECT t.year, L.state, SUM (S.sales) FROM Sales S, Times T, Location L WHERE S.time_id = T.time_id AND S.loc_id = L.loc_id GROUP BY CUBE (T.year, L.province) SELECT t.year, L.state, SUM (S.sales) FROM Sales S, Times T, Location L WHERE S.time_id = T.time_id AND S.loc_id = L.loc_id GROUP BY CUBE (T.year, L.province) not included in the equivalent rollup query (GROUP BY ROLLUP)

 The eight grouping lists for the set product, time and location are shown below  Note that the CUBE operation calculates the specified groupings and it's child nodes {product, location, time} {product, location}{product, time}{location, time} {product}{location}{time} {}

 Find the percentage change in the total monthly sales of each product  Find the top five products ranked by total sales  Find the trailing n day moving average of sales  For each day compute the average daily sales over the preceding n days  The first two queries are hard to express in SQL  The third cannot be expressed if n is a parameter of the query  The SQL 1999 WINDOW clause allows such queries, over a table viewed as a sequence

 The FROM and WHERE clause is a normal SQL query  Call this query Temp  Temp is the sales rows with attributes of time and location  The subquery defines the window SELECT L.province, T.month, AVG(S.sales) OVER W AS movavg FROM Sales S, Time T, Location L WHERE S.time_id = T.time_id AND S.loc_id = L.loc_id WINDOW W AS (PARTITION BY L.province ORDER BY T.month RANGE BETWEEN INTERVAL '1' MONTH PRECEDING AND INTERVAL '1' MONTH FOLLOWING) SELECT L.province, T.month, AVG(S.sales) OVER W AS movavg FROM Sales S, Time T, Location L WHERE S.time_id = T.time_id AND S.loc_id = L.loc_id WINDOW W AS (PARTITION BY L.province ORDER BY T.month RANGE BETWEEN INTERVAL '1' MONTH PRECEDING AND INTERVAL '1' MONTH FOLLOWING)

 Temp is partitioned by the PARTION BY clause  The result has one row for each row in the partition, rather than one row for each partition  The partitions are sorted by the ORDER BY clause  The WINDOW clause makes groups nearby records  Value based, using RANGE (like the example)  Based on the number of rows, using the ROWS clause  Compute the aggregate function for each row, and its corresponding group, i.e. its window  There are new aggregate functions, RANK, and variants

 In analyzing trends it is common to want to find the 10 (or n) best, or worst performers  e.g., which the top ten best selling products ▪ Sort sales by each product ▪ Return answer in descending order  If there are a million products this is very wasteful  Instead guess (!) a sales value, c  All top ten performers are better than c,  But many more are less than c  Add the selection sales > c

 The OPTIMIZE construct is not in SQL: 1999 but is supported by some DBMSs  The cutoff value (c) is chosen by the optimizer  Choosing the cutoff can be tricky  The effectiveness of the approach depends on how accurately the cutoff can be estimated SELECT P.product_id, P.pName, S.sales FROM Product P, Sales S WHERE P.product_id = S.product_id AND S.loc_id = 1 AND S.time_id = 3 ORDER BY S.sales DESC OPTIMIZE FOR 10 ROWS SELECT P.product_id, P.pName, S.sales FROM Product P, Sales S WHERE P.product_id = S.product_id AND S.loc_id = 1 AND S.time_id = 3 ORDER BY S.sales DESC OPTIMIZE FOR 10 ROWS

 This query may be expensive if the tables are large  If speed is of the essence it is possible to return data before the query is complete  Either return the current running average  Or use sampling and other statistical techniques to return an approximation  Note that the algorithms must be non-blocking SELECT L.province, AVG(S.sales) FROM Sales S, Location L WHERE S.loc_id = L.loc_id GROUP BY L.province SELECT L.province, AVG(S.sales) FROM Sales S, Location L WHERE S.loc_id = L.loc_id GROUP BY L.province

 The OLAP environment motivates some different implementation techniques  Indexing is very important in OLAP systems  Interactive response time is desired for queries over very large database  OLAP systems are mostly read, and rarely updated which reduces the cost of maintaining indexes  New indexing techniques have been developed for OLAP systems  Bitmap indexes  Join indexes

 A bitmap index can speed up queries on sparse columns, that have few possible values  One bit is allocated for each possible value  The indexes can be used to answer some queries  How many male customers have a rating of 3?  AND the M and 3 columns and count the 1s MFidnamesexrating12345 10112SamM500001 01113SueF300101 01121AnnF201000 10131BobM300100 rating index sex index

 Joins are often expensive operations  Join indexes can be built to speed up specific join queries  A join index contains record IDs of matching records from different tables  e.g. Sales, products and locations of all sales in B.C.  The index would contain the sales rids and their matching product and location rids ▪ Only locations where province = "BC" are included  The number of such indexes can be a problem where there are many similar queries

 To reduce the number of join indexes separate indexes can be created on selected columns  Each index contains rids of dimension table records that meet the condition, and rids of matching fact table records  The separate join indexes have to be combined, using rid intersection, to compute a join query  The intersection can be performed more efficiently if the new indexes are bitmap indexes  Particularly if the selection columns are sparse  The result is a bitmapped join index

 Decision support is a rapidly growing area of database use and research  It involves the creation of large, consolidated data repositories called data warehouses  Warehouses are queried using sophisticated analysis techniques  Complex multidimensional queries influenced by both SQL and spreadsheets  New techniques for database design, indexing, view maintenance and querying must be supported

 Data mining consists of finding interesting trends in large datasets  Related to an area of statistics - exploratory data analysis  Such patterns should be identified with minimal user input  The knowledge discovery process is in four steps  Data selection - find the target subset of the data  Data cleaning - remove noise and outliers, transform fields to common units and prepare the data for analysis  Data mining - apply data mining algorithms to find interesting trends or patterns  Evaluation - present results to end users

 A market basket is a collection of items purchased by a customer in a single transaction  Retailers commonly want to know which items are purchased together, to identify marketing opportunities  An itemset is a set of items bought in a transaction  The support of an itemset is the fraction of transactions that contain all the items in an itemset  e.g. if {milk, cookies} has 60% support then 60% of all transactions contain both milk and cookies  We may be interested in single item itemsets as they identify frequently purchased items

 The a priori property is that every subset of a frequent itemset is also a frequent itemset  The algorithm can proceed iteratively by first identifying frequent itemsets with only one item  Each single item itemset can then be extended with another item to generate larger candidate itemsets  Each iteration of the algorithm scans the transactions once  Increasing the candidate itemsets by one item ▪ The algorithm can be improved by only considering additional items that are themselves itemsets  The minimum support level has to be specified by the user

 If the number of {custid, item} pairs is large the relation may have to be sorted or hashed, but  The result set is probably small, just the tip of the iceberg  The query will waste time computing all of the groups, even though only few will meet the HAVING condition  A modification of the priori property suggests that we only need to consider  Customers that have purchased 5 items, and  Items that have been purchased 5 times SELECT P.custid, P.item, SUM(P.qty) FROM Purchases P GROUP BY P.custid, P.item HAVING SUM (P.qty) > 5 SELECT P.custid, P.item, SUM(P.qty) FROM Purchases P GROUP BY P.custid, P.item HAVING SUM (P.qty) > 5 Find items that customers have purchased more than five times

 An association rule is a rule of the form  {milk}  {cookies}, which states that:  If milk is purchased in a transaction then it is likely that cookies are also purchased in that transaction  There are two measures for association rules  The support is the percentage of transactions that contain the {LHS  RHS} ▪ The same as the support for that itemset  The confidence is a measure of the strength of the rule ▪ The percentage of times that cookies are purchased whenever milk is purchased, or ▪ support ({LHS  RHS}) / support ({LHS})

 Users ask for association rules with given minimum support and confidence  First all frequent itemsets with the specified minimum support are found ▪ As discussed previously  Once the frequent itemsets have been produced they are divided into LHS and RHS  The confidence measure is then tested for each possible LHS and RHS combination of the qualifying itemsets  The most expensive part of the algorithm is identifying the frequent itemsets

 Association rules can be applied to sets of days  By using the date field as a grouping attribute  In calendric market basket analysis the user specifies a group of dates, or calendar, to test the rule over  Sequential patterns can be analyzed, where a customer purchases a given sequence of itemsets  Care must be taken when using association rules for predictive purposes  Rules like {milk}  {cookies} may not be causal

 Finding causal relationships can be hard  Although two events (or purchases) are correlated there may not be a causal relationship between them  Each possible combination of causal relationships can be considered as a model of the world  Assign a score to each model based on its consistency with the observed data  Bayesian networks are graphs that can be used to describe such models ▪ The number of models is exponential in the number of variables so some subset should be considered

 An insurance company may want to predict whether or not customers are high risk  What information can they use to do this? ▪ e.g. If a male aged between 16 and 25 drives a truck they are high risk  There is one attribute whose value is to be predicted, the dependent attribute, and  A number of predictor attributes  The general form of such rules is:  P 1 (X 1 )  P 2 (X 2 )  …  P k (X k )  Y = c ▪ P i s are predicates, X i s are predictors

 The form of predicates depends on the type of the predictor attribute  If the attribute is numerical, then numerical computations can be performed ▪ The predicate is of the form low  X i  high  If the attribute is categorical, then we must test to see if two values are equal ▪ The predicate is of the form X i  {v 1, …, v j }  age is numerical, car type and risk rating are categorical ▪ (16  age  25)  car  {truck}  hiRisk = true  Rules are based on the dependent attribute type  Categorical – classification rules  Numerical – regression rules

 Support and confidence can be defined for classification and regression  Support – the support for a condition, C, is the percentage of records that satisfy C ▪ The support for C 1  C 2 is the support for C 1  C 2  Confidence – the confidence is the percentage of records of C 1 that also satisfy condition C 2  Classification and regression rules differ from association rules  They consider more than one set-valued field as the left hand side of the rule

 A collection of classification rules can be represented as a decision tree  Each internal node of the tree is labeled with a predictor attribute, referred to as a splitting attribute  Outgoing edges are labeled with predicates  Leaf nodes are given labels values of dependent attributes  Decision trees are constructed in two phases  In the growth phase an overly large tree is built ▪ Trees are built by repeatedly splitting the tree on the best remaining splitting criterion ▪ The database is then partitioned on this criterion  The tree is then pruned to remove overspecialized rules

 The goal is to partition a set of records into groups  Records in a group are similar, and records that belong to different groups are dissimilar  Each group is called a cluster, and records should belong to only one cluster  Partitional clustering algorithms partitions the data into groups based on some criterion  Hierarchical clustering algorithms generate a sequence of partitions  In the first partition each cluster consists of one record, the algorithm then mergers two partitions in each step

 Data warehouses  Decision support  The multidimensional model  OLAP queries.

Similar presentations

Presentation on theme: " Data warehouses  Decision support  The multidimensional model  OLAP queries."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

 Data warehouses  Decision support  The multidimensional model  OLAP queries.

Similar presentations

Presentation on theme: " Data warehouses  Decision support  The multidimensional model  OLAP queries."— Presentation transcript:

Similar presentations

About project

Feedback