Presentation is loading. Please wait.

Presentation is loading. Please wait.

12/18/20141 Lecture 10: Data Warehouses  Introduction  Operational vs. Warehouse  Multidimensional Data  Examples  MOLAP vs ROLAP  Dimensional Hierarchies.

Similar presentations


Presentation on theme: "12/18/20141 Lecture 10: Data Warehouses  Introduction  Operational vs. Warehouse  Multidimensional Data  Examples  MOLAP vs ROLAP  Dimensional Hierarchies."— Presentation transcript:

1 12/18/20141 Lecture 10: Data Warehouses  Introduction  Operational vs. Warehouse  Multidimensional Data  Examples  MOLAP vs ROLAP  Dimensional Hierarchies  OLAP Queries  Demos  Comparison with SQL Queries  CUBE Operator  Multidimensional Design  Star/Snowflake Schemas  Online Aggregation  Implementation Issues  Bitmap Index  Constructing a Data Warehouse  Views  Materialized View Example  Materialized View is an Index  Issues in Materialized Views  Maintaining Materialized Views

2 12/18/20142 Introduction  In the late 80s and early 90s, companies began to use their DBMSs for complex, interactive, exploratory analysis of historical data.  This was called Decision Support, and On- Line Analytic Processing (OLAP).  DS slowed down the operation of the company, called On-Line Transaction Processing (OLTP).  This led to the creation of Data Warehouses, separate from operational Databases.

3 12/18/20143 Operational vs Data Warehouse Requirements OLTP / Operational / ProductionDSS / Warehouse / DataMart Operate the business / ClerksDiagnose the business / Managers Short queries, small amts of dataopposite Queries change dataopposite Customer inquiry, Order Entry, etc.OLAP, Statistics, Visualization, Data Mining, etc. Legacy Applications, Heterogeneous databases Opposite Often DistributedOften Centralized (Warehouse) Current dataCurrent and Historical data

4 12/18/20144 Operational vs Data Warehouse Requirements, ctd OperationalWarehouse General E-R DiagramsMultidimensional data model common Locks necessaryNo Locks necessary Crash recovery requiredCrash recovery optional Smaller volume of dataHuge volume of data Need indexes designed to access small amounts of data Need indexes designed to access large volumes of data

5 12/18/20145 Data Warehousing  Integrated data spanning long time periods, often augmented with summary information.  Several terabytes to petabytes common.  Interactive response times expected for complex queries; ad-hoc updates uncommon. Operational Data EXTRACT TRANSFORM LOAD REFRESH DATA WAREHOUSE Metadata Repository SUPPORTS OLAP DATA MINING

6 12/18/20146 Multidimensional Data  In order to support OLAP, warehouse data is often structured multidimensionally, as measures and dimensions.  Measure: Numeric attribute, e.g. sales amount  Dimension: attribute categorizing the measure, e.g. product, store, date of sale.  The fact table is a foreign key for each dimension, plus an attribute for each measure.  There will also be a dimension table for each dimension.  On the next page, the fact tables are red, the dimension tables are green.

7 12/18/20147 Examples of MultiDimensional Data  Purchase(ProductID, StoreID, DateID, Amt)  Product(ID, SKU, size, brand)  Store(ID, Address, Sales District, Region, Manager)  Date (ID, Week, Month, Holiday, Promotion)  Claims(ProvID, MembID, Procedure, DateID, Cost)  Providers(ID, Practice, Address, ZIP, City, State)  Members(ID, Contract, Name, Address)  Procedure (ID, Name, Type)  Telecomm (CustID, SalesRepID, ServiceID, DateID)  SalesRep(ID, Address, Sales District, Region, Manager)  Service(ID, Name, Category)

8 12/18/20148 MOLAP vs ROLAP  Multidimensional data can be stored physically in a (disk-resident, persistent) array; called MOLAP systems. Alternatively, can store as a relation; called ROLAP systems.  The main relation, which relates dimensions to a measure, is called the fact table. Each dimension can have additional attributes and an associated dimension table.  E.g., Products(pid, locid, timeid, amt)  Fact tables are much larger than dimensional tables.

9 12/18/ Multidimensional Data Model  Collection of numeric measures, which depend on a set of dimensions.  E.g., measure Amt, dimensions Product (key: pid), Location (locid), and Time (timeid) timeid pid pid timeidlocid amt locid Slice locid=1 is shown:

10 12/18/ Dimension Hierarchies  For each dimension, some of the attributes may be organized in a hierarchy: PRODUCTTIMELOCATION pname week city PID date ZIP year category quarter state

11 12/18/ OLAP Queries  Influenced by SQL and by spreadsheets.  A common operation is to aggregate a measure over one or more dimensions.  Find total sales.  Find total sales for each city, or for each state.  Find top five products ranked by total sales.  Roll-up: Aggregating at different levels of a dimension hierarchy.  E.g., Given total sales by city, we can roll-up to get sales by state.

12 12/18/ OLAP Queries  Drill-down: The inverse of roll-up.  E.g., Given total sales by state, can drill-down to get total sales by city.  E.g., Can also drill-down on different dimension to get total sales by product for each state.  Pivoting: Aggregation on selected dimensions.  E.g., Pivoting on State and Year yields this cross-tabulation : OR CA Total Total  Slicing and Dicing: Equality and range selections on one or more dimensions.

13 12/18/ Cognos Demo  Now we watch a demo of Cognos (bought by IBM)  Dimensions: Products…Margin ranges  Measure: Order value (sales)  First pivot from Product dimension to Margin Range  Notice how quickly the cube changes  Slice to Low Margin, pivot to Product and Company Region  Drill Down to High Tech, IDES AG  Now the guilty product is clear.

14 12/18/ Tableau Demo   Note the many measures.  Pivot on sales, date (drill down to month), region as color.  Clear date, pivot on product and drill down on subcategory.  Change region from color to rows  Move profit into color  Change bars to circles  Pivot on dates (columns)

15 12/18/ Comparison with SQL Queries  The cross-tabulation obtained by pivoting can also be computed using a collection of SQLqueries: SELECT T.year, L.state, SUM (S.amt) FROM Sales S, Times T, Locations L WHERE S.timeid=T.timeid AND S.locid=L.locid GROUP BY T.year, L.state SELECT T.year, SUM (S.amt) FROM Sales S, Times T WHERE S.timeid=T.timeid GROUP BY T.year SELECT L.state,SUM (S.amt) FROM Sales S, Location L WHERE S.locid=L.locid GROUP BY L.state

16 12/18/ The CUBE Operator  Generalizing the previous example, if there are k dimensions, we have 2^k possible SQL GROUP BY queries that can be generated through pivoting on a subset of dimensions.  CUBE pid, locid, timeid BY SUM Sales  Equivalent to rolling up Sales on all eight subsets of the set {pid, locid, timeid}; each roll-up corresponds to an SQL query of the form: SELECT SUM (S.amt) FROM Sales S GROUP BY grouping-list Lots of work on optimizing the CUBE operator!

17 12/18/ Example Multidimensional Design  This kind of schema is very common in OLAP applications  It is called a star schema  What is “wrong” with it? pricecategorypnamepidcountrystatecitylocid amtlocidtimeidpid holiday_flagweekdatetimeidmonthquarteryear (Fact table) SALES TIMES PRODUCTSLOCATIONS

18 12/18/ Star/Snowflake Schemas  Why normalize?  Space  Redundancy, anomalies  Why unnormalize?  Performance  Which is more important in D. Warehouses?  If normalized, it is a snowflake schema

19 12/18/ Online Aggregation  Consider an aggregate query, e.g., finding the average sales by state. Can we provide the user with some information before the exact average is computed for all states?  Can show the current “running average” for each state as the computation proceeds.  Even better, if we use statistical techniques and sample tuples to aggregate instead of simply scanning the aggregated table, we can provide bounds such as “the average for Oregon is 2000  102 with 95% probability”. Should also use nonblocking algorithms!

20 12/18/ Implementation Issues  New indexing techniques: Bitmap indexes, Join indexes, array representations, compression, precomputation of aggregations, etc.  E.g., Bitmap index: sex custid name sex rating rating Bit-vector: 1 bit for each possible value. M F

21 12/18/ Bitmap Indexes  Work when an attribute has few values, e.g. gender or rating  Advantage: Small enough to fit in memory  Many queries can be answered by bit-vector ops, e.g. females with rating = 3.

22 12/18/ Constructing a D. Warehouse  Extract  Is the data in native format?  Clean  How many ways can you spell Mr.?  Errors, missing information  Transform  Fix semantic mismatches. E.g. Last+first vs. Name  Load  Do it in parallel or else….  Refresh  Both data and indexes

23 12/18/ ,9 Views and Decision Support  In large databases, precomputation is necessary for decent response times  Examples: brain, google  Example: Precompute daily sums for the cube.  What can be derived from those precomputations?  These precomputed queries are called Materialized Views (SQL Server: Indexed views).

24 12/18/ Materialized View Example CREATE VIEW DailySum(date, sumamt) AS SELECT date, SUM(amt) FROM Times Join Sales USING(timeid) GROUP BY date SELECT week, SUM (amt) FROM Times Join Sales USING(timeid) Group By week SELECT week, SUM (sumamt) FROM Times Join DailySum USING (week) GROUP BY week Mat. View Query Modified Query

25 12/18/ Pros and Cons of Materialized Views  Pro: Modified query is a join of two small tables; original query is a join with one huge table.  Con: Materialized views take up space, need to be updated.

26 12/18/ A Materialized View is an Index  Recall the definition of an index  Data structure that provides fast access to data  Table indexes were of the form {(value, pointer)}, perhaps at leaf level of a search structure. This is different.  Needs to be maintained as underlying tables change.  Ideally, we want incremental view maintenance algorithms.

27 12/18/ What views should we materialize?  Remember the software that automatically chooses optimal index configurations?  The same software will choose optimal materialized views, given a workload and available space.

28 12/18/ What about the optimizer?  Given a query and a set of materialized views, can we use the materialized views to answer the query?  This is tricky. Best reference is [348]

29 12/18/ Refreshing Materialized Views  How often should we refresh the materialized view?  Many enterprises refresh warehouse data only weekly/nightly, so can afford to completely rebuild their materialized views.  Others want their warehouses to be current, so materialized views must be updated incrementally if possible.  Let's look at some simple examples.

30 12/18/ Maintaining Materialized Views*  Incremental view maintenance  Defn: make changes in view that correspond to changes in the base tables  Example: V = SELECT a FROM R  How is V modified if r is inserted to R?  How is V modified if r is deleted from R?

31 12/18/ Maintaining Materialized Views*  Consider V = R ⋈ S  How is V modified if r is inserted to R?  How is V modified if r is deleted from R?  Consider V = SELECT g,COUNT(*) FROM R GROUP BY g  How is V modified if r is inserted to R?  How is V modified if r is deleted from R  For more general cases, see [348]


Download ppt "12/18/20141 Lecture 10: Data Warehouses  Introduction  Operational vs. Warehouse  Multidimensional Data  Examples  MOLAP vs ROLAP  Dimensional Hierarchies."

Similar presentations


Ads by Google