Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS346: Advanced Databases

Similar presentations


Presentation on theme: "CS346: Advanced Databases"— Presentation transcript:

1 CS346: Advanced Databases
Graham Cormode Data Warehousing and OLAP

2 Outline Chapter: “Overview of Data Warehousing and OLAP” in Elmasri and Navathe What is a data warehouse and what is it for? The multidimensional data model and common schema designs Special indexes: bitmap and join indexes Why? Another model of data to study and contrast with RDBMS A different perspective on using data for insight A relatively recent development (1990s): still developing CS346 Advanced Databases

3 CS910 Foundations of Data Analytics
Data Warehouses Data Warehouses were introduced to handle large data stores Typically, historical data for business analytic purposes Separate from the organization’s “live” operational database “A data warehouse is a subject-oriented, integrated, time-variant, non-volatile collection of data” W. Inmon, “father of data warehouse” Subject-oriented: focused on one topic (e.g. all sales records) Integrated: data brought together from many sources, cleaned Time-variant: covers a long history of data (e.g. last decade) Nonvolatile: only periodically updated, not “live” data Data warehouse products from Oracle, IBM, Microsoft, Teradata CS910 Foundations of Data Analytics

4 OLAP, OLTP, DSS OLAP: Online Analytical Processing
Analysis of complex data stored in a data warehouse Often using distributed storage and processing (Hive…) In contrast to Online Transaction Processing (OLTP) Insertions, updates, deletions and queries Decision Support Systems or Enterprise Information Systems Allow organization’s leader to make complex strategic decisions Support data mining / machine learning for knowledge discovery “Business Intelligence” CS346 Advanced Databases

5 Data Warehouse Characteristics
Data Warehouses adopt a different data model to RDBMS Typically a multidimensional data model Warehouses often store integrated data from many sources Contrast to DBMS which encourages multiple disjoint DBs Warehouses typically support time-series, trend analysis Need more historical data, not just the current values Warehouses typically nonvolatile Data is added to only periodically. No need for transactions! Warehouses typically handle very large amounts of data Often two orders of magnitude (100x) larger than “live” databases May be terabytes-petabytes in size CS346 Advanced Databases

6 ETL: Extract, Transform, Load
Putting data into a data warehouse is a complex process Denoted ETL: Extract, Transform, Load Extract: pull data out of whatever system it is stored in Via appropriate interchange format: XML, flat files, etc. Transform: put data into a usable format Pick which attributes, harmonize formats, sort and join as needed Format for consistency: names of entities should agree Clean the data: identify errors and fill in missing values Return cleaned data to update original source: backflushing Fit the data to the model of the warehouse: ensure it fits schema CS346 Advanced Databases

7 ETL: Extract, Transform, Load
Load: store in an appropriate format Many warehouses use simple structures, e.g. sorted flat files Refresh policy: How up to date is the data? Can it be offline? How long does it take to load into the warehouse? Store metadata on the data as well: metadata repository Technical metadata: how data was processed, stored, updated Business metadata: relevant business rules and organization details CS346 Advanced Databases

8 Characteristics of Data Warehouses
A few key properties of data warehouses (DW): Multidimensional: allow many levels of aggregation Support multiple users via client-server architecture Should be intuitive and responsive to use Many variations of the central concept: Enterprise-wide DW: corral everything about an organization Virtual DWs: provide a materialized view of an operational DB Data marts: DWs restricted to a subset of an organization Two common architectures for warehouses: Distributed: must handle replication, partitioning, consistency Federated: collection of autonomous warehouses (data marts) CS346 Advanced Databases

9 CS910 Foundations of Data Analytics
OLAP and Data Cubes Warehouses often support Online Analytical Processing (OLAP) A multidimensional view of data Represents data as a data cube Explored by aggregating or refining dimensions in the data CS910 Foundations of Data Analytics

10 Aggregating Multidimensional Data
E.g. Sales volume as a function of product, month, and region Dimensions: Product, Location, Time Hierarchical summarization paths Region Industry Region Year Category Country Quarter Product City Month Week Office Day Product Month

11 A Sample Data Cube * (all) Total annual sales of TVs in U.S.A. 1Qtr
Date Product Country sum TV DVD PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico * (all)

12 CS910 Foundations of Data Analytics
OLAP Operations Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction Drill down (roll down): inverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: project and select Zoom in on particular value, or drop some attributes Apply aggregation: on a given dimension Count, Sum, Min, Max, Average, Variance, Median, Mode CS910 Foundations of Data Analytics

13 Multidimensional Storage Model
The DW multidimensional storage model has two table types: Dimension tables and fact tables Fact table: many tuples, 1 per stored fact, pointing to dimensions E.g. sale of an item: which product, which store, which customer Dimension table: tuples of attributes of the dimension E.g. details of the product, of the store, of the customer CS346 Advanced Databases

14 Data Warehouse Schemas
Star schema: fact table with a single table for each dimension Snowflake schema: variation of a star schema Fact tables are arranged hierarchically after normalization CS346 Advanced Databases

15 Fact constellations Fact constellation: a set of fact tables that share some dimension tables CS346 Advanced Databases

16 Bitmap Indexes Bitmap indexes used to support high-performance access
One of various techniques used in the database Takes the form of a bit vector for each value in a table Set to 1 if a particular value occurs, 0 if it does not Can be quite compact if the domain size is small E.g. 1M rows and domain size of 4: bitmap index size 0.5MB Efficient to check conjunctive conditions: intersect (AND) bitmaps CS346 Advanced Databases

17 Join indexing A join index connections dimension data to tuples in a fact table Assuming a star schema A join index is a traditional index linking primary and foreign keys Lists all the keys that meet the (equi)join condition e.g. consider a sales fact table that has city as one dimension Join index on city: list of sales tuple ids for each different city Can make a join index as a bitmap index CS346 Advanced Databases

18 Data Warehouse versus Views
Recall views: result of a (stored) query on a database Could achieve warehouse functionality via (materialized) views Data warehouses are more than just views: Warehouses are stored, not materialized on demand Different data model: multidimensional, not relational Data warehouses can be indexed (views cannot) Warehouses support various analysis tasks (mining, time series) Warehouses typically contain more (historic) data than one DB CS346 Advanced Databases

19 Data Warehouses: Pros and Cons
Data warehouses have many strengths for data analysis: Support fast exploration and aggregation of data Designed to handle very large data sets (TBs / billions of records) Software supports analytics (data mining/machine learning) on top Clustering, Regression, Classification, Rule mining However, they have their limitations: A big undertaking: bringing together all an organization’s data Need a thorough understanding of the organizational structure Can be costly to maintain (time-consuming to clean and load data) As underlying data organization changes, so must the warehouse CS910 Foundations of Data Analytics

20 Summary What is a data warehouse and what is it for?
Storing and querying all the data of a large organization The multidimensional data model and common schema designs Roll up, drill down, slice & dice; star and snowflake schemas Special indexes: bitmap and join indexes Chapter: “Overview of Data Warehousing and OLAP” in Elmasri and Navathe CS346 Advanced Databases


Download ppt "CS346: Advanced Databases"

Similar presentations


Ads by Google