Presentation on theme: "Contents of this slideshow: What is a datawarehouse? Multi-dimensional data modelling Data warehouse architecture The hidden slides of this slideshow may."— Presentation transcript:
Contents of this slideshow: What is a datawarehouse? Multi-dimensional data modelling Data warehouse architecture The hidden slides of this slideshow may be important. However, I will focus on leaning by exercises and therefore, rattling off new concepts are often done in hidden slides.
Datawarehouses versus Operational systems OLAP (On-Line Analytical Processing) 1.Normally uses a special data warehouse database. 2.Data analysis and decision making OLTP (On-Line Transaction Processing) 1.Normally uses a relational DBMS 2.Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.
Why Separate Data Warehouse? High performance for both systems –DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery –Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation (responces to slowly changing dimensions). Different functions and different data: –missing data: Decision support (DS) requires historical data which operational DBs do not typically maintain in a way suited for decision support. –data integration: DS requires integrated data from heterogeneous sources –data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled
OLTP versus OLAP OLTP = On Line Transaction Processing OLAP = On Line Analytical Processing
Contents of this slideshow: What is a datawarehouse? Multi-dimensional data modelling Data warehouse architecture
An example of a Datawarehouse: A star shema datawarehouse has a central table (the Fact table) surrouded by dimension tables with on-to-many relationships towards the fact table. The fixed data base structure implies that application programs (drilling functions/aggregates) can be generated automatically!
Dimension hierarchies: A dimension hierarchy is a set of tables connected by one-to-many relationships towards the fact table: In a dimension hierarchiy it is possible to aggregate data from the fact table to the different levels of the hierachy. Roll-up = aggregate along one or more dimensions. Drill-down = “de-aggregate” = break an aggregate into its constituents.
Two different types of drilling: -Drilling in dimension hierarchies -Drilling between dimensions.
Which star schemas or data marts can be build by using the illustrated integrated E-commerce/ ERP data model? Which star schema would you recommend to be implemented first?
A galaxy is a set of star fact tables with conformed (fælles tilpassede) dimensions: The value chain
Conceptual Modeling of Data Warehouses –Star schema: A fact table in the middle connected to a set of dimension tables –Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake –Galaxy schema: Multiple fact tables share dimension tables (Conformed dimensions), viewed as a collection of stars, therefore called galaxy schema or fact constellation
The aggregating level is the argument to the GROUP BY statement: SELECT Product#, SUM(Qty*Price) AS Turnover FROM Orderdetails JOIN Products GROUP BY Product#
Drill down to the Product per Salesman level: SELECT Product#, Salesman#, SUM(Qty*Price) AS Turnover FROM Orderdetails JOIN Products JOIN Salesmen GROUP BY Product#, Salesman#; Where should the Price be stored?
Snowflake schema with branches: A Snowflake schema may have branches in the dimension hierarchies: Are Customers related to the Regions?
Drilling between dimension hierarchies: Salesman#Turn- over Branch- office# Smith100,000LA Jones300,000LA Adams200,000SF Sales man# Product- name Turn- over Branch- office# SmithScrew10,000LA SmithBolt30,000LA SmithNut60,000LA JonesScrew20,000SF JonesNut40,000SF...
Roll up to the top level: Roll up can be executed by removing one or more argument to the GROUP BY statement. Sales man# Product- name Turn- over Branch- office# SmithScrew10,000LA SmithBolt30,000LA SmithNut60,000LA JonesScrew20,000SF JonesNut40,000SF... ProductnameTurnover Screw100.000 Bolt200.000 Nut300,000 Roll up to the product level. Top levelTurnover 600.000 Roll up to the top level.
The aggregation level is the argument to the GROUP BY statement. x1x1 x2x2 …xnxn Aggregated dataNon-aggregated data Salesman#ProductnameTurnoverBranch-office# SmithScrew10,000LA SmithBolt30,000LA SmithNut60,000LA JonesScrew20,000SF JonesNut40,000SF...
Dimension hierarchies: A dimension hierarchi is a set of tables connected by one-to-many relationships towards the fact table: A Snowflake schema may in contrast to star schemas have dimension hierarchies. Describe the advantage/disadvantage of using dimension hierarchies or Snowflake schema?
Exercise: The figure illustrates an ER-diagram of a car rental company like Hertz or Avis. Question 1. Design a star schema or Galaxy for the car rental company. Question 2. Is there advantages by storing suppliers as customers in e.g. an e- commerce data warehouse?
Contents of this slideshow: What is a datawarehouse? Multi-dimensional data modelling Data warehouse architecture
Data Models –Relational models/ER-diagram used for OLTP databases –Stars, snowflakes and galaxies used for OLAP databases –Cubes used for OLAP databases
From OLTP Data Models to Data Cubes:
Et star schema DW can be illustrated as a multidimensinal cube:
From Tables to Data Cubes The tables of a data warehouse may be viewed as a multidimensional data cube Some data warehouse software product also stores data in a cube structure In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid.
A Sample Data Cube Total annual sales of TV in U.S.A. Date Product Country sum TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico sum Aggregated data is a cuboid between the base cuboid and the apex cuboid. The following slides the different cuboids.
Cuboids Corresponding to the Previous Cube all product date country product,dateproduct,countrydate, country product, date, country 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D(base) cuboid The nodes in the graph corresponds to the aggregation levels.
A four dimensional cube: A Lattice of Cuboids all timeitemlocationsupplier time,itemtime,location time,supplier item,location item,supplier location,supplier time,item,location time,item,supplier time,location,supplier item,location,supplier time, item, location, supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid The nodes in the graph corresponds to the aggregation levels.
Describe advantages/disadvantages of storing data in a cube in memory?
OLAP Cube operations: OLAP operations: Roll Up = Aggregatin to a higer level. For example from month to year) Drill Down = recalculation with more details Slice = Selecting a subset by using a fixed dimension value. Drill Across = Join of fact data across conformed dimensions Drill Through = Accessing related data from a OLTP system. Aggregating Pivoting = See next slide!
Pivoting = Transforming SQL query output to user friendly two dimensional screen layout day 2 day 1 Multi-dimensional cube: Fact table view:
OLAP Server Architectures Relational OLAP (ROLAP) –Use relational or extended-relational DBMS to store and manage warehouse data –Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services –greater scalability Multidimensional OLAP (MOLAP) –Array-based multidimensional storage engine (sparse matrix techniques) –fast indexing to pre-computed summarized data Hybrid OLAP (HOLAP) –Storage flexibility with mix of ROLAP and MOLAP POLAP personel HOLAP
Contents of this slideshow: What is a datawarehouse? Multi-dimensional data modeling Data warehouse design/implementation architectures 1. Kimball has a bottom-up architecture 2. Inmon has a top-down architecture 3. Data Vault architecture is normalized tables extended with historic data tables. That is, the Data Vault can be used to generate any data mart when needed.
Kimball’s Bottom-Up DW architecture: Kimball’s architecture uses conformed dimensions and conformed facts. Conformed dimensions makes it possible to drill across from one data mart to another to present data from different marts in the same view. Only the conformed data have top-down design.
Kimball’s Data Warehousing Architecture Data Staging Area Metadata ETL side Query side Query Services - Extract - Transform - Load Data mining Data sources Presentation servers Desktop Data Access Tools Reporting Tools Data marts with aggregate-only data Data Warehouse Bus Conformed dimensions and facts Data marts with atomic data -Warehouse Browsing - Access and Security - Query Management - Standard Reporting - Activity Monitor Surrogate key (Surrogatnøgle) = A sequense number used as primary key.
More definitions: Enterprise warehouse –collects all of the information about subjects spanning the entire organization Virtual warehouse –A set of views over operational databases –Only some of the possible summary views may be materialized
Kimball’s Data Warehouse Bus Architecture = An architecture for designing all the data marts of an enterprice by using conformed dimension and conformed fact tables. _________________________________________________________ Conformed dimensions = dimensions designed to be common for different data marts in order to make drill across operations possible. Conformed facts = measures with common units of measurement and granularities that make it possible to integrate measures from different fact tables. Data marts = Kimball uses the data mart concept for any multidimensional database. (Inmon uses the data mart concept for subject areas/business functions or department data warehouses).
William Inmon’s DATA WAREHOUSE architecture from 1990 has top-down design without conformed data. and: Department dataware- houses EDS = Enterprise Data Warehouse. The DSA (Data Staging Area) where transformation takes place is not illustrated.
William Inmon’s DATA WAREHOUSE concept: In practice you may have a fact table for each node in the value chain!
The DATA VAULT architecture from 2002-2005 has full top-down design and buttom up implementation: Normalized Data Vault with historic data In the Data Vault database with historic information only the Extract activity has taken place. Therefore, the Data Vault architecture is not drowned in the design phase.
The DATA VAULT architecture takes the best from Inmon, Kimball and relational database design as it may be viewed as a top-down designed normalized database with historic and conformed data. That is, the EDW is not part of the design. Normalized Data Vault with historic data
Classical Data warehousing ExtractionError handling Aggregate Business Rules Trans- formation Delta Detection DSA DM EDW 1 23 Source Cleansing Filter DSA = Data Staging Area EDS = Enterprise Data Warehouse OLTP
Error handling Aggregate Business Rules Trans- formation Delta Detection DSA 1 Source Cleansing Filter Extraction Error handling Aggregate Business Rules Trans- formation Delta Detection DSA DM EDW 1 23 Source Cleansing Filter Classical Data warehousing HANA from SAP is an In memory Data Warehouse product OLTP
Error handling Aggregate Business Rules Trans- formation Delta Detection DSA 1 Source Cleansing Filter Extraction Error handling Aggregate Business Rules Trans- formation Delta Detection DSA DM EDW 1 23 Source Cleansing Filter Classical Data warehousing In memory Data warehousing How can OLTP and OLAP be integrated in a common In Memory database? OLTP
ER- diagram for a hospital. Exercise: Transform the OLTP database to a Star schema DW for a Hospital.
Exercise: Design an Airline DW.
Exercise: Design a Hotel DW.
Exercise. Design a datawarehouse for a travel agency.
Problems that Datawarehouses may solve: Easy data access Easy presentation Direct access Overview Consistency Relevance
Inmon versus Kimball’s DW definitions: Why do you think Kimball’s DW architecture is used most in practice? Kimball and Inmon agree in that OLAP datawarehouses do not use the OLTP databases. However, what is the difference in the architectures?
Dates may be stored in different formats. As an example the First purchase date may be stored as a FK to a hierachical time dimension and Birth date as a SQL time stamp. Why is different Date formats used in the Customer table?
OLAP OLAP = On-Line Analytical Processing –Interaktiv analyse –Eksplorativ opdagelse –Kræver hurtige svartider Data kan vises som multidimensionelle terninger –Terninger/kuber kan have et vilkårligt antal dimensioner –Dimensioner har hierarkier, f.eks. dag-måned-år OLAP operationer –Aggregering = Sammentælling af data, f.eks. med SUM, AVG, COUNT… –Startniveau, (Kvartal, Produkt) –Roll Up: mindre detalje, Kvartal->År –Drill Down : mere detalje, Kvartal->Måned –Slice: Projektering/selektering, År=1999 –Drill Across: “join” på fælles dimensioner –Drill Through: Opsøgning af kildedataene i de operative systemer –Pivoting
Design af teknisk arkitektur Design af teknisk arkitektur Valg af produkt og installation Valg af produkt og installation Specifikation af applikationer Specifikation af applikationer Udvikling af applikationer Udvikling af applikationer Specifikation af krav Specifikation af krav Ibrugtagning Vedligehold og vækst Vedligehold og vækst Projektledelse Dimensionel modellering Fysisk design ETL: design og udvikling Projekt planlægning Design af teknisk arkitektur Design af teknisk arkitektur Valg af produkt og installation Valg af produkt og installation Specifikation af applikationer Specifikation af applikationer Udvikling af applikationer Udvikling af applikationer Specifikation af krav Specifikation af krav Ibrugtagning Vedligehold og vækst Vedligehold og vækst Projektledelse Dimensionel modellering Fysisk design ETL: design og udvikling Projekt planlægning The Business Dimensional Lifecycle = Kimball’s activity model for DATAWAREHOUSE devellopment has three parallel tracks:
The Data Warehouse Bus Architecture = Arkitektur for design af en række data marts som tilsammen udgør virksomhedens data warehouse med fælles conformed dimensions og conformed facts. Data marts = afdelings data warehouse. Kimball bruger ordet mere generelt om en enkelt multidimensional database. Conformed dimensions = Fælles dimensioner, som er tilpasset kravere fra flere data marts. Stovepipe (kakkelovnsrør) = Skældsord for et data warehouse uden conformed dimensions.
Datawarehouses versus Operational systems OLTP (on-line transaction processing) –Anvender normalt et relational DBMS –Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. OLAP (on-line analytical processing) –Anvender data warehouse database, hvorunder der kan ligge et DBMS. –Data analysis and decision making Distinct features (OLTP versus OLAP): –View: current versus evolutionary –System orientation: transactions versus analyzes –Database design: ER + application versus star + subject
The SQL example table.
Syntax for SQL queries: SELECT desired output attributes FROM the names of the tables used as input [WHERE conditions the desired output records must fulfill] [GROUP BY grouping attributes [HAVING conditions the desired output groups must fulfill] ]; Example where the number of salesmen are counted in cities where S-KODE > 10: SELECT BYNAVN, COUNT (*) AS Antal sælgere FROM S GROUP BY BYNAVN HAVING S-KODE > 10; BYNAVNAntal sælgere københavn2 Odense1
Aggregations at group level: In SQL grouping queries the aggregations SUM, MIN, MAX, AVG operate per group: COUNT(*) counts the number of output records in each group. COUNT(attribute) counts the number of different attribute values in each group.
Example of group level aggregations List statistics about the real sales locations with more than one salesman:
Kimball’s datawarehouse concepts : Data Staging Area Metadata ETL side Query side Query Services - Extract - Transform - Load Data mining Data Service Element Data sources Presentation servers Operationel systems Desktop Data Access Tools Reporting Tools Data marts with aggregate-only data Data Warehouse Bus Conformed dimensions and facts Data marts with atomic data -Warehouse Browsing - Access and Security - Query Management - Standard Reporting - Activity Monitor Inmonn does not use the conformed facts and dimension table concepts!
DB Appl. ETL Data Vault DM OLAPVisua- lization Appl. Data mining Existing databases and systems (OLTP) New databases and systems (OLAP) In the DATA VAULT Architecture the data marts are loaded from a normalized database with historic information. …
DB Appl. ETL Data Vault DM OLAPVisua- lization Appl. Data mining In the future the DATA VAULT may be the only database and stored In-Memory. … SAP has already developed an In-Memory OLAP database called HANA