Presentation is loading. Please wait.

Presentation is loading. Please wait.

Contents of this slideshow:

Similar presentations


Presentation on theme: "Contents of this slideshow:"— Presentation transcript:

1 Contents of this slideshow:
What is a datawarehouse? Multi-dimensional data modelling Data warehouse architecture The hidden slides of this slideshow may be important. However, I will focus on leaning by exercises and therefore, rattling off new concepts are often done in hidden slides.

2 Datawarehouses versus Operational systems
OLAP (On-Line Analytical Processing) Normally uses a special data warehouse database. Data analysis and decision making OLTP (On-Line Transaction Processing) Normally uses a relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.

3 Why Separate Data Warehouse?
High performance for both systems DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation (responces to slowly changing dimensions). Different functions and different data: missing data: Decision support (DS) requires historical data which operational DBs do not typically maintain in a way suited for decision support. data integration: DS requires integrated data from heterogeneous sources data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled

4 OLTP versus OLAP OLTP = On Line Transaction Processing OLAP = On Line Analytical Processing
Business functions in the supply or value chain

5 Contents of this slideshow:
What is a datawarehouse? Multi-dimensional data modelling Data warehouse architecture

6 An example of a Datawarehouse:
A star shema datawarehouse has a central table (the Fact table) surrouded by dimension tables with on-to-many relationships towards the fact table. The fixed data base structure implies that application programs (drilling functions/aggregates) can be generated automatically!

7 Dimension hierarchies:
A dimension hierarchy is a set of tables connected by one-to-many relationships towards the fact table: In a dimension hierarchiy it is possible to aggregate data from the fact table to the different levels of the hierachy. Roll-up = aggregate along one or more dimensions. Drill-down = “de-aggregate” = break an aggregate into its constituents.

8 Two different types of drilling:
-Drilling in dimension hierarchies -Drilling between dimensions.

9 Which star schemas or data marts can be build by using the illustrated integrated E-commerce/ ERP data model? Which star schema would you recommend to be implemented first?

10 A galaxy is a set of star fact tables with conformed (fælles tilpassede) dimensions:
The value chain

11 Conceptual Modeling of Data Warehouses
Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Galaxy schema: Multiple fact tables share dimension tables (Conformed dimensions), viewed as a collection of stars, therefore called galaxy schema or fact constellation

12 The aggregating level is the argument to the GROUP BY statement:
SELECT Product#, SUM(Qty*Price) AS Turnover FROM Orderdetails JOIN Products GROUP BY Product#

13 Drill down to the Product per Salesman level:
SELECT Product#, Salesman#, SUM(Qty*Price) AS Turnover FROM Orderdetails JOIN Products JOIN Salesmen GROUP BY Product#, Salesman#; Where should the Price be stored?

14 Snowflake schema with branches:
A Snowflake schema may have branches in the dimension hierarchies: Are Customers related to the Regions?

15 Drilling in dimension hierarchies:
Salesman# Turnover Branch-office# Smith 100,000 LA Jones 300,000 Adams 200,000 SF Branch-office# Turnover LA 400,000 SF 200,000

16 Drilling between dimension hierarchies:
Salesman# Product-name Turn-over Branch-office# Smith Screw 10,000 LA Bolt 30,000 Nut 60,000 Jones 20,000 SF 40,000 . . . Salesman# Turn-over Branch-office# Smith 100,000 LA Jones 300,000 Adams 200,000 SF

17 Roll up to the top level:
Salesman# Product-name Turn-over Branch-office# Smith Screw 10,000 LA Bolt 30,000 Nut 60,000 Jones 20,000 SF 40,000 . . . Roll up can be executed by removing one or more argument to the GROUP BY statement. Productname Turnover Screw Bolt Nut 300,000 Roll up to the product level. Top level Turnover Roll up to the top level.

18 The aggregation level is the argument to the GROUP BY statement.
Salesman# Productname Turnover Branch-office# Smith Screw 10,000 LA Bolt 30,000 Nut 60,000 Jones 20,000 SF 40,000 . . . x1 x2 xn Aggregated data Non-aggregated data

19 Dimension hierarchies:
A dimension hierarchi is a set of tables connected by one-to-many relationships towards the fact table: A Snowflake schema may in contrast to star schemas have dimension hierarchies. Describe the advantage/disadvantage of using dimension hierarchies or Snowflake schema?

20 Exercise: The figure illustrates an ER-diagram of a car rental company like Hertz or Avis. Question 1. Design a star schema or Galaxy for the car rental company. Question 2. Is there advantages by storing suppliers as customers in e.g. an e-commerce data warehouse?

21 Contents of this slideshow:
What is a datawarehouse? Multi-dimensional data modelling Data warehouse architecture

22 Data Models Relational models/ER-diagram used for OLTP databases
Stars, snowflakes and galaxies used for OLAP databases Cubes used for OLAP databases

23 From OLTP Data Models to Data Cubes:

24 Et star schema DW can be illustrated as a multidimensinal cube:

25 From Tables to Data Cubes
The tables of a data warehouse may be viewed as a multidimensional data cube Some data warehouse software product also stores data in a cube structure In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid.

26 All, All, All A Sample Data Cube
Aggregated data is a cuboid between the base cuboid and the apex cuboid. The following slides the different cuboids. Total annual sales of TV in U.S.A. Date Product Country All, All, All sum TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico

27 Cuboids Corresponding to the Previous Cube
The nodes in the graph corresponds to the aggregation levels. all 0-D(apex) cuboid country product date 1-D cuboids product,date product,country date, country 2-D cuboids 3-D(base) cuboid product, date, country

28 A four dimensional cube: A Lattice of Cuboids
The nodes in the graph corresponds to the aggregation levels. all 0-D(apex) cuboid time item location supplier 1-D cuboids time,item time,location item,location location,supplier 2-D cuboids time,supplier item,supplier time,location,supplier time,item,location 3-D cuboids time,item,supplier item,location,supplier time, item, location, supplier 4-D(base) cuboid

29 Describe advantages/disadvantages of storing data in a cube in memory?
Where should dimension attributes be stored?

30 OLAP Cube operations: OLAP operations:
Roll Up = Aggregatin to a higer level. For example from month to year) Drill Down = recalculation with more details Slice = Selecting a subset by using a fixed dimension value. Drill Across = Join of fact data across conformed dimensions Drill Through = Accessing related data from a OLTP system. Aggregating Pivoting = See next slide! dice [dais] vb. spille med terninger; rafle; skære i terninger (fx diced carrots);

31 Pivoting = Transforming SQL query output to user friendly two dimensional screen layout
Fact table view: Multi-dimensional cube: day 2 day 1

32 OLAP Server Architectures
Relational OLAP (ROLAP) Use relational or extended-relational DBMS to store and manage warehouse data Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services greater scalability Multidimensional OLAP (MOLAP) Array-based multidimensional storage engine (sparse matrix techniques) fast indexing to pre-computed summarized data Hybrid OLAP (HOLAP) Storage flexibility with mix of ROLAP and MOLAP POLAP personel HOLAP

33 Contents of this slideshow:
What is a datawarehouse? Multi-dimensional data modeling Data warehouse design/implementation architectures 1. Kimball has a bottom-up architecture 2. Inmon has a top-down architecture 3. Data Vault architecture is normalized tables extended with historic data tables. That is, the Data Vault can be used to generate any data mart when needed.

34 Kimball’s Bottom-Up DW architecture:
Kimball’s architecture uses conformed dimensions and conformed facts. Conformed dimensions makes it possible to drill across from one data mart to another to present data from different marts in the same view. Only the conformed data have top-down design.

35 Kimball’s Data Warehousing Architecture
ETL side Query side Metadata Data sources Query Reporting Tools Services Presentation servers Desktop Data -Warehouse Browsing Access Tools - Extract -Access and Security Data marts with aggregate-only data - Transform -Query Management - Load - Standard Reporting Data mining Data Warehouse Bus Conformed dimensions and facts Data Staging Area -Activity Monitor Data marts with atomic data Surrogate key (Surrogatnøgle) = A sequense number used as primary key.

36 More definitions: Enterprise warehouse Virtual warehouse
collects all of the information about subjects spanning the entire organization Virtual warehouse A set of views over operational databases Only some of the possible summary views may be materialized

37 Kimball’s Data Warehouse Bus Architecture =
An architecture for designing all the data marts of an enterprice by using conformed dimension and conformed fact tables. _________________________________________________________ Conformed dimensions = dimensions designed to be common for different data marts in order to make drill across operations possible. Conformed facts = measures with common units of measurement and granularities that make it possible to integrate measures from different fact tables. Data marts = Kimball uses the data mart concept for any multidimensional database. (Inmon uses the data mart concept for subject areas/business functions or department data warehouses).

38 William Inmon’s DATA WAREHOUSE architecture from 1990 has top-down design without conformed data. and: EDS = Enterprise Data Warehouse. Department dataware-houses The DSA (Data Staging Area) where transformation takes place is not illustrated.

39 William Inmon’s DATA WAREHOUSE concept:
In practice you may have a fact table for each node in the value chain!

40 The DATA VAULT architecture from 2002-2005 has full top-down design and buttom up implementation:
Normalized Data Vault with historic data In the Data Vault database with historic information only the Extract activity has taken place. Therefore, the Data Vault architecture is not drowned in the design phase.

41 The DATA VAULT architecture takes the best from Inmon, Kimball and relational database design as it may be viewed as a top-down designed normalized database with historic and conformed data. That is, the EDW is not part of the design. Normalized Data Vault with historic data

42 Classical Data warehousing
1 2 3 Source DSA EDW DM OLTP Extraction Delta Detection Cleansing Trans- formation Business Rules Filter Aggregate Error handling DSA = Data Staging Area EDS = Enterprise Data Warehouse

43 1 2 3 1 Classical Data warehousing
Source DSA EDW DM OLTP Extraction Delta Detection Cleansing Trans- formation Business Rules Filter Aggregate Error handling HANA from SAP is an In memory Data Warehouse product 1 OLTP Source DSA Delta Detection Cleansing Trans- formation Business Rules Error handling Aggregate Filter Extraction

44 1 2 3 1 Classical Data warehousing In memory Data warehousing
Source DSA EDW DM Extraction Delta Detection Cleansing Trans- formation Business Rules Filter Aggregate Error handling In memory Data warehousing 1 How can OLTP and OLAP be integrated in a common In Memory database? OLTP Source DSA Delta Detection Cleansing Trans- formation Business Rules Error handling Aggregate Filter Extraction

45 Exercise: Transform the OLTP database to a Star schema DW for a Hospital.
ER-diagram for a hospital.

46 Exercise: Design an Airline DW.

47 Exercise: Design a Hotel DW.

48 Exercise. Design a datawarehouse for a travel agency.
Eksempel: CSCW systemer med replikerede lokale data.

49 End of session Thank you !!!

50 Codd’s 12 rules for OLAP = Conformed dimensions = Pivoting

51 Problems that Datawarehouses may solve:
Easy data access Easy presentation Direct access Overview Consistency Relevance

52 Inmon versus Kimball’s DW definitions:
Kimball and Inmon agree in that OLAP datawarehouses do not use the OLTP databases. However, what is the difference in the architectures? Why do you think Kimball’s DW architecture is used most in practice?

53 Dates may be stored in different formats
Dates may be stored in different formats. As an example the First purchase date may be stored as a FK to a hierachical time dimension and Birth date as a SQL time stamp. Why is different Date formats used in the Customer table? Eksempel: CSCW systemer med replikerede lokale data.

54 OLAP OLAP = On-Line Analytical Processing Interaktiv analyse
Eksplorativ opdagelse Kræver hurtige svartider Data kan vises som multidimensionelle terninger Terninger/kuber kan have et vilkårligt antal dimensioner Dimensioner har hierarkier, f.eks. dag-måned-år OLAP operationer Aggregering = Sammentælling af data, f.eks. med SUM, AVG, COUNT… Startniveau, (Kvartal, Produkt) Roll Up: mindre detalje, Kvartal->År Drill Down : mere detalje, Kvartal->Måned Slice: Projektering/selektering, År=1999 Drill Across: “join” på fælles dimensioner Drill Through: Opsøgning af kildedataene i de operative systemer Pivoting

55 The Business Dimensional Lifecycle =
Kimball’s activity model for DATAWAREHOUSE devellopment has three parallel tracks: Design af teknisk arkitektur Valg af produkt og installation Specifikation af applikationer Udvikling krav Ibrugtagning Vedligehold og vækst Projektledelse Dimensionel modellering Fysisk design ETL: design og udvikling Projekt planlægning Design af teknisk arkitektur Valg af produkt og installation Specifikation af applikationer Udvikling krav Ibrugtagning Vedligehold og vækst Projektledelse Dimensionel modellering Fysisk design ETL: design og udvikling Projekt planlægning

56 The Data Warehouse Bus Architecture =
Arkitektur for design af en række data marts som tilsammen udgør virksomhedens data warehouse med fælles conformed dimensions og conformed facts. Data marts = afdelings data warehouse. Kimball bruger ordet mere generelt om en enkelt multidimensional database. Conformed dimensions = Fælles dimensioner, som er tilpasset kravere fra flere data marts. Stovepipe (kakkelovnsrør) = Skældsord for et data warehouse uden conformed dimensions.

57 Datawarehouses versus Operational systems
OLTP (on-line transaction processing) Anvender normalt et relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. OLAP (on-line analytical processing) Anvender data warehouse database, hvorunder der kan ligge et DBMS. Data analysis and decision making Distinct features (OLTP versus OLAP): View: current versus evolutionary System orientation: transactions versus analyzes Database design: ER + application versus star + subject

58 The SQL example table.

59 Syntax for SQL queries:
SELECT desired output attributes FROM the names of the tables used as input [WHERE conditions the desired output records must fulfill] [GROUP BY grouping attributes [HAVING conditions the desired output groups must fulfill] ]; Example where the number of salesmen are counted in cities where S-KODE > 10: SELECT BYNAVN, COUNT (*) AS Antal sælgere FROM S GROUP BY BYNAVN HAVING S-KODE > 10; Clauses in brackets are optional. BYNAVN Antal sælgere københavn 2 Odense 1

60 Aggregations at group level:
In SQL grouping queries the aggregations SUM, MIN, MAX, AVG operate per group: COUNT(*) counts the number of output records in each group. COUNT(attribute) counts the number of different attribute values in each group.

61 Example of group level aggregations
List statistics about the real sales locations with more than one salesman:

62 Kimball’s datawarehouse concepts:
ETL side Query side Metadata Data sources Query Reporting Tools Services Presentation servers Desktop Data -Warehouse Browsing Access Tools - Extract -Access and Security Data marts with aggregate-only data - Transform -Query Management - Load - Standard Reporting Data mining Data Warehouse Bus Conformed dimensions and facts Data Staging Area -Activity Monitor Operationel systems Data marts with atomic data Data Service Element Inmonn does not use the conformed facts and dimension table concepts!

63 In the DATA VAULT Architecture the data marts are loaded from a normalized database with historic information. Existing databases and systems (OLTP) New databases and systems (OLAP) Appl. OLAP DB DM Appl. DB Data mining DM Data Vault Appl. DB ETL Appl. Visua- lization DM DB Appl. DB

64 In the future the DATA VAULT may be the only database and stored In-Memory.
Appl. OLAP DB DM Appl. DB Data mining DM Data Vault Appl. DB ETL Appl. Visua- lization DM DB Appl. DB SAP has already developed an In-Memory OLAP database called HANA


Download ppt "Contents of this slideshow:"

Similar presentations


Ads by Google