Presentation on theme: "Contents of this slideshow:"— Presentation transcript:
1 Contents of this slideshow: What is a datawarehouse?Multi-dimensional data modellingData warehouse architectureThe hidden slides of this slideshow may be important. However, I will focus on leaning by exercises and therefore, rattling off new concepts are often done in hidden slides.
2 Datawarehouses versus Operational systems OLAP (On-Line Analytical Processing)Normally uses a special data warehouse database.Data analysis and decision makingOLTP (On-Line Transaction Processing)Normally uses a relational DBMSDay-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.
3 Why Separate Data Warehouse? High performance for both systemsDBMS— tuned for OLTP: access methods, indexing, concurrency control, recoveryWarehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation (responces to slowly changing dimensions).Different functions and different data:missing data: Decision support (DS) requires historical data which operational DBs do not typically maintain in a way suited for decision support.data integration: DS requires integrated data from heterogeneous sourcesdata quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled
4 OLTP versus OLAP OLTP = On Line Transaction Processing OLAP = On Line Analytical Processing Business functions in the supply or value chain
5 Contents of this slideshow: What is a datawarehouse?Multi-dimensional data modellingData warehouse architecture
6 An example of a Datawarehouse: A star shema datawarehouse has a central table (the Fact table) surrouded by dimension tables with on-to-many relationships towards the fact table.The fixed data base structure implies that application programs (drilling functions/aggregates) can be generated automatically!
7 Dimension hierarchies: A dimension hierarchy is a set of tables connected by one-to-many relationships towards the fact table:In a dimension hierarchiy it is possible to aggregate data from the fact table to the different levels of the hierachy. Roll-up = aggregate along one or more dimensions.Drill-down = “de-aggregate” = break an aggregate into its constituents.
8 Two different types of drilling: -Drilling in dimension hierarchies-Drilling between dimensions.
9 Which star schemas or data marts can be build by using the illustrated integrated E-commerce/ ERP data model? Which star schema would you recommend to be implemented first?
10 A galaxy is a set of star fact tables with conformed (fælles tilpassede) dimensions: The value chain
11 Conceptual Modeling of Data Warehouses Star schema: A fact table in the middle connected to a set of dimension tablesSnowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflakeGalaxy schema: Multiple fact tables share dimension tables (Conformed dimensions), viewed as a collection of stars, therefore called galaxy schema or fact constellation
12 The aggregating level is the argument to the GROUP BY statement: SELECT Product#, SUM(Qty*Price) AS Turnover FROM Orderdetails JOIN Products GROUP BY Product#
13 Drill down to the Product per Salesman level: SELECT Product#, Salesman#, SUM(Qty*Price) AS Turnover FROM Orderdetails JOIN Products JOIN Salesmen GROUP BY Product#, Salesman#;Where should the Price be stored?
14 Snowflake schema with branches: A Snowflake schema may have branches in the dimension hierarchies:Are Customers related to the Regions?
15 Drilling in dimension hierarchies: Salesman#TurnoverBranch-office#Smith100,000LAJones300,000Adams200,000SFBranch-office#TurnoverLA400,000SF200,000
16 Drilling between dimension hierarchies: Salesman#Product-nameTurn-overBranch-office#SmithScrew10,000LABolt30,000Nut60,000Jones20,000SF40,000. . .Salesman#Turn-overBranch-office#Smith100,000LAJones300,000Adams200,000SF
17 Roll up to the top level: Salesman#Product-nameTurn-overBranch-office#SmithScrew10,000LABolt30,000Nut60,000Jones20,000SF40,000. . .Roll up can be executed by removing one or more argument to the GROUP BY statement.ProductnameTurnoverScrewBoltNut300,000Roll up to the product level.Top levelTurnoverRoll up to the top level.
18 The aggregation level is the argument to the GROUP BY statement. Salesman#ProductnameTurnoverBranch-office#SmithScrew10,000LABolt30,000Nut60,000Jones20,000SF40,000. . .x1x2…xnAggregated dataNon-aggregated data
19 Dimension hierarchies: A dimension hierarchi is a set of tables connected by one-to-many relationships towards the fact table:A Snowflake schema may in contrast to star schemas have dimension hierarchies.Describe the advantage/disadvantage of using dimension hierarchies or Snowflake schema?
20 Exercise:The figure illustrates an ER-diagram of a car rental company like Hertz or Avis.Question 1. Design a star schema or Galaxy for the car rental company.Question 2. Is there advantages by storing suppliers as customers in e.g. an e-commerce data warehouse?
21 Contents of this slideshow: What is a datawarehouse?Multi-dimensional data modellingData warehouse architecture
22 Data Models Relational models/ER-diagram used for OLTP databases Stars, snowflakes and galaxies used for OLAP databasesCubes used for OLAP databases
24 Et star schema DW can be illustrated as a multidimensinal cube:
25 From Tables to Data Cubes The tables of a data warehouse may be viewed as a multidimensional data cubeSome data warehouse software product also stores data in a cube structureIn data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid.
26 All, All, All A Sample Data Cube Aggregated data is a cuboid between the base cuboid and the apex cuboid. The following slides the different cuboids.Total annual salesof TV in U.S.A.DateProductCountryAll, All, AllsumTVVCRPC1Qtr2Qtr3Qtr4QtrU.S.ACanadaMexico
27 Cuboids Corresponding to the Previous Cube The nodes in the graph corresponds to the aggregation levels.all0-D(apex) cuboidcountryproductdate1-D cuboidsproduct,dateproduct,countrydate, country2-D cuboids3-D(base) cuboidproduct, date, country
28 A four dimensional cube: A Lattice of Cuboids The nodes in the graph corresponds to the aggregation levels.all0-D(apex) cuboidtimeitemlocationsupplier1-D cuboidstime,itemtime,locationitem,locationlocation,supplier2-D cuboidstime,supplieritem,suppliertime,location,suppliertime,item,location3-D cuboidstime,item,supplieritem,location,suppliertime, item, location, supplier4-D(base) cuboid
29 Describe advantages/disadvantages of storing data in a cube in memory? Where should dimension attributes be stored?
30 OLAP Cube operations: OLAP operations: Roll Up = Aggregatin to a higer level. For example from month to year)Drill Down = recalculation with more detailsSlice = Selecting a subset by using a fixed dimension value.Drill Across = Join of fact data across conformed dimensionsDrill Through = Accessing related data from a OLTP system.AggregatingPivoting = See next slide!dice [dais] vb. spille med terninger; rafle; skære i terninger (fx diced carrots);•
31 Pivoting = Transforming SQL query output to user friendly two dimensional screen layout Fact table view:Multi-dimensional cube:day 2day 1
32 OLAP Server Architectures Relational OLAP (ROLAP)Use relational or extended-relational DBMS to store and manage warehouse dataInclude optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and servicesgreater scalabilityMultidimensional OLAP (MOLAP)Array-based multidimensional storage engine (sparse matrix techniques)fast indexing to pre-computed summarized dataHybrid OLAP (HOLAP)Storage flexibility with mix of ROLAP and MOLAPPOLAP personel HOLAP
33 Contents of this slideshow: What is a datawarehouse?Multi-dimensional data modelingData warehouse design/implementation architectures1. Kimball has a bottom-up architecture 2. Inmon has a top-down architecture 3. Data Vault architecture is normalized tables extended with historic data tables. That is, the Data Vault can be used to generate any data mart when needed.
34 Kimball’s Bottom-Up DW architecture: Kimball’s architecture uses conformed dimensions and conformed facts. Conformed dimensions makes it possible to drill across from one data mart to another to present data from different marts in the same view.Only the conformed data have top-down design.
35 Kimball’s Data Warehousing Architecture ETL sideQuery sideMetadataDatasourcesQueryReporting ToolsServicesPresentation serversDesktop Data-Warehouse BrowsingAccess Tools- Extract-Access and SecurityData marts withaggregate-only data- Transform-Query Management- Load- Standard ReportingData miningDataWarehouseBusConformeddimensionsand factsData StagingArea-Activity MonitorData marts withatomic dataSurrogate key (Surrogatnøgle) = A sequense number used as primary key.
36 More definitions: Enterprise warehouse Virtual warehouse collects all of the information about subjects spanning the entire organizationVirtual warehouseA set of views over operational databasesOnly some of the possible summary views may be materialized
37 Kimball’s Data Warehouse Bus Architecture = An architecture for designing all the data marts of an enterprice by using conformed dimension and conformed fact tables. _________________________________________________________Conformed dimensions = dimensions designed to be common for different data marts in order to make drill across operations possible.Conformed facts = measures with common units of measurement and granularities that make it possible to integrate measures from different fact tables.Data marts = Kimball uses the data mart concept for any multidimensional database. (Inmon uses the data mart concept for subject areas/business functions or department data warehouses).
38 William Inmon’s DATA WAREHOUSE architecture from 1990 has top-down design without conformed data. and:EDS = Enterprise Data Warehouse.Department dataware-housesThe DSA (Data Staging Area) where transformation takes place is not illustrated.
39 William Inmon’s DATA WAREHOUSE concept: In practice you may have a fact table for each node in the value chain!
40 The DATA VAULT architecture from 2002-2005 has full top-down design and buttom up implementation: Normalized Data Vault with historic dataIn the Data Vault database with historic information only the Extract activity has taken place. Therefore, the Data Vault architecture is not drowned in the design phase.
41 The DATA VAULT architecture takes the best from Inmon, Kimball and relational database design as it may be viewed as a top-down designed normalized database with historic and conformed data. That is, the EDW is not part of the design.Normalized Data Vault with historic data
42 Classical Data warehousing 123SourceDSAEDWDMOLTPExtractionDeltaDetectionCleansingTrans-formationBusiness RulesFilterAggregateError handlingDSA = Data Staging AreaEDS = Enterprise Data Warehouse
43 1 2 3 1 Classical Data warehousing SourceDSAEDWDMOLTPExtractionDeltaDetectionCleansingTrans-formationBusiness RulesFilterAggregateError handlingHANA from SAP is an In memory Data Warehouse product1OLTPSourceDSADeltaDetectionCleansingTrans-formationBusiness RulesError handlingAggregateFilterExtraction
44 1 2 3 1 Classical Data warehousing In memory Data warehousing SourceDSAEDWDMExtractionDeltaDetectionCleansingTrans-formationBusiness RulesFilterAggregateError handlingIn memory Data warehousing1How can OLTP and OLAP be integrated in a common In Memory database?OLTPSourceDSADeltaDetectionCleansingTrans-formationBusiness RulesError handlingAggregateFilterExtraction
45 Exercise: Transform the OLTP database to a Star schema DW for a Hospital. ER-diagram for a hospital.
50 Codd’s 12 rules for OLAP= Conformed dimensions= Pivoting
51 Problems that Datawarehouses may solve: Easy data accessEasy presentationDirect accessOverviewConsistency Relevance
52 Inmon versus Kimball’s DW definitions: Kimball and Inmon agree in that OLAP datawarehouses do not use the OLTP databases. However, what is the difference in the architectures?Why do you think Kimball’s DW architecture is used most in practice?
53 Dates may be stored in different formats Dates may be stored in different formats. As an example the First purchase date may be stored as a FK to a hierachical time dimension and Birth date as a SQL time stamp. Why is different Date formats used in the Customer table?Eksempel: CSCW systemer med replikerede lokale data.
54 OLAP OLAP = On-Line Analytical Processing Interaktiv analyse Eksplorativ opdagelseKræver hurtige svartiderData kan vises som multidimensionelle terningerTerninger/kuber kan have et vilkårligt antal dimensionerDimensioner har hierarkier, f.eks. dag-måned-årOLAP operationerAggregering = Sammentælling af data, f.eks. med SUM, AVG, COUNT…Startniveau, (Kvartal, Produkt)Roll Up: mindre detalje, Kvartal->ÅrDrill Down : mere detalje, Kvartal->MånedSlice: Projektering/selektering, År=1999Drill Across: “join” på fælles dimensionerDrill Through: Opsøgning af kildedataene i de operative systemerPivoting
55 The Business Dimensional Lifecycle = Kimball’s activity model for DATAWAREHOUSE devellopment has three parallel tracks:Design aftekniskarkitekturValg afprodukt oginstallationSpecifikationafapplikationerUdviklingkravIbrugtagningVedligeholdog vækstProjektledelseDimensionel modelleringFysisk designETL: design og udviklingProjekt planlægningDesign aftekniskarkitekturValg afprodukt oginstallationSpecifikationafapplikationerUdviklingkravIbrugtagningVedligeholdog vækstProjektledelseDimensionel modelleringFysisk designETL: design og udviklingProjekt planlægning
56 The Data Warehouse Bus Architecture = Arkitektur for design af en række data marts som tilsammen udgør virksomhedens data warehouse med fælles conformed dimensions og conformed facts.Data marts = afdelings data warehouse. Kimball bruger ordet mere generelt om en enkelt multidimensional database.Conformed dimensions = Fælles dimensioner, som er tilpasset kravere fra flere data marts.Stovepipe (kakkelovnsrør) = Skældsord for et data warehouse uden conformed dimensions.
57 Datawarehouses versus Operational systems OLTP (on-line transaction processing)Anvender normalt et relational DBMSDay-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.OLAP (on-line analytical processing)Anvender data warehouse database, hvorunder der kan ligge et DBMS.Data analysis and decision makingDistinct features (OLTP versus OLAP):View: current versus evolutionarySystem orientation: transactions versus analyzesDatabase design: ER + application versus star + subject
59 Syntax for SQL queries: SELECT desired output attributesFROM the names of the tables used as input[WHERE conditions the desired output records must fulfill][GROUP BY grouping attributes[HAVING conditions the desired output groups must fulfill] ];Example where the number of salesmenare counted in cities whereS-KODE > 10:SELECT BYNAVN, COUNT (*) AS Antal sælgereFROM SGROUP BY BYNAVNHAVING S-KODE > 10;Clauses in brackets are optional.BYNAVNAntal sælgerekøbenhavn2Odense1
60 Aggregations at group level: In SQL grouping queries the aggregations SUM, MIN, MAX, AVG operate per group:COUNT(*) counts the number of output records in each group.COUNT(attribute) counts the number of different attribute values in each group.
61 Example of group level aggregations List statistics about the real sales locations with more than one salesman:
62 Kimball’s datawarehouse concepts: ETL sideQuery sideMetadataDatasourcesQueryReporting ToolsServicesPresentation serversDesktop Data-Warehouse BrowsingAccess Tools- Extract-Access and SecurityData marts withaggregate-only data- Transform-Query Management- Load- Standard ReportingData miningDataWarehouseBusConformeddimensionsand factsData StagingArea-Activity MonitorOperationel systemsData marts withatomic dataDataServiceElementInmonn does not use the conformed facts and dimension table concepts!
63 In the DATA VAULT Architecture the data marts are loaded from a normalized database with historic information.Existing databasesand systems (OLTP)New databasesand systems (OLAP)Appl.OLAPDBDMAppl.DBDataminingDMData VaultAppl.DBETL…Appl.Visua-lizationDMDBAppl.DB
64 In the future the DATA VAULT may be the only database and stored In-Memory. Appl.OLAPDBDMAppl.DBDataminingDMData VaultAppl.DBETL…Appl.Visua-lizationDMDBAppl.DBSAP has already developed an In-Memory OLAP database called HANA