We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byRylan Bratcher
Modified over 2 years ago
1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. For more information on how you may use them, please see http://www.openlineconsult.com/db
2 © Ellis Cohen, 2003-2006 Topics Overview Star Schema: Fact & Dimension Tables The Star Schema & Denormalization The Data Cube ETL: Extraction, Transformation & Loading
3 © Ellis Cohen, 2003-2006 Overview
4 © Ellis Cohen, 2003-2006 Data Warehousing & Data Mining Data Warehousing Techniques for representing & querying large amounts of relatively static data Potentially stored in Multi-Dimensional Databases On-line Analysis & Decision Support Data Mining Automated analysis: Discovering (potentially) unexpected patterns in large amounts of data
5 © Ellis Cohen, 2003-2006 Operational vs Analytical DBs Operational Database Data needed and updated constantly to directly support business operations Focus on OLTP (on-line transaction processing): Transactional access & modification of relatively small # of data points at a time Analytical Database: Data Warehouse & Data Mart Copious amounts of relatively static data, culled & integrated across enterprise, cleansed & summarized, maintained historically, used for decision support and business intelligence (BI) Focus on OLAP (on-line analytical processing): Querying large amounts of data, scheduled modifications
6 © Ellis Cohen, 2003-2006 Operational vs Analytical DBs OperationalWarehouse Usage Transactional (OLTP) Analytical (OLAP) Organized forModificationsQueries ModificationsContinualPeriodic Queries Narrow-scope Low-complexity Broad-scope High-complexity DatabaseRelational Relational/ Dimensional DataNormalized Denormalized Aggregated & Derived
7 © Ellis Cohen, 2003-2006 Central Data Warehouse (from Oracle 9i Data Warehousing Guide)
8 © Ellis Cohen, 2003-2006 Warehouse Questions How many red Bally shoes did we sell by region in the third quarter of each of the last 5 years? What are the top 25 selling products by category and region for this past quarter? What percent of the market do we own for each product we make? Which of our customer's zipcodes were responsible for the top 10% of total sales over the last year.
9 © Ellis Cohen, 2003-2006 Star Schema: Fact & Dimension Tables
10 © Ellis Cohen, 2003-2006 Star Schema Stores (Dimension) DailySales (Fact) storid prodid date price units storid … Products (Dimension) prodid … Measures A Star Schema has a central fact table, with a composite primary key, which references multiple Dimension tables what each fact measures Data Warehouses are organized using Star Schema models foreign key
11 © Ellis Cohen, 2003-2006 Subjects (Facts) & Dimensions Instead of thinking about entities & relationships, design a data warehouse by thinking about Subjects (represented by fact tables) Sales, Distribution, Purchases Dimensions (represented by dimension tables) How to uniquely identify the facts about each subject –Sales: Product, Stores, Dates (maybe also Employee, Customer: depends what you want to analyze) –Distribution: Warehouses, Products, Stores, Dates (maybe Employees & Trucks) –Purchases: Products, Vendors, Dates (maybe also Employees)
12 © Ellis Cohen, 2003-2006 Fact & Dimension Tables Fact Tables Composite primary key identify dimensions uniquely identify each fact (or measurement) Additional attributes: measures what is measured about each fact Dimension Tables Primary key Surrogate key uniquely identifies each dimension value Additional attributes Properties of each dimension value
13 © Ellis Cohen, 2003-2006 Dimensions & Granularity Dimensions have different levels of granularity Stores Regions Districts Products SubCategories ProductTypes Categories Manufacturers
14 © Ellis Cohen, 2003-2006 Snowflake Schema (with Normalized Dimensions) Stores (Dimension) DailySales (Fact) storid prodid date price units storid stornam city state distid Products (Dimension) prodid color size prodtyp Districts distid distnam distarea regid Regions regid regnam ProductTypes prodtyp prodnam prodescr subcatid manfid SubCategories subcatid subnam subdescr catid Categories catid catnam catdescr Manufacturers manfid manfnam
15 © Ellis Cohen, 2003-2006 Typical Warehouse Query How many red Bally shoes did we sell in each region in 2002? SELECT r.regnam as region, sum(f.units) as sumunits FROM DailySales f NATURAL JOIN Stores NATURAL JOIN Districts NATURAL JOIN Regions r NATURAL JOIN Products p NATURAL JOIN ProductTypes NATURAL JOIN SubCategorie s NATURAL JOIN Manufacturers m WHERE to_char(f.date,'YYYY') = '2002' AND p.color = 'red' AND m.manfnam = 'Bally' AND s.subnam = 'Shoe' GROUP BY r.regnam
16 © Ellis Cohen, 2003-2006 The Star Schema & Denormalization
17 © Ellis Cohen, 2003-2006 Snowflake Schema is Normalized Snowflake Schema has normalized dimension tables Each dimension is represented by multiple sub-dimension tables at different levels of granularity (Product, ProductType, Category, etc.) Each sub-dimension table has attributes appropriate to the level of granularity –Product: color, size –ProductType: prodnam, prodescr –etc.
18 © Ellis Cohen, 2003-2006 Denormalization Products (Dimension) prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr Products (Dimension) prodid color size prodtyp ProductTypes prodtyp prodnam prodescr subcatid manfid SubCategories subcatid subnam subdescr catid Categories catid catnam catdescr Manufacturers manfid manfnam Why is there redundancy here?
19 © Ellis Cohen, 2003-2006 Star Schema is Denormalized The Star Schema has denormalized dimension tables Each dimension by joining together the sub-dimension table to form a single dimension table The dimension table has attributes at different levels of granularity The dimension tables contain lots of redundancy, but queries use far fewer joins Does not dramatically impact space: dimension tables usually < 1% size of fact table (but some descriptions may need to be stored separately)
20 © Ellis Cohen, 2003-2006 Star Schema (Fully Denormalized Dimensions) Stores (Dimension) DailySales (Fact) storid prodid date price units storid stornam city state distid distnam distarea regid regnam Products (Dimension) prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr Maybe catdescr not included here if it is a GIF or a 4000 byte description Why should this be replaced by a dateid?
21 © Ellis Cohen, 2003-2006 Query with Denormalized Schema How many red Bally shoes did we sell in each region in 2002? SELECT s.regnam as region, sum(f.units) as sumunits FROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products p WHERE to_char(f.date,'YYYY') = '2002' AND p.color = 'red' AND p.manfnam = 'Bally' AND p.subnam = 'Shoe' GROUP BY s.regnam Costly
22 © Ellis Cohen, 2003-2006 Typical Date Dimension Attributes Requires Month + Year to identify a month within a year. Might want to add a single MonthYr field to represent the pair FieldExample Value Year2005 MonthFeb Quarter1 DayOfMonth12 DayOfYear43 WeekOfYear7 DayOfWeekSat Note: Quarter is less granular than Month Also, DayOfYear, WeekOfYear & DayOfWeek can be derived form the other fields It is common and almost always more efficient to treat Dates as a dimension with a number of attributes
23 © Ellis Cohen, 2003-2006 Extended Date Dimension Hierarchy Date (e.g. Feb 12, 2005) DayOfWeek (e.g. Sat) WeekYr (e.g. 2005Wk7) MonthYr (e.g. Feb2005) QuarterYr (e.g. 2005Q1) Year (e.g 2005) Quarter (e.g. 1) Month (e.g. Feb) WeekOfYear (e.g. 7) DayOfYear (e.g. 43) DayOfMonth (e.g. 12)
24 © Ellis Cohen, 2003-2006 Star Schema with Date Dimension Stores (Dimension) DailySales (Fact) storid prodid dateid price units storid stornam city state distid distnam distarea regid regnam Products (Dimension) prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr Dates (Dimension) dateid date dayofweek dayofmonth dayofyear weekyr weekofyear monthyr month quarteryr quarter year In general, represent dates by a Dates dimension table
25 © Ellis Cohen, 2003-2006 Query using Dates Dimension How many red Bally shoes did we sell in each region in 2002? SELECT s.regnam as region, sum(f.units) as sumunits FROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products p NATURAL JOIN Dates d WHERE d.year = 2002 AND p.color = 'red' AND p.manfnam = 'Bally' AND p.subnam = 'Shoe' GROUP BY s.regnam Needs an extra join, but simpler query, Executes faster if Dates is indexed by year
26 © Ellis Cohen, 2003-2006 The Data Cube
27 © Ellis Cohen, 2003-2006 Data Cube Representation Products dimension Stores dimension Dates dimension Sales of Beanie Babies in Pittsburgh Store Today Sales of Beanie Babies in Pittsburgh Store Yesterday All Sales (of all products over time) in NYC Store Pgh NYC Sales Cube
28 © Ellis Cohen, 2003-2006 Data Cube Characteristics Each axis represents a dimension –Elements along axis are at lowest granularity for that dimension Measures are the data within the cells at intersections of the cube –Information about the topic of the cube –e.g. units & price for each sales fact (i.e. sales in a store of a product on a date)
29 © Ellis Cohen, 2003-2006 Data Cube Views Slice View data relative to a point in one or more dimensions View sales today (for each store & each product category) View Bally shoe sales at the NYC store (for each date) Dice View data relative to (sets of) ranges in one or more dimensions View sales for the last 4 days (for each store & each product category) View sales for each type of shoes at all the NY and NJ stores for each of the last 10 quarters
30 © Ellis Cohen, 2003-2006 MDDB: MultiDimensional DataBase Knows about Fact & Dimension Tables Uses direct (n dimensional) hypercube representation to provide fast access to fact elements in query Supports sparse representations –The Pittsburgh store doesn't sell lingerie –The Cape Cod store is not open in the winter –Baked Beanie Babies are only sold in the NE region Uses specialized query language e.g. MDX (used by Microsoft OLAP Server) w basic data types: cube, slice, dice
31 © Ellis Cohen, 2003-2006 ETL: Extraction, Transformation & Loading
32 © Ellis Cohen, 2003-2006 ETL: Extraction, Transformation & Loading 80% of total cost of building warehouse Extraction Loading Transformation
33 © Ellis Cohen, 2003-2006 Extraction Sources Multiple DB's Flat Files External Data Sources e.g. Census, Geographic, Weather, Financial, Unemployment Data Standard DB/Spreadsheet format or semi- structured data from the web Frequency Periodic (hourly, daily, weekly, …) Triggered Single event #, sequence, pattern of events Mechanisms Snapshots / Materialized Views / Replication Database Triggers Process Logs Query Sources (full vs incremental)
34 © Ellis Cohen, 2003-2006 Transformation Cleaning Scrubbing Filtering Conformance Integration Renaming Fusion & Merging Determine Surrogate Keys Timestamping Summarization Schema Organization Dimension Tables Pre-Aggregation via Materialized Views Derivation
35 © Ellis Cohen, 2003-2006 (Transformation) Cleaning Scrubbing Use domain-specific knowledge e.g. SS#, phone-number, zipcode Filtering Check for inconsistent data Use data validation rules Conformance Map similarly typed data to standard representation Convert units (inch => cm, $ => euro) scale (mm => cm) formats (string => integer, string with/wo $)
36 © Ellis Cohen, 2003-2006 (Transformation) Integration Renaming Resolve name conflicts Fusion - e.g. merge –properties in city db –properties in developer lists Determine Surrogate Keys Do not use keys from operational data as primary key in warehouse data Timestamping Add timestamps to fact data where missing to enable historical queries Reorganization & Evolution Support Data Reorganization & Schema Evolution Summarization Summarize original operational data and combine into less detailed tables
37 © Ellis Cohen, 2003-2006 Integration (Data Reorganization) What do we do when attributes change? Suppose districts are reorganized and a store is now part of a different district Consistently changing mapping of store to district –Allows new and old data to be compared reasonably by district –But causes incorrect comparisons by district among older data alone Solutions 1.Keep fields for both old and new mapping -- in fact, potentially a separate field for each reorganization 2.Add effective date to store dimension. Have multiple rows for same store - each with different effective date
38 © Ellis Cohen, 2003-2006 (Integration) Summarization DailySales (Fact) storid prodid date price units CustomerTransaction transid custid empid posid time ItemPurchase transid lineno prodid price units PointOfSaleTerminals posid postyp storid loc Might build different fact tables for different purposes: e.g. ones involving Customers ones involving Store Locations Tradeoff Smaller Fact Tables vs. Missed Relationships
39 © Ellis Cohen, 2003-2006 Loading Alternatives –Incremental vs Full Refresh: most data is incrementally added to the warehouse –Off-line vs on-line –Frequency Nightly Weekly Monthly –All-at-once vs Staged What indices to create or drop? What statistics to collect (& use)?
40 © Ellis Cohen, 2003-2006 Constellation Schema Data warehouses often are designed as constellations Multiple fact tables Shared/related dimension tables Examples –Sales: store, product, date –Distribution: distributor, store, product, carrier, period –Advertising: store, medium, product, period Query across same or related dimensions –Compare advertising and sales by store within various periods
41 © Ellis Cohen, 2003-2006 Data Marts Store different fact tables (or different groups of fact tables) in separate data marts
42 © Ellis Cohen, 2003-2006 Data Mart Architectures Subset of Data Warehouse Meets needs of subgroup of users Top-down: –Extracted from Data Warehouse –Problem: early availability Bottom-up: –Built directly from staging area –Can be combined to form warehouse –Problem: Conformance. ETL tool must provide metadata Hybrid: –Some data marts built directly from staging area –Others extracted from Data Warehouse
43 © Ellis Cohen, 2003-2006 Metadata Management Identify & define each attribute –Source(s) –Transformation(s) applied –How aggregated –Description of what it represents –Relationships to other attributes –History
1 IS 4420 Database Fundamentals Chapter 11: Data Warehousing Leon Chen.
1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen SQL for Data Warehouses These slides are.
1 Advanced Database Topics Copyright © Ellis Cohen Data Warehousing These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
13 1 Chapter 13 The Data Warehouse Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
DAVID M. KROENKE’S DATABASE PROCESSING, 10th Edition © 2006 Pearson Prentice Hall 15-1 David M. Kroenke Database Processing Chapter 15 Business Intelligence.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
An overview of Data Warehousing and OLAP Technology Presented By Manish Desai.
© 2005 by Prentice Hall 1 Chapter 11: Data Warehousing Modern Database Management 7 th Edition Jeffrey A. Hoffer, Mary B. Prescott, Fred R. McFadden.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
CSE6011 Warehouse Models & Operators Data Models relations stars & snowflakes cubes Operators slice & dice roll-up, drill down pivoting.
1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.
Chapter 33 Data Warehousing Design Transparencies.
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
C Copyright © 2005, Oracle. All rights reserved. Practice Solutions.
1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.
© 2007 by Prentice Hall Management Information Systems, 10/e Raymond McLeod and George Schell 1 Management Information Systems, 10/e Raymond McLeod Jr.
Data Warehousing M R BRAHMAM. Data Warehousing - Architecture Enterprise Data Warehouse Enterprise Data Warehouse Data Mart Execution Systems CRM ERP.
CHAPTER 11: DIMENSIONAL MODELING: ADVANCED TOPICS.
1 © Prentice Hall, 2002 Chapter 11: Data Warehousing.
1 Data Warehouses BUAD/American University Data Warehouses.
Data Warehousing and OLAP Outline u Models & operations u Implementing a warehouse u Future directions.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 12 View Design and Integration.
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
Time for a BREAK! You have 45 Minutes. Time Left 44.
BY LECTURER/ AISHA DAWOOD DW Lab # 2. LAB EXERCISE #1 Oracle Data Warehousing Goal: Develop an application to implement defining subject area, design.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
DW-1: Introduction to Data Warehousing. Overview What is Database What Is Data Warehousing Data Marts and Data Warehouses The Data Warehousing Process.
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management Dave Salisbury ( )
13 Chapter 13 The Data Warehouse Database Systems: Design, Implementation, and Management, Fifth Edition, Rob and Coronel.
UNIT-II Principles of dimensional modeling Dimensional modeling: advanced topics ETL OLAP 1.
The Need for Data Analysis 2 Managers track daily transactions to evaluate how the business is performing Strategies should be developed to meet organizational.
Fundamentals, Design, and Implementation, 9/e Chapter 2 Entity-Relationship Data Modeling: Tools and Techniques.
Dimensional Modeling. Dimensional Models A denormalized relational model Made up of tables with attributes Relationships defined by keys and foreign keys.
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
© 2005 by Prentice Hall 1 Chapter 1: The Database Environment Modern Database Management 7 th Edition Jeffrey A. Hoffer, Mary B. Prescott, Fred R. McFadden.
Copyright© 2014, Sira Yongchareon Department of Computing, Faculty of Creative Industries and Business Lecturer : Dr. Sira Yongchareon ISCG 6425 Data Warehousing.
Information Systems Today: Managing in the Digital World TB3-1 3 Technology Briefing Database Management Modern organizations are said to be drowning in.
1 On-Line Analytic Processing Warehousing Data Cubes.
Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.
Data Warehousing DSCI 4103 Dr. Mennecke Introduction and Chapter 1.
Agenda Common terms used in the software of data warehousing and what they mean. Difference between a database and a data warehouse - the difference in.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Chapter 12 Membrane Transport Essential Cell Biology Third Edition Copyright © Garland Science 2010.
Chapter 14 Energy Generation in Mitochondria and Chlorplasts Essential Cell Biology Third Edition Copyright © Garland Science 2010.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts - 5 th Edition, Aug 26, 2005 Buzzword List OLTP – OnLine Transaction Processing (normalized,
CS 157B: Database Management Systems II March 20 Class Meeting Department of Computer Science San Jose State University Spring 2013 Instructor: Ron Mak.
Chapter 6 Data Design. 2 Design Phase Description Systems Design is the third of five phases in the systems development life cycle (SDLC) Begin the physical.
© 2017 SlidePlayer.com Inc. All rights reserved.