Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design.

Similar presentations


Presentation on theme: "1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design."— Presentation transcript:

1 1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen Introduction to Data Warehouse Design These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. For more information on how you may use them, please see

2 2 © Ellis Cohen, Topics Overview Star Schema: Fact & Dimension Tables The Star Schema & Denormalization The Data Cube ETL: Extraction, Transformation & Loading

3 3 © Ellis Cohen, Overview

4 4 © Ellis Cohen, Data Warehousing & Data Mining Data Warehousing Techniques for representing & querying large amounts of relatively static data Potentially stored in Multi-Dimensional Databases On-line Analysis & Decision Support Data Mining Automated analysis: Discovering (potentially) unexpected patterns in large amounts of data

5 5 © Ellis Cohen, Operational vs Analytical DBs Operational Database Data needed and updated constantly to directly support business operations Focus on OLTP (on-line transaction processing): Transactional access & modification of relatively small # of data points at a time Analytical Database: Data Warehouse & Data Mart Copious amounts of relatively static data, culled & integrated across enterprise, cleansed & summarized, maintained historically, used for decision support and business intelligence (BI) Focus on OLAP (on-line analytical processing): Querying large amounts of data, scheduled modifications

6 6 © Ellis Cohen, Operational vs Analytical DBs OperationalWarehouse Usage Transactional (OLTP) Analytical (OLAP) Organized forModificationsQueries ModificationsContinualPeriodic Queries Narrow-scope Low-complexity Broad-scope High-complexity DatabaseRelational Relational/ Dimensional DataNormalized Denormalized Aggregated & Derived

7 7 © Ellis Cohen, Central Data Warehouse (from Oracle 9i Data Warehousing Guide)

8 8 © Ellis Cohen, Warehouse Questions How many red Bally shoes did we sell by region in the third quarter of each of the last 5 years? What are the top 25 selling products by category and region for this past quarter? What percent of the market do we own for each product we make? Which of our customer's zipcodes were responsible for the top 10% of total sales over the last year.

9 9 © Ellis Cohen, Star Schema: Fact & Dimension Tables

10 10 © Ellis Cohen, Star Schema Stores (Dimension) DailySales (Fact) storid prodid date price units storid … Products (Dimension) prodid … Measures A Star Schema has a central fact table, with a composite primary key, which references multiple Dimension tables what each fact measures Data Warehouses are organized using Star Schema models foreign key

11 11 © Ellis Cohen, Subjects (Facts) & Dimensions Instead of thinking about entities & relationships, design a data warehouse by thinking about Subjects (represented by fact tables) Sales, Distribution, Purchases Dimensions (represented by dimension tables) How to uniquely identify the facts about each subject –Sales: Product, Stores, Dates (maybe also Employee, Customer: depends what you want to analyze) –Distribution: Warehouses, Products, Stores, Dates (maybe Employees & Trucks) –Purchases: Products, Vendors, Dates (maybe also Employees)

12 12 © Ellis Cohen, Fact & Dimension Tables Fact Tables Composite primary key identify dimensions uniquely identify each fact (or measurement) Additional attributes: measures what is measured about each fact Dimension Tables Primary key Surrogate key uniquely identifies each dimension value Additional attributes Properties of each dimension value

13 13 © Ellis Cohen, Dimensions & Granularity Dimensions have different levels of granularity Stores Regions Districts Products SubCategories ProductTypes Categories Manufacturers

14 14 © Ellis Cohen, Snowflake Schema (with Normalized Dimensions) Stores (Dimension) DailySales (Fact) storid prodid date price units storid stornam city state distid Products (Dimension) prodid color size prodtyp Districts distid distnam distarea regid Regions regid regnam ProductTypes prodtyp prodnam prodescr subcatid manfid SubCategories subcatid subnam subdescr catid Categories catid catnam catdescr Manufacturers manfid manfnam

15 15 © Ellis Cohen, Typical Warehouse Query How many red Bally shoes did we sell in each region in 2002? SELECT r.regnam as region, sum(f.units) as sumunits FROM DailySales f NATURAL JOIN Stores NATURAL JOIN Districts NATURAL JOIN Regions r NATURAL JOIN Products p NATURAL JOIN ProductTypes NATURAL JOIN SubCategorie s NATURAL JOIN Manufacturers m WHERE to_char(f.date,'YYYY') = '2002' AND p.color = 'red' AND m.manfnam = 'Bally' AND s.subnam = 'Shoe' GROUP BY r.regnam

16 16 © Ellis Cohen, The Star Schema & Denormalization

17 17 © Ellis Cohen, Snowflake Schema is Normalized Snowflake Schema has normalized dimension tables Each dimension is represented by multiple sub-dimension tables at different levels of granularity (Product, ProductType, Category, etc.) Each sub-dimension table has attributes appropriate to the level of granularity –Product: color, size –ProductType: prodnam, prodescr –etc.

18 18 © Ellis Cohen, Denormalization Products (Dimension) prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr Products (Dimension) prodid color size prodtyp ProductTypes prodtyp prodnam prodescr subcatid manfid SubCategories subcatid subnam subdescr catid Categories catid catnam catdescr Manufacturers manfid manfnam Why is there redundancy here?

19 19 © Ellis Cohen, Star Schema is Denormalized The Star Schema has denormalized dimension tables Each dimension by joining together the sub-dimension table to form a single dimension table The dimension table has attributes at different levels of granularity The dimension tables contain lots of redundancy, but queries use far fewer joins Does not dramatically impact space: dimension tables usually < 1% size of fact table (but some descriptions may need to be stored separately)

20 20 © Ellis Cohen, Star Schema (Fully Denormalized Dimensions) Stores (Dimension) DailySales (Fact) storid prodid date price units storid stornam city state distid distnam distarea regid regnam Products (Dimension) prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr Maybe catdescr not included here if it is a GIF or a 4000 byte description Why should this be replaced by a dateid?

21 21 © Ellis Cohen, Query with Denormalized Schema How many red Bally shoes did we sell in each region in 2002? SELECT s.regnam as region, sum(f.units) as sumunits FROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products p WHERE to_char(f.date,'YYYY') = '2002' AND p.color = 'red' AND p.manfnam = 'Bally' AND p.subnam = 'Shoe' GROUP BY s.regnam Costly

22 22 © Ellis Cohen, Typical Date Dimension Attributes Requires Month + Year to identify a month within a year. Might want to add a single MonthYr field to represent the pair FieldExample Value Year2005 MonthFeb Quarter1 DayOfMonth12 DayOfYear43 WeekOfYear7 DayOfWeekSat Note: Quarter is less granular than Month Also, DayOfYear, WeekOfYear & DayOfWeek can be derived form the other fields It is common and almost always more efficient to treat Dates as a dimension with a number of attributes

23 23 © Ellis Cohen, Extended Date Dimension Hierarchy Date (e.g. Feb 12, 2005) DayOfWeek (e.g. Sat) WeekYr (e.g. 2005Wk7) MonthYr (e.g. Feb2005) QuarterYr (e.g. 2005Q1) Year (e.g 2005) Quarter (e.g. 1) Month (e.g. Feb) WeekOfYear (e.g. 7) DayOfYear (e.g. 43) DayOfMonth (e.g. 12)

24 24 © Ellis Cohen, Star Schema with Date Dimension Stores (Dimension) DailySales (Fact) storid prodid dateid price units storid stornam city state distid distnam distarea regid regnam Products (Dimension) prodid color size prodtyp prodnam prodescr manfid manfnam subcatid subnam subdescr catid catnam catdescr Dates (Dimension) dateid date dayofweek dayofmonth dayofyear weekyr weekofyear monthyr month quarteryr quarter year In general, represent dates by a Dates dimension table

25 25 © Ellis Cohen, Query using Dates Dimension How many red Bally shoes did we sell in each region in 2002? SELECT s.regnam as region, sum(f.units) as sumunits FROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products p NATURAL JOIN Dates d WHERE d.year = 2002 AND p.color = 'red' AND p.manfnam = 'Bally' AND p.subnam = 'Shoe' GROUP BY s.regnam Needs an extra join, but simpler query, Executes faster if Dates is indexed by year

26 26 © Ellis Cohen, The Data Cube

27 27 © Ellis Cohen, Data Cube Representation Products dimension Stores dimension Dates dimension Sales of Beanie Babies in Pittsburgh Store Today Sales of Beanie Babies in Pittsburgh Store Yesterday All Sales (of all products over time) in NYC Store Pgh NYC Sales Cube

28 28 © Ellis Cohen, Data Cube Characteristics Each axis represents a dimension –Elements along axis are at lowest granularity for that dimension Measures are the data within the cells at intersections of the cube –Information about the topic of the cube –e.g. units & price for each sales fact (i.e. sales in a store of a product on a date)

29 29 © Ellis Cohen, Data Cube Views Slice View data relative to a point in one or more dimensions View sales today (for each store & each product category) View Bally shoe sales at the NYC store (for each date) Dice View data relative to (sets of) ranges in one or more dimensions View sales for the last 4 days (for each store & each product category) View sales for each type of shoes at all the NY and NJ stores for each of the last 10 quarters

30 30 © Ellis Cohen, MDDB: MultiDimensional DataBase Knows about Fact & Dimension Tables Uses direct (n dimensional) hypercube representation to provide fast access to fact elements in query Supports sparse representations –The Pittsburgh store doesn't sell lingerie –The Cape Cod store is not open in the winter –Baked Beanie Babies are only sold in the NE region Uses specialized query language e.g. MDX (used by Microsoft OLAP Server) w basic data types: cube, slice, dice

31 31 © Ellis Cohen, ETL: Extraction, Transformation & Loading

32 32 © Ellis Cohen, ETL: Extraction, Transformation & Loading 80% of total cost of building warehouse Extraction Loading Transformation

33 33 © Ellis Cohen, Extraction Sources Multiple DB's Flat Files External Data Sources e.g. Census, Geographic, Weather, Financial, Unemployment Data Standard DB/Spreadsheet format or semi- structured data from the web Frequency Periodic (hourly, daily, weekly, …) Triggered Single event #, sequence, pattern of events Mechanisms Snapshots / Materialized Views / Replication Database Triggers Process Logs Query Sources (full vs incremental)

34 34 © Ellis Cohen, Transformation Cleaning Scrubbing Filtering Conformance Integration Renaming Fusion & Merging Determine Surrogate Keys Timestamping Summarization Schema Organization Dimension Tables Pre-Aggregation via Materialized Views Derivation

35 35 © Ellis Cohen, (Transformation) Cleaning Scrubbing Use domain-specific knowledge e.g. SS#, phone-number, zipcode Filtering Check for inconsistent data Use data validation rules Conformance Map similarly typed data to standard representation Convert units (inch => cm, $ => euro) scale (mm => cm) formats (string => integer, string with/wo $)

36 36 © Ellis Cohen, (Transformation) Integration Renaming Resolve name conflicts Fusion - e.g. merge –properties in city db –properties in developer lists Determine Surrogate Keys Do not use keys from operational data as primary key in warehouse data Timestamping Add timestamps to fact data where missing to enable historical queries Reorganization & Evolution Support Data Reorganization & Schema Evolution Summarization Summarize original operational data and combine into less detailed tables

37 37 © Ellis Cohen, Integration (Data Reorganization) What do we do when attributes change? Suppose districts are reorganized and a store is now part of a different district Consistently changing mapping of store to district –Allows new and old data to be compared reasonably by district –But causes incorrect comparisons by district among older data alone Solutions 1.Keep fields for both old and new mapping -- in fact, potentially a separate field for each reorganization 2.Add effective date to store dimension. Have multiple rows for same store - each with different effective date

38 38 © Ellis Cohen, (Integration) Summarization DailySales (Fact) storid prodid date price units CustomerTransaction transid custid empid posid time ItemPurchase transid lineno prodid price units PointOfSaleTerminals posid postyp storid loc Might build different fact tables for different purposes: e.g. ones involving Customers ones involving Store Locations Tradeoff Smaller Fact Tables vs. Missed Relationships

39 39 © Ellis Cohen, Loading Alternatives –Incremental vs Full Refresh: most data is incrementally added to the warehouse –Off-line vs on-line –Frequency Nightly Weekly Monthly –All-at-once vs Staged What indices to create or drop? What statistics to collect (& use)?

40 40 © Ellis Cohen, Constellation Schema Data warehouses often are designed as constellations Multiple fact tables Shared/related dimension tables Examples –Sales: store, product, date –Distribution: distributor, store, product, carrier, period –Advertising: store, medium, product, period Query across same or related dimensions –Compare advertising and sales by store within various periods

41 41 © Ellis Cohen, Data Marts Store different fact tables (or different groups of fact tables) in separate data marts

42 42 © Ellis Cohen, Data Mart Architectures Subset of Data Warehouse Meets needs of subgroup of users Top-down: –Extracted from Data Warehouse –Problem: early availability Bottom-up: –Built directly from staging area –Can be combined to form warehouse –Problem: Conformance. ETL tool must provide metadata Hybrid: –Some data marts built directly from staging area –Others extracted from Data Warehouse

43 43 © Ellis Cohen, Metadata Management Identify & define each attribute –Source(s) –Transformation(s) applied –How aggregated –Description of what it represents –Relationships to other attributes –History


Download ppt "1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen 2002-2006 Introduction to Data Warehouse Design."

Similar presentations


Ads by Google