Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 16 Data Warehouse Technology and Management.

Similar presentations


Presentation on theme: "Chapter 16 Data Warehouse Technology and Management."— Presentation transcript:

1 Chapter 16 Data Warehouse Technology and Management

2 Outline Basic concepts and characteristics Business architectures and applications Data cube concepts and operators Relational DBMS features Populating a data warehouse

3 Comparison of Processing Environments Transaction processing –Uses operational databases –Short-term decisions: fulfill orders, resolve complaints, provide staffing Decision support processing –Uses integrated and summarized data –Medium and long-term decisions: capacity planning, store locations, new lines of business

4 Data Warehouse Definition and Characteristics A central repository for summarized and integrated data from operational databases and external data sources Key Characteristics –Subject-oriented –Integrated –Time-variant –Nonvolatile

5 Data Comparison

6 Business Architectures and Applications Data warehouse projects Top-down architectures Bottom-up architecture Applications and data mining

7 Data Warehouse Projects Large efforts with much coordination across departments Enterprise data model –Important artifact of data warehouse project –Structure of data model –Meta data for data transformation Top-down vs. bottom-up business architectures

8 Two Tier Architecture

9 Three Tier Architecture

10 Bottom-up Architecture

11 Applications

12 Data Mining Discover significant, implicit patterns –Target promotions –Change mix and collocation of items Requires large volumes of transaction data Important application for data warehouses

13 Data Cube Concepts and Operators Basics Dimension and measure details Operators

14 Data Cube Basics Multidimensional arrangement of data Users think about decision support data as data cubes Terminology –Dimension: subject label for a row or column –Member: value of dimension –Measure: quantitative data stored in cells

15 Data Cube Example

16 Dimension and Measure Details Dimensions –Hierarchies: members can have sub members –Sparsity: many cells do not have data Measures –Derived measures –Multiple measures in cells

17 Time Series Data Common data type in trend analysis Reduce dimensionality using time series Time series properties –Data type –Start date –Calendar –Periodicity –Conversion

18 Slice Operator Focus on a subset of dimensions Set dimension to specific value: 1/1/2003

19 Dice Operator Focus on a subset of member values Replace dimension with a subset of values Dice operation often follows a slice operation

20 Other Operators Operators for hierarchical dimensions –Drill-down: add detail to a dimension –Roll-up: remove detail from a dimension –Recalculate measure values Pivot: rearrange dimensions

21 Operator Summary OperatorPurposeDescription SliceFocus attention on a subset of dimensions Replace a dimension with a single member value or with a summary of its measure values DiceFocus attention on a subset of member values Replace a dimension with a subset of members Drill-downObtain more detail about a dimension Navigate from a more general level to a more specific level Roll-upSummarize details about a dimension Navigate from a more specific level to a more general level PivotPresent data in a different order Rearrange the dimensions in a data cube

22 Relational DBMS Support Data modeling Dimension representation GROUP BY extensions Materialized views and query rewriting Storage structures and optimization

23 Relational Data Modeling Dimension table: contains member values Fact table: contains measure values 1-M relationships from dimension to fact tables Grain: most detailed measure values stored

24 Star Schema Example

25 Constellation Schema Example

26 Snowflake Schema Example

27 Handling M-N Relationships Source data may have M-N relationships, not 1-M relationships Adjust fact or dimension tables for a fixed number of exceptions More complex solutions to support M-N relationships with a variable number of connections

28 Dimension Representation Star schema and variations lack dimension representation Explicit dimension representation important to data cube operations and optimization Proprietary extensions for dimension representation Represent levels, hierarchies, and constraints

29 Oracle Dimension Representation Levels: dimension components Hierarchies: may have multiple hierarchies Constraints: functional dependency relationships

30 CREATE DIMENSION Example CREATE DIMENSION StoreDim LEVEL StoreId IS Store.StoreId LEVEL City IS Store.StoreCity LEVEL State IS Store.StoreState LEVEL Zip IS Store.StoreZip LEVEL Nation IS Store.StoreNation LEVEL DivId IS Division.DivId HIERARCHY CityRollup ( StoreId CHILD OF City CHILD OF State CHILD OF Nation ) HIERARCHY ZipRollup ( StoreId CHILD OF Zip CHILD OF State CHILD OF Nation ) HIERARCHY DivisionRollup ( StoreId CHILD OF DivId JOIN KEY Store.DivId REFERENCES DivId ) ATTRIBUTE DivId DETERMINES Division.DivName ATTRIBUTE DivId DETERMINES Division.DivManager ;

31 GROUP BY Extensions ROLLUP operator CUBE operator GROUPING SETS operator Other extensions –Ranking –Ratios –Moving summary values

32 CUBE Example SELECT StoreZip, TimeMonth, SUM(SalesDollar) AS SumSales FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND (StoreNation = 'USA' OR StoreNation = 'Canada') AND TimeYear = 2002 GROUP BY CUBE (StoreZip, TimeMonth)

33 ROLLUP Example SELECT TimeMonth, TimeYear, SUM(SalesDollar) AS SumSales FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND (StoreNation = 'USA' OR StoreNation = 'Canada') AND TimeYear BETWEEN 2002 AND 2003 GROUP BY ROLLUP (TimeMonth,TimeYear);

34 GROUPING SETS Example SELECT StoreZip, TimeMonth, SUM(SalesDollar) AS SumSales FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND (StoreNation = 'USA' OR StoreNation = 'Canada') AND TimeYear = 2002 GROUP BY GROUPING SETS((StoreZip, TimeMonth), StoreZip, TimeMonth, ());

35 Variations of the Grouping Operators Partial cube Partial rollup Composite columns CUBE and ROLLUP inside a GROUPIING SETS operation

36 Materialized Views Stored view Periodically refreshed with source data Usually contain summary data Fast query response for summary data Appropriate in query dominant environments

37 Materialized View Example CREATE MATERIALIZED VIEW MV1 BUILD IMMEDIATE REFRESH COMPLETE ON DEMAND ENABLE QUERY REWRITE AS SELECT StoreState, TimeYear, SUM(SalesDollar) AS SUMDollar1 FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND TimeYear > 2000 GROUP BY StoreState, TimeYear;

38 Query Rewriting Substitution process Materialized view replaces references to fact and dimension tables in a query Query optimizer must evaluate whether the substitution will improve performance over the original query More complex than query modification process for traditional views

39 Query Rewriting Example -- Data warehouse query SELECT StoreState, TimeYear, SUM(SalesDollar) FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND StoreNation IN ('USA','Canada') AND TimeYear = 2002 GROUP BY StoreState, TimeYear; -- Query Rewrite: replace Sales and Time tables with MV1 SELECT DISTINCT MV1.StoreState, TimeYear, SumDollar1 FROM MV1, Store WHERE MV1.StoreState = Store.StoreState AND TimeYear = 2002 AND StoreNation IN ('USA','Canada');

40 Storage and Optimization Technologies MOLAP: direct storage and manipulation of data cubes ROLAP: relational extensions to support multidimensional data HOLAP: combine MOLAP and ROLAP storage engines

41 ROLAP Techniques Bitmap join indexes Star join optimization Query rewriting Summary storage advisors Parallel query execution

42 Populating a Data Warehouse Data sources Workflow representation Optimizing the refresh process

43 Data Sources Cooperative Logged Queryable Snapshot

44 Maintenance Workflow

45 Data Quality Problems Multiple identifiers Multiple field names Different units Missing values Orphaned values Multipurpose fields Conflicting data Different update times

46 ETL Tools Extraction, Transformation, and Loading Specification based Eliminate custom coding Third party and DBMS based tools

47 Refresh Optimization

48 Determining the Refresh Frequency Maximize net refresh benefit Value of data timeliness Cost of refresh Satisfy data warehouse and source system constraints

49 Determining the Level of Historical Integrity Primarily an issue for dimension updates Type I: overwrite old values Type II: version numbers for an unlimited history Type III: new columns for a limited history

50 Summary Data warehouse requirements differ from transaction processing. Architecture choice is important. Multidimensional data model is intuitive Relational representation and storage techniques are significant. Maintaining a data warehouse is an important, operational problem.


Download ppt "Chapter 16 Data Warehouse Technology and Management."

Similar presentations


Ads by Google