Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.

Similar presentations


Presentation on theme: "Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho."— Presentation transcript:

1 Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho

2 2 Data Warehousing Lab. DW Index  The Information Utility's Infrastructure  The Preferred Architecture: Integration Layer and High Performance Query Structures  Alternate Warehousing Architectures Data Store 1 - The Source Systems Data Store 1 - The Source Systems Data Flow 1 - From the Data sources to the Integration layer Data Flow 1 - From the Data sources to the Integration layer Data Store 2 - The Integration Layer Data Store 2 - The Integration Layer Data Flow 2 - From the Integration Layer to the High Performance Query Structures Data Flow 2 - From the Integration Layer to the High Performance Query Structures Data Store 3 - High Performance Query Structures(HPQS) Data Store 3 - High Performance Query Structures(HPQS) Data Flow 3 - From the High Performance Query Structures to the End User Reporting Applications Data Flow 3 - From the High Performance Query Structures to the End User Reporting Applications Data Store 4 - Data in the End User's Hands Data Store 4 - Data in the End User's Hands

3 3 Data Warehousing Lab. DW The Information Utility's Infrastructure warehouse must: warehouse must:  extract data from a variety of sources  integrate data into a common repository  put data into a format that users can use  provide users with tools to access the warehouse

4 4 Data Warehousing Lab. DW The Preferred Architecture: Integration Layer and High Performance Query Structures 4 data stores and 3 data flows. 4 data stores and 3 data flows.

5 5 Data Warehousing Lab. DW Data Store 1 - The Source Systems provide data to warehouse provide data to warehouse  enterprise resource planning package(ERP) SAP, PeopleSoft, Oracle applications SAP, PeopleSoft, Oracle applications  home-grown applications OASIS system OASIS system  outside sources data purchased from outside vendors data purchased from outside vendors source systems sales, accounting, distribution, etc. warehouse data

6 6 Data Warehousing Lab. DW Flow 1 - From the Data sources to the Integration layer data extraction step data extraction step  data out of its sources  extracted at the beginning of every data flow  very complex step variety of data storage technologies ex. Oracle, DB2, Infomix, IMS, other formats variety of data storage technologies ex. Oracle, DB2, Infomix, IMS, other formats -> require select statements and each code  consideration for extraction

7 7 Data Warehousing Lab. DW Flow 1 - From the Data sources to the Integration layer Is This Extract Supporting the Initial Load of the Warehouse or a Periodic Refresh Load? Is This Extract Supporting the Initial Load of the Warehouse or a Periodic Refresh Load?  problems with complete refreshes warehouse is a record of history! warehouse is a record of history! -> frequently lost by source systems. warehouses tend to be very large! warehouses tend to be very large! -> poor computing and telecommunications bandwidth  two architectures to load warehouse initial loadperiodical refresh history data from offline storageonline data bring it all overchanged source records use special logic for timestamps

8 8 Data Warehousing Lab. DW Flow 1 - From the Data sources to the Integration layer How Will I Determine What Records to Extract? How Will I Determine What Records to Extract?  change data capture what source records have changed what source records have changed how, those records are moved to the warehouse how, those records are moved to the warehouse  delete question! no trace, the deleted record is just gone! no trace, the deleted record is just gone!  Techniques recognizing changes Timestamps Timestamps  records whenever inserted and deleted  reduced search what records have changed. Triggers Triggers  put trigger on the source tables  write a corresponding(insert,update,delete) message in a log file

9 9 Data Warehousing Lab. DW Flow 1 - From the Data sources to the Integration layer Application Integration Software(AIS) Application Integration Software(AIS)  MQ Series, Mercator, Tibco..  link applications, when a transaction occurs in one, transmit it to all the others.  all transactions in AIS-enabled systems  real-time access to data File Compares File Compares  compare today ’ s file to the last loaded file  difficult implementation and less accuracy

10 10 Data Warehousing Lab. DW Flow 1 - From the Data sources to the Integration layer How Will I Format the Extracted Records? How Will I Format the Extracted Records?  store extracted records with each mean what source system generated the record what source system generated the record when the record was obtained, when the record was obtained, the key of the record the key of the record What Will I Do with the Extracted Records? What Will I Do with the Extracted Records?  data loading programs read flat files / load the data into the warehouse read flat files / load the data into the warehouse  "loosely coupled" warehousing architectures separate extract programs and load programs separate extract programs and load programs ->more flexible and maintainable warehouse! ->more flexible and maintainable warehouse!

11 11 Data Warehousing Lab. DW Flow 1 - From the Data sources to the Integration layer A Few Notes About Dirty Data A Few Notes About Dirty Data  dirty in several ways Format violations Format violations Referential integrity violations Referential integrity violations Cross-system matching violations Cross-system matching violations Internal consistency violations Internal consistency violations  dirty data makes warehouse unreliable makes warehouse unreliable corrected in the source systems before extracting corrected in the source systems before extracting both refresh data and history data both refresh data and history data

12 12 Data Warehousing Lab. DW Data Store 2 - The Integration Layer a normalized database in a single place a normalized database in a single place normalization normalization  break flat file into smaller files to store the data more efficiently. Why Build an Integration Layer? Why Build an Integration Layer?  Avoids extraction repetition multiple data marts using data from same source systems multiple data marts using data from same source systems -> read from only one source(already integrated, clean data)  Ensures standard interpretation of enterprise data multiple groups interpret the same data differently multiple groups interpret the same data differently -> develop common definitions shared across the organization  Provides a more flexible repository than the denormalized structures in the HPQS layer denormalized data structures in HPQS for querying are inflexible denormalized data structures in HPQS for querying are inflexible -> complex and required reintegration, recleasing

13 13 Data Warehousing Lab. DW Data Store 2 - The Integration Layer

14 14 Data Warehousing Lab. DW Data Store 2 - The Integration Layer Introduction to Database Normalization Introduction to Database Normalization - data model in third normal form  completely denormalized Data 1NF

15 15 Data Warehousing Lab. DW Data Store 2 - The Integration Layer  First Normal Form eliminate repeating groups! eliminate repeating groups! 2NF

16 16 Data Warehousing Lab. DW Data Store 2 - The Integration Layer  Second Normal Form all non-key attributes of a table must rely on the entire key of the table all non-key attributes of a table must rely on the entire key of the table 3NF

17 17 Data Warehousing Lab. DW Data Store 2 - The Integration Layer  Third Normal Form all non-key fields must depend solely on the table's primary key all non-key fields must depend solely on the table's primary key

18 18 Data Warehousing Lab. DW Data Store 2 - The Integration Layer What "Extra" Data Must the Integration Layer Hold? What "Extra" Data Must the Integration Layer Hold?  surrogate Keys Sequential number generated by warehouse load programs Sequential number generated by warehouse load programs have no business meaning have no business meaning Benefits Benefits  single surrogate key for same attribute having different keys  easy tracking for Moving information  dates, statuses, and other fields auditing support, easy identifying data to data mart auditing support, easy identifying data to data mart additional information in the warehouse additional information in the warehouse Ex. insert date, last update date, status flag, etc. Ex. insert date, last update date, status flag, etc. Another Note About Dirty Data Another Note About Dirty Data  Techniques for handling bad records Ignoring them. Ignoring them. Rejecting bad records, but saving them in a separate file for manual review. Rejecting bad records, but saving them in a separate file for manual review. Loading the bad record and pointing out the errors for later review. Loading the bad record and pointing out the errors for later review.

19 19 Data Warehousing Lab. DW Data Store 2 - The Integration Layer key

20 20 Data Warehousing Lab. DW Data Flow 2 - From the Integration Layer to the High Performance Query Structures data is extracted from the integration layer and inserted into the data marts data is extracted from the integration layer and inserted into the data marts  ETL: extract, transform, and load to populate data marts  benefits loading from integration lay no cleansing and integration no cleansing and integration Identifying the loading records using timestamps Identifying the loading records using timestamps no creating surrogate keys (only reuse!) no creating surrogate keys (only reuse!)  use of summary tables differ from data warehouse differ from data warehouse some summaries of their atomic-level detail some summaries of their atomic-level detail ->load both the atomic level data and summary tables Oracle8i Oracle8i  create materialized view  automatical refresh every commit

21 21 Data Warehousing Lab. DW Data Store 3 - High Performance Query Structures(HPQS) databases and data structures to support end-user queries databases and data structures to support end-user queries databases managed by either relational database engines or multidimensional database engines databases managed by either relational database engines or multidimensional database engines logical structure, not physical structure logical structure, not physical structure  share the same computer With data warehouse  physically different table designs more easier and speedier for end user to access than normalized database formats. more easier and speedier for end user to access than normalized database formats.

22 22 Data Warehousing Lab. DW Data Flow 3 - From the High Performance Query Structures to the End User Reporting Applications Query tools issue SQL calls to relational databases Query tools issue SQL calls to relational databases data is returned to the tools and formated data is returned to the tools and formated

23 23 Data Warehousing Lab. DW Data Store 4 - Data in the End User's Hands report and analysis in end-user's hands report and analysis in end-user's hands  the last data store in warehousing architecture  "How can I prevent a bad employee from selling warehouse data to one of our competitions?" only way to deny him access to that data in the first place only way to deny him access to that data in the first place

24 24 Data Warehousing Lab. DW Alternate Warehousing Architectures Alternate Architecture 1 - No Warehouse Alternate Architecture 1 - No Warehouse  no demand for a warehouse, don't build it transaction systems are strong and end -user queries are limited transaction systems are strong and end -user queries are limited Alternate Architecture 2 - Normalized Design Alternate Architecture 2 - Normalized Design  data integrated in integration layer  users query directly out of the integration layer integration benefits, no usability and query performance integration benefits, no usability and query performance Alternate Architecture 3 - Just Data Marts Alternate Architecture 3 - Just Data Marts  building one or more data marts without a normalized integration layer no need data integrated from multiple systems. no need data integrated from multiple systems.


Download ppt "Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho."

Similar presentations


Ads by Google