Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data Warehousing BI Tools and Techniques.

Similar presentations


Presentation on theme: "© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data Warehousing BI Tools and Techniques."— Presentation transcript:

1 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data Warehousing BI Tools and Techniques Robert Monroe March 25, 2008

2 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Goals: Data Warehouses and Dimensional Modeling Understand the role and importance of data warehouses and data marts in a complete BI solution Understand the key characteristics of data warehouses and data marts, and how they differ from transactional data stores Understand common data warehousing architectures and the tradeoffs associated with each of them Introduce the concept of dimensional modeling

3 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data Warehouses and Data Marts

4 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Business Intelligence Systems Improve Decision Making Source: O’Brien, Management Information Systems, 6 th ed. Question: What do the tools need to do so?

5 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Transactional and Analytical Systems Transactional systems: System that are used to run a business in real time, based on current data. Also called “systems of record” Analytical systems: Systems designed to support decision making based on historical point-in-time and prediction data for complex queries or data mining applications BI systems are generally analytical systems

6 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Examples of Transactional and Analytical Systems Transactional System Examples Supermarket checkout system ATM machines Purchase order processing Student course registration Warehouse/inventory tracker Airline ticketing system E-Z Pass Analytical System Examples Data warehouses Data marts Enterprise spend analysis –Where do we spend our $$$ Sales force productivity analysis –By sales person, region, or product line Product-line profitability analysis –Which products are most profitable? –Which do we lose money on?

7 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Why Not Use Transactional Data Stores For BI? It is good practice to separate transactional and analytic systems and data Why? –To improve system performance –To improve database managability and maintainability –Optimize each type of system for it’s primary purpose

8 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data Warehouses A data warehouse is a subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes –Subject-oriented: e.g. customers, patients, students, products –Integrated: Consistent naming conventions, formats, encoding structures; from multiple data sources –Time-variant: Can study trends and changes –Nonupdatable: Read-only, periodically refreshed A data warehouse generally provides an integrated, company- wide view of high-quality information that is derived from disparate databases across the enterprise Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

9 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data Marts A data mart is an informational data store that stores aggregated information for a specific group or function within an organization –Example: sales data for a single company division –Example: materials purchases for a group of factories Data marts can also be limited in scope by the way that they represent information –Optimized for specific types of analytical tools such as OLAP, data mining, visualization, etc.

10 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Tradeoffs Drawbacks Advantages Data Warehouses Data Marts

11 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Tradeoffs Comprehensive Supports enterprise-wide analyses Single source for data and analysis tools Limited in scope Cheaper, easier, quicker to create and maintain Can be optimized for specific analyses / tools Complex Expensive Optimized for everything/ nothing Limited in scope Difficult to analyze across enterprise / boundaries May or may not have needed data Drawbacks Advantages Data Warehouses Data Marts

12 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data Warehousing Architectures

13 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Operational Systems Feed Informational Systems Informational systems get their data from operational databases This process generally requires significant processing (transformation) of the data stored in operational databases This process is commonly known to as ETL –Extract, Transform, and Load (ETL)

14 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data Warehouse Architectures Generic Two-Level Architecture Independent Data Mart Dependent Data Mart and Operational Data Store Logical Data Mart and Active Warehouse Each approach has benefits and drawbacks

15 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Common Data Warehouse Architecture Elements Data source systems –Transactional systems that store day-to-day data Data staging area –Holding store for data as it is being extracted from source systems and transformed into appropriate format for data warehouse or data marts Data and metadata storage area –Physical storage for the transformed and aggregated data warehouse/mart data End-user analysis tools –Various tools that let end-users query, analyze, report on, or explore the informational data stored in the data warehouse or data marts Data Source Systems Data Staging Area Data/Metadata Storage Area End-User Analysis Tools

16 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques E T L One, company- wide warehouse Periodic extraction  data is not completely current in warehouse Generic Two-Level Architecture Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

17 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data marts: Mini-warehouses, limited in scope E T L Separate ETL for each independent data mart Data access complexity due to multiple data marts Independent Data Marts Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

18 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques E T L Single ETL for (EDW) Enterprise Data Warehouse (EDW) ODS ODS provides option for obtaining current data Simpler data access Dependent data marts loaded from EDW Dependent Data Marts with Operational Data Store Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

19 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques E T L Near real-time ETL for Active Data Warehouse Data marts are NOT separate databases, but logical views of the data warehouse Logical Data Mart and Active Warehouse Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

20 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Dimensional Modeling

21 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques The Relational Data Model The Relational Model has become the de-facto standard for managing operational business data Core concepts in a relational model: –Tables (relations) –Records (rows) –Data fields (columns) –Primary keys –Foreign keys Products Product IDDescriptionColorSizeQty Available 52Shoes (pair)Blue1025 64Socks (pair)WhiteLarge200 145BlouseGreen714 158PantsBlue32/340

22 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data, Information, Database Example Purchases Order IDCustomer NameProduct IDQuantityDate 5623Jimmy Hwang52312/15/2004 5624Sue Smith64512/16/2004 5625Jane Chen145112/16/2004 Products Product IDDescriptionColorSizeQty Available 52Shoes (pair)Blue1025 64Socks (pair)WhiteLarge200 145BlouseGreen714 158PantsBlue32/340 Jimmy Hwang purchased 3 pairs of size 10 shoes on 12/15/2004 What other information can we derive from these data tables? Data in Database Tables Information

23 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Relational Data, Tables, Records, and Metadata Example Purchases Order IDCustomer NameProduct IDQuantityDate 5623Jimmy Hwang52312/15/2004 5624Sue Smith64512/16/2004 5625Jane Chen145112/16/2004 Products Product IDDescriptionColorSizeQty Available 52Shoes (pair)Blue1025 64Socks (pair)WhiteLarge200 145BlouseGreen714 158PantsBlue32/340 Table Name: Products ProductID Int (pkey) Description Text(50) Color Text(50) SizeText(20) QtyAvailableInt Table Name: Purchases OrderIDInt (pkey) CustomerNameText(75) ProductIDInt (fkey) QuantityDecimal DateDateTime Data (Records) in Database Tables Metadata

24 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Normalization And Denormalization Data normalization is the process of decomposing relations with anomalies to produce smaller, well-structured relations –Basic idea: each table only holds data about one ‘thing’ Goals of normalization include: –Minimize data redundancy –Simplifying the enforcement of referential integrity constraints –Simplify data maintenance (inserts, updates, deletes) –Improve representation model to match “the real world” Normalization sometimes hurts query performance

25 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Example: Denormalized Table Insertion anomaly: when an employee takes a new class we need to add duplicate data (Name, Dept_Name, and Salary) Deletion anomaly: If we remove employee 140, we lose information about the existence of a Tax Acc class Modification anomaly: Employee 100 salary increase forces update of multiple records These anomalies exist because there are two themes (entity types) into one relation – course and employee, resulting in duplication, and an unnecessary dependency between the entities Employee Emp_IDNameDept_NameSalaryCourse_TitleDate_Completed 100Margaret SimpsonMarketing48000SPSS6/19/2005 100Margaret SimpsonMarketing48000Surveys10/7/2004 140Alan BeetonAccounting52000Tax Acc12/8/2004 110Chris LuceroInfo Systems43000SPSS1/12/2004 110Chris LuceroInfo Systems43000C++4/22/2003 190Lorenzo DavisFinance55000 150Susan MartinMarketing42000Java8/12/2002 150Susan MartinMarketing42000SPSS6/19/2005 Example Derived from Hoffer, Prescott, McFadden, Modern Database Management, 7th ed.

26 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Normalizing Previous Employee/Class Table Course_Completion Emp_IDCourse_IDDate_Completed 10016/19/2005 100210/7/2004 140312/8/2004 11011/12/2004 11044/22/2003 15016/19/2005 15058/12/2002 Employee Emp_IDNameDept_NameSalary 100Margaret SimpsonMarketing48000 140Alan BeetonAccounting52000 110Chris Lucero43000 190Lorenzo DavisFinance55000 150Susan MartinMarketing42000 Course Course_IDCourse_Title 1SPSS 2Surveys 3Tax Acc 4C++ 5Java This seems more complicated Why might this approach be superior to the previous one?

27 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Indexing An index is a table or other data structure used to determine the location of rows in a file that satisfy some condition Indices reduce the time needed to retrieve records … but increase the time and cost to insert, update, or delete Indexing is critical for high performance in large, complex db’s, –Especially data warehouses and data marts Products Product IDDescriptionColorSize 52Shoes (pair)Blue10 145Socks (pair)WhiteLarge 62BlouseGreen7 12PantsBlue32/34 532SkirtGreen7 ………… Product_Index Product IDRow 124 521 623 1452 5325 ……

28 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Alternative Data Models The relational data model is the current de-facto standard for storing and managing corporate data There are other data storage models, usually associated with legacy systems –The data you need for your analysis may be stored in them! Four common alternative data models –Flat file –Hierarchical –Network –Object

29 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Dimensional Modeling: Facts and Dimensions Dimensional Modeling –a simple database design in which dimensional data are separated from fact or event data. Dimensional models are also sometimes called star schemas. Dimensional models are a common way to represent derived data for informational data stores –Commonly used for data warehouse/mart storage model –Poorly suited for transaction processing –Well suited to ad-hoc queries and OLAP

30 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques 1:N relationship between dimension tables and fact tables Dimension tables are denormalized to maximize performance Star Schema Structure Dimension tables contain descriptions about the subjects of the business Fact tables contain factual or quantitative data Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

31 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Star Schema Example Fact table provides statistics for sales broken down by product, period and store dimensions Dimension tables provides details on stores, products, and time periods Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

32 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Star Schema Example With Data Product Period Store Sales Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

33 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Dimensional Model Benefits Simple and predictable framework –Well suited to ad-hoc analytical queries –Relatively straightforward mapping from most transactional systems Dimensional independence –Query performance is somewhat independent of dimensions used in the query Simplifies aggregation and comparison Straightforward model extensions support evolution

34 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Challenge: Fact Table Granularity One of the biggest challenges in designing an effective star schema is deciding on the granularity of the fact data Transactional grain – finest level Aggregated grain – more summarized –Finer grains provide More detailed analysis capability More dimension tables, more rows in fact table (much larger storage) Allow better “drill-down” capabilities Rule of thumb: use the smallest granularity of fact data that is possible given your technical, storage, and computational constraints

35 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques In-Class Exercise: Dimensional Modeling Form teams of 2-3 people Complete exercise 2, question #1 on handout –Build a star schema to store grades at Millenium College

36 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Wrap-up

37 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Preparations For Next Week: Install and configure SQL 2005 client so that it runs on your laptop by next week’s class –You only need to install the client tools on your laptop unless you want to run the database server and other BI servers directly on your laptop –We will provide tech support in installing and configuring client software –We can only provide ‘as-able’ support with the server components Installation instructions posted to Wiki –Read and follow them carefully!

38 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Extra Slides - ETL

39 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Extract, Transform, and Load Processing (ETL)

40 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques The ETL Process The process of creating informational data stores from operational data stores is commonly described as the Extract, Transform, and Load process, or ETL There are four basic steps to ETL –Capture/Extract source data –Cleanse (scrub) –Transform –Load and Index

41 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques The Three-Layer Data Architecture Data goes through three common stages during ETL Operational Data –transactional data stored in individual systems of record throughout the organization Reconciled Data –detailed, current data intended to be the single, authoritative source for all decision support applications Derived Data –data that have been selected, formatted, and aggregated for end-user decision support applications Operational Data Reconciled Data Derived Data

42 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Reconciling and Deriving Data Reconcile Data Derive Data Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

43 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques In-Class Exercise: ETL Form teams of 2-3 people Complete exercise 1 on handout

44 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data Profiling First step: understand your source data –What is available? What is missing? –What is ‘good’ quality data? What is of questionable quality? –Data volumes, frequency, sparseness –Embedded business rules –Obvious (and subtle) data conflicts Ranges and formats Cardinality and uniqueness Key collisions This is a long, and often painful process that can require a lot of meticulous effort - budget and plan accordingly!

45 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Reconciling and Deriving Data Reconcile Data Derive Data Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

46 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data Characteristics: Status vs. Event Data Status Event: a database action (create/update/delete) that results from a transaction Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

47 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data Characteristics: Transient vs. Periodic Data Transient data: –Changes to existing records are written over previous records, thus destroying the previous data content Periodic data: –Never physically altered or deleted once they have been added to the store

48 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data Reconciliation Typical operational data is: –Transient – not historical –Not always normalized (perhaps due to denormalization for performance) –Restricted in scope – not comprehensive –Sometimes poor quality – inconsistencies and errors After reconciliation, data should be: –Detailed – not summarized yet –Historical – periodic –Normalized – 3rd normal form or higher –Comprehensive – enterprise-wide perspective –Timely – data should be current enough to assist decision-making –Quality controlled – accurate with full integrity Operational Data Reconciled Data Derived Data

49 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Static extract Static extract: capturing a snapshot of the source data at a point in time Incremental extract Incremental extract: capturing changes that have occurred since the last static extract Capture/Extract: obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse Data Reconciliation: Capture/Extract Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

50 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Extract Challenges / Issues What data should be extracted, and from where? How should it be extracted? How frequently should it be extracted?

51 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Fixing errors: Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies Also: Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data Scrub/Cleanse: Use pattern recognition and AI techniques to upgrade data quality Rule of thumb: Automate where possible! Data Reconciliation: Scrub/Cleanse Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

52 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Common Data Cleansing Tasks Suppliers Supplier_IDSupplier NameContact Name 5623International Business MachinesJoe Smith 14534IBMJim Hwang qwq77dfsIntl. Business MachinesSusan Chen Supplier_Orders_US Order_IDItemQuantity_Tons 44253Salt100 14534Salt250 Quick exercise: How many suppliers are listed in this table? Quick exercise: how many pounds of salt were purchased? Supplier_Orders_Europe Order_IDItemQuantity 44253RoadSalt25 Truckloads 14534TableSalt500 Cases ???

53 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Common Data Cleansing Tasks Reconciling mismatched data fields across source databases –E.g. CompanyName field in db1 = Comp_Name field in db2 Finding or fixing missing data or data fields –Database 1 records “region” as part of address, database 2 does not Mismatched data types –Zip stored as a string in on source database and as an integer in another Converting between different units of measure –Kilograms in european divisions database, pounds in US database Resolving primary key collisions

54 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data Quality Goal of cleansing stage is to improve data quality Common dimensions for measuring data quality: –Accuracy –Completeness –Consistency –Currency/Timeliness [Los03] Why is it so hard to achieve (and maintain) a high level of data quality in a data warehouse?

55 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Record-level transformation: Selection – data partitioning Joining – data combining Aggregation – data summarization Transform: convert data from format of operational system to format of data warehouse Data Reconciliation: Transform Field-level transformation: single-field – from one field to one field multi-field – from many fields to one, or one field to many Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

56 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Transform Examples: Single Field Transform General transformation: –Directly maps and transforms individual fields in the source record directly to individual fields in the target record Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

57 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Transform Examples: Single Field Transform Algorithmic transformation: –Uses a formula or logical expression to map and transforms individual fields in the source record directly to individual fields in the target record Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

58 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Transform Examples: Single Field Transform Table look-up transformation: –Uses a separate table, keyed by source-code records to map and transforms individual fields in the source record directly to individual fields in the target record Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

59 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Transform Examples: Multi-Field Transform M:1 maps many source fields to one target field transformation: Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

60 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Transform Examples: Multi-Field Transform 1:M maps and transforms one source field to many target fields Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

61 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Surrogate Keys Reconciled data tables should use surrogate keys –Surrogate keys are not business related –Surrogate keys are independent of operational store’s primary keys Surrogate keys are important because: –Primary keys may change over time in source system –Ability to properly track changes over time –Consistency of key length/format/type –Avoid primary key collisions

62 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Refresh mode: Refresh mode: bulk rewriting of target data at periodic intervals Load/Index: place transformed data into the warehouse and create indexes Data Reconciliation: Load and Index Update mode: Update mode: only changes in source data are written to data warehouse Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

63 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data Reconciliation Recap After load/index, data reconciliation should be complete After reconciliation, data should be: –Detailed – not summarized yet –Historical – periodic –Comprehensive – enterprise-wide perspective –Timely – data is current enough to assist decision-making –Quality controlled – accurate with full integrity Operational Data Reconciled Data Derived Data

64 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques ETL Issue: Frequency Of Data Updates How should an organization decide the frequency of updates from operational databases to data warehouses/marts? What are the benefits and costs of frequent loads? What are the benefits and costs of infrequent loads?

65 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques ETL Tool Example – Microsoft Integration Services Scenario: –Extract operational sales data from AdventureWorks transactional database Capture sales facts Capture salespeople and sales territories –Transform to data mart structure (dimensional structure) –Load data mart

66 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Derived Data

67 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Quick Review: Typical Data Warehouse Structure Reconcile Data Derive Data Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

68 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Derived Data Although reconciled data provides a consistent, hiqh-quality collection of enterprise data it is not necessarily in an efficient form for use by BI tools Derived data objectives: –Ease of use for decision support applications –Fast response to predefined user queries –Customized data for particular target audiences –Ad-hoc query support –Data mining capabilities Characteristics –Detailed (mostly periodic) data –Aggregated (for summary) –Processed –Distributed (to data marts) Operational Data Reconciled Data Derived Data

69 © 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Dimensional Modeling: Facts and Dimensions Dimensional Modeling –a simple database design in which dimensional data are separated from fact or event data. Dimensional models are also sometimes called star schemas. Dimensional models are a common way to represent derived data for informational data stores –Well suited to ad-hoc queries and OLAP –Poorly suited for transaction processing –Commonly used for data warehouse/mart storage model


Download ppt "© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data Warehousing BI Tools and Techniques."

Similar presentations


Ads by Google