Presentation is loading. Please wait.

Presentation is loading. Please wait.

Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Data Warehousing 70-451 Management Information Systems Robert.

Similar presentations


Presentation on theme: "Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Data Warehousing 70-451 Management Information Systems Robert."— Presentation transcript:

1 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Data Warehousing 70-451 Management Information Systems Robert Monroe November 1, 2011

2 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Today’s Quiz

3 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems By The End Of Today's Class You Should Be Able To: Explain the most important attributes of a data warehouse and how it differs from a transactional database Explain what happens in each of the steps of the Extract, Transform, and Load (ETL) process and why this process is important List, and provide examples of, many of the most common data integrity problems that are repaired in the ETL process

4 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Database Management Systems

5 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Database Management Systems (DBMS) What DBMS’s Do: Store data Retrieve data (queries) Update/Delete data Abstract and simplify data storage to programs and people Support concurrent access to data Support a single interface to access many data sources Sales Operations Accounting Database Management System Data stores Database clients CRM System Reporting Tools Website … Custom Application BI Tools

6 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Structured Query Language (SQL) SQL is a standard textual language for designing, querying, and manipulating relational databases This is how programs communicate with relational databases All of the major relational database vendors implement some form of SQL in their database products Example query: Queries data and returns set(s) of data rows SELECT FirstName, LastName, Title, Country FROM Employees WHERE Country = 'USA' ORDER BY LastName DESC;

7 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Business Analytics

8 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Business Analytics Core question: How can an organization manage and leverage large data sets to make better business decisions? Business Analytics –A broad category of applications and technologies for gathering, storing, analyzing, and providing access to data to help enterprise users make better business decisions. (Wikipedia) Two common uses for Business Analytics tools –Measuring where you are / how your business is performing –Identifying problems and opportunities

9 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Business Analytics Systems Improve Decision Making Source: O’Brien, Management Information Systems, 6 th ed.

10 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems A Business Analytics Question: If the goal of Business Analytics tools is to help organizations manage and leverage large data sets to make better business decisions, then what do they need in order to do so?

11 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Data Warehousing Concepts

12 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Quick Recap: Operational and Analytic Systems Operational (transactional) system: A system that is used to run a business in real time, based on current data. Also called “system of record” Analytic (informational) systems: Systems designed to support decision making based on historical point-in- time and prediction data for complex queries or data mining applications It is good practice to separate operational and analytic systems and data –To improve system performance –To improve database managability and maintainability –Optimize each type of system for it’s primary purpose

13 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Decision Making: Operational And Analytic Systems Analytic Systems Operational Systems Source: O’Brien, Management Information Systems, 6 th ed.

14 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Data Warehouses A data warehouse is a subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes –Subject-oriented: e.g. customers, patients, students, products –Integrated: Consistent naming conventions, formats, encoding structures; from multiple data sources –Time-variant: Can study trends and changes –Nonupdatable: Read-only, periodically refreshed A data warehouse generally provides an integrated, company- wide view of high-quality information that is derived from disparate databases Data warehouses are analytic data stores

15 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Operational Systems Feed Analytic Systems Analytic systems get their data from operational databases This process generally requires significant processing (transformation) of the data stored in operational databases This process is commonly known to as ETL –Extract, Transform, and Load (ETL)

16 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Data Marts and Data Warehouses A Data Mart is an analytic data store that stores aggregated information for a specific group or function within an organization –Example: sales data for a single company division –Example: materials purchases for a group of factories Data Warehouses generally include data from across the entire enterprise (multiple functions, divisions, etc.) –Generally much more difficult to build/deploy than data marts –Basically required for enterprise-wide analysis

17 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Tradeoffs Drawbacks Advantages Data Warehouses Data Marts

18 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Data Warehousing Architectures

19 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems E T L One company- wide warehouse Periodic extraction  data is not completely current in warehouse Generic Two-Level Data Warehouse Architecture

20 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Data marts: Mini-warehouses, limited in scope E T L Separate ETL for each independent data mart Data access complexity due to multiple data marts Independent Data Marts

21 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems E T L (EDW) Single ETL for Enterprise Data Warehouse (EDW) Simpler data access Dependent data marts loaded from EDW Dependent Data Marts

22 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Extract, Transform, and Load (ETL)

23 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems The ETL Process Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

24 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Static extract Static extract: capturing a snapshot of the source data at a point in time Incremental extract Incremental extract: capturing changes that have occurred since the last static extract Capture/Extract: obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse The Capture/Extract Stage Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

25 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Fixing errors: Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies Also: Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data Scrub/Cleanse: Use pattern recognition and AI techniques to upgrade data quality The Scrub/Cleanse Stage Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

26 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Data Quality Goal of cleansing stage is to improve data quality Common dimensions for measuring data quality: –Accuracy –Completeness –Consistency –Currency/Timeliness [Los03] Why is it so hard to achieve (and maintain) a high level of data quality in a data warehouse?

27 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Quick exercise: How many suppliers are listed in this table? Quick exercise: how many kilos of oranges were purchased? Common Data Cleansing Tasks Suppliers Supplier_IDSupplier NameContact Name 5623International Business MachinesJoe Smith 14534IBMJim Hwang qwq77dfsIntl. Business MachinesJim Hwang Supplier_Orders_Europe Order_IDItemQuantity_Kilos 44253Oranges100 14534Oranges250 Supplier_Orders_US Order_NumProductQuantity 44253Oranges25 Pounds 14534Oranges500 Cases ???

28 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Common Data Cleansing Tasks Reconciling mismatched data fields across source databases –E.g. CompanyName field in db1 = Comp_Name field in db2 Finding or fixing missing data or data fields –Database 1 records “region” as part of address, database 2 does not Duplicate records –Sometimes obvious (same data, same primary key) –Sometimes non-obvious (same data, different keys) Mismatched data types –Zip stored as a string in one source database and as an integer in another Converting between different units of measure –Kilograms in MENA divisions database, pounds in US database Resolving primary key collisions

29 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems In-Class Exercise – Data Cleansing

30 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Record-level transformation: Selection – data partitioning Joining – data combining Aggregation – data summarization Transform: convert data from format of operational system to format of data warehouse ETL: Transform Stage Field-level transformation: single-field – from one field to one field multi-field – from many fields to one, or one field to many Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

31 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Transform Examples: Single Field Transform Algorithmic transformation: –Uses a formula or logical expression to map and transforms individual fields in the source record directly to individual fields in the target record Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

32 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Transform Examples: Single Field Transform Table look-up transformation: –Uses a separate table, keyed by source-code records to map and transforms individual fields in the source record directly to individual fields in the target record Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

33 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Load: Add the captured, cleansed and transformed data to the data warehouse ETL: Load and Index Stage Index: Build the index tables to allow for fast retrieval of complex data warehouse queries Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.

34 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Data Reconciliation Recap After the ETL process, the data should be: –Detailed –Historical – periodic –Comprehensive – enterprise-wide perspective –Timely – data is current enough to assist decision-making –Quality controlled – accurate with full integrity Operational Data Reconciled Data

35 Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems References [HPM05] Jeffrey Hoffer, Mary Prescott, Fred McFadden, Modern Database Management, 7 th Ed., Pearson - Prentice Hall, 2005, ISBN: 0-13-145320-3. [Los03] Loshin, David, Business Intelligence: The Savvy Manager’s Guide, Morgan Kaufmann Publishers, 2003, ISBN: 1-55860-916-4


Download ppt "Carnegie Mellon University ©2006 - 2010 Robert T. Monroe 70-451 Management Information Systems Data Warehousing 70-451 Management Information Systems Robert."

Similar presentations


Ads by Google