MIS2502: Data Analytics Dimensional Data Modeling

Slides:



Advertisements
Similar presentations
Dimensional Modeling.
Advertisements

Dimensional Modeling Business Intelligence Solutions.
Business Intelligence. On-Line Analytical Processing (OLAP) Tools The use of a set of graphical tools that provides users with multidimensional views.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Data Warehousing ISYS 650. What is a data warehouse? A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data.
Agenda Common terms used in the software of data warehousing and what they mean. Difference between a database and a data warehouse - the difference in.
Cube Intro. Decision Making Effective decision making Goal: Choice that moves an organization closer to an agreed-on set of goals in a timely manner Goal:
THE INFORMATION ARCHITECTURE OF THE ORGANIZATION MIS2502 Data Analytics.
DIMENSIONAL MODELLING. Overview Clearly understand how the requirements definition determines data design Introduce dimensional modeling and contrast.
Data Warehouse. Design DataWarehouse Key Design Considerations it is important to consider the intended purpose of the data warehouse or business intelligence.
1 Data Warehouses BUAD/American University Data Warehouses.
DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What.
MIS2502: Data Analytics Dimensional Data Modeling
UNIT-II Principles of dimensional modeling
Chapter 5 DATA WAREHOUSING Study Sections 5.2, 5.3, 5.5, Pages: & Snowflake schema.
1 On-Line Analytic Processing Warehousing Data Cubes.
Business Intelligence Transparencies 1. ©Pearson Education 2009 Objectives What business intelligence (BI) represents. The technologies associated with.
Pooja Sharma Shanti Ragathi Vaishnavi Kasala. BUSINESS BACKGROUND Lowe's started as a single hardware store in North Carolina in 1946 and since then has.
Copyright© 2014, Sira Yongchareon Department of Computing, Faculty of Creative Industries and Business Lecturer : Dr. Sira Yongchareon ISCG 6425 Data Warehousing.
What Do You Do With Data? Gather Store Retrieve Interpret.
ITEC 3220M Using and Designing Database Systems Instructor: Prof. Z.Yang Course Website: c3220m.htm Office: TEL.
DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What.
INCREMENTAL AGGREGATION After you create a session that includes an Aggregator transformation, you can enable the session option, Incremental Aggregation.
The Need for Data Analysis 2 Managers track daily transactions to evaluate how the business is performing Strategies should be developed to meet organizational.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
1 Database Systems, 8 th Edition Star Schema Data modeling technique –Maps multidimensional decision support data into relational database Creates.
Pindaro Demertzoglou Data Resource Management – MGMT 4170 Lally School of Management Rensselaer Polytechnic Institute.
Jaclyn Hansberry MIS2502: Data Analytics The Things You Can Do With Data The Information Architecture of an Organization Jaclyn.
Operation Data Analysis Hints and Guidelines
Advanced Applied IT for Business 2
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
MIS2502: Data Analytics Relational Data Modeling
Chapter 13 Business Intelligence and Data Warehouses
On-Line Analytic Processing
Data warehouse and OLAP
Chapter 13 The Data Warehouse
Data storage is growing Future Prediction through historical data
MIS2502: Data Analytics Dimensional Data Modeling
Data Warehouse.
Star Schema.
Applying Data Warehouse Techniques
MIS2502: Data Analytics Dimensional Data Modeling
MIS2502: Data Analytics Dimensional Data Modeling
Competing on Analytics II
Inventory is used to illustrate:
Retail Sales is used to illustrate a first dimensional model
MIS2502: Data Analytics Dimensional Data Modeling
CMPE 226 Database Systems April 11 Class Meeting
MIS2502: Data Analytics Relational Data Modeling
MIS2502: Review for Exam 1 JaeHwuen Jung
MIS2502: Data Analytics Converting ERDs to Schemas
Applying Data Warehouse Techniques
MIS2502: Data Analytics The Information Architecture of an Organization David Schuff
MIS2502: Data Analytics The Information Architecture of an Organization Acknowledgement: David Schuff.
MIS2502: Data Analytics The Information Architecture of an Organization Aaron Zhi Cheng Acknowledgement:
MIS2502: Data Analytics Relational Data Modeling
Exam 2 Exam 2 Study Guide is posted on the Course Site
MIS2502: Data Analytics Extract, Transform, Load
Retail Sales is used to illustrate a first dimensional model
Applying Data Warehouse Techniques
MIS2502: Data Analytics Relational Data Modeling
MIS2502: Review for Exam 1 Aaron Zhi Cheng
MIS2502: Data Analytics Dimensional Data Modeling
Retail Sales is used to illustrate a first dimensional model
MIS2502: Data Analytics Relational Data Modeling
MIS2502: Data Analytics Introduction to Advanced Analytics
Data Warehousing Concepts
Applying Data Warehouse Techniques
Applying Data Warehouse Techniques
Data Warehousing.
Presentation transcript:

MIS2502: Data Analytics Dimensional Data Modeling Aaron Zhi Cheng http://community.mis.temple.edu/zcheng/ acheng@temple.edu Acknowledgement: David Schuff

Where we are… Now we’re here… Data entry Transactional Database Data extraction Analytical Data Store Data analysis Stores real-time transactional data Stores historical transactional and summary data

What do we know so far? Why are relational databases good for storing transaction data? Why are they bad for analytical processing? What’s the solution?

Dimensional Data Modeling Is a set of techniques and concepts used in data warehouse design Optimized for analytical processing Different from relational data modeling (ERD)

Some terminology Data Warehouse Data Mart Data Cube Takes many forms Really is just a repository for historical data Data Warehouse Subset of the Data Warehouse Designed for specific analysis Data Mart Organization of data as a “multidimensional matrix” Implementation of a Data Mart Data Cube

The Actual Process Analytical Data Store Data Warehouse ETL ETL ETL Transactional Database 1 Data Warehouse Data Mart (Sales) ETL Transactional Database 2 Data Mart (Finance) ETL Other Sources Data Mart (Inventory) ETL

Why isn’t product price a measured fact? The Data Cube Product Core component of Online Analytical Processing (OLAP) and Multidimensional Data Analysis Made up of “facts” and “dimensions” Diet Coke Famous Amos M&Ms Doritos quantity & total price quantity & total price quantity & total price quantity & total price Ardmore, PA quantity & total price quantity & total price quantity & total price quantity & total price Temple Main Store quantity & total price quantity & total price quantity & total price quantity & total price Cherry Hill, NJ quantity & total price quantity & total price quantity & total price quantity & total price Mar. 2013 King of Prussia, PA Feb. 2013 Jan. 2013 Time Quantity sold and total price are measured facts. Why isn’t product price a measured fact?

A single summary record representing a business event (monthly sales). The Data Cube Product Diet Coke Famous Amos M&Ms Doritos The highlighted element represents all the M&Ms sold in Ardmore, PA in January, 2011 quantity & total price quantity & total price quantity & total price quantity & total price Ardmore, PA quantity & total price quantity & total price quantity & total price quantity & total price Temple Main A single summary record representing a business event (monthly sales). Store quantity & total price quantity & total price quantity & total price quantity & total price Cherry Hill, NJ quantity & total price quantity & total price quantity & total price quantity & total price Mar. 2013 King of Prussia, PA Feb. 2013 Jan. 2013 Time

This is called “slicing the data.” Product Diet Coke Famous Amos M&Ms Doritos The highlighted elements represent Famous Amos cookies sold on Temple’s Main campus from January to March, 2013 quantity & total price quantity & total price quantity & total price quantity & total price Ardmore, PA quantity & total price quantity & total price quantity & total price quantity & total price Temple Main Store quantity & total price quantity & total price quantity & total price quantity & total price This is called “slicing the data.” Cherry Hill, NJ quantity & total price quantity & total price quantity & total price quantity & total price Mar. 2013 King of Prussia, PA Feb. 2013 Jan. 2013 Time

Dicing the Data Product Store Time Diet Coke Famous Amos M&Ms Doritos What do the orange highlighted elements represent? quantity & total price quantity & total price quantity & total price quantity & total price Ardmore, PA quantity & total price quantity & total price quantity & total price quantity & total price Temple Main Store quantity & total price quantity & total price quantity & total price quantity & total price What do the purple highlighted elements represent? Cherry Hill, NJ quantity & total price quantity & total price quantity & total price quantity & total price Mar. 2013 King of Prussia, PA Feb. 2013 Jan. 2013 Time This is called “dicing the data”

Could you have a data mart with five dimensions? Then why does our example (and most others) only have three?

Modeling a data cube: The Star Schema Store Store_ID Store_Address Store_City Store_State Store_Type Transactional databases aren’t built around dimensions They don’t map well to cubes They aren’t set up for summarization So we build a star schema Built around “dimensions” and “facts” Simplified relational model The star schema facilitates Aggregating individual transactions Creation of cubes Dimension Sales Sales_ID Product_ID Store_ID Time_ID Quantity Sold Total Price Fact Product Product_ID Product_Name Product_Price Product_Weight Time Time_ID Day Month Year Dimension Dimension

Fact Table Contains the following elements: Sales Fact Primary key Facts (numeric measurements) associated with a specific business process Foreign keys that refer to dimension tables Sales Sales_ID Product_ID Store_ID Time_ID Quantity Sold Total Price Fact

Dimension Tables Store Store_ID Store_Address Store_City Store_State Store_Type Dimension Provide the “who, what, where, when, why, and how” context surrounding a business process event Contains the following elements: Primary key Descriptive attributes Sales Sales_ID Product_ID Store_ID Time_ID Quantity Sold Total Price Fact Product Product_ID Product_Name Product_Price Product_Weight Time Time_ID Day Month Year Dimension Dimension

Designing the Star Schema 1. Choose the business process 2. Decide on the level of granularity 3. Identify the dimensions 4. Identify the fact Kimball’s Four Step Process for Dimensional Data Modeling (Kimball et al., 2008) http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/four-4-step-design-process/

Choose the business process Business processes are the operational activities performed by your organization What your data cube is “about” Determined by the questions you want to answer about your organization Question Business Process What are my highest selling products? Sales Which teachers have the best student performance? Standardized testing Which supplier is offering us the best deals? Purchasing Business processes: are the operational activities performed by an organization, such as taking an order, processing an insurance claim, registering students for a class. Business process events generate or capture performance metrics that translate into facts in a fact table. Most fact tables focus on the results of a single business process. Choosing the process is important because it defines a specific design target and allows the grain, dimensions, and facts to be declared. Note that a “business process” is not always about business.

Decide on the level of granularity Level of detail for each business process event Will determine the data in the dimensions Example: Who is my best customer? The “event” is a sales transaction Choices for time: yearly, quarterly, monthly, daily Choices for store: store, city, state Granularity: Declaring the grain is the pivotal step in a dimensional design. The grain establishes exactly what a single fact table row represents. How would you select the right granularity?

Identify the dimensions Description of the context of the business process who, what, where, when, why, and how Example: Sales transaction A “sale” is the fact Dimensions Product (what) Store (where) Time (when) Dimensions: Dimensions provide the “who, what, where, when, why, and how” context surrounding a business process event. Dimension tables contain the descriptive attributes used by BI applications for filtering and grouping the facts

Facts: Measured, numeric data Identify the fact The fact table contains data called facts associated with the business process event Keys Primary key for each event Foreign keys for the associated dimensions Example: Sales has Sales_ID as primary key, and Product_ID, Store_ID, and Time_ID as foreign keys Facts: Measured, numeric data Facts: Quantifiable information for each business event – almost always numeric Describes a particular combination of dimensional data Example: Sales has quantity_sold and total_price. Fact: Facts are the measurements that result from a business process event and are almost always numeric. In a retail sales transaction, the quantity of a product sold and its extended price are good Facts.

From Star Schema to Data Cube A Cube typically uses a Star Schema as its source and stores precomputed summarized (aggregated) data Much more efficient, but can’t be changed (non-volatile)

Advantages of Data Cube Fast response to give you the information you have previously designed in the cube Speed The data multi-dimensional data structure allows the data to be analyzed in the most logical way. Analysis

Data Cube Caveats The cube is “non volatile,” so you’re locked in Measured facts Dimensions Granularity So choose wisely! For example: You can’t track daily sales if “date” is monthly So why not include every single sale and do no aggregation?

Pivot tables in Excel PivotTable is a data summarization tool in Excel the easiest way to learn multidimensional data and generate simple reports Data cubes can act as the data source for Pivot Table in Excel

ICA #5 In ICA #5, we learned to how to create a pivot table in Excel Identify which fields are assigned as VALUES and which ones are assigned as ROWS Identify the correct function for aggregation: e.g., SUM, COUNT, AVERAGE, MAX, MIN

The star schema in ICA #5 Measured Fact: Order amount Three dimensions: Salesperson, Country, and Time.

Pivot Table and Data Cube The fields in the ROWS box correspond to dimensions in a data cube The fields in the VALUES box correspond to measured facts in a data cube

Example 1 Dimension Measured Fact

Example 2 Dimensions Measured Fact

Summary Data warehouse vs. data mart vs. data cube Data Cube Star schema Kimball’s four step process for dimensional data modeling Pivot tables in Excel