CSD305 Data Warehouse Design

CSD305 Data Warehouse Design
An introduction to data warehousing Connolly and Begg Database Systems 4th edition chapters 31 and 32 Introduction to Data Mining, Tan, Steinbach and Kumar, Pearson Educational

Agenda Business Intelligence
The architecture of data warehouse and data mart Data integration and cleansing Data extraction and OLAP Data warehousing no longer seen as optional for many businesses. Database vendors now include data warehousing capabilities Not only for internal users but also external e.g. customers and suppliers. Driven by: government regulation that requires maintenance of transactional histories. Cheaper more reliable data storage Emergence of real-time data warehousing for time critical BI applications

Business Intelligence (BI) Data Mining (DM) and Knowledge Discovery in Databases (KDD)
Recent advances in technologies have resulted in an explosion in the amount of data generated and collected by businesses and organizations Point of sale data in retail Smart card technology (e.g. Oyster cards) Satellite data Extracting useful information from these data sets can be challenging

Benefits Potential high returns on investment (ROI)
Can cost tens thousands to millions to implement data warehouse Study by International Data Corporation (IDC) DW projects delivered average three-year ROI of 401% in 1996 later study analytical tools delivered average one-year ROI of 431% in 2002 Competitive advantage Huge ROI evidence of enormous competitive advantage. Taps into previous unknown untapped info on customers, trends and demands. Increased productivity of corporate decision makers Consistent, subject orientated historical data Integrates from multiple incompatible systems providing a consistent view Allows for more substantive, accurate and consistent analysis Benefits Potential high returns on investment (ROI) Can cost tens thousands to millions to implement data warehouse Study by International Data Corporation (IDC) DW projects delivered average three-year ROI of 401% in 1996 later study analytical tools delivered average one-year ROI of 431% in 2002 Competitive advantage Huge ROI evidence of enormous competitive advantage. Taps into previous unknown untapped info on customers, trends and demands. Increased productivity of corporate decision makers Consistent, subject orientated historical data Integrates from multiple incompatible systems providing a consistent view Allows for more substantive, accuratgte and consistent analysis

Knowledge Discovery in Databases (KDD)
Input data Data pre processing Data Mining Post processing Knowledge Filtering Pattern analysis Visualization Cleansing Knowledge discovery in databases (KDD) is the process of discovering useful knowledge from a collection of data. A process that includes data preparation and selection, data cleansing, incorporating prior knowledge on data sets and interpreting accurate solutions from the observed results.

How to get to BI? Data mining techniques, technologies and algorithms are being developed to provide BI Before we get to data mining we must first pre process the data to get it into the right form This will often involve the design of a data warehouse: Design the data warehouse architecture Extract and cleanse the data Integrate the data into a common data mart schema

Data warehousing - General Architecture
Operational Data Mainframe held in first gen hierarchical and network DB Departmental data in propriety file systems and RDBMS, VSAM (virtual Storage Access Method) IBM OS, and RMS file system for other vendors Private data on workstations and private servers External systems e.g. internet, commercial databases or DB’s associated with business customers or suppliers.

Data Warehousing concepts
Subject-oriented Integrated Time-variant Non-volatile Subject orientated: Major subjects of business e.g. customers, products sales. Rather than application areas e.g. customer invoicing, stock control, product sales. Integrated: Source data is often inconsistent e.g using different formats. Integrated data source has to be made consistent Time-variant: Data in warehouse is accurate and valid only at some point in time or over some time interval. Data is held over an extended time An explicit and implicit association of time with data. Data represents a series of snapshots Non-volatile As data not updated in real time but refreshed on regular basis. New data is added as supplement rather than replacement Incrementally integrates with previous data Continuing to build historical data.

Data Integration Component
Systems storing data may be different Data Type Differences Value differences (Colour:Black(0, or BL, or Black)) Semantic Differences ( Terms -> Different interpretations) E.g., Column ‘Title’ in one database means ‘Job Title’ while in another database it means ‘Person Title’ Missing Values (NULL VALUES) Different Schemata

Data Integration Conflicts
Schema level conflicts are due to different perceptions different focus Value level conflicts are due to different representations, coding, etc different precision incorrect information data entry errors Data cleaning

The Warehousing Approach
Information integrated in advance Stored in WH for direct querying and analysis ETL – Extract, Transform, Load Extract reading and converting data from various sources, usually Flat Files and RDBMS but could be any data storage. Transform using rules, filtering, sorting, aggregating, joining data, cleaning data, generating calculated data, validating data. Load transformed data to target database. Into an Operational Data Store (ODS) A repository of current and integrated data used for analysis Often structured in same way as the data warehouse but may simply be used as a staging area. Holds data already extracted from sources and cleaned.

Dimensional modelling
Dimensional modelling - presenting the data in a standard form Each Dimensional Model One fact table Multiple Dimension tables Each dimensions gives a different focus to the data Star schema or Star join Logical design technique aims to present data in standard, intuitive form, allows for high-performance access. A fact table in the centre, surrounded by denormalized dimension tables

Star schema Fact table one table with a composite primary key.
factual data generated by events occurred in the past. Unlikely to change, regardless of how they are analysed. Can be large relative to dimension tables Numerical measures or ‘facts’ that occur for each record Other examples: offer price, selling price, sale commission, sale revenue. Dimension Table Each has a simple PK Corresponds to exactly one component of composite key in fact table All natural keys are replaced by surrogate keys Natural keys has relationship with data, ISBN, vehicle registration number. Surrogate key no relationship with data, a generated value to make the data unique. General structure based on integers

Star schema essentials
One fact table Multiple dimension tables Each dimensions gives a different focus to the data

Star join

Star join essentials Again - fact tables and dimension tables.
Said to be star join when one large central table joined to two or more dimension tables The fact table represents the structure that holds the majority of the occurrences of the data. Fact tables typically combine data and cross reference keys from a variety of other tables. Dimension tables contain data which is not terribly voluminous. Dimension tables are related to fact tables by means of a foreign key relationship. Typical use would be for an end-user query on any fact table

Why star joins? By building star joins, the designer has created a structure for efficient access of large volumes of data and natural end-user viewing. Problem with star joins. In order to know how to create the star join, the designer must make assumptions about the usage of the data. One department will look at data very differently from another department. The star join for finance will be very different than the star join for production, for example.

Star Schema development steps
Choose the process Choose the grain Identify the dimensions Choose the facts Choose business process for the data mart Data Mart contains subset of corporate data to support requirements of business unit e.g. sales department Data source likely accessible, high quality DreamHome processes include: Property sales Property rentals Property viewing Property advertising Property maintenance Best choice for a first data mart is sales and finance.

Choose Business Process
Choose business process for the data mart Data Mart contains subset of corporate data to support requirements of business unit e.g. sales department DreamHome processes include: Property sales Property rentals Property viewing Property advertising Property maintenance Best choice for a first data mart is sales and finance. Data source likely accessible, high quality

Choose the Grain Balance between meeting business requirements and what is possible given the data source Grain determines what the fact table represents In this case sales Facts for each product Best to build model using lowest level of detail available. Only when the grain is chosen can we identify dimensions. Time is included as a core dimension, always present in dimensional models. Grain decision determines grain for dimension tables e.g. if grain is for saleFacts of each product, StoreDimension details of the store each sale took place in.

Choose Dimensions Dimension are context for asking questions about the facts. Identify dimensions in sufficient detail to describe things at the correct grain. Dimensions can be used in more than one dimensional model (Data Mart) Referred to as being conformed. Must be exactly the same or a subset. Allows for individual data marts to form the enterprise data warehouse support.

Choose Facts The grain of the fact table determines facts to be used.
All facts must be expressed at level implied by grain E.g. if grain is for sale of individual products All numerical facts must refer to this particular sale Facts must also be numerical and addative (can be summed across any dimension) Additional facts can be added to the fact table at any time Provided they are consistent with the grain

Data extractions Data mart software will often include extraction and visualization tools These are given the umbrella term OLAP On-Line Analytical Process

Data extractions Image from

OLAP and Data Mining We will see how a data warehouse could be examined in different dimensions In the following week we shall look at Data Mining and begin investigating various data mining models

DreamHome Star Schema from SQL Server

Dimensional Modelling
Step 1: Select business process The process (function) refers to the subject matter of a particular data mart. First data mart built should be the one that is most likely to be delivered on time, within budget, and to answer the most commercially important business questions.

ER model of extended version of DreamHome

ER model of property sales business process

Step 2: Declare grain Decide what a record of the fact table is to represent. Identify dimensions of the fact table. The grain decision for the fact table also determines the grain of each dimension table. Also include time as a core dimension, which is always present in star schemas.

Step 3: Choose dimensions
Dimensions set the context for asking questions about the facts in the fact table. If any dimension occurs in two data marts, they must be exactly the same dimension, or one must be a mathematical subset of the other. A dimension used in more than one data mart is referred to as being conformed.

Star schemas for property sales and property advertising

Step 4: Identify facts The grain of the fact table determines which facts can be used in the data mart. Facts should be numeric and additive. Unusable facts include: non-numeric facts non-additive facts fact at different granularity from other facts in table Once the facts have been selected each should be re-examined to determine whether there are opportunities to use pre-calculations.

CSD305 Data Warehouse Design

Similar presentations

Presentation on theme: "CSD305 Data Warehouse Design"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSD305 Data Warehouse Design

Similar presentations

Presentation on theme: "CSD305 Data Warehouse Design"— Presentation transcript:

Similar presentations

About project

Feedback