Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Warehouses Brief Overview Add ETL Copyright © 2011 Curt Hill.

Similar presentations


Presentation on theme: "Data Warehouses Brief Overview Add ETL Copyright © 2011 Curt Hill."— Presentation transcript:

1 Data Warehouses Brief Overview Add ETL Copyright © 2011 Curt Hill

2 Introduction Data warehouses have much in common with databases
This may include using the same software What distinguishes them is the main purpose The database is about transactions The data warehouse is about decision support Copyright © 2011 Curt Hill

3 Example(1) Consider the operational database for a large retailer
The database is used to monitor day to day operations How many stock items of any type are present? How much was sold yesterday? What will the next payroll need? Quick response and ACID are very important Used by lower management Copyright © 2011 Curt Hill

4 Example(2) Then there is corporate data warehouse for the same retailer This stores any type of data for a long period of time Tickets from sales Almost everything the operational database has had at one time with years of history Trends are very important This is used to guide upper management Copyright © 2011 Curt Hill

5 Operational Databases
Operational database have: Strict performance requirements Predictable workloads Small units of work High utilization Data warehouses are contrary in every respect Copyright © Curt Hill

6 Some definitions A data warehouse Data warehouses support
Subject oriented, integrated, non-volatile, time-variant collection of data in support of management decsions Data warehouses support OLAP (OnLine Analytical Processing) DSS/EIS (Decision Support Systems or Executive Information Systems) Data mining Copyright © 2011 Curt Hill

7 Operational Database The operational database needs to retain very little history The retailer’s ticket will: Update stock quantities Generate a credit card request Once this is successful there is very little need to retain it Similarly, once a product is no longer stocked, there is no need to retain information on it Copyright © 2011 Curt Hill

8 Cleaning Once data is of no value to the operational database it is transferred to the data warehouse It may need some reformatting Data is frequently deleted from the operational database Once in the data warehouse it becomes a permanent addition Hence the expression non-volatile Copyright © 2011 Curt Hill

9 Data Models We want to organize the data in the way that will be most useful One common model is a data cube Example Region is one dimension Product is a second Month of sale is the third Hypercubes contain more dimensions Prior to data warehouses this was often done with spreadsheets Dimensionality gets in the way Copyright © 2011 Curt Hill

10 Figure 6/2007 5/2007 Reg 1 Reg 2 Reg 3 P 1 P 2 P 3
Copyright © 2011 Curt Hill

11 Viewing The cube presents three distinct faces:
Product vs. Region Region vs. Quarter Product vs. Quarter A hypercube would present more Each of these looks like a spreadsheet display To pivot or rotate the cube is to present another face Copyright © 2011 Curt Hill

12 Viewing Again It is also desirable to condense the dimensions in a roll-up display Condensing days into weeks into months into quarters Condensing single stores into larger and larger groupings Condensing single products into related products by functionality or brand The opposite of this is the drill-down Copyright © 2011 Curt Hill

13 Operations What common operations exist, besides pivot, roll-up, drill-down Slice and dice Take sectional slices in the hypercube Sorting Arrange the data in an order, not necessarily that of the dimensions of the hypercube Compute attributes Arithmetic results based on existing values Copyright © 2011 Curt Hill

14 Multiple Tables As you might think the data cube is more complicated than it first appears Two tables Fact table Tuples that have the actual data Dimension table Tuples of the attributes with selection criteria into the fact table Copyright © 2011 Curt Hill

15 Said another way The fact table contains the data that is aggregated into the cube entries Pretty directly extracted from the operational database This table may be enormous The dimension table contains the selection criteria needed to condense the facts into a cube entry How is that data summarized into the cube Usually display sized, so small to medium Copyright © 2011 Curt Hill

16 Fact Table Granularity
Since the fact table may be enormous, what is the smallest fact worth recording? The minimum in retail may be every item sold in one day in one store Straight off of ticket information It may also be already aggregated How many items of this product number sold in one day in one store Copyright © 2011 Curt Hill

17 Fact/Dimension Schemas
Two typical ways to connect the two types of tables Generally there is only one fact table but several dimension tables Star Each dimension is a single table Snowflake Each dimension is an hierarchy of tables Copyright © 2011 Curt Hill

18 Dimension Example Consider the product dimension
Each tuple specifies a range of products As few as one As many as entire brand or type In the star model all the accessible data is in these tuples In the snowflake model these tuples may reference further tables with more extensive data Copyright © 2011 Curt Hill

19 Building the Warehouse
To build a warehouse the following steps are often used Extraction Formatting Cleaning Fit into the model Loaded Copyright © 2011 Curt Hill

20 Non Warehouse We see the same process in moving data from one database to another The acronym is ETL Extract Transform Load Typically do not need the clean Copyright © 2011 Curt Hill

21 Extraction Obtain data from one or more sources
Often, but not always, one or more operational databases Any data stream of interest may also be used Sensors at the Large Hadron Collider at CERN store about 15 petabytes a year – generates about a Terabyte/second Financial market data Copyright © 2011 Curt Hill

22 Formatting There are often multiple sources to the data which means that we have a variety of fields and meanings Mapping different data sources into a common meaning and format Reconciling different dates, such as range of fiscal year Making the data conform to the table formats required so that every field has the same meaning and units Copyright © 2011 Curt Hill

23 Cleaning The data must be checked for validity before entering the warehouse Most labor intensive portion of the build The size of the incoming data requires an automated approach Each data source may require a different approach Backflushing: returning cleaned data to original source for updating their own tables Copyright © 2011 Curt Hill

24 Fitting Putting the data into a form suitable for the data model of the warehouse Usually converted from the form of the source database into the cube or hypercube model of the warehouse Copyright © 2011 Curt Hill

25 Loading Insert into the warehouse
The ability to check that the load completed properly is needed The ability to remove incomplete loads and try again is also required Copyright © 2011 Curt Hill

26 Software Data warehouses may use traditional RDBMS or so-called NoSQL database software The multidimesional hypercube format does not favor a normal DBMS Once data is loaded it is retained SQL Insert, Remove, Update statements are never used on data after it is successfully loaded Copyright © 2011 Curt Hill

27 Warehouse vs. DBMS Operational databases are crisp and up to date
Warehouses do not need the transactional ACID of a operational database Warehouses may also lag operational databases by days to weeks Copyright © 2011 Curt Hill

28 Knowledge Workers Have a different skill set than many others
Business analyst Understands the business processes of the organization Programming skills The organization of complicated queries is often much more than simple SQL Usually involves considerable programmed search and aggregation Copyright © 2011 Curt Hill

29 Finally Much contrast between an operational database and a data warehouse The warehouse is used to support managerial decisions Usually at a much higher level than the operational database There is another presentation on NoSQL databases Copyright © 2011 Curt Hill


Download ppt "Data Warehouses Brief Overview Add ETL Copyright © 2011 Curt Hill."

Similar presentations


Ads by Google