Presentation is loading. Please wait.

Presentation is loading. Please wait.

Typically data is extracted from multiple sources

Similar presentations


Presentation on theme: "Typically data is extracted from multiple sources"— Presentation transcript:

1 Typically data is extracted from multiple sources
Loading the Warehouse Typically data is extracted from multiple sources To update the warehouse periodically, we must receive the changes that have occurred and load these into the warehouse An initial load of the warehouse ETL Periodically update the warehouse Changes? ETL Refresh? March 2004 Ron McFadyen

2 Data is obtained from source systems Transform
ETL Extract Data is obtained from source systems Transform Data is cleansed and transformed Correcting values, detecting incorrect/impossible values, … Complex fields broken down, standardization of values, … Attribute types may vary in source and target Data may be aggregated Load New data is loaded into dimensions and fact tables Indexes rebuilt March 2004 Ron McFadyen

3 Data capture techniques
Synchronous Data is obtained as it is created in the source systems – in real time Asynchronous A delay is present between the time the data changes and the time it is captured in the warehouse Total Refresh A complete table in the warehouse is refreshed from its source Incremental Only changes are acquired and loaded into the warehouse March 2004 Ron McFadyen

4 Data capture techniques
Static capture Data is acquired by reading the database or files. Subsets may be acquired by filtering Application assisted The application(s) are modified to write changes out to a file/database Triggered capture Triggers in a DBMS are written to capture changes Replication A DBMS replication feature is used to manage changes Log capture The DBMS logging feature is utilized for capturing changes File comparison A prior copy and the current file are comparing to find changes March 2004 Ron McFadyen

5 Data capture process source Extract Cleanse Repair Transform target
errors Load March 2004 Ron McFadyen

6 Star Schema Update Load Dim 1 Load Dim n Load Fact table March 2004
Ron McFadyen

7 Dimension table surrogate key management
Figure 16.4 on page 360 Going back in time (page 271) Late-arriving fact rows Late-arriving dimension rows From the source systems we receive data that we should have receive some time ago Technical issues having to do with Dimension records being twin-timestamped for contiguous non-overlapping time intervals, placing records in the right partition For dimensions, you also need to adjust the facts Subtle point about the ordering of surrogate keys March 2004 Ron McFadyen

8 Point-in-time balances
See pages 208+, SQL on page 209 The fact table is given on page 210 There is just one date-related attribute: transaction date key “the date key is a set of integers running from 1 to N with a meaningful, predictable sequence. We assign consecutive integers to the date surrogate key so that we can physically partition a large fact table based on the date.” “The date dimension is the only dimension whose surrogate keys have any embedded semi-intelligence” March 2004 Ron McFadyen


Download ppt "Typically data is extracted from multiple sources"

Similar presentations


Ads by Google