Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Warehouse Chapter 11. Multiple Files Problem Added complexity of multiple source files Start simple Multiple Source files Extracted data Logic to.

Similar presentations


Presentation on theme: "Data Warehouse Chapter 11. Multiple Files Problem Added complexity of multiple source files Start simple Multiple Source files Extracted data Logic to."— Presentation transcript:

1 Data Warehouse Chapter 11

2 Multiple Files Problem Added complexity of multiple source files Start simple Multiple Source files Extracted data Logic to detect Correct source

3 Transforming Data from Multiple files File

4 Missing Values Problem Solution Ignore Wait Mark rows Extract when time-stamped A If NULL then Field=‘A’

5 Duplicate Value Problem Solution SQL self-join techniques RDMBS constrains utilities SELECT… FROM table_a, table_b WHERE table_a.key(+)=table_b.key UNION SELECT… FROM table_a, table_b WHERE table_a.key=table_b.key(+) ACME Inc

6 Element Names Problem Solution CTAS SQL*Loader Customer Client Contact Name Customer

7 Element Meaning Problem Avoid misinterpretation Complex solution Document meaning in metadata Customer’s name All customer details All details Except name

8 Input Format Problem EBCDICASCII “123-73” 12373

9 Referential Integrity Problem Solution SQL anti-join Server constraints Dedicated tools Department 10 20 30 40 Emp Name Department 1099 Smith 10 1269 Jones 20 1234 Doe 50 6787 Harris 60

10 Name and Address Problem No unique key Missing values Personal and commercial names mixed Different addresses for same member Different names and spelling for same number Many names on one line One name on two lines

11 Name and Address Problem Single-field format Multiple-field format Mr.J.Smith, 100 Main St., Bigtown, County Luth, 23565 Name Street Town County Code Mr.J.Smith 100 Main St. Bigtown County Luth 23565

12 Clean and Organize 1. Create atomic values. 2. Standardize formats. 3. Verify data accuracy. 4. Match with other records. 5. Identify private and commercial addresses and inhabitants. 6. Document in metadata. Requires sophisticated tools and techniques

13 Merging Data Operational transactions do not usually map one-to-one with warehouse data Data for the warehouse is merged to provide information for analysis Sale 1/2/98 12:00:01 Ham Pizza $10.00 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 Sale 1/2/98 12:00:02 Anchovy Pizza $12.00 Return 1/2/98 12:00:03 Ham Pizza -$12.00 Sale 1/2/98 12:00:04 Sausage Pizza $11.00 Pizza sales/return by day, hour, seconds

14 Merging Data Sale 1/2/98 12:00:01 Ham Pizza $10.00 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 Sale 1/2/98 12:00:02 Anchovy Pizza $12.00 Return 1/2/98 12:00:03 Ham Pizza -$12.00 Sale 1/2/98 12:00:01 Ham Pizza $10.00 Sale 1/2/98 12:00:02 Cheese Pizza $10.00 Sale 1/2/98 12:00:04 Sausage Pizza $11.00

15 Adding a Date Stamp Enables time analysis Label loaded data with a date stamp Add time to fact and dimension data

16 Adding a Date Stamp Sales Fact Table Item_id Store_id Time_key Sales_dollars Sales_units Store Table Store_id District_id Time_key Item_Table Item_id Dept_id Time_key Time Table Week_id Period_id Year_id Time_key Product Table Product_id Time_key Product_desc

17 Adding a Date Stamp Fact table - Add triggers - Recode applications - Compare tables Dimension table Time representation - Point in time - Time span

18 Adding Keys to Data #1 Sale 1/2/98 12:00:01 Ham Pizza $10.00 #2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 #3 Sale 1/2/98 12:00:02 Anchovy Pizza $12.00 #4 Sale 1/2/98 12:00:03 Ham Pizza -$12.00 #dw1 Sale 1/2/98 12:00:01 Ham Pizza $10.00 #dw2 Sale 1/2/98 12:00:02 Cheese Pizza $10.00 #dw3 Sale 1/2/98 12:00:04 Sausage Pizza $11.00 #5 Sale 1/2/98 12:00:04 Sausage Pizza $11.00 Data values or artificial keys

19 Summarizing Data During extraction on staging area After loading onto the warehouse server Operational databases Staging area Warehouse database

20 Maintaining Transformation Metadata Contains transformation rules, algorithms, and routines Sources Stages Rules Publish Extract Transform Load Query

21 Transformation Timing and Location Transformation is performed: - Before load - In parallel May be initiated at different points UnlikelyProbablePossible

22 Choosing a Transformation Point * Workload * Network bandwidth * Environment * Parallel execution * CPU use * Load window time * Disk space * User information needs

23 Monitoring and Tracking Transformations should: Be self-documenting Provides summary statistics Handle process exceptions

24 Designing Transformation Processes Analysis: - Sources and target mappings, business rules - Key users, metadata, grain Design options: PL/SQL, replication, custom, third-party tools Design issues: - Performance - Size of the staging area - Exception handling, integrity maintenance

25 Transformation Tools Purchased SQL*Loader In-house developed

26 Data Management, Quality, and Auditing Tools Data management: - Innovative Systems - Postalsoft - Vality Technology Data quality and auditing: - Innovative Systems - Vality Technology

27 Summary This lesson discussed the following topics: Importance of data quality Transformation processes Data transformation issuess Data anomalies Name and address management Tools


Download ppt "Data Warehouse Chapter 11. Multiple Files Problem Added complexity of multiple source files Start simple Multiple Source files Extracted data Logic to."

Similar presentations


Ads by Google