Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Warehouse DATA TRANSFORMATION.

Similar presentations


Presentation on theme: "Data Warehouse DATA TRANSFORMATION."— Presentation transcript:

1 Data Warehouse DATA TRANSFORMATION

2 Extract Transform Insert
Extract data from operational system, transform and insert into data warehouse Why ETI? Will your warehouse produce correct information with the current data? How ho w can I ensure warehouse credibility?

3 Excuses for NOT Transforming Legacy Data
Old data works fine, new will work as well. Data will be fixed at point of entry through GUI. If needed, data will be cleaned after new system populated; After proof-of-concept pilot. Keys join the data most of the time. Users will not agree to modifying or standardizing their data.

4 Levels of Migration Problem
Existing metadata is insufficient and unreliable Metadata must hold for all occurrences Metadata must represent business and technical attributes Data values incorrectly typed and accessible Values form extracted from storage Values meaning inferred from its content Entity keys unreliable or unavailable Inferred from related values

5 Metadata Challenge Metadata gets out of synch with details it summarizes Business grows faster than systems designed to capture business info Not at the right level of detail Multiple values in a single field Multiple meanings to a single field No fixed format for value Expressed in awkward of limited terms Program/compiler view rather than business view

6 Character-level Challenge
Value instance level Spelling, aliases Abbreviations, truncations, transpositions Inconsistent storage formats Named type level Multiple meanings, contextual meanings Synonyms, homonyms Entity level No common keys or representation No integrated view across records, files, systems

7 Some Data Quality examples
The magic shrinking vendor file 127 ways to spell... Data surprises in individual fields Cowbirds and Data Fields Magic numbers and embedded intelligence

8 The Magic Shrinking Vendor File
A Medical claims processor was having trouble with their Insurance Vendor file. They thought they had 300,000 Insurance Vendors. When they cleaned up their data, they discovered they had only 27,000 unique Insurance Vendors.

9 127 ways to spell... Have over 127 different ways to spell AT&T
Have over 1000 ways to spell duPont

10 Data surprises in individual fields
NAME SOC. SEC. # TELEPHONE Source: Vality

11 Data surprises in individual fields
Meta NAME SOC. SEC. # TELEPHONE Source: Vality

12 Data surprises in individual fields
Meta NAME SOC. SEC. # TELEPHONE Denise Mario DBA Marc Di Lorenzo ETAL Tom & Mary Roberts First Natl Provident Digital 15 State St. Astorial Fedrl Savings Kevin Cooke, Receiver John Doe Trustee for K Actual Data Values Source: Vality

13 Data surprises in individual fields
Meta NAME SOC. SEC. # TELEPHONE Denise Mario DBA Marc Di Lorenzo ETAL Tom & Mary Roberts First Natl Provident Digital 15 State St. Astorial Fedrl Savings Kevin Cooke, Receiver John Doe Trustee for K LN#12-756 Actual Data Values Source: Vality

14 Data surprises in individual fields
Meta NAME SOC. SEC. # TELEPHONE Denise Mario DBA Marc Di Lorenzo ETAL Tom & Mary Roberts First Natl Provident Digital 15 State St. Astorial Fedrl Savings Kevin Cooke, Receiver John Doe Trustee for K LN#12-756 FAX 5436 Actual Data Values Source: Vality

15 Cowbirds and Data Fields
Cowbirds lay their eggs in other birds nets Users use data fields that are not used for other purposes

16 Magic Numbers and Embedded Intelligence
Customer Number = XXXX-YY-ZZ XXXX = 1st 4 Positions of Zip Code If YY = Then Cust = Pharmacy If YY = Then Cust = Hospital Except if YY = 82 and ZZ = ** Which Means...

17 Orr's Laws of Data Quality
Law #1 - “Data that is not used cannot be correct!” Law #2 - “Data quality is a function of its use, not its collection!” Law #3 - “Data will be no better than its most stringent use!” Law #4 - “Data quality problems increase with the age of the system!” Law #5 - “Data quality laws apply equally to meta-data!” Law #6 - The less likely something is to occur, the more traumatic it will be when it happens!

18 Legacy Data Contaminants Found in Migrations
Lack of standards Data surprises in individual fields Legacy information buried in free form fields Legacy myopia – multiple account numbers block consolidated view Anomaly nightmare – complex matching and consolidation

19 4 Fundamental Types of Transformation
Simple Transformation Fundamental building blocks of all data transformations One field at a time Cleansing and Scrubbing Ensure consistent formatting and usage of field or related group of fileds Checks valid values

20 4 Fundamental Types of Transformation (con’t)
Integration Takes operational data from one or more sources and maps it, field by field to new data structure Aggregation and Summarization Remove low level of detail Data for data mart

21 Simple Transformation
Convert data element from one type to another semantic value same rename elements Date time conversion standard warehouse format Decode encoded fields M F vs C S MM

22 Cleansing and Scrubbing
Actual content examined Range checking, enumerated lists, dependency checking Uniform representation for dw address information parse to components

23 Integration Simple field level mappings -80-90% Complex integration
No common identifier probable matches 2-stage process, isolation/reconciliation Multiple sources for same target element contradictory Missing data Derived/calculated data redundant?

24 Aggregation and Summarization
Summarization is the addition of like values along one or more business dimensions add daily sales by stores for monthly sales by region Aggregation is the addition of different business elements into common total daily product sales plus monthly consulting sales give monthly combined sales amount Details of process available in metadata

25 Data Re-engineering Problem
Programming for the unknown Unanticipated values, structures and patterns Programming for noise and uncertainty Conflicting and missing values Programming for productivity and efficiency Changing data values, changing user requirements High volumes, non-linear searches Conventional data transformation methods do not solve the metadata and data value challenges – need data re-engineering Stephen Brown, Vality Corp.

26 Data Re-engineering Process
External Files Legacy Applications Historical Extracts Customer Information Systems Data Warehouses Client/server Applications Consolidations Data investigation and Metadata Mining Data Standardization Data Integration Data Survivorship and Formating

27 Natural Laws of Data Re-engineering
Data has no standard You can’t predict or legislate format or content Data will evolve faster than its capture and storage systems You can’t write rules for what you don’t know and can’t see Instructions for handling data are within the data Don’t trust the metadata, make the data reveal itself Revealed metadata is knowledge about the business Revealed metadata validates warehouse design Revealed metadata supports conversion project management Revealed metadata is insurance against misinformation

28 Buy tool or manually code programs ?

29 3 - DW Tools Technologies Processes
2nd Generation ETL Suites / Environments Repositories DB & System Monitors Meta Data Browsers Data Visualization Data Mining DB Design Job Schedulers EIS MOLAP/ROLAP/LowLAP CASE Replication/Distribution Tools Q&R/MQE/MRE 1st Generation ETL RDBMS Utilities Universal Repositories Processes Design Mapping Extract Scrub Transform Load Index Aggregation Replication Data Set Distribution Access & Analysis Resource Scheduling & Distribution Meta Data System Monitoring

30 Transformation Choosing between tool and manually coded programs
Time frames - tools take longer select, configure, learn Budgets - short term or long term Size of warehouse - initial project small enough for coding Size and skills of warehouse team Tool automatically generates and maintains metadata

31 Hand Generated Code Upside Downside No learning curve Inherent skills
In house capabilities Usually simple No culture change/mandate (CASE) Downside Manual meta data Maintenance challenge when talent level changes No automation

32 Tools Upside Easy to maintain as talent level changes
Automatic meta data May gain efficiencies Integration with repositories Integration with other tools Schedulers Monitors Meta data management

33 Tools Downside Cost (1st generation tools very high $) Learning curve
Enforced culture change Must use tool for all changes Speed, may be slower to implement May require additional resources

34 Source Mainframe or C/S System Data Warehouse Client/Server System
Manual Code / 1st Generation ETL Tools Process Source Mainframe or C/S System Data Warehouse Client/Server System Source OLTP Systems External Job Scheduling and Control - External Meta Data Load/Maintenance Extract Program Transform Program File Transfer Program File Load Program Index Program Aggregation Program Copyright © 1997, Enterprise Group, Ltd.

35 Transformation Engine
2nd Generation ETL Tools Process Source Mainframe or C/S System Transformation Engine C/S System Data Warehouse/Mart C/S System Source OLTP Systems Data Warehouse or Data Mart Transformation Engine Monitoring Scheduling Extraction Scrubbing Transformation Load Index Aggregation Meta Data Load Meta Data Maint. Caching Copyright © 1997, Enterprise Group, Ltd.

36 Transformation Engine
2nd Generation ETL Environment Process Source Mainframe or C/S System Transformation Engine C/S System Enterprise Meta Data User Process Surf Meta Data Request Resource Schedule Delivery Source OLTP Systems Transformation Engine Monitoring Scheduling Extraction Scrubbing Transformation Load Index Aggregation Meta Data Load Meta Data Maint. Request Broker Data Mart C/S System Data Warehouse C/S System Data Mart Data Mart C/S System Data Warehouse Caching Data Mart Copyright © 1997, Enterprise Group, Ltd.

37 1st Generation ETL Tools Hampered by:
High cost (average deal prices in the $ k range) Long learning curves Perceived value (most teams felt they could write better code) Cultural challenges (like a CASE tool, the team must use the code generator for all creation and changes, no matter how minor) Core capabilities (complex transformations still required manual code) Management requirements (users still had to manage all the programs generated) Performance issues (the resulting programs could not leverage parallelism)

38 Important 2nd Generation ETL tool features:
Transformation engine design Ability to leverage parallel server technology CDC (Change Data Capture, which allows only the new data to be extracted) Incremental aggregation (ability to add CDC incremental data to existing aggregations) Limited or no use of temporary files or data base tables (virtual caching only) Common, open and extensible meta data repository Enterprise scalability

39 Important 2nd Generation ETL tool features:
Common UI (User Interface) across all tools Extensive selection of transformation algorithms Easily extensible scrub and transform algorithm library Extensive heterogeneous source and target support Native OLAP data set target support System monitoring & management Enterprise meta data repository (content, resources, structure, etc.) Transform once, populate many (populate multiple targets with a single transformation output)

40 Important 2nd Generation ETL tool features:
Integrated enterprise scale scrubbing capabilities Seamless interoperability with external point solution tools Integrated information access, analysis, scheduling and delivery Aggregate aware information request broker (enables virtual data warehouse) Ad hoc aggregation monitoring and management Pipeline parallelism / very high throughput Native drivers (source and target)

41 OLTP <> OLAP OLTP OLAP normalized
tools must provide multidimensional conceptual view of data ?????? Providing OLAP to User Analysts, E.F.Codd redundant data

42 Multidimensional Model
Data stored as facts and dimensions Sales Fact Cube


Download ppt "Data Warehouse DATA TRANSFORMATION."

Similar presentations


Ads by Google