Presentation is loading. Please wait.

Presentation is loading. Please wait.

Taming Data Logistics: The Hardest Part of Data Science

Similar presentations


Presentation on theme: "Taming Data Logistics: The Hardest Part of Data Science"— Presentation transcript:

1 Taming Data Logistics: The Hardest Part of Data Science
handed a ticket Taming Data Logistics: The Hardest Part of Data Science February 1, 2011 Ken Farmer

2 Data Logistics is the Management of Data in Motion
While handling problems And all of this for many, many feeds

3 Top 3 Data Logistics Problems
Images: Cabling nightmare: Car problem: Dependability Problems Productivity Problems Data Quality problems

4 Data Logistics is Surprisingly Difficult
It's not sexy Lack of dramatic improvement Few best practices Non-intuitive challenges Little-known tools & methods Tools != methods Image:

5 Data Science is not alone
Similar activities Similar technologies SImilar challenges Similar results Different heritages Nothing on the Data Science side maps to ETL Corporate DBA Heritage Academic CS Heritage

6 Top 3 Fundamental Mistakes
Second: Third: First: Images: Comsistency Vs Adaptability (or incorrect requirements & objectives) Non-Linear Scalability Problems (or misunderstood dynamics) Magical Thinking (or over-estimated capabilities)

7 Architectural Decisions – Buy vs Build
Considerations Include: Feed Complexity & Number (see left) Developer Interest & skills Organizational culture It's really (buy+build) vs build ^ | Risk Complexity ->

8 Architectural Decisions – Scheduling & Control
<- Synchronous Steps Asynchronous Stations -> Unit of Work: Batches Microbatches Streaming Chain: Assembly line:

9 Stages ETL are both activities AND stages
Enables deployment flexibility Enables different tools & technology Adds structure to process

10 Extract Stage Get transformation-ready data Changed data capture
Minimal transformation Auditing Potential Colocation

11 Transformation Stage Heavy transformations: Lookups Validations
Remapping Business Rules Heavy Auditing Post-transform delta-processing

12 Load Stage Speed vs Concurrency Double-duty as backup/recovery
Auditing Delta-processing Insert vs Insert-Update vs Replace

13 File Transportation Process
Autonomous utility allows components to “fire & forget” Like rsynch but with pre & post actions – for renaming or moving files Automatically retries failures Commodity Interface Alternative: network file system

14 Metadata Active Documentation that drives: Integration Automation
May include the Audit Subsystem: Process-audit results Rule-audit results

15 ID Generation Especially important for relational databases Options:
Reuse source ids (don't do this) Assigned In database Assigned In ETL Consider recoverability Image:

16 Recap These problems will happen: Unless you avoid these mistakes:
Productivity issues Data quality problems Dependability issues Unless you avoid these mistakes: Non-linear scalability Magical thinking Wrong consistency vs adaptability trade-offs And pick the right architecture: ETL tool for large number of simple feeds in the corporate world. Custom solution for small number of complex feeds in the startup world – if you have a great team. Think carefully about the grey area in-between. Finally, stick to a consistent extract, transform, load breakdown whenever possible – for best maintenance and adaptability.

17 Thanks to the following:
Old Shoe: Cabling Nightmare: Auto Repair: Chain: Assembly line: Cottingl y Faeries: Scale: Giant Gnome: Ticket:

18 Ken Farmer has twenty years of experience in delivering innovations through data logistics: the unglamorous part of data science involved in acquiring, standardizing, validating, transforming, integrating, and enabling the availability and access to vast amounts of data. Ken is a senior data architect at IBM where he leads their security & compliance data warehouse. Prior to this role Ken consulted on search engines and data warehouses in the insurance, telecom, entertainment, and retail industries.


Download ppt "Taming Data Logistics: The Hardest Part of Data Science"

Similar presentations


Ads by Google