Taming Data Logistics: The Hardest Part of Data Science

Slides:



Advertisements
Similar presentations
Defining Decision Support System
Advertisements

Developed by Reneta Barneva, SUNY Fredonia
Chapter 1 Business Driven Technology
INTEGRATING BIG DATA TECHNOLOGY INTO LEGACY SYSTEMS Robert Cooley, Ph.D.CodeFreeze 1/16/2014.
We make it easier for businesses of all sizes to safely accept checks transmodus offers clients automation utilizing our online processing platform for.
Basic guidelines for the creation of a DW Create corporate sponsors and plan thoroughly Determine a scalable architectural framework for the DW Identify.
Altosoft Copyright ® 2012 altosoft.com8/3/2012 Sandy Follin, Sr. Account Executive Steve Schrader, Sr. Sales Engineer.
Integrating work flows of five utilities utilizing Oracle’s WAM
Copyright © 2014 McGraw-Hill Higher Education. All rights reserved. CHAPTER 4 Product/Process Innovation McGraw-Hill/Irwin.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Streams – DataStage Integration InfoSphere Streams Version 3.0
Automotive Warranty System 1.  Challenges faced by CIO  Our Solution  Our Methodology  Cloud Based Architecture  Clientele  Highly customizable.
VAP What is a Virtual Application ? A virtual application is an application that has been optimized to run on virtual infrastructure. The application software.
Prepared By: Prof. Dhara Virani CSE/IT Dept. Dr. Subhash Technical Campus. Junagadh. Chapter 7.
© 2011 IBM Corporation Smarter Software for a Smarter Planet The Capabilities of IBM Software Borislav Borissov SWG Manager, IBM.
= WEEKS, MONTHS, YEARS OF DELAYED APPLICATION VALUE MISSED REVENUE OPPORTUNITIES, INCREASED COST AND RISK DEV QA PACKAGE COMMERCIAL SOFTWARE CUSTOM APPLICATION.
Software Engineering Management Lecture 1 The Software Process.
© 2008 IBM Corporation ® IBM Cognos Business Viewpoint Miguel Garcia - Solutions Architect.
1 Warranty and Repair Management For Infor XA Release 7 WARM Denise Luther – Sr. XA Consultant WARMS Technical Manager CISTECH, Inc. Rod Fortson – Sr.
CommSee - a client service systems development strategy using .NET
Touchstone Automation’s DART ™ (Data Analysis and Reporting Tool)
Product Architecture, Industrial Design, Design for Manufacturing.
Improving the world through engineering 1 Design Feedback Session 2011 Neill Anderson Head Design Judge Alex Hickson Deputy Head Design.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Chapter © 2012 Pearson Education, Inc. Publishing as Prentice Hall.
Knowledge Management & Knowledge Management Systems By: Chad Thomison MIS 650.
Project Design Alain Esteva-Ramirez School of Computing and Information Sciences Florida International University Bárbara Morales-Quiñones Department of.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
ARCH-04 Before You Begin Your Transformation Project… Phillip Magnay Architect – Applied Technology.
Case Study: EMPOWERING BUSINESS DECISIONS WITH FME WORKBENCH Eric Klein Senior GIS Consultant, Critigen Bill Bodinson Senior GIS Analyst, Critigen.
Chapter 8: Maintenance and Software Evolution Ronald J. Leach Copyright Ronald J. Leach, 1997, 2009, 2014,
Chapter 8: Data Warehousing. Data Warehouse Defined A physical repository where relational data are specially organized to provide enterprise- wide, cleansed.
© 2010 IBM Corporation IBM Container Transport Optimization Dr. Juergen Koehl, IBM Research and Development, Boeblingen Juergen Koehl, IBM Container Logistics.
Business Intelligence and Decision Support Systems (9 th Ed., Prentice Hall) Chapter 8: Data Warehousing.
Metadata Driven Clinical Data Integration – Integral to Clinical Analytics April 11, 2016 Kalyan Gopalakrishnan, Priya Shetty Intelent Inc. Sudeep Pattnaik,
Software Engineering Management
Viewing Data-Driven Success Through a Capability Lens
Design of Operations.
CIM Modeling for E&U - (Short Version)
Data Warehouse Components
Chapter 18 Maintaining Information Systems
Management Information Systems
Software Factories - Today and Tomorrow
Senior Solutions Architect, MongoDB Inc.
Security in Windows Store apps
A new way to govern, manage and share your data assets
Auditing in SQL Server 2008 DBA-364-M
SharePoint Online: Migration Planning to avoid Mistakes
A Must to Know - Testing IoT
Engineering Processes
Gotcha! SharePoint Online Migration Mistakes to Avoid
Chapter 1: The Database Environment
Your Facility Your Information
Cloud Data Replication with SQL Data Sync
Software models - Software Architecture Design Patterns
SELL THE RIGHT PRODUCT ─ EVERY TIME
Introduction of Week 11 Return assignment 9-1 Collect assignment 10-1
4+1 View Model of Software Architecture
Chapter 11 Selecting the Best Alternative Design Strategy
Engineering Processes
4+1 View Model of Software Architecture
The Database Environment
Chapter 11 Selecting the Best Alternative Design Strategy
Introduction Software maintenance:
Modern Systems Analysis and Design Third Edition
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management
Best Practices in Higher Education Student Data Warehousing Forum
WORKSHOP Establish a Communication and Training Plan
Presentation transcript:

Taming Data Logistics: The Hardest Part of Data Science handed a ticket Taming Data Logistics: The Hardest Part of Data Science February 1, 2011 Ken Farmer kenfar@us.ibm.com kenfar@gmail.com

Data Logistics is the Management of Data in Motion While handling problems And all of this for many, many feeds

Top 3 Data Logistics Problems Images: Cabling nightmare: http://www.flickr.com/photos/alq666/2248613780/sizes/z/in/photostream/ Car problem: http://www.flickr.com/photos/michelhrv/2545226437/sizes/z/in/photostream/ Dependability Problems Productivity Problems Data Quality problems

Data Logistics is Surprisingly Difficult It's not sexy Lack of dramatic improvement Few best practices Non-intuitive challenges Little-known tools & methods Tools != methods Image: http://www.flickr.com/photos/freefoto/3037402633/sizes/o/in/photostream/

Data Science is not alone Similar activities Similar technologies SImilar challenges Similar results Different heritages Nothing on the Data Science side maps to ETL Corporate DBA Heritage Academic CS Heritage

Top 3 Fundamental Mistakes Second: Third: First: Images: http://www.google.com/imgres?q=fairies+cottingley&hl=en&gbv=2&biw=1280&bih=575&tbm=isch&tbnid=yTeYdw-LVEELiM:&imgrefurl=http://www.unmuseum.org/fairies.htm&docid=DIOvMkBlWc67GM&w=401&h=267&ei=Unh8Tt_AH4ro0QHNmOQH&zoom=1&iact=hc&vpx=968&vpy=227&dur=7451&hovh=183&hovw=275&tx=152&ty=69&page=1&tbnh=159&tbnw=218&start=0&ndsp=11&ved=1t:429,r:4,s:0 http://www.flickr.com/photos/sugarcubevintage/5663606146/ http://www.flickr.com/photos/tobyleah/2897960941/sizes/z/in/photostream/ Comsistency Vs Adaptability (or incorrect requirements & objectives) Non-Linear Scalability Problems (or misunderstood dynamics) Magical Thinking (or over-estimated capabilities)

Architectural Decisions – Buy vs Build Considerations Include: Feed Complexity & Number (see left) Developer Interest & skills Organizational culture It's really (buy+build) vs build ^ | Risk Complexity ->

Architectural Decisions – Scheduling & Control <- Synchronous Steps Asynchronous Stations -> Unit of Work: Batches Microbatches Streaming Chain: http://www.flickr.com/photos/pratanti/5359581911/sizes/z/in/photostream/ Assembly line: http://www.flickr.com/photos/gblakeley/5583120966/sizes/z/in/photostream/

Stages ETL are both activities AND stages Enables deployment flexibility Enables different tools & technology Adds structure to process

Extract Stage Get transformation-ready data Changed data capture Minimal transformation Auditing Potential Colocation

Transformation Stage Heavy transformations: Lookups Validations Remapping Business Rules Heavy Auditing Post-transform delta-processing

Load Stage Speed vs Concurrency Double-duty as backup/recovery Auditing Delta-processing Insert vs Insert-Update vs Replace

File Transportation Process Autonomous utility allows components to “fire & forget” Like rsynch but with pre & post actions – for renaming or moving files Automatically retries failures Commodity Interface Alternative: network file system

Metadata Active Documentation that drives: Integration Automation May include the Audit Subsystem: Process-audit results Rule-audit results

ID Generation Especially important for relational databases Options: Reuse source ids (don't do this) Assigned In database Assigned In ETL Consider recoverability Image: http://www.flickr.com/photos/rrrrred/2686239220/sizes/z/in/photostream/

Recap These problems will happen: Unless you avoid these mistakes: Productivity issues Data quality problems Dependability issues Unless you avoid these mistakes: Non-linear scalability Magical thinking Wrong consistency vs adaptability trade-offs And pick the right architecture: ETL tool for large number of simple feeds in the corporate world. Custom solution for small number of complex feeds in the startup world – if you have a great team. Think carefully about the grey area in-between. Finally, stick to a consistent extract, transform, load breakdown whenever possible – for best maintenance and adaptability.

Thanks to the following: Old Shoe: http://www.flickr.com/photos/freefoto/3037402633/sizes/o/in/photostream/ Cabling Nightmare: http://www.flickr.com/photos/alq666/2248613780/sizes/z/in/photostream/ Auto Repair: http://www.flickr.com/photos/michelhrv/2545226437/sizes/z/in/photostream/ Chain: http://www.flickr.com/photos/pratanti/5359581911/sizes/z/in/photostream/ Assembly line: http://www.flickr.com/photos/gblakeley/5583120966/sizes/z/in/photostream/ Cottingl y Faeries:http://www.google.com/imgres?q=fairies+cottingley&hl=en&gbv=2&biw=1280&bih=575&tbm=isch&tbnid=yTeYdw-LVEELiM:&imgrefurl=http://www.unmuseum.org/fairies.htm&docid=DIOvMkBlWc67GM&w=401&h=267&ei=Unh8Tt_AH4ro0QHNmOQH&zoom=1&iact=hc&vpx=968&vpy=227&dur=7451&hovh=183&hovw=275&tx=152&ty=69&page=1&tbnh=159&tbnw=218&start=0&ndsp=11&ved=1t:429,r:4,s:0 Scale: http://www.flickr.com/photos/sugarcubevintage/5663606146/ Giant Gnome: http://www.flickr.com/photos/tobyleah/2897960941/sizes/z/in/photostream/ Ticket: http://www.flickr.com/photos/rrrrred/2686239220/sizes/z/in/photostream/

Ken Farmer has twenty years of experience in delivering innovations through data logistics: the unglamorous part of data science involved in acquiring, standardizing, validating, transforming, integrating, and enabling the availability and access to vast amounts of data. Ken is a senior data architect at IBM where he leads their security & compliance data warehouse. Prior to this role Ken consulted on search engines and data warehouses in the insurance, telecom, entertainment, and retail industries.