Presentation is loading. Please wait.

Presentation is loading. Please wait.

INTEGRATING BIG DATA TECHNOLOGY INTO LEGACY SYSTEMS Robert Cooley, Ph.D.CodeFreeze 1/16/2014.

Similar presentations


Presentation on theme: "INTEGRATING BIG DATA TECHNOLOGY INTO LEGACY SYSTEMS Robert Cooley, Ph.D.CodeFreeze 1/16/2014."— Presentation transcript:

1 INTEGRATING BIG DATA TECHNOLOGY INTO LEGACY SYSTEMS Robert Cooley, Ph.D.CodeFreeze 1/16/2014

2 AGENDA Do you have "Big Data"? Not all big data is useful data Strengths & Weakness of data technologies Integrating big data technologies into legacy systems 2CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

3 BIG DATA? 3CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE Defined obsolescence! A mere TB or two need not apply From Wikipedia - “Big Data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process within a tolerable elapsed time. Big Data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes in a single data set.”

4 CAN YOU BENEFIT FROM BIG DATA PARADIGMS AND TECHNOLOGY? The Three Vs* Volume – size of the data Velocity – the speed of new incoming data Variety – the variation of data formats and types + Concurrency – amount of simultaneous processing needed 4CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE *”3D Data Management: controlling Data Volume, Velocity, and Variety” Doug Laney 2/6/2001

5 EXAMPLE: OPTIMINE SOFTWARE Optimization and measurement for digital advertising Data comes in at an advertisement-day level or transaction level Volume? Not really by today’s standards. Entire datacenter is under 20TB Velocity? Not really. Data feeds come in once a day Variety? Yes. Hundreds of different data file formats Concurrency? Yes. Hundreds of simultaneous processing requests 5CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

6 JUST BECAUSE IT IS BIG DOESN’T MEAN IT’S USEFUL The danger of the big data mindset is collecting and retaining data without a purpose or plan to utilize it An advantage of legacy systems is there is a history of analysis and data already collected to help determine use cases Be on the lookout for “Accidental Data” – data collected from various applications by default using whatever the default settings happen to be 6CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

7 EXAMPLES OF ACCIDENTAL DATA Over 1M hits per day for a Web site 100% of traffic assigned to a single page 99.8% of age fields are populated for head of household 20% of population is listed as age 18 OptiMine Example 28% of conversions from search assigned to search keywords without a click or visit 7CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

8 BIG DATA WITHOUT ANALYSIS IS A BIG WASTE OF RESOURCES Collecting data without also investing in an appropriately scaled analytics infrastructure results in a “Data Tomb” Even if the Big Data technology streamlines data access e.g. the NSA collection of CDRs, most organizations where IT is building the data infrastructure independently from the business Make sure an analyst or data scientist has a chance to evaluate the data collection plan and fields For OptiMine, the head analyst is also the head of development Think about possible use cases, but if no one in the organization can come up with one, question the cost of collecting and storing it 8CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

9 CURRENT OPTIMINE ETL & STAGING 9CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE ETL Phase 1 – Parse, Validate, & Simple Transforms Phase 2 – Assign Clean Key Issues T in Phase 2 processing is a bottleneck Insufficient meta-data makes QA difficult Only stores latest version of data in database

10 RDBMS Strengths Mature technology Variety of technologies available MPP architectures (e.g. Teradata, BitYota) Very efficient for set operations & relational algebra Very efficient for updating data while maintaining data integrity Weaknesses Not great for procedural operations (e.g. iterators) Full transaction locking overhead is not always needed Inserts can be slow due to indexing Fixed schema (“Schema on write”*) 10CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE *Amr Awadallah, founder Cloudera

11 ETL TOOLS Strengths Built-in library of common transforms Built-in library of data source connectors Typically a drag-and-drop workflow Weaknesses Expensive, especially for scalable parallel processing 11CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

12 PROCEDURAL PROGRAMMING LANGUAGES Strengths Flexibility Complex data structures Iterators Recursion Weaknesses More programming time required compared to higher level tools (e.g. ETL) 12CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

13 DISTRIBUTED FILE SYSTEM (HADOOP) Strengths Flexibility “Schema on Read”* Full procedural programming power Parallelism/Redundency Low cost Data load speed Weaknesses Flexibility! “Hadoop makes the easy things hard, but the impossible things possible” Often need to add additional tools (Hive, Pig, etc.) Evolving technology - ecosystem is still in flux with new tools coming and going No ability to update, only insert Data read speed 13CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE *Amr Awadallah, founder Cloudera

14 PICK THE PARADIGM FIRST, TOOL SECOND OptiMine Technologies RDBMS – SQLServer 2008 ETL – SSIS (SQLServer Integration Services) Procedural Language – Java/Groovy Distributed File System – Hadoop (MapReduce) Issues Transform of Phase 2 processing is a bottleneck - MapReduce Insufficient meta-data makes QA difficult - RDBMS Only stores latest version of data in database - HDFS 14CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

15 NEW OPTIMINE ETL & STAGING 15CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE HDFS stores all versions of the inbound data MapReduce handles heavy lifting for assigning and updating Meta Data Staging to Production queries are reduced to simple inner joins

16 SUMMARY Your data doesn’t have to be “big” in order to get value out of “big data” technologies Conversely, don’t fall into the trap of pursuing “all of the data” just because you have the technology to cheaply store and retrieve it Figure out the right paradigm for the problem first, then select the appropriate technology 16CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

17


Download ppt "INTEGRATING BIG DATA TECHNOLOGY INTO LEGACY SYSTEMS Robert Cooley, Ph.D.CodeFreeze 1/16/2014."

Similar presentations


Ads by Google