AGENDA Do you have "Big Data"? Not all big data is useful data Strengths & Weakness of data technologies Integrating big data technologies into legacy systems

BIG DATA? Defined obsolescence! A mere TB or two need not apply From Wikipedia - "Big Data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process within a tolerable elapsed time. Big Data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes in a single data set."

CAN YOU BENEFIT FROM BIG DATA PARADIGMS AND TECHNOLOGY? The Three Vs* Volume – size of the data Velocity – the speed of new incoming data Variety – the variation of data formats and types + Concurrency – amount of simultaneous processing needed *"3D Data Management: controlling Data Volume, Velocity, and Variety" Doug Laney 2/6/2001

EXAMPLE: OPTIMINE SOFTWARE Optimization and measurement for digital advertising Data comes in at an advertisement-day level or transaction level Volume? Not really by today's standards. Entire datacenter is under 20TB Velocity? Not really. Data feeds come in once a day Variety? Yes. Hundreds of different data file formats Concurrency? Yes. Hundreds of simultaneous processing requests

JUST BECAUSE IT IS BIG DOESN'T MEAN IT'S USEFUL The danger of the big data mindset is collecting and retaining data without a purpose or plan to utilize it An advantage of legacy systems is there is a history of analysis and data already collected to help determine use cases Be on the lookout for "Accidental Data" – data collected from various applications by default using whatever the default settings happen to be

EXAMPLES OF ACCIDENTAL DATA Over 1M hits per day for a Web site 100% of traffic assigned to a single page 99.8% of age fields are populated for head of household 20% of population is listed as age 18 OptiMine Example 28% of conversions from search assigned to search keywords without a click or visit

BIG DATA WITHOUT ANALYSIS IS A BIG WASTE OF RESOURCES Collecting data without also investing in an appropriately scaled analytics infrastructure results in a "Data Tomb" Even if the Big Data technology streamlines data access e.g. the NSA collection of CDRs, most organizations where IT is building the data infrastructure independently from the business Make sure an analyst or data scientist has a chance to evaluate the data collection plan and fields For OptiMine, the head analyst is also the head of development Think about possible use cases, but if no one in the organization can come up with one, question the cost of collecting and storing it

CURRENT OPTIMINE ETL & STAGING ETL Phase 1 – Parse, Validate, & Simple Transforms Phase 2 – Assign Clean Key Issues T in Phase 2 processing is a bottleneck Insufficient meta-data makes QA difficult Only stores latest version of data in database

RDBMS Strengths Mature technology Variety of technologies available MPP architectures (e.g. Teradata, BitYota) Very efficient for set operations & relational algebra Very efficient for updating data while maintaining data integrity Weaknesses Not great for procedural operations (e.g. iterators) Full transaction locking overhead is not always needed Inserts can be slow due to indexing Fixed schema ("Schema on write"*) *Amr Awadallah, founder Cloudera

ETL TOOLS Strengths Built-in library of common transforms Built-in library of data source connectors Typically a drag-and-drop workflow Weaknesses Expensive, especially for scalable parallel processing

PROCEDURAL PROGRAMMING LANGUAGES Strengths Flexibility Complex data structures Iterators Recursion Weaknesses More programming time required compared to higher level tools (e.g. ETL)

DISTRIBUTED FILE SYSTEM (HADOOP) Strengths Flexibility "Schema on Read"* Full procedural programming power Parallelism/Redundency Low cost Data load speed Weaknesses Flexibility! "Hadoop makes the easy things hard, but the impossible things possible" Often need to add additional tools (Hive, Pig, etc.) Evolving technology - ecosystem is still in flux with new tools coming and going No ability to update, only insert Data read speed *Amr Awadallah, founder Cloudera

PICK THE PARADIGM FIRST, TOOL SECOND OptiMine Technologies RDBMS – SQLServer 2008 ETL – SSIS (SQLServer Integration Services) Procedural Language – Java/Groovy Distributed File System – Hadoop (MapReduce) Issues Transform of Phase 2 processing is a bottleneck - MapReduce Insufficient meta-data makes QA difficult - RDBMS Only stores latest version of data in database - HDFS

NEW OPTIMINE ETL & STAGING HDFS stores all versions of the inbound data MapReduce handles heavy lifting for assigning and updating Meta Data Staging to Production queries are reduced to simple inner joins

SUMMARY Your data doesn't have to be "big" in order to get value out of "big data" technologies Conversely, don't fall into the trap of pursuing "all of the data" just because you have the technology to cheaply store and retrieve it Figure out the right paradigm for the problem first, then select the appropriate technology


