INTEGRATING BIG DATA TECHNOLOGY INTO LEGACY SYSTEMS Robert Cooley, Ph.D.CodeFreeze 1/16/2014.

Slides:



Advertisements
Similar presentations
Inner Architecture of a Social Networking System Petr Kunc, Jaroslav Škrabálek, Tomáš Pitner.
Advertisements

Native XML Database or RDBMS. Data or Document orientation If you are primarily storing documents, then a Native XML Database may be the best option.
Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich 1.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Management Information Systems, Sixth Edition
A Fast Growing Market. Interesting New Players Lyzasoft.
Chapter 9 DATA WAREHOUSING Transparencies © Pearson Education Limited 1995, 2005.
Overview Distributed vs. decentralized Why distributed databases
Chapter 11 Data Management Layer Design
Chapter 14 The Second Component: The Database.
Chapter 1 Introduction to Databases
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
A Comparsion of Databases and Data Warehouses Name: Liliana Livorová Subject: Distributed Data Processing.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
HADOOP ADMIN: Session -2
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Ch 4. The Evolution of Analytic Scalability
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
The Worlds of Database Systems Chapter 1. Database Management Systems (DBMS) DBMS: Powerful tool for creating and managing large amounts of data efficiently.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
 DATABASE DATABASE  DATABASE ENVIRONMENT DATABASE ENVIRONMENT  WHY STUDY DATABASE WHY STUDY DATABASE  DBMS & ITS FUNCTIONS DBMS & ITS FUNCTIONS 
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Oracle Challenges Parallelism Limitations Parallelism is the ability for a single query to be run across multiple processors or servers. Large queries.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
MULTIMEDIA DATABASES -Define data -Define databases.
A NoSQL Database - Hive Dania Abed Rabbou.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Efficient Processing of Semantic Information on the Web Georg Lausen Technische Fakultät Universität Freiburg.
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
IS 325 Notes for Wednesday August 28, Data is the Core of the Enterprise.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
CIS/SUSL1 Fundamentals of DBMS S.V. Priyan Head/Department of Computing & Information Systems.
Chapter 9 Database Systems © 2007 Pearson Addison-Wesley. All rights reserved.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
PANEL SENIOR BIG DATA ARCHITECT BD-COE
Nov 2006 Google released the paper on BigTable.
Software Reuse Course: # The Johns-Hopkins University Montgomery County Campus Fall 2000 Session 4 Lecture # 3 - September 28, 2004.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Big Data Analytics with Excel Peter Myers Bitwise Solutions.
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
This is a free Course Available on Hadoop-Skills.com.
3 Copyright © 2006, Oracle. All rights reserved. Designing and Developing for Performance.
MarkLogic The Only Enterprise NoSQL Database Presented by: Aashi Rastogi ( ) Sanket Patel ( )
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
BIG DATA/ Hadoop Interview Questions.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
CPSC8985 FA 2015 Team C3 DATA MIGRATION FROM RDBMS TO HADOOP By Naga Sruthi Tiyyagura Monika RallabandiRadhakrishna Nalluri.
Management Information Systems by Prof. Park Kyung-Hye Chapter 7 (8th Week) Databases and Data Warehouses 07.
Big Data & Test Automation
Databases and DBMSs Todd S. Bacastow January 2005.
Analytics Warehouse P.J. Kelly.
An Open Source Project Commonly Used for Processing Big Data Sets
Data, Databases, and DBMSs
Introduction to Databases Transparencies
Ch 4. The Evolution of Analytic Scalability
Distributed Database Management Systems
Database System Architectures
Big DATA.
CRM DMP – a marriage of two acronyms
Presentation transcript:

INTEGRATING BIG DATA TECHNOLOGY INTO LEGACY SYSTEMS Robert Cooley, Ph.D.CodeFreeze 1/16/2014

AGENDA Do you have "Big Data"? Not all big data is useful data Strengths & Weakness of data technologies Integrating big data technologies into legacy systems 2CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

BIG DATA? 3CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE Defined obsolescence! A mere TB or two need not apply From Wikipedia - “Big Data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process within a tolerable elapsed time. Big Data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes in a single data set.”

CAN YOU BENEFIT FROM BIG DATA PARADIGMS AND TECHNOLOGY? The Three Vs* Volume – size of the data Velocity – the speed of new incoming data Variety – the variation of data formats and types + Concurrency – amount of simultaneous processing needed 4CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE *”3D Data Management: controlling Data Volume, Velocity, and Variety” Doug Laney 2/6/2001

EXAMPLE: OPTIMINE SOFTWARE Optimization and measurement for digital advertising Data comes in at an advertisement-day level or transaction level Volume? Not really by today’s standards. Entire datacenter is under 20TB Velocity? Not really. Data feeds come in once a day Variety? Yes. Hundreds of different data file formats Concurrency? Yes. Hundreds of simultaneous processing requests 5CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

JUST BECAUSE IT IS BIG DOESN’T MEAN IT’S USEFUL The danger of the big data mindset is collecting and retaining data without a purpose or plan to utilize it An advantage of legacy systems is there is a history of analysis and data already collected to help determine use cases Be on the lookout for “Accidental Data” – data collected from various applications by default using whatever the default settings happen to be 6CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

EXAMPLES OF ACCIDENTAL DATA Over 1M hits per day for a Web site 100% of traffic assigned to a single page 99.8% of age fields are populated for head of household 20% of population is listed as age 18 OptiMine Example 28% of conversions from search assigned to search keywords without a click or visit 7CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

BIG DATA WITHOUT ANALYSIS IS A BIG WASTE OF RESOURCES Collecting data without also investing in an appropriately scaled analytics infrastructure results in a “Data Tomb” Even if the Big Data technology streamlines data access e.g. the NSA collection of CDRs, most organizations where IT is building the data infrastructure independently from the business Make sure an analyst or data scientist has a chance to evaluate the data collection plan and fields For OptiMine, the head analyst is also the head of development Think about possible use cases, but if no one in the organization can come up with one, question the cost of collecting and storing it 8CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

CURRENT OPTIMINE ETL & STAGING 9CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE ETL Phase 1 – Parse, Validate, & Simple Transforms Phase 2 – Assign Clean Key Issues T in Phase 2 processing is a bottleneck Insufficient meta-data makes QA difficult Only stores latest version of data in database

RDBMS Strengths Mature technology Variety of technologies available MPP architectures (e.g. Teradata, BitYota) Very efficient for set operations & relational algebra Very efficient for updating data while maintaining data integrity Weaknesses Not great for procedural operations (e.g. iterators) Full transaction locking overhead is not always needed Inserts can be slow due to indexing Fixed schema (“Schema on write”*) 10CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE *Amr Awadallah, founder Cloudera

ETL TOOLS Strengths Built-in library of common transforms Built-in library of data source connectors Typically a drag-and-drop workflow Weaknesses Expensive, especially for scalable parallel processing 11CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

PROCEDURAL PROGRAMMING LANGUAGES Strengths Flexibility Complex data structures Iterators Recursion Weaknesses More programming time required compared to higher level tools (e.g. ETL) 12CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

DISTRIBUTED FILE SYSTEM (HADOOP) Strengths Flexibility “Schema on Read”* Full procedural programming power Parallelism/Redundency Low cost Data load speed Weaknesses Flexibility! “Hadoop makes the easy things hard, but the impossible things possible” Often need to add additional tools (Hive, Pig, etc.) Evolving technology - ecosystem is still in flux with new tools coming and going No ability to update, only insert Data read speed 13CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE *Amr Awadallah, founder Cloudera

PICK THE PARADIGM FIRST, TOOL SECOND OptiMine Technologies RDBMS – SQLServer 2008 ETL – SSIS (SQLServer Integration Services) Procedural Language – Java/Groovy Distributed File System – Hadoop (MapReduce) Issues Transform of Phase 2 processing is a bottleneck - MapReduce Insufficient meta-data makes QA difficult - RDBMS Only stores latest version of data in database - HDFS 14CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE

NEW OPTIMINE ETL & STAGING 15CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE HDFS stores all versions of the inbound data MapReduce handles heavy lifting for assigning and updating Meta Data Staging to Production queries are reduced to simple inner joins

SUMMARY Your data doesn’t have to be “big” in order to get value out of “big data” technologies Conversely, don’t fall into the trap of pursuing “all of the data” just because you have the technology to cheaply store and retrieve it Figure out the right paradigm for the problem first, then select the appropriate technology 16CONFIDENTIAL – COPYRIGHT © 2013 OPTIMINE SOFTWARE