Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2012 IBM Corporation January 19, 2014 The Big Deal About Big Data Dean Compher Data Management Technical Professional for UT, NV

Similar presentations

Presentation on theme: "© 2012 IBM Corporation January 19, 2014 The Big Deal About Big Data Dean Compher Data Management Technical Professional for UT, NV"— Presentation transcript:

1 © 2012 IBM Corporation January 19, 2014 The Big Deal About Big Data Dean Compher Data Management Technical Professional for UT, NV Slides Created and Provided by: Paul Zikopoulos Tom Deustch

2 © 2012 IBM Corporation January 19, 2014 Why Big Data How We Got Here

3 © 2012 IBM Corporation 3 3 …by the end of 2011, this was about 30 billion and growing even faster In 2005 there were 1.3 billion RFID tags in circulation…

4 © 2012 IBM Corporation 4 An increasingly sensor-enabled and instrumented business environment generates HUGE volumes of data with MACHINE SPEED characteristics… 1 BILLION lines of code EACH engine generating 10 TB every 30 minutes! LHR JKF: 640TBs

5 © 2012 IBM Corporation 5 350B Transactions/Year Meter Reads every 15 min. 3.65B – meter reads/day 120M – meter reads/month

6 © 2012 IBM Corporation 6 In August of 2010, Adam Savage, of Myth Busters, took a photo of his vehicle using his smartphone. He then posted the photo to his Twitter account including the phrase Off to work. Since the photo was taken by his smartphone, the image contained metadata revealing the exact geographical location the photo was taken By simply taking and posting a photo, Savage revealed the exact location of his home, the vehicle he drives, and the time he leaves for work

7 © 2012 IBM Corporation 7 The Social Layer in an Instrumented Interconnected World 2+ billion people on the Web by end billion RFID tags today (1.3B in 2005) 4.6 billion camera phones world wide 100s of millions of GPS enabled devices sold annually 76 million smart meters in 2009… 200M by TBs of tweet data every day 25+ TBs of log data every day ? TBs of data every day

8 © 2012 IBM Corporation 8 Twitter Tweets per Second Record Breakers of 2011

9 © 2012 IBM Corporation 9 Extract Intent, Life Events, Micro Segmentation Attributes Jo Jobs Tina Mu Tom Sit Pauline Name, Birthday, Family Not Relevant - Noise Monetizable Intent Relocation Location Wishful Thinking SPAMbots

10 © 2012 IBM Corporation 10 Extracting insight from an immense volume, variety and velocity of data, in context, beyond what was previously possible Big Data Includes Any of the following Characteristics Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text Streaming data and large volume data movement Scale from Terabytes to Petabytes (1K TBs) to Zetabytes (1B TBs) Variety: Velocity: Volume:

11 © 2012 IBM Corporation 11 Retailers collect click-stream data from Web site interactions and loyalty card data –This traditional POS information is used by retailer for shopping basket analysis, inventory replenishment, +++ –But data is being provided to suppliers for customer buying analysis Healthcare has traditionally been dominated by paper-based systems, but this information is getting digitized Science is increasingly dominated by big science initiatives –Large-scale experiments generate over 15 PB of data a year and cant be stored within the data center; sent to laboratories Financial services are seeing large and large volumes through smaller trading sizes, increased market volatility, and technological improvements in automated and algorithmic trading Improved instrument and sensory technology –Large Synoptic Survey Telescopes GPixel camera generates 6PB+ of image data per year or consider Oil and Gas industry Bigger and Bigger Volumes of Data

12 © 2012 IBM Corporation 12 Data AVAILABLE to an organization Data an organization can PROCESS The Big Data Conundrum The percentage of available data an enterprise can analyze is decreasing proportionately to the available to it Quite simply, this means as enterprises, we are gettingmore naive about our business over time We dont know what we could already know….

13 © 2012 IBM Corporation 13 Why Not All of Big Data Before: Didnt have the Tools?

14 © 2012 IBM Corporation 14 Applications for Big Data Analytics Homeland Security Finance Smarter HealthcareMulti-channel sales Telecom Manufacturing Traffic Control Trading AnalyticsFraud and Risk Log Analysis Search Quality Retail: Churn, NBO

15 © 2012 IBM Corporation 15 Most Requested Uses of Big Data Log Analytics & Storage Smart Grid / Smarter Utilities RFID Tracking & Analytics Fraud / Risk Management & Modeling 360° View of the Customer Warehouse Extension / Call Center Transcript Analysis Call Detail Record Analysis +++

16 © 2012 IBM Corporation 16 So What Is Hadoop?

17 © 2012 IBM Corporation 17 Hadoop Background Apache Hadoop is a software framework that supports data- intensive applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google Map/Reduce and Google File System papers. Hadoop is a top-level Apache project being built and used by a global community of contributors, using the Java programming language. Yahoo has been the largest contributor to the project, and uses Hadoop extensively across its businesses. Hadoop is a paradigm that says that you send your application to the data rather than sending the data to the application

18 © 2012 IBM Corporation 18 What Hadoop Is Not It is not a replacement for your Database & Warehouse strategy –Customers need hybrid database/warehouse & hadoop models It is not a replacement for your ETL strategy –Existing data flows arent typically changed, they are extended It is not designed for real-time complex event processing like Streams –Customers are asking for Streams & BigInsights integration

19 © 2012 IBM Corporation 19 So What Is Really New Here? Cost effective / Linear Scalability. –Hadoop brings massively parallel competing to commodity servers. You can start small and scales linearly as your work requires. –Storage and Modeling at Internet-scale rather than small sampling –Cost profile for super-computer level compute capabilities –Cost per TB of storage enables superset of information to be modeled Mixing Structured and Unstructured data. –Hadoop is its schema-less so it doesnt care about the form the data stored is in, and thus allows a super-set of information to be commonly stored. Further, MapReduce can be run effectively on any type of data and is really limited by the creatively of the developer. –Structure can be introduced at the MapReduce run time based on the keys and values defined in the MapReduce program. Developers can create jobs that against structured, semi-structured, and even unstructured data. Inherently flexible of what is modeled/analytics run –Ability to change direction literally on a moments notice without any design or operational changes –Since hadoop is schema-less, and can introduce structure on the fly, the type of analytics and nature of the questions being asked can be changed as often as needed without upfront cost or latency

20 © 2012 IBM Corporation 20 Break It Down For Me Here… Hadoop is a platform and framework, not a database –It uses both the CPU and disc of single commodity boxes, or node –Boxes can be combined into clusters –New nodes can be added as needed, and added without needing to change the; Data formats How data is loaded How jobs are written The applications on top

21 © 2012 IBM Corporation 21 So How Does It Do That? At its core, hadoop is made up of; Map/Reduce –How hadoop understands and assigns work to the nodes (machines) Hadoop Distributed File System = HDFS –Where hadoop stores data –A file system thats runs across the nodes in a hadoop cluster –It links together the file systems on many local nodes to make them into one big file system

22 © 2012 IBM Corporation 22 What is HDFS The HDFS file system stores data across multiple machines. HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes –Default is 3 copies Two on the same rack, and one on a different rack. The filesystem is built from a cluster of data nodes, each of which serves up blocks of data over the network using a block protocol specific to HDFS. –They also serve the data over HTTP, allowing access to all content from a web browser or other client –Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high.

23 © 2012 IBM Corporation 23 File System on my Laptop

24 © 2012 IBM Corporation 24 HDFS File System Example

25 © 2012 IBM Corporation 25 Map/Reduce Explained "Map" step: –The program is chopped up into many smaller sub- problems. A worker node processes some subset of the smaller problems under the global control of the JobTracker node and stores the result in the local file system where a reducer is able to access it. "Reduce" step: –Aggregation The reduce aggregates data from the map steps. There can be multiple reduce tasks to parallelize the aggregation, and these tasks are executed on the worker nodes under the control of the JobTracker.

26 © 2012 IBM Corporation 26 The MapReduce Programming Model "Map" step: –Program split into pieces –Worker nodes process individual pieces in parallel (under global control of the Job Tracker node) –Each worker node stores its result in its local file system where a reducer is able to access it "Reduce" step: –Data is aggregated (reduced from the map steps) by worker nodes (under control of the Job Tracker) –Multiple reduce tasks can parallelize the aggregation

27 © 2012 IBM Corporation 27 Map/Reduce Job Example

28 © 2012 IBM Corporation 28 Murray 38 Salt Lake 39 Bluffdale 35 Sandy 32 Salt Lake 42 Murray 31 Bluffdale 32 Sandy 40 Murray 27 Salt Lake 25 Bluffdale 37 Sandy 32 Salt Lake 23 Murray 30 Sandy 40 Salt Lake 25 Bluffdale 37 Murray 30 Murray 38 Bluffdale 35 Sandy 32 Salt Lake 42 Murray 38 Bluffdale 35 Bluffdale 37 Murray 30 Sandy 40 Salt Lake 25 Sandy 32 Salt Lake 42 Murray 38 Bluffdale 37 Sandy 40 Salt Lake 42 Map Shuffle Reduce

29 © 2012 IBM Corporation 29 MapReduce In more Detail Map-Reduce applications specify the input/output locations and supply map and reduce functions via implementations of appropriate Hadoop interfaces, such as Mapper and Reducer. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable, etc.) and configuration to the JobTracker The JobTracker then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client. The Map/Reduce framework operates exclusively on pairs that is, the framework views the input to the job as a set of pairs and produces a set of pairs as the output of the job, conceivably of different types. The vast majority of Map-Reduce applications executed on the Grid do not directly implement the low-level Map-Reduce interfaces; rather they are implemented in a higher-level language, such as Jaql, Pig or BigSheets

30 © 2012 IBM Corporation 30 JobTracker and TaskTrackers Map/Reduce requests are handed to the Job Tracker which is a master controller for the map and reduce tasks. –Each worker node contains a Task Tracker process which manages work on the local node. –The Job Tracker pushes work out to the Task Trackers on available worker nodes, striving to keep the work as close to the data as possible –The Job Tracker knows which node contains the data, and which other machines are nearby –If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack –This reduces network traffic on the main backbone network. If a Task Tracker fails or times out, that part of the job is rescheduled

31 © 2012 IBM Corporation 31 How To Create Map/Reduce Jobs Map/reduce development in Java –Hard, few resources that know this Pig –Open source language / Apache sub-project –Becoming a standard Hive –Open source language / Apache sub-project –Provides a SQL-like interface to hadoop Jaql –IBM Research Invented –More powerful than Pig when dealing with loosely structure data –Visa has been a development partner BigSheets –BigInsights browser based application –Little development required –Youll use this most often Skill Required

32 © 2012 IBM Corporation 32 Taken Together - What Does This Result In? Easy To Scale –Simply add machines as your data and jobs require Fault Tolerant and Self-Healing –Hadoop runs on commodity hardware and provides fault tolerance through software. –Hardware losses are expecting and tolerated –When you lose a node the system just redirects work to another location of the data and nothing stops, nothing breaks, jobs, applications and users dont even know. Hadoop Is Data Agnostic –Hadoop can absorb any type of data, structured or not, from any number of sources. –Data from many sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide. –Hadoop results can be consumed by any system necessary if the output is structured appropriately Hadoop Is Extremely Flexible –Start small, scale big –You can turn nodes off and use for other needs if required (really) –Throw any data, in any form or format, you want at it –What you use it for can be changed on a whim

33 © 2012 IBM Corporation 33 The IBM Big Data Platform

34 © 2012 IBM Corporation 34 Analytic Sandboxes – aka Production Hadoop capabilities exposed to LOB with some notion of IT support Not really production in an IBM sense Really just ad-hoc made visible to more users in the organization Formal declaration of direction as part of the architecture Use it, but dont count on it Not built for secutity

35 © 2012 IBM Corporation 35 Production Usage with SLAs SLA driven workloads –Guaranteed job completion –Job completion within operational windows Data Security Requirements –Problematic if it fails or looses data –True DR becomes a requirements –Data quality becomes an issue –Secure Data Marts become a hard requirement Integration With The Rest of the Enterprise –Workload integration becomes an issue Efficiency Becomes A Hot Topic –Inefficient utilization on 20 machines isnt an issue, on 500 or it is Relatively few are really here yet outside of Facebook, Yahoo, LinkedIn, etc… Few are thinking of this but it is inevitable

36 © 2012 IBM Corporation 36 IBM – Delivers a Platform Not a Product Hardened Environment –Removes single points of failure –Security –All Components Tested Together –Operational Processes –Ready for Production Mature / Pervasive usage Deployed and Managed Like Other Mature Data Center Platforms BIG INSIGHTS –Text Analytics, Data Mining, Streams, Others

37 © 2012 IBM Corporation 37 The IBM Big Data Platform InfoSphere BigInsights Hadoop-based low latency analytics for variety and volume IBM Netezza High Capacity Appliance Queryable Archive Structured Data IBM Netezza 1000 BI+Ad Hoc Analytics on Structured Data IBM Smart Analytics System Operational Analytics on Structured Data IBM Informix Timeseries Time-structured analytics IBM InfoSphere Warehouse Large volume structured data analytics InfoSphere Streams Low Latency Analytics for streaming data MPP Data Warehouse Stream ComputingInformation Integration Hadoop InfoSphere Information Server High volume data integration and transformation

38 © 2012 IBM Corporation 38 What Does a Big Data Platform Do? Analyze Information in Motion Streaming data analysis Large volume data bursts and ad-hoc analysis Analyze a Variety of Information Novel analytics on a broad set of mixed information that could not be analyzed before Discover and Experiment Ad-hoc analytics, data discovery and experimentation Analyze Extreme Volumes of Information Cost-efficiently process and analyze PBs of information Manage & analyze high volumes of structured, relational data Manage and Plan Enforce data structure, integrity and control to ensure consistency for repeatable queries

39 © 2012 IBM Corporation 39 Big Data Enriches the Information Management Ecosystem Who Ran What, Where, and When? Audit MapReduce Jobs and tasks Managing a Governance Initiative OLTP Optimization (SAP, checkout, +++) Master Data Enrichment via Life Events, Hobbies, Roles, +++ Establishing Information as a Service Active Archive Cost Optimization

40 © 2012 IBM Corporation January 19, 2014 Get More Information…

41 © 2012 IBM Corporation 41

42 © 2012 IBM Corporation 42 Get the Book

43 © 2012 IBM Corporation 43

Download ppt "© 2012 IBM Corporation January 19, 2014 The Big Deal About Big Data Dean Compher Data Management Technical Professional for UT, NV"

Similar presentations

Ads by Google