Presentation on theme: "The Big Deal About Big Data"— Presentation transcript:
1 The Big Deal About Big Data @db2Deanfacebook.com/db2DeanThis presentation is designed to get you excited about Big Data, understand what it is, how to define it, identify it, and what makes IBM so special when it comes to solving client Big Data problems. The object isn’t to teach everything about our positioning, or our products, rather generate excitement and leadership. NOTE that not all use cases are found in, in fact, most use cases in there are leading edge and not the easy ‘low hanging’ fruit. In addition, many use cases are implicit in the examples and you should encourage clients to think about their own based on examples given here.Dean Compher Data Management Technical Professional for UT, NVSlides Created and Provided by:Paul ZikopoulosTom Deustch
2 Why Big Data How We Got Here In this section I will talk about the Big Data era, how to spot it, define it, and most of all, how we got here.
3 …by the end of 2011, this was about 30 billion and growing even faster In 2005 there were 1.3 billion RFID tags in circulation…You know, a great example is radio frequency ID tags (RFID). These caught lots and lots of attention when Wal-Mart was redesigning their supply chain with them, and the cost of RFID tags have come down so much, they’ve just proliferated all over the world. When you think about the Instrumentation characteristic of IBM’s Smarter Planet (Instrumented, Interconnected, and Intelligent), this is just one example of how we’ve become an instrumented world. On this slide you can see that in 2005, there were 1.3 billion RFID tags in circulation; this turns into 30 billion by the end of last year (2011). That’s a pretty significant annual growth rate to get to where we got to at the end of 2011; and again, this is just a single example of instrumentation.They are a good place to start with Big Data, because they are now ubiquitous as is the opportunity for Big Data. They are used to track cars on a toll route, food supplies for temperature transport, livestock, supplies, inventories, luggage, retail, tickets used for transportation, you name it.33
4 1 BILLION lines of code EACH engine generating 10 TB every 30 minutes! An increasingly sensor-enabled and instrumentedbusiness environment generates HUGE volumes of data with MACHINE SPEED characteristics…LHR JKF: 640TBsI was on a plane in Airbus the other day, and do you realize that these things are hugely sensor enabled devices that are instrumented to collect data as they operate. They also generate huge volumes of data.+CLICK+For this particular Airbus, over a billion lines of a code and a single engine generates 10 terabytes of data every 30 minutes. And so there’s four engines there, right?And, you know, just taking this particular plane from the UK to New York would generate 640 terabytes of data. Now stop and ponder that for a moment. Propose this amount of data injection to your client and it becomes obvious – there’s too much data to process, analyze, store with traditional approaches.1 BILLION lines of code EACH engine generating 10 TB every 30 minutes!4
5 350B Transactions/Year Meter Reads every 15 min. You can see in this slide another example of Big Data in the utilities sector: smart metering. As meter reads have transformed from every other month, to a physical read with a estimation every other month, to monthly, weekly, daily, and hourly – you’ve got an immense amount of data streaming into the enterprise as shown on this slide.350B Transactions/Year Meter Reads every 15 min.120M – meter reads/month3.65B – meter reads/day
6 In August of 2010, Adam Savage, of “Myth Busters,” took a photo of his vehicle using his smartphone. He then posted the photo to his Twitter account including the phrase “Off to work.”Since the photo was taken by his smartphone, the image contained metadata revealing the exact geographical location the photo was takenBy simply taking and posting a photo, Savage revealed the exact location of his home, the vehicle he drives, and the time he leaves for workThe notion is that we are always sharing information about ourselves. For example, this particular Hollywood Star actually gave away the location of his house, when we often heads to work, and more just by uploading a photo with GPS location enabled (the default for smartphones by the way). The full story of this is located atThe US Army had to send guidance and requirements for military phone lockdowns because geo-positioning capabilities of service men and women’s Blackberries and iPhones gave away sensitive location information when unsuspecting service personnel upload pictures of themselves in the Iraqi desert.
7 The Social Layer in an Instrumented Interconnected World 2+ billion people on the Web by end 201130 billion RFID tags today (1.3B in 2005)4.6 billion camera phones world wide100s of millions of GPS enabled devices sold annually76 million smart meters in 2009… 200M by 201412+ TBs of tweet data every day? TBs of data every dayObviously, there are many other forms of data. Let’s start with the hottest topic associated with Big Data today: social networks. Twitter generates about 12 terabytes a day of tweet data – which is every single day. Now, keep in mind, these numbers are hard to keep accurate, so the point is that they’re big, right? So don’t fixate on the actual number because they change all the time and realize that even if these numbers are out of date by 2 years, it’s at a point where it’s too staggering to handle exclusively using traditional approaches.+CLICK+Facebook over a year ago was generating 25 terabytes of log data every day (Facebook log data reference:) and probably about 7 to 8 terabytes of data that goes up on the Internet.Google, who knows? Look at Google Plus, YouTube, Google Maps, and all that kind of stuff. So that’s the left hand of this chart – the social network layer.Now let’s get back to instrumentation: there are massive amounts of proliferated technologies that allow us to be more interconnected than in the history of the world – and it just isn’t P2P (people to people) interconnections, it’s M2M (machine to machine) as well. Again, with these numbers, who cares what the current number is, I try to keep them updated, but it’s the point that even if they are out of date, it’s almost unimaginable how large these numbers are. Over 4.6 billion camera phones that leverage built-in GPD to tag your location or your photos, purpose built GPS devices, smart metres. If you recall the bridge that collapsed in Minneapolis a number of years ago in the USA, it was rebuilt with smart sensors inside it that measure the contraction of the concrete based on weather conditions, ice build up, and so much more.So I didn’t realise how true it was when Sam P launched Smart Planet: I thought it was a marketing play. But truly the world is more instrumented, interconnected, and intelligent than it’s ever been before and this capability allows us to address new problems and gain new insight never before thought possible and that’s what the Big Data opportunity is going to be all about!25+ TBs of log data every day7
8 Twitter Tweets per Second Record Breakers of 2011 This slide shows the tweets per second (TPS) record breakers for 2011 – as you can see, the records keeps getting broken and the topics range from news, to safety, to sport, to shocking, to ‘cult’ like movie followers.The point here is that Twitter is not only growing enormously, but the range of topics is from emmergency to world events to social commentary to sport to entertainment and all parts in between.Source:
9 Extract Intent, Life Events, Micro Segmentation Attributes PaulineName, Birthday, FamilyTom SitNot Relevant - NoiseTina MuMonetizable IntentJo JobsYou can just +CLICK+ through this slide as another example of social media (such as Facebook and Twitter) and the valuable information that can be found within; note also in some cases, the information is SPAM and noise – and we want to be able to discard that area as well and find the signals in the noise.The reason why I am showing social media is it involves heavy text analytics – and that’s the hardest part of Big Data analytics. So there are easier use cases, and the IBM platform is terrific at that for sure (such as log analysis). In addition, there are easier ways to use text analytics – for example, use it to get insight into company earnings as it pours through hundreds of pages on the web to spot trends and patterns.Monetizable IntentLocationWishful ThinkingRelocationSPAMbots
10 Big Data Includes Any of the following Characteristics Extracting insight from an immense volume, variety and velocity of data, in context, beyond what was previously possibleVariety:Velocity:Volume:Manage the complexity of data in many different structures, ranging from relational, to logs, to raw textStreaming data and large volume data movementScale from Terabytes to Petabytes (1K TBs) to Zetabytes (1B TBs)We like to define Big Data at IBM as Variety, Velocity and Volume. If you start at the bottom, volume is pretty simple. We all understand we’re going from terabytes to petabytes and into a zetabytes world, I think most of us understand today just how much data is out there now and what’s coming (at least you should after the first couple of slides in this presentation).The variety aspect is something kind of new. Analytics allows us to explore beyond structured data: we want to fold in unstructured data as well. If you look at a Facebook post or a tweet, they may come in a structured format (JSON), but the true value is in the unstructured part; the part that you tweet or your Facebook status and your post, that’s really a kind of unstructured data, so we refer to that as semi-structured data. So now we’re looking at all sorts of different kinds of data.Finally, there’s velocity. Other vendors who don’t have as big of a Big Data scope as we have at IBM will call velocity the speed at which the volume grows, but I think it’s fair to say that that’s part of volume. We talk about velocity as being how fast does the data arrive at the enterprise, and of course, it’s going to lead to the question and how long does it take you to do something about it? Velocity in this context is a MAJOR IBM differentiator.Now keep in mind that a Big Data problem could involve solely one of these characteristics, or all of them.
11 Bigger and Bigger Volumes of Data Retailers collect click-stream data from Web site interactions and loyalty card dataThis traditional POS information is used by retailer for shopping basket analysis, inventory replenishment, +++But data is being provided to suppliers for customer buying analysisHealthcare has traditionally been dominated by paper-based systems, but this information is getting digitizedScience is increasingly dominated by big science initiativesLarge-scale experiments generate over 15 PB of data a year and can’t be stored within the data center; sent to laboratoriesFinancial services are seeing large and large volumes through smaller trading sizes, increased market volatility, and technological improvements in automated and algorithmic tradingImproved instrument and sensory technologyLarge Synoptic Survey Telescope’s GPixel camera generates 6PB+ of image data per year or consider Oil and Gas industryYou can see on this slide just some other examples of different industry generating more and more amounts of data. The point is that EVERY industry has a Big Data problem.
12 Data AVAILABLE to an organization Data an organization can PROCESS The Big Data ConundrumThe percentage of available data an enterprise can analyze is decreasing proportionately to the available to itQuite simply, this means as enterprises, we are getting “more naive” about our business over timeWe don’t know what we could already know….Data AVAILABLE to an organizationSignals and NoiseIn this slide you can see a graph – it’s not to scale, but you get the point – and this graph shows that the percentage of data available to an enterprise is growing enormously; you can see that at the top bar. And as the amount of data available to an organization grows, the percentage of data that the organization can actually process is decreasing. It’s kind of like we’re getting “dumber” as organizations – in terms of proportion of measurement to the data we are collecting - are understanding less and less of it.+CLICK+I call the shaded area between these opposite trending lines “The Blind Spot”: it contains signals and noise. This area has got all this data in there, and perhaps it would make sense for us to ingest this into our traditional analytic systems, but we don’t know if that data will yield value or not – it’s a blind spot. We have a hunch that there is value in there, but truly we have no idea what’s in the shaded area. Furthermore, while we know there is value in here, we know it’s not all going to be valuable, so how do we sift through the noise to find the signals? We can start ingesting 10 TB a day of data, ask the CIO for her approval for triple OPEX and CAPEX costs on a hunch? So we have to find a way to find signals from the noise in a cost effective manner.Now if we can leverage some new approach to find the value in the blind spot, at a relatively low cost, if we could tie together things like Big Data social media around our core trusted information that we know about our customers, and drop the stuff that isn’t related to what the business is trying to accomplish, you could really start to monetize that relationship and intents - not just transactions. And that’s the difference, right? How do we monetize intent and relationships? - And that’s a problem domain that includes Big Data.In the previous paragraph I just gave a ubiquitous example, since social media is so obviously tied to Big Data. But you can imagine this dichotomy in any industry. For example, think Oil and Gas (O&G) and the well readings streaming in – and wanting to apply analytics to that with geological data that is unstructured and comes from other sources in various formats and is likely often changing (from an attribute perspective). Harvesting wind energy, traffic patterns, and more.Data an organization can PROCESS
13 Why Not All of Big Data Before: Didn’t have the Tools? We all know there exists a SQL-controlled relational database warehouse (warehouse) today, so why are we at this era of Big Data? I think the two images on this slide really sum it up with a decent analogy around gold mining. If you think about the guy on the left, where you see this old-timer gold miner sifting for gold in a river and he is hoping to find big chunks of golf in his sifter. If someone found find big chunks of gold, word spread and that would spark a big gold rush. The find would pave the way for lots of investment, and eventually a town would spring up around this visible find.What’s a characteristic of this scenario? When you look at that gold, you can visually see it, and I would refer to gold (data) as having a visible value (high value per byte data). You can see it. It’s obvious. It’s valuable and therefore I can build a business case and invest in bringing this obvious high value per byte data into the warehouse– which indeed is a Big Data technology. Now bringing data into an warehouse is inherently more expensive (for good reasons), because in an warehouse we are taught that this is pristine data, the single version of the truth, it’s got to be enriched, it’s got to be documented, glossarized, transformed; and we do that because we know there’s a high value per byte data. Now, although mining towns sprung up around a gold find, folks didn’t go and dig up the mountains around the stream. Why? Because there is so much dirt (low value per byte data), and you didn’t have enough information or the right capital equipment to process all that dirt on a hunch.Now think of gold mining today, it’s a very different process than what I outlined on the left. In today’s gold mining, you actually can’t see most of the gold that’s mined today. Gold has to be 30 parts per million (ppm) ore or greater for you to see it, so most gold mined today isn’t visible to the naked eye. Instead, today there exists massive capital equipment that’s able to go through and process lots and lots of dirt (low value per byte data) and finds for extraction strains of gold (high value data) that are otherwise invisible to the naked eye. So today’s gold mine collects all these strains of gold and brings together value (insight).I was watching a gold mining documentary the other day – and they talked about how they chemically treat the dirt to find even finer grains of gold after a recent discovery, so this particular company was going to go back to the dirt that they’ve already processed, chemically treat it, and find more gold (value) than what was found in the initial extraction. I think analytics is (or will be) just like that and that’s yet another reason why Big Data compliments the existing warehouse. Five years from now, we’ll be able to do more and more analytically on the data we have today, and we’re going to understand inflection points and trends better that what we can today, and that’s just one of the reasons why developing a corpus of information, and keeping it, not only makes today’s models more accurate, but presents unknown opportunities for the future.
14 Applications for Big Data Analytics Smarter HealthcareMulti-channel salesFinanceLog AnalysisHomeland SecurityTraffic ControlTelecomSearch QualitySo, think about the suitability of applications for IBM Big Data technologies. I am telling you: every single industry has a Big Data opportunity for you. For example, smarter healthcare where a hospital can pick up the sensor readings off of neonatal babies to try to foreshadow incoming problems based on trends. We work with homeland security today. The US President Barack Obama is the Twitter President, if when an event happens, he tweets about it and homeland defence wants to know how people respond and if there are groups to focus on that are expressing negative sentiment laced with terrorism or wrong-doing.Just look across any industry and you’re going to find some reoccurring themes. One of those themes is more data, because I (and business for that matter) believe we can make better decisions when you have access to more data, or we can keep that data longer. More data that’s persisted for longer periods of time leads to better models. So that’s definitely a recurring Big Data theme: “I want to keep more and more data to get better and better insight, and I want to be able to have analysis on the data that—when it’s NOT only structured” There’s unstructured and semi-structured to fold into our mostly structured analytics of today and ALL industries are facing this challenge today (and can benefit from solving it).Below are a list of all kinds of opportunities by industry for Big DataWeb: Social Network Analysis and Clickstream SessionizationMedia: Content Optimization and EngagementTelco: Network Analytics and MediationRetail: Loyalty and Promotion Analysis and Data FactoryFinancial: Fraud Analytics and Trade ReconciliationFederal: Entity Analytics and SIGINTBioinformatics: Sequencing Analysis and Genome MappingFinancial Services – Better and deeper understanding of risk to avoid credit crisisTelecommunications – More reliable networks where we can predict and prevent failure.Media – More content that is lined up with your personal preferencesLife Sciences – Better targeted medicines with fewer complications and side effectsRetail – A personal experience with products and offers that are just what you needGovernment– Government services that are based on hard data, not just gut.Some examples of use cases include:Predict weather patterns to plan optimal wind turbine usage and optimize CAPEX on asset placementDetect life-threatening conditions at hospitals in time to interveneMulti-channel customer sentiment and experience analysisIdentify criminals and threats from disparate video, audio, and data feedsAnalyzing network data to predict failure; for example. how does a network react to fluctuationsThreat analysisTrade surveillance; for example, detecting trading anomalies and harmful behaviorSearch qualityData sandbox; for example, finding patterns or relationships that allow the organization to derive additional value from dataAllows you to model true risk. For example, 2,220/100K cardholders that used a specific branded credit card in a drinking place missed 4 payments within a year but, only 530/100K using this card at dentist missed 4 paymentsCustomer churn analysis (CDR and IPDR analysis)Recommendation engine: next best offerGet taste information: “Other who bought this bought….”ManufacturingTrading AnalyticsFraud and RiskRetail: Churn, NBO
15 Most Requested Uses of Big Data Log Analytics & StorageSmart Grid / Smarter UtilitiesRFID Tracking & AnalyticsFraud / Risk Management & Modeling360° View of the CustomerWarehouse Extension/ Call Center Transcript AnalysisCall Detail Record Analysis+++Here are some of the more typical areas in which I see request for Big Data use cases.
17 Hadoop BackgroundApache Hadoop is a software framework that supports data-intensive applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google Map/Reduce and Google File System papers.Hadoop is a top-level Apache project being built and used by a global community of contributors, using the Java programming language. Yahoo has been the largest contributor to the project, and uses Hadoop extensively across its businesses.Hadoop is a paradigm that says that you send your application to the data rather than sending the data to the application1717
18 What Hadoop Is NotIt is not a replacement for your Database & Warehouse strategyCustomers need hybrid database/warehouse & hadoop modelsIt is not a replacement for your ETL strategyExisting data flows aren’t typically changed, they are extendedIt is not designed for real-time complex event processing like StreamsCustomers are asking for Streams & BigInsights integration
19 So What Is Really New Here? Cost effective / Linear Scalability.Hadoop brings massively parallel competing to commodity servers. You can start small and scales linearly as your work requires.Storage and Modeling at Internet-scale rather than small samplingCost profile for super-computer level compute capabilitiesCost per TB of storage enables superset of information to be modeledMixing Structured and Unstructured data.Hadoop is its schema-less so it doesn’t care about the form the data stored is in, and thus allows a super-set of information to be commonly stored. Further, MapReduce can be run effectively on any type of data and is really limited by the creatively of the developer.Structure can be introduced at the MapReduce run time based on the keys and values defined in the MapReduce program. Developers can create jobs that against structured, semi-structured, and even unstructured data.Inherently flexible of what is modeled/analytics runAbility to change direction literally on a moment’s notice without any design or operational changesSince hadoop is schema-less, and can introduce structure on the fly, the type of analytics and nature of the questions being asked can be changed as often as needed without upfront cost or latency
20 Break It Down For Me Here… Hadoop is a platform and framework, not a databaseIt uses both the CPU and disc of single commodity boxes, or nodeBoxes can be combined into clustersNew nodes can be added as needed, and added without needing to change the;Data formatsHow data is loadedHow jobs are writtenThe applications on top
21 So How Does It Do That? At its core, hadoop is made up of; Map/Reduce How hadoop understands and assigns work to the nodes (machines)Hadoop Distributed File System = HDFSWhere hadoop stores dataA file system that’s runs across the nodes in a hadoop clusterIt links together the file systems on many local nodes to make them into one big file system
22 What is HDFSThe HDFS file system stores data across multiple machines.HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodesDefault is 3 copiesTwo on the same rack, and one on a different rack.The filesystem is built from a cluster of data nodes, each of which serves up blocks of data over the network using a block protocol specific to HDFS.They also serve the data over HTTP, allowing access to all content from a web browser or other clientData nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high.
23 File System on my Laptop File systems exist on every operating system and allow you to create files, write to files, read to files, copy files, etc. With these commands you don’t have to worry about where data is actually stored on disk. It just gets there magically. In this example, I can manage files using Windows Explorer on my laptop. HDFS has its own commands, but it stores the files for you.
24 HDFS File System Example This picture copied directly from the book, “Understanding Big Data”. HDFS file system lives across several servers, and breaks a file in to blocks, with each of those blocks being put on internal disk drives on various servers in your cluster.
25 Map/Reduce Explained "Map" step: "Reduce" step: The program is chopped up into many smaller sub-problems.A worker node processes some subset of the smaller problems under the global control of the JobTracker node and stores the result in the local file system where a reducer is able to access it."Reduce" step:AggregationThe reduce aggregates data from the map steps. There can be multiple reduce tasks to parallelize the aggregation, and these tasks are executed on the worker nodes under the control of the JobTracker.2525
26 The MapReduce Programming Model "Map" step:Program split into piecesWorker nodes process individual pieces in parallel (under global control of the Job Tracker node)Each worker node stores its result in its local file system where a reducer is able to access it"Reduce" step:Data is aggregated (‘reduced” from the map steps) by worker nodes (under control of the Job Tracker)Multiple reduce tasks can parallelize the aggregationFrom Wikipedia on MapReduce (MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or as a grid (if the nodes use different hardware). Computational processing can occur on data stored either in a filesystem (unstructured) or within a database (structured)."Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node."Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output - the answer to the problem it was originally trying to solve.2626
28 MapShuffleReduceMurray 38Salt Lake 39Bluffdale 35Sandy 32Salt Lake 42Murray 31Murray 38Bluffdale 35Sandy 32Salt Lake 42Murray 38Bluffdale 35Bluffdale 37Murray 30Murray 38Bluffdale 37Bluffdale 32Sandy 40Murray 27Salt Lake 25Bluffdale 37Sandy 32Salt Lake 23Murray 30Sandy 40Salt Lake 25Bluffdale 37Murray 30Sandy 40Salt Lake 25Sandy 32Salt Lake 42Sandy 40Salt Lake 42
29 MapReduce In more Detail Map-Reduce applications specify the input/output locations and supply map and reduce functions via implementations of appropriate Hadoop interfaces, such as Mapper and Reducer.These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable, etc.) and configuration to the JobTrackerThe JobTracker then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.The Map/Reduce framework operates exclusively on <key, value> pairs — that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job, conceivably of different types.The vast majority of Map-Reduce applications executed on the Grid do not directly implement the low-level Map-Reduce interfaces; rather they are implemented in a higher-level language, such as Jaql, Pig or BigSheets
30 JobTracker and TaskTrackers Map/Reduce requests are handed to the Job Tracker which is a master controller for the map and reduce tasks.Each worker node contains a Task Tracker process which manages work on the local node.The Job Tracker pushes work out to the Task Trackers on available worker nodes, striving to keep the work as close to the data as possibleThe Job Tracker knows which node contains the data, and which other machines are nearbyIf the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rackThis reduces network traffic on the main backbone network. If a Task Tracker fails or times out, that part of the job is rescheduled3030
31 How To Create Map/Reduce Jobs Skill RequiredHow To Create Map/Reduce JobsMap/reduce development in JavaHard, few resources that know thisPigOpen source language / Apache sub-projectBecoming a “standard”HiveProvides a SQL-like interface to hadoopJaqlIBM Research InventedMore powerful than Pig when dealing with loosely structure dataVisa has been a development partnerBigSheetsBigInsights browser based applicationLittle development requiredYou’ll use this most often
32 Taken Together - What Does This Result In? Easy To ScaleSimply add machines as your data and jobs requireFault Tolerant and Self-HealingHadoop runs on commodity hardware and provides fault tolerance through software.Hardware losses are expecting and toleratedWhen you lose a node the system just redirects work to another location of the data and nothing stops, nothing breaks, jobs, applications and users don’t even know.Hadoop Is Data AgnosticHadoop can absorb any type of data, structured or not, from any number of sources.Data from many sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.Hadoop results can be consumed by any system necessary if the output is structured appropriatelyHadoop Is Extremely FlexibleStart small, scale bigYou can turn nodes “off” and use for other needs if required (really)Throw any data, in any form or format, you want at itWhat you use it for can be changed on a whim
33 The IBM Big Data Platform This is the IBM Big Data platform – you can see it’s a very rich a capable platform. It includes traditional Big Data technologies such as Netezza that have been used to address the more traditional Big Data problems, and enriches it with new age NoSQL-like technologies that include velocity and variety capabilities as well; and of course, if the data in these systems can’t be integrated, then what’s the point and that is why Information Integration is such a key part of the IBM Big Data platform.Finally,+CLICK+This is the area that I am going to focus on in this presentation: as well as the application development and system management portion.33
34 Analytic Sandboxes – aka “Production” Hadoop capabilities exposed to LOB with some notion of IT supportNot really production in an IBM senseReally “just” ad-hoc made visible to more users in the organizationFormal declaration of direction as part of the architecture“Use it, but don’t count on it”Not built for secutity
35 Production Usage with SLAs SLA driven workloadsGuaranteed job completionJob completion within operational windowsData Security RequirementsProblematic if it fails or looses dataTrue DR becomes a requirementsData quality becomes an issueSecure Data Marts become a hard requirementIntegration With The Rest of the EnterpriseWorkload integration becomes an issueEfficiency Becomes A Hot TopicInefficient utilization on 20 machines isn’t an issue, on 500 or it isRelatively few are really here yet outside of Facebook, Yahoo, LinkedIn, etc…Few are thinking of this but it is inevitable
36 IBM – Delivers a Platform Not a Product Hardened EnvironmentRemoves single points of failureSecurityAll Components Tested TogetherOperational ProcessesReady for ProductionMature / Pervasive usageDeployed and Managed Like Other Mature Data Center PlatformsBIG INSIGHTSText Analytics, Data Mining, Streams, Others
37 The IBM Big Data Platform InfoSphere BigInsights Hadoop-based low latency analytics for variety and volumeHadoopInformation IntegrationStream ComputingInfoSphere Information Server High volume data integration and transformationInfoSphere Streams Low Latency Analytics for streaming dataThis is a closer look at the Big Data platform. You can see the product view and how each fits into the IBM Big Data platform.MPP Data WarehouseIBM InfoSphere Warehouse Large volume structured data analyticsIBM Netezza High Capacity Appliance Queryable Archive Structured DataIBM Netezza 1000 BI+Ad Hoc Analytics on Structured DataIBM Smart Analytics System Operational Analytics on Structured DataIBM Informix Timeseries Time-structured analytics
38 What Does a Big Data Platform Do? Analyze a Variety of InformationNovel analytics on a broad set of mixed information that could not be analyzed beforeAnalyze Information in MotionStreaming data analysis Large volume data bursts and ad-hoc analysisAnalyze Extreme Volumes of InformationCost-efficiently process and analyze PBs of information Manage & analyze high volumes of structured, relational dataWhen a vendor delivers a Big Data platform, such as IBM, it creates the ability to do a lot of new age things: some of those shown on this chart. It also gives them the ability to do things they are doing today – better. Now of course you have to ask yourself, how can you do this, and the answer is going to be a platform. That platform has to be rich in tooling, integration, and core capability and that’s EXACTLY what IBM is delivering today.Discover and ExperimentAd-hoc analytics, data discovery and experimentationManage and PlanEnforce data structure, integrity and control to ensure consistency for repeatable queries
39 Big Data Enriches the Information Management Ecosystem Who Ran What, Where, and When?Audit MapReduce Jobs and tasksManaging a Governance InitiativeOLTP Optimization (SAP, checkout, +++)Master Data Enrichment via Life Events, Hobbies, Roles, +++Establishing Informationas a ServiceActive Archive Cost OptimizationThis slide shows the IBM Big Data Platform – as you can see, it’s rich and capable portfolio that you can use to address Big Data platforms, and shows a even richer Value proposition as you see how the broader IBM IM portfolio integrates with Big Data technologies.+CLICK+Master Data Management (MDM) is a brand within the InfoSphere platform. MDM’s role within InfoSphere it to help to create trusted information, information that you can bring to bear to ensure that the information that you have on your customers, your clients, your organizations, your citizens or even your criminals is both accurate, timely and actionable. It’s ability to provide that data makes it complimentary technology to work with the Big Data platform, as well as our Data Warehouse and our platform technologies.39