2“Big Data” - Technical Architecture AGENDAFoundational Definitions & where these technologies came fromBig DataNoSQLHadoopBusiness & Technical DriversHow they are being used in many companiesPredictions for the futureChallenges & ObstaclesQuestionsRONI
3“Big Data” - Technical Architecture Foundational Definition – Big DataBig data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information.Big data can be characterized by 3Vs: the extreme volume of data, the wide variety of types of data and the velocity at which the data must be must processed. There are many other aspects as well such as: Viscosity, Complexity, Ambiguity.RONIThese technologies are not just about volumes of data. We now have the ability to capture without having to pre-declare all potential uses of the data and position ourselves to adjust to the velocity of change more readily. (Schema-on-read; leveraging/preventing data exhaust, etc…) Another way to stress this is “we don’t know what tomorrow’s questions are going to be, but we expect they will need some of today’s data to answer.”How do I ask the question that I didn’t know I wanted to ask?Data in a corporation that cannot be processed using traditional data management techniques and technologies can be broadly classified as Big Data.
5“Big Data” - Technical Architecture Big Data ≠ HadoopBig Data ≠ NoSQLHadoop ≠ NoSQLRONIThese technologies are not just about volumes of data. We now have the ability to capture without having to pre-declare all potential uses of the data and position ourselves to adjust to the velocity of change more readily. (Schema-on-read; leveraging/preventing data exhaust, etc…) Another way to stress this is “we don’t know what tomorrow’s questions are going to be, but we expect they will need some of today’s data to answer.”How do I ask the question that I didn’t know I wanted to ask?Hadoop & NoSQL are key technologies for working with Big Data effectively.
6“Big Data” - Technical Architecture RONIBig Data in 2015—Power to the People!December 16, 2014 Last year I speculated that the big data ‘power curve’ in 2014 would be shaped by business demands for data blending. Customers presenting at our debut PentahoWorld conference last October, from Paytronix, to RichRelevance, to NASDAQ, certainly proved my speculations to be true. Businesses like these are examples of how increasingly large and varied data sets can be used to deliver high and sustainable ROI. In fact, Ventana Research recently confirmed that 22 percent of organizations now use upwards of 20 data sources, and 19 percent use between 11 – 20 data sources.Moving into 2015, and fired up by their initial big data bounties, businesses will seek even more power to explore data freely, structure their own data blends, and gain profitable insights faster. They know “there’s gold in them hills” and they want to mine for even more!With that said, here are my big data predictions for 2015:Big Data Meets the Big Blender!The digital universe is exploding at a rate that even Carl Sagan might struggle to articulate. Analysts believe it’s doubling every year, but with the unstructured component doubling every three months. By 2025, IDCestimates that 40 percent of the digital universe will be generated by machine data and devices, while unstructured data is getting all the headlines. The ROI business use cases we’ve seen require the blending of unstructured data with more traditional, relational, data. For example, one of the most common use cases we are helping companies create is a360 view of their customers. The de facto reference architecture involves the blending of relational/transactional data detailing what the customer has bought, with unstructured weblog and clickstream data highlighting customer behavior patterns around what they might buy in the future. This blended data set is further mashed up with social media data describing sentiment around the company’s products and customer demographics. This “Big Blend” is fed into recommendation platforms to drive higher conversion rates, increase sales, and improve customer engagement. This “blended data” approach is fundamental to other popular big data use cases like Internet of Things, security and intelligence applications, supply chain management and regulatory and compliance demands in Financial Services, Healthcare and Telco industries.Internet of Things Will Fuel the New ‘Industrial Internet’Early big data adoption drove the birth of new business models at companies like our customers Beachmint and Paytronix. In 2015, I’m convinced that we’ll see big data starting to transform traditional industrial businesses by delivering operational, strategic and competitive advantage. Germany is running an ambitious Industry 4.0 project to create “Smart Factories” that are flexible, resource efficient, ergonomic and integrated with customers and business partners. The machine data generated from sensors and devices, are fueling key opportunities like Smart Homes, Smart Cities, and Smart Medicine, which all require big data analytics. Much like the ‘Industrial Internet’ movement in the U.S., Industry 4.0 is is being defined by the Internet of Things. According to Wikibon, the value of efficiency from machine data could reach close to $1.3 trillion dollars and will drive $514B in IT spend by 2020.The bottlenecks are challenges related to data security and governance, data silos, and systems integration.Big Data Gets Cloudy!As companies with huge data volumes seek to operate in more elastic environments, we’re starting to see some running all, or part of, their big data infrastructures in the cloud. This says to me that the cloud is now “IT approved” as a safe, secure, and flexible data host. At PentahoWorld, I told a story about a “big datathrow down” that occurred during our Strategic Advisory Board meeting. At one point in the meeting, two enterprise customers in highly regulated industries started one-upping each other about how much data they stored in Amazon Redshift Cloud. One shared that they processed and analysed 5-7 billion records daily. The next shared that they stored a half petabyte of new data every day and on top of that, they had to hold the data for seven years while still making it available for quick analysis. Both of these customers are held to the highest standards for data governance and compliance – regardless of who won, the forecast for their big data environments is the cloud!Embedded Analytics is the New BIAlthough “classic BI,” which involves a business analyst looking at data with a separate tool outside the flow of the business application, will be around for a while, a new wave is rising in which business users increasingly consume analytics embedded within applications to drive faster, smarter decisions. Gartner’s latest research estimates that more than half the enterprises that use BI now use embedded analytics.Whether it’s a RichRelevance data scientist building a predictive algorithm for a recommendation engine, or a marketing director accessing Marketo to consume analytics related to lead scoring or campaign effectiveness, the way our customers are deploying Pentaho leave me with no doubt that this prediction will bear out.As classic BI matured, we witnessed a final “tsunami” in which data visualization and self-service inspired business people to imagine the potential for advanced analytics. Users could finally see all their data – warts and all – and also start to experiment with rudimentary blending techniques. Self-service and data visualization prepared the market for what I firmly expect to be the most significant analytics trend in 2015….Data Refineries Give Real Power to the People!The big data stakes are higher than ever before. No longer just about quantifying ‘virtual’ assets like sentiment and preference, analytics are starting to inform how we manage physical assets like inventory, machines and energy. This means companies must turn their focus to the traditional ETL processes that result in safe, clean and trustworthy data. However, for the types of ROI use cases we’re talking about today, this traditional IT process needs to be made fast, easy, highly scalable, cloud-friendly and accessible to business. And this has been a stumbling block – until now. Enter Pentaho’s Streamlined Data Refinery, a market-disrupting innovation that effectively brings the power of governed data delivery to “the people,” unlocking big data’s full operational potential. I’m tremendously excited about 2015 and the journey we’re on with both customers and partners. You’re going to hear a lot more about the Streamlined Data Refinery in 2015 – and that’s a prediction I can guarantee will come true!Finally, as I promised at PentahoWorld in October, we’re only going to succeed when you tell us you’ve delivered an ROI for your business. Let’s go forth and prosper together in 2015!Quentin Gallivan, CEO, Pentaho
7“Big Data” - Technical Architecture Foundational Definition - NoSQLNoSQL database, also called Not Only SQL, is an approach to data management and database design that's useful for very large sets of distributed data.NoSQL seeks to solve the scalability and big data performance issues that relational databases weren’t designed to address.NoSQL is especially useful when an enterprise needs to access and analyze massive amounts of unstructured data or data that's stored remotely on multiple virtual servers in the cloudHowever - NoSQL is not just about Big DataRONI
8“Big Data” - Technical Architecture Where this technology came from - NoSQL19701980199020002005200720102014+Document DB Inspired by Lotus NotesFlat FilesPolygot PersistenceEnterprise will have a variety of different data storage technologies for different kinds of data & application needsRise of Object DatabasesRise of Relational DatabasesRelational Database DominanceRONIKey Value StoreReplicate Dataduring 24x7AvailabilityNeed to Store Tabular Data in Distributed SystemMany Innovators In The 2005 to 2010 Timeframe
9“Big Data” - Technical Architecture Market view of what’s out there – we do NOT have all of these at PFG today.There are over 150 NoSQL databases in the market – these are just a few of the top ones.RONI
10“Big Data” - Data Architecture at PFG Foundational Definition - HadoopHadoop is a open source, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.It is part of the Apache project sponsored by the Apache Software Foundation.Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes.Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of nodes become inoperative.TOM
11“Big Data” - Data Architecture at PFG Where this technology came from - Hadoop1995200620102014+200420051995 – 2005: Yahoo! Search team builds 4+ generations of systems to crawl & index the WWW. 20 Billion pages!Juggernaut & Nutch join forces – Hadoop is born!Analytic Tool InteroperabilityService providers step into the market – provide training, support, & hostingMassAdoptionGoogle publishes Google File System & MapReduce papersYahoo! Staffs ‘Juggernaut’, open source DFS & MapReduceDoug Cutting builds Nutch DFS & MapReduce, joins Yahoo!Other Internet companies add tools / frameworks to enhance HadoopTOM1995 – 2005Yahoo! Search team builds 4+ generations of systems to crawl & index the WWW. 20 Billion pages!2004Google publishes Google File System & MapReduce papers2005Yahoo! Staffs ‘Juggernaut’, open source DFS & MapReduceCompete / Differentiate via open source contributionAttract scientists – become known center of big data excellenceAvoid building proprietary systems that will be osolescedGain leverage of wider community building one infrastructureDoug Cutting builds Nutch DFS & MapReduce, joins Yahoo!2006Juggernauth & Nutch join forces – Hadoop is born!Nutch prototype used to seed new Apache Hadoop projectYahoo! Commits to scaling Hadoop, staffs Hadoop team‘Enterprise Grade’ Security
12“Big Data” - Technical Architecture The Hadoop VendorLandscapeTOMThis is a snapshot from “The Forrester Wave™: Big Data Hadoop Solutions, Q1 2014”We implemented our first Hadoop Cluster in the fall of The major contenders were Cloudera, Hortonworks, and MapR. We wanted an open source solution that we could host on-prem. We chose Cloudera based on market presence, vision, and ability to help us implement quickly.
14“Big Data” - Technical Architecture Business DriversProvide access to all data needed for analytics (internal or external)Provide the ability to realistically interact with greater ‘depths’ of data – IE: tens of years instead of a couple of monthsProvide a greater “speed to insight” for all types of requestsLower the total cost of ownership across the enterprise for analyticsAllow for exploration of our data in ways we never anticipated to identify differentiating understanding of customers and marketsThere’s an Imbalance today….RONIInternal Documents Call Center Data Project Data Audio, Video 10k & Financial Data Contracts Sales & Marketing DataExternal Forums Twitter Facebook Public Domain Data
15“Big Data” - Technical Architecture Technical Drivers -Current technical capabilities don’t align with changing expectationsTOMInternal Documents Call Center Data Project Data Audio, Video 10k & Financial Data Contracts Sales & Marketing DataExternal Forums Twitter Facebook Public Domain Data
16“Big Data” - Technical Architecture How they are being used todayNoSQLHadoopNot focused on Big Data….yetMany companies using or at least experimenting with MongoDB Document store for web applications that only needs to persist the content for the lifespan of that interaction.Using NoSQL stores for user preferences to personalize what is presented on a web page for their interaction.Beginning to organization social streams of dataInterrogating our web logs to better understand the behavior of people interacting with a website.Merging that semi-structured web activity with other structured legacy data.Massive storage of data for exploration and discovery – often using interoperability with analytic consumption tools.TAG TEAMSchema-on-Write: RDBMS• Schema must be createdbefore data is loaded.• An explicit load operationhas to take place whichtransforms the data to theinternal structure of thedatabase.• New columns must be addedexplicitly before data forsuch columns can be loadedinto the database.• Read is Fast.• Standards/Governance.Schema-on-Read: HadoopData is simply copied to thefile store, no specialtransformation is needed.• A SerDe (Serializer/Deserlizer) is applied duringread time to extract therequired columns.• New data can start flowinganytime and will appearretroactively once the SerDe isupdated to parse them.• Load is Fast• Evolving Schemas/Agility
17“Big Data” - Technical Architecture Plans for the futureNoSQLHadoopExpansion of web activity data (more logs, more data in logs, more use cases.)Speech-to-text translation of Call Recordings and text analysis/Natural Language processing to determine call topics and caller sentiment.Extraction of text from documents to aid in analysis.‘Data Lake’ solutioning – both for ingestion and archive.Database for web applications that need that speed of development and nimbleness.Layering of NoSQL solutions on top of Hadoop to improve searchability and performance.Exploration of Graph NoSQL solutions for analytics on hierarchical type data .TAG TEAMCaller sentiment: Was it a positive/negative interaction? How does that influence future behavior?Text analytics: Think Supplier negotiated contracts. Can we find all that have a particular clause in them? Or, if we have new contract language, how many and which need to be updated?
18“Big Data” - Technical Architecture Lake of DataData RefineryTOMData Lake or Data Refinery?We run into trouble when people envision a “Lake of Data.” It is often visualized as a single large entity where all the wild animals come to drink.A better analogy is to think of Hadoop as a “Data Refinery.” Crude Oil enters and it is “cooked” to deliver different data and different formats for different consumers. At the bottom it is crude oil: raw, untransformed, managed for speed of acquisition and only fit for consumption by the quintessential “Data Scientist.” As it moves up the stack it is filtered, formatted, business rules are applied and delivered for Power Users for parameterized reporting. At the top the data is formatted for performance and stored in tools for general consumption by the masses.
19“Big Data” - Technical Architecture Data RefineryTOM
20“Big Data” - Technical Architecture Many Kinds of data in our organizationRONIConceptually for illustration – not a vetted/approved picture of the PFG environment
21“Big Data” - Technical Architecture Conceptual Workload Isolation Today…RONIConceptually for illustration – not a vetted/approved picture of the PFG environment
22“Big Data” - Technical Architecture Conceptual Workload Isolation in the Future…RONIConceptually for illustration – not a vetted/approved picture of the PFG environment
24“Big Data” - Technical Architecture Big Data technologies are broader than just Hadoop & NoSQL – but those are the key starting points for us.Market view of what’s out there – we do NOT have all of these at PFG today.TOM
25“Big Data” - Technical Architecture Challenges and Obstacles to overcomeSecurityGovernanceClear Use CasesIntegration PointsHosting modelsTOMSecurity: Hadoop through various projects can apply database-like security. If we use it we can only access the data like a database.“Big Data” Governance: as we seek value from new data sources we need to ensure we are applying security, retention, and auditing consistent with corporate standards. We may see shifting of roles and accountability from traditional roles to new roles depending on data source (i.e. shift from DBA’s to Analytics Professionals).Clear Use Cases: we have configured out early environments based on a specific use case. We are making the data available with the expectation that we will find new use cases. The goal for now is not to “paint our selves into corners” as we learn.Integration Points: what tools outside of Hadoop will integrate? i.e. SAS, SAS VA, R, Business Objects, Microstrategy, etc.Hosting Models: Many of our new “Big Data” technologies go against the grain of our modern data center (consolidation/dense). We need a good understanding of security, data sources, and integration to make an informed decision. This may push us to have multiple distributions to meet different use cases.
27NoSQL Data Architecture& Best Practices Data View - Overview We are in a Database RevolutionExisting paradigms are being challengedModelsHardwareSoftwareLanguagesWill tweaking current data solutions be enough?
28NoSQL Data Architecture& Best Practices Data View - Overview
29NoSQL Data Architecture& Best Practices Data View – Five Data Paradigms
30NoSQL Data Architecture& Best Practices Data View – Five Data Paradigms Relational ModelPROsMost flexible queries & updatesReuse data structures in any contextGreat DB-to-DB integrationMature toolsStandard query languageEasy to hire expertiseCONsDesign-time, static relationshipsDesign-time, static structures: design first then load dataHard to normalize modelRequires code to integrate relational data with object-oriented codeCannot query for relevance
31NoSQL Data Architecture& Best Practices Data View – Five Data Paradigms Dimensional ModelPROsQueries facts in contextSelf-service, ad hoc queriesHigh-performance platformsMature tools and integrationStandard query languageTurns data into informationCONsExpensive platformsDesign-time, static relationshipsDesign-time, static structures: design first then load dataCannot query for relevanceCannot query for answers that are not built into the model
32Relevance Velocity Volume Variety Variability NoSQL Data Architecture& Best Practices Data View – Five Data ParadigmsWhat’s wrong (aka challenging) with SQL DB’s?RelevanceVelocityVolumeVarietyVariability
33NoSQL Data Architecture& Best Practices Data View – Five Data Paradigms Key Value / Column Family ModelsPROsFast puts and getsMassive scalabilityEasy to shard & replicateData colocationSimple to modelInexpensiveData in transactional contextDeveloper in controlCONsCarefully design keyShred JSON into flat columnsSecondary indexes required to query outside of hierarchical keyNo standard query API or languageHand code all joins in appImmature tools and platformHard to integrate and hire
34NoSQL Data Architecture& Best Practices Data View – Five Data Paradigms Document ModelPROsFast development“Schemaless”, run-time designed, rich, JSON and/or XML data structuresQueries everything in contextSelf-service, ad hoc queriesTurns data into informationCan query for relevanceCONsDefensive programming for unexpected data structuresExpensive platforms, immature tools, and hard to integrateNon-standard Query Languages, and hard to hire expertiseNot as fast as Column-Family / Key-Value databases
35NoSQL Data Architecture& Best Practices Data View – Five Data Paradigms Graph ModelPROsUnlimited flexibility – model any structureRun time definition of types & relationshipsRelate anything to anything in any wayQuery relationship patternsStandard Query Language (SPARQL)Creates maximum context around dataCONsHard to model at such a low levelHard to integrate with other systemsImmature toolsHard to hire expertiseCannot query for relevance because original document context is not preserved
36NoSQL Data Architecture& Best Practices Data View NoSQL Data Architecture& Best Practices Data View – Five Data Paradigms..What’s wrong (aka challenging) with NoSQL DB’s?Developer responsible for consistency(handle threading)LocksContentionSerializationDead LocksRace ConditionsThreading Bugs
37NoSQL Data Architecture& Best Practices Data View NoSQL Data Architecture& Best Practices Data View – Five Data ParadigmsNoSQL Data Architecture& Best Practices Data View
38NoSQL Data Architecture& Best Practices Data View Modeling TakeawaysEach model has a specialized purposeDimensional Business intelligence reporting and analyticsRelational Flexible queries, joins, updates, mature, standardColumn / Key-Value Simple, fast puts and gets, massively scalableDocument Fast Development, “schemaless” JSON/XML, searchableGraph / RDF Modeling anything at runtime including relationships
39NoSQL Data Architecture& Best Practices Data View – How do you choose? ..How do you choose?How much Durability do you need?Durable data survives system failures & can be recovered after unwanted deletionHow much Atomicity do you need?An atomic transaction is all or nothing, sets of data and/or sets of commands.How much Isolation do you need?Isolation prevents concurrent transactions from affecting each others.How much Consistency do you need (or when do you need it)?Consistency exists when data is committed and consistent with all data rules at a point in time.
40NoSQL Data Architecture& Best Practices Data View – How do you choose? ..DurabilityCan you live with writing advanced code to compensate?Trusting all developers to properly check for partial transaction failures, current physical layout of the data cluster, and write code to propagate data across the cluster.Can you live with lost data?No logs, archives, mirroring, etc….Can you live with accidental deletion of data?No point in time recovery featureCan you live with scripting your own backup & recovery solutions?
41NoSQL Data Architecture& Best Practices Data View – How do you choose? ..AtomicityCan you live with modifying single documents at a time?Can you live with partially successful transactions?You can achieve higher availability because transactions can partially succeed.Can you live with inconsistent and incomplete data?Is it OK to not know when data anomalies are caused by bugs in your code or are temporarily inconsistent because they haven’t been synchronized yet?Can you live with writing advanced code to compensate?Custom solutions for atomic rollback, handling of transactions that fail, find & fix inconsistent data.
42NoSQL Data Architecture& Best Practices Data View – How do you choose? ..IsolationCan you live with modifying single documents at a time?Can you live with inaccurate queries?Without isolation, query results are inaccurate because concurrent transactions can change data while processing it.Can you live with race conditions and dead locks?Can you live with writing advanced code to compensate?Your own versioning system, code to hide concurrent updates, inserts and deletes from queries, handle race conditions and deadlocks.
43NoSQL Data Architecture& Best Practices Data View – How do you choose? ..Consistency - Do you need complete consistency?Not necessarily – instead, you may prefer:Absolute fastest performance at lowest hardware costHighest global data availability at lowest hardware costWorking with one document at a timeWriting advanced code to create your own consistency modelEventually consistent dataSome inconsistent data that can’t be reconciledSome missing data that can’t be recoveredSome inconsistent query results
44NoSQL Data Architecture& Best Practices Data View – How do you choose? ..What do you need most?Highest performance for queries and transactionsHighest data availability across multiple data centersLess data loss (eg. Durability)More query accuracy & less deadlocks (eg. Isolation)More data integrity (eg. Atomicity)Less code to compensate for lack of ACID compliance
45NoSQL Data Architecture& Best Practices Key Points RDBM’s will always have an important place in our architecture.NoSQL implementations have a benefit to our future. Once you have a list of NoSQL databases that meet your modeling needs, choose the one that best meets your need for velocity and volume.It is not a one-or-the-other ‘all in’ choice to make.At the database tier, relational databases were originally the popular choice. Their use was increasingly problematic however, because they are a centralized, share-everything technology that scales up rather than out. This made them a poor fit for applications that require easy and dynamic scalability. NoSQL technologies have been built from the ground up to be distributed, scale-out technologies and therefore fit better with the highly distributed nature of the three-tier Internet architecture.The capture and use of the data creates the need for a very different type of database, however. Developers want a very flexible database that easily accommodates any new type of data they want to work with and is not disrupted by content structure changes from third-party data providers. Much of the new data is unstructured and semi-structured, so developers also need a database that is capable of efficiently storing it. Unfortunately, the rigidly defined, schema-based approach used by relational databases makes it impossible to quickly incorporate new types of data, and is a poor fit for unstructured and semi-structured data.Finally, with the rising importance of processing data, developers are increasingly frustrated with the “impedance mismatch” between the object-oriented approach they use to write applications and the schema-based tables and rows of a relational database. NoSQL provides a data model that maps better to the application’s organization of data and simplifies the interaction between the application and the database, resulting in less code to write, debug, and maintain.(Reference: Why No SQL? – Three trends disrupting the database status quo; Couchbase)