Roni Schuling - Enterprise Architecture

Roni Schuling - Enterprise Architecture
“Big Data” - Technical Architecture Roni Schuling - Enterprise Architecture Tom Scroggins – IS Domain Architecture Principal Financial Group

“Big Data” - Technical Architecture
AGENDA Foundational Definitions & where these technologies came from Big Data NoSQL Hadoop Business & Technical Drivers How they are being used in many companies Predictions for the future Challenges & Obstacles Questions RONI

Foundational Definition – Big Data Big data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information. Big data can be characterized by 3Vs: the extreme volume of data, the wide variety of types of data and the velocity at which the data must be must processed. There are many other aspects as well such as: Viscosity, Complexity, Ambiguity. RONI These technologies are not just about volumes of data. We now have the ability to capture without having to pre-declare all potential uses of the data and position ourselves to adjust to the velocity of change more readily. (Schema-on-read; leveraging/preventing data exhaust, etc…) Another way to stress this is “we don’t know what tomorrow’s questions are going to be, but we expect they will need some of today’s data to answer.” How do I ask the question that I didn’t know I wanted to ask? Data in a corporation that cannot be processed using traditional data management techniques and technologies can be broadly classified as Big Data.

Big Data ≠ Hadoop Big Data ≠ NoSQL Hadoop ≠ NoSQL RONI These technologies are not just about volumes of data. We now have the ability to capture without having to pre-declare all potential uses of the data and position ourselves to adjust to the velocity of change more readily. (Schema-on-read; leveraging/preventing data exhaust, etc…) Another way to stress this is “we don’t know what tomorrow’s questions are going to be, but we expect they will need some of today’s data to answer.” How do I ask the question that I didn’t know I wanted to ask? Hadoop & NoSQL are key technologies for working with Big Data effectively.

RONI Big Data in 2015—Power to the People! December 16, 2014 Last year I speculated that the big data ‘power curve’ in 2014 would be shaped by business demands for data blending. Customers presenting at our debut PentahoWorld conference last October, from Paytronix, to RichRelevance, to NASDAQ, certainly proved my speculations to be true. Businesses like these are examples of how increasingly large and varied data sets can be used to deliver high and sustainable ROI. In fact, Ventana Research recently confirmed that 22 percent of organizations now use upwards of 20 data sources, and 19 percent use between 11 – 20 data sources.[1] Moving into 2015, and fired up by their initial big data bounties, businesses will seek even more power to explore data freely, structure their own data blends, and gain profitable insights faster. They know “there’s gold in them hills” and they want to mine for even more! With that said, here are my big data predictions for 2015: Big Data Meets the Big Blender! The digital universe is exploding at a rate that even Carl Sagan might struggle to articulate. Analysts believe it’s doubling every year, but with the unstructured component doubling every three months. By 2025, IDCestimates that 40 percent of the digital universe will be generated by machine data and devices, while unstructured data is getting all the headlines.[2] The ROI business use cases we’ve seen require the blending of unstructured data with more traditional, relational, data. For example, one of the most common use cases we are helping companies create is a360 view of their customers. The de facto reference architecture involves the blending of relational/transactional data detailing what the customer has bought, with unstructured weblog and clickstream data highlighting customer behavior patterns around what they might buy in the future. This blended data set is further mashed up with social media data describing sentiment around the company’s products and customer demographics. This “Big Blend” is fed into recommendation platforms to drive higher conversion rates, increase sales, and improve customer engagement. This “blended data” approach is fundamental to other popular big data use cases like Internet of Things, security and intelligence applications, supply chain management and regulatory and compliance demands in Financial Services, Healthcare and Telco industries. Internet of Things Will Fuel the New ‘Industrial Internet’ Early big data adoption drove the birth of new business models at companies like our customers Beachmint and Paytronix. In 2015, I’m convinced that we’ll see big data starting to transform traditional industrial businesses by delivering operational, strategic and competitive advantage. Germany is running an ambitious Industry 4.0 project to create “Smart Factories” that are flexible, resource efficient, ergonomic and integrated with customers and business partners. The machine data generated from sensors and devices, are fueling key opportunities like Smart Homes, Smart Cities, and Smart Medicine, which all require big data analytics. Much like the ‘Industrial Internet’ movement in the U.S., Industry 4.0 is is being defined by the Internet of Things. According to Wikibon, the value of efficiency from machine data could reach close to $1.3 trillion dollars and will drive $514B in IT spend by 2020.[3]The bottlenecks are challenges related to data security and governance, data silos, and systems integration. Big Data Gets Cloudy! As companies with huge data volumes seek to operate in more elastic environments, we’re starting to see some running all, or part of, their big data infrastructures in the cloud. This says to me that the cloud is now “IT approved” as a safe, secure, and flexible data host. At PentahoWorld, I told a story about a “big datathrow down” that occurred during our Strategic Advisory Board meeting. At one point in the meeting, two enterprise customers in highly regulated industries started one-upping each other about how much data they stored in Amazon Redshift Cloud. One shared that they processed and analysed 5-7 billion records daily. The next shared that they stored a half petabyte of new data every day and on top of that, they had to hold the data for seven years while still making it available for quick analysis. Both of these customers are held to the highest standards for data governance and compliance – regardless of who won, the forecast for their big data environments is the cloud! Embedded Analytics is the New BI Although “classic BI,” which involves a business analyst looking at data with a separate tool outside the flow of the business application, will be around for a while, a new wave is rising in which business users increasingly consume analytics embedded within applications to drive faster, smarter decisions. Gartner’s latest research estimates that more than half the enterprises that use BI now use embedded analytics.[4]Whether it’s a RichRelevance data scientist building a predictive algorithm for a recommendation engine, or a marketing director accessing Marketo to consume analytics related to lead scoring or campaign effectiveness, the way our customers are deploying Pentaho leave me with no doubt that this prediction will bear out. As classic BI matured, we witnessed a final “tsunami” in which data visualization and self-service inspired business people to imagine the potential for advanced analytics. Users could finally see all their data – warts and all – and also start to experiment with rudimentary blending techniques. Self-service and data visualization prepared the market for what I firmly expect to be the most significant analytics trend in 2015…. Data Refineries Give Real Power to the People! The big data stakes are higher than ever before. No longer just about quantifying ‘virtual’ assets like sentiment and preference, analytics are starting to inform how we manage physical assets like inventory, machines and energy. This means companies must turn their focus to the traditional ETL processes that result in safe, clean and trustworthy data. However, for the types of ROI use cases we’re talking about today, this traditional IT process needs to be made fast, easy, highly scalable, cloud-friendly and accessible to business. And this has been a stumbling block – until now. Enter Pentaho’s Streamlined Data Refinery, a market-disrupting innovation that effectively brings the power of governed data delivery to “the people,” unlocking big data’s full operational potential. I’m tremendously excited about 2015 and the journey we’re on with both customers and partners. You’re going to hear a lot more about the Streamlined Data Refinery in 2015 – and that’s a prediction I can guarantee will come true! Finally, as I promised at PentahoWorld in October, we’re only going to succeed when you tell us you’ve delivered an ROI for your business. Let’s go forth and prosper together in 2015! Quentin Gallivan, CEO, Pentaho

Foundational Definition - NoSQL NoSQL database, also called Not Only SQL, is an approach to data management and database design that's useful for very large sets of distributed data. NoSQL seeks to solve the scalability and big data performance issues that relational databases weren’t designed to address. NoSQL is especially useful when an enterprise needs to access and analyze massive amounts of unstructured data or data that's stored remotely on multiple virtual servers in the cloud However - NoSQL is not just about Big Data RONI

Where this technology came from - NoSQL 1970 1980 1990 2000 2005 2007 2010 2014+ Document DB Inspired by Lotus Notes Flat Files Polygot Persistence Enterprise will have a variety of different data storage technologies for different kinds of data & application needs Rise of Object Databases Rise of Relational Databases Relational Database Dominance RONI Key Value Store Replicate Data during 24x7 Availability Need to Store Tabular Data in Distributed System Many Innovators In The 2005 to 2010 Timeframe

Market view of what’s out there – we do NOT have all of these at PFG today. There are over 150 NoSQL databases in the market – these are just a few of the top ones. RONI

“Big Data” - Data Architecture at PFG
Foundational Definition - Hadoop Hadoop is a open source, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation. Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of nodes become inoperative. TOM

“Big Data” - Data Architecture at PFG
Where this technology came from - Hadoop 1995 2006 2010 2014+ 2004 2005 1995 – 2005: Yahoo! Search team builds 4+ generations of systems to crawl & index the WWW. 20 Billion pages! Juggernaut & Nutch join forces – Hadoop is born! Analytic Tool Interoperability Service providers step into the market – provide training, support, & hosting Mass Adoption Google publishes Google File System & MapReduce papers Yahoo! Staffs ‘Juggernaut’, open source DFS & MapReduce Doug Cutting builds Nutch DFS & MapReduce, joins Yahoo! Other Internet companies add tools / frameworks to enhance Hadoop TOM 1995 – 2005 Yahoo! Search team builds 4+ generations of systems to crawl & index the WWW. 20 Billion pages! 2004 Google publishes Google File System & MapReduce papers 2005 Yahoo! Staffs ‘Juggernaut’, open source DFS & MapReduce Compete / Differentiate via open source contribution Attract scientists – become known center of big data excellence Avoid building proprietary systems that will be osolesced Gain leverage of wider community building one infrastructure Doug Cutting builds Nutch DFS & MapReduce, joins Yahoo! 2006 Juggernauth & Nutch join forces – Hadoop is born! Nutch prototype used to seed new Apache Hadoop project Yahoo! Commits to scaling Hadoop, staffs Hadoop team ‘Enterprise Grade’ Security

The Hadoop Vendor Landscape TOM This is a snapshot from “The Forrester Wave™: Big Data Hadoop Solutions, Q1 2014” We implemented our first Hadoop Cluster in the fall of The major contenders were Cloudera, Hortonworks, and MapR. We wanted an open source solution that we could host on-prem. We chose Cloudera based on market presence, vision, and ability to help us implement quickly.

HAND-OFF

Business Drivers Provide access to all data needed for analytics (internal or external) Provide the ability to realistically interact with greater ‘depths’ of data – IE: tens of years instead of a couple of months Provide a greater “speed to insight” for all types of requests Lower the total cost of ownership across the enterprise for analytics Allow for exploration of our data in ways we never anticipated to identify differentiating understanding of customers and markets There’s an Imbalance today…. RONI Internal  Documents   Call Center Data  Project Data  Audio, Video  10k & Financial Data  Contracts  Sales & Marketing Data External  Forums  Twitter  Facebook  Public Domain Data

Technical Drivers - Current technical capabilities don’t align with changing expectations TOM Internal  Documents   Call Center Data  Project Data  Audio, Video  10k & Financial Data  Contracts  Sales & Marketing Data External  Forums  Twitter  Facebook  Public Domain Data

How they are being used today NoSQL Hadoop Not focused on Big Data….yet Many companies using or at least experimenting with MongoDB Document store for web applications that only needs to persist the content for the lifespan of that interaction. Using NoSQL stores for user preferences to personalize what is presented on a web page for their interaction. Beginning to organization social streams of data Interrogating our web logs to better understand the behavior of people interacting with a website. Merging that semi-structured web activity with other structured legacy data. Massive storage of data for exploration and discovery – often using interoperability with analytic consumption tools. TAG TEAM Schema-on-Write: RDBMS • Schema must be created before data is loaded. • An explicit load operation has to take place which transforms the data to the internal structure of the database. • New columns must be added explicitly before data for such columns can be loaded into the database. • Read is Fast. • Standards/Governance. Schema-on-Read: Hadoop Data is simply copied to the file store, no special transformation is needed. • A SerDe (Serializer/ Deserlizer) is applied during read time to extract the required columns. • New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse them. • Load is Fast • Evolving Schemas/Agility

Plans for the future NoSQL Hadoop Expansion of web activity data (more logs, more data in logs, more use cases.) Speech-to-text translation of Call Recordings and text analysis/Natural Language processing to determine call topics and caller sentiment. Extraction of text from documents to aid in analysis. ‘Data Lake’ solutioning – both for ingestion and archive. Database for web applications that need that speed of development and nimbleness. Layering of NoSQL solutions on top of Hadoop to improve searchability and performance. Exploration of Graph NoSQL solutions for analytics on hierarchical type data . TAG TEAM Caller sentiment: Was it a positive/negative interaction? How does that influence future behavior? Text analytics: Think Supplier negotiated contracts. Can we find all that have a particular clause in them? Or, if we have new contract language, how many and which need to be updated?

Lake of Data Data Refinery TOM Data Lake or Data Refinery? We run into trouble when people envision a “Lake of Data.” It is often visualized as a single large entity where all the wild animals come to drink. A better analogy is to think of Hadoop as a “Data Refinery.” Crude Oil enters and it is “cooked” to deliver different data and different formats for different consumers. At the bottom it is crude oil: raw, untransformed, managed for speed of acquisition and only fit for consumption by the quintessential “Data Scientist.” As it moves up the stack it is filtered, formatted, business rules are applied and delivered for Power Users for parameterized reporting. At the top the data is formatted for performance and stored in tools for general consumption by the masses.

Data Refinery TOM

Many Kinds of data in our organization RONI Conceptually for illustration – not a vetted/approved picture of the PFG environment

Conceptual Workload Isolation Today… RONI Conceptually for illustration – not a vetted/approved picture of the PFG environment

Conceptual Workload Isolation in the Future… RONI Conceptually for illustration – not a vetted/approved picture of the PFG environment

Big Data technologies are broader than just Hadoop & NoSQL – but those are the key starting points for us. Market view of what’s out there – we do NOT have all of these at PFG today. TOM

Challenges and Obstacles to overcome Security Governance Clear Use Cases Integration Points Hosting models TOM Security: Hadoop through various projects can apply database-like security. If we use it we can only access the data like a database. “Big Data” Governance: as we seek value from new data sources we need to ensure we are applying security, retention, and auditing consistent with corporate standards. We may see shifting of roles and accountability from traditional roles to new roles depending on data source (i.e. shift from DBA’s to Analytics Professionals). Clear Use Cases: we have configured out early environments based on a specific use case. We are making the data available with the expectation that we will find new use cases. The goal for now is not to “paint our selves into corners” as we learn. Integration Points: what tools outside of Hadoop will integrate? i.e. SAS, SAS VA, R, Business Objects, Microstrategy, etc. Hosting Models: Many of our new “Big Data” technologies go against the grain of our modern data center (consolidation/dense). We need a good understanding of security, data sources, and integration to make an informed decision. This may push us to have multiple distributions to meet different use cases.

Q&A TAG TEAM

NoSQL Data Architecture& Best Practices Data View - Overview
We are in a Database Revolution Existing paradigms are being challenged Models Hardware Software Languages Will tweaking current data solutions be enough?

NoSQL Data Architecture& Best Practices Data View - Overview

NoSQL Data Architecture& Best Practices Data View – Five Data Paradigms

Relational Model PROs Most flexible queries & updates Reuse data structures in any context Great DB-to-DB integration Mature tools Standard query language Easy to hire expertise CONs Design-time, static relationships Design-time, static structures: design first then load data Hard to normalize model Requires code to integrate relational data with object-oriented code Cannot query for relevance

Dimensional Model PROs Queries facts in context Self-service, ad hoc queries High-performance platforms Mature tools and integration Standard query language Turns data into information CONs Expensive platforms Design-time, static relationships Design-time, static structures: design first then load data Cannot query for relevance Cannot query for answers that are not built into the model

Relevance Velocity Volume Variety Variability
NoSQL Data Architecture& Best Practices Data View – Five Data Paradigms What’s wrong (aka challenging) with SQL DB’s? Relevance Velocity Volume Variety Variability

Key Value / Column Family Models PROs Fast puts and gets Massive scalability Easy to shard & replicate Data colocation Simple to model Inexpensive Data in transactional context Developer in control CONs Carefully design key Shred JSON into flat columns Secondary indexes required to query outside of hierarchical key No standard query API or language Hand code all joins in app Immature tools and platform Hard to integrate and hire

Document Model PROs Fast development “Schemaless”, run-time designed, rich, JSON and/or XML data structures Queries everything in context Self-service, ad hoc queries Turns data into information Can query for relevance CONs Defensive programming for unexpected data structures Expensive platforms, immature tools, and hard to integrate Non-standard Query Languages, and hard to hire expertise Not as fast as Column-Family / Key-Value databases

Graph Model PROs Unlimited flexibility – model any structure Run time definition of types & relationships Relate anything to anything in any way Query relationship patterns Standard Query Language (SPARQL) Creates maximum context around data CONs Hard to model at such a low level Hard to integrate with other systems Immature tools Hard to hire expertise Cannot query for relevance because original document context is not preserved

NoSQL Data Architecture& Best Practices Data View
NoSQL Data Architecture& Best Practices Data View – Five Data Paradigms .. What’s wrong (aka challenging) with NoSQL DB’s? Developer responsible for consistency (handle threading) Locks Contention Serialization Dead Locks Race Conditions Threading Bugs

NoSQL Data Architecture& Best Practices Data View – Five Data Paradigms NoSQL Data Architecture& Best Practices Data View

Modeling Takeaways Each model has a specialized purpose Dimensional Business intelligence reporting and analytics Relational Flexible queries, joins, updates, mature, standard Column / Key-Value Simple, fast puts and gets, massively scalable Document Fast Development, “schemaless” JSON/XML, searchable Graph / RDF Modeling anything at runtime including relationships

NoSQL Data Architecture& Best Practices Data View – How do you choose?
.. How do you choose? How much Durability do you need? Durable data survives system failures & can be recovered after unwanted deletion How much Atomicity do you need? An atomic transaction is all or nothing, sets of data and/or sets of commands. How much Isolation do you need? Isolation prevents concurrent transactions from affecting each others. How much Consistency do you need (or when do you need it)? Consistency exists when data is committed and consistent with all data rules at a point in time.

.. Durability Can you live with writing advanced code to compensate? Trusting all developers to properly check for partial transaction failures, current physical layout of the data cluster, and write code to propagate data across the cluster. Can you live with lost data? No logs, archives, mirroring, etc…. Can you live with accidental deletion of data? No point in time recovery feature Can you live with scripting your own backup & recovery solutions?

.. Atomicity Can you live with modifying single documents at a time? Can you live with partially successful transactions? You can achieve higher availability because transactions can partially succeed. Can you live with inconsistent and incomplete data? Is it OK to not know when data anomalies are caused by bugs in your code or are temporarily inconsistent because they haven’t been synchronized yet? Can you live with writing advanced code to compensate? Custom solutions for atomic rollback, handling of transactions that fail, find & fix inconsistent data.

.. Isolation Can you live with modifying single documents at a time? Can you live with inaccurate queries? Without isolation, query results are inaccurate because concurrent transactions can change data while processing it. Can you live with race conditions and dead locks? Can you live with writing advanced code to compensate? Your own versioning system, code to hide concurrent updates, inserts and deletes from queries, handle race conditions and deadlocks.

.. Consistency - Do you need complete consistency? Not necessarily – instead, you may prefer: Absolute fastest performance at lowest hardware cost Highest global data availability at lowest hardware cost Working with one document at a time Writing advanced code to create your own consistency model Eventually consistent data Some inconsistent data that can’t be reconciled Some missing data that can’t be recovered Some inconsistent query results

.. What do you need most? Highest performance for queries and transactions Highest data availability across multiple data centers Less data loss (eg. Durability) More query accuracy & less deadlocks (eg. Isolation) More data integrity (eg. Atomicity) Less code to compensate for lack of ACID compliance

NoSQL Data Architecture& Best Practices Key Points
RDBM’s will always have an important place in our architecture. NoSQL implementations have a benefit to our future. Once you have a list of NoSQL databases that meet your modeling needs, choose the one that best meets your need for velocity and volume. It is not a one-or-the-other ‘all in’ choice to make. At the database tier, relational databases were originally the popular choice. Their use was increasingly problematic however, because they are a centralized, share-everything technology that scales up rather than out. This made them a poor fit for applications that require easy and dynamic scalability. NoSQL technologies have been built from the ground up to be distributed, scale-out technologies and therefore fit better with the highly distributed nature of the three-tier Internet architecture. The capture and use of the data creates the need for a very different type of database, however. Developers want a very flexible database that easily accommodates any new type of data they want to work with and is not disrupted by content structure changes from third-party data providers. Much of the new data is unstructured and semi-structured, so developers also need a database that is capable of efficiently storing it. Unfortunately, the rigidly defined, schema-based approach used by relational databases makes it impossible to quickly incorporate new types of data, and is a poor fit for unstructured and semi-structured data. Finally, with the rising importance of processing data, developers are increasingly frustrated with the “impedance mismatch” between the object-oriented approach they use to write applications and the schema-based tables and rows of a relational database. NoSQL provides a data model that maps better to the application’s organization of data and simplifies the interaction between the application and the database, resulting in less code to write, debug, and maintain. (Reference: Why No SQL? – Three trends disrupting the database status quo; Couchbase)

Roni Schuling - Enterprise Architecture

Similar presentations

Presentation on theme: "Roni Schuling - Enterprise Architecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Roni Schuling - Enterprise Architecture

Similar presentations

Presentation on theme: "Roni Schuling - Enterprise Architecture"— Presentation transcript:

Similar presentations

About project

Feedback